[00:02:40] PROBLEM - Hadoop NodeManager on an-worker1120 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:03:32] PROBLEM - SSH on puppetserver1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:03:55] RESOLVED: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:03:55] RESOLVED: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:05:32] RECOVERY - Hadoop NodeManager on an-worker1092 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:05:53] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1072635 (owner: 10TrainBranchBot) [00:08:44] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1072637 [00:08:44] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1072637 (owner: 10TrainBranchBot) [00:09:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_nologinattempt) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [00:14:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_nologinattempt) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [00:14:40] RECOVERY - Hadoop NodeManager on an-worker1120 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:20:38] PROBLEM - Hadoop NodeManager on an-worker1171 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:29:48] (03CR) 10JHathaway: [V:03+1] "PCC SUCCESS (CORE_DIFF 24 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/1072593 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [00:35:34] RECOVERY - Hadoop NodeManager on an-worker1140 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:38:53] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1072637 (owner: 10TrainBranchBot) [00:44:40] RECOVERY - Hadoop NodeManager on an-worker1171 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:46:30] RECOVERY - SSH on puppetserver1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:53:25] FIRING: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:54:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [00:55:10] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10142991 (10phaultfinder) [01:00:24] (03CR) 10JHathaway: [V:03+1] "PCC SUCCESS (CORE_DIFF 24 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/1072593 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [01:18:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_nologinattempt) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [01:23:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_nologinattempt) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [01:47:42] PROBLEM - Hadoop NodeManager on an-worker1158 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [01:54:42] FIRING: [4x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [01:57:46] PROBLEM - Hadoop NodeManager on an-worker1166 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:01:44] RECOVERY - Hadoop NodeManager on an-worker1166 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:15:42] RECOVERY - Hadoop NodeManager on an-worker1158 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:23:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [02:39:12] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:59:12] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:54:14] (03PS4) 10Ebrahim: Remove ProofreadPage dark mode namespaces exception [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072600 [03:55:26] (03PS5) 10Ebrahim: Remove ProofreadPage dark mode namespaces exception [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072600 [03:55:55] (03PS6) 10Ebrahim: Remove ProofreadPage dark mode namespaces exception [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072600 [03:56:39] (03PS7) 10Ebrahim: Remove ProofreadPage dark mode namespaces exception [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072600 [04:20:25] (03CR) 10Ebrahim: "Asked on Meta:Babel, https://meta.wikimedia.org/wiki/Meta:Babel#Enable_the_dark_mode_for_Grants,_Research_and_Iberocoop_namespaces" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072623 (owner: 10Ebrahim) [04:28:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [04:33:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [04:45:10] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10143166 (10phaultfinder) [04:53:25] FIRING: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:54:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [05:20:13] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10143172 (10phaultfinder) [05:44:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [05:49:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [05:54:42] FIRING: [4x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [06:02:01] (03CR) 10Muehlenhoff: Allow users to see rejected requests for permissions. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1072552 (owner: 10Slyngshede) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:06:43] (03CR) 10Muehlenhoff: "Looks good from a technical perspective, but a few comments inline which are more from a process angle." [software/bitu] - 10https://gerrit.wikimedia.org/r/1072552 (owner: 10Slyngshede) [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:23:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [06:26:50] (03PS1) 10Stevemunene: Add new an worker keytabs [labs/private] - 10https://gerrit.wikimedia.org/r/1072655 (https://phabricator.wikimedia.org/T353788) [06:31:54] (03CR) 10Filippo Giunchedi: [C:03+1] puppet8: account for unknown probe types [puppet] - 10https://gerrit.wikimedia.org/r/1072303 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [06:32:50] (03CR) 10Filippo Giunchedi: "LGTM, commit message needs adjusting tho" [puppet] - 10https://gerrit.wikimedia.org/r/1072276 (owner: 10Ssingh) [06:34:13] (03CR) 10Filippo Giunchedi: [C:03+1] alert: Failover from alert2002 to alert1002 [puppet] - 10https://gerrit.wikimedia.org/r/1071701 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [06:34:54] (03CR) 10Muehlenhoff: Menu: Add menu entry for managers to view pending permission requests. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1072547 (owner: 10Slyngshede) [06:34:55] (03CR) 10Filippo Giunchedi: "This will need rebasing on current dns repo (e.g. alerts CNAME alert2002 not alert1001)" [dns] - 10https://gerrit.wikimedia.org/r/1063078 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [06:38:34] (03CR) 10Muehlenhoff: Permission validation: Handle validation for manager approvals better. (032 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/1072528 (owner: 10Slyngshede) [06:43:48] (03PS1) 10Filippo Giunchedi: team-sre: tweak MediaWikiLoginFailures threshold [alerts] - 10https://gerrit.wikimedia.org/r/1072657 (https://phabricator.wikimedia.org/T350597) [06:48:00] (03CR) 10Slyngshede: Menu: Add menu entry for managers to view pending permission requests. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1072547 (owner: 10Slyngshede) [06:53:43] (03CR) 10Muehlenhoff: Menu: Add menu entry for managers to view pending permission requests. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1072547 (owner: 10Slyngshede) [06:54:52] !log evacuating leadership for all partitions assigned to broker id 2005 on kafka-main-codfw - T363210 [06:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:56] T363210: kafka-main200[6789] and kafka-main2010 implementation tracking - https://phabricator.wikimedia.org/T363210 [06:56:20] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on kafka-main[2005,2010].codfw.wmnet with reason: Hardware refresh [06:56:38] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on kafka-main[2005,2010].codfw.wmnet with reason: Hardware refresh [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240913T0700) [07:02:03] (03PS1) 10Stevemunene: hdfs: Add new worker hosts to net_topology [puppet] - 10https://gerrit.wikimedia.org/r/1072660 (https://phabricator.wikimedia.org/T353788) [07:02:05] (03PS1) 10Stevemunene: hdfs: Assign the worker role to new hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/1072661 (https://phabricator.wikimedia.org/T353788) [07:03:11] (03Abandoned) 10Stevemunene: trafficserver: add airflow-analytics-test discovery record [puppet] - 10https://gerrit.wikimedia.org/r/1057830 (https://phabricator.wikimedia.org/T371210) (owner: 10Stevemunene) [07:09:58] (03PS1) 10JMeybohm: kafka-main: Replace kafka-main2005 with kafka-main2010 [puppet] - 10https://gerrit.wikimedia.org/r/1072662 (https://phabricator.wikimedia.org/T363210) [07:13:20] (03PS1) 10JMeybohm: Replace kafka-main2005 with kafka-main2010 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072663 (https://phabricator.wikimedia.org/T363210) [07:23:46] (03CR) 10JMeybohm: [C:03+2] kafka-main: Replace kafka-main2005 with kafka-main2010 [puppet] - 10https://gerrit.wikimedia.org/r/1072662 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm) [07:27:33] !log jayme@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-main-codfw [07:31:54] (03PS1) 10Muehlenhoff: envoy: Add support for passing an array of sets to the firewall service [puppet] - 10https://gerrit.wikimedia.org/r/1072690 [07:32:46] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [07:32:53] !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [07:32:55] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [07:33:18] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [07:33:19] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [07:33:31] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [07:33:32] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [07:34:03] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [07:34:05] !log jayme@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [07:34:19] !log jayme@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [07:34:21] !log jayme@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [07:34:31] (03CR) 10CI reject: [V:04-1] envoy: Add support for passing an array of sets to the firewall service [puppet] - 10https://gerrit.wikimedia.org/r/1072690 (owner: 10Muehlenhoff) [07:34:54] !log jayme@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [07:34:56] !log jayme@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [07:35:30] !log jayme@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [07:35:31] !log jayme@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [07:35:44] !log jayme@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [07:35:45] !log jayme@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [07:35:56] !log jayme@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [07:36:15] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission kafka-main2005.codfw.wmnet - https://phabricator.wikimedia.org/T374688 (10JMeybohm) 03NEW [07:36:28] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission kafka-main2005.codfw.wmnet - https://phabricator.wikimedia.org/T374688#10143297 (10JMeybohm) [07:39:19] (03PS1) 10JMeybohm: Decom kafka-main2005 [puppet] - 10https://gerrit.wikimedia.org/r/1072695 (https://phabricator.wikimedia.org/T374688) [07:39:39] (03CR) 10Jelto: "one question in line" [puppet] - 10https://gerrit.wikimedia.org/r/1072662 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm) [07:42:11] 06SRE, 06Infrastructure-Foundations, 10vm-requests: eqiad: 2 VM request for poolcounter - https://phabricator.wikimedia.org/T374629#10143306 (10elukey) [07:43:09] (03CR) 10JMeybohm: [C:03+2] kafka-main: Replace kafka-main2005 with kafka-main2010 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072662 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm) [07:43:22] 06SRE, 06Infrastructure-Foundations, 10vm-requests: eqiad: 2 VM request for poolcounter - https://phabricator.wikimedia.org/T374629#10143308 (10elukey) ` +-------+-------+-----------+----------+-----------+---------+-----------+ | Group | Nodes | Instances | MFree | MFree avg | DFree | DFree avg | +----... [07:43:34] (03PS2) 10Muehlenhoff: envoy: Add support for passing an array of sets to the firewall service [puppet] - 10https://gerrit.wikimedia.org/r/1072690 [07:44:42] (03PS1) 10Elukey: Add configuration for poolcounter100[6,7] [puppet] - 10https://gerrit.wikimedia.org/r/1072696 (https://phabricator.wikimedia.org/T374629) [07:45:35] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: improve Supermicro's bios settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1071553 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [07:45:40] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1072696 (https://phabricator.wikimedia.org/T374629) (owner: 10Elukey) [07:45:42] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: refactor _config_dell_pxe() [cookbooks] - 10https://gerrit.wikimedia.org/r/1072553 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [07:46:11] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: eqiad: 2 VM request for poolcounter - https://phabricator.wikimedia.org/T374629#10143312 (10MoritzMuehlenhoff) +1 [07:46:23] (03CR) 10Elukey: [C:03+2] Add configuration for poolcounter100[6,7] [puppet] - 10https://gerrit.wikimedia.org/r/1072696 (https://phabricator.wikimedia.org/T374629) (owner: 10Elukey) [07:46:40] !log jayme@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-main-codfw [07:46:44] (03CR) 10Jelto: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072663 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm) [07:47:29] !log elukey@cumin1002 START - Cookbook sre.ganeti.makevm for new host poolcounter1006.eqiad.wmnet [07:47:30] !log elukey@cumin1002 START - Cookbook sre.dns.netbox [07:50:25] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1072695 (https://phabricator.wikimedia.org/T374688) (owner: 10JMeybohm) [07:50:43] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM poolcounter1006.eqiad.wmnet - elukey@cumin1002" [07:50:47] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM poolcounter1006.eqiad.wmnet - elukey@cumin1002" [07:50:47] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:50:47] !log elukey@cumin1002 START - Cookbook sre.dns.wipe-cache poolcounter1006.eqiad.wmnet on all recursors [07:50:50] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) poolcounter1006.eqiad.wmnet on all recursors [07:51:16] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM poolcounter1006.eqiad.wmnet - elukey@cumin1002" [07:51:21] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM poolcounter1006.eqiad.wmnet - elukey@cumin1002" [07:52:13] !log installing nano updates from Bookworm point release [07:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:38] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072690 (owner: 10Muehlenhoff) [07:53:06] (03CR) 10Jelto: kafka-main: Replace kafka-main2005 with kafka-main2010 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072662 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm) [07:53:18] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host poolcounter1006.eqiad.wmnet with OS bookworm [07:53:29] (03CR) 10Jelto: [C:03+1] kafka-main: Replace kafka-main2005 with kafka-main2010 [puppet] - 10https://gerrit.wikimedia.org/r/1072662 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm) [07:55:37] (03CR) 10Vgutierrez: [C:03+1] cache:haproxy: hardcode $schema field [puppet] - 10https://gerrit.wikimedia.org/r/1072577 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [07:56:41] (03PS1) 10Fabfur: hiera: testing haproxykafka on cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1072697 (https://phabricator.wikimedia.org/T374473) [07:58:12] (03CR) 10Fabfur: [C:03+2] cache:haproxy: hardcode $schema field [puppet] - 10https://gerrit.wikimedia.org/r/1072577 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [08:01:53] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10143324 (10MoritzMuehlenhoff) [08:02:25] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on poolcounter1006.eqiad.wmnet with reason: host reimage [08:02:43] (03CR) 10Vgutierrez: "no longer relevant for the latest PS" [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [08:05:23] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on poolcounter1006.eqiad.wmnet with reason: host reimage [08:06:34] (03CR) 10JMeybohm: [C:04-1] "Feel free to ignore the nits ofc., but the CHANGELOG format should follow the rest of the modules" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072502 (owner: 10Effie Mouzeli) [08:11:44] (03CR) 10Elukey: [V:03+2 C:03+2] spark: force a rebuild to pick up OS package upgrades [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1072150 (https://phabricator.wikimedia.org/T371874) (owner: 10Elukey) [08:12:13] (03PS1) 10Jcrespo: mariadb: Reenable notifications for db2184 [puppet] - 10https://gerrit.wikimedia.org/r/1072699 (https://phabricator.wikimedia.org/T335640) [08:13:10] (03CR) 10Elukey: [C:03+2] blubber: force rebuild to pick up git upgrades [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/1071802 (https://phabricator.wikimedia.org/T373976) (owner: 10Elukey) [08:14:16] (03CR) 10Jcrespo: [C:03+2] mariadb: Reenable notifications for db2184 [puppet] - 10https://gerrit.wikimedia.org/r/1072699 (https://phabricator.wikimedia.org/T335640) (owner: 10Jcrespo) [08:14:20] (03Merged) 10jenkins-bot: blubber: force rebuild to pick up git upgrades [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/1071802 (https://phabricator.wikimedia.org/T373976) (owner: 10Elukey) [08:15:55] (03CR) 10Elukey: [C:03+1] puppetmaster::frontend|backend: Read the puppet-merge server from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1072543 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff) [08:16:25] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: gerrit1004.wikimedia.org [08:16:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: gerrit1004.wikimedia.org [08:16:39] (03CR) 10Elukey: [C:03+1] puppetserver: Pass the value of puppet_merge_server [puppet] - 10https://gerrit.wikimedia.org/r/1072494 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff) [08:17:43] (03CR) 10Elukey: [C:03+1] Do not use a login shell when dropping privileges [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/1060789 (https://phabricator.wikimedia.org/T216832) (owner: 10Hashar) [08:18:06] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host poolcounter1006.eqiad.wmnet with OS bookworm [08:18:06] !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host poolcounter1006.eqiad.wmnet [08:18:56] (03CR) 10Elukey: [C:03+1] test-cookbook: read spicerack config with sudo [puppet] - 10https://gerrit.wikimedia.org/r/1071810 (owner: 10Volans) [08:19:26] !log elukey@cumin1002 START - Cookbook sre.ganeti.makevm for new host poolcounter1007.eqiad.wmnet [08:19:27] !log elukey@cumin1002 START - Cookbook sre.dns.netbox [08:21:29] (03PS2) 10Fabfur: hiera: testing haproxykafka on cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1072697 (https://phabricator.wikimedia.org/T374473) [08:25:17] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM poolcounter1007.eqiad.wmnet - elukey@cumin1002" [08:27:13] !log remove djangorestframework 3.14.0-2+wmf12u1 from apt.wikimedia.org, the bug fixed in that custom build has been integrated into Debian Bookworm via a point update and is no longer needed [08:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:53] (03CR) 10Fabfur: [C:03+2] hiera: testing haproxykafka on cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1072697 (https://phabricator.wikimedia.org/T374473) (owner: 10Fabfur) [08:28:05] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM poolcounter1007.eqiad.wmnet - elukey@cumin1002" [08:28:05] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:28:05] !log elukey@cumin1002 START - Cookbook sre.dns.wipe-cache poolcounter1007.eqiad.wmnet on all recursors [08:28:08] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) poolcounter1007.eqiad.wmnet on all recursors [08:28:35] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM poolcounter1007.eqiad.wmnet - elukey@cumin1002" [08:28:40] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM poolcounter1007.eqiad.wmnet - elukey@cumin1002" [08:29:51] !log rolling out djangorestbase update from Bookworm point release (replacing our previous bespoke build) [08:29:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:04] !log klausman@cumin1002 START - Cookbook sre.hosts.reimage for host ml-lab1001.eqiad.wmnet with OS bookworm [08:30:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10143366 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin1002 for host ml-l... [08:32:09] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host poolcounter1007.eqiad.wmnet with OS bookworm [08:35:33] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [08:37:19] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10143379 (10MoritzMuehlenhoff) [08:39:12] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [08:40:06] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10143389 (10phaultfinder) [08:40:11] (03PS1) 10Klausman: preseed: Add missing wildcard for ml-lab partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1072703 [08:42:07] (03CR) 10Klausman: [C:03+2] preseed: Add missing wildcard for ml-lab partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1072703 (owner: 10Klausman) [08:42:42] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on poolcounter1007.eqiad.wmnet with reason: host reimage [08:43:25] !log klausman@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-lab1001.eqiad.wmnet with OS bookworm [08:43:36] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10143390 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin1002 for host ml-lab10... [08:45:10] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [08:46:09] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on poolcounter1007.eqiad.wmnet with reason: host reimage [08:47:33] !log klausman@cumin1002 START - Cookbook sre.hosts.reimage for host ml-lab1001.eqiad.wmnet with OS bookworm [08:47:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10143392 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin1002 for host ml-l... [08:48:11] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2001.codfw.wmnet with OS bookworm [08:53:25] FIRING: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:54:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [08:59:33] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10LDAP-Access-Requests: Requesting access to `contint-admins`, `contint-docker`, LDAP `ciadmin` for 'Arthur taylor' - https://phabricator.wikimedia.org/T373969#10143404 (10hashar) 05Stalled→03Open [09:00:11] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10143405 (10phaultfinder) [09:01:19] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host poolcounter1007.eqiad.wmnet with OS bookworm [09:01:19] !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host poolcounter1007.eqiad.wmnet [09:02:28] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:02:47] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:06:46] (03CR) 10Btullis: [C:03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072243 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking) [09:07:36] (03CR) 10JMeybohm: [C:03+2] Replace kafka-main2005 with kafka-main2010 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072663 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm) [09:08:48] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102#10143412 (10cmooney) 05Open→03Resolved a:03cmooney [09:09:20] (03Merged) 10jenkins-bot: Replace kafka-main2005 with kafka-main2010 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072663 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm) [09:09:56] !log klausman@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-lab1001.eqiad.wmnet with OS bookworm [09:10:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10143433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin1002 for host ml-lab10... [09:12:37] !log klausman@cumin1002 START - Cookbook sre.hosts.reimage for host ml-lab1001.eqiad.wmnet with OS bookworm [09:12:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10143460 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin1002 for host ml-l... [09:14:57] !log restoring leadership for all partitions assigned to broker id 2005 on kafka-main-codfw - T363210 [09:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:01] T363210: kafka-main200[6789] and kafka-main2010 implementation tracking - https://phabricator.wikimedia.org/T363210 [09:15:18] !log jayme@cumin1002 START - Cookbook sre.hosts.remove-downtime for kafka-main2010.codfw.wmnet [09:15:19] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kafka-main2010.codfw.wmnet [09:19:15] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [09:20:08] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [09:20:09] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [09:20:23] !log jayme@cumin1002 START - Cookbook sre.hosts.decommission for hosts kafka-main2005.codfw.wmnet [09:20:26] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [09:20:28] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [09:20:57] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2059.codfw.wmnet [09:21:05] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:21:07] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [09:21:31] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2059.codfw.wmnet [09:21:36] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2060.codfw.wmnet [09:21:52] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [09:21:53] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams: apply [09:22:10] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2060.codfw.wmnet [09:22:15] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2301.codfw.wmnet [09:22:49] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [09:22:49] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2301.codfw.wmnet [09:22:50] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [09:22:54] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2302.codfw.wmnet [09:23:03] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [09:23:04] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [09:23:23] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [09:23:33] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2302.codfw.wmnet [09:23:37] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2303.codfw.wmnet [09:24:11] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2303.codfw.wmnet [09:24:16] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2304.codfw.wmnet [09:24:38] (03CR) 10DCausse: flink-app: customize calico label selector (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking) [09:24:51] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2304.codfw.wmnet [09:24:51] !log klausman@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-lab1001.eqiad.wmnet with reason: host reimage [09:24:54] (03PS1) 10Alexandros Kosiaris: kubernetes20[59-60], mw230[1-5] -> wikikube-worker21[14-20] [puppet] - 10https://gerrit.wikimedia.org/r/1072712 (https://phabricator.wikimedia.org/T372878) [09:24:56] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2305.codfw.wmnet [09:25:29] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2305.codfw.wmnet [09:25:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:27:24] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [09:28:12] !log klausman@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-lab1001.eqiad.wmnet with reason: host reimage [09:28:47] (03CR) 10Alexandros Kosiaris: [C:03+2] kubernetes20[59-60], mw230[1-5] -> wikikube-worker21[14-20] [puppet] - 10https://gerrit.wikimedia.org/r/1072712 (https://phabricator.wikimedia.org/T372878) (owner: 10Alexandros Kosiaris) [09:28:56] FIRING: [2x] RdfStreamingUpdaterFlinkJobUnstable: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [09:30:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:33:56] RESOLVED: [2x] RdfStreamingUpdaterFlinkJobUnstable: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [09:34:23] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-main2005.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1002" [09:34:40] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-main2005.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1002" [09:34:40] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:34:40] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kafka-main2005.codfw.wmnet [09:34:54] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission kafka-main2005.codfw.wmnet - https://phabricator.wikimedia.org/T374688#10143508 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jayme@cumin1002 for hosts: `kafka-main2005.codfw.wmnet` - kafka-main2005.codf... [09:36:57] (03CR) 10Vgutierrez: hiera: let purged use closest cluster on codfw, ulsfo and eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1071844 (https://phabricator.wikimedia.org/T363210) (owner: 10Vgutierrez) [09:37:35] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission kafka-main2005.codfw.wmnet - https://phabricator.wikimedia.org/T374688#10143510 (10JMeybohm) a:05JMeybohm→03None [09:37:56] (03CR) 10JMeybohm: [C:03+1] "kafka-main-codfw is done, this can be merged now" [puppet] - 10https://gerrit.wikimedia.org/r/1071844 (https://phabricator.wikimedia.org/T363210) (owner: 10Vgutierrez) [09:38:22] (03CR) 10Vgutierrez: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1071844 (https://phabricator.wikimedia.org/T363210) (owner: 10Vgutierrez) [09:38:53] (03PS1) 10Fabfur: hiera: enable haproxykafka on cp3066 for testing [puppet] - 10https://gerrit.wikimedia.org/r/1072714 (https://phabricator.wikimedia.org/T374473) [09:39:51] (03PS2) 10Fabfur: hiera: enable haproxykafka on cp3066 for testing [puppet] - 10https://gerrit.wikimedia.org/r/1072714 (https://phabricator.wikimedia.org/T374473) [09:41:09] !log klausman@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - klausman@cumin1002" [09:41:37] !log klausman@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - klausman@cumin1002" [09:41:38] !log klausman@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-lab1001.eqiad.wmnet with OS bookworm [09:41:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10143519 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin1002 for host ml-lab10... [09:42:36] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10143533 (10klausman) [09:46:58] (03CR) 10Ilias Sarantopoulos: "@tklausmann@wikimedia.org can you merge please? I don't have +2 on this repo. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1063213 (https://phabricator.wikimedia.org/T360455) (owner: 10Ilias Sarantopoulos) [09:54:20] 06SRE, 06Infrastructure-Foundations, 10vm-requests: eqiad: 2 VM request for poolcounter - https://phabricator.wikimedia.org/T374629#10143560 (10elukey) 05Open→03Resolved a:03elukey [09:54:42] FIRING: [4x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [09:55:00] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10143564 (10MoritzMuehlenhoff) [09:55:44] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [09:56:22] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.104 second response time https://wikitech.wikimedia.org/wiki/Swift [09:56:34] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.079 second response time https://wikitech.wikimedia.org/wiki/Swift [09:57:20] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.096 second response time https://wikitech.wikimedia.org/wiki/Swift [09:57:54] (03PS1) 10Elukey: services: update thumbor-eqiad to poolcounter1006 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072716 (https://phabricator.wikimedia.org/T332015) [09:57:56] (03PS1) 10Elukey: services: add new poolcounter nodes to MW configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072717 (https://phabricator.wikimedia.org/T332015) [09:58:00] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cr4-ulsfo.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [09:58:00] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [09:58:16] !incidents [09:58:17] 5164 (UNACKED) Primary inbound port utilisation over 80% (paged) global noc (cr4-ulsfo.wikimedia.org) [09:58:17] 5165 (UNACKED) Primary outbound port utilisation over 80% (paged) global noc (cr1-codfw.wikimedia.org) [09:58:22] !ack 5164 [09:58:23] 5164 (ACKED) Primary inbound port utilisation over 80% (paged) global noc (cr4-ulsfo.wikimedia.org) [09:58:23] !ack 5165 [09:58:24] 5165 (ACKED) Primary outbound port utilisation over 80% (paged) global noc (cr1-codfw.wikimedia.org) [09:58:27] thanks [09:58:37] I'm wondering if we triggered that with kafka :_) [10:00:27] it's already down again, isn't it? [10:02:26] (03PS1) 10Vgutierrez: Revert "hiera: let purged use closest cluster on codfw, ulsfo and eqsin" [puppet] - 10https://gerrit.wikimedia.org/r/1072718 [10:03:00] RESOLVED: Primary inbound port utilisation over 80% #page: Device cr4-ulsfo.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [10:03:00] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [10:03:15] (03CR) 10Vgutierrez: [C:03+2] Revert "hiera: let purged use closest cluster on codfw, ulsfo and eqsin" [puppet] - 10https://gerrit.wikimedia.org/r/1072718 (owner: 10Vgutierrez) [10:04:28] (03CR) 10Elukey: Swap poolcounter2003 with poolcounter2005 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072206 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [10:07:19] (03PS2) 10Elukey: Update the Debian changelog to build on Bookworm [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/1071561 (https://phabricator.wikimedia.org/T331969) [10:08:01] PROBLEM - MariaDB Replica Lag: s8 #page on db1172 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 86326.51 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:09] !incidents [10:08:10] 5166 (UNACKED) db1172 (paged)/MariaDB Replica Lag: s8 (paged) [10:08:10] 5165 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-codfw.wikimedia.org) [10:08:10] 5164 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr4-ulsfo.wikimedia.org) [10:08:12] !ack 5166 [10:08:13] 5166 (ACKED) db1172 (paged)/MariaDB Replica Lag: s8 (paged) [10:09:03] that must be a schema change that went over downtime [10:09:20] Amir1 ^ [10:09:48] let me double check it is depooled [10:10:20] it's pooled [10:10:28] pooled? [10:10:28] for apu [10:10:31] *api [10:10:45] but not pooled in general ... whatever that means :) [10:11:13] sections.s8.groups.api.pooled: true [10:11:19] yeah, I belive that overrides it [10:11:25] let me double check the generated config [10:11:49] one one is the normal state and the other is the global temporary status [10:12:20] yeah, no reference of it at eqiad.json [10:12:40] it would should otherwise errors on mw of db not availible of the probe [10:12:52] so "sections.s8.pooled: false" overrides the groups.api one [10:12:59] yeah [10:13:08] (03PS1) 10Fabfur: prometheus: enable haproxykafka scraping [puppet] - 10https://gerrit.wikimedia.org/r/1072719 (https://phabricator.wikimedia.org/T374696) [10:13:29] (03CR) 10CI reject: [V:04-1] prometheus: enable haproxykafka scraping [puppet] - 10https://gerrit.wikimedia.org/r/1072719 (https://phabricator.wikimedia.org/T374696) (owner: 10Fabfur) [10:13:47] jayme: after all, the documentation says: to depool a host, set is a depooled, otherwise it would be a hell to depool [10:13:50] 06SRE, 06Infrastructure-Foundations, 10netops: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929#10143598 (10cmooney) Talking about this again I'm ok with the revised plan, with allocations similar to our POP sites. So for instnace for codfw we can probably move ahead on this basis: * 2a02:ec8... [10:13:53] (03CR) 10Clément Goubert: [C:03+1] team-sre: tweak MediaWikiLoginFailures threshold [alerts] - 10https://gerrit.wikimedia.org/r/1072657 (https://phabricator.wikimedia.org/T350597) (owner: 10Filippo Giunchedi) [10:14:09] (03CR) 10CI reject: [V:04-1] Update the Debian changelog to build on Bookworm [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/1071561 (https://phabricator.wikimedia.org/T331969) (owner: 10Elukey) [10:14:53] jynus: okidoke. So rn I just downtime it and open a ticket for DBA's to inspect? [10:15:06] jayme: vgutierrez I will downtime the host until monday, then pull Amir's ears [10:15:14] I will handle it, no worries [10:15:19] thx jynus [10:15:21] sweet, thanks! [10:18:14] !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db1172.eqiad.wmnet with reason: ongoing schema change [10:18:30] (03CR) 10Elukey: "The lintian errors are:" [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/1071561 (https://phabricator.wikimedia.org/T331969) (owner: 10Elukey) [10:18:30] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1172.eqiad.wmnet with reason: ongoing schema change [10:19:19] actually, that may be arnaudb, not amir, according to https://wikitech.wikimedia.org/wiki/Map_of_database_maintenance [10:20:08] (03CR) 10Muehlenhoff: Update the Debian changelog to build on Bookworm (031 comment) [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/1071561 (https://phabricator.wikimedia.org/T331969) (owner: 10Elukey) [10:23:21] (03CR) 10Elukey: Update the Debian changelog to build on Bookworm (031 comment) [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/1071561 (https://phabricator.wikimedia.org/T331969) (owner: 10Elukey) [10:23:30] (03PS3) 10Elukey: Update the Debian changelog to build on Bookworm [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/1071561 (https://phabricator.wikimedia.org/T331969) [10:23:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [10:25:18] jayme, vgutierrez I will be around for some time still, but I am not sure the dbas are, today. Can you pass the Americas time the idea of what happened- I cannot be 100% sure it won't happen again on another host until automation/run changes [10:25:57] !log T12345 Ran mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=commonswiki --logwiki=metawiki 'IloveuFlyTek' 'Theology1937' --ignorestatus [10:25:58] jynus: will do [10:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:01] T12345: Create "annotation" namespace on Hebrew Wikisource - https://phabricator.wikimedia.org/T12345 [10:26:30] vgutierrez: basically, IF not pooled, like it was the case, ack/downtime and not worry [10:28:21] FIRING: [2x] ProbeDown: Service kubestagemaster1003:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster1003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:29:14] !log T374684 Ran mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=commonswiki --logwiki=metawiki 'IloveuFlyTek' 'Theology1937' --ignorestatus [10:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:18] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, you can ignore the CI test." [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/1071561 (https://phabricator.wikimedia.org/T331969) (owner: 10Elukey) [10:29:18] T374684: Unblock stuck global rename of IloveuFlyTek, Iosonopony, Mohamadanisahmad5, Monty.ch - https://phabricator.wikimedia.org/T374684 [10:29:53] !log T374684 Ran mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=metawiki --logwiki=metawiki 'Iosonopony' 'L.Sala' [10:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:08] (03CR) 10CI reject: [V:04-1] Update the Debian changelog to build on Bookworm [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/1071561 (https://phabricator.wikimedia.org/T331969) (owner: 10Elukey) [10:30:36] RESOLVED: [2x] ProbeDown: Service kubestagemaster1003:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster1003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:30:59] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072716 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [10:32:10] (03CR) 10Muehlenhoff: "(The failing nodes in PCC fail for unrelated reasons to this change)" [puppet] - 10https://gerrit.wikimedia.org/r/1072690 (owner: 10Muehlenhoff) [10:32:13] (03PS1) 10Vgutierrez: hiera: Switch purged@cp2037 back to main-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1072720 (https://phabricator.wikimedia.org/T363210) [10:33:18] !log T374684 Ran mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwiki --logwiki=metawiki 'Mohamadanisahmad5' 'Vanished user a53a2dd4f79a7bde25cf2ea2b2a309cb' [10:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:59] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3975/co" [puppet] - 10https://gerrit.wikimedia.org/r/1072720 (https://phabricator.wikimedia.org/T363210) (owner: 10Vgutierrez) [10:35:00] (03CR) 10JMeybohm: [C:03+1] hiera: Switch purged@cp2037 back to main-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1072720 (https://phabricator.wikimedia.org/T363210) (owner: 10Vgutierrez) [10:36:40] FIRING: KubernetesRsyslogDown: rsyslog on mw2302:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2302 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:37:03] !log T374684 Ran mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwiki --logwiki=metawiki 'Monty.ch' 'MajorFault' [10:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:07] T374684: Unblock stuck global rename of IloveuFlyTek, Iosonopony, Mohamadanisahmad5, Monty.ch - https://phabricator.wikimedia.org/T374684 [10:37:44] (03CR) 10Hnowlan: [C:03+1] services: update thumbor-eqiad to poolcounter1006 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072716 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [10:40:52] (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: Switch purged@cp2037 back to main-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1072720 (https://phabricator.wikimedia.org/T363210) (owner: 10Vgutierrez) [10:44:24] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on db1172.eqiad.wmnet with reason: Depooled recovering replag [10:44:28] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on db1172.eqiad.wmnet with reason: Depooled recovering replag [10:45:26] (03PS2) 10Fabfur: prometheus: enable haproxykafka scraping [puppet] - 10https://gerrit.wikimedia.org/r/1072719 (https://phabricator.wikimedia.org/T374696) [10:45:49] (03CR) 10CI reject: [V:04-1] prometheus: enable haproxykafka scraping [puppet] - 10https://gerrit.wikimedia.org/r/1072719 (https://phabricator.wikimedia.org/T374696) (owner: 10Fabfur) [10:49:50] (03PS3) 10Fabfur: prometheus: enable haproxykafka scraping [puppet] - 10https://gerrit.wikimedia.org/r/1072719 (https://phabricator.wikimedia.org/T374696) [10:53:30] (03PS1) 10Btullis: Add ORKG triplestore to WDQS federation allowlist [puppet] - 10https://gerrit.wikimedia.org/r/1072723 (https://phabricator.wikimedia.org/T366485) [10:54:07] (03PS4) 10Fabfur: prometheus: enable haproxykafka scraping [puppet] - 10https://gerrit.wikimedia.org/r/1072719 (https://phabricator.wikimedia.org/T374696) [10:56:02] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3976/co" [puppet] - 10https://gerrit.wikimedia.org/r/1072723 (https://phabricator.wikimedia.org/T366485) (owner: 10Btullis) [10:56:45] (03CR) 10CI reject: [V:04-1] prometheus: enable haproxykafka scraping [puppet] - 10https://gerrit.wikimedia.org/r/1072719 (https://phabricator.wikimedia.org/T374696) (owner: 10Fabfur) [10:57:56] jynus: thanks. I take care of ti [10:58:01] sorry for the mess [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240913T0700) [11:00:05] eoghan, jelto, arnoldokoth, and mutante: I, the Bot under the Fountain, call upon thee, The Deployer, to do GitLab version upgrades deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240913T1100). [11:00:06] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10143766 (10phaultfinder) [11:00:31] (03PS5) 10Fabfur: prometheus: enable haproxykafka scraping [puppet] - 10https://gerrit.wikimedia.org/r/1072719 (https://phabricator.wikimedia.org/T374696) [11:01:23] (03CR) 10Clément Goubert: [C:03+1] services: add new poolcounter nodes to MW configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072717 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [11:01:51] (03PS8) 10Slyngshede: Permission validation: Handle validation for manager approvals better. [software/bitu] - 10https://gerrit.wikimedia.org/r/1072528 [11:02:11] (03CR) 10Clément Goubert: [C:03+1] Swap poolcounter2003 with poolcounter2005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072206 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [11:03:37] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072719 (https://phabricator.wikimedia.org/T374696) (owner: 10Fabfur) [11:04:05] 06SRE, 10Wikimedia-Mailing-lists, 07Upstream: "list has X moderation requests waiting" email should provide a link - https://phabricator.wikimedia.org/T374694#10143781 (10Ladsgroup) [11:05:10] (03CR) 10Ladsgroup: "Legal gave their seal of approval" [puppet] - 10https://gerrit.wikimedia.org/r/1072265 (owner: 10Varnent) [11:05:15] (03PS2) 10Varnent: Updated license information from CC 3.0 to CC 4.0 per request from Legal. [puppet] - 10https://gerrit.wikimedia.org/r/1072265 [11:05:19] (03CR) 10Ladsgroup: [V:03+2 C:03+2] Updated license information from CC 3.0 to CC 4.0 per request from Legal. [puppet] - 10https://gerrit.wikimedia.org/r/1072265 (owner: 10Varnent) [11:06:56] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072714 (https://phabricator.wikimedia.org/T374473) (owner: 10Fabfur) [11:15:12] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10143808 (10phaultfinder) [11:16:56] (03CR) 10Muehlenhoff: P:idp More precise base_dn for user lookup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1060396 (https://phabricator.wikimedia.org/T371930) (owner: 10Slyngshede) [11:25:13] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10143827 (10phaultfinder) [11:28:26] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for prometheus::pushgateway [puppet] - 10https://gerrit.wikimedia.org/r/1072733 (https://phabricator.wikimedia.org/T135991) [11:30:37] (03PS1) 10Btullis: Update the URL of the WikiPathways SPARQL endpoint to use HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/1072734 (https://phabricator.wikimedia.org/T364448) [11:32:18] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072733 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:33:07] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3977/co" [puppet] - 10https://gerrit.wikimedia.org/r/1072734 (https://phabricator.wikimedia.org/T364448) (owner: 10Btullis) [11:34:10] (03PS2) 10Slyngshede: Menu: Add menu entry for managers to view pending permission requests. [software/bitu] - 10https://gerrit.wikimedia.org/r/1072547 [11:41:35] (03CR) 10Slyngshede: Permission validation: Handle validation for manager approvals better. (032 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/1072528 (owner: 10Slyngshede) [11:44:02] (03PS9) 10Slyngshede: Permission validation: Handle validation for manager approvals better. [software/bitu] - 10https://gerrit.wikimedia.org/r/1072528 [11:47:00] !log akosiaris@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2059 to wikikube-worker2114 [11:47:23] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [11:50:39] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2059 to wikikube-worker2114 - akosiaris@cumin1002" [11:52:12] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2059 to wikikube-worker2114 - akosiaris@cumin1002" [11:52:12] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:52:13] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2114 [11:53:04] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2114 [11:53:43] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2059 to wikikube-worker2114 [11:54:20] !log akosiaris@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2060 to wikikube-worker2115 [11:54:43] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [11:59:51] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2060 to wikikube-worker2115 - akosiaris@cumin1002" [12:05:17] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2060 to wikikube-worker2115 - akosiaris@cumin1002" [12:05:18] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:05:18] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2115 [12:05:32] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2115 [12:06:11] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2060 to wikikube-worker2115 [12:07:15] !log akosiaris@cumin1002 START - Cookbook sre.hosts.rename from mw2301 to wikikube-worker2116 [12:07:26] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [12:07:42] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10143968 (10MoritzMuehlenhoff) [12:09:21] (03PS2) 10Slyngshede: Allow users to see log entires made by managers. [software/bitu] - 10https://gerrit.wikimedia.org/r/1072552 [12:11:33] !log akosiaris@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2114.codfw.wmnet [12:11:43] !log akosiaris@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2115.codfw.wmnet [12:11:59] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2301 to wikikube-worker2116 - akosiaris@cumin1002" [12:12:39] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2114.codfw.wmnet with OS bullseye [12:12:43] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2115.codfw.wmnet with OS bullseye [12:12:49] !log akosiaris@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2114 [12:12:51] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2301 to wikikube-worker2116 - akosiaris@cumin1002" [12:12:52] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:12:52] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2116 [12:13:05] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2116 [12:13:43] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2301 to wikikube-worker2116 [12:15:11] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [12:15:26] (03CR) 10Slyngshede: Allow users to see log entires made by managers. (032 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/1072552 (owner: 10Slyngshede) [12:15:42] !log akosiaris@cumin1002 START - Cookbook sre.hosts.rename from mw2303 to wikikube-worker2118 [12:16:16] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1562942216 and 4009 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:17:16] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1510545320 and 4069 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:18:16] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 6120 and 343 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:18:16] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 343 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:18:21] FIRING: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:18:21] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2114 - akosiaris@cumin1002" [12:19:57] FIRING: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:20:07] !incidents [12:20:08] 5166 (ACKED) db1172 (paged)/MariaDB Replica Lag: s8 (paged) [12:20:08] 5167 (UNACKED) ProbeDown sre (10.2.2.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 eqiad) [12:20:08] 5165 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-codfw.wikimedia.org) [12:20:08] 5164 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr4-ulsfo.wikimedia.org) [12:20:12] :-( [12:20:12] !ack 5167 [12:20:12] 5167 (ACKED) ProbeDown sre (10.2.2.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 eqiad) [12:20:24] ah..I was wondering already [12:20:40] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [12:20:48] neverending fun with wikifunctions [12:22:14] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2114 - akosiaris@cumin1002" [12:22:15] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:22:15] !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2114.codfw.wmnet 102.0.192.10.in-addr.arpa 2.0.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:22:18] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2114.codfw.wmnet 102.0.192.10.in-addr.arpa 2.0.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:22:19] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2114 [12:23:01] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2114 [12:23:01] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2114 [12:24:00] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2303 to wikikube-worker2118 - akosiaris@cumin1002" [12:24:04] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2303 to wikikube-worker2118 - akosiaris@cumin1002" [12:24:04] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:24:06] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2118 [12:24:13] !log akosiaris@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2115 [12:24:21] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2118 [12:24:22] (03CR) 10Vgutierrez: [C:04-1] "current code doesn't enable scraping given that haproxykafka ensure parameter is never set to present" [puppet] - 10https://gerrit.wikimedia.org/r/1072719 (https://phabricator.wikimedia.org/T374696) (owner: 10Fabfur) [12:24:43] RESOLVED: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:24:57] RESOLVED: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:25:00] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2303 to wikikube-worker2118 [12:25:26] !log akosiaris@cumin1002 START - Cookbook sre.hosts.rename from mw2304 to wikikube-worker2119 [12:25:51] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [12:26:02] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:26:02] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:26:36] !log akosiaris@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2116.codfw.wmnet [12:26:58] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2116.codfw.wmnet with OS bullseye [12:27:29] (03CR) 10Filippo Giunchedi: [C:03+2] team-sre: tweak MediaWikiLoginFailures threshold [alerts] - 10https://gerrit.wikimedia.org/r/1072657 (https://phabricator.wikimedia.org/T350597) (owner: 10Filippo Giunchedi) [12:27:55] (03CR) 10Filippo Giunchedi: [C:03+1] Enable profile::auto_restarts::service for prometheus::pushgateway [puppet] - 10https://gerrit.wikimedia.org/r/1072733 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [12:29:14] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2304 to wikikube-worker2119 - akosiaris@cumin1002" [12:29:44] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2304 to wikikube-worker2119 - akosiaris@cumin1002" [12:29:44] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:29:45] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2119 [12:29:57] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2119 [12:30:36] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2304 to wikikube-worker2119 [12:30:54] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [12:31:11] !log akosiaris@cumin1002 START - Cookbook sre.hosts.rename from mw2305 to wikikube-worker2120 [12:31:57] !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2117.codfw.wmnet on all recursors [12:32:00] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2117.codfw.wmnet on all recursors [12:33:50] 07Puppet, 06SRE, 06Infrastructure-Foundations, 10Keyholder: keyholder-proxy doesn't restart on config change - https://phabricator.wikimedia.org/T374711 (10fgiunchedi) 03NEW [12:34:10] RESOLVED: KubernetesRsyslogDown: rsyslog on mw2302:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2302 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:35:07] (03PS1) 10Muehlenhoff: Remove old parsoid certs [puppet] - 10https://gerrit.wikimedia.org/r/1072737 (https://phabricator.wikimedia.org/T359387) [12:35:24] (03PS1) 10Muehlenhoff: labs-private: Remove parsoid stub secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1072738 (https://phabricator.wikimedia.org/T357750) [12:35:27] FIRING: [2x] ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:35:35] (03PS2) 10Muehlenhoff: Remove old parsoid certs [puppet] - 10https://gerrit.wikimedia.org/r/1072737 (https://phabricator.wikimedia.org/T359387) [12:35:38] !incidents [12:35:39] 5166 (ACKED) db1172 (paged)/MariaDB Replica Lag: s8 (paged) [12:35:39] 5168 (UNACKED) ProbeDown sre (ip4 probes/service eqiad) [12:35:39] 5167 (RESOLVED) ProbeDown sre (10.2.2.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 eqiad) [12:35:39] 5165 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-codfw.wikimedia.org) [12:35:39] 5164 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr4-ulsfo.wikimedia.org) [12:35:44] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2115 - akosiaris@cumin1002" [12:35:48] !ack 5168 [12:35:48] 5168 (ACKED) ProbeDown sre (ip4 probes/service eqiad) [12:35:59] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [12:36:04] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1002.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:36:04] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1002.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:36:05] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2115 - akosiaris@cumin1002" [12:36:05] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:36:05] !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2115.codfw.wmnet 124.0.192.10.in-addr.arpa 4.2.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:36:08] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2115.codfw.wmnet 124.0.192.10.in-addr.arpa 4.2.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:36:09] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2115 [12:36:29] mmhh thanos' unhappy too [12:36:35] I'm taking a look [12:37:00] jayme: ^ FYI [12:37:12] saw, thanks [12:37:17] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2115 [12:37:17] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2115 [12:37:47] so...given that wikifunctions is known to be broken - how do we feel about making it non-paging until it's fixed? [12:38:07] there is really no reason to pull anyone out of the weekend when nothing can be done [12:38:29] !log akosiaris@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2116 [12:38:41] +1 [12:38:51] is it completely broken, or just can't handle load or what? [12:39:05] more like completely broken [12:39:12] FIRING: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:39:33] if it's timing out or not taking conns or something, we might want to disable it somewhere in traffic if we're leaving it dead for a while [12:39:35] (03PS6) 10Fabfur: cache:haproxykafka: first stub classes to allow prometheus scraping [puppet] - 10https://gerrit.wikimedia.org/r/1072719 (https://phabricator.wikimedia.org/T374696) [12:39:37] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2305 to wikikube-worker2120 - akosiaris@cumin1002" [12:39:41] (03CR) 10Fabfur: cache:haproxykafka: first stub classes to allow prometheus scraping (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1072719 (https://phabricator.wikimedia.org/T374696) (owner: 10Fabfur) [12:39:41] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2305 to wikikube-worker2120 - akosiaris@cumin1002" [12:39:41] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:39:42] !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2118.codfw.wmnet on all recursors [12:39:42] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2120 [12:39:42] so the impact doesn't spread through cache clusters [12:39:45] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2118.codfw.wmnet on all recursors [12:39:55] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [12:40:05] !log akosiaris@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2118.codfw.wmnet [12:40:19] bblack: IIUC there are a bunch of URLs that lead to some infinite loop that is killed after 60s, which saturates workers [12:40:28] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2120 [12:40:38] yeah that sounds like a good reason to disable it [12:40:47] those 60s tie up cp-node threads, too [12:40:51] but its an isolated mw instance, so the worker saturation does not spread to actual wikis [12:41:07] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2305 to wikikube-worker2120 [12:41:13] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2118.codfw.wmnet with OS bullseye [12:41:14] indeed, might still be a problem for cp nodes [12:41:30] !log bounce thanos-query-frontend on titan eqiad [12:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:49] but we have not seen that yet [12:42:01] (03PS7) 10Vgutierrez: sre.cdn: Add transfer-purged-positions cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1072736 [12:43:59] (03PS1) 10Hashar: rdbms: only count replication sources toward "masterConns" in getServerConnection() [core] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072739 (https://phabricator.wikimedia.org/T374534) [12:43:59] or maybe we can just turn down the timeout for the mw-wikifunctions backend to reduce the impact there [12:44:04] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:44:04] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:44:07] !log akosiaris@cumin1002 START - Cookbook sre.hosts.rename from mw2302 to wikikube-worker2117 [12:44:12] FIRING: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:44:15] 06SRE, 06Infrastructure-Foundations, 10netops: netbox: create IPv6 entries for Cloud VPS - https://phabricator.wikimedia.org/T374712 (10aborrero) 03NEW [12:44:59] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [12:45:06] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10144154 (10phaultfinder) [12:45:07] !log akosiaris@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2119.codfw.wmnet [12:45:10] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2116 - akosiaris@cumin1002" [12:45:25] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2119.codfw.wmnet with OS bullseye [12:45:27] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:45:36] bblack: given that cp did raise any issues up until now we might be okay without...do you know what the timeout is currently? With >60s at least users get some kind of proper error message [12:45:56] (03CR) 10Hashar: [C:03+2] "Amir suggested to backport it immediately in the interest of cutting the log spam in `rdbms` :)" [core] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072739 (https://phabricator.wikimedia.org/T374534) (owner: 10Hashar) [12:46:09] (03PS8) 10Vgutierrez: sre.cdn: Add transfer-purged-positions cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1072736 [12:46:15] (03PS1) 10JMeybohm: Disable paging for mw-wikifunctions [puppet] - 10https://gerrit.wikimedia.org/r/1072740 (https://phabricator.wikimedia.org/T374231) [12:46:48] jayme: yeah we might be ok, so long as the traffic to those slow requests remains stable [12:47:12] I guess if not, someone could use standard requestctl to shut it off, vs figuring out some tricky timeout thing today. [12:47:52] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: cloudsw: codfw: enable IPv6 - https://phabricator.wikimedia.org/T374713 (10aborrero) 03NEW [12:48:13] right...also we've been in the "ballpark" of 5rps for it - I don't think it's expected to raise [12:48:21] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: openstack: verify security groups settings for IPv6 - https://phabricator.wikimedia.org/T374714 (10aborrero) 03NEW [12:49:00] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2116 - akosiaris@cumin1002" [12:49:00] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:49:00] !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2116.codfw.wmnet 171.0.192.10.in-addr.arpa 1.7.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:49:03] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2116.codfw.wmnet 171.0.192.10.in-addr.arpa 1.7.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:49:04] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2116 [12:49:12] RESOLVED: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:49:32] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2302 to wikikube-worker2117 - akosiaris@cumin1002" [12:49:36] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2302 to wikikube-worker2117 - akosiaris@cumin1002" [12:49:36] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:49:36] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2117 [12:49:47] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: openstack: work out IPv6 and designate integration - https://phabricator.wikimedia.org/T374715 (10aborrero) 03NEW [12:50:25] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: openstack: work out IPv6 and designate integration - https://phabricator.wikimedia.org/T374715#10144206 (10aborrero) [12:50:28] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: openstack: verify security groups settings for IPv6 - https://phabricator.wikimedia.org/T374714#10144207 (10aborrero) [12:50:33] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: CloudVPS: IPv6 early PoC - https://phabricator.wikimedia.org/T245495#10144208 (10aborrero) [12:50:52] (03CR) 10Clément Goubert: [C:03+1] Disable paging for mw-wikifunctions [puppet] - 10https://gerrit.wikimedia.org/r/1072740 (https://phabricator.wikimedia.org/T374231) (owner: 10JMeybohm) [12:51:45] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2117 [12:51:58] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2116 [12:51:58] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2116 [12:52:13] !log akosiaris@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2119 [12:52:20] !log akosiaris@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2120.codfw.wmnet [12:52:24] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2302 to wikikube-worker2117 [12:52:35] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2120.codfw.wmnet with OS bullseye [12:52:52] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [12:53:17] !log akosiaris@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2117.codfw.wmnet [12:53:25] FIRING: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:53:37] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2117.codfw.wmnet [12:53:58] !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2117.codfw.wmnet on all recursors [12:54:01] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2117.codfw.wmnet on all recursors [12:54:10] !log akosiaris@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2117.codfw.wmnet [12:54:33] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2117.codfw.wmnet with OS bullseye [12:54:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [12:55:00] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: CloudVPS: IPv6 in codfw1dev - https://phabricator.wikimedia.org/T245495#10144214 (10aborrero) [12:55:16] (03CR) 10Fabfur: "Don't know spicerack kafka APIs but aside from the two minor observations on docstrings looks good!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1072736 (owner: 10Vgutierrez) [12:55:29] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: CloudVPS: IPv6 in codfw1dev - https://phabricator.wikimedia.org/T245495#10144223 (10aborrero) [12:56:05] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: cloudgw: add support and enable IPv6 - https://phabricator.wikimedia.org/T374716 (10aborrero) 03NEW [12:56:19] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: cloudsw: codfw: enable IPv6 - https://phabricator.wikimedia.org/T374713#10144237 (10aborrero) [12:56:22] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: cloudgw: add support and enable IPv6 - https://phabricator.wikimedia.org/T374716#10144238 (10aborrero) [12:56:28] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: CloudVPS: IPv6 in codfw1dev - https://phabricator.wikimedia.org/T245495#10144239 (10aborrero) [12:56:34] (03CR) 10JMeybohm: [C:03+2] Disable paging for mw-wikifunctions [puppet] - 10https://gerrit.wikimedia.org/r/1072740 (https://phabricator.wikimedia.org/T374231) (owner: 10JMeybohm) [12:57:11] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: CloudVPS: IPv6 in codfw1dev - https://phabricator.wikimedia.org/T245495#10144245 (10aborrero) [12:57:19] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2119 - akosiaris@cumin1002" [12:57:23] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2119 - akosiaris@cumin1002" [12:57:23] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:57:23] !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2119.codfw.wmnet 174.0.192.10.in-addr.arpa 4.7.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:57:27] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2119.codfw.wmnet 174.0.192.10.in-addr.arpa 4.7.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:57:27] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2119 [12:57:33] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: CloudVPS: IPv6 in codfw1dev - https://phabricator.wikimedia.org/T245495#10144241 (10aborrero) [12:57:35] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: CloudVPS: IPv6 in codfw1dev - https://phabricator.wikimedia.org/T245495#10144246 (10aborrero) [13:00:06] !log aqu@deploy1003 Started deploy [airflow-dags/analytics_test@5315c8d]: Test Refine through Airflow [13:00:14] (03PS9) 10Vgutierrez: sre.cdn: Add transfer-purged-positions cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1072736 [13:00:19] (03CR) 10CI reject: [V:04-1] sre.cdn: Add transfer-purged-positions cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1072736 (owner: 10Vgutierrez) [13:00:36] !log aqu@deploy1003 Finished deploy [airflow-dags/analytics_test@5315c8d]: Test Refine through Airflow (duration: 00m 31s) [13:01:26] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2119 [13:01:26] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2119 [13:01:31] !log akosiaris@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2120 [13:01:45] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [13:01:47] (03PS10) 10Vgutierrez: sre.cdn: Add transfer-purged-positions cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1072736 [13:02:02] (03CR) 10Vgutierrez: sre.cdn: Add transfer-purged-positions cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1072736 (owner: 10Vgutierrez) [13:03:36] (03CR) 10Vgutierrez: cache:haproxykafka: first stub classes to allow prometheus scraping (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072719 (https://phabricator.wikimedia.org/T374696) (owner: 10Fabfur) [13:05:19] (03PS1) 10Muehlenhoff: deployment servers: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1072744 [13:05:28] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2120 - akosiaris@cumin1002" [13:05:33] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2120 - akosiaris@cumin1002" [13:05:33] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:05:33] !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2120.codfw.wmnet 175.0.192.10.in-addr.arpa 5.7.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:05:36] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2120.codfw.wmnet 175.0.192.10.in-addr.arpa 5.7.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:05:37] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2120 [13:09:42] (03CR) 10Fabfur: sre.cdn: Add transfer-purged-positions cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1072736 (owner: 10Vgutierrez) [13:10:04] (03PS4) 10Ssingh: P:ntp and nagios_core: add new command ntp_check_peer_and_stratum [puppet] - 10https://gerrit.wikimedia.org/r/1072276 [13:10:41] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2118.codfw.wmnet with OS bullseye [13:10:42] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2118.codfw.wmnet [13:10:55] (03CR) 10Ssingh: P:ntp and nagios_core: add new command ntp_check_peer_and_stratum (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072276 (owner: 10Ssingh) [13:11:01] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2120 [13:11:01] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2120 [13:11:01] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3978/co" [puppet] - 10https://gerrit.wikimedia.org/r/1072276 (owner: 10Ssingh) [13:11:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072744 (owner: 10Muehlenhoff) [13:11:49] !log akosiaris@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2118.codfw.wmnet [13:12:06] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2118.codfw.wmnet with OS bullseye [13:12:13] !log akosiaris@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2117 [13:12:21] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/1072276/3979/dns1004.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1072276 (owner: 10Ssingh) [13:15:13] (03Merged) 10jenkins-bot: rdbms: only count replication sources toward "masterConns" in getServerConnection() [core] (wmf/1.43.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1072739 (https://phabricator.wikimedia.org/T374534) (owner: 10Hashar) [13:15:23] (03CR) 10CI reject: [V:04-1] sre.cdn: Add transfer-purged-positions cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1072736 (owner: 10Vgutierrez) [13:16:53] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1072739|rdbms: only count replication sources toward "masterConns" in getServerConnection() (T374534)]] [13:16:57] T374534: Lots of "Expectation (masterConns <= 0) by ApiMain::setRequestExpectations not met" involving external store (2024-09-05) - https://phabricator.wikimedia.org/T374534 [13:17:12] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [13:17:20] (03PS2) 10Muehlenhoff: deployment servers: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1072744 [13:17:53] (03PS11) 10Vgutierrez: sre.cdn: Add transfer-purged-positions cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1072736 [13:20:06] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10144279 (10phaultfinder) [13:20:10] !log hashar@deploy1003 hashar: Backport for [[gerrit:1072739|rdbms: only count replication sources toward "masterConns" in getServerConnection() (T374534)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:21:03] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2117 - akosiaris@cumin1002" [13:21:07] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2117 - akosiaris@cumin1002" [13:21:07] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:21:07] !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2117.codfw.wmnet 172.0.192.10.in-addr.arpa 2.7.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:21:10] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2117.codfw.wmnet 172.0.192.10.in-addr.arpa 2.7.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:21:12] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2117 [13:22:10] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2117 [13:22:11] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2117 [13:22:16] (03CR) 10Bking: flink-app: customize calico label selector (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072236 (https://phabricator.wikimedia.org/T373195) (owner: 10Bking) [13:22:17] !log akosiaris@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2118 [13:22:47] !log hashar@deploy1003 hashar: Continuing with sync [13:27:28] !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1072739|rdbms: only count replication sources toward "masterConns" in getServerConnection() (T374534)]] (duration: 10m 34s) [13:27:32] T374534: Lots of "Expectation (masterConns <= 0) by ApiMain::setRequestExpectations not met" involving external store (2024-09-05) - https://phabricator.wikimedia.org/T374534 [13:27:39] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072744 (owner: 10Muehlenhoff) [13:28:53] (03PS7) 10Fabfur: cache:haproxykafka: first stub classes to allow prometheus scraping [puppet] - 10https://gerrit.wikimedia.org/r/1072719 (https://phabricator.wikimedia.org/T374696) [13:29:06] (03CR) 10Fabfur: cache:haproxykafka: first stub classes to allow prometheus scraping (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072719 (https://phabricator.wikimedia.org/T374696) (owner: 10Fabfur) [13:33:50] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072719 (https://phabricator.wikimedia.org/T374696) (owner: 10Fabfur) [13:33:52] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10144303 (10MoritzMuehlenhoff) [13:34:17] (03CR) 10Ssingh: [C:03+1] sre.cdn: Add transfer-purged-positions cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1072736 (owner: 10Vgutierrez) [13:37:24] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [13:38:21] (03CR) 10Ladsgroup: [V:03+2 C:03+2] "Acknowledged" [puppet] - 10https://gerrit.wikimedia.org/r/1072265 (owner: 10Varnent) [13:39:09] (03PS12) 10Vgutierrez: sre.cdn: Add transfer-purged-positions cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1072736 [13:39:34] (03CR) 10Vgutierrez: sre.cdn: Add transfer-purged-positions cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1072736 (owner: 10Vgutierrez) [13:39:47] (03PS1) 10Muehlenhoff: wmcs::novaproxy: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1072751 [13:39:55] (03CR) 10Ssingh: [C:03+1] sre.cdn: Add transfer-purged-positions cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1072736 (owner: 10Vgutierrez) [13:40:44] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2118 - akosiaris@cumin1002" [13:40:49] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2118 - akosiaris@cumin1002" [13:40:49] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:40:49] !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2118.codfw.wmnet 173.0.192.10.in-addr.arpa 3.7.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:40:52] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2118.codfw.wmnet 173.0.192.10.in-addr.arpa 3.7.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:40:53] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2118 [13:42:02] (03PS1) 10Alexandros Kosiaris: Add wikikube-worker2117-2120 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1072752 [13:42:03] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2114.codfw.wmnet with OS bullseye [13:42:04] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2114.codfw.wmnet [13:44:26] (03CR) 10Alexandros Kosiaris: [C:03+2] Add wikikube-worker2117-2120 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1072752 (owner: 10Alexandros Kosiaris) [13:48:34] (03PS1) 10Vgutierrez: hiera: switch purged@codfw,ulsfo,eqsin back to codfw kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/1072753 (https://phabricator.wikimedia.org/T363210) [13:50:28] (03CR) 10Vgutierrez: [C:04-1] cache:haproxykafka: first stub classes to allow prometheus scraping (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072719 (https://phabricator.wikimedia.org/T374696) (owner: 10Fabfur) [13:52:07] (03CR) 10Muehlenhoff: [C:03+1] "I made https://gerrit.wikimedia.org/r/c/operations/puppet/+/1072690 for this" [puppet] - 10https://gerrit.wikimedia.org/r/1071925 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [13:52:29] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2118 [13:52:29] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2118 [13:52:39] (03CR) 10JHathaway: [C:03+2] puppet8: account for unknown probe types [puppet] - 10https://gerrit.wikimedia.org/r/1072303 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [13:52:44] !log akosiaris@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2114.codfw.wmnet [13:53:08] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1072753 (https://phabricator.wikimedia.org/T363210) (owner: 10Vgutierrez) [13:53:14] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2114.codfw.wmnet with OS bullseye [13:53:39] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2114.codfw.wmnet with OS bullseye [13:53:40] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2114.codfw.wmnet [13:54:42] FIRING: [4x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [13:55:43] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2115.codfw.wmnet with OS bullseye [13:55:43] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2115.codfw.wmnet [13:57:27] (03CR) 10Vgutierrez: [C:03+2] sre.cdn: Add transfer-purged-positions cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1072736 (owner: 10Vgutierrez) [13:57:36] (03PS1) 10Muehlenhoff: Revert "Temporarily disable stunnel for the Puppet 7 migration of deployment hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1072754 (https://phabricator.wikimedia.org/T349619) [13:58:32] !log akosiaris@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2114.codfw.wmnet [13:59:00] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2114.codfw.wmnet with OS bullseye [13:59:25] (03PS8) 10Fabfur: cache:haproxykafka: first stub classes to allow prometheus scraping [puppet] - 10https://gerrit.wikimedia.org/r/1072719 (https://phabricator.wikimedia.org/T374696) [13:59:40] (03CR) 10Fabfur: cache:haproxykafka: first stub classes to allow prometheus scraping (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072719 (https://phabricator.wikimedia.org/T374696) (owner: 10Fabfur) [13:59:47] (03CR) 10CI reject: [V:04-1] cache:haproxykafka: first stub classes to allow prometheus scraping [puppet] - 10https://gerrit.wikimedia.org/r/1072719 (https://phabricator.wikimedia.org/T374696) (owner: 10Fabfur) [14:00:05] (03CR) 10Ssingh: [C:03+1] "Site list looks good, PCC looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1072753 (https://phabricator.wikimedia.org/T363210) (owner: 10Vgutierrez) [14:00:16] !log akosiaris@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2115.codfw.wmnet [14:00:24] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2115.codfw.wmnet [14:00:56] !log akosiaris@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2115.codfw.wmnet [14:01:19] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2115.codfw.wmnet with OS bullseye [14:01:21] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2114.codfw.wmnet with OS bullseye [14:01:22] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2114.codfw.wmnet [14:01:26] (03CR) 10Ssingh: [C:03+1] "https://puppet-compiler.wmflabs.org/output/1072753/3981/cp4052.ulsfo.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1072753 (https://phabricator.wikimedia.org/T363210) (owner: 10Vgutierrez) [14:01:30] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2115.codfw.wmnet with OS bullseye [14:01:30] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2115.codfw.wmnet [14:02:28] (03CR) 10Muehlenhoff: [C:03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/1070681 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [14:03:13] (03PS1) 10FNegri: R:wmcs::db::wikireplicas remove access from cloudcumin [puppet] - 10https://gerrit.wikimedia.org/r/1072755 (https://phabricator.wikimedia.org/T344599) [14:04:08] (03CR) 10Alexandros Kosiaris: [C:03+1] "Makes sense to me." [puppet] - 10https://gerrit.wikimedia.org/r/1072740 (https://phabricator.wikimedia.org/T374231) (owner: 10JMeybohm) [14:05:38] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2120.codfw.wmnet with reason: host reimage [14:06:08] (03CR) 10Muehlenhoff: [C:03+1] P:etcd::tlsproxy: add support for PKI certs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1070681 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [14:07:32] FIRING: KubernetesCalicoDown: wikikube-worker2114.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2114.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:08:11] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072755 (https://phabricator.wikimedia.org/T344599) (owner: 10FNegri) [14:08:39] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: Degraded RAID on wikikube-worker2092 - https://phabricator.wikimedia.org/T374409#10144395 (10Jhancock.wm) or the delivery gets messed up. will update when I have it in hand. [14:08:48] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2116.codfw.wmnet with OS bullseye [14:08:49] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2116.codfw.wmnet [14:09:02] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2120.codfw.wmnet with reason: host reimage [14:09:08] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2118.codfw.wmnet with reason: host reimage [14:10:32] (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: switch purged@codfw,ulsfo,eqsin back to codfw kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/1072753 (https://phabricator.wikimedia.org/T363210) (owner: 10Vgutierrez) [14:11:48] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2118.codfw.wmnet with reason: host reimage [14:11:59] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on cloudvirt2004-dev - https://phabricator.wikimedia.org/T374422#10144401 (10Jhancock.wm) shipping has gone awry. will update when it's in hand [14:15:28] (03PS9) 10Fabfur: cache:haproxykafka: first stub classes to allow prometheus scraping [puppet] - 10https://gerrit.wikimedia.org/r/1072719 (https://phabricator.wikimedia.org/T374696) [14:17:29] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.transfer-purged-positions rolling custom on P{cp2036*} and A:cp [14:18:20] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2119.codfw.wmnet with OS bullseye [14:18:20] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2119.codfw.wmnet [14:19:13] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.transfer-purged-positions (exit_code=0) rolling custom on P{cp2036*} and A:cp [14:19:28] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db1172.eqiad.wmnet with reason: Schema change (T367856) [14:19:32] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [14:19:32] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1172.eqiad.wmnet with reason: Schema change (T367856) [14:20:49] (03CR) 10JHathaway: [C:03+2] puppet8: add explicit typecast [puppet] - 10https://gerrit.wikimedia.org/r/1072301 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [14:20:50] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1072276 (owner: 10Ssingh) [14:21:50] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Strict mode enabled by default - https://phabricator.wikimedia.org/T372664#10144446 (10jhathaway) [14:23:14] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM re: prometheus bits" [puppet] - 10https://gerrit.wikimedia.org/r/1072719 (https://phabricator.wikimedia.org/T374696) (owner: 10Fabfur) [14:23:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [14:25:05] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10144482 (10phaultfinder) [14:30:02] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2118.codfw.wmnet with OS bullseye [14:31:07] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10LDAP-Access-Requests: Requesting access to `contint-admins`, `contint-docker`, LDAP `ciadmin` for 'Arthur taylor' - https://phabricator.wikimedia.org/T373969#10144491 (10Ladsgroup) [14:31:42] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10LDAP-Access-Requests: Requesting access to `contint-admins`, `contint-docker`, LDAP `ciadmin` for 'Arthur taylor' - https://phabricator.wikimedia.org/T373969#10144496 (10Ladsgroup) [14:31:53] (03CR) 10Ssingh: [C:03+2] P:ntp and nagios_core: add new command ntp_check_peer_and_stratum [puppet] - 10https://gerrit.wikimedia.org/r/1072276 (owner: 10Ssingh) [14:32:49] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.transfer-purged-positions rolling custom on P{cp2035*} and A:cp [14:33:17] !log homer cr*codfw* commit 'T372878' [14:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:20] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [14:33:24] !log homer lsw1-a6-codfw* commit 'T372878' [14:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:42] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.transfer-purged-positions (exit_code=0) rolling custom on P{cp2035*} and A:cp [14:35:34] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: Degraded RAID on wikikube-worker2092 - https://phabricator.wikimedia.org/T374409#10144500 (10Clement_Goubert) Logistics... Thanks for the update! [14:35:38] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10LDAP-Access-Requests: Requesting access to `contint-admins`, `contint-docker`, LDAP `ciadmin` for 'Arthur taylor' - https://phabricator.wikimedia.org/T373969#10144503 (10Ladsgroup) The ssh key you provided here is different the existi... [14:39:12] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:16] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2117.codfw.wmnet with OS bullseye [14:39:17] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2117.codfw.wmnet [14:40:46] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2118.codfw.wmnet [14:41:25] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 301, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:41:39] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) pool for host wikikube-worker2118.codfw.wmnet [14:41:40] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2118.codfw.wmnet [14:42:36] (03CR) 10Vgutierrez: [C:03+1] cache:haproxykafka: first stub classes to allow prometheus scraping [puppet] - 10https://gerrit.wikimedia.org/r/1072719 (https://phabricator.wikimedia.org/T374696) (owner: 10Fabfur) [14:43:22] (03PS1) 10Ladsgroup: admin: Add Cyndywikime to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/1072758 (https://phabricator.wikimedia.org/T374595) [14:44:11] (03CR) 10CI reject: [V:04-1] admin: Add Cyndywikime to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/1072758 (https://phabricator.wikimedia.org/T374595) (owner: 10Ladsgroup) [14:44:55] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2120.codfw.wmnet with OS bullseye [14:47:29] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 383, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:48:12] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [14:48:40] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [14:50:29] (03PS1) 10DCausse: cirrus-streaming-updater: test resolve_canonical_bootstrap_servers_only [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072759 [14:50:59] (03PS2) 10Ladsgroup: admin: Add Cyndywikime to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/1072758 (https://phabricator.wikimedia.org/T374595) [14:51:15] (03CR) 10Alexandros Kosiaris: [C:03+2] Remove old parsoid certs [puppet] - 10https://gerrit.wikimedia.org/r/1072737 (https://phabricator.wikimedia.org/T359387) (owner: 10Muehlenhoff) [14:51:29] (03CR) 10Alexandros Kosiaris: [C:03+2] labs-private: Remove parsoid stub secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1072738 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [14:51:31] (03CR) 10Alexandros Kosiaris: [V:03+2 C:03+2] labs-private: Remove parsoid stub secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1072738 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [14:51:36] (03CR) 10Fabfur: [C:03+2] cache:haproxykafka: first stub classes to allow prometheus scraping [puppet] - 10https://gerrit.wikimedia.org/r/1072719 (https://phabricator.wikimedia.org/T374696) (owner: 10Fabfur) [14:51:49] (03CR) 10CI reject: [V:04-1] admin: Add Cyndywikime to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/1072758 (https://phabricator.wikimedia.org/T374595) (owner: 10Ladsgroup) [14:53:34] (03Abandoned) 10Ladsgroup: admin: Add Cyndywikime to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/1072758 (https://phabricator.wikimedia.org/T374595) (owner: 10Ladsgroup) [14:54:56] (03CR) 10Cwhite: "Seems like there's a local statsite instance currently in use. Any objections to using it rather than the main one?" [puppet] - 10https://gerrit.wikimedia.org/r/1072632 (https://phabricator.wikimedia.org/T233089) (owner: 10Cwhite) [14:57:03] (03CR) 10Cwhite: "We'll need to coordinate a zuul restart to activate this. Rollback is a revert of this patch. Does someone from releng want to be involv" [puppet] - 10https://gerrit.wikimedia.org/r/1072633 (https://phabricator.wikimedia.org/T233089) (owner: 10Cwhite) [14:58:00] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: LDAP access to the wmf group for Cyndywikime - https://phabricator.wikimedia.org/T374595#10144551 (10Ladsgroup) You seem to be already in wmf ldap group? https://ldap.toolforge.org/user/cyndywikime [15:01:10] (03PS1) 10Scott French: kubernetes: re-name / IP mw231[345] [puppet] - 10https://gerrit.wikimedia.org/r/1072762 (https://phabricator.wikimedia.org/T372878) [15:02:32] (03CR) 10Hnowlan: [C:03+1] kubernetes: re-name / IP mw231[345] [puppet] - 10https://gerrit.wikimedia.org/r/1072762 (https://phabricator.wikimedia.org/T372878) (owner: 10Scott French) [15:02:32] FIRING: [3x] KubernetesCalicoDown: wikikube-worker2114.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:04:12] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:05:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10144567 (10Jclark-ctr) 05Open→03Resolved [15:12:43] !log homer lsw1-a6-codfw* commit T372878 [15:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:47] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [15:14:03] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2313.codfw.wmnet [15:14:40] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2313.codfw.wmnet [15:15:06] PROBLEM - BGP status on lsw1-a6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:15:08] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2314.codfw.wmnet [15:15:41] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2314.codfw.wmnet [15:16:04] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2315.codfw.wmnet [15:16:37] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2315.codfw.wmnet [15:16:49] (03PS1) 10Vgutierrez: sre.cdn.transfer-purged-positions: Do not use transfer_consumer_position [cookbooks] - 10https://gerrit.wikimedia.org/r/1072763 [15:17:18] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2120.codfw.wmnet [15:17:20] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2120.codfw.wmnet [15:17:21] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=1) Renumbering for host wikikube-worker2120.codfw.wmnet [15:17:32] FIRING: [3x] KubernetesCalicoDown: wikikube-worker2114.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:17:41] (03CR) 10Scott French: [C:03+2] kubernetes: re-name / IP mw231[345] [puppet] - 10https://gerrit.wikimedia.org/r/1072762 (https://phabricator.wikimedia.org/T372878) (owner: 10Scott French) [15:19:47] (03PS2) 10Vgutierrez: sre.cdn.transfer-purged-positions: Do not use transfer_consumer_position [cookbooks] - 10https://gerrit.wikimedia.org/r/1072763 [15:20:06] RECOVERY - BGP status on lsw1-a6-codfw.mgmt is OK: BGP OK - up: 36, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:21:53] (03PS3) 10Vgutierrez: sre.cdn.transfer-purged-positions: Do not use transfer_consumer_position [cookbooks] - 10https://gerrit.wikimedia.org/r/1072763 [15:22:11] !log swfrench@cumin2002 START - Cookbook sre.hosts.rename from mw2313 to wikikube-worker2121 [15:22:17] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2014.codfw.wmnet [15:22:19] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2014.codfw.wmnet [15:22:32] !log swfrench@cumin2002 START - Cookbook sre.dns.netbox [15:22:32] RESOLVED: [3x] KubernetesCalicoDown: wikikube-worker2114.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:23:07] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2111.codfw.wmnet [15:23:07] !log akosiaris@cumin1002 END (ERROR) - Cookbook sre.k8s.pool-depool-node (exit_code=97) pool for host wikikube-worker2111.codfw.wmnet [15:23:16] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2114.codfw.wmnet [15:23:18] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2114.codfw.wmnet [15:23:23] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2115.codfw.wmnet [15:23:25] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2115.codfw.wmnet [15:23:30] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2116.codfw.wmnet [15:23:32] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2116.codfw.wmnet [15:23:37] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2117.codfw.wmnet [15:23:39] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2117.codfw.wmnet [15:23:44] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2118.codfw.wmnet [15:23:45] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2118.codfw.wmnet [15:23:50] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2119.codfw.wmnet [15:23:52] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2119.codfw.wmnet [15:23:53] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:23:57] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2120.codfw.wmnet [15:23:59] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2120.codfw.wmnet [15:24:26] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:25:07] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10144586 (10phaultfinder) [15:26:08] !log swfrench@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2313 to wikikube-worker2121 - swfrench@cumin2002" [15:26:31] !log swfrench@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2313 to wikikube-worker2121 - swfrench@cumin2002" [15:26:32] !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:26:33] !log swfrench@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2121 [15:26:45] !log swfrench@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2121 [15:27:25] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2313 to wikikube-worker2121 [15:28:12] !log swfrench@cumin2002 START - Cookbook sre.hosts.rename from mw2314 to wikikube-worker2122 [15:28:34] !log swfrench@cumin2002 START - Cookbook sre.dns.netbox [15:28:40] (03CR) 10JHathaway: [C:03+2] puppet8: replace to_pson with to_json [puppet] - 10https://gerrit.wikimedia.org/r/1071962 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [15:31:53] (03PS4) 10Vgutierrez: sre.cdn.transfer-purged-positions: Do not use transfer_consumer_position [cookbooks] - 10https://gerrit.wikimedia.org/r/1072763 [15:32:04] !log swfrench@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2314 to wikikube-worker2122 - swfrench@cumin2002" [15:32:33] !log swfrench@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2314 to wikikube-worker2122 - swfrench@cumin2002" [15:32:33] !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:32:35] !log swfrench@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2122 [15:32:40] FIRING: KubernetesRsyslogDown: rsyslog on mw2315:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2315 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:32:54] !log swfrench@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2122 [15:33:30] (03CR) 10JHathaway: [V:03+1 C:03+2] puppet8: ensure kerberos keytab type is binary [puppet] - 10https://gerrit.wikimedia.org/r/1072593 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [15:33:34] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2314 to wikikube-worker2122 [15:34:10] !log swfrench@cumin2002 START - Cookbook sre.hosts.rename from mw2315 to wikikube-worker2123 [15:34:31] !log swfrench@cumin2002 START - Cookbook sre.dns.netbox [15:37:46] (03Abandoned) 10JHathaway: Revert "P:tlsproxy::instance: Drop numa_networking global" [puppet] - 10https://gerrit.wikimedia.org/r/1072290 (owner: 10JHathaway) [15:37:53] (03PS1) 10Scott French: mw-debug: add initial "next" release (attempt 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072764 (https://phabricator.wikimedia.org/T372604) [15:38:06] !log swfrench@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2315 to wikikube-worker2123 - swfrench@cumin2002" [15:38:09] (03Abandoned) 10JHathaway: WIP: test pcc do not merge [puppet] - 10https://gerrit.wikimedia.org/r/1057967 (owner: 10JHathaway) [15:38:15] (03CR) 10Filippo Giunchedi: [C:03+1] zuul: set statsd-exporter to relay to local statsite instance [puppet] - 10https://gerrit.wikimedia.org/r/1072632 (https://phabricator.wikimedia.org/T233089) (owner: 10Cwhite) [15:38:23] (03CR) 10Filippo Giunchedi: [C:03+1] zuul: send stats to prometheus-statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/1072633 (https://phabricator.wikimedia.org/T233089) (owner: 10Cwhite) [15:38:42] !log swfrench@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2315 to wikikube-worker2123 - swfrench@cumin2002" [15:38:42] !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:38:43] !log swfrench@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2123 [15:38:54] !log swfrench@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2123 [15:39:34] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2315 to wikikube-worker2123 [15:40:36] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Drop PSON support - https://phabricator.wikimedia.org/T372667#10144667 (10jhathaway) [15:43:33] (03CR) 10CI reject: [V:04-1] sre.cdn.transfer-purged-positions: Do not use transfer_consumer_position [cookbooks] - 10https://gerrit.wikimedia.org/r/1072763 (owner: 10Vgutierrez) [15:45:35] (03PS5) 10Vgutierrez: sre.cdn.transfer-purged-positions: Do not use transfer_consumer_position [cookbooks] - 10https://gerrit.wikimedia.org/r/1072763 [15:46:18] !log swfrench@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2121.codfw.wmnet wikikube-worker2122.codfw.wmnet wikikube-worker2123.codfw.wmnet on all recursors [15:46:21] !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2121.codfw.wmnet wikikube-worker2122.codfw.wmnet wikikube-worker2123.codfw.wmnet on all recursors [15:48:22] (03CR) 10Scott French: "Alexandros, since you kindly reviewed the original patch, if you could take a look at this attempt #2, that would be greatly appreciated!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072764 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [15:49:25] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#10144676 (10Jhancock.wm) This is where to find the settings in the bios. {F57505755} once in the bios the ports will be labeled as such (they aren't intuitively named)... [15:49:31] (03CR) 10Ebernhardson: [C:03+1] cirrus-streaming-updater: test resolve_canonical_bootstrap_servers_only [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072759 (owner: 10DCausse) [15:50:06] !log swfrench@cumin2002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2121.codfw.wmnet [15:50:36] !log swfrench@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2121.codfw.wmnet with OS bullseye [15:50:48] !log swfrench@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2121 [15:51:05] !log swfrench@cumin2002 START - Cookbook sre.dns.netbox [15:51:17] (03PS1) 10JHathaway: puppet8: ensure gpg keyring type is binary [puppet] - 10https://gerrit.wikimedia.org/r/1072768 (https://phabricator.wikimedia.org/T372667) [15:51:29] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072768 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [15:52:28] (03CR) 10Bking: [C:03+2] wdqs: do not add categories on main and scholarly endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1070958 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [15:53:05] (03PS6) 10Vgutierrez: sre.cdn.transfer-purged-positions: Do not use transfer_consumer_position [cookbooks] - 10https://gerrit.wikimedia.org/r/1072763 [15:53:21] (03CR) 10Bking: [C:03+2] wdqs: fix CATEGORY_ENDPOINT env var [puppet] - 10https://gerrit.wikimedia.org/r/1071877 (https://phabricator.wikimedia.org/T374016) (owner: 10DCausse) [15:54:08] (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: test resolve_canonical_bootstrap_servers_only [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072759 (owner: 10DCausse) [15:54:10] (03PS1) 10JHathaway: puppet8: ensure kerberos keytab type is binary [puppet] - 10https://gerrit.wikimedia.org/r/1072769 (https://phabricator.wikimedia.org/T372667) [15:54:32] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072769 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [15:55:11] !log swfrench@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2121 - swfrench@cumin2002" [15:55:16] !log swfrench@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2121 - swfrench@cumin2002" [15:55:16] (03Merged) 10jenkins-bot: cirrus-streaming-updater: test resolve_canonical_bootstrap_servers_only [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072759 (owner: 10DCausse) [15:55:16] !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:55:17] !log swfrench@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2121.codfw.wmnet 162.16.192.10.in-addr.arpa 2.6.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:55:20] !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2121.codfw.wmnet 162.16.192.10.in-addr.arpa 2.6.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:55:21] !log swfrench@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2121 [15:55:28] (03CR) 10Ssingh: [C:03+1] sre.cdn.transfer-purged-positions: Do not use transfer_consumer_position [cookbooks] - 10https://gerrit.wikimedia.org/r/1072763 (owner: 10Vgutierrez) [15:55:35] !log swfrench@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2121 [15:55:36] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2121 [15:56:40] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [15:56:47] !log swfrench@cumin2002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2122.codfw.wmnet [15:57:18] !log swfrench@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2122.codfw.wmnet with OS bullseye [15:57:26] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: reimage gerrit1004.wikimedia.org as phab1005.eqiad.wmnet - https://phabricator.wikimedia.org/T372817#10144691 (10Dzahn) I ran the decom cookbook (without and with --force) but it errors out with ` spicerack.netbox.NetboxHostNotFoundError: gerrit... [15:57:28] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:57:30] !log swfrench@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2122 [15:57:35] (03PS1) 10JHathaway: puppet8: ensure dns cookie type is binary [puppet] - 10https://gerrit.wikimedia.org/r/1072770 (https://phabricator.wikimedia.org/T372667) [15:57:45] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072770 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [15:57:59] !log swfrench@cumin2002 START - Cookbook sre.dns.netbox [16:01:04] (03PS1) 10DCausse: Revert "cirrus-streaming-updater: test resolve_canonical_bootstrap_servers_only" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072771 [16:01:44] !log swfrench@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2122 - swfrench@cumin2002" [16:01:50] !log swfrench@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2122 - swfrench@cumin2002" [16:01:50] !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:01:51] !log swfrench@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2122.codfw.wmnet 163.16.192.10.in-addr.arpa 3.6.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:01:54] !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2122.codfw.wmnet 163.16.192.10.in-addr.arpa 3.6.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:01:55] !log swfrench@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2122 [16:02:05] !log swfrench@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2122 [16:02:05] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2122 [16:02:23] (03CR) 10DCausse: [C:03+2] Revert "cirrus-streaming-updater: test resolve_canonical_bootstrap_servers_only" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072771 (owner: 10DCausse) [16:02:27] (03PS8) 10Bking: wdqs: common module and profile should not define categories_endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1070956 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [16:02:30] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070956 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [16:02:41] (03CR) 10CI reject: [V:04-1] wdqs: common module and profile should not define categories_endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1070956 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [16:03:05] !log swfrench@cumin2002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2123.codfw.wmnet [16:03:23] (03Merged) 10jenkins-bot: Revert "cirrus-streaming-updater: test resolve_canonical_bootstrap_servers_only" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072771 (owner: 10DCausse) [16:03:32] !log swfrench@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2123.codfw.wmnet with OS bullseye [16:03:43] !log swfrench@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2123 [16:05:29] (03PS9) 10Bking: wdqs: common module and profile should not define categories_endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1070956 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [16:05:35] (03PS1) 10JHathaway: puppet8: ensure java ssh key type is binary [puppet] - 10https://gerrit.wikimedia.org/r/1072773 (https://phabricator.wikimedia.org/T372667) [16:05:40] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070956 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [16:05:47] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072773 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [16:05:56] !log swfrench@cumin2002 START - Cookbook sre.dns.netbox [16:06:02] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [16:06:08] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:07:18] (03CR) 10Vgutierrez: [C:03+2] sre.cdn.transfer-purged-positions: Do not use transfer_consumer_position [cookbooks] - 10https://gerrit.wikimedia.org/r/1072763 (owner: 10Vgutierrez) [16:07:31] !log performing friday deployment of jenkins-deploy (releases server) to fix broken job (see https://gitlab.wikimedia.org/repos/releng/jenkins-deploy/-/merge_requests/81) [16:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:42] !log dduvall@deploy1003 Started deploy [releng/jenkins-deploy@e8b4e0b] (releasing): (no justification provided) [16:08:25] !log dduvall@deploy1003 Finished deploy [releng/jenkins-deploy@e8b4e0b] (releasing): (no justification provided) (duration: 00m 43s) [16:08:58] (03CR) 10Bking: [C:03+2] wdqs: common module and profile should not define categories_endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1070956 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [16:09:49] !log swfrench@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2123 - swfrench@cumin2002" [16:09:53] !log swfrench@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2123 - swfrench@cumin2002" [16:09:54] !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:09:54] !log swfrench@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2123.codfw.wmnet 164.16.192.10.in-addr.arpa 4.6.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:09:57] !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2123.codfw.wmnet 164.16.192.10.in-addr.arpa 4.6.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:09:58] !log swfrench@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2123 [16:10:12] (03PS7) 10DCausse: wdqs: do not add categories on main and scholarly endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1070958 (https://phabricator.wikimedia.org/T374009) [16:10:30] !log swfrench@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2123 [16:10:31] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2123 [16:12:07] (03CR) 10Bking: wdqs: do not add categories on main and scholarly endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1070958 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [16:12:12] (03CR) 10Bking: [V:03+2 C:03+2] wdqs: do not add categories on main and scholarly endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1070958 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [16:12:37] !log swfrench@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2121.codfw.wmnet with reason: host reimage [16:13:02] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Drop PSON support - https://phabricator.wikimedia.org/T372667#10144748 (10jhathaway) [16:13:24] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.transfer-purged-positions rolling custom on P{cp2027*} and A:cp [16:15:24] (03PS1) 10JHathaway: puppet8: ensure java keystore type is binary [puppet] - 10https://gerrit.wikimedia.org/r/1072777 (https://phabricator.wikimedia.org/T372667) [16:15:32] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.transfer-purged-positions (exit_code=0) rolling custom on P{cp2027*} and A:cp [16:15:35] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072777 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [16:16:00] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2121.codfw.wmnet with reason: host reimage [16:16:59] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#10144759 (10Jhancock.wm) more exposition! in the case of this particular configuration, these are the names of the ports on the server in the Advanced option menu. The o... [16:18:29] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.transfer-purged-positions rolling custom on P{cp[2028-2034,2038-2042].codfw.wmnet,cp[5017,5019-5020,5023,5027-5028,5030].eqsin.wmnet,cp[4038-4052].ulsfo.wmnet} and A:cp [16:18:47] !log swfrench@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2122.codfw.wmnet with reason: host reimage [16:21:58] (03CR) 10JHathaway: [C:03+2] puppet8: ensure gpg keyring type is binary [puppet] - 10https://gerrit.wikimedia.org/r/1072768 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [16:23:16] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2122.codfw.wmnet with reason: host reimage [16:25:09] (03CR) 10JHathaway: [C:03+2] puppet8: ensure kerberos keytab type is binary [puppet] - 10https://gerrit.wikimedia.org/r/1072769 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [16:27:24] !log swfrench@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2123.codfw.wmnet with reason: host reimage [16:27:58] SAL is down it seems, the web interface [16:29:38] aw, sounds like https://wikitech.wikimedia.org/w/index.php?title=Nova_Resource:Tools.sal/SAL&diff=prev&oldid=2225689 didn’t work then 😔 [16:29:39] * Lucas_WMDE looks [16:30:34] hm, kubectl get events says “Container webservice failed liveness probe, will be restarted” 4m41s ago at least [16:30:40] and now another one 3s ago [16:31:09] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2123.codfw.wmnet with reason: host reimage [16:32:23] sukhe: better now? [16:32:53] thanks Lucas_WMDE <3 [16:35:18] and now for the actual reason I came back into this channel :D [16:35:24] I don’t even remember what the context for https://bash.toolforge.org/quip/HptC3JEBFFSCpsJzSng3 was [16:35:33] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2121.codfw.wmnet with OS bullseye [16:35:53] on the off chance that whoever quipped it is around… would you mind pinging me in future? I find it odd to only discover these via the list of new quips later 😅 [16:36:20] usually we take permission for sharing anything there (I wasn't the one who added it, just remarking) [16:36:21] (likewise https://bash.toolforge.org/quip/Sj3KapEBKFqumxvtIHYX, though I think I happened to see that one within a day of writing it so I still remembered the context ^^) [16:37:12] (03CR) 10JHathaway: [C:03+2] puppet8: ensure dns cookie type is binary [puppet] - 10https://gerrit.wikimedia.org/r/1072770 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [16:38:54] !log running homer lsw1-b3-codfw* commit 'T372878' [16:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:59] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [16:41:09] PROBLEM - BGP status on lsw1-b3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:41:41] ^ expected - waiting on host reboot for session to come up [16:42:01] (03CR) 10JHathaway: [C:03+2] puppet8: ensure java ssh key type is binary [puppet] - 10https://gerrit.wikimedia.org/r/1072773 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [16:43:09] RECOVERY - BGP status on lsw1-b3-codfw.mgmt is OK: BGP OK - up: 34, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:43:22] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2122.codfw.wmnet with OS bullseye [16:43:56] sukhe: I was actually wondering about that and submitted https://github.com/bd808/quips/pull/31/files to document it in the quips tool itself, so feel free to reply there if you like (it sounds like what I inferred / guessed isn’t necessarily what other people think) [16:46:09] PROBLEM - BGP status on lsw1-b3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:46:33] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: reimage gerrit1004.wikimedia.org as phab1005.eqiad.wmnet - https://phabricator.wikimedia.org/T372817#10144861 (10Dzahn) ` [puppetserver1001:~] $ sudo puppet node clean gerrit1004.wikimedia.org Notice: Certificate for gerrit1004.wikimedia.org has b... [16:46:45] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2121.codfw.wmnet [16:46:47] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2121.codfw.wmnet [16:46:48] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2121.codfw.wmnet [16:47:03] (03CR) 10JHathaway: [C:03+2] puppet8: ensure java keystore type is binary [puppet] - 10https://gerrit.wikimedia.org/r/1072777 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [16:48:09] RECOVERY - BGP status on lsw1-b3-codfw.mgmt is OK: BGP OK - up: 34, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:50:01] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Drop PSON support - https://phabricator.wikimedia.org/T372667#10144879 (10jhathaway) [16:50:57] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2123.codfw.wmnet with OS bullseye [16:52:32] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2122.codfw.wmnet [16:52:34] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2122.codfw.wmnet [16:52:35] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2122.codfw.wmnet [16:53:25] FIRING: SystemdUnitFailed: wdqs-categories.service on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:53:25] FIRING: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:54:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [16:55:07] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10144911 (10phaultfinder) [16:55:41] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on wdqs[2021-2024].codfw.wmnet with reason: T373791 [16:55:45] T373791: Transfer a sane journal (subgraph:main) to wdqs2021 from wdqs2022 - https://phabricator.wikimedia.org/T373791 [16:55:57] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on wdqs[2021-2024].codfw.wmnet with reason: T373791 [16:56:14] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on wdqs[1021-1024].eqiad.wmnet with reason: T373935 [16:56:18] T373935: WDQS graph split: cleanup monitoring/alerting now that we are in production - https://phabricator.wikimedia.org/T373935 [16:56:31] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on wdqs[1021-1024].eqiad.wmnet with reason: T373935 [16:57:49] !log running homer cr*codfw* commit 'T372878' [16:57:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:53] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [16:59:03] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 295, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:59:07] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2123.codfw.wmnet [16:59:09] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2123.codfw.wmnet [16:59:10] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2123.codfw.wmnet [17:02:31] (03PS1) 10Ahmon Dancy: gitlab-settings: v1.7.0 for bugfix [puppet] - 10https://gerrit.wikimedia.org/r/1072785 [17:03:05] (03CR) 10CI reject: [V:04-1] gitlab-settings: v1.7.0 for bugfix [puppet] - 10https://gerrit.wikimedia.org/r/1072785 (owner: 10Ahmon Dancy) [17:03:37] (03PS2) 10Ahmon Dancy: gitlab-settings: v1.7.0 for bugfix [puppet] - 10https://gerrit.wikimedia.org/r/1072785 [17:06:01] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 377, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:10:06] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10144964 (10phaultfinder) [17:11:06] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T374733 (10Scott_French) 03NEW [17:11:11] (03PS1) 10Ahmon Dancy: gitlab: Sync people/wmde GitLab group w/ wmde LDAP group [puppet] - 10https://gerrit.wikimedia.org/r/1072786 [17:21:50] (03PS2) 10Alexandros Kosiaris: mediawiki-image-download: Drop to 5% [puppet] - 10https://gerrit.wikimedia.org/r/1070550 (https://phabricator.wikimedia.org/T366778) [17:29:43] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T374733#10145018 (10akosiaris) [17:30:16] (03CR) 10Ssingh: "For some reason, PCC is running on old cp hosts (decommissioned for more than a year). Beyond that, I am still checking why this a NOOP an" [puppet] - 10https://gerrit.wikimedia.org/r/1072566 (https://phabricator.wikimedia.org/T350008) (owner: 10JHathaway) [17:38:38] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.transfer-purged-positions (exit_code=0) rolling custom on P{cp[2028-2034,2038-2042].codfw.wmnet,cp[5017,5019-5020,5023,5027-5028,5030].eqsin.wmnet,cp[4038-4052].ulsfo.wmnet} and A:cp [17:38:40] 06SRE, 06Data-Engineering, 10Observability-Logging, 10Wikimedia-Logstash, 10Event-Platform: Integrate Event Platform and ECS logs - https://phabricator.wikimedia.org/T291645#10145027 (10EBernhardson) This would have been useful to debug T374662, aggregating the times out of elasticsearch is a bit hard as... [17:38:45] \o/( [17:42:20] 06SRE, 06Data-Engineering, 10Observability-Logging, 10Wikimedia-Logstash, 10Event-Platform: Integrate Event Platform and ECS logs - https://phabricator.wikimedia.org/T291645#10145036 (10CDanis) Similar but different: {T304373} [17:49:12] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Vacation coverage for Katie Francis - https://phabricator.wikimedia.org/T374673#10145270 (10Dzahn) a:05Dzahn→03None [17:52:38] (03CR) 10JHathaway: "it may be an issue with the regex Hosts line? I'll file a task to look into it." [puppet] - 10https://gerrit.wikimedia.org/r/1072566 (https://phabricator.wikimedia.org/T350008) (owner: 10JHathaway) [17:53:49] (03CR) 10Dzahn: [C:03+2] gitlab-settings: v1.7.0 for bugfix [puppet] - 10https://gerrit.wikimedia.org/r/1072785 (owner: 10Ahmon Dancy) [17:54:42] FIRING: [4x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [17:55:14] (03CR) 10Dzahn: [C:03+2] gitlab: Sync people/wmde GitLab group w/ wmde LDAP group [puppet] - 10https://gerrit.wikimedia.org/r/1072786 (owner: 10Ahmon Dancy) [17:55:51] Thanks mutante! [17:56:29] yw [17:59:10] (03CR) 10Scott French: [C:03+1] services: add new poolcounter nodes to MW configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072717 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [18:00:42] (03CR) 10Scott French: [C:03+1] Swap poolcounter2003 with poolcounter2005 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072206 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [18:10:06] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10145290 (10phaultfinder) [18:10:40] (03CR) 10Jforrester: Improve $wgFooterIcons override, remove $wmgWikimediaIcon (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071712 (owner: 10Bartosz Dziewoński) [18:11:46] (03PS4) 10Ssingh: haproxy: re-add numa support [puppet] - 10https://gerrit.wikimedia.org/r/1072566 (https://phabricator.wikimedia.org/T350008) (owner: 10JHathaway) [18:12:54] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3986/console" [puppet] - 10https://gerrit.wikimedia.org/r/1072566 (https://phabricator.wikimedia.org/T350008) (owner: 10JHathaway) [18:16:03] (03PS3) 10Jasmine: icinga: add jasmine to icinga authorizations [puppet] - 10https://gerrit.wikimedia.org/r/1071964 [18:22:54] (03CR) 10Jasmine: "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1071964 (owner: 10Jasmine) [18:23:29] (03PS5) 10Ssingh: haproxy: re-add numa support [puppet] - 10https://gerrit.wikimedia.org/r/1072566 (https://phabricator.wikimedia.org/T350008) (owner: 10JHathaway) [18:23:50] (03CR) 10Scott French: "Thanks, Moritz! I'll keep you posted on an ETA for when this is happening." [puppet] - 10https://gerrit.wikimedia.org/r/1070681 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [18:23:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [18:24:18] (03CR) 10RLazarus: [C:03+2] icinga: add jasmine to icinga authorizations [puppet] - 10https://gerrit.wikimedia.org/r/1071964 (owner: 10Jasmine) [18:24:41] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3987/console" [puppet] - 10https://gerrit.wikimedia.org/r/1072566 (https://phabricator.wikimedia.org/T350008) (owner: 10JHathaway) [18:27:50] (03CR) 10Ssingh: [V:03+1] "Hi Jesse: After looking at this a bit more deeply, we got lucky here when the cp-specific block in realm.pp was removed. That would have r" [puppet] - 10https://gerrit.wikimedia.org/r/1072566 (https://phabricator.wikimedia.org/T350008) (owner: 10JHathaway) [18:28:53] (03PS6) 10Ssingh: haproxy: re-add numa support [puppet] - 10https://gerrit.wikimedia.org/r/1072566 (https://phabricator.wikimedia.org/T350008) (owner: 10JHathaway) [18:28:54] (03CR) 10Ssingh: "Commit message updated." [puppet] - 10https://gerrit.wikimedia.org/r/1072566 (https://phabricator.wikimedia.org/T350008) (owner: 10JHathaway) [18:56:32] (03CR) 10JHathaway: [C:03+1] "makes sense, thanks for the careful review and updated patch" [puppet] - 10https://gerrit.wikimedia.org/r/1072566 (https://phabricator.wikimedia.org/T350008) (owner: 10JHathaway) [18:57:00] (03CR) 10JHathaway: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072566 (https://phabricator.wikimedia.org/T350008) (owner: 10JHathaway) [19:05:02] (03PS1) 10Scott French: [DNM] service: move mwdebug-next to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1072796 (https://phabricator.wikimedia.org/T372604) [19:06:00] (03CR) 10Ssingh: "Will merge Monday morning, to be extra sure (even if it is a NOOP)" [puppet] - 10https://gerrit.wikimedia.org/r/1072566 (https://phabricator.wikimedia.org/T350008) (owner: 10JHathaway) [19:07:55] (03PS1) 10JHathaway: puppetserver: remove empty hiera data files [puppet] - 10https://gerrit.wikimedia.org/r/1072797 [19:08:37] (03PS1) 10Scott French: [DNM] service: move mwdebug-next to production [puppet] - 10https://gerrit.wikimedia.org/r/1072798 (https://phabricator.wikimedia.org/T372604) [19:16:49] (03CR) 10JHathaway: [C:03+2] puppetserver: remove empty hiera data files [puppet] - 10https://gerrit.wikimedia.org/r/1072797 (owner: 10JHathaway) [19:39:56] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Idle - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:45:52] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:45:56] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:50:08] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10145515 (10phaultfinder) [19:52:28] !log bking@cumin2002 START - Cookbook sre.hosts.remove-downtime for wdqs[1021-1024].eqiad.wmnet [19:52:30] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wdqs[1021-1024].eqiad.wmnet [19:52:38] !log bking@cumin2002 START - Cookbook sre.hosts.remove-downtime for wdqs[2021-2024].codfw.wmnet [19:52:40] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wdqs[2021-2024].codfw.wmnet [20:00:55] RESOLVED: SystemdUnitFailed: wdqs-categories.service on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:09:20] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Drop PSON support - https://phabricator.wikimedia.org/T372667#10145556 (10jhathaway) [20:25:13] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T374573#10145582 (10phaultfinder) [20:27:28] (03CR) 10Jdlrobson: "Thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072623 (owner: 10Ebrahim) [20:40:18] (03CR) 10Bartosz Dziewoński: Improve $wgFooterIcons override, remove $wmgWikimediaIcon (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071712 (owner: 10Bartosz Dziewoński) [20:53:25] FIRING: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:54:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [20:59:31] (03CR) 10Cwhite: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072632 (https://phabricator.wikimedia.org/T233089) (owner: 10Cwhite) [21:03:57] (03PS1) 10Dwisehaupt: frack: remove fraban2001 from dns for decommissioning [dns] - 10https://gerrit.wikimedia.org/r/1072812 (https://phabricator.wikimedia.org/T374741) [21:06:14] (03CR) 10Cwhite: [C:03+2] zuul: set statsd-exporter to relay to local statsite instance [puppet] - 10https://gerrit.wikimedia.org/r/1072632 (https://phabricator.wikimedia.org/T233089) (owner: 10Cwhite) [21:07:22] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Drop PSON support - https://phabricator.wikimedia.org/T372667#10145782 (10jhathaway) [21:08:23] (03PS1) 10Dwisehaupt: icinga: remove frban2001 for decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1072813 (https://phabricator.wikimedia.org/T374741) [21:11:18] !log dwisehaupt@cumin1002 START - Cookbook sre.dns.netbox [21:11:31] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Drop PSON support - https://phabricator.wikimedia.org/T372667#10145786 (10jhathaway) 05Open→03Resolved a:03jhathaway All known uses of pson have been removed. However, since we cannot disable support on 7.23, I don't think there is anyth... [21:14:41] !log dwisehaupt@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: decommissioning frban2001 - dwisehaupt@cumin1002" [21:14:45] !log dwisehaupt@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: decommissioning frban2001 - dwisehaupt@cumin1002" [21:14:46] !log dwisehaupt@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:21:56] (03CR) 10Dzahn: [C:03+2] vrts: switch inactive host vrts2001 to nftables as firewall provider [puppet] - 10https://gerrit.wikimedia.org/r/1072313 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [21:24:44] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 0:10:00 on vrts2001.codfw.wmnet with reason: nftables migration [21:24:47] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on vrts2001.codfw.wmnet with reason: nftables migration [21:25:51] (03CR) 10Dzahn: [C:03+2] "looked good, rebooted" [puppet] - 10https://gerrit.wikimedia.org/r/1072313 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [21:54:42] FIRING: [4x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [22:23:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:38:34] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1072823 [23:38:34] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1072823 (owner: 10TrainBranchBot)