[00:05:04] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 2308 bytes in 0.149 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [00:05:38] !log dzahn@cumin2002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: security release 20241023 [00:05:56] FIRING: [2x] ProbeDown: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:06:04] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 109244 bytes in 0.484 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [00:06:34] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1082580 (owner: 10TrainBranchBot) [00:08:33] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1082582 [00:08:33] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1082582 (owner: 10TrainBranchBot) [00:10:56] RESOLVED: [2x] ProbeDown: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:14:07] This was a needed version upgrade. The part that it alerts is kind of a bug but known. [00:14:17] It shouldn't happen though when the cookbook is used for this. [00:14:57] it's up and patched is what matters more:) [00:14:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2084.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:15:27] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2084.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:21:38] PROBLEM - Host gerrit2003 is DOWN: PING CRITICAL - Packet loss = 100% [00:22:24] !log gerrit2003 rebooting for T338470 [00:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:48] T338470: Rename gerrit2 unix user to gerrit and assign a fixed uid - https://phabricator.wikimedia.org/T338470 [00:22:52] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2084.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:23:04] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2084.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:23:52] RECOVERY - Host gerrit2003 is UP: PING OK - Packet loss = 0%, RTA = 30.42 ms [00:26:04] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2084.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:26:40] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 0:10:00 on gerrit2003.wikimedia.org with reason: reboot [00:26:55] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on gerrit2003.wikimedia.org with reason: reboot [00:32:22] FIRING: [2x] SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [00:42:42] 06SRE, 06collaboration-services, 13Patch-For-Review: setup gerrit2003 with gerrit service - https://phabricator.wikimedia.org/T372804#10256987 (10Dzahn) [00:43:03] 06SRE, 06collaboration-services, 13Patch-For-Review: setup gerrit2003 with gerrit service - https://phabricator.wikimedia.org/T372804#10256989 (10Dzahn) [00:43:15] 06SRE, 06collaboration-services, 13Patch-For-Review: setup gerrit2003 with gerrit service (gerrit on bookworm) - https://phabricator.wikimedia.org/T372804#10256990 (10Dzahn) [00:44:28] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on gerrit2003.wikimedia.org with reason: in setup and T338470 [00:44:36] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1082582 (owner: 10TrainBranchBot) [00:44:42] T338470: Rename gerrit2 unix user to gerrit and assign a fixed uid - https://phabricator.wikimedia.org/T338470 [00:44:43] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on gerrit2003.wikimedia.org with reason: in setup and T338470 [00:46:53] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2084.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:01:50] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2083.codfw.wmnet with OS bullseye [01:02:02] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10257011 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2083.codfw.wmnet with OS bullseye [01:19:37] (03PS4) 10MacFan4000: ExtensionDistributor: Mark 1.43 as beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082256 (https://phabricator.wikimedia.org/T372322) [01:21:31] 10SRE-swift-storage, 10ConfirmEdit (CAPTCHA extension), 07Wikimedia-production-error: FileBackendError: Iterator page I/O error. - https://phabricator.wikimedia.org/T318941#10257037 (10Reedy) p:05Triage→03Low {F57638173 size=full} Still happening, but pretty low... October 9, 2024 had a spike https://... [01:30:02] 10SRE-swift-storage, 06Commons, 10ConfirmEdit (CAPTCHA extension), 06Editing-team, and 3 others: Make SwiftFileBackend::doStoreInternal defer the opening of file handles to stay in the concurrency limit - https://phabricator.wikimedia.org/T230245#10257100 (10Reedy) I suspect we should see if the WMF captch... [01:30:25] (03Restored) 10Reedy: Revert "Workaround for GenerateFancyCaptcha not running as expected in prod" [puppet] - 10https://gerrit.wikimedia.org/r/606021 (https://phabricator.wikimedia.org/T230245) (owner: 10Reedy) [01:35:53] (03PS1) 10Reedy: AutoLoader: Use require_once rather than require [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082585 (https://phabricator.wikimedia.org/T378006) [01:36:06] (03PS1) 10Reedy: AutoLoader: Use require_once rather than require [core] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1082586 (https://phabricator.wikimedia.org/T378006) [01:51:22] (03CR) 10Varnent: "I can leave the icon file - I mean to be fair it's not like a size where disk space savings are very notable. That said, given the audienc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082559 (https://phabricator.wikimedia.org/T378026) (owner: 10Varnent) [02:01:29] FIRING: [4x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:13:50] PROBLEM - Hadoop NodeManager on an-worker1112 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:14:48] PROBLEM - Hadoop NodeManager on an-worker1092 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:15:36] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2083.codfw.wmnet with OS bullseye [02:16:06] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10257288 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2083.codfw.wmnet with OS bullseye executed... [02:19:52] PROBLEM - Hadoop NodeManager on an-worker1163 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:27:50] RECOVERY - Hadoop NodeManager on an-worker1112 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:33:48] RECOVERY - Hadoop NodeManager on an-worker1092 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:40:52] PROBLEM - Hadoop NodeManager on an-worker1145 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:41:52] RECOVERY - Hadoop NodeManager on an-worker1163 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:08:52] RECOVERY - Hadoop NodeManager on an-worker1145 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:32:06] (03PS1) 10DDesouza: miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082597 (https://phabricator.wikimedia.org/T219903) [03:34:38] (03PS1) 10DDesouza: miscweb(design-strategy): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082598 (https://phabricator.wikimedia.org/T344471) [03:35:11] (03CR) 10DDesouza: [C:03+2] miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082597 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [03:36:19] (03Merged) 10jenkins-bot: miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082597 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [03:43:45] (03CR) 10DDesouza: [C:03+2] miscweb(design-strategy): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082598 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [03:44:59] (03Merged) 10jenkins-bot: miscweb(design-strategy): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082598 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [03:47:48] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:49:46] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:53:55] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [03:54:11] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [03:54:13] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [03:54:32] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [03:54:33] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [03:54:47] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [03:54:58] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [03:55:18] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [03:55:19] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [03:55:40] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [03:55:42] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [03:56:00] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [04:32:22] FIRING: [2x] SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [04:48:58] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:49:10] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:57:55] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2155.codfw.wmnet with reason: Maintenance [04:58:09] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2155.codfw.wmnet with reason: Maintenance [04:58:11] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2187.codfw.wmnet with reason: Maintenance [04:58:24] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2187.codfw.wmnet with reason: Maintenance [04:58:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2155 (T367781)', diff saved to https://phabricator.wikimedia.org/P70573 and previous config saved to /var/cache/conftool/dbconfig/20241024-045830-arnaudb.json [04:58:35] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [05:28:15] (03PS4) 10Pppery: Configure settings for annwiki, nrwiki, mywikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081267 (https://phabricator.wikimedia.org/T375102) [05:58:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T367781)', diff saved to https://phabricator.wikimedia.org/P70574 and previous config saved to /var/cache/conftool/dbconfig/20241024-055856-arnaudb.json [05:59:01] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241024T0600) [06:00:05] marostegui, Amir1, and arnaudb: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241024T0600). [06:01:30] FIRING: [4x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:04:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T377942#10257707 (10phaultfinder) [06:07:51] (03PS2) 10Abijeet Patro: tables-catalog: Add translate_message_group_subscriptions table [puppet] - 10https://gerrit.wikimedia.org/r/1082549 (https://phabricator.wikimedia.org/T372287) [06:08:15] (03CR) 10Abijeet Patro: "Hadn't published the change. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1082549 (https://phabricator.wikimedia.org/T372287) (owner: 10Abijeet Patro) [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:14:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P70575 and previous config saved to /var/cache/conftool/dbconfig/20241024-061403-arnaudb.json [06:24:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T377942#10257713 (10phaultfinder) [06:29:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P70576 and previous config saved to /var/cache/conftool/dbconfig/20241024-062910-arnaudb.json [06:32:30] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS1257/IPv6: Idle - Tele2, AS1257/IPv4: Idle - Tele2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:33:58] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 185.15.59.129, interfaces up: 67, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:42:58] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 185.15.59.129, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:44:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T367781)', diff saved to https://phabricator.wikimedia.org/P70577 and previous config saved to /var/cache/conftool/dbconfig/20241024-064418-arnaudb.json [06:44:20] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2219.codfw.wmnet with reason: Maintenance [06:44:33] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [06:44:34] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2219.codfw.wmnet with reason: Maintenance [06:44:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2219 (T367781)', diff saved to https://phabricator.wikimedia.org/P70578 and previous config saved to /var/cache/conftool/dbconfig/20241024-064440-arnaudb.json [06:53:28] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti2039 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1082490 (owner: 10Muehlenhoff) [06:53:36] (03PS5) 10Slyngshede: Permissions: Cleanup code and reduce LDAP queries. [software/bitu] - 10https://gerrit.wikimedia.org/r/1078675 [06:53:51] (03CR) 10Slyngshede: Permissions: Cleanup code and reduce LDAP queries. (033 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/1078675 (owner: 10Slyngshede) [06:58:58] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1078675 (owner: 10Slyngshede) [06:59:21] (03CR) 10Slyngshede: [C:03+2] Permissions: Cleanup code and reduce LDAP queries. [software/bitu] - 10https://gerrit.wikimedia.org/r/1078675 (owner: 10Slyngshede) [07:00:05] Amir1, Urbanecm, and awight: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241024T0700) [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:02:04] (03Merged) 10jenkins-bot: Permissions: Cleanup code and reduce LDAP queries. [software/bitu] - 10https://gerrit.wikimedia.org/r/1078675 (owner: 10Slyngshede) [07:04:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2039.codfw.wmnet [07:06:32] PROBLEM - Host ldap-maint2001 is DOWN: PING CRITICAL - Packet loss = 100% [07:10:30] RECOVERY - Host ldap-maint2001 is UP: PING OK - Packet loss = 0%, RTA = 30.82 ms [07:15:39] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti2039.codfw.wmnet [07:16:30] FIRING: [5x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:18:32] (03CR) 10Tacsipacsi: "I see." [puppet] - 10https://gerrit.wikimedia.org/r/1082549 (https://phabricator.wikimedia.org/T372287) (owner: 10Abijeet Patro) [07:21:12] (03CR) 10Ayounsi: [C:03+1] Update static reverse PTR records for frack records codfw [dns] - 10https://gerrit.wikimedia.org/r/1082525 (https://phabricator.wikimedia.org/T374176) (owner: 10Cathal Mooney) [07:22:17] (03PS5) 10Elukey: tests: fix outstanding CI issues [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1082486 (owner: 10Volans) [07:22:18] (03PS2) 10Elukey: tox: add Jenkins settings to reduce its execution time [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1082524 [07:22:31] (03CR) 10Elukey: [C:03+1] tests: fix outstanding CI issues [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1082486 (owner: 10Volans) [07:22:57] (03CR) 10DCausse: [C:03+1] "lgtm, should we keep this patch open while all the plugins are ready?" [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1080749 (https://phabricator.wikimedia.org/T372769) (owner: 10Ebernhardson) [07:26:09] (03PS1) 10Muehlenhoff: Switch ganeti2041-ganeti2044 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1082691 (https://phabricator.wikimedia.org/T376594) [07:32:32] !log arnaudb@cumin1002 START - Cookbook sre.mysql.sanitize-pii Setting up permissions and view database PII for wikis annwiki in section s5 [07:32:33] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.sanitize-pii (exit_code=0) Setting up permissions and view database PII for wikis annwiki in section s5 [07:32:51] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti2041-ganeti2044 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1082691 (https://phabricator.wikimedia.org/T376594) (owner: 10Muehlenhoff) [07:33:58] !log arnaudb@cumin1002 START - Cookbook sre.mysql.sanitize-pii Checking PII for wikis annwiki in section s5 [07:34:46] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.sanitize-pii (exit_code=0) Checking PII for wikis annwiki in section s5 [07:37:59] (03PS1) 10Muehlenhoff: Switch ganeti2035 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1082701 [07:43:40] (03CR) 10Elukey: "5 mins vs 14 mins :D" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1082524 (owner: 10Elukey) [07:45:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T367781)', diff saved to https://phabricator.wikimedia.org/P70579 and previous config saved to /var/cache/conftool/dbconfig/20241024-074506-arnaudb.json [07:45:26] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [07:45:46] (03PS1) 10Slyngshede: New version, 0.1.0. [software/bitu] - 10https://gerrit.wikimedia.org/r/1082702 [07:46:45] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Manange fundraising network elements from Netbox - https://phabricator.wikimedia.org/T377996#10257838 (10ayounsi) It's quite a big task overall, splitting it into several well defined sub-tasks will make it easier to accomplish. For example... [07:51:20] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti2035 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1082701 (owner: 10Muehlenhoff) [07:51:21] (03PS2) 10Slyngshede: New version, 0.1.0. [software/bitu] - 10https://gerrit.wikimedia.org/r/1082702 [07:52:08] (03PS1) 10Elukey: sre.hosts.reimage: fix reimage failed sentence [cookbooks] - 10https://gerrit.wikimedia.org/r/1082703 [07:53:21] (03CR) 10Ayounsi: [C:03+2] vlan migration report: add one example host per group [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1081061 (owner: 10Ayounsi) [07:55:17] (03Merged) 10jenkins-bot: vlan migration report: add one example host per group [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1081061 (owner: 10Ayounsi) [07:55:26] !log restart ircstream on irc.wikimedia.org to remove a performance experiment [07:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:34] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [07:56:47] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [07:57:28] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet [07:57:46] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10257859 (10ops-monitoring-bot) Draining ganeti2035.codfw.wmnet of running VMs [07:58:58] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.10 point update - https://phabricator.wikimedia.org/T368288#10257860 (10MoritzMuehlenhoff) [07:59:46] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [08:00:13] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [08:00:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P70580 and previous config saved to /var/cache/conftool/dbconfig/20241024-080013-arnaudb.json [08:00:22] 06SRE, 06Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists: Replace Exim on lists.wikimedia.org with Postfix - https://phabricator.wikimedia.org/T378021#10257862 (10Peachey88) [08:00:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet [08:01:06] !log installing libssh2 security updates [08:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:09] (03CR) 10Ayounsi: [C:03+2] Netbox: run the vlan_migration report every 2 hours [puppet] - 10https://gerrit.wikimedia.org/r/1081071 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [08:02:38] (03PS1) 10Kevin Bazira: ml-services: normalize text input in langid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082704 (https://phabricator.wikimedia.org/T377751) [08:05:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet [08:05:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet [08:07:23] 06SRE, 06Infrastructure-Foundations, 10Mail, 10vrts, 10Znuny: Replace Exim on VRTS servers with Postfix - https://phabricator.wikimedia.org/T378028#10257878 (10Peachey88) [08:09:55] (03PS1) 10Ayounsi: Netbox: vlan_migration timer, fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1082706 [08:10:33] (03PS32) 10Arnaudb: mariadb: pii cleaner cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1080129 (https://phabricator.wikimedia.org/T366146) [08:11:29] FIRING: SystemdUnitFailed: netbox_report_vlan_migration_run.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:12:33] (03PS1) 10Elukey: sre.hosts.reimage: clear puppetdb's state upon rollback (if needed) [cookbooks] - 10https://gerrit.wikimedia.org/r/1082707 (https://phabricator.wikimedia.org/T371400) [08:13:08] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master [08:13:25] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors [08:13:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors [08:14:48] (03CR) 10Ayounsi: [C:03+2] Netbox: vlan_migration timer, fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1082706 (owner: 10Ayounsi) [08:15:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P70581 and previous config saved to /var/cache/conftool/dbconfig/20241024-081520-arnaudb.json [08:16:36] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Manange fundraising network elements from Netbox - https://phabricator.wikimedia.org/T377996#10257896 (10cmooney) >>! In T377996#10257838, @ayounsi wrote: > It's quite a big task overall, splitting it into several well defined sub-tasks wil... [08:17:45] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors [08:17:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors [08:18:11] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Manange fundraising network elements from Netbox - https://phabricator.wikimedia.org/T377996#10257898 (10ayounsi) > If we don't want to use dummy interface names I think the simple way forward is Option 1, which seems like a big improvement... [08:18:38] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: normalize text input in langid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082704 (https://phabricator.wikimedia.org/T377751) (owner: 10Kevin Bazira) [08:19:31] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1082702 (owner: 10Slyngshede) [08:19:40] (03Merged) 10jenkins-bot: ml-services: normalize text input in langid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082704 (https://phabricator.wikimedia.org/T377751) (owner: 10Kevin Bazira) [08:21:32] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:22:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master [08:23:10] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [08:23:54] !log installing bash/zsh updates from bookworm point release [08:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:25] RESOLVED: SystemdUnitFailed: netbox_report_vlan_migration_run.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:27:55] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [08:30:23] !log kevinbazira@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [08:30:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T367781)', diff saved to https://phabricator.wikimedia.org/P70582 and previous config saved to /var/cache/conftool/dbconfig/20241024-083027-arnaudb.json [08:30:55] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [08:32:22] FIRING: [2x] SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [08:33:06] PROBLEM - MariaDB read only pc5 on pc1017 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:33:15] checking [08:33:24] PROBLEM - MariaDB Replica SQL: pc5 on pc1017 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:33:27] (03PS1) 10Slyngshede: P:idp Make Redis database number configurable. [puppet] - 10https://gerrit.wikimedia.org/r/1082711 (https://phabricator.wikimedia.org/T377937) [08:33:48] https://orchestrator.wikimedia.org/web/cluster/alias/pc5 pc5 has an issue [08:34:00] PROBLEM - MariaDB Event Scheduler pc5 on pc1017 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [08:34:34] the host probably crashed, put the only other spare in its place [08:34:46] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4367/console" [puppet] - 10https://gerrit.wikimedia.org/r/1082711 (https://phabricator.wikimedia.org/T377937) (owner: 10Slyngshede) [08:34:49] pc1017 has crashed yep [08:34:58] mariadb or the host? [08:35:04] mariadb [08:35:08] I'm checking to restart it atm [08:35:53] https://phabricator.wikimedia.org/P70583 [08:36:00] PROBLEM - mysqld processes on pc1017 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [08:36:12] it's the same issue as https://phabricator.wikimedia.org/T375382 [08:37:01] (probably) [08:37:21] jynus: could you handle the move? it'll be faster as I never did it [08:37:24] I don't see many mw errors, so that's good [08:37:28] PROBLEM - MariaDB Replica IO: pc5 on pc1017 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:37:39] (03CR) 10Fabfur: [C:03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/1082548 (owner: 10Ssingh) [08:38:05] you can start replication, pc auto handles stale entries [08:38:33] we are not in a rush to change it if we are not presentin errors, I think you should do it [08:39:05] could I just innodb_force_recovery ? [08:39:17] it doesn't boot normally? [08:39:32] nope, it claims there is a corrupted page id [08:39:40] then yes, for pc only [08:39:40] so I guess a forced recovery would fix it [08:39:43] ack [08:39:47] we don't care abut its data [08:39:58] my take indeed, thanks for the confirmation [08:39:58] must be hw issue like pc1013 [08:40:51] also double check the mariadb version :-D [08:41:00] PROBLEM - MariaDB Replica Lag: pc5 on pc1017 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:41:14] sigh [08:41:31] then I can't downgrade, right? [08:41:39] we will put the other host instead [08:41:42] don't worry for now [08:41:45] sad [08:41:48] put it up, restart replication [08:41:59] will ensure mariadb version first :P [08:42:01] then we give us a break and plan [08:42:33] Aviate, Navigate, Communicate [08:43:19] manuel would be proud of me :-D [08:43:55] mariadb downgraded, will move on to replication start, I can use GTID from scratch right? [08:43:58] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1082150 (https://phabricator.wikimedia.org/T377728) (owner: 10Slyngshede) [08:44:36] we don't care about the position, really, as long as it is replicating [08:44:43] (for pc, ofc) [08:44:48] ack, will change master to w/ master_auto_position [08:44:54] wait [08:44:57] ? [08:45:09] did you put the instance up or it doesn't boot? [08:45:27] its on a faulty mariadb version [08:45:32] that's ok [08:45:34] so I skipped the instance fix part [08:45:36] we put the service up [08:45:42] ah? i'll try then wait [08:45:43] then we do the rest, that's what I meant [08:45:55] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4368/co" [puppet] - 10https://gerrit.wikimedia.org/r/1082497 (https://phabricator.wikimedia.org/T377878) (owner: 10Btullis) [08:46:06] I hope it will survive at least for some minutes [08:46:21] instance starting back [08:46:26] RECOVERY - MariaDB Replica SQL: pc5 on pc1017 is OK: OK slave_sql_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:46:28] RECOVERY - MariaDB Replica IO: pc5 on pc1017 is OK: OK slave_io_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:46:32] ok, orchestator looks happy [08:46:37] even with lag [08:46:46] restart replication on both when you can [08:47:00] RECOVERY - mysqld processes on pc1017 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [08:47:00] RECOVERY - MariaDB Replica Lag: pc5 on pc1017 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:47:02] multimaster needs something for this? [08:47:02] RECOVERY - MariaDB Event Scheduler pc5 on pc1017 is OK: Version 10.6.19-MariaDB-log, Uptime 53s, read_only: False, event_scheduler: True, 1137.90 QPS, connection latency: 0.041112s, query latency: 0.000836s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [08:47:08] RECOVERY - MariaDB read only pc5 on pc1017 is OK: Version 10.6.19-MariaDB-log, Uptime 58s, read_only: False, event_scheduler: True, 1053.85 QPS, connection latency: 0.032306s, query latency: 0.001057s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:47:14] and that's when I say, take a break, plan better, create a ticket, etc [08:47:26] we should have a few minutes now that things are up [08:47:26] haha ack indeed [08:48:01] check pc1017, may need a bump on replication too [08:48:55] the idea is to fix the outage asap, then to prevent it, assuming we will have some time for that [08:49:45] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Exim on VRTS servers with Postfix - https://phabricator.wikimedia.org/T378028#10257925 (10LSobanski) [08:49:49] arnaudb: running start slave on pc1017, ok? [08:49:58] ack, I was unsure how to do it [08:50:00] please proceed [08:50:17] connect to it -> execute "START SLAVE;" [08:50:27] sure but I was unsure it was the right command x) [08:50:38] run "show slave status;" to make sure it is working [08:50:53] again, this is pc, we don't care too much about consistency [08:51:11] in other cases we would have recloned it right away, keep it down until properly fixed [08:51:30] yeah but between consistency caring and "breaking replication because I issued a wrong SQL command", I tried to avoid both :D [08:51:37] (03CR) 10Urbanecm: [C:03+1] "Agreed. Just wanted that to be considered. LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082559 (https://phabricator.wikimedia.org/T378026) (owner: 10Varnent) [08:51:38] pc is just a cache, and it self heals for stale data thanks to mw logic [08:51:52] that's ok, pc is great for that because service > data [08:51:53] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10257924 (10elukey) There is something strange (at least for me) in the BMC's storage view: {F57639015} The 24 disks look paired by slot, and each cou... [08:52:11] on other cases (s* and es*) , data > service [08:52:32] so it just just a question of knowing what is going on and the right approach on every case [08:52:49] ok, let's now create a ticket, and hope it gives us a breath to see the next steps [08:53:19] "pc2017 crashed" [08:53:23] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10257929 (10MatthewVernon) ISTR there are two banks of 12 disks in these systems, so maybe only 1 controller is playing ball? We do need all 24 disks av... [08:53:46] sorry, that would be pc1017, right? [08:53:57] (03PS3) 10Btullis: Add new an-worker nodes to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1082497 (https://phabricator.wikimedia.org/T377878) [08:54:50] I'll create the ticket don't worry! [08:54:57] tappof: we invite you to practice with us if you want! [08:55:08] it was pc1017 who crashed [08:55:31] so this part we can double confirm, but shouldn't wait much [08:56:05] as it happened in the past that pc1013 crashed a bit then was stuck in a loop [08:56:23] but we can take some time to make sure we are doing it right [08:56:43] deploy a puppet change making the only other spare a master [08:57:03] (03CR) 10Slyngshede: [C:03+2] New version, 0.1.0. [software/bitu] - 10https://gerrit.wikimedia.org/r/1082702 (owner: 10Slyngshede) [08:57:15] then we will do the dbctl change [08:57:30] then we repoint the replication [08:57:49] actually, we can make it replicate from pc2017 beforehand so it warms up a little [08:58:24] we should also warn people that cache hitratio (and thus performance) may be affected a little [08:58:52] plus we should downgrade mariadb versions if needed (the ticket will help do a checklist of all of that) [08:59:23] ^ arnaudb let me know if that helps, and how I can help or review, double check, etc [08:59:24] (03Merged) 10jenkins-bot: New version, 0.1.0. [software/bitu] - 10https://gerrit.wikimedia.org/r/1082702 (owner: 10Slyngshede) [08:59:46] jynus: arnaudb yes, sure [09:00:47] T378068 [09:00:47] T378068: pc1017 crashed - https://phabricator.wikimedia.org/T378068 [09:01:08] tappof: this is now handled by the DBAs (so not necessary) but it is a rare occasion to be able to practice dba recovery, if you want, even if just an observer/double checking patches, etc [09:01:17] so up to you [09:02:21] "InnoDB: File './ibdata1' is corrupted" that's very weird it happens by software [09:02:32] unless, you know, you write random bits to it [09:02:43] so my guess would be an i/o issue [09:02:51] dmesg showed nothing [09:02:56] :-( [09:03:13] check if you are on top of that health log from BMC etc [09:03:26] yeah will do, I'm trying to cross ref with our mariadb issue first [09:03:32] yes [09:03:34] that can wait [09:03:41] let's focus on the replacement, you are right [09:03:41] because the lack of dmesg info made me lean that way [09:04:05] which is the spare we have? [09:04:15] pc1013, I'll be starting preparing it soon [09:04:23] oh no [09:04:27] I think we had another [09:04:36] I downgraded it, its been fixed, lets use this one! [09:04:47] not the one that crashed in a loop, please arnaudb [09:04:48] but first, lets see if it breaks again [09:05:00] if we have 2, let's chose the other [09:05:12] let me check [09:05:58] we have a spare but on codfw [09:06:28] pc1014 should be free? [09:06:40] (03CR) 10Btullis: [C:03+2] Add new an-worker nodes to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1082497 (https://phabricator.wikimedia.org/T377878) (owner: 10Btullis) [09:06:45] it is replicating from pc1011, I doubt it is pooled elsewhere [09:07:01] it is, however, in 10.6.19 [09:07:13] oh we can grab hosts from pc1? [09:07:17] yeah [09:07:19] because it has 2 spares [09:07:25] ack was unaware of this [09:07:27] cold cache > rebooting host [09:07:56] we should thank Amir by adding an extra spare [09:08:04] (03CR) 10Jelto: [C:03+2] wikidata-query-gui: add releases for commons, query-main and scholarly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082166 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [09:08:09] when he increased the cluster size [09:08:16] so, I'll make you review the SQL in private first as its out of my comfort zone [09:08:19] so let's do a checklist [09:08:33] but yea, first make it replicate from the codfw host [09:08:37] then we do puppet [09:08:43] then the actual dbctl failover [09:09:07] (03Merged) 10jenkins-bot: wikidata-query-gui: add releases for commons, query-main and scholarly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082166 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [09:09:17] what I would do, if we had the time, is replicate from the fist binlog position available in non gtid mode [09:09:28] but anything that makes it work would work :-D [09:12:02] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on pc[1014,1017].eqiad.wmnet with reason: pc maintenance T378068 [09:12:07] T378068: pc1017 crashed - https://phabricator.wikimedia.org/T378068 [09:12:17] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on pc[1014,1017].eqiad.wmnet with reason: pc maintenance T378068 [09:12:39] hosts downtimed, starting replication from spare [09:12:45] 06SRE, 06Infrastructure-Foundations, 10netops: Automate interface configuration for pfw firewalls using Netbox data - https://phabricator.wikimedia.org/T378070 (10cmooney) 03NEW p:05Triage→03Medium [09:13:25] (03PS1) 10Cathal Mooney: Interface automation templates for pfw devices [homer/public] - 10https://gerrit.wikimedia.org/r/1082716 (https://phabricator.wikimedia.org/T378070) [09:14:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netboxdb2003.codfw.wmnet [09:17:38] (03PS2) 10Cathal Mooney: Interface automation templates for pfw devices [homer/public] - 10https://gerrit.wikimedia.org/r/1082716 (https://phabricator.wikimedia.org/T378070) [09:17:51] jynus: So, you are detaching pc1014 from pc1 and reusing it for pc5? [09:18:07] tappof: exactly [09:18:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netboxdb2003.codfw.wmnet [09:18:21] ideally, and more easilly pc1 would have crashed [09:18:30] but we are down to only 1 spare, so we are using that [09:19:05] (03CR) 10Cathal Mooney: Interface automation templates for pfw devices (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1082716 (https://phabricator.wikimedia.org/T378070) (owner: 10Cathal Mooney) [09:20:40] (03PS1) 10Slyngshede: IDM: Switch over to upgraded Bitu instance. [dns] - 10https://gerrit.wikimedia.org/r/1082718 [09:21:28] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075922 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [09:21:34] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Automate interface configuration for pfw firewalls using Netbox data - https://phabricator.wikimedia.org/T378070#10258033 (10cmooney) [09:21:39] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Manange fundraising network elements from Netbox - https://phabricator.wikimedia.org/T377996#10258034 (10cmooney) [09:21:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netboxdb1003.eqiad.wmnet [09:21:45] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Manange fundraising network elements from Netbox - https://phabricator.wikimedia.org/T377996#10258049 (10cmooney) [09:22:49] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1285-1286,1288-1289].eqiad.wmnet [09:22:49] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [09:23:14] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [09:23:48] (03PS1) 10EoghanGaffney: apt-staging: Add empty secret for staging_secring.gpg [labs/private] - 10https://gerrit.wikimedia.org/r/1082719 [09:25:06] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1285-1286,1288-1289].eqiad.wmnet [09:25:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netboxdb1003.eqiad.wmnet [09:28:09] tappof: I'll afk for a while to eat, what's left to be done (jynus is still here) would be to puppetize what we did (pc1014 is now a spare of pc5), and change the topology in dbctl. Please work with jynus, I'll be back soon and I'm not far from my computer, feel free to reach out via Signal (call id on office wiki) or trigger a page if needed [09:28:10] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2083.codfw.wmnet with OS bookworm [09:31:10] (03CR) 10EoghanGaffney: [V:03+2 C:03+2] apt-staging: Add empty secret for staging_secring.gpg [labs/private] - 10https://gerrit.wikimedia.org/r/1082719 (owner: 10EoghanGaffney) [09:31:13] ack arnaudb [09:32:16] tappof: as I said before, I will be pushing the buttons, but I would appreciate someone looking over my shoulder and +1 the pending patches [09:33:02] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1082718 (owner: 10Slyngshede) [09:33:03] yes jynus [09:34:15] https://orchestrator.wikimedia.org/web/cluster/alias/pc5 pc1014 looking fine so far [09:34:20] just with stale data [09:34:39] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1285.eqiad.wmnet with OS bookworm [09:35:02] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1286.eqiad.wmnet with OS bookworm [09:35:10] will now do puppet deploy [09:35:33] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1288.eqiad.wmnet with OS bookworm [09:35:56] yes jynus, I was looking into that (orchestrator) [09:36:38] writing the patch, taking my time as we are not technically under the incident [09:36:46] but last time it kept rebooting once and again [09:36:52] (03PS1) 10Muehlenhoff: Bump the versions for OpenJDK to 11.0.25 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1082720 [09:36:55] so we should switchover asap [09:37:12] pc1013 did so recently :-( [09:37:47] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1289.eqiad.wmnet with OS bookworm [09:38:22] PROBLEM - BGP status on lsw1-e7-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:38:34] So, Orchestrator gives me a picture of the cluster configuration from a DB/SQL perspective, while dbctl is used to configure the user-facing balancer. Is that right jynus? [09:39:00] PROBLEM - BGP status on lsw1-f5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:39:23] "user-facing" ... quoted .. clearlyt [09:40:30] BGP errors for kubernetes-eqiad are mine [09:40:34] tappof: that's right [09:40:51] dbctl will change etcd so it will be the one that will affect mw dynamic config [09:41:14] but first I must ensure puppet works ok (it should be ok as is, in an emergency) [09:41:27] but I want to make sure I change the identifiers before pooling it [09:41:34] for coordination, mostly [09:43:55] (03PS1) 10Jcrespo: parsercache: Move pc1014 to be the master of pc5 instead of pc1017 [puppet] - 10https://gerrit.wikimedia.org/r/1082721 (https://phabricator.wikimedia.org/T378068) [09:48:06] tappof: could I have a review? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1082721 [09:48:40] I am mostly aiming for a sanity check [09:48:57] jynus: yes jynus I'm taking a look [09:50:08] (03CR) 10Tiziano Fogli: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1082721 (https://phabricator.wikimedia.org/T378068) (owner: 10Jcrespo) [09:51:11] (03CR) 10Cathal Mooney: Authdns: add class to create zonefile snippets for K8s PTR delegation (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1080276 (https://phabricator.wikimedia.org/T376291) (owner: 10Cathal Mooney) [09:51:21] (03CR) 10Jcrespo: [C:03+2] parsercache: Move pc1014 to be the master of pc5 instead of pc1017 [puppet] - 10https://gerrit.wikimedia.org/r/1082721 (https://phabricator.wikimedia.org/T378068) (owner: 10Jcrespo) [09:52:30] oh, nice I can merge patches individually [09:52:40] whoever implemented that, may thanks <3 [09:52:50] *many [09:53:20] (03CR) 10Cathal Mooney: [C:03+2] Update static reverse PTR records for frack records codfw [dns] - 10https://gerrit.wikimedia.org/r/1082525 (https://phabricator.wikimedia.org/T374176) (owner: 10Cathal Mooney) [09:53:33] (03PS1) 10Btullis: Update maintainership for all DPE owned container images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1082722 (https://phabricator.wikimedia.org/T373534) [09:54:10] (03CR) 10Volans: [C:03+1] "approach LGTM, nit inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1082707 (https://phabricator.wikimedia.org/T371400) (owner: 10Elukey) [09:54:15] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1286.eqiad.wmnet with reason: host reimage [09:54:38] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1082703 (owner: 10Elukey) [09:54:59] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1285.eqiad.wmnet with reason: host reimage [09:55:10] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1288.eqiad.wmnet with reason: host reimage [09:55:48] I will be restarting pc1014 just in case [09:56:02] and then if everything looks fine, do the dbctl change [09:56:05] for dbctl I think we just need to edit the section pc5 for eqiad and switch pc1017 with pc1014 jynus .. and then commit the config ... [09:57:23] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1289.eqiad.wmnet with reason: host reimage [09:57:26] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1286.eqiad.wmnet with reason: host reimage [09:59:19] (03CR) 10Volans: [C:03+2] tests: fix outstanding CI issues [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1082486 (owner: 10Volans) [09:59:24] !log restart pc1014 T378068 [09:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:29] T378068: pc1017 crashed - https://phabricator.wikimedia.org/T378068 [09:59:57] (03CR) 10Volans: [C:03+1] "LGTM, one suggestion inline" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1082524 (owner: 10Elukey) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241024T1000) [10:00:58] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1289.eqiad.wmnet with reason: host reimage [10:02:05] 10ops-codfw, 06SRE, 06DC-Ops: Renumber frack server mgmt IPs in codfw - https://phabricator.wikimedia.org/T371468#10258140 (10cmooney) [10:03:04] !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on pc1014.eqiad.wmnet with reason: moved pc number [10:03:08] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc1014.eqiad.wmnet with reason: moved pc number [10:03:19] need to redo the downtime because icinga ids changed, so downtime was lost [10:03:37] got renamed from pc1 to pc5 or so [10:03:52] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1288.eqiad.wmnet with reason: host reimage [10:03:58] (03CR) 10Varnent: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082559 (https://phabricator.wikimedia.org/T378026) (owner: 10Varnent) [10:04:03] although we should remove the downtime except the replica lag afterwardsmake it an ack [10:04:10] (03CR) 10Varnent: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082559 (https://phabricator.wikimedia.org/T378026) (owner: 10Varnent) [10:06:16] (03CR) 10Elukey: [C:03+1] "Asked a couple of questions on the versioning, but if we want to keep the -so feel free to proceed :)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1082720 (owner: 10Muehlenhoff) [10:08:00] (03CR) 10Btullis: [C:03+1] "I have made the same change in https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1082722 - I don't mind rebasi" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1082720 (owner: 10Muehlenhoff) [10:08:22] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1285.eqiad.wmnet with reason: host reimage [10:09:06] tappof: a "dbctl config diff" to see how it looks? [10:09:15] on a cumin host [10:09:28] for a virtual review (it's been long since I've used dbctl) [10:09:28] (03CR) 10Btullis: Update maintainership for all DPE owned container images (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1082722 (https://phabricator.wikimedia.org/T373534) (owner: 10Btullis) [10:09:48] (03CR) 10Btullis: Update maintainership for all DPE owned container images (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1082722 (https://phabricator.wikimedia.org/T373534) (owner: 10Btullis) [10:10:23] That's what I was expecting, so it looks good to me jynus [10:10:27] later we can refine to see how to set them up as candidates, etc [10:10:32] for now I just doing the change [10:10:58] confirming read only is off [10:11:06] so deploying now, and cross fingers [10:11:12] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10258148 (10BTullis) a:05BTullis→03None [10:11:51] !log jynus@cumin1002 dbctl commit (dc=all): 'promoting pc1014 as the master of pc5 T378068', diff saved to https://phabricator.wikimedia.org/P70584 and previous config saved to /var/cache/conftool/dbconfig/20241024-101150-jynus.json [10:11:56] T378068: pc1017 crashed - https://phabricator.wikimedia.org/T378068 [10:12:03] checking for mw errors [10:12:35] as well as performance of pc and host [10:13:22] hit ratio has gone down, as expected, but nothing dramatic [10:13:22] (03PS2) 10Btullis: Update maintainership for all DPE owned container images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1082722 (https://phabricator.wikimedia.org/T373534) [10:13:29] we may have lost at most 1/5th of cache [10:13:32] (03Merged) 10jenkins-bot: tests: fix outstanding CI issues [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1082486 (owner: 10Volans) [10:13:57] I see less than 10% of slowdown [10:14:15] hit slowdown, I mean, not overal request latency [10:14:33] (03PS1) 10Cathal Mooney: Update example config.yaml file with additional "ignore" messages [software/homer] - 10https://gerrit.wikimedia.org/r/1082725 (https://phabricator.wikimedia.org/T378070) [10:14:42] app servers seem not even have notice it [10:15:19] pc1014 seems healty [10:15:23] (03PS1) 10Wangombe: Translate Event Logging: Enable using $wgTranslateEnableEventLogging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082726 (https://phabricator.wikimedia.org/T364460) [10:15:25] mariadb wise [10:15:37] (03CR) 10Btullis: [C:03+1] Bump the versions for OpenJDK to 11.0.25 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1082720 (owner: 10Muehlenhoff) [10:16:17] (03Abandoned) 10Cathal Mooney: Update example config.yaml file with additional "ignore" messages [software/homer] - 10https://gerrit.wikimedia.org/r/1082725 (https://phabricator.wikimedia.org/T378070) (owner: 10Cathal Mooney) [10:17:53] (03CR) 10Muehlenhoff: "Oh, I wasn't aware of your patch. Let's go with yours, it's more complete anyway since it also covers Java 17." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1082720 (owner: 10Muehlenhoff) [10:17:58] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1286.eqiad.wmnet with OS bookworm [10:17:59] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-redacteddb1001.eqiad.wmnet [10:18:17] (03Abandoned) 10Muehlenhoff: Bump the versions for OpenJDK to 11.0.25 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1082720 (owner: 10Muehlenhoff) [10:18:28] (03CR) 10Muehlenhoff: Update maintainership for all DPE owned container images (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1082722 (https://phabricator.wikimedia.org/T373534) (owner: 10Btullis) [10:18:37] Ok, jynus, so it seems we're good. I thank you and arnaudb for the session. :) [10:18:39] Now, arnaudb, I have an appointment with the doctor. I'll be away for about half an hour, I think. I have the PC with me just in case... [10:18:53] yeah, no problem [10:18:56] I think we are done [10:19:02] at least for now [10:19:05] thanks for your help [10:19:45] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1289.eqiad.wmnet with OS bookworm [10:20:29] (03CR) 10Muehlenhoff: Update maintainership for all DPE owned container images (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1082722 (https://phabricator.wikimedia.org/T373534) (owner: 10Btullis) [10:21:06] (03PS1) 10Cathal Mooney: Add additional ignore line to Juniper warnings for Homer [puppet] - 10https://gerrit.wikimedia.org/r/1082728 (https://phabricator.wikimedia.org/T378070) [10:21:07] actually, I forgot 1 thing, which is change replication topology con codfw, doing [10:21:24] !log reboot apus frontends T376800 [10:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:58] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-cluster [10:21:59] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=99) [10:21:59] RECOVERY - BGP status on lsw1-f5-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:22:09] ACKNOWLEDGEMENT - MariaDB Replica Lag: pc5 on pc1014 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 71047.33 seconds Jcrespo expected because T378068 - The acknowledgement expires at: 2024-10-25 10:21:28. https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:22:13] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1288.eqiad.wmnet with OS bookworm [10:23:55] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-cluster [10:26:22] (03PS1) 10Muehlenhoff: Add a hook to build Java 8 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1082730 [10:26:27] !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on pc1017.eqiad.wmnet with reason: stopped being the active one, stopping replication [10:26:30] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc1017.eqiad.wmnet with reason: stopped being the active one, stopping replication [10:26:49] (03PS2) 10Slyngshede: IDM: Switch over to upgraded Bitu instance. [dns] - 10https://gerrit.wikimedia.org/r/1082718 [10:27:09] (03CR) 10CI reject: [V:04-1] Add a hook to build Java 8 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1082730 (owner: 10Muehlenhoff) [10:27:27] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1285.eqiad.wmnet with OS bookworm [10:27:27] RECOVERY - BGP status on lsw1-e7-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:29:08] (03CR) 10Slyngshede: [C:03+2] IDM: Switch over to upgraded Bitu instance. [dns] - 10https://gerrit.wikimedia.org/r/1082718 (owner: 10Slyngshede) [10:30:15] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host an-redacteddb1001.eqiad.wmnet [10:32:02] (03PS1) 10Hnowlan: kask: remove all support for in-service TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082731 (https://phabricator.wikimedia.org/T363996) [10:33:11] (03PS2) 10Muehlenhoff: Add a hook to build Java 8 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1082730 [10:33:58] (03CR) 10CI reject: [V:04-1] Add a hook to build Java 8 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1082730 (owner: 10Muehlenhoff) [10:35:09] (03CR) 10Muehlenhoff: Update maintainership for all DPE owned container images (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1082722 (https://phabricator.wikimedia.org/T373534) (owner: 10Btullis) [10:36:23] (03PS3) 10Btullis: Update maintainership for all Java, Flink, ad Spark related container images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1082722 (https://phabricator.wikimedia.org/T373534) [10:36:43] (03PS3) 10Muehlenhoff: Add a hook to build Java 8 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1082730 [10:38:02] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [10:38:26] (03CR) 10Muehlenhoff: [C:03+1] "Looks good (as far as the Java images are concerned)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1082722 (https://phabricator.wikimedia.org/T373534) (owner: 10Btullis) [10:38:33] (03CR) 10Btullis: Update maintainership for all Java, Flink, ad Spark related container images (034 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1082722 (https://phabricator.wikimedia.org/T373534) (owner: 10Btullis) [10:40:24] (03PS1) 10Muehlenhoff: Switch ganeti2038 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1082735 [10:40:38] (03CR) 10Slyngshede: [V:03+2 C:03+2] Upgrade, v7.0.9, and enable Redis ticket registry. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1082150 (https://phabricator.wikimedia.org/T377728) (owner: 10Slyngshede) [10:43:19] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2083.codfw.wmnet with OS bookworm [10:45:00] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti2038 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1082735 (owner: 10Muehlenhoff) [10:51:14] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-cluster [10:53:25] (03PS1) 10Arturo Borrero Gonzalez: openstack: neutron: ml2: template extension_drivers config option [puppet] - 10https://gerrit.wikimedia.org/r/1082736 (https://phabricator.wikimedia.org/T377740) [10:54:15] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1082736 (https://phabricator.wikimedia.org/T377740) (owner: 10Arturo Borrero Gonzalez) [10:54:51] PROBLEM - MariaDB Event Scheduler pc2 on pc2012 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [10:55:05] urgh [10:55:14] PROBLEM - MariaDB read only pc2 #page on pc2012 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:55:27] I just got back [10:55:42] arnaudb: is that expected? [10:55:46] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-web_4450: Servers parse2001.codfw.wmnet, wikikube-worker2033.codfw.wmnet, wikikube-worker2102.codfw.wmnet, wikikube-worker2081.codfw.wmnet, parse2009.codfw.wmnet, wikikube-worker2010.codfw.wmnet, parse2020.codfw.wmnet, wikikube-worker2027.codfw.wmnet, kubernetes2006.codfw.wmnet, wikikube-worker2023.codfw.wmnet, parse2013.codfw.wmnet, kubernetes201 [10:55:46] wmnet, mw2449.codfw.wmnet, wikikube-worker2045.codfw.wmnet, mw2356.codfw.wmnet, kubernetes2022.codfw.wmnet, mw2419.codfw.wmnet, wikikube-worker2014.codfw.wmnet, kubernetes2036.codfw.wmnet, wikikube-worker2018.codfw.wmnet, mw2336.codfw.wmnet, wikikube-worker2094.codfw.wmnet, parse2007.codfw.wmnet, mw2374.codfw.wmnet, wikikube-worker2046.codfw.wmnet, wikikube-worker2095.codfw.wmnet, wikikube-worker2042.codfw.wmnet, mw2337.codfw.wmnet, wikik [10:55:46] er2039.codfw.wmnet, wikikube-worker2068.codfw.wmnet, mw2417.codfw.wmnet, wikikube-worker2047.codfw.wmnet, wikikube-worker2118.codfw.wmnet, mw2418.codfw.wmnet, wikikube-worker2067.codfw. https://wikitech.wikimedia.org/wiki/PyBal [10:55:57] FIRING: [3x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:55:58] I didn't touch pc2 [10:56:06] not Emperor, I'm checking [10:56:09] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2038.codfw.wmnet [10:56:22] ack, LMK if you need anything [10:56:32] FIRING: [9x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:56:33] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10258238 (10ops-monitoring-bot) Draining ganeti2038.codfw.wmnet of running VMs [10:56:38] <_joe_> I guess first thing is to go look at the server [10:56:42] <_joe_> is anyone doing it? [10:56:47] I am [10:56:51] I'm on it also [10:56:52] <_joe_> thanks [10:57:00] ERROR 1040 (HY000): Too many connections [10:57:02] <_joe_> so, I think we're down right now more or less? [10:57:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 4.605% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:57:46] <_joe_> these are all consequences of pc down I guess? [10:57:49] I'm back now arnaudb [10:58:03] well, welcome back! thing just caught on fire again [10:58:05] PROBLEM - MariaDB Replica IO: pc2 on pc1012 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1040, Errmsg: error connecting to master repl2024@pc2012.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Too many connections https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:58:11] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-web_4450: Servers kubernetes2046.codfw.wmnet, parse2001.codfw.wmnet, parse2017.codfw.wmnet, wikikube-worker2120.codfw.wmnet, kubernetes2056.codfw.wmnet, wikikube-worker2063.codfw.wmnet, wikikube-worker2081.codfw.wmnet, wikikube-worker2046.codfw.wmnet, mw2375.codfw.wmnet, kubernetes2024.codfw.wmnet, wikikube-worker2036.codfw.wmnet, mw2447.codfw.wmn [10:58:11] 70.codfw.wmnet, mw2368.codfw.wmnet, wikikube-worker2113.codfw.wmnet, mw2443.codfw.wmnet, kubernetes2048.codfw.wmnet, parse2003.codfw.wmnet, wikikube-worker2076.codfw.wmnet, wikikube-worker2040.codfw.wmnet, parse2018.codfw.wmnet, parse2004.codfw.wmnet, kubernetes2050.codfw.wmnet, mw2351.codfw.wmnet, mw2440.codfw.wmnet, wikikube-worker2007.codfw.wmnet, wikikube-worker2043.codfw.wmnet, wikikube-worker2096.codfw.wmnet, wikikube-worker2097.cod [10:58:11] , wikikube-worker2023.codfw.wmnet, wikikube-worker2124.codfw.wmnet, wikikube-worker2002.codfw.wmnet, wikikube-worker2055.codfw.wmnet, wikikube-worker2089.codfw.wmnet, kubernetes2039.cod https://wikitech.wikimedia.org/wiki/PyBal [10:58:13] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.005e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [10:58:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:58:38] <_joe_> we're basically down in codfw [10:58:40] $ sudo ss -lntpuae|rg -i 3306|wc -l [10:58:40] 35282 [10:58:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [10:58:59] <_joe_> jynus: no requests have dropped if anything [10:59:00] there is indeed a lot of hits on this port, but mysql is still up. jynus should we restart the daemon? [10:59:08] * kamila_ will update statuspage unless someone stops them [10:59:10] sure [10:59:12] <_joe_> they're idle waiting for pc2012 [10:59:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web (k8s) 21.84s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:59:17] kamila_: please and thank you! [10:59:23] np arnaudb [10:59:26] <_joe_> I would rather kill any outstanding queries [10:59:44] <_joe_> the mean latency is up to 15 seconds [10:59:46] (03CR) 10Btullis: [V:03+2 C:03+2] Update maintainership for all Java, Flink, ad Spark related container images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1082722 (https://phabricator.wikimedia.org/T373534) (owner: 10Btullis) [10:59:50] ah, good idea but I restarted the daemon already [11:00:04] (03CR) 10Btullis: [V:03+2 C:03+2] Update maintainership for all Java, Flink, ad Spark related container images (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1082722 (https://phabricator.wikimedia.org/T373534) (owner: 10Btullis) [11:01:02] FIRING: [13x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:01:06] `2024-10-24 10:59:30 0 [Note] /opt/wmf-mariadb106/bin/mysqld (initiated by: unknown): Normal shutdown` → waiting for a minute or so to kill the daemon [11:01:31] the previous restart took 2 minutes or so [11:01:36] so normal long reboot [11:01:37] ack [11:01:56] * arnaudb increments the wait counter [11:02:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 14.05% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:02:48] FIRING: [9x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:02:53] (03PS3) 10Clément Goubert: php*-fpm-multiversion: Add helper scripts for mwcron, mwscript [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1082478 (https://phabricator.wikimedia.org/T377958) [11:02:59] <_joe_> what is the status of that server? [11:03:15] FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:03:24] mariadb is going down still [11:03:34] yep, keepalives are going down as well, no new conn [11:03:44] problem is if you kill it you will have a long start too [11:03:46] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:03:51] <_joe_> yeah ofc [11:03:55] so we should try to wait a bit [11:03:59] <_joe_> I guess mysql just stopped responding to tcp [11:04:06] <_joe_> and that made things recover a bit now [11:04:14] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:04:42] (03CR) 10Clément Goubert: "Thanks for the review, everything should be addressed." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1082478 (https://phabricator.wikimedia.org/T377958) (owner: 10Clément Goubert) [11:05:00] arnaudb: we should prepare for a downgrade, I didn't see a traffic increase [11:05:03] response time seems to go to recover [11:05:19] PROBLEM - MariaDB Replica Lag: pc2 on pc1012 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 631.94 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:05:22] I can do that if you are with the server [11:05:23] jynus: are we sure this is a causailty ? [11:05:28] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [11:05:28] yes please [11:05:29] o idea [11:05:31] no idea [11:05:40] just want to move the binary there and wait [11:05:45] <_joe_> arnaudb: where do you see response times recovering? [11:05:50] ack, there is one on my /home/arnaudb on cumin1002 [11:05:57] FIRING: [13x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:05:58] _joe_: https://grafana-rw.wikimedia.org/d/O_OXJyTVk/home-w-wiki-status?orgId=1&refresh=5m&viewPanel=4 [11:06:04] "kinda" [11:06:09] <_joe_> yeah that includes both DCs [11:06:20] <_joe_> mean latency for mw-web in codfw is still 10 seconds [11:06:50] jynus: sanity check: do I pull the trigger for a kill -9 ? [11:06:52] <_joe_> I'm wondering if a roll restart of the pods would help here [11:06:58] <_joe_> please don't [11:07:05] ack [11:07:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [11:07:09] <_joe_> don't kill -9 mysql unless there's no alternative [11:07:25] there is no .deb in your home in cumin1002 [11:07:26] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-mariadb1002.eqiad.wmnet [11:08:14] jynus: fixed, sorry for the confusion [11:08:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [11:09:11] <_joe_> can we replace this server with a replica/fail over? [11:09:20] we just did that [11:09:23] for other host [11:09:27] we run out of spares [11:09:35] <_joe_> sigh [11:09:40] we can put the one that keeps crashing or the one that just crashed [11:09:43] <_joe_> ok so the answer to my question is "no" [11:09:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [11:09:58] jynus: there is one spare but if this is the ripple effect of the .19 upgrade → this will be a rolling spare [11:10:07] arnaudb: they are 10.6.16 [11:10:13] so that is not an issue I think [11:10:22] indeed [11:10:34] kill the host, has been stopping for 15 minutes already [11:10:36] <_joe_> did mysql restart at all on that server btw? [11:10:37] disk write throughput seems to be decreasing [11:10:39] it won't come backup [11:10:57] FIRING: [13x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:11:14] last seen: 16 minutes ago [11:12:02] <_joe_> jynus: if you kill it, it might help the applications indeed [11:12:23] I am not the one with the keyboard, I can only ask at the moment [11:12:23] <_joe_> but at the same time, without a spare, what do we do? [11:12:31] its a cache [11:12:35] we can just downgrade it and repool it [11:12:38] wdyt? [11:12:47] kill it and restart it [11:12:53] ack, on it [11:12:55] just make it start [11:13:18] restarting [11:13:21] <_joe_> kamila_: I was thinking [11:13:26] restarted [11:13:27] up [11:13:33] <_joe_> we could try to depool codfw for reads for mw-web [11:13:39] <_joe_> to ease the load there [11:13:51] RECOVERY - MariaDB Event Scheduler pc2 on pc2012 is OK: Version 10.6.16-MariaDB-log, Uptime 36s, read_only: False, event_scheduler: True, 2143.19 QPS, connection latency: 0.015821s, query latency: 0.000421s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [11:13:54] <_joe_> wdyt? [11:13:59] processlist seems stable now [11:14:05] _joe_: seems harmless [11:14:05] RECOVERY - MariaDB Replica IO: pc2 on pc1012 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:14:11] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-mariadb1002.eqiad.wmnet [11:14:19] RECOVERY - MariaDB Replica Lag: pc2 on pc1012 is OK: OK slave_sql_lag Replication lag: 0.31 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:14:19] <_joe_> this are much worse otoh fro the appservers [11:14:22] RECOVERY - MariaDB read only pc2 #page on pc2012 is OK: Version 10.6.16-MariaDB-log, Uptime 65s, read_only: False, event_scheduler: True, 1842.61 QPS, connection latency: 0.014986s, query latency: 0.000461s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:14:42] things should be ok now, but without knowing what caused it... [11:14:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [11:14:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [11:14:48] it is the second crash in 1 hour [11:14:55] FIRING: [5x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [11:15:03] <_joe_> jynus: oh the second in one hour for the same server? [11:15:06] yep, there is an overload [11:15:11] _joe_: no, different server [11:15:16] on the same workpool, pc5 first [11:15:18] pc2 now [11:15:19] that is why we run out of spare [11:15:29] PROBLEM - MariaDB Event Scheduler pc4 on pc2015 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [11:15:30] PROBLEM - MariaDB read only pc4 #page on pc2015 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:15:32] <_joe_> ok, sorry, what's the status of the db now? [11:15:33] connection list is growing [11:15:37] <_joe_> sigh [11:15:39] PROBLEM - MariaDB Replica IO: pc4 on pc1016 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1040, Errmsg: error connecting to master repl2024@pc2015.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Too many connections https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:15:40] well, this one broke [11:15:41] (03PS4) 10Clément Goubert: php*-fpm-multiversion: Add helper scripts for mwcron, mwscript [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1082478 (https://phabricator.wikimedia.org/T377958) [11:15:48] _joe_: should I do the mw-web depool? [11:15:50] * arnaudb grabs his anti freeze gun [11:15:52] ah, a 3rd crash [11:15:53] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [11:15:57] something is overloading connections [11:16:03] FIRING: [13x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:16:05] we just moved the traffic [11:16:06] handling the server [11:16:12] FIRING: [13x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:16:14] <_joe_> kamila_: if anything should help the servers, so yes [11:16:18] <_joe_> well no wait [11:16:19] I don't think mariadb is the cause this time [11:16:20] ack, on it [11:16:22] ther server didnt crash [11:16:25] ok waiting [11:16:27] it is just running out of resources [11:16:33] neither do I jynus that feels like something else [11:16:33] <_joe_> verify we have enough resources in eqiad [11:16:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [11:16:44] will get to pc2015 [11:16:46] it must be memcache or something on top of parsercache dbs [11:16:55] we should, given switchover was not long ago? but ok, checking [11:17:03] because it is 3 different shards [11:17:06] <_joe_> kamila_: you are right, let's go [11:17:09] arnaudb@pc2015:~ $ sudo mysql [11:17:10] ERROR 1040 (HY000): Too many connections [11:17:10] → same issue [11:17:16] <_joe_> jynus: might be cascading issues [11:17:18] something is wrong with traffic or cache workflow [11:17:29] <_joe_> so let us move traffic first [11:18:02] _joe_: do you want me to hold on restarting mariadb on pc2015? [11:18:07] or can I go? [11:18:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:18:34] <_joe_> arnaudb: do whatever you need [11:18:41] ack [11:18:54] !log oblivian@cumin1002 START - Cookbook sre.discovery.service-route depool mw-web-ro in codfw: maintenance [11:19:08] <_joe_> ok [11:19:18] letting mariadb a 2min window to reboot properly, outside of this: kill -9 [11:19:18] the fact that we don't have enough connections for root (which has 10 reserved) is a fail [11:19:19] <_joe_> read-only traffic should shift all to eqiad [11:19:34] I cannot connect and observe [11:19:45] must be something wrong with grants there [11:19:51] FIRING: [5x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [11:20:35] !incidents [11:20:35] 5349 (ACKED) [3x] ProbeDown sre (ip6 text-https:443 probes/service http_text-https_ip6) [11:20:35] 5351 (ACKED) [3x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet) [11:20:35] 5352 (ACKED) pc2015 (paged)/MariaDB read only pc4 (paged) [11:20:36] 5353 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [11:20:36] 5348 (RESOLVED) pc2012 (paged)/MariaDB read only pc2 (paged) [11:20:36] 5350 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [11:20:44] oh, mariadb handled itself! [11:20:57] RESOLVED: [13x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:21:01] <_joe_> arnaudb: we're moving traffic to eqiad [11:21:09] <_joe_> that should move away pressure from codfw [11:21:14] no doubt [11:21:17] <_joe_> just read-only traffic for mw-web [11:21:30] FIRING: [17x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:21:31] pc2 looks fine now [11:21:33] restarting mariadb && repl for this instance [11:21:42] pc4 still down [11:21:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [11:21:52] <_joe_> jynus: we removed most requests from there heh [11:22:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 9.524% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:22:26] <_joe_> jynus, arnaudb can you check pc in eqiad now? [11:22:32] RECOVERY - MariaDB read only pc4 #page on pc2015 is OK: Version 10.6.18-MariaDB-log, Uptime 53s, read_only: False, event_scheduler: True, 5294.93 QPS, connection latency: 0.033313s, query latency: 0.000842s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:22:32] RECOVERY - MariaDB Event Scheduler pc4 on pc2015 is OK: Version 10.6.18-MariaDB-log, Uptime 53s, read_only: False, event_scheduler: True, 5299.61 QPS, connection latency: 0.023691s, query latency: 0.000840s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [11:22:32] the question is if it will move somewhere else [11:22:39] RECOVERY - MariaDB Replica IO: pc4 on pc1016 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:22:40] <_joe_> if the problem is indeed traffic induced we should've moved it [11:22:48] FIRING: [17x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:23:06] https://orchestrator.wikimedia.org/web/cluster/alias/pc4 jynus its weird that pc1016 stopped its replication thread as well [11:23:09] will start it again [11:23:15] FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:23:31] looks like reading is happy now, updating statuspage [11:23:35] could be 2 the misses from the original crashed host, but the misses were very low [11:23:41] mw-web is still serving 3krps in codfw [11:23:46] <_joe_> I would wait a bit tbh [11:23:49] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:23:49] bet something is making lots of writes [11:23:50] ok [11:23:51] <_joe_> claime: uh that's a lot [11:23:53] (03CR) 10Jgiannelos: pcs: Configure prometheus metrics (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082473 (https://phabricator.wikimedia.org/T372749) (owner: 10Jgiannelos) [11:23:58] !log oblivian@cumin1002 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool mw-web-ro in codfw: maintenance [11:23:59] yeah, misses are very low, it is not that [11:24:05] pc4 alive and well [11:24:13] <_joe_> claime: those are get requests [11:24:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web (k8s) 2.01s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:24:18] <_joe_> can you check the access logs? [11:24:19] * arnaudb stays warm for the next one that pops [11:24:33] I'm trying to find anything weird in superset [11:24:34] yep [11:24:48] it's dropping to 2krps now [11:24:51] RESOLVED: [5x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [11:24:53] traffic looks fine, but maybe some change on cache specifically? [11:24:55] <_joe_> ah ok [11:25:03] <_joe_> these are just requests we're finally responding to [11:25:13] also, the previous crash was on eqiad [11:25:20] and 200s are back at normal levels [11:25:25] and this overload was on codfw, so I think it is related atm [11:25:33] *it is not [11:25:39] ah*, I was about task [11:25:56] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [11:25:59] so something else created a huge amount of connections, still trying to find where [11:26:25] mw-web traffic down ~150rps right now now in codfw [11:26:41] yeah it was just a spike due to finally responding to req [11:26:43] one thing we have seen recently is a host getting stalled, but not on that specific version [11:26:44] sorry for the red herring [11:26:59] claime: no worries [11:27:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [11:27:17] that could be another reason- the server getting soft-locked [11:27:40] 10.6.16 has been quite battle tested afair [11:27:55] (the latest one to crash at least was on that version) [11:27:59] so, let me get this straight. We think all of this was due to 1 single server? [11:28:01] eqiad up to 7.5krps and looks to be holding [11:28:15] RESOLVED: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:28:15] akosiaris: not really clear atm on what is the _root_ cause [11:28:42] there is spam at dmesg of "Data hash table of /var/log/journal/" is that normal? [11:28:55] are they frequent? [11:29:14] very [11:29:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [11:29:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [11:30:01] is there a way if the rate of queries to the parsercache increase? with that we will be able to tell if the host stopped responding before getting maxed out connections [11:30:13] * kamila_ updating statuspage to say reading is OK now [11:30:14] that means that there was some log rotation during the process afaict, maybe because of the network spam? [11:30:33] ok, I have a smoking gun [11:30:36] Did we ever get an update on the status of pc hosts in eqiad? [11:30:49] arnaudb: Sure, it's too early for the root cause, I just wanna making sure I understood correctly that our current theory is that whatever happened in 1 single server caused all of this. [11:31:01] there was a smart_failure on pc2012 just before the alert [11:31:15] I have a theory, but I cannot prove it [11:31:16] wait no [11:31:19] Ignore. [11:31:27] jynus: expand please! [11:31:31] jynus: we can help prove or disprove it [11:31:36] that's the point of theories ;-) [11:31:45] new pc hosts were setup recently [11:32:01] those seemeed to be unreliable, in fact, one crashed a few minutes ago [11:32:15] that increased load, but not much [11:32:19] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.006e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [11:32:26] but maybe enough to trigger the query killer [11:32:40] except on the new hosts, the query killer is not installer properly [11:32:51] *installed [11:33:07] T374496 is more than a month old, this should have been an issue sooner no? [11:33:07] T374496: Bring pc5 into rotation - https://phabricator.wikimedia.org/T374496 [11:33:30] arnaudb: but caually, a host crashed a few minutes ago [11:33:48] leave it cooking for a while to overload... and there you have it [11:34:02] so its a slow cooking overload [11:34:06] must taste good [11:34:14] just a few minutes, arnaudb [11:34:31] the extra load came from the extra reparses due to the cold cache of pc5 after crash [11:34:57] lets maybe force reinstall query killer everywhere to avoid proving your theory right this way? [11:35:34] before that, to prove it we should check that load not increased much, just enough to not being able to keep up [11:36:24] we had in the past outages because db hosts were pooled with no query killer by accident [11:36:51] as it causes error but makes sure they don't snowball [11:37:36] <_joe_> what's going on with mirrormaker btw? [11:38:22] (03PS1) 10JMeybohm: Migrate wikikube-worker128[5689] to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1082746 (https://phabricator.wikimedia.org/T377876) [11:38:46] I can still see an increased db traffic [11:39:02] not sure that would be related to cache, I would expect only es be extra loaded [11:39:16] oh, thats eqiad [11:39:20] that part is expected [11:39:25] <_joe_> eqiad is expected yes [11:39:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T377942#10258317 (10phaultfinder) [11:40:10] what was the time of the incident? [11:40:27] Side question, Orchestrator is showing replication lag for pc5, is this expected and a result of the pc1017 crash? [11:40:38] sobanski: yeah, ignore that [11:40:43] sobanski: yep, this is no problem [11:40:44] jynus: first alert fired at 10:54 UTC [11:40:45] thats a glitch from the previous incident [11:40:50] not relevant [11:40:53] 👍 [11:41:09] ok, some signal [11:41:21] s8 increased traffic by 5x [11:41:25] database-wise [11:41:35] on codfw [11:41:51] https://grafana.wikimedia.org/goto/FLBxCzWNR?orgId=1 [11:42:15] it would be weird that would be just for extra mises [11:42:26] I cannot discard that, [11:42:28] wikidata then [11:42:38] but there is signal there of something [11:43:07] when I checked I was seing 10% increase in misses [11:43:28] (03CR) 10Jgiannelos: [C:03+2] pcs: Configure prometheus metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082473 (https://phabricator.wikimedia.org/T372749) (owner: 10Jgiannelos) [11:43:38] 10:52 yeah, that lines up [11:44:24] (03CR) 10JMeybohm: [C:03+2] Migrate wikikube-worker128[5689] to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1082746 (https://phabricator.wikimedia.org/T377876) (owner: 10JMeybohm) [11:44:35] (03Merged) 10jenkins-bot: pcs: Configure prometheus metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082473 (https://phabricator.wikimedia.org/T372749) (owner: 10Jgiannelos) [11:44:36] could we had a mass cache rebuilt, caused or not by the previous crash? [11:44:43] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:44:45] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:44:49] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:44:53] ah that would make sense jynus as consistency was askew [11:45:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2038.codfw.wmnet [11:46:46] (03CR) 10Muehlenhoff: [C:03+2] Add a hook to build Java 8 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1082730 (owner: 10Muehlenhoff) [11:47:50] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1285.eqiad.wmnet with OS bookworm [11:48:24] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1286.eqiad.wmnet with OS bookworm [11:48:33] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1288.eqiad.wmnet with OS bookworm [11:48:39] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1289.eqiad.wmnet with OS bookworm [11:49:48] repeating _j.oe_'s question: what's up with mirrormaker? and is it bad? [11:49:49] if we had a mass cache rebuild or a ton of edit traffic then would that explain it? [11:50:40] https://calendar.google.com/calendar/u/0/r/week edit count seems to be a bit steady (except for the holes) [11:50:57] oops [11:51:09] https://grafana.wikimedia.org/goto/EMbi3zWHg?orgId=1 copy pasta issue :D [11:51:29] PROBLEM - BGP status on lsw1-e7-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:51:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2038.codfw.wmnet [11:51:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2038.codfw.wmnet [11:52:05] PROBLEM - BGP status on lsw1-f5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:52:51] the only relevant thing I see on edit is that there was a bump on save failures while we were in the danger zone https://grafana.wikimedia.org/goto/PxZ4qzWNg?orgId=1 [11:53:58] that's expected I suppose [11:56:22] https://grafana.wikimedia.org/goto/-bx5qzZNg?orgId=1 on the same time window we see disk usage moving upward due to recaching [11:57:55] (03PS1) 10Muehlenhoff: Switch ganeti2036 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1082747 [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241024T1200) [12:04:48] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T377942#10258405 (10phaultfinder) [12:05:52] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti2036 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1082747 (owner: 10Muehlenhoff) [12:07:37] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1285.eqiad.wmnet with reason: host reimage [12:07:49] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1286.eqiad.wmnet with reason: host reimage [12:08:01] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1289.eqiad.wmnet with reason: host reimage [12:08:10] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1288.eqiad.wmnet with reason: host reimage [12:10:36] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1285.eqiad.wmnet with reason: host reimage [12:12:56] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet [12:13:09] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10258417 (10ops-monitoring-bot) Draining ganeti2036.codfw.wmnet of running VMs [12:13:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:13:39] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-be2001.codfw.wmnet [12:13:58] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:14:03] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1288.eqiad.wmnet with reason: host reimage [12:14:23] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete config-master cergen cert [puppet] - 10https://gerrit.wikimedia.org/r/1075922 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [12:14:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet [12:15:44] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host moss-be1001.eqiad.wmnet [12:16:41] lots of "enwiki:parsoid-pcache:idhash:61454258-0!useParsoid=1" (with differente ids) [12:16:56] that's for a different channel [12:17:45] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1286.eqiad.wmnet with reason: host reimage [12:20:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet [12:20:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet [12:21:10] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:21:10] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be2001.codfw.wmnet [12:21:28] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-be2002.codfw.wmnet [12:21:28] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1289.eqiad.wmnet with reason: host reimage [12:22:20] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082760 [12:22:59] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be1001.eqiad.wmnet [12:28:50] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be2002.codfw.wmnet [12:29:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T377942#10258470 (10phaultfinder) [12:29:42] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-be2003.codfw.wmnet [12:29:47] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1285.eqiad.wmnet with OS bookworm [12:29:56] (03CR) 10Abijeet Patro: tables-catalog: Add translate_message_group_subscriptions table (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1082549 (https://phabricator.wikimedia.org/T372287) (owner: 10Abijeet Patro) [12:30:01] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host moss-be1002.eqiad.wmnet [12:32:22] FIRING: [2x] SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [12:33:14] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1288.eqiad.wmnet with OS bookworm [12:33:57] (03PS1) 10Muehlenhoff: Switch ganeti2037 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1082775 [12:34:15] !log bump qemu migration speed to 1000 for esams, ulsfo, eqsin, drmrs, magru clusters [12:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:25] !log bump qemu migration speed to 1000 for esams, ulsfo, eqsin, drmrs, magru Ganeti clusters [12:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:28] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be1002.eqiad.wmnet [12:37:30] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be2003.codfw.wmnet [12:38:27] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host moss-be1003.eqiad.wmnet [12:38:36] RECOVERY - BGP status on lsw1-e7-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:38:44] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1286.eqiad.wmnet with OS bookworm [12:40:23] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1289.eqiad.wmnet with OS bookworm [12:40:52] RECOVERY - BGP status on lsw1-f5-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:42:55] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti2037 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1082775 (owner: 10Muehlenhoff) [12:44:44] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:44:50] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:45:13] !log btullis@cumin1002 START - Cookbook sre.kafka.roll-restart-mirror-maker restart MirrorMaker for Kafka A:kafka-mirror-maker-main-eqiad cluster: Roll restart of jvm daemons. [12:46:04] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be1003.eqiad.wmnet [12:55:36] !log btullis@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0) restart MirrorMaker for Kafka A:kafka-mirror-maker-main-eqiad cluster: Roll restart of jvm daemons. [12:57:21] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1285.eqiad.wmnet with OS bookworm [12:58:42] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1286.eqiad.wmnet with OS bookworm [12:59:10] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1289.eqiad.wmnet with OS bookworm [12:59:34] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1288.eqiad.wmnet with OS bookworm [13:00:05] Urbanecm and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241024T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:41] PROBLEM - BGP status on lsw1-e7-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:02:57] PROBLEM - BGP status on lsw1-f5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:03:38] (03PS1) 10Slyngshede: idm downgrade [dns] - 10https://gerrit.wikimedia.org/r/1082777 [13:05:59] (03PS1) 10CDanis: changeprop: parsoidCachePrewarm: halve concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082778 [13:06:09] (03PS2) 10Elukey: sre.hosts.reimage: clear puppetdb's state upon rollback (if needed) [cookbooks] - 10https://gerrit.wikimedia.org/r/1082707 (https://phabricator.wikimedia.org/T371400) [13:07:44] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2083.codfw.wmnet with OS bullseye [13:08:27] (03CR) 10Elukey: sre.hosts.reimage: clear puppetdb's state upon rollback (if needed) (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1082707 (https://phabricator.wikimedia.org/T371400) (owner: 10Elukey) [13:08:47] (03CR) 10Btullis: [C:03+1] changeprop: parsoidCachePrewarm: halve concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082778 (owner: 10CDanis) [13:09:00] (03CR) 10Giuseppe Lavagetto: [C:03+1] changeprop: parsoidCachePrewarm: halve concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082778 (owner: 10CDanis) [13:09:16] (03CR) 10Kamila Součková: [C:03+1] changeprop: parsoidCachePrewarm: halve concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082778 (owner: 10CDanis) [13:10:19] <_joe_> jayme: we're in an outage and we've added load on eqiad, maybe it could be a good idea to pause reimages [13:10:39] (03PS2) 10CDanis: changeprop: parsoidCachePrewarm: halve concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082778 (https://phabricator.wikimedia.org/T378076) [13:10:47] (03CR) 10CDanis: [C:03+2] changeprop: parsoidCachePrewarm: halve concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082778 (https://phabricator.wikimedia.org/T378076) (owner: 10CDanis) [13:10:49] 06SRE, 06DBA, 13Patch-For-Review, 07Wikimedia-production-error: Parsercache issues in codfw causing large-scale outage - https://phabricator.wikimedia.org/T378076#10258620 (10CDanis) p:05Triage→03High [13:11:09] _joe_: it's the same 4 node all the time and they've not been pooled all day [13:11:51] (03Merged) 10jenkins-bot: changeprop: parsoidCachePrewarm: halve concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082778 (https://phabricator.wikimedia.org/T378076) (owner: 10CDanis) [13:11:54] and they are in a suboptimal state, so fixing them would increase capacity by 4 [13:12:53] so I'd argue I'm not making things worse [13:14:35] !log cdanis@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [13:15:26] !log cdanis@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [13:16:50] (03CR) 10Ssingh: [C:03+1] idm downgrade [dns] - 10https://gerrit.wikimedia.org/r/1082777 (owner: 10Slyngshede) [13:16:58] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1285.eqiad.wmnet with reason: host reimage [13:18:16] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1286.eqiad.wmnet with reason: host reimage [13:18:43] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1289.eqiad.wmnet with reason: host reimage [13:18:59] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1288.eqiad.wmnet with reason: host reimage [13:19:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T377942#10258649 (10phaultfinder) [13:20:20] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1285.eqiad.wmnet with reason: host reimage [13:21:03] 06SRE, 10SRE-Access-Requests: Requesting access to 'deployment' for 'Joely Rooke WMDE' - https://phabricator.wikimedia.org/T378082 (10JoelyRooke-WMDE) 03NEW [13:21:07] FIRING: [3x] SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [13:21:37] <_joe_> cdanis: want me to take a look at the diffs for changeprop-jobqueue? [13:21:50] _joe_: no it's ok [13:21:58] it was just a change to the docker-registry fqdn [13:22:13] !log cdanis@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [13:22:43] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082781 [13:23:22] !log cdanis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [13:23:44] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1289.eqiad.wmnet with reason: host reimage [13:25:28] <_joe_> prewarm jobs are already going down :) [13:25:44] (03PS1) 10Slyngshede: Navigation: Fix anonymous check. [software/bitu] - 10https://gerrit.wikimedia.org/r/1082782 [13:26:29] (03CR) 10Ssingh: Authdns: add class to create zonefile snippets for K8s PTR delegation (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1080276 (https://phabricator.wikimedia.org/T376291) (owner: 10Cathal Mooney) [13:26:58] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1288.eqiad.wmnet with reason: host reimage [13:29:58] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1286.eqiad.wmnet with reason: host reimage [13:30:57] (03CR) 10Ssingh: "Marked my only comment as resolved, thanks. Can you confirm that running it still gives the desired output since I noticed another code ch" [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [13:36:30] (03CR) 10Ssingh: "Not required as an INCLUDE supersedes this." [dns] - 10https://gerrit.wikimedia.org/r/360767 (owner: 10Rush) [13:36:33] (03Abandoned) 10Ssingh: wip: labstore: this was moved to a private link with 192 addressing [dns] - 10https://gerrit.wikimedia.org/r/360767 (owner: 10Rush) [13:37:36] (03Abandoned) 10Slyngshede: idm downgrade [dns] - 10https://gerrit.wikimedia.org/r/1082777 (owner: 10Slyngshede) [13:39:02] (03CR) 10Slyngshede: "Issue seems related to the switch over of earlier. If a user attempts to sign in just as we do the DNS change, the OIDC tokens can be redi" [dns] - 10https://gerrit.wikimedia.org/r/1082777 (owner: 10Slyngshede) [13:40:33] !log oblivian@cumin2002 START - Cookbook sre.discovery.service-route pool mw-web-ro in codfw: maintenance [13:40:44] RECOVERY - BGP status on lsw1-e7-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:40:55] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1285.eqiad.wmnet with OS bookworm [13:42:02] (03CR) 10Elukey: [C:03+2] sre.hosts.reimage: fix reimage failed sentence [cookbooks] - 10https://gerrit.wikimedia.org/r/1082703 (owner: 10Elukey) [13:42:18] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet [13:42:29] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10258750 (10ops-monitoring-bot) Draining ganeti2037.codfw.wmnet of running VMs [13:43:22] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:43:30] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1289.eqiad.wmnet with OS bookworm [13:43:30] (03Abandoned) 10Ssingh: cloud: labs* VLANs are renamed in the switches [dns] - 10https://gerrit.wikimedia.org/r/443135 (owner: 10Rush) [13:44:27] (03CR) 10Isabelle Hurbain-Palatin: [C:03+1] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082781 (owner: 10PipelineBot) [13:44:30] (03Abandoned) 10Ssingh: openstack: add labnet100[34] VLAN 1120 reservations [dns] - 10https://gerrit.wikimedia.org/r/445023 (https://phabricator.wikimedia.org/T196633) (owner: 10Rush) [13:45:29] (03CR) 10Ssingh: "; wmf-zone-validator-ignore=MISSING_PTR_FOR_NAME_AND_IP supersedes these and is currently what we use." [dns] - 10https://gerrit.wikimedia.org/r/493101 (owner: 10BBlack) [13:45:38] !log oblivian@cumin2002 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) pool mw-web-ro in codfw: maintenance [13:45:42] PROBLEM - BGP status on lsw1-e7-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:45:56] (03Abandoned) 10Ssingh: [WIP] Mark Non-WMF IPs for zone_validator [dns] - 10https://gerrit.wikimedia.org/r/493101 (owner: 10BBlack) [13:46:42] RECOVERY - BGP status on lsw1-e7-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:47:01] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1288.eqiad.wmnet with OS bookworm [13:47:02] RECOVERY - BGP status on lsw1-f5-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:47:48] (03CR) 10Ssingh: "It seems like there is consensus on keeping zero.wp.org redirect, so doing that. Abandoning the patch." [dns] - 10https://gerrit.wikimedia.org/r/521966 (owner: 10Jforrester) [13:47:50] (03Abandoned) 10Ssingh: Drop zero.wikipedia.org redirect to www.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/521966 (owner: 10Jforrester) [13:49:00] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1286.eqiad.wmnet with OS bookworm [13:49:01] (03Abandoned) 10Ssingh: Add missing PTR records for mwdebug and tegola-vector-tiles [dns] - 10https://gerrit.wikimedia.org/r/723471 (owner: 10Effie Mouzeli) [13:50:25] (03CR) 10Ssingh: "No longer required." [dns] - 10https://gerrit.wikimedia.org/r/749218 (https://phabricator.wikimedia.org/T291541) (owner: 10Majavah) [13:50:26] (03Abandoned) 10Ssingh: add discovery record for puppet [dns] - 10https://gerrit.wikimedia.org/r/749218 (https://phabricator.wikimedia.org/T291541) (owner: 10Majavah) [13:50:51] (03Abandoned) 10Ssingh: point puppet.SITE to discovery record [dns] - 10https://gerrit.wikimedia.org/r/749219 (owner: 10Majavah) [13:52:46] !log mvernon@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on ms-be1066.eqiad.wmnet with reason: vacuum an overlarge container db [13:53:00] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be1066.eqiad.wmnet with reason: vacuum an overlarge container db [13:53:05] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10258787 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=34f60f1c-3678-43fe-adcb-c7e8283b7f6c) set by mvernon@cumin... [13:56:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet [13:56:27] (03CR) 10Ssingh: "Hi: is this still required or can we abandon this? (Nothing specific to this patch, we are just cleaning up ops/dns.git). Thank you." [dns] - 10https://gerrit.wikimedia.org/r/815376 (https://phabricator.wikimedia.org/T313355) (owner: 10CDanis) [13:57:34] !log restarting swift after vacuum on ms-be1066 T377827 [13:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:50] (03CR) 10Ssingh: "Hi: is this still required or can we abandon this? (Nothing specific to this patch, we are just cleaning up ops/dns.git). Thank you." [dns] - 10https://gerrit.wikimedia.org/r/936236 (https://phabricator.wikimedia.org/T341220) (owner: 10Arturo Borrero Gonzalez) [13:57:53] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082781 (owner: 10PipelineBot) [13:57:56] T377827: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827 [13:58:38] PROBLEM - Host ml-etcd2002 is DOWN: PING CRITICAL - Packet loss = 100% [13:58:54] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082781 (owner: 10PipelineBot) [13:59:14] (03PS1) 10Urbanecm: Add maintenance script to move all flow boards on a wiki to a subpage [extensions/Flow] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082791 (https://phabricator.wikimedia.org/T371738) [13:59:25] (03PS1) 10Urbanecm: Add maintenance script to move all flow boards on a wiki to a subpage [extensions/Flow] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1082792 (https://phabricator.wikimedia.org/T371738) [13:59:42] !log mvernon@cumin1002 START - Cookbook sre.hosts.remove-downtime for ms-be1066.eqiad.wmnet [13:59:42] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be1066.eqiad.wmnet [14:00:20] (03Abandoned) 10CDanis: add sretools.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/815376 (https://phabricator.wikimedia.org/T313355) (owner: 10CDanis) [14:00:40] RECOVERY - Host ml-etcd2002 is UP: PING OK - Packet loss = 0%, RTA = 30.77 ms [14:01:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet [14:02:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet [14:02:24] 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10258800 (10Fabfur) Hi @RobH , do you need anything from us (Traffic) for this? Can we help? [14:02:29] what's the status with the incident, please? is it a good idea to do a MW deployment, or should I wait with that? [14:02:48] FIRING: [5x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:03:02] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10258801 (10MatthewVernon) So, I did: - disable puppet - stop swift-* and rsync - did the vacuum as the "swift" user - ran pupp... [14:03:15] claime: cdanis: maybe you can answer ^^? [14:03:25] urbanecm: you may proceed [14:03:29] thanks! [14:03:44] (03CR) 10Urbanecm: [C:03+2] Add maintenance script to move all flow boards on a wiki to a subpage [extensions/Flow] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082791 (https://phabricator.wikimedia.org/T371738) (owner: 10Urbanecm) [14:03:47] (03CR) 10Urbanecm: [C:03+2] Add maintenance script to move all flow boards on a wiki to a subpage [extensions/Flow] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1082792 (https://phabricator.wikimedia.org/T371738) (owner: 10Urbanecm) [14:05:12] jouncebot: nowandnext [14:05:12] No deployments scheduled for the next 0 hour(s) and 54 minute(s) [14:05:12] In 0 hour(s) and 54 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241024T1500) [14:05:36] ah Flow deployment thing :) [14:07:41] (03CR) 10Ssingh: "This has already been merged, abandoning." [dns] - 10https://gerrit.wikimedia.org/r/957745 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [14:07:44] (03Abandoned) 10Ssingh: wikimediacloud.org: decom ns-recursor0.openstack.eqiad1.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/957745 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [14:08:04] 10ops-codfw, 06DC-Ops: Relabel servers: restbase-dev200[1-3] to cassandra-dev200[1-3] - https://phabricator.wikimedia.org/T324806#10258822 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm this ticket got lost. it's completed now. [14:08:13] !log gmodena@deploy2002 Started deploy [analytics/refinery@413e5d9]: 2024-10-24 refinery hotfix deployment [analytics/refinery@413e5d91] [14:08:37] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2013.codfw.wmnet [14:08:51] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10258836 (10ops-monitoring-bot) Draining ganeti2013.codfw.wmnet of running VMs [14:09:20] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: Test new hardware candidate for cloudbackup replacement - https://phabricator.wikimedia.org/T353746#10258837 (10Jhancock.wm) is this ticket still needed? [14:10:09] (03PS3) 10Ssingh: wmftest: Remove old performance team setup. [dns] - 10https://gerrit.wikimedia.org/r/1042919 (https://phabricator.wikimedia.org/T366669) (owner: 10Phedenskog) [14:10:29] hashar: you wanna ship something too? i'm prepping for some maintenance later today [14:10:51] I wanted to restart the CI Jenkins, I will do it afte r:) [14:11:52] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10258845 (10Jhancock.wm) I don't know if this is relevant or not. I noticed when I was reseating the DIMM in 2083 that the second bank is connected to t... [14:12:09] (03Merged) 10jenkins-bot: Add maintenance script to move all flow boards on a wiki to a subpage [extensions/Flow] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082791 (https://phabricator.wikimedia.org/T371738) (owner: 10Urbanecm) [14:12:29] (03CR) 10Ssingh: [C:03+2] wmftest: Remove old performance team setup. [dns] - 10https://gerrit.wikimedia.org/r/1042919 (https://phabricator.wikimedia.org/T366669) (owner: 10Phedenskog) [14:12:48] (03CR) 10Ssingh: [C:03+2] "Approvals were there, patch rebased and merging. This is part of cleanup of ops/dns.git." [dns] - 10https://gerrit.wikimedia.org/r/1042919 (https://phabricator.wikimedia.org/T366669) (owner: 10Phedenskog) [14:14:34] (03Merged) 10jenkins-bot: Add maintenance script to move all flow boards on a wiki to a subpage [extensions/Flow] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1082792 (https://phabricator.wikimedia.org/T371738) (owner: 10Urbanecm) [14:15:29] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1082791|Add maintenance script to move all flow boards on a wiki to a subpage (T371738)]], [[gerrit:1082792|Add maintenance script to move all flow boards on a wiki to a subpage (T371738)]] [14:15:52] T371738: Write a script to automatically migrate Flow boards to sub-pages - https://phabricator.wikimedia.org/T371738 [14:16:02] !log gmodena@deploy2002 Finished deploy [analytics/refinery@413e5d9]: 2024-10-24 refinery hotfix deployment [analytics/refinery@413e5d91] (duration: 07m 48s) [14:17:34] PROBLEM - check if authdns-update was run after a change was submitted to dns.git on dns1004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 04abb4b2cd5b7f414b26df18e5937d10d007e0a5, dns.git is 866ffadfeb18a99810ecc36778383ae7aa9421a4) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:17:38] ok nice [14:18:21] !log running authdns-update for CR 1042919 [14:18:22] PROBLEM - check if authdns-update was run after a change was submitted to dns.git on dns1005 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 04abb4b2cd5b7f414b26df18e5937d10d007e0a5, dns.git is 866ffadfeb18a99810ecc36778383ae7aa9421a4) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:18:22] PROBLEM - check if authdns-update was run after a change was submitted to dns.git on dns1006 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 04abb4b2cd5b7f414b26df18e5937d10d007e0a5, dns.git is 866ffadfeb18a99810ecc36778383ae7aa9421a4) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:24] PROBLEM - check if authdns-update was run after a change was submitted to dns.git on dns2005 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 04abb4b2cd5b7f414b26df18e5937d10d007e0a5, dns.git is 866ffadfeb18a99810ecc36778383ae7aa9421a4) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:18:24] PROBLEM - check if authdns-update was run after a change was submitted to dns.git on dns2004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 04abb4b2cd5b7f414b26df18e5937d10d007e0a5, dns.git is 866ffadfeb18a99810ecc36778383ae7aa9421a4) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:18:24] PROBLEM - check if authdns-update was run after a change was submitted to dns.git on dns2006 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 04abb4b2cd5b7f414b26df18e5937d10d007e0a5, dns.git is 866ffadfeb18a99810ecc36778383ae7aa9421a4) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:18:24] PROBLEM - check if authdns-update was run after a change was submitted to dns.git on dns4003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 04abb4b2cd5b7f414b26df18e5937d10d007e0a5, dns.git is 866ffadfeb18a99810ecc36778383ae7aa9421a4) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:18:24] PROBLEM - check if authdns-update was run after a change was submitted to dns.git on dns4004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 04abb4b2cd5b7f414b26df18e5937d10d007e0a5, dns.git is 866ffadfeb18a99810ecc36778383ae7aa9421a4) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:18:24] PROBLEM - check if authdns-update was run after a change was submitted to dns.git on dns3004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 04abb4b2cd5b7f414b26df18e5937d10d007e0a5, dns.git is 866ffadfeb18a99810ecc36778383ae7aa9421a4) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:18:25] PROBLEM - check if authdns-update was run after a change was submitted to dns.git on dns3003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 04abb4b2cd5b7f414b26df18e5937d10d007e0a5, dns.git is 866ffadfeb18a99810ecc36778383ae7aa9421a4) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:18:25] PROBLEM - check if authdns-update was run after a change was submitted to dns.git on dns6001 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 04abb4b2cd5b7f414b26df18e5937d10d007e0a5, dns.git is 866ffadfeb18a99810ecc36778383ae7aa9421a4) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:18:26] PROBLEM - check if authdns-update was run after a change was submitted to dns.git on dns6002 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 04abb4b2cd5b7f414b26df18e5937d10d007e0a5, dns.git is 866ffadfeb18a99810ecc36778383ae7aa9421a4) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:18:26] PROBLEM - check if authdns-update was run after a change was submitted to dns.git on dns7001 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 04abb4b2cd5b7f414b26df18e5937d10d007e0a5, dns.git is 866ffadfeb18a99810ecc36778383ae7aa9421a4) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:18:27] PROBLEM - check if authdns-update was run after a change was submitted to dns.git on dns7002 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 04abb4b2cd5b7f414b26df18e5937d10d007e0a5, dns.git is 866ffadfeb18a99810ecc36778383ae7aa9421a4) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:18:27] PROBLEM - check if authdns-update was run after a change was submitted to dns.git on dns5003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 04abb4b2cd5b7f414b26df18e5937d10d007e0a5, dns.git is 866ffadfeb18a99810ecc36778383ae7aa9421a4) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:18:46] sorry about the alert spam but I couldn't test it yesterday as ircecho was not happy [14:19:03] !log gmodena@deploy2002 Started deploy [analytics/refinery@413e5d9] (thin): 2024-10-24 refinery hotfix deployment THIN [analytics/refinery@413e5d91] [14:19:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2013.codfw.wmnet [14:20:04] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2013.codfw.wmnet [14:20:31] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10258885 (10ops-monitoring-bot) Draining ganeti2013.codfw.wmnet of running VMs [14:21:17] (03PS1) 10Muehlenhoff: Remove ganeti2013 from active Ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/1082795 (https://phabricator.wikimedia.org/T376594) [14:22:01] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2083.codfw.wmnet with OS bullseye [14:22:04] elukey@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [14:22:20] (03PS1) 10Ssingh: P:dns::auth: improve text for authdns-update check [puppet] - 10https://gerrit.wikimedia.org/r/1082796 [14:22:34] RECOVERY - check if authdns-update was run after a change was submitted to dns.git on dns1004 is OK: DNS git repository and local zone files are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:22:57] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1082791|Add maintenance script to move all flow boards on a wiki to a subpage (T371738)]], [[gerrit:1082792|Add maintenance script to move all flow boards on a wiki to a subpage (T371738)]] (duration: 07m 28s) [14:23:15] (03CR) 10Elukey: [C:03+2] "Test-cookbooked, all good!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1082707 (https://phabricator.wikimedia.org/T371400) (owner: 10Elukey) [14:23:21] T371738: Write a script to automatically migrate Flow boards to sub-pages - https://phabricator.wikimedia.org/T371738 [14:23:22] RECOVERY - check if authdns-update was run after a change was submitted to dns.git on dns1005 is OK: DNS git repository and local zone files are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:23:22] RECOVERY - check if authdns-update was run after a change was submitted to dns.git on dns1006 is OK: DNS git repository and local zone files are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:23:24] RECOVERY - check if authdns-update was run after a change was submitted to dns.git on dns2005 is OK: DNS git repository and local zone files are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:23:24] RECOVERY - check if authdns-update was run after a change was submitted to dns.git on dns2004 is OK: DNS git repository and local zone files are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:23:24] RECOVERY - check if authdns-update was run after a change was submitted to dns.git on dns2006 is OK: DNS git repository and local zone files are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:23:25] RECOVERY - check if authdns-update was run after a change was submitted to dns.git on dns4004 is OK: DNS git repository and local zone files are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:23:25] RECOVERY - check if authdns-update was run after a change was submitted to dns.git on dns4003 is OK: DNS git repository and local zone files are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:23:25] RECOVERY - check if authdns-update was run after a change was submitted to dns.git on dns3003 is OK: DNS git repository and local zone files are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:23:25] RECOVERY - check if authdns-update was run after a change was submitted to dns.git on dns3004 is OK: DNS git repository and local zone files are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:23:26] RECOVERY - check if authdns-update was run after a change was submitted to dns.git on dns6001 is OK: DNS git repository and local zone files are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:23:26] RECOVERY - check if authdns-update was run after a change was submitted to dns.git on dns6002 is OK: DNS git repository and local zone files are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:23:27] RECOVERY - check if authdns-update was run after a change was submitted to dns.git on dns7002 is OK: DNS git repository and local zone files are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:23:27] RECOVERY - check if authdns-update was run after a change was submitted to dns.git on dns7001 is OK: DNS git repository and local zone files are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:23:28] RECOVERY - check if authdns-update was run after a change was submitted to dns.git on dns5003 is OK: DNS git repository and local zone files are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:23:37] (03CR) 10Tacsipacsi: tables-catalog: Add translate_message_group_subscriptions table (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1082549 (https://phabricator.wikimedia.org/T372287) (owner: 10Abijeet Patro) [14:24:02] !log gmodena@deploy2002 Finished deploy [analytics/refinery@413e5d9] (thin): 2024-10-24 refinery hotfix deployment THIN [analytics/refinery@413e5d91] (duration: 04m 59s) [14:24:42] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:26:11] (03PS2) 10Ssingh: P:dns::auth: improve text for authdns-update check [puppet] - 10https://gerrit.wikimedia.org/r/1082796 [14:26:35] hashar: i'm done [14:27:27] !log gmodena@deploy2002 Started deploy [analytics/refinery@413e5d9] (hadoop-test): 2024-10-24 refinery hotfix deployment TEST [analytics/refinery@413e5d91] [14:27:38] urbanecm: thanks! [14:28:17] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ms-be2085 to codfw - jhancock@cumin2002" [14:28:44] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T377607#10258935 (10LSobanski) [14:29:00] (03PS19) 10Bking: elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) [14:30:55] (03CR) 10Ssingh: [C:03+2] P:dns::auth: improve text for authdns-update check [puppet] - 10https://gerrit.wikimedia.org/r/1082796 (owner: 10Ssingh) [14:30:56] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [14:31:18] (03PS3) 10Abijeet Patro: tables-catalog: Add translate_message_group_subscriptions table [puppet] - 10https://gerrit.wikimedia.org/r/1082549 (https://phabricator.wikimedia.org/T372287) [14:31:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ms-be2085 to codfw - jhancock@cumin2002" [14:31:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:31:31] !log gmodena@deploy2002 Finished deploy [analytics/refinery@413e5d9] (hadoop-test): 2024-10-24 refinery hotfix deployment TEST [analytics/refinery@413e5d91] (duration: 04m 03s) [14:31:34] (03CR) 10Abijeet Patro: tables-catalog: Add translate_message_group_subscriptions table (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1082549 (https://phabricator.wikimedia.org/T372287) (owner: 10Abijeet Patro) [14:32:08] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2085 [14:32:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2085 [14:34:41] (03CR) 10Tacsipacsi: tables-catalog: Add translate_message_group_subscriptions table (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1082549 (https://phabricator.wikimedia.org/T372287) (owner: 10Abijeet Patro) [14:36:37] (03PS7) 10Muehlenhoff: peopleweb: limit envoy srange to CACHES and DEPLOYMENT servers [puppet] - 10https://gerrit.wikimedia.org/r/1071927 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [14:37:15] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:07] (03PS1) 10Snwachukwu: Add cu_log table to sqoop job [puppet] - 10https://gerrit.wikimedia.org/r/1082800 (https://phabricator.wikimedia.org/T364398) [14:38:55] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071927 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [14:40:01] (03CR) 10Muehlenhoff: [C:03+2] Test puppet-managed /var/lib/ganeti/known_hosts on ganeti-test2003 [puppet] - 10https://gerrit.wikimedia.org/r/1076188 (https://phabricator.wikimedia.org/T309724) (owner: 10Muehlenhoff) [14:42:19] !log Restarting CI Jenkins [14:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:59] (03Abandoned) 10Arturo Borrero Gonzalez: wikimediacloud.org: add openstack-next.eqiad1.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/936236 (https://phabricator.wikimedia.org/T341220) (owner: 10Arturo Borrero Gonzalez) [14:46:47] (03PS3) 10JHathaway: EFI: install grub on all EFI partitions [puppet] - 10https://gerrit.wikimedia.org/r/1082288 (https://phabricator.wikimedia.org/T376949) [14:47:06] (03CR) 10Milimetric: [C:03+1] Add cu_log table to sqoop job [puppet] - 10https://gerrit.wikimedia.org/r/1082800 (https://phabricator.wikimedia.org/T364398) (owner: 10Snwachukwu) [14:48:48] (03CR) 10Atieno: [C:03+1] ExtensionDistributor: Mark 1.43 as beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082256 (https://phabricator.wikimedia.org/T372322) (owner: 10MacFan4000) [14:48:57] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-eqiad: Apply openjdk upgrade (11.0.25+9-1~deb11u1) - eevans@cumin1002 [14:50:16] !log ihurbain@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [14:51:24] (03PS3) 10CDanis: haproxy: gpc_rate arrays to all clusters [puppet] - 10https://gerrit.wikimedia.org/r/1082236 (https://phabricator.wikimedia.org/T371144) [14:51:25] (03CR) 10CDanis: [C:04-2] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1082236 (https://phabricator.wikimedia.org/T371144) (owner: 10CDanis) [14:52:50] (03PS7) 10Ssingh: Duplicate names by design: add zone validator ignore [dns] - 10https://gerrit.wikimedia.org/r/793728 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans) [14:53:02] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1285-1286,1288-1289].eqiad.wmnet [14:53:04] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1285-1286,1288-1289].eqiad.wmnet [14:55:24] (03CR) 10Giuseppe Lavagetto: [C:03+1] haproxy: gpc_rate arrays to all clusters [puppet] - 10https://gerrit.wikimedia.org/r/1082236 (https://phabricator.wikimedia.org/T371144) (owner: 10CDanis) [14:55:52] (03CR) 10Ssingh: "Currently (without this change):" [dns] - 10https://gerrit.wikimedia.org/r/793728 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans) [14:56:31] 06SRE, 10SRE-Access-Requests, 06Machine-Learning-Team, 10LPL Essential (LPL Essential 2024 Jul-Sep): Access to deploy recommendation API ML service for kartik - https://phabricator.wikimedia.org/T376585#10259119 (10calbon) Approved @MoritzMuehlenhoff [14:57:01] (03CR) 10Ssingh: "Patch rebased, removed old records that are no longer relevant." [dns] - 10https://gerrit.wikimedia.org/r/793728 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans) [14:57:31] (03CR) 10Fabfur: [C:03+1] haproxy: gpc_rate arrays to all clusters [puppet] - 10https://gerrit.wikimedia.org/r/1082236 (https://phabricator.wikimedia.org/T371144) (owner: 10CDanis) [14:58:00] (03CR) 10CDanis: [C:03+2] haproxy: gpc_rate arrays to all clusters [puppet] - 10https://gerrit.wikimedia.org/r/1082236 (https://phabricator.wikimedia.org/T371144) (owner: 10CDanis) [15:00:04] dancy and jeena: Time to do the Train log triage deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241024T1500). [15:02:15] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:33] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes1005.eqiad.wmnet [15:02:43] (03PS1) 10Jgiannelos: Revert "pcs: Configure prometheus metrics" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082803 [15:03:06] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes1005.eqiad.wmnet [15:03:09] !log jayme@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM kubernetes1005.eqiad.wmnet [15:03:37] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:03:54] (03CR) 10Isabelle Hurbain-Palatin: [C:03+1] Revert "pcs: Configure prometheus metrics" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082803 (owner: 10Jgiannelos) [15:05:05] (03CR) 10Jgiannelos: [C:03+2] Revert "pcs: Configure prometheus metrics" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082803 (owner: 10Jgiannelos) [15:06:06] (03Merged) 10jenkins-bot: Revert "pcs: Configure prometheus metrics" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082803 (owner: 10Jgiannelos) [15:08:12] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2085.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:08:31] !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubernetes1005.eqiad.wmnet [15:08:34] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes1005.eqiad.wmnet [15:08:35] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes1005.eqiad.wmnet [15:09:13] 06SRE, 10SRE-Access-Requests, 06Machine-Learning-Team, 10LPL Essential (LPL Essential 2024 Jul-Sep): Access to deploy recommendation API ML service for kartik - https://phabricator.wikimedia.org/T376585#10259181 (10isarantopoulos) From the ML side we suggest to proceed with providing @KartikMistry access s... [15:09:38] (03PS1) 10Muehlenhoff: Cover one more case in the setup of Envoy firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1082806 [15:09:43] (03PS1) 10Isabelle Hurbain-Palatin: mobileapps: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082807 [15:10:05] (03PS2) 10Muehlenhoff: Cover one more case in the setup of Envoy firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1082806 [15:10:44] (03CR) 10Jgiannelos: [C:03+2] mobileapps: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082807 (owner: 10Isabelle Hurbain-Palatin) [15:11:27] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2085.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:12:00] (03Merged) 10jenkins-bot: mobileapps: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082807 (owner: 10Isabelle Hurbain-Palatin) [15:13:06] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes1005.eqiad.wmnet [15:13:38] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes1005.eqiad.wmnet [15:13:41] !log jayme@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM kubernetes1005.eqiad.wmnet [15:14:37] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:15:12] !log ihurbain@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [15:15:16] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1082806 (owner: 10Muehlenhoff) [15:15:36] !log ihurbain@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [15:15:49] 06SRE, 10SRE-Access-Requests: Requesting access to 'deployment' for 'Joely Rooke WMDE' - https://phabricator.wikimedia.org/T378082#10259243 (10WMDE-leszek) I approve this request on WMDE end. thank you! [15:16:07] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:16:07] FIRING: [2x] SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [15:16:13] !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubernetes1005.eqiad.wmnet [15:16:17] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes1005.eqiad.wmnet [15:16:18] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes1005.eqiad.wmnet [15:17:46] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2085.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:18:31] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes1006.eqiad.wmnet [15:19:04] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes1006.eqiad.wmnet [15:21:06] !log jayme@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM kubernetes1006.eqiad.wmnet [15:21:07] FIRING: [3x] SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [15:21:37] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:22:09] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:23:34] !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubernetes1006.eqiad.wmnet [15:23:38] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes1006.eqiad.wmnet [15:23:40] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes1006.eqiad.wmnet [15:23:43] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes1015.eqiad.wmnet [15:24:16] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes1015.eqiad.wmnet [15:26:07] FIRING: [4x] SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [15:26:18] !log jayme@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM kubernetes1015.eqiad.wmnet [15:26:37] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:27:11] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:27:22] FIRING: [5x] SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [15:28:46] !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubernetes1015.eqiad.wmnet [15:28:49] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes1015.eqiad.wmnet [15:28:51] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes1015.eqiad.wmnet [15:28:55] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes1016.eqiad.wmnet [15:29:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2085.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:29:27] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes1016.eqiad.wmnet [15:30:07] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2005.codfw.wmnet [15:30:39] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2005.codfw.wmnet [15:31:12] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:31:30] !log jayme@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM kubernetes1016.eqiad.wmnet [15:32:30] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10259382 (10Jhancock.wm) [15:32:37] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:32:42] !log jayme@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM kubernetes2005.codfw.wmnet [15:33:19] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:33:58] !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubernetes1016.eqiad.wmnet [15:34:02] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes1016.eqiad.wmnet [15:34:04] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes1016.eqiad.wmnet [15:35:03] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:35:21] !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubernetes2005.codfw.wmnet [15:35:23] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2005.codfw.wmnet [15:35:25] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2005.codfw.wmnet [15:35:28] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2006.codfw.wmnet [15:35:52] (03PS20) 10Bking: elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) [15:36:01] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2006.codfw.wmnet [15:36:03] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 288, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:36:07] FIRING: [4x] SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [15:36:28] (03CR) 10CI reject: [V:04-1] elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [15:37:22] FIRING: [5x] SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [15:37:35] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:38:03] !log jayme@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM kubernetes2006.codfw.wmnet [15:39:03] (03CR) 10Ssingh: [C:03+2] Duplicate names by design: add zone validator ignore [dns] - 10https://gerrit.wikimedia.org/r/793728 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans) [15:39:21] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:40:32] !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubernetes2006.codfw.wmnet [15:40:35] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2006.codfw.wmnet [15:40:37] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2006.codfw.wmnet [15:40:40] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2015.codfw.wmnet [15:41:07] FIRING: [5x] SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [15:41:12] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2015.codfw.wmnet [15:41:23] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ms-be2086 to codfw - jhancock@cumin2002" [15:41:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ms-be2086 to codfw - jhancock@cumin2002" [15:41:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:41:58] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2086 [15:42:06] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2086 [15:42:22] FIRING: [5x] SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [15:43:05] (03PS1) 10Cparle: Add config for testing T375264 on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082809 (https://phabricator.wikimedia.org/T377988) [15:43:15] !log jayme@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM kubernetes2015.codfw.wmnet [15:43:17] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2086.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:44:23] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:45:18] !log xcollazo@deploy2002 Started deploy [airflow-dags/analytics@325d943]: Deploy latest DAGs to analytics Airflow instance. T377999. [15:45:37] T377999: Run Dumps 2.0 main DAG at a daily cadence rather than hourly. - https://phabricator.wikimedia.org/T377999 [15:45:44] !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubernetes2015.codfw.wmnet [15:45:46] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2015.codfw.wmnet [15:45:48] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2015.codfw.wmnet [15:45:51] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2016.codfw.wmnet [15:46:07] FIRING: [6x] SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [15:46:24] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2016.codfw.wmnet [15:46:25] !log xcollazo@deploy2002 Finished deploy [airflow-dags/analytics@325d943]: Deploy latest DAGs to analytics Airflow instance. T377999. (duration: 01m 07s) [15:47:49] (03PS21) 10Bking: elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) [15:48:25] (03CR) 10CI reject: [V:04-1] elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [15:48:27] !log jayme@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM kubernetes2016.codfw.wmnet [15:49:23] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:49:41] (03CR) 10C. Scott Ananian: [C:04-1] "I suggest abandoning this backport: T378006#10259525" [core] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1082586 (https://phabricator.wikimedia.org/T378006) (owner: 10Reedy) [15:49:49] (03CR) 10C. Scott Ananian: [C:04-1] "I suggest abandoning this backport: T378006#10259525" [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082585 (https://phabricator.wikimedia.org/T378006) (owner: 10Reedy) [15:50:56] !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubernetes2016.codfw.wmnet [15:50:58] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes2016.codfw.wmnet [15:51:00] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes2016.codfw.wmnet [15:52:31] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 9 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [15:52:37] (03PS3) 10Ssingh: tox.ini: add Python 3.11 to interpreters (and remove 3.7) [dns] - 10https://gerrit.wikimedia.org/r/1082548 [15:53:17] jouncebot nowandnext [15:53:17] For the next 0 hour(s) and 6 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241024T1500) [15:53:18] In 0 hour(s) and 6 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241024T1600) [15:53:38] I'm going to roll the train to group1 to let it marinate before the usual train window. [15:53:38] (03PS22) 10Bking: elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) [15:54:14] (03CR) 10CI reject: [V:04-1] elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [15:55:26] (03PS1) 10TrainBranchBot: group1 to 1.43.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082811 (https://phabricator.wikimedia.org/T375659) [15:55:28] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.43.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082811 (https://phabricator.wikimedia.org/T375659) (owner: 10TrainBranchBot) [15:56:07] FIRING: [7x] SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [15:56:12] (03Merged) 10jenkins-bot: group1 to 1.43.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082811 (https://phabricator.wikimedia.org/T375659) (owner: 10TrainBranchBot) [15:57:22] FIRING: [7x] SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [15:59:29] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - mwdebug-next_4453: Servers wikikube-worker1280.eqiad.wmnet, kubernetes1010.eqiad.wmnet, parse1011.eqiad.wmnet, kubernetes1025.eqiad.wmnet, mw1419.eqiad.wmnet, wikikube-worker1291.eqiad.wmnet, mw1442.eqiad.wmnet, wikikube-worker1007.eqiad.wmnet, mw1462.eqiad.wmnet, mw1430.eqiad.wmnet, mw1484.eqiad.wmnet, wikikube-worker1247.eqiad.wmnet, wikikube-worke [15:59:29] iad.wmnet, wikikube-worker1260.eqiad.wmnet, mw1435.eqiad.wmnet, wikikube-worker1279.eqiad.wmnet, wikikube-worker1282.eqiad.wmnet, mw1488.eqiad.wmnet, mw1454.eqiad.wmnet, wikikube-worker1287.eqiad.wmnet, parse1005.eqiad.wmnet, wikikube-worker1003.eqiad.wmnet, kubernetes1017.eqiad.wmnet, mw1457.eqiad.wmnet, wikikube-worker1020.eqiad.wmnet, wikikube-worker1009.eqiad.wmnet, mw1483.eqiad.wmnet, kubernetes1059.eqiad.wmnet, wikikube-worker1022.e [15:59:29] et, mw1486.eqiad.wmnet, wikikube-worker1272.eqiad.wmnet, kubernetes1018.eqiad.wmnet, wikikube-worker1286.eqiad.wmnet, parse1001.eqiad.wmnet, wikikube-worker1267.eqiad.wmnet, wikikube-wo https://wikitech.wikimedia.org/wiki/PyBal [16:00:05] jhathaway and rzl: OwO what's this, a deployment window?? Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241024T1600). nyaa~ [16:00:05] Pppery: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:07] Here [16:00:07] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - mwdebug-next_4453: Servers wikikube-worker1280.eqiad.wmnet, kubernetes1010.eqiad.wmnet, parse1013.eqiad.wmnet, mw1419.eqiad.wmnet, mw1442.eqiad.wmnet, mw1480.eqiad.wmnet, parse1009.eqiad.wmnet, wikikube-worker1247.eqiad.wmnet, wikikube-worker1273.eqiad.wmnet, mw1435.eqiad.wmnet, mw1424.eqiad.wmnet, mw1488.eqiad.wmnet, mw1454.eqiad.wmnet, parse1010.eq [16:00:07] t, wikikube-worker1287.eqiad.wmnet, parse1005.eqiad.wmnet, wikikube-worker1003.eqiad.wmnet, wikikube-worker1244.eqiad.wmnet, kubernetes1017.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1033.eqiad.wmnet, wikikube-worker1257.eqiad.wmnet, kubernetes1018.eqiad.wmnet, mw1469.eqiad.wmnet, wikikube-worker1256.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1058.eqiad.wmnet, mw1483.eqiad.wmnet, wikikube-worker1242.eqiad.wmnet, wikikube-w [16:00:07] 7.eqiad.wmnet, mw1468.eqiad.wmnet, kubernetes1028.eqiad.wmnet, kubernetes1008.eqiad.wmnet, kubernetes1019.eqiad.wmnet, kubernetes1031.eqiad.wmnet, kubernetes1024.eqiad.wmnet, mw1439.eqi https://wikitech.wikimedia.org/wiki/PyBal [16:01:07] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:01:07] FIRING: [5x] SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [16:01:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2086.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:01:29] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:02:22] RESOLVED: [5x] SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [16:03:22] Pppery: hello! sorry to be late, looking [16:03:28] No problem [16:05:29] Pppery: can you talk me through why it's okay to temporarily route those wikis to the incubator url? that surprises me [16:05:34] !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.43.0-wmf.28 refs T375659 [16:06:05] I didn't think anything would break if they pointed there for a few hours [16:06:06] T375659: 1.43.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T375659 [16:06:18] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:06:30] Since the Incubator page itself has a pointer to the page they currently redirect to [16:06:42] If you think it's better to do these in the other order, then that's fine with me [16:06:55] but the timing was more convenient for me in this order [16:07:14] that's my instinct, yeah -- I see your point re testability, but we can test it on one host at the Apache level before rolling it out anyway [16:08:06] if you'd like we can find an open time to deploy it earlier than the next puppet window [16:08:12] (after the backport is in, I mean) [16:09:24] The backport is scheduled for UTC late today, but anyway this entire patch tree is non-urgent cleanup that I decided to do spontaneously, [16:12:14] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ms-be2087 to codfw - jhancock@cumin2002" [16:12:19] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ms-be2087 to codfw - jhancock@cumin2002" [16:12:19] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:13:04] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2087 [16:13:05] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2088 [16:13:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2088 [16:13:21] Pppery: got it, yeah -- this redirects patch is low-risk enough that I'm comfortable deploying it on a Friday, we could meet back here at 16 UTC tomorrow (24 hours later) if that works for you [16:13:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2087 [16:13:36] or if you'd rather just put it in the next Puppet window on Tuesday that's fine too [16:15:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2087.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:15:21] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2088.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:15:57] I think I'll just wait until Tuesday [16:17:16] sure :) talk to you then, sorry for the wait [16:17:54] (03PS1) 10Ahmon Dancy: AbuseLogPager: Fix passing `false` as message parameter [extensions/AbuseFilter] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082813 (https://phabricator.wikimedia.org/T377917) [16:18:19] (nothing else for the Puppet window today) [16:25:59] (03CR) 10EarlyWarningBot: "Failed command: "npm run selenium-test"" [extensions/AbuseFilter] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082813 (https://phabricator.wikimedia.org/T377917) (owner: 10Ahmon Dancy) [16:26:16] sigh [16:28:00] (03CR) 10CI reject: [V:04-1] AbuseLogPager: Fix passing `false` as message parameter [extensions/AbuseFilter] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082813 (https://phabricator.wikimedia.org/T377917) (owner: 10Ahmon Dancy) [16:28:18] (03CR) 10Ahmon Dancy: "recheck" [extensions/AbuseFilter] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082813 (https://phabricator.wikimedia.org/T377917) (owner: 10Ahmon Dancy) [16:32:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2087.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:42:28] (03PS1) 10Giuseppe Lavagetto: Fix encoding of user names [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1082814 [16:42:39] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Fix encoding of user names [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1082814 (owner: 10Giuseppe Lavagetto) [16:43:17] !log oblivian@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Fix encoding of usernames with non-ascii letters - oblivian@cumin1002" [16:43:19] !log oblivian@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Fix encoding of usernames with non-ascii letters - oblivian@cumin1002 [16:43:48] !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Fix encoding of usernames with non-ascii letters - oblivian@cumin1002 [16:43:49] !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Fix encoding of usernames with non-ascii letters - oblivian@cumin1002" [16:48:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy2002 using scap backport" [extensions/AbuseFilter] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082813 (https://phabricator.wikimedia.org/T377917) (owner: 10Ahmon Dancy) [16:48:33] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2088.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:50:33] (03PS23) 10Bking: elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) [16:50:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:51:09] (03CR) 10CI reject: [V:04-1] elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [16:51:57] (03PS24) 10Bking: elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) [16:52:34] (03CR) 10CI reject: [V:04-1] elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [16:56:20] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-eqiad: Apply openjdk upgrade (11.0.25+9-1~deb11u1) - eevans@cumin1002 [16:57:39] (03PS25) 10Bking: elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) [16:58:15] (03CR) 10CI reject: [V:04-1] elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [16:58:47] PROBLEM - Hadoop NodeManager on an-worker1139 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:59:37] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10259834 (10Jhancock.wm) [17:00:05] bd808: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241024T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241024T1700) [17:04:09] (03PS26) 10Bking: elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) [17:04:45] (03CR) 10CI reject: [V:04-1] elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [17:04:52] !log `mwscript-k8s -f extensions/Flow/maintenance/FlowMoveBoardsToSubpages.php -- --wiki=nowiki` (running as `mw-script.codfw.ui7285yu`; T376749) [17:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:24] T376749: Run Flow migration script at Phase 0 wikis - https://phabricator.wikimedia.org/T376749 [17:05:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:06:10] (03Merged) 10jenkins-bot: AbuseLogPager: Fix passing `false` as message parameter [extensions/AbuseFilter] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082813 (https://phabricator.wikimedia.org/T377917) (owner: 10Ahmon Dancy) [17:06:38] !log dancy@deploy2002 Started scap sync-world: Backport for [[gerrit:1082813|AbuseLogPager: Fix passing `false` as message parameter (T377917)]] [17:06:56] T377917: Special:AbuseLog InvalidArgumentException when logged out - https://phabricator.wikimedia.org/T377917 [17:09:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2013.codfw.wmnet [17:09:11] !log dancy@deploy2002 dancy: Backport for [[gerrit:1082813|AbuseLogPager: Fix passing `false` as message parameter (T377917)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:09:17] !log dancy@deploy2002 dancy: Continuing with sync [17:10:11] (03PS27) 10Bking: elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) [17:10:46] (03CR) 10CI reject: [V:04-1] elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [17:13:35] (03PS28) 10Bking: elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) [17:13:57] !log dancy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1082813|AbuseLogPager: Fix passing `false` as message parameter (T377917)]] (duration: 07m 18s) [17:14:14] T377917: Special:AbuseLog InvalidArgumentException when logged out - https://phabricator.wikimedia.org/T377917 [17:17:31] (03PS1) 10BryanDavis: developer-portal: Bump container to 2024-10-24-122318-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082823 [17:20:47] RECOVERY - Hadoop NodeManager on an-worker1139 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:22:38] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [17:22:39] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [17:23:32] (03PS2) 10Andrea Denisse: alert: Ensure vopsbot database is synced between the alertmanager hosts [puppet] - 10https://gerrit.wikimedia.org/r/1082820 (https://phabricator.wikimedia.org/T375143) [17:23:32] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/1082820/4370/" [puppet] - 10https://gerrit.wikimedia.org/r/1082820 (https://phabricator.wikimedia.org/T375143) (owner: 10Andrea Denisse) [17:33:24] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container to 2024-10-24-122318-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082823 (owner: 10BryanDavis) [17:34:24] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2024-10-24-122318-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082823 (owner: 10BryanDavis) [17:35:54] (03PS29) 10Bking: elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) [17:36:30] (03CR) 10CI reject: [V:04-1] elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [17:38:04] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:38:23] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:39:19] (03CR) 10CI reject: [V:04-1] elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [17:40:22] (03PS1) 10Ottomata: admin data.yaml - explicit approval is not needed for analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1082826 (https://phabricator.wikimedia.org/T370424) [17:41:11] (03CR) 10CI reject: [V:04-1] admin data.yaml - explicit approval is not needed for analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1082826 (https://phabricator.wikimedia.org/T370424) (owner: 10Ottomata) [17:41:48] 06SRE, 06Data-Platform-SRE, 10Data-Engineering (Q2 2024 October 1st - December 31th), 13Patch-For-Review: Streamline Data Platform access approvals for WMF staff - https://phabricator.wikimedia.org/T370424#10259980 (10Ottomata) Updated docs: https://wikitech.wikimedia.org/w/index.php?title=SRE%2FProduction... [17:41:55] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:42:18] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:42:25] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:42:51] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:44:44] (03PS2) 10Ottomata: admin - explicit approval not needed for analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1082826 (https://phabricator.wikimedia.org/T370424) [17:51:22] (03PS6) 10Elukey: sre.hosts.provision: initial UEFI support [cookbooks] - 10https://gerrit.wikimedia.org/r/1077377 (https://phabricator.wikimedia.org/T373519) (owner: 10Ayounsi) [17:51:22] (03PS3) 10Elukey: sre.hosts.provision: make UEFI opt-out [cookbooks] - 10https://gerrit.wikimedia.org/r/1078539 (owner: 10Ayounsi) [17:53:33] (03CR) 10Elukey: "Finally rebased on top of the latest changes, Arzhel/Jesse lemme know if it looks correct or not!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1077377 (https://phabricator.wikimedia.org/T373519) (owner: 10Ayounsi) [17:54:33] 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10260062 (10RobH) Sorry, been tied up doing orders and quotes this week but should free up time to complete the directions today/tomorrow and have ya'll d... [18:00:04] dancy and jeena: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241024T1800). [18:01:36] (03PS1) 10TrainBranchBot: group2 to 1.43.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082827 (https://phabricator.wikimedia.org/T375659) [18:01:37] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.43.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082827 (https://phabricator.wikimedia.org/T375659) (owner: 10TrainBranchBot) [18:02:22] (03Merged) 10jenkins-bot: group2 to 1.43.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082827 (https://phabricator.wikimedia.org/T375659) (owner: 10TrainBranchBot) [18:06:30] FIRING: [4x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:09:21] !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.43.0-wmf.28 refs T375659 [18:09:53] T375659: 1.43.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T375659 [18:21:49] (03CR) 10JHathaway: [C:03+1] "looks good to me, thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/1077377 (https://phabricator.wikimedia.org/T373519) (owner: 10Ayounsi) [18:30:25] (03PS31) 10Bking: elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) [18:31:10] (03CR) 10Ssingh: liberica: provide a liberica module (0321 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1080708 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [18:33:23] (03PS1) 10Gergő Tisza: chore: Move authevents logging into AuthManager [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082829 (https://phabricator.wikimedia.org/T341650) [18:33:36] (03PS1) 10Gergő Tisza: chore: AuthManager::autoCreateUser log authevents now [extensions/CentralAuth] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082830 (https://phabricator.wikimedia.org/T341650) [18:35:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082829 (https://phabricator.wikimedia.org/T341650) (owner: 10Gergő Tisza) [18:35:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/CentralAuth] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082830 (https://phabricator.wikimedia.org/T341650) (owner: 10Gergő Tisza) [18:39:24] (03PS32) 10Bking: elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) [18:42:30] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [18:46:38] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: Apply openjdk upgrade (11.0.25+9-1~deb11u1) - eevans@cumin1002 [18:52:48] (03PS43) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) [18:53:01] (03CR) 10CDobbins: prometheus: add script to check TCP MSS clamping value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [18:53:25] (03CR) 10CDobbins: "ferm_mss_cfg{endpoint="208.80.153.232:443",interface="ens13",protocol="IPv4"} 1440.0" [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [18:53:38] (03CR) 10CI reject: [V:04-1] prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [18:54:26] (03PS33) 10Bking: elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) [19:02:44] (03PS1) 10DLynch: Enable edit check on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082834 (https://phabricator.wikimedia.org/T377551) [19:03:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082834 (https://phabricator.wikimedia.org/T377551) (owner: 10DLynch) [19:09:31] (03PS1) 10Ahmon Dancy: Use SpecialPage::getRobotPolicy to set robot policy [extensions/FundraiserLandingPage] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082835 (https://phabricator.wikimedia.org/T378108) [19:13:05] PROBLEM - Hadoop NodeManager on an-worker1094 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:13:41] (03CR) 10Eevans: "Just playing Devil's Advocate / For posterity sake: Given that Kask != (session|echo)store (at least in theory), are there any scenarios w" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082731 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [19:16:47] PROBLEM - Hadoop NodeManager on an-worker1140 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:23:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy2002 using scap backport" [extensions/FundraiserLandingPage] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082835 (https://phabricator.wikimedia.org/T378108) (owner: 10Ahmon Dancy) [19:25:51] (03CR) 10Dzahn: [C:03+1] "Let me also do this only on bookworm (gerrit2003) first. Might be easier to just test there and phase it out." [puppet] - 10https://gerrit.wikimedia.org/r/1064413 (https://phabricator.wikimedia.org/T315942) (owner: 10Dzahn) [19:25:55] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [19:26:38] (03Merged) 10jenkins-bot: Use SpecialPage::getRobotPolicy to set robot policy [extensions/FundraiserLandingPage] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082835 (https://phabricator.wikimedia.org/T378108) (owner: 10Ahmon Dancy) [19:26:57] !log dancy@deploy2002 Started scap sync-world: Backport for [[gerrit:1082835|Use SpecialPage::getRobotPolicy to set robot policy (T378108)]] [19:27:26] T378108: PHP Deprecated: Use of MediaWiki\Output\OutputPage::setIndexPolicy with index after noindex was deprecated in MediaWiki 1.43. [Called from MediaWiki\Extension\FundraiserLandingPage\Specials\FundraiserLandingPage::execute] - https://phabricator.wikimedia.org/T378108 [19:27:31] (03CR) 10Dzahn: [C:03+1] "the /.ssh/config part is obviously already limited to specific host names of the current prod servers.. not going to add new host there." [puppet] - 10https://gerrit.wikimedia.org/r/1064413 (https://phabricator.wikimedia.org/T315942) (owner: 10Dzahn) [19:29:07] PROBLEM - Hadoop NodeManager on an-worker1133 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:29:17] !log dancy@deploy2002 dancy: Backport for [[gerrit:1082835|Use SpecialPage::getRobotPolicy to set robot policy (T378108)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:29:23] !log dancy@deploy2002 dancy: Continuing with sync [19:31:47] RECOVERY - Hadoop NodeManager on an-worker1140 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:34:05] !log dancy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1082835|Use SpecialPage::getRobotPolicy to set robot policy (T378108)]] (duration: 07m 08s) [19:34:07] RECOVERY - Hadoop NodeManager on an-worker1133 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:34:39] T378108: PHP Deprecated: Use of MediaWiki\Output\OutputPage::setIndexPolicy with index after noindex was deprecated in MediaWiki 1.43. [Called from MediaWiki\Extension\FundraiserLandingPage\Specials\FundraiserLandingPage::execute] - https://phabricator.wikimedia.org/T378108 [19:37:05] RECOVERY - Hadoop NodeManager on an-worker1094 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:46:49] PROBLEM - Hadoop NodeManager on an-worker1152 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:47:53] (03PS6) 10Andrea Denisse: alert: Ensure vopsbot database is synced between the alertmanager hosts [puppet] - 10https://gerrit.wikimedia.org/r/1082820 (https://phabricator.wikimedia.org/T375143) [19:53:08] (03CR) 10Scott French: [C:03+1] "Two minor comments, but otherwise looks good!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1082478 (https://phabricator.wikimedia.org/T377958) (owner: 10Clément Goubert) [19:53:28] 06SRE, 06SRE-OnFire: productionize 'sremap' and 'filter_victorops_calendar' under sretools.wikimedia.org - https://phabricator.wikimedia.org/T313355#10260558 (10Pppery) [20:00:53] jouncebot: hello? [20:01:02] jouncebot: nowandnext [20:01:02] For the next 0 hour(s) and 58 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241024T2000) [20:01:02] In 9 hour(s) and 58 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241025T0600) [20:01:49] welp, I can deploy at any rate [20:01:53] It knows and yet it does not... [20:02:11] here [20:02:15] cool [20:02:33] tgr|away: are you actually away? or around for utc late? [20:02:49] RECOVERY - Hadoop NodeManager on an-worker1152 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:02:59] Pppery: let's get started with your changes [20:02:59] thcipriani: present [20:03:12] tgr|away: great, I'll get your changes a mergin' [20:03:24] thx [20:03:40] The first change is mostly not testable because it gets shadowed by puppet code, but a few minor cases are testable [20:03:51] the two patches can be deployed together [20:04:00] fine with me [20:04:26] (core CI has been taking 30-40 minutes recently, hopefully we'll have better luck today) [20:04:31] ooooh good [20:04:40] Pppery: sorry, I meant my patches [20:04:52] sorry, I conflated you and thcipriani [20:05:01] You both have blue names on my IRC display and start with "t" [20:05:18] (03CR) 10Thcipriani: [C:03+2] "BACKPORT" [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082829 (https://phabricator.wikimedia.org/T341650) (owner: 10Gergő Tisza) [20:05:28] (03CR) 10Thcipriani: [C:03+2] "BACKPORT" [extensions/CentralAuth] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082830 (https://phabricator.wikimedia.org/T341650) (owner: 10Gergő Tisza) [20:05:54] Pppery: :D I do that frequently [20:07:33] (03PS1) 10Dzahn: gerrit2003: set SSH disable_nist_kex to true on new bookwork host [puppet] - 10https://gerrit.wikimedia.org/r/1082846 (https://phabricator.wikimedia.org/T315942) [20:08:39] (03CR) 10Dzahn: [C:03+1] alert: Ensure vopsbot database is synced from active to passive hosts [puppet] - 10https://gerrit.wikimedia.org/r/1082325 (https://phabricator.wikimedia.org/T375143) (owner: 10Andrea Denisse) [20:09:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by thcipriani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079055 (https://phabricator.wikimedia.org/T376923) (owner: 10Pppery) [20:10:32] (03Merged) 10jenkins-bot: Deploy missing.php redirects for Allemanic German [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079055 (https://phabricator.wikimedia.org/T376923) (owner: 10Pppery) [20:10:49] !log thcipriani@deploy2002 Started scap sync-world: Backport for [[gerrit:1079055|Deploy missing.php redirects for Allemanic German (T376923)]] [20:11:09] T376923: Setup missing.php layer redirects for wikipedia hosting the other projects too - https://phabricator.wikimedia.org/T376923 [20:11:16] (03CR) 10Dzahn: [C:03+1] "sorry, I voted on the wrong patch. I meant https://gerrit.wikimedia.org/r/c/operations/puppet/+/1082325" [puppet] - 10https://gerrit.wikimedia.org/r/1082325 (https://phabricator.wikimedia.org/T375143) (owner: 10Andrea Denisse) [20:12:31] (03CR) 10Dzahn: [C:03+1] "yes, this should fix it (and also that chroot error due to stunnel not installed on the active host), LGTM !:)" [puppet] - 10https://gerrit.wikimedia.org/r/1082820 (https://phabricator.wikimedia.org/T375143) (owner: 10Andrea Denisse) [20:12:44] (03CR) 10Andrea Denisse: [C:03+2] alert: Ensure vopsbot database is synced between the alertmanager hosts [puppet] - 10https://gerrit.wikimedia.org/r/1082820 (https://phabricator.wikimedia.org/T375143) (owner: 10Andrea Denisse) [20:13:09] !log thcipriani@deploy2002 thcipriani, pppery: Backport for [[gerrit:1079055|Deploy missing.php redirects for Allemanic German (T376923)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:13:27] Pppery: not testable correct? [20:13:35] A few edge cases are [20:13:47] But you can just proceeed and I'll test the entire patch next week when the puppet part is deployed [20:14:16] okie doke, rolling forward [20:19:56] sorry, filing a task for an error I noticed [20:20:21] !log thcipriani@deploy2002 thcipriani, pppery: Continuing with sync [20:24:57] !log thcipriani@deploy2002 Finished scap sync-world: Backport for [[gerrit:1079055|Deploy missing.php redirects for Allemanic German (T376923)]] (duration: 14m 08s) [20:25:16] T376923: Setup missing.php layer redirects for wikipedia hosting the other projects too - https://phabricator.wikimedia.org/T376923 [20:25:51] still got 10 minutes for the core patch, let's do namespaces [20:28:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by thcipriani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081267 (https://phabricator.wikimedia.org/T375102) (owner: 10Pppery) [20:28:52] (03Merged) 10jenkins-bot: Configure settings for annwiki, nrwiki, mywikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081267 (https://phabricator.wikimedia.org/T375102) (owner: 10Pppery) [20:29:08] !log thcipriani@deploy2002 Started scap sync-world: Backport for [[gerrit:1081267|Configure settings for annwiki, nrwiki, mywikisource (T375102 T377160 T363270)]] [20:29:36] T375102: Post-creation work for nrwiki - https://phabricator.wikimedia.org/T375102 [20:29:36] T377160: Post-creation work for annwiki - https://phabricator.wikimedia.org/T377160 [20:29:37] T363270: Post-creation work for mywikisource - https://phabricator.wikimedia.org/T363270 [20:31:20] !log thcipriani@deploy2002 thcipriani, pppery: Backport for [[gerrit:1081267|Configure settings for annwiki, nrwiki, mywikisource (T375102 T377160 T363270)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:31:24] testing [20:34:29] proceed [20:34:48] This setup namespace aliases on two different wikis, so you may want to run namespaceDupes, but I didn't see any conflicting titles [20:35:34] ack, thanks for testing going live [20:35:42] !log thcipriani@deploy2002 thcipriani, pppery: Continuing with sync [20:37:03] (03Merged) 10jenkins-bot: chore: Move authevents logging into AuthManager [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082829 (https://phabricator.wikimedia.org/T341650) (owner: 10Gergő Tisza) [20:37:07] (03Merged) 10jenkins-bot: chore: AuthManager::autoCreateUser log authevents now [extensions/CentralAuth] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082830 (https://phabricator.wikimedia.org/T341650) (owner: 10Gergő Tisza) [20:38:17] jouncebot: nowandnext [20:38:17] For the next 0 hour(s) and 21 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241024T2000) [20:38:17] In 9 hour(s) and 21 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241025T0600) [20:38:28] I'll like to make a beta deploy once others are done. [20:38:39] Dreamy_Jazz: ack [20:39:47] PROBLEM - Hadoop NodeManager on an-worker1174 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:39:50] My config patch should be very quick, I'd hope, in terms of not holding that up. [20:40:03] (03PS1) 10Dreamy Jazz: [beta] Disable auto-promotion to checkuser-temporary-account-viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082851 (https://phabricator.wikimedia.org/T377884) [20:40:17] !log thcipriani@deploy2002 Finished scap sync-world: Backport for [[gerrit:1081267|Configure settings for annwiki, nrwiki, mywikisource (T375102 T377160 T363270)]] (duration: 11m 09s) [20:40:35] Thanks [20:40:37] T375102: Post-creation work for nrwiki - https://phabricator.wikimedia.org/T375102 [20:40:38] T377160: Post-creation work for annwiki - https://phabricator.wikimedia.org/T377160 [20:40:38] T363270: Post-creation work for mywikisource - https://phabricator.wikimedia.org/T363270 [20:41:24] Pppery: annwiki 0 pages to fix, 0 were resolvable. [20:42:47] As I expected [20:42:47] Pppery: mywikisource 0 pages to fix, 0 were resolvable. [20:43:03] But thanks for confirming [20:43:11] sure thing :) [20:43:16] tgr|away: you're up! I'll deploy them together, correct? [20:43:26] thcipriani: yes, thanks [20:44:28] !log thcipriani@deploy2002 Started scap sync-world: Backport for [[gerrit:1082829|chore: Move authevents logging into AuthManager (T341650 T375510 T375505)]], [[gerrit:1082830|chore: AuthManager::autoCreateUser log authevents now (T341650 T375510 T375505)]] [20:44:48] T341650: Update authentication metrics for IP masking - https://phabricator.wikimedia.org/T341650 [20:44:48] T375510: Temp accounts Grafana Dashboard: Rate of account creation - https://phabricator.wikimedia.org/T375510 [20:44:49] T375505: Temp accounts Grafana Dashboard: Rate of temporary account creation - https://phabricator.wikimedia.org/T375505 [20:46:43] !log thcipriani@deploy2002 tgr, thcipriani: Backport for [[gerrit:1082829|chore: Move authevents logging into AuthManager (T341650 T375510 T375505)]], [[gerrit:1082830|chore: AuthManager::autoCreateUser log authevents now (T341650 T375510 T375505)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:47:08] ^ tgr|away check please [20:53:51] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-codfw: Apply openjdk upgrade (11.0.25+9-1~deb11u1) - eevans@cumin1002 [20:55:39] tgr|away: still checking? [20:55:47] RECOVERY - Hadoop NodeManager on an-worker1174 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:56:35] hm, it does not seem to be logging [20:56:57] but it's not causing any problems either, so maybe let's continue and I'll re-test in production [20:57:43] hrm, alrighty, I can go ahead and sync [20:57:51] the logging worked fine locally, so maybe some sort of production config difference [20:58:01] thanks for checking, difference is odd [20:58:05] !log thcipriani@deploy2002 tgr, thcipriani: Continuing with sync [20:58:10] (the patch is a noop other than logging changes) [21:02:38] !log thcipriani@deploy2002 Finished scap sync-world: Backport for [[gerrit:1082829|chore: Move authevents logging into AuthManager (T341650 T375510 T375505)]], [[gerrit:1082830|chore: AuthManager::autoCreateUser log authevents now (T341650 T375510 T375505)]] (duration: 18m 10s) [21:02:56] ^ tgr|away sync'd good luck [21:03:06] T341650: Update authentication metrics for IP masking - https://phabricator.wikimedia.org/T341650 [21:03:06] T375510: Temp accounts Grafana Dashboard: Rate of account creation - https://phabricator.wikimedia.org/T375510 [21:03:06] T375505: Temp accounts Grafana Dashboard: Rate of temporary account creation - https://phabricator.wikimedia.org/T375505 [21:03:07] alright Kemayo yer up [21:03:11] thanks! [21:03:15] thcipriani: yay [21:03:35] 10ops-eqiad, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143 (10RobH) 03NEW [21:04:10] 10ops-eqiad, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10260810 (10RobH) [21:04:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by thcipriani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082834 (https://phabricator.wikimedia.org/T377551) (owner: 10DLynch) [21:04:51] 10ops-eqiad, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10260818 (10RobH) a:03ABran-WMF @ABran-WMF, Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving the... [21:04:54] (03Merged) 10jenkins-bot: Enable edit check on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082834 (https://phabricator.wikimedia.org/T377551) (owner: 10DLynch) [21:05:12] !log thcipriani@deploy2002 Started scap sync-world: Backport for [[gerrit:1082834|Enable edit check on nlwiki (T377551)]] [21:05:41] T377551: Activate Reference Check at nlwiki - https://phabricator.wikimedia.org/T377551 [21:06:52] 06SRE, 10Wikimedia-Mailing-lists: Create a mail address for Russian Wikipedia oversighters - https://phabricator.wikimedia.org/T378069#10260835 (10Dzahn) Yes, basically. But the details depend on how you configure it. You would be expected to name like 2 admins who maintain the list and can change settings acc... [21:07:25] !log thcipriani@deploy2002 thcipriani, kemayo: Backport for [[gerrit:1082834|Enable edit check on nlwiki (T377551)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:07:40] ^ Kemayo check please [21:07:53] thcipriani: One second [21:08:02] thcipriani: Works fine! [21:08:16] 06SRE, 10Wikimedia-Mailing-lists: Create a mail address for Russian Wikipedia oversighters - https://phabricator.wikimedia.org/T378069#10260850 (10Dzahn) Here are the relavant Wikipedia articles about the different software that is used: list: https://en.wikipedia.org/wiki/GNU_Mailman VRT queue: https://en... [21:09:35] Kemayo: great, going live [21:09:41] !log thcipriani@deploy2002 thcipriani, kemayo: Continuing with sync [21:11:46] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146 (10RobH) 03NEW [21:12:19] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10260881 (10RobH) [21:13:32] thcipriani: are you still deploying? I'd like to do a interwiki cache update [21:13:41] I want to deploy after [21:13:46] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10260883 (10RobH) a:03ABran-WMF @ABran-WMF, Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving the... [21:13:54] legoktm: yep, still deploying, then Dreamy_Jazz has claimed the conch [21:13:58] (03PS34) 10Ryan Kemper: elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [21:14:02] kk, I'll get in line after Dreamy :) [21:14:12] should be quick from my end [21:14:20] !log thcipriani@deploy2002 Finished scap sync-world: Backport for [[gerrit:1082834|Enable edit check on nlwiki (T377551)]] (duration: 09m 07s) [21:14:29] ^ Kemayo all done! [21:14:39] T377551: Activate Reference Check at nlwiki - https://phabricator.wikimedia.org/T377551 [21:14:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082851 (https://phabricator.wikimedia.org/T377884) (owner: 10Dreamy Jazz) [21:14:47] thcipriani: excellent, thanks! [21:14:59] thanks! Alright, late window complete [21:15:25] (03Merged) 10jenkins-bot: [beta] Disable auto-promotion to checkuser-temporary-account-viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082851 (https://phabricator.wikimedia.org/T377884) (owner: 10Dreamy Jazz) [21:15:57] legoktm: Over to you. My change was beta only. [21:16:14] awesome [21:16:15] (03PS1) 10Pppery: Redirect to wikis using subpages rather than namespaces too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082853 (https://phabricator.wikimedia.org/T376923) [21:16:17] ty [21:18:52] (03PS2) 10Pppery: Redirect to wikis using subpages rather than namespaces too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082853 (https://phabricator.wikimedia.org/T376923) [21:19:26] (03PS3) 10Pppery: Redirect to wikis using subpages rather than namespaces too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082853 (https://phabricator.wikimedia.org/T376923) [21:20:11] (03PS4) 10Pppery: Redirect to wikis using subpages rather than namespaces too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082853 (https://phabricator.wikimedia.org/T376923) [21:21:18] hm, I think it might be broken because of the mwscript changes [21:22:14] oh, no, T347982 [21:22:15] T347982: scap update-interwiki-cache is broken - https://phabricator.wikimedia.org/T347982 [21:22:16] huh [21:24:00] !log Ran `foreachwiki emptyUserGroup.php checkuser-temporary-account-viewer` on the beta wikis. [21:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:00] (03CR) 10Dzahn: [C:03+1] "We should pass just /srv/vopsbot as module_path, not a file name. For once that will copy vopsbot.db and the schema.sql and not just the s" [puppet] - 10https://gerrit.wikimedia.org/r/1082820 (https://phabricator.wikimedia.org/T375143) (owner: 10Andrea Denisse) [21:25:02] !log removing 1 file for legal compliance [21:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:51] (03PS1) 10Legoktm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082855 [21:26:31] (03CR) 10Ryan Kemper: [C:03+1] elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [21:26:53] (03CR) 10Bking: [C:03+2] elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [21:27:51] PROBLEM - Hadoop NodeManager on an-worker1154 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:28:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by legoktm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082855 (owner: 10Legoktm) [21:29:00] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082855 (owner: 10Legoktm) [21:29:15] !log legoktm@deploy2002 Started scap sync-world: Backport for [[gerrit:1082855|Update interwiki cache]] [21:30:55] (03CR) 10BCornwall: [C:03+1] tox.ini: add Python 3.11 to interpreters (and remove 3.7) [dns] - 10https://gerrit.wikimedia.org/r/1082548 (owner: 10Ssingh) [21:31:37] !log legoktm@deploy2002 legoktm: Backport for [[gerrit:1082855|Update interwiki cache]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:32:05] Funny - I actually thought about explicitly scheduling "update interwiki cache" for this backport window - I decided that it wasn't urgent so could wait until the interwiki cache was updated anyway with the next wiki creation [21:32:16] !log legoktm@deploy2002 legoktm: Continuing with sync [21:32:17] But no objection to what you're doing [21:32:39] Also while you're doing interwiki stuff can I get a code review on https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1080093? [21:32:51] RECOVERY - Hadoop NodeManager on an-worker1154 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:32:54] (03PS1) 10Andrea Denisse: alert: Ensure the db and schema path is synced between alert hosts [puppet] - 10https://gerrit.wikimedia.org/r/1082856 (https://phabricator.wikimedia.org/T375143) [21:32:54] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/1082856/4375/" [puppet] - 10https://gerrit.wikimedia.org/r/1082856 (https://phabricator.wikimedia.org/T375143) (owner: 10Andrea Denisse) [21:33:20] Pppery: I got excited by you approving the ccorg one :) [21:34:02] +2'd [21:34:57] (03CR) 10Dzahn: [C:03+1] alert: Ensure the db and schema path is synced between alert hosts [puppet] - 10https://gerrit.wikimedia.org/r/1082856 (https://phabricator.wikimedia.org/T375143) (owner: 10Andrea Denisse) [21:37:06] !log legoktm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1082855|Update interwiki cache]] (duration: 07m 51s) [21:38:42] I'm done [21:38:55] thanks [22:06:30] FIRING: [4x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:09:41] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1082846/4376/gerrit2003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1082846 (https://phabricator.wikimedia.org/T315942) (owner: 10Dzahn) [22:14:21] 06SRE, 06collaboration-services, 13Patch-For-Review: setup gerrit2003 with gerrit service (gerrit on bookworm) - https://phabricator.wikimedia.org/T372804#10261067 (10Dzahn) https://gerrit.wikimedia.org/r/c/operations/puppet/+/1082846 in relation to T315942 [22:14:26] (03CR) 10Dzahn: [C:03+1] "for now: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1082846" [puppet] - 10https://gerrit.wikimedia.org/r/1064413 (https://phabricator.wikimedia.org/T315942) (owner: 10Dzahn) [22:15:00] (03CR) 10Dzahn: "How about week of November 4th?:)" [puppet] - 10https://gerrit.wikimedia.org/r/1059156 (owner: 10Dzahn) [22:15:57] (03CR) 10Dzahn: [C:04-1] "UID needs to be over 900 and be added to admin module" [puppet] - 10https://gerrit.wikimedia.org/r/1080823 (https://phabricator.wikimedia.org/T377374) (owner: 10Dzahn) [22:19:35] (03CR) 10Zabe: [C:03+2] s8: Reduce revision-slots cache expiry to 60 seconds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082579 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [22:20:18] (03Merged) 10jenkins-bot: s8: Reduce revision-slots cache expiry to 60 seconds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082579 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [22:20:46] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1082579|s8: Reduce revision-slots cache expiry to 60 seconds (T183490)]] [22:21:13] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [22:23:04] !log zabe@deploy2002 zabe: Backport for [[gerrit:1082579|s8: Reduce revision-slots cache expiry to 60 seconds (T183490)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:23:15] !log zabe@deploy2002 zabe: Continuing with sync [22:27:50] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1082579|s8: Reduce revision-slots cache expiry to 60 seconds (T183490)]] (duration: 07m 03s) [22:28:08] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [22:57:06] (03CR) 10Andrea Denisse: [C:03+2] alert: Ensure the db and schema path is synced between alert hosts [puppet] - 10https://gerrit.wikimedia.org/r/1082856 (https://phabricator.wikimedia.org/T375143) (owner: 10Andrea Denisse) [23:09:38] !log removing 3 files for legal compliance [23:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:12] (03PS5) 10MacFan4000: ExtensionDistributor: Mark 1.43 as beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082256 (https://phabricator.wikimedia.org/T372322) [23:19:31] RECOVERY - SSH on rdb1014 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:38:30] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1082866 [23:38:30] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1082866 (owner: 10TrainBranchBot)