[00:39:06] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1252983 [00:39:06] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1252983 (owner: 10TrainBranchBot) [00:52:49] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1252983 (owner: 10TrainBranchBot) [00:53:10] FIRING: BFDdown: BFD session down between cr2-esams and fe80::5e5e:ab00:d3d:83c7 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:58:10] RESOLVED: BFDdown: BFD session down between cr2-esams and fe80::5e5e:ab00:d3d:83c7 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [01:08:58] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1253007 [01:08:58] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1253007 (owner: 10TrainBranchBot) [01:27:19] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1253007 (owner: 10TrainBranchBot) [01:35:00] FIRING: [2x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [02:00:48] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [02:08:40] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:08:40] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 07m 52s) [02:18:21] (03CR) 10Scott French: "Thanks, Raine!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251475 (https://phabricator.wikimedia.org/T419548) (owner: 10Kamila Součková) [02:33:40] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:35:00] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:35:00] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:11:16] (03PS2) 10Anzx: bowiki: update logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253046 (https://phabricator.wikimedia.org/T419268) [03:25:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:55:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [04:09:38] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new: Eqiad: lsw1-c2-eqiad BGP maintenance/ Tuesday 17th at 9:30 CDT - https://phabricator.wikimedia.org/T420158 (10Papaul) 03NEW [04:16:11] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new: Eqiad: lsw1-c7-eqiad BGP maintenance/ Thursday 19th at 10:00 am CDT - https://phabricator.wikimedia.org/T420159 (10Papaul) 03NEW [04:35:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [05:18:00] (03PS1) 10AikoChou: ml-services: update image for revise tone task generator in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253156 (https://phabricator.wikimedia.org/T416904) [05:20:59] (03CR) 10AikoChou: [C:03+2] ml-services: update image for revise tone task generator in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253156 (https://phabricator.wikimedia.org/T416904) (owner: 10AikoChou) [05:23:20] (03Merged) 10jenkins-bot: ml-services: update image for revise tone task generator in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253156 (https://phabricator.wikimedia.org/T416904) (owner: 10AikoChou) [05:25:51] !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [05:35:00] FIRING: [2x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:53:02] (03PS2) 10Kevin Bazira: ml-services: add gpt-oss-safeguard-20b isvc to experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251269 (https://phabricator.wikimedia.org/T418350) [06:01:28] (03PS1) 10AikoChou: ml-services: update image for revise tone task generator on prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253196 (https://phabricator.wikimedia.org/T416904) [06:06:07] (03CR) 10AikoChou: "Tested on staging. This change ensures the revise-tone task generator always processes edits on testwiki. People can create tasks for QA p" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253196 (https://phabricator.wikimedia.org/T416904) (owner: 10AikoChou) [06:19:08] (03CR) 10Itamar Givon: [C:03+1] "Same as Silvan, LGTM." [dumps] - 10https://gerrit.wikimedia.org/r/1251169 (https://phabricator.wikimedia.org/T401296) (owner: 10WMDE-leszek) [06:35:00] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:39:55] (03CR) 10Muehlenhoff: "Puppetising the entire template has proven to be a bit of a pain since Oracle changes it fairly often and then we need to fiddle in the ch" [puppet] - 10https://gerrit.wikimedia.org/r/1251836 (https://phabricator.wikimedia.org/T420083) (owner: 10Elukey) [06:39:55] (03PS2) 10Daniel Kinzler: rest-gateway: handle trust level C with invalid token. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1252658 (https://phabricator.wikimedia.org/T420106) [06:40:03] (03CR) 10Daniel Kinzler: rest-gateway: handle trust level C with invalid token. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1252658 (https://phabricator.wikimedia.org/T420106) (owner: 10Daniel Kinzler) [06:46:31] (03CR) 10Daniel Kinzler: [C:04-1] rest-gateway rate limit: add DENY policy and class (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250598 (owner: 10Daniel Kinzler) [06:52:37] !nowandnext [06:55:50] !log arnaudb@cumin1003 START - Cookbook sre.hosts.reboot-single for host lists1004.wikimedia.org [06:59:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 16 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253046 (https://phabricator.wikimedia.org/T419268) (owner: 10Anzx) [07:00:04] Amir1, Urbanecm, and awight: Time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260316T0700). [07:00:04] katherine_g, codenamenoreste, and anzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:33] o/ [07:02:30] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lists1004.wikimedia.org [07:03:40] (03PS3) 10Daniel Kinzler: rest-gateway rate limit: add DENY policy and class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250598 [07:03:54] (03PS10) 10Daniel Kinzler: rest-gateway rate limiting: add CORS headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248461 (https://phabricator.wikimedia.org/T418969) [07:06:01] !log arnaudb@cumin1003 START - Cookbook sre.hosts.reboot-single for host doc2003.codfw.wmnet [07:10:02] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host doc2003.codfw.wmnet [07:14:18] (03PS11) 10Daniel Kinzler: rest-gateway rate limiting: add CORS headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248461 (https://phabricator.wikimedia.org/T418969) [07:21:54] !log arnaudb@cumin1003 START - Cookbook sre.hosts.reboot-single for host aphlict1002.eqiad.wmnet [07:25:12] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new: Eqiad: lsw1-c2-eqiad BGP maintenance/ Tuesday 17th at 9:30 CDT - https://phabricator.wikimedia.org/T420158#11712110 (10ayounsi) ` an-master1003: skipping host (Make sure the redundant master is active.) an-worker1220: skipping host (no depool... [07:25:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:25:50] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aphlict1002.eqiad.wmnet [07:26:09] !log arnaudb@cumin1003 START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet [07:28:32] (03CR) 10Arnaudb: [C:03+1] "very good idea, thanks for the implementation Jelto" [puppet] - 10https://gerrit.wikimedia.org/r/1251406 (owner: 10Jelto) [07:33:04] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet [07:33:42] !log arnaudb@cumin1003 START - Cookbook sre.hosts.reboot-single for host stewards1001.eqiad.wmnet [07:37:37] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stewards1001.eqiad.wmnet [07:38:27] o/ here a bit late but ready to deploy mine whenever others are done [07:40:17] (03PS1) 10KartikMistry: Update cxserver to 2026-03-16-071247-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253260 (https://phabricator.wikimedia.org/T420004) [07:42:10] katherine_g: i think nobody is deploying at the moment, if you can would it be possible to deploy a patch i scheduled on this window [07:43:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kgraessle@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251276 (https://phabricator.wikimedia.org/T419950) (owner: 10Kgraessle) [07:44:13] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new: Eqiad: lsw1-c7-eqiad BGP maintenance/ Thursday 19th at 10:00 am CDT - https://phabricator.wikimedia.org/T420159#11712154 (10ayounsi) ` alert1002: Couldn't get or parse depool Hiera key an-worker1151: skipping host (no depool needed) an-worker... [07:45:10] (03Merged) 10jenkins-bot: Fix broken survey links on PersonalDashboard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251276 (https://phabricator.wikimedia.org/T419950) (owner: 10Kgraessle) [07:45:53] !log kgraessle@deploy2002 Started scap sync-world: Backport for [[gerrit:1251276|Fix broken survey links on PersonalDashboard (T419950)]] [07:45:57] T419950: Fix broken survey links on PersonalDashboard - https://phabricator.wikimedia.org/T419950 [07:51:25] (03CR) 10Arnaudb: [C:03+2] gerrit: sync httpd config to ATS [puppet] - 10https://gerrit.wikimedia.org/r/1251092 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [07:52:56] !log installing Linux 5.10.251 on Bullseye hosts [07:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:01] anzx: it looks like your patch doesn't have a +1 or +2 [07:56:44] katherine_g: i will reschedule mine for next window [07:57:15] anzx: ok thank you sounds good [07:57:36] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new: Eqiad: lsw1-c2-eqiad BGP maintenance/ Tuesday 17th at 9:30 CDT - https://phabricator.wikimedia.org/T420158#11712178 (10MoritzMuehlenhoff) [07:58:30] (03Abandoned) 10Slyngshede: hiera: use short for meassures [puppet] - 10https://gerrit.wikimedia.org/r/1251029 (owner: 10Slyngshede) [07:59:13] !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host gitlab1004.wikimedia.org [08:01:31] PROBLEM - Host gitlab.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [08:02:38] (03PS1) 10Arnaudb: trafficserver: Enable connection re-use for gerrit-spare [puppet] - 10https://gerrit.wikimedia.org/r/1253268 (https://phabricator.wikimedia.org/T417998) [08:02:40] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1253268 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [08:04:01] RECOVERY - Host gitlab.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [08:04:29] !log kgraessle@deploy2002 kgraessle: Backport for [[gerrit:1251276|Fix broken survey links on PersonalDashboard (T419950)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:04:32] T419950: Fix broken survey links on PersonalDashboard - https://phabricator.wikimedia.org/T419950 [08:05:27] !log kgraessle@deploy2002 kgraessle: Continuing with sync [08:06:27] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1004.wikimedia.org [08:07:43] FIRING: [4x] ProbeDown: Service gitlab1004:22 has failed probes (tcp_gitlab_wikimedia_org_ssh_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:09:25] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:10:13] (03CR) 10Elukey: "+1 totally" [puppet] - 10https://gerrit.wikimedia.org/r/1251836 (https://phabricator.wikimedia.org/T420083) (owner: 10Elukey) [08:10:42] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7004.magru.wmnet [08:12:43] RESOLVED: [4x] ProbeDown: Service gitlab1004:22 has failed probes (tcp_gitlab_wikimedia_org_ssh_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:14:25] RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:15:27] jmm@cumin2002 drain-node (PID 2989844) is awaiting input [08:15:32] (03PS4) 10Daniel Kinzler: rest-gateway: per-path overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248477 (https://phabricator.wikimedia.org/T419130) [08:16:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.93% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:17:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7004.magru.wmnet [08:18:02] !log kgraessle@deploy2002 Finished scap sync-world: Backport for [[gerrit:1251276|Fix broken survey links on PersonalDashboard (T419950)]] (duration: 32m 09s) [08:18:06] T419950: Fix broken survey links on PersonalDashboard - https://phabricator.wikimedia.org/T419950 [08:20:10] FIRING: BFDdown: BFD session down between cr2-drmrs and fe80::ee38:7300:1ae8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:20:54] (03CR) 10Arnaudb: [C:03+2] trafficserver: Enable connection re-use for gerrit-spare [puppet] - 10https://gerrit.wikimedia.org/r/1253268 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [08:21:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.3% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:22:08] !log taavi@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1013.eqiad.wmnet,service=s3 [08:24:02] (03PS6) 10Trueg: wikidata-platform: wdqs-queryhammer chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251095 (https://phabricator.wikimedia.org/T417415) [08:24:11] (03CR) 10Trueg: wikidata-platform: wdqs-queryhammer chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251095 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [08:25:09] (03CR) 10Bartosz Wójtowicz: [C:03+1] "LGTM, thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251269 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [08:25:10] RESOLVED: BFDdown: BFD session down between cr2-drmrs and fe80::ee38:7300:1ae8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:25:30] (03CR) 10Trueg: "The values in the configmap of the chart were an accidental copy from kafka-mirrormaker. The queryhammer chart does not have any configmap" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251453 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [08:27:46] (03CR) 10Elukey: "mmm wait a sec, from the docs I read:" [puppet] - 10https://gerrit.wikimedia.org/r/1251836 (https://phabricator.wikimedia.org/T420083) (owner: 10Elukey) [08:27:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7004.magru.wmnet [08:28:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7004.magru.wmnet [08:29:25] !log slyngshede@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM idp-test2005.wikimedia.org [08:29:52] (03CR) 10Majavah: [C:03+2] apt: Add keyfile for debian-debug/backports [puppet] - 10https://gerrit.wikimedia.org/r/1251275 (https://phabricator.wikimedia.org/T419957) (owner: 10Majavah) [08:30:14] (03PS1) 10Arnaudb: trafficserver: Enable connection re-use for gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/1253279 (https://phabricator.wikimedia.org/T417998) [08:32:37] 06SRE, 10Infrastructure Security: Sensible updates of java.security properties - https://phabricator.wikimedia.org/T282545#11712292 (10elukey) Very interesting, from the link it seems that we could create an override file simply adding something like `-Djava.security.properties=/etc/sysconfig/jvm.java.security... [08:33:27] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idp-test2005.wikimedia.org [08:35:00] !log slyngshede@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM idp-test1005.wikimedia.org [08:36:22] (03CR) 10Muehlenhoff: "Sure, let's go ahead with the old approach to not block the Kafka migration. I'll have a look at the patch shortly." [puppet] - 10https://gerrit.wikimedia.org/r/1251836 (https://phabricator.wikimedia.org/T420083) (owner: 10Elukey) [08:36:29] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new: Eqiad: lsw1-c2-eqiad BGP maintenance/ Tuesday 17th at 9:30 CDT - https://phabricator.wikimedia.org/T420158#11712298 (10taavi) [08:36:59] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new: Eqiad: lsw1-c7-eqiad BGP maintenance/ Thursday 19th at 10:00 am CDT - https://phabricator.wikimedia.org/T420159#11712299 (10taavi) [08:39:01] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idp-test1005.wikimedia.org [08:40:27] (03CR) 10Vgutierrez: [C:03+1] trafficserver: Enable connection re-use for gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/1253279 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [08:41:50] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1253279 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [08:43:03] (03CR) 10Kevin Bazira: [C:03+2] ml-services: add gpt-oss-safeguard-20b isvc to experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251269 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [08:43:11] (03PS5) 10Elukey: sre.hosts.provision: Allow more optional BIOS values for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1251424 (https://phabricator.wikimedia.org/T414216) [08:43:12] (03CR) 10Elukey: sre.hosts.provision: Allow more optional BIOS values for Supermicro (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1251424 (https://phabricator.wikimedia.org/T414216) (owner: 10Elukey) [08:44:25] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [08:44:25] !log jmm@cumin2002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling reboot on A:kafka-test-eqiad [08:44:55] !log slyngshede@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM idp1005.wikimedia.org [08:44:58] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7003.magru.wmnet [08:45:35] (03Merged) 10jenkins-bot: ml-services: add gpt-oss-safeguard-20b isvc to experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251269 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [08:47:35] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [08:48:43] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update network and mgmt - jclark@cumin1003" [08:48:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7003.magru.wmnet [08:48:48] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update network and mgmt - jclark@cumin1003" [08:48:48] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:48:53] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idp1005.wikimedia.org [08:49:15] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [08:49:18] (03CR) 10Gmodena: wikidata-platform: wdqs-queryhammer chart (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251095 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [08:50:13] (03CR) 10Arnaudb: [C:03+2] trafficserver: Enable connection re-use for gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/1253279 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [08:54:36] (03PS1) 10Slyngshede: IDP: Failover for OS patching [dns] - 10https://gerrit.wikimedia.org/r/1253377 [08:54:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host pki-root1002.eqiad.wmnet [08:55:14] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1253377 (owner: 10Slyngshede) [08:56:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7003.magru.wmnet [08:56:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7003.magru.wmnet [08:57:24] (03CR) 10Slyngshede: [C:03+2] IDP: Failover for OS patching [dns] - 10https://gerrit.wikimedia.org/r/1253377 (owner: 10Slyngshede) [08:58:27] !log slyngshede@dns1004 START - running authdns-update [08:59:51] !log slyngshede@dns1004 END - running authdns-update [09:00:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pki-root1002.eqiad.wmnet [09:01:47] (03PS2) 10Arnaudb: trafficserver: Enable connection re-use for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/1253376 (https://phabricator.wikimedia.org/T417998) [09:02:20] (03CR) 10Gmodena: wikidata-platform: wdqs-queryhammer helmfile deployment (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251453 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [09:03:44] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7002.magru.wmnet [09:05:57] !log slyngshede@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM idp2005.wikimedia.org [09:06:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet [09:08:17] (03CR) 10Trueg: wikidata-platform: wdqs-queryhammer helmfile deployment (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251453 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [09:08:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host build2001.codfw.wmnet [09:09:50] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idp2005.wikimedia.org [09:10:48] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install frdata1003, frmx1002, frqueue100[5-6] - https://phabricator.wikimedia.org/T416249#11712377 (10Jclark-ctr) [09:11:36] !log slyngshede@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM cloudidp2001-dev.codfw.wmnet [09:13:57] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts tcp-proxy4001.ulsfo.wmnet [09:14:34] (03CR) 10Arnaudb: [C:03+2] trafficserver: Enable connection re-use for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/1253376 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [09:14:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2001.codfw.wmnet [09:15:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet [09:15:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7002.magru.wmnet [09:15:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2002.codfw.wmnet [09:15:39] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM cloudidp2001-dev.codfw.wmnet [09:16:12] (03CR) 10Trueg: wikidata-platform: wdqs-queryhammer helmfile deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251453 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [09:18:48] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:19:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2002.codfw.wmnet [09:20:29] !log slyngshede@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM idm-test1001.wikimedia.org [09:20:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling reboot on A:kafka-test-eqiad [09:21:06] !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [09:21:08] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts tcp-proxy4001.ulsfo.wmnet [09:21:16] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11712407 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `tcp-proxy4001.ulsfo.wmnet` - tcp-proxy4001.ulsfo.wmnet... [09:22:00] !log failover Ganeti master in magru to ganeti7004 [09:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:32] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Eqiad: lsw1-c2-eqiad BGP maintenance/ Tuesday 17th at 9:30 CDT - https://phabricator.wikimedia.org/T420158#11712410 (10Gehel) [09:22:41] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Eqiad: lsw1-c7-eqiad BGP maintenance/ Thursday 19th at 10:00 am CDT - https://phabricator.wikimedia.org/T420159#11712412 (10Gehel) [09:23:18] (03CR) 10Ayounsi: [C:03+1] cmooney: remove temp. ssh key [homer/public] - 10https://gerrit.wikimedia.org/r/1251284 (owner: 10Cathal Mooney) [09:23:33] (03CR) 10Cathal Mooney: [C:03+2] cmooney: remove temp. ssh key [homer/public] - 10https://gerrit.wikimedia.org/r/1251284 (owner: 10Cathal Mooney) [09:24:22] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idm-test1001.wikimedia.org [09:24:22] PROBLEM - ganeti-wconfd running on ganeti7001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [09:25:57] (03Merged) 10jenkins-bot: cmooney: remove temp. ssh key [homer/public] - 10https://gerrit.wikimedia.org/r/1251284 (owner: 10Cathal Mooney) [09:26:14] !log slyngshede@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM idm1001.wikimedia.org [09:26:46] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [09:29:02] (03PS4) 10Phuedx: mw::maintenance: Remove ExperimentationLab periodic job [puppet] - 10https://gerrit.wikimedia.org/r/1249932 (https://phabricator.wikimedia.org/T419428) [09:29:29] moritzm: ^ [09:29:38] (03PS1) 10Slyngshede: IDM: OS patching [dns] - 10https://gerrit.wikimedia.org/r/1253391 [09:30:07] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idm1001.wikimedia.org [09:30:10] (03PS5) 10Phuedx: mw::maintenance: Remove ExperimentationLab periodic job [puppet] - 10https://gerrit.wikimedia.org/r/1249932 (https://phabricator.wikimedia.org/T419428) [09:31:08] 10SRE-SLO, 06Abstract Wikipedia team, 06ServiceOps new, 07Essential-Work: wikifunctions-backend-combined-v1 SLI error budget has been rapidly dropping over Feb 2026 - https://phabricator.wikimedia.org/T418160#11712455 (10elukey) I followed up again today and created https://wikitech.wikimedia.org/wiki/User... [09:32:02] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:32:24] jclark@cumin1003 netbox (PID 3658975) is awaiting input [09:33:21] (03CR) 10Slyngshede: [C:03+2] IDM: OS patching [dns] - 10https://gerrit.wikimedia.org/r/1253391 (owner: 10Slyngshede) [09:33:52] !log slyngshede@dns1004 START - running authdns-update [09:34:14] !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [09:34:18] !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [09:34:27] !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [09:35:00] FIRING: [2x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:35:04] 06SRE, 10Infrastructure Security: Sensible updates of java.security properties - https://phabricator.wikimedia.org/T282545#11712492 (10MoritzMuehlenhoff) >>! In T282545#11712292, @elukey wrote: > Very interesting, from the link it seems that we could create an override file simply adding something like `-Djava... [09:35:17] !log slyngshede@dns1004 END - running authdns-update [09:36:09] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:37:31] FIRING: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:38:24] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update network and mgmt - jclark@cumin1003" [09:38:38] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update network and mgmt - jclark@cumin1003" [09:38:38] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:38:40] FIRING: [2x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:39:09] !log slyngshede@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM idm2001.wikimedia.org [09:41:49] jmm@cumin2002 netbox (PID 3010493) is awaiting input [09:43:06] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idm2001.wikimedia.org [09:43:35] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: decom tcp-proxy4001 - jmm@cumin2002" [09:43:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: decom tcp-proxy4001 - jmm@cumin2002" [09:43:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:46:11] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts tcp-proxy4002.ulsfo.wmnet [09:48:05] (03PS1) 10Muehlenhoff: Remove tcp-proxy4001/4002 [puppet] - 10https://gerrit.wikimedia.org/r/1253397 (https://phabricator.wikimedia.org/T418993) [09:48:06] Gerrit is very slow for me at the moment (getting 502 errors intermittently), does anyone else have issues with it? (Might be the wrong channel to ask, but I guess there are enough Gerrit users here to get an answer) [09:48:17] (03PS1) 10Arnaudb: trafficserver: disable connection re-use for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/1253396 (https://phabricator.wikimedia.org/T417998) [09:48:40] FIRING: JobUnavailable: Reduced availability for job gerrit-metrics in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:49:07] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Eqiad: lsw1-c2-eqiad BGP maintenance/ Tuesday 17th at 9:30 CDT - https://phabricator.wikimedia.org/T420158#11712621 (10cmooney) Can we hold off on any work related to this? I am planning to dr... [09:49:18] 06SRE, 10Infrastructure Security: Sensible updates of java.security properties - https://phabricator.wikimedia.org/T282545#11712622 (10elukey) I remember that Traffic suggested the use of hardened TLS settings to allow the use case of mTLS when pushing webrequest data to Kafka Jumbo, and we probably wanted to... [09:49:26] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Eqiad: lsw1-c7-eqiad BGP maintenance/ Thursday 19th at 10:00 am CDT - https://phabricator.wikimedia.org/T420159#11712623 (10cmooney) Can we hold off on any work related to this? I am planning... [09:51:02] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:51:20] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet [09:51:42] !log elukey@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [09:51:53] !log elukey@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [09:53:04] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new: Nokia SR-Linux DHCP Relay Bug - https://phabricator.wikimedia.org/T411054#11712668 (10cmooney) [09:53:40] RESOLVED: [2x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:53:40] RESOLVED: JobUnavailable: Reduced availability for job gerrit-metrics in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:53:50] Jhs: judging by https://phabricator.wikimedia.org/T417998#11712639 seems like it may be a known issue [09:54:14] (it is also very slow for me rn) [09:54:34] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: tcp-proxy4002.ulsfo.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:54:41] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:54:48] A_smart_kitten, aight, nice. as long as the right people are aware and fixing it, i'm happy :) thanks! [09:54:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: tcp-proxy4002.ulsfo.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:54:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:54:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts tcp-proxy4002.ulsfo.wmnet [09:55:03] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11712686 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `tcp-proxy4002.ulsfo.wmnet` - tcp... [09:55:08] RECOVERY - MariaDB Replica IO: s3 on clouddb1013 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:55:08] RECOVERY - MariaDB Replica SQL: s3 on clouddb1013 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:55:56] jmm@cumin2002 drain-node (PID 3013086) is awaiting input [09:56:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet [09:56:50] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260316T1000) [10:01:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest2004.codfw.wmnet [10:02:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet [10:02:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet [10:03:40] FIRING: JobUnavailable: Reduced availability for job gerrit in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:04:07] RECOVERY - MariaDB Replica Lag: s3 on clouddb1013 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:06:27] elukey@cumin1003 provision (PID 3664672) is awaiting input [10:07:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest2004.codfw.wmnet [10:08:40] RESOLVED: JobUnavailable: Reduced availability for job gerrit in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:08:47] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:09:27] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:09:36] 07Puppet, 06collaboration-services, 10Gerrit: Edit puppet-merge to use gerrit.discovery.wmnet instead of gerrit.wikimedia.org? - https://phabricator.wikimedia.org/T420184 (10ABran-WMF) 03NEW [10:14:23] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3008.esams.wmnet [10:17:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3008.esams.wmnet [10:18:37] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install frdata1003, frmx1002, frqueue100[5-6] - https://phabricator.wikimedia.org/T416249#11712803 (10Jclark-ctr) [10:19:23] 10ops-codfw, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup2005 power supplies fried or overvoltage - https://phabricator.wikimedia.org/T419970#11712805 (10jcrespo) Please take your time, as I said it can be down for some time. My only question is to please update the state of the h... [10:19:44] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install frdata1003, frmx1002, frqueue100[5-6] - https://phabricator.wikimedia.org/T416249#11712806 (10Jclark-ctr) a:05Jclark-ctr→03Jgreen @Jgreen Hey jeff these are setup and reachable. password should be set to default dell password [10:20:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest2009.codfw.wmnet [10:20:10] (03CR) 10Elukey: [C:03+1] hardware.upgrade-firmware: Fix usage path [cookbooks] - 10https://gerrit.wikimedia.org/r/1244788 (owner: 10BCornwall) [10:20:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:20:25] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [10:21:30] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1251540 (https://phabricator.wikimedia.org/T420034) (owner: 10Majavah) [10:22:07] (03CR) 10Btullis: [C:03+2] Update HaproxyKafkaNoMessages for team-data-engineering [alerts] - 10https://gerrit.wikimedia.org/r/1251293 (https://phabricator.wikimedia.org/T419829) (owner: 10Btullis) [10:22:31] RESOLVED: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:23:03] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:23:14] (03CR) 10Clément Goubert: [C:03+1] rest-gateway: handle trust level C with invalid token. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1252658 (https://phabricator.wikimedia.org/T420106) (owner: 10Daniel Kinzler) [10:23:28] (03PS1) 10Elukey: sre.hosts.provision: adapt for new dse-k8s-workers [cookbooks] - 10https://gerrit.wikimedia.org/r/1253412 [10:23:56] (03Merged) 10jenkins-bot: Update HaproxyKafkaNoMessages for team-data-engineering [alerts] - 10https://gerrit.wikimedia.org/r/1251293 (https://phabricator.wikimedia.org/T419829) (owner: 10Btullis) [10:24:17] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:24:36] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:25:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest2009.codfw.wmnet [10:25:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3008.esams.wmnet [10:25:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3008.esams.wmnet [10:28:49] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [10:28:57] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [10:29:04] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3007.esams.wmnet [10:30:50] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1251539 (https://phabricator.wikimedia.org/T420034) (owner: 10Majavah) [10:31:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3007.esams.wmnet [10:32:03] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update image for revise tone task generator on prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253196 (https://phabricator.wikimedia.org/T416904) (owner: 10AikoChou) [10:32:56] (03PS1) 10Btullis: Update the druid-public cluster and configure new Java cmdline options [puppet] - 10https://gerrit.wikimedia.org/r/1253415 (https://phabricator.wikimedia.org/T278056) [10:33:02] (03CR) 10Muehlenhoff: [C:03+2] cfssl: Run tests on Bullseye and Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1251089 (owner: 10Muehlenhoff) [10:34:01] (03CR) 10Muehlenhoff: [C:03+2] mcrounter: Run spec tests on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1248768 (owner: 10Muehlenhoff) [10:35:00] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:35:18] (03PS2) 10Btullis: Update the druid-public cluster and configure new Java cmdline options [puppet] - 10https://gerrit.wikimedia.org/r/1253415 (https://phabricator.wikimedia.org/T278056) [10:35:58] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1253415 (https://phabricator.wikimedia.org/T278056) (owner: 10Btullis) [10:38:38] (03PS1) 10Btullis: Increase the size of the WAL volume for postgresql-airflow-sre [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253422 (https://phabricator.wikimedia.org/T402512) [10:39:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3007.esams.wmnet [10:39:32] (03PS3) 10Btullis: Update the druid-public cluster and configure new Java cmdline options [puppet] - 10https://gerrit.wikimedia.org/r/1253415 (https://phabricator.wikimedia.org/T278056) [10:39:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest2010.codfw.wmnet [10:39:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3007.esams.wmnet [10:40:43] (03PS1) 10Mszwarc: Always use external actor for interwiki rights logs on target wiki [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1253423 (https://phabricator.wikimedia.org/T6055) [10:40:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1253423 (https://phabricator.wikimedia.org/T6055) (owner: 10Mszwarc) [10:41:49] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11712882 (10gmodena) cc @Ottomata [10:43:34] (03CR) 10Joal: [C:03+1] "LGTM - no need to change the JVM config as we've not experienced issues so far" [puppet] - 10https://gerrit.wikimedia.org/r/1253415 (https://phabricator.wikimedia.org/T278056) (owner: 10Btullis) [10:43:49] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11712886 (10DSantamaria) Approved! [10:45:16] (03CR) 10Btullis: [C:03+2] Update the druid-public cluster and configure new Java cmdline options [puppet] - 10https://gerrit.wikimedia.org/r/1253415 (https://phabricator.wikimedia.org/T278056) (owner: 10Btullis) [10:46:34] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host druid1009.eqiad.wmnet with OS bookworm [10:46:36] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host druid1010.eqiad.wmnet with OS bookworm [10:46:38] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host druid1011.eqiad.wmnet with OS bookworm [10:46:50] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host druid1012.eqiad.wmnet with OS bookworm [10:47:00] (03PS14) 10Jcrespo: mediabackups: Initial puppetization of Versity S3 gateway for replacing minio [puppet] - 10https://gerrit.wikimedia.org/r/1251113 (https://phabricator.wikimedia.org/T410020) [10:47:01] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host druid1013.eqiad.wmnet with OS bookworm [10:47:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest2010.codfw.wmnet [10:48:32] (03PS1) 10JavierMonton: stream: mw-content-history-reconcile-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253425 (https://phabricator.wikimedia.org/T408918) [10:49:55] (03PS2) 10Btullis: Increase the size of the WAL volume for postgresql-airflow-sre [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253422 (https://phabricator.wikimedia.org/T402512) [10:50:39] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1010.eqiad.wmnet, druid1012.eqiad.wmnet, druid1013.eqiad.wmnet are marked down but pooled: druid-public-coordinator_8081: Servers druid1010.eqiad.wmnet, druid1012.eqiad.wmnet, druid1013.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:50:42] (03PS1) 10JavierMonton: stream: mw-page-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253426 (https://phabricator.wikimedia.org/T408918) [10:50:43] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1010.eqiad.wmnet, druid1012.eqiad.wmnet, druid1013.eqiad.wmnet are marked down but pooled: druid-public-coordinator_8081: Servers druid1010.eqiad.wmnet, druid1012.eqiad.wmnet, druid1013.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:53:40] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1251113 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [10:55:03] (03CR) 10Btullis: [C:03+2] Increase the size of the WAL volume for postgresql-airflow-sre [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253422 (https://phabricator.wikimedia.org/T402512) (owner: 10Btullis) [10:55:11] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install payments101[0-2] - https://phabricator.wikimedia.org/T416252#11712926 (10Jclark-ctr) [10:55:24] (03CR) 10Jcrespo: [C:03+2] mediabackups: Initial puppetization of Versity S3 gateway for replacing minio [puppet] - 10https://gerrit.wikimedia.org/r/1251113 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [10:55:59] btullis: merge? [10:56:54] (03Merged) 10jenkins-bot: Increase the size of the WAL volume for postgresql-airflow-sre [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253422 (https://phabricator.wikimedia.org/T402512) (owner: 10Btullis) [10:57:56] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on druid1013.eqiad.wmnet with reason: host reimage [10:59:20] I assumed yes, your change looked minor [11:00:04] arnaudb : That opportune time for a gerrit primary reboot deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260316T1100). [11:00:20] !log arnaudb@cumin1003 START - Cookbook sre.hosts.reboot-single for host gerrit2003.wikimedia.org [11:01:06] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on druid1012.eqiad.wmnet with reason: host reimage [11:02:37] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on druid1009.eqiad.wmnet with reason: host reimage [11:02:46] FIRING: [8x] GerritHAProxyBackendUnavailable: Gerrit backend is unavilable for tcp-proxy (HAProxy) gerrit_ssh - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritHAProxyBackendUnavailable - grafana.wikimedia.org/d/459365f6-df37-48d6-8142-82b22c1875e7/gerrit-tcp-proxy?viewPanel=panel-15 - https://alerts.wikimedia.org/?q=alertname%3DGerritHAProxyBackendUnavailable [11:02:49] 10ops-eqdfw, 06DC-Ops: db1253 depooled following host crash - https://phabricator.wikimedia.org/T420041#11712958 (10FCeratto-WMF) a:05FCeratto-WMF→03None [11:02:51] FIRING: [2x] GerritHAProxyServiceUnavailable: Gerrit tcp-proxy (HAProxy) service gerrit_ssh is DOWN in codfw - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritHAProxyServiceUnavailable - grafana.wikimedia.org/d/459365f6-df37-48d6-8142-82b22c1875e7/gerrit-tcp-proxy?viewPanel=panel-15 - https://alerts.wikimedia.org/?q=alertname%3DGerritHAProxyServiceUnavailable [11:02:53] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on druid1011.eqiad.wmnet with reason: host reimage [11:04:05] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on druid1010.eqiad.wmnet with reason: host reimage [11:05:30] 10ops-eqdfw, 06DC-Ops: db1253 depooled following host crash - https://phabricator.wikimedia.org/T420041#11712973 (10FCeratto-WMF) DC-Ops: could you please check if everything is ok on the bios/firmware side and if there are hardware issues? [11:05:34] 10ops-eqdfw, 06DC-Ops: db1253 depooled following host crash - https://phabricator.wikimedia.org/T420041#11712974 (10FCeratto-WMF) 05In progress→03Open [11:06:05] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply [11:06:12] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply [11:06:50] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on druid1013.eqiad.wmnet with reason: host reimage [11:07:32] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gerrit2003.wikimedia.org [11:07:46] RESOLVED: [7x] GerritHAProxyServiceUnavailable: Gerrit tcp-proxy (HAProxy) service gerrit_ssh is DOWN in codfw - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritHAProxyServiceUnavailable - grafana.wikimedia.org/d/459365f6-df37-48d6-8142-82b22c1875e7/gerrit-tcp-proxy?viewPanel=panel-15 - https://alerts.wikimedia.org/?q=alertname%3DGerritHAProxyServiceUnavailable [11:07:51] RESOLVED: [14x] GerritHAProxyBackendUnavailable: Gerrit backend is unavilable for tcp-proxy (HAProxy) gerrit_ssh - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritHAProxyBackendUnavailable - grafana.wikimedia.org/d/459365f6-df37-48d6-8142-82b22c1875e7/gerrit-tcp-proxy?viewPanel=panel-15 - https://alerts.wikimedia.org/?q=alertname%3DGerritHAProxyBackendUnavailable [11:09:26] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on druid1012.eqiad.wmnet with reason: host reimage [11:09:31] (03PS2) 10Elukey: sre.hosts.provision: refactor bios if/else branches [cookbooks] - 10https://gerrit.wikimedia.org/r/1253412 (https://phabricator.wikimedia.org/T414216) [11:10:26] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:10:43] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:12:17] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on druid1011.eqiad.wmnet with reason: host reimage [11:12:19] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling reboot on A:swift-fe-codfw [11:12:19] (03PS11) 10Effie Mouzeli: Update chart metadata for various charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250597 (https://phabricator.wikimedia.org/T412693) [11:12:59] !log mvernon@cumin1003 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling reboot on A:swift-fe-eqiad [11:14:11] (03PS1) 10Elukey: sre.hosts.provision: add sys-112c-tn-configg to SUPERMICRO_NO_FQDN_MANAGEMENT [cookbooks] - 10https://gerrit.wikimedia.org/r/1253448 (https://phabricator.wikimedia.org/T414216) [11:14:31] PROBLEM - Druid middlemanager on druid1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:14:31] PROBLEM - Druid overlord on druid1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:14:51] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:15:11] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:15:30] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on druid1010.eqiad.wmnet with reason: host reimage [11:16:25] PROBLEM - Host ms-fe2009 is DOWN: PING CRITICAL - Packet loss = 100% [11:16:25] PROBLEM - Host ms-fe1009 is DOWN: PING CRITICAL - Packet loss = 100% [11:16:55] RECOVERY - Host ms-fe2009 is UP: PING OK - Packet loss = 0%, RTA = 30.42 ms [11:17:16] (03CR) 10Effie Mouzeli: [C:03+2] Update chart metadata for various charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250597 (https://phabricator.wikimedia.org/T412693) (owner: 10Effie Mouzeli) [11:17:32] RECOVERY - Host ms-fe1009 is UP: PING OK - Packet loss = 0%, RTA = 3.31 ms [11:17:46] PROBLEM - Druid historical on druid1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:17:46] PROBLEM - Druid coordinator on druid1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:19:14] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on druid1009.eqiad.wmnet with reason: host reimage [11:20:34] PROBLEM - Druid middlemanager on druid1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:20:34] PROBLEM - Druid overlord on druid1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:22:20] PROBLEM - Host druid1012 is DOWN: PING CRITICAL - Packet loss = 100% [11:22:28] PROBLEM - Druid overlord on druid1011 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:22:30] PROBLEM - Druid middlemanager on druid1011 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:22:49] !log btullis@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6 days, 0:00:00 on dse-k8s-worker[1012,1015-1017].eqiad.wmnet with reason: Adding 10 Gbps NIC [11:23:09] (03PS1) 10Sergio Gimeno: AccountCreation: track account registrations for WE1.8 experiments [extensions/WikimediaEvents] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1253450 (https://phabricator.wikimedia.org/T416100) [11:23:33] RECOVERY - Druid middlemanager on druid1013 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:23:33] RECOVERY - Druid overlord on druid1013 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:23:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1253450 (https://phabricator.wikimedia.org/T416100) (owner: 10Sergio Gimeno) [11:24:01] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:24:19] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:24:28] PROBLEM - Host ms-fe1010 is DOWN: PING CRITICAL - Packet loss = 100% [11:24:58] RECOVERY - Host ms-fe1010 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [11:25:29] PROBLEM - Druid broker on druid1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server broker https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:25:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:25:47] RECOVERY - Druid coordinator on druid1012 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:25:49] RECOVERY - Host druid1012 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [11:26:03] PROBLEM - Host dse-k8s-worker1012 is DOWN: PING CRITICAL - Packet loss = 100% [11:26:17] PROBLEM - Host ms-fe2010 is DOWN: PING CRITICAL - Packet loss = 100% [11:26:22] (03CR) 10Muehlenhoff: java: add java-21-security erb template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1251836 (https://phabricator.wikimedia.org/T420083) (owner: 10Elukey) [11:26:33] PROBLEM - Host dse-k8s-worker1017 is DOWN: PING CRITICAL - Packet loss = 100% [11:26:33] PROBLEM - Host dse-k8s-worker1016 is DOWN: PING CRITICAL - Packet loss = 100% [11:26:33] PROBLEM - Host dse-k8s-worker1015 is DOWN: PING CRITICAL - Packet loss = 100% [11:26:47] PROBLEM - Druid historical on druid1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:26:51] RECOVERY - Host ms-fe2010 is UP: PING OK - Packet loss = 0%, RTA = 30.26 ms [11:27:30] PROBLEM - Druid historical on druid1010 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:27:30] PROBLEM - Druid coordinator on druid1010 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:27:30] PROBLEM - Druid broker on druid1010 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server broker https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:27:40] PROBLEM - Host druid1011 is DOWN: PING CRITICAL - Packet loss = 100% [11:29:18] PROBLEM - Druid middlemanager on druid1010 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:29:18] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host druid1013.eqiad.wmnet with OS bookworm [11:29:20] RECOVERY - Host druid1011 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [11:29:30] PROBLEM - Druid middlemanager on druid1011 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:29:30] PROBLEM - Druid overlord on druid1011 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:30:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3006.esams.wmnet [11:31:14] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host druid1012.eqiad.wmnet with OS bookworm [11:31:23] PROBLEM - Druid middlemanager on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:31:32] PROBLEM - Druid broker on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server broker https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:31:32] RECOVERY - Druid broker on druid1013 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server broker https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:31:32] RECOVERY - Druid overlord on druid1011 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:31:33] RECOVERY - Druid middlemanager on druid1011 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:31:34] PROBLEM - Druid overlord on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:31:42] PROBLEM - Host ms-fe1011 is DOWN: PING CRITICAL - Packet loss = 100% [11:31:48] RECOVERY - Druid historical on druid1012 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:32:38] RECOVERY - Host ms-fe1011 is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [11:33:02] (03CR) 10Clément Goubert: [C:04-1] "There's a logic footgun with the way lua handles table ordering that should be at least mentioned, if not handled." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248477 (https://phabricator.wikimedia.org/T419130) (owner: 10Daniel Kinzler) [11:33:43] jmm@cumin2002 drain-node (PID 3035787) is awaiting input [11:33:56] PROBLEM - Host druid1010 is DOWN: PING CRITICAL - Packet loss = 100% [11:34:07] 06SRE, 06Infrastructure-Foundations: Consider reducing verbosity of IRC logging - https://phabricator.wikimedia.org/T419919#11713106 (10Volans) Removing the cumin tag as cumin doesn't log to IRC at all. Adding the SRE one as this is not a technical problem but a workflow one that involves everyone touching pro... [11:34:25] PROBLEM - Host ms-fe2011 is DOWN: PING CRITICAL - Packet loss = 100% [11:34:29] PROBLEM - Druid coordinator on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:34:35] RECOVERY - Host druid1010 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [11:34:57] RECOVERY - Host ms-fe2011 is UP: PING OK - Packet loss = 0%, RTA = 30.28 ms [11:35:33] PROBLEM - Druid coordinator on druid1010 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:35:33] PROBLEM - Druid middlemanager on druid1010 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:35:33] PROBLEM - Druid broker on druid1010 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server broker https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:35:34] PROBLEM - Druid historical on druid1010 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:36:14] (03Merged) 10jenkins-bot: Update chart metadata for various charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250597 (https://phabricator.wikimedia.org/T412693) (owner: 10Effie Mouzeli) [11:36:29] PROBLEM - Druid overlord on druid1010 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:37:03] PROBLEM - Host druid1009 is DOWN: PING CRITICAL - Packet loss = 100% [11:37:29] RECOVERY - Druid overlord on druid1010 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:37:33] RECOVERY - Druid middlemanager on druid1010 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:37:33] RECOVERY - Druid coordinator on druid1010 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:37:40] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host druid1011.eqiad.wmnet with OS bookworm [11:37:43] RECOVERY - Host druid1009 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [11:38:29] PROBLEM - Druid coordinator on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:38:29] PROBLEM - Druid middlemanager on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:38:33] PROBLEM - Druid overlord on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:38:33] PROBLEM - Druid broker on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server broker https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:40:09] PROBLEM - Host ms-fe1012 is DOWN: PING CRITICAL - Packet loss = 100% [11:40:17] PROBLEM - Druid historical on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:40:29] RECOVERY - Druid middlemanager on druid1009 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:40:29] RECOVERY - Druid coordinator on druid1009 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:40:33] RECOVERY - Druid historical on druid1010 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:40:33] RECOVERY - Druid overlord on druid1009 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:40:33] RECOVERY - Druid broker on druid1010 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server broker https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:40:37] RECOVERY - Host ms-fe1012 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [11:40:47] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:40:51] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:41:17] RECOVERY - Druid historical on druid1009 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:41:33] RECOVERY - Druid broker on druid1009 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server broker https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:41:47] PROBLEM - Host ms-fe2012 is DOWN: PING CRITICAL - Packet loss = 100% [11:41:53] (03CR) 10Harroyo-wmf: hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250575 (https://phabricator.wikimedia.org/T419125) (owner: 10Harroyo-wmf) [11:42:55] RECOVERY - Host ms-fe2012 is UP: PING OK - Packet loss = 0%, RTA = 30.26 ms [11:43:08] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host druid1010.eqiad.wmnet with OS bookworm [11:45:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3006.esams.wmnet [11:45:18] (03PS15) 10Harroyo-wmf: hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250575 (https://phabricator.wikimedia.org/T419125) [11:46:34] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host druid1009.eqiad.wmnet with OS bookworm [11:47:48] (03PS1) 10Slyngshede: Permissions: format longer comments [software/bitu] - 10https://gerrit.wikimedia.org/r/1253455 (https://phabricator.wikimedia.org/T401720) [11:49:11] PROBLEM - Host ganeti3006 is DOWN: PING CRITICAL - Packet loss = 100% [11:49:21] PROBLEM - Host ms-fe1013 is DOWN: PING CRITICAL - Packet loss = 100% [11:49:43] (03PS2) 10Slyngshede: Permissions: format longer comments [software/bitu] - 10https://gerrit.wikimedia.org/r/1253455 (https://phabricator.wikimedia.org/T401720) [11:50:01] RECOVERY - Host ms-fe1013 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [11:50:25] PROBLEM - Host ms-fe2013 is DOWN: PING CRITICAL - Packet loss = 100% [11:50:53] RECOVERY - Host ms-fe2013 is UP: PING OK - Packet loss = 0%, RTA = 31.72 ms [11:51:47] RECOVERY - Host ganeti3006 is UP: PING OK - Packet loss = 0%, RTA = 80.37 ms [11:52:25] (03PS1) 10Jcrespo: mediabackups: Modify syntax for new systemd and package version [puppet] - 10https://gerrit.wikimedia.org/r/1253457 (https://phabricator.wikimedia.org/T410020) [11:53:01] (03CR) 10CI reject: [V:04-1] mediabackups: Modify syntax for new systemd and package version [puppet] - 10https://gerrit.wikimedia.org/r/1253457 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [11:53:09] (03CR) 10AikoChou: [C:03+2] "Thanks for the review!!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253196 (https://phabricator.wikimedia.org/T416904) (owner: 10AikoChou) [11:54:33] (03PS2) 10Jcrespo: mediabackups: Modify syntax for new systemd and package version [puppet] - 10https://gerrit.wikimedia.org/r/1253457 (https://phabricator.wikimedia.org/T410020) [11:55:53] (03Merged) 10jenkins-bot: ml-services: update image for revise tone task generator on prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253196 (https://phabricator.wikimedia.org/T416904) (owner: 10AikoChou) [11:56:22] (03PS2) 10Muehlenhoff: Switch the netinsights role to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1242372 [11:56:49] (03CR) 10Volans: sre.loadbalancer: Provide check-ipip cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1251442 (https://phabricator.wikimedia.org/T419873) (owner: 10Vgutierrez) [11:57:09] !log btullis@cumin1003 START - Cookbook sre.hosts.remove-downtime for druid[1009-1013].eqiad.wmnet [11:57:11] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.remove-downtime (exit_code=97) for druid[1009-1013].eqiad.wmnet [11:57:17] !log btullis@cumin1003 START - Cookbook sre.hosts.remove-downtime for druid[1009-1013].eqiad.wmnet [11:57:19] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for druid[1009-1013].eqiad.wmnet [11:58:07] PROBLEM - Host ms-fe2014 is DOWN: PING CRITICAL - Packet loss = 100% [11:58:45] RECOVERY - Host ms-fe2014 is UP: PING OK - Packet loss = 0%, RTA = 30.20 ms [12:00:41] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti3006.esams.wmnet [12:00:41] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti3006.esams.wmnet [12:01:24] (03PS1) 10Sergio Gimeno: fix(anon warning): remove wring type=signup param [extensions/MobileFrontend] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1253461 (https://phabricator.wikimedia.org/T415160) [12:01:48] (03CR) 10EggRoll97: [C:04-1] "This doesn't appear to clean up that admins can grant the editor group." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251193 (https://phabricator.wikimedia.org/T419105) (owner: 10Codename Noreste) [12:03:07] PROBLEM - Host ms-fe1014 is DOWN: PING CRITICAL - Packet loss = 100% [12:03:37] RECOVERY - Host ms-fe1014 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [12:04:03] (03CR) 10Clément Goubert: [C:03+2] trafficserver: Support fractional routing for api.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1245389 (https://phabricator.wikimedia.org/T418145) (owner: 10Clément Goubert) [12:05:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/MobileFrontend] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1253461 (https://phabricator.wikimedia.org/T415160) (owner: 10Sergio Gimeno) [12:08:40] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new: Nokia SR-Linux DHCP Relay Bug - https://phabricator.wikimedia.org/T411054#11713232 (10cmooney) >>! In T411054#11696665, @BTullis wrote: > Will all of the switches in rows C & D be getting this configuration change? Yes we need to fix it on a... [12:09:38] (03CR) 10Gmodena: wikidata-platform: wdqs-queryhammer helmfile deployment (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251453 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [12:10:36] mvernon@cumin2002 roll-restart-reboot-swift-ms-proxies (PID 3031201) is awaiting input [12:10:43] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ncredir4001.ulsfo.wmnet [12:11:09] PROBLEM - Host ms-fe1015 is DOWN: PING CRITICAL - Packet loss = 100% [12:11:55] RECOVERY - Host ms-fe1015 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [12:13:36] (03PS1) 10Elukey: sre.hosts.provision: use PATCH and PUT to set Supermicro's BIOS settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216) [12:13:51] jmm@cumin2002 decommission (PID 3042877) is awaiting input [12:14:56] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:15:24] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:15:36] (03PS3) 10Slyngshede: Permissions: format longer comments [software/bitu] - 10https://gerrit.wikimedia.org/r/1253455 (https://phabricator.wikimedia.org/T401720) [12:19:17] PROBLEM - Host ms-fe1016 is DOWN: PING CRITICAL - Packet loss = 100% [12:20:16] !log failover Ganeti master in esams to ganeti3005 [12:20:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:22] !log failover Ganeti master in esams to ganeti3008 [12:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:31] PROBLEM - Host ncredir4001 is DOWN: PING CRITICAL - Packet loss = 100% [12:20:51] RECOVERY - Host ms-fe1016 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [12:21:05] (03CR) 10Btullis: [C:03+1] stream: mw-content-history-reconcile-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253425 (https://phabricator.wikimedia.org/T408918) (owner: 10JavierMonton) [12:21:18] 10ops-eqdfw, 06DBA, 06DC-Ops: db1253 depooled following host crash - https://phabricator.wikimedia.org/T420041#11713271 (10Ladsgroup) We still will need to do work once it's been fixed and need to be informed of the progress. Putting back the DBA tag. [12:21:19] (03CR) 10Btullis: [C:03+1] stream: mw-page-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253426 (https://phabricator.wikimedia.org/T408918) (owner: 10JavierMonton) [12:21:39] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2001.codfw.wmnet, ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:21:39] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2001.codfw.wmnet, ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:21:45] PROBLEM - ganeti-wconfd running on ganeti3005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [12:22:54] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:25:17] (03PS3) 10Jcrespo: mediabackups: Modify syntax for new systemd and package version [puppet] - 10https://gerrit.wikimedia.org/r/1253457 (https://phabricator.wikimedia.org/T410020) [12:25:22] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-ctrl1002.eqiad.wmnet [12:25:41] !log aikochou@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [12:26:23] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1253457 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [12:27:36] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1017 [12:27:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow7002.magru.wmnet [12:27:51] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir4001.ulsfo.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [12:28:57] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1017 [12:30:04] PROBLEM - SSH on wikikube-ctrl1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:30:13] RECOVERY - Host dse-k8s-worker1017 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [12:30:18] (03CR) 10Jcrespo: [C:03+2] mediabackups: Modify syntax for new systemd and package version [puppet] - 10https://gerrit.wikimedia.org/r/1253457 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [12:30:55] jmm@cumin2002 decommission (PID 3042877) is awaiting input [12:31:37] mvernon@cumin1003 roll-restart-reboot-swift-ms-proxies (PID 3673555) is awaiting input [12:31:41] (03CR) 10Slyngshede: [C:03+2] Permissions: format longer comments [software/bitu] - 10https://gerrit.wikimedia.org/r/1253455 (https://phabricator.wikimedia.org/T401720) (owner: 10Slyngshede) [12:31:55] RECOVERY - SSH on wikikube-ctrl1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:32:00] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:32:08] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:32:09] RECOVERY - Host dse-k8s-worker1012 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [12:32:21] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:33:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow7002.magru.wmnet [12:33:39] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:33:39] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:33:39] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:34:10] (03CR) 10Muehlenhoff: [C:03+2] Switch the netinsights role to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1242372 (owner: 10Muehlenhoff) [12:34:21] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:35:14] (03Merged) 10jenkins-bot: Permissions: format longer comments [software/bitu] - 10https://gerrit.wikimedia.org/r/1253455 (https://phabricator.wikimedia.org/T401720) (owner: 10Slyngshede) [12:35:26] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:37:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir4001.ulsfo.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [12:37:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:37:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir4001.ulsfo.wmnet [12:37:53] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ncredir4002.ulsfo.wmnet [12:37:56] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11713386 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ncredir4001.ulsfo.wmnet` - ncred... [12:38:21] (03PS1) 10Bartosz Wójtowicz: ml-services: Lower MAX_MODEL_LEN for CoPE-A-9B. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253480 (https://phabricator.wikimedia.org/T418832) [12:38:54] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11713395 (10MoritzMuehlenhoff) [12:39:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow6001.drmrs.wmnet [12:39:44] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2001.codfw.wmnet, ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:39:44] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2001.codfw.wmnet, ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:40:09] !log aikochou@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [12:41:15] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-ctrl1002.eqiad.wmnet [12:42:11] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-ctrl1003.eqiad.wmnet [12:42:17] (03CR) 10Slyngshede: [C:03+2] P:idp switch default OIDC profile format to FLAT [puppet] - 10https://gerrit.wikimedia.org/r/1250944 (owner: 10Slyngshede) [12:42:22] PROBLEM - Host ncredir4002 is DOWN: PING CRITICAL - Packet loss = 100% [12:43:32] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3005.esams.wmnet [12:43:33] (03CR) 10Kevin Bazira: [C:03+1] ml-services: Lower MAX_MODEL_LEN for CoPE-A-9B. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253480 (https://phabricator.wikimedia.org/T418832) (owner: 10Bartosz Wójtowicz) [12:44:12] (03CR) 10Clément Goubert: [C:03+1] Remove PSP related code from admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248823 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [12:44:30] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Lower MAX_MODEL_LEN for CoPE-A-9B. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253480 (https://phabricator.wikimedia.org/T418832) (owner: 10Bartosz Wójtowicz) [12:44:38] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:45:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow6001.drmrs.wmnet [12:46:41] (03Merged) 10jenkins-bot: ml-services: Lower MAX_MODEL_LEN for CoPE-A-9B. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253480 (https://phabricator.wikimedia.org/T418832) (owner: 10Bartosz Wójtowicz) [12:47:03] PROBLEM - SSH on wikikube-ctrl1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:48:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3005.esams.wmnet [12:48:16] !log jiji@cumin1003 START - Cookbook sre.memcached.roll-reboot-restart rolling reboot on A:memcached-gutter-eqiad [12:48:53] RECOVERY - SSH on wikikube-ctrl1003 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:50:14] jmm@cumin2002 decommission (PID 3049187) is awaiting input [12:50:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow5002.eqsin.wmnet [12:51:17] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-ctrl1003.eqiad.wmnet [12:51:33] PROBLEM - Host ganeti3005 is DOWN: PING CRITICAL - Packet loss = 100% [12:51:35] PROBLEM - Host mc-gp1004 is DOWN: PING CRITICAL - Packet loss = 100% [12:53:49] RECOVERY - Host mc-gp1004 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [12:57:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow5002.eqsin.wmnet [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260316T1300). [13:00:05] James_F, Msz2001, Sergi0, and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] o/ [13:00:13] PROBLEM - Host ms-fe1017 is DOWN: PING CRITICAL - Packet loss = 100% [13:00:17] PROBLEM - Host ms-fe2015 is DOWN: PING CRITICAL - Packet loss = 100% [13:00:18] I'll self-deploy mine quickly. [13:00:19] o/ [13:00:24] And then get out of the way. [13:00:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [extensions/WikiLambda] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251487 (https://phabricator.wikimedia.org/T419666) (owner: 10Jforrester) [13:01:03] RECOVERY - Host ms-fe2015 is UP: PING OK - Packet loss = 0%, RTA = 30.33 ms [13:01:09] RECOVERY - Host ms-fe1017 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [13:01:33] (03CR) 10Btullis: [C:03+2] Disable the x1 section on an-redacteddb1001 until we can populate it [puppet] - 10https://gerrit.wikimedia.org/r/1251494 (https://phabricator.wikimedia.org/T407485) (owner: 10Btullis) [13:02:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow4003.ulsfo.wmnet [13:02:36] (03Merged) 10jenkins-bot: Replace direct BagOStuff with WANObjectCache [extensions/WikiLambda] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251487 (https://phabricator.wikimedia.org/T419666) (owner: 10Jforrester) [13:02:59] !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1251487|Replace direct BagOStuff with WANObjectCache (T419666)]] [13:03:02] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir4002.ulsfo.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:03:03] T419666: WikiLambda: Replace direct usage of BagOStuff with WANObjectCache - https://phabricator.wikimedia.org/T419666 [13:03:53] !log jiji@cumin1003 END (ERROR) - Cookbook sre.memcached.roll-reboot-restart (exit_code=97) rolling reboot on A:memcached-gutter-eqiad [13:04:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir4002.ulsfo.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:04:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:04:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir4002.ulsfo.wmnet [13:04:41] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11713526 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ncredir4002.ulsfo.wmnet` - ncred... [13:04:48] !log jiji@cumin1003 START - Cookbook sre.memcached.roll-reboot-restart rolling reboot on A:memcached-gutter-eqiad [13:05:03] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-ctrl1004.eqiad.wmnet [13:05:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:06:27] PROBLEM - mysqld processes on an-redacteddb1001 is CRITICAL: PROCS CRITICAL: 9 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [13:06:54] !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1251487|Replace direct BagOStuff with WANObjectCache (T419666)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:07:22] !log jforrester@deploy2002 jforrester: Continuing with sync [13:07:25] PROBLEM - Host ms-fe1018 is DOWN: PING CRITICAL - Packet loss = 100% [13:07:27] (03PS2) 10Trueg: wikidata-platform: wdqs-queryhammer helmfile deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251453 (https://phabricator.wikimedia.org/T417415) [13:07:29] PROBLEM - Host ms-fe2016 is DOWN: PING CRITICAL - Packet loss = 100% [13:08:33] Msz2001: Are you able to take over and deploy your patch and anzx's? [13:08:36] Yes [13:08:49] Excellent, as soon as my sync is done I'll bow out. [13:09:17] PROBLEM - Host mc-gp1005 is DOWN: PING CRITICAL - Packet loss = 100% [13:09:35] PROBLEM - Host wikikube-ctrl1004 is DOWN: PING CRITICAL - Packet loss = 100% [13:09:41] RECOVERY - Host ms-fe1018 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [13:09:47] RECOVERY - Host ms-fe2016 is UP: PING OK - Packet loss = 0%, RTA = 30.36 ms [13:09:52] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host ganeti3005.esams.wmnet [13:09:52] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti3005.esams.wmnet [13:09:53] RECOVERY - Host mc-gp1005 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [13:10:09] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11713554 (10Ottomata) Approved! [13:11:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow4003.ulsfo.wmnet [13:11:37] !log bwojtowicz@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:11:58] RECOVERY - Host wikikube-ctrl1004 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [13:13:31] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11713574 (10MoritzMuehlenhoff) [13:13:44] (03CR) 10Ottomata: "Let's leave the release name at staging, otherwise LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251480 (https://phabricator.wikimedia.org/T351225) (owner: 10AKhatun) [13:13:50] FIRING: ProbeDown: Service ganeti3005:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:14:24] !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1251487|Replace direct BagOStuff with WANObjectCache (T419666)]] (duration: 11m 25s) [13:14:28] T419666: WikiLambda: Replace direct usage of BagOStuff with WANObjectCache - https://phabricator.wikimedia.org/T419666 [13:14:32] Msz2001: Over to you. [13:14:44] Starting my patch [13:14:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1253423 (https://phabricator.wikimedia.org/T6055) (owner: 10Mszwarc) [13:15:10] FIRING: [2x] GanetiBGPDown: BGP session down between ganeti3005 and asw1-by27-esams - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [13:15:16] echo 'https://en.wikipedia.org/static/images/project-logos/frwiki.png' | mwscript purgeList.php [13:15:48] (03CR) 10Ssingh: "@kharlan@wikimedia.org: Happy to do it today, let me know if you prefer us to do it or you would like to be around when we do it." [puppet] - 10https://gerrit.wikimedia.org/r/1249929 (https://phabricator.wikimedia.org/T418865) (owner: 10Kosta Harlan) [13:16:22] (03CR) 10Ottomata: stream: deploy edit-type stream to production (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251480 (https://phabricator.wikimedia.org/T351225) (owner: 10AKhatun) [13:16:39] (03CR) 10Ottomata: "Ah wait, the s3 HA bucket path does matter. Suggestion inline." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251480 (https://phabricator.wikimedia.org/T351225) (owner: 10AKhatun) [13:19:12] anzx: frwiki's logo needs purging? [13:19:32] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow3004.esams.wmnet [13:19:55] !log cgoubert@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host wikikube-ctrl1004.eqiad.wmnet [13:20:15] (03PS3) 10Elukey: java: add java-21-security erb template [puppet] - 10https://gerrit.wikimedia.org/r/1251836 (https://phabricator.wikimedia.org/T420083) [13:20:48] (03CR) 10Elukey: java: add java-21-security erb template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1251836 (https://phabricator.wikimedia.org/T420083) (owner: 10Elukey) [13:20:56] mvernon@cumin1003 roll-restart-reboot-swift-ms-proxies (PID 3673555) is awaiting input [13:21:02] !log drain edgeuno transit for optic replacement - T415743 [13:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:05] T415743: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743 [13:21:08] Msz2001: i had copied it for bowiki, by mistake pasted it here, please ignore it [13:21:18] np [13:21:24] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet [13:21:27] jiji@cumin1003 roll-reboot-restart (PID 3696293) is awaiting input [13:22:17] wikikube-ctrl1004.eqiad.wmnet is ok, it just took a little too long to recover but is now up [13:22:27] PROBLEM - Host ms-fe2017 is DOWN: PING CRITICAL - Packet loss = 100% [13:22:39] !log sukhe@cumin1003 START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and A:durum [13:23:12] (03PS1) 10AOkoth: miscweb: add wmf-navigator values - empty httpd [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253489 (https://phabricator.wikimedia.org/T414405) [13:23:33] RECOVERY - Host ms-fe2017 is UP: PING OK - Packet loss = 0%, RTA = 30.42 ms [13:23:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet [13:24:16] (03Abandoned) 10Dpogorzelski: kserve: update image to 0.16 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235826 (owner: 10Dpogorzelski) [13:24:46] (03PS3) 10Trueg: wikidata-platform: wdqs-queryhammer helmfile deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251453 (https://phabricator.wikimedia.org/T417415) [13:24:52] (03PS2) 10AOkoth: miscweb: add wmf-navigator values - empty httpd [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253489 (https://phabricator.wikimedia.org/T414405) [13:25:13] !log sukhe@cumin1003 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough [13:25:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow3004.esams.wmnet [13:25:40] (03CR) 10Trueg: wikidata-platform: wdqs-queryhammer helmfile deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251453 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [13:25:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow2003.codfw.wmnet [13:27:31] PROBLEM - Bird Internet Routing Daemon on durum2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:27:35] PROBLEM - Host ganeti2025 is DOWN: PING CRITICAL - Packet loss = 100% [13:28:10] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and 10.192.32.58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:28:31] RECOVERY - Bird Internet Routing Daemon on durum2001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:28:41] RECOVERY - Host ganeti2025 is UP: PING OK - Packet loss = 0%, RTA = 30.40 ms [13:28:51] !log jiji@cumin1003 END (PASS) - Cookbook sre.memcached.roll-reboot-restart (exit_code=0) rolling reboot on A:memcached-gutter-eqiad [13:29:25] (03Merged) 10jenkins-bot: Always use external actor for interwiki rights logs on target wiki [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1253423 (https://phabricator.wikimedia.org/T6055) (owner: 10Mszwarc) [13:29:43] !log mszwarc@deploy2002 Started scap sync-world: Backport for [[gerrit:1253423|Always use external actor for interwiki rights logs on target wiki (T6055)]] [13:29:46] T6055: Interwiki rights logs should be duplicated at related wikis - https://phabricator.wikimedia.org/T6055 [13:30:04] !log jiji@cumin1003 START - Cookbook sre.memcached.roll-reboot-restart rolling reboot on A:memcached-gutter-eqiad [13:30:10] FIRING: [3x] GanetiBGPDown: BGP session down between ganeti3005 and asw1-by27-esams - group - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [13:30:11] PROBLEM - Bird Internet Routing Daemon on doh1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:30:21] yeah well this should have been downtimed [13:31:13] RECOVERY - Bird Internet Routing Daemon on doh1001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:31:27] !log mszwarc@deploy2002 mszwarc: Backport for [[gerrit:1253423|Always use external actor for interwiki rights logs on target wiki (T6055)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:31:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow2003.codfw.wmnet [13:31:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet [13:31:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet [13:32:11] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet [13:32:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow2004.codfw.wmnet [13:33:10] RESOLVED: [8x] BFDdown: BFD session down between cr1-codfw and 10.192.32.58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:33:50] FIRING: [2x] ProbeDown: Service ganeti2025:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:33:59] PROBLEM - Host mc-gp1006 is DOWN: PING CRITICAL - Packet loss = 100% [13:34:31] PROBLEM - Bird Internet Routing Daemon on durum2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:34:45] !log mszwarc@deploy2002 mszwarc: Continuing with sync [13:34:46] yeah I suspect downtiming is broken somehow :) [13:35:02] mvernon@cumin2002 roll-restart-reboot-swift-ms-proxies (PID 3031201) is awaiting input [13:35:03] 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11713641 (10ayounsi) > netops drains link in advance of work during EU AM. Done. [13:35:09] RECOVERY - Host mc-gp1006 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [13:35:30] (03CR) 10Mszwarc: [C:03+2] "Accepting ahead of deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253046 (https://phabricator.wikimedia.org/T419268) (owner: 10Anzx) [13:35:34] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253490 [13:35:35] RECOVERY - Bird Internet Routing Daemon on durum2002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:35:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-magru and EdgeUno (200.25.58.212) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [13:36:27] PROBLEM - Bird Internet Routing Daemon on doh1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:36:29] (03Merged) 10jenkins-bot: bowiki: update logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253046 (https://phabricator.wikimedia.org/T419268) (owner: 10Anzx) [13:37:27] RECOVERY - mysqld processes on an-redacteddb1001 is OK: PROCS OK: 8 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [13:37:27] RECOVERY - Bird Internet Routing Daemon on doh1002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:37:38] (03PS1) 10Jcrespo: mediabackups: Fix bug in which services were defined in duplicate [puppet] - 10https://gerrit.wikimedia.org/r/1253491 (https://phabricator.wikimedia.org/T410020) [13:38:02] (03PS2) 10Jcrespo: mediabackups: Fix bug in which services were defined in duplicate [puppet] - 10https://gerrit.wikimedia.org/r/1253491 (https://phabricator.wikimedia.org/T410020) [13:38:14] (03PS3) 10Jcrespo: mediabackups: Fix bug in which services were defined in duplicate [puppet] - 10https://gerrit.wikimedia.org/r/1253491 (https://phabricator.wikimedia.org/T410020) [13:38:25] FIRING: [16x] BFDdown: BFD session down between cr1-codfw and 10.192.32.58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:38:30] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253492 [13:38:36] !log mszwarc@deploy2002 Finished scap sync-world: Backport for [[gerrit:1253423|Always use external actor for interwiki rights logs on target wiki (T6055)]] (duration: 08m 53s) [13:38:39] T6055: Interwiki rights logs should be duplicated at related wikis - https://phabricator.wikimedia.org/T6055 [13:38:41] jmm@cumin2002 drain-node (PID 3061230) is awaiting input [13:38:44] anzx: Proceeding with yours [13:38:49] ok [13:38:57] (03CR) 10CI reject: [V:04-1] mediabackups: Fix bug in which services were defined in duplicate [puppet] - 10https://gerrit.wikimedia.org/r/1253491 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [13:39:04] (03PS4) 10Jcrespo: mediabackups: Fix bug in which services were defined in duplicate [puppet] - 10https://gerrit.wikimedia.org/r/1253491 (https://phabricator.wikimedia.org/T410020) [13:39:18] !log mszwarc@deploy2002 Started scap sync-world: Backport for [[gerrit:1253046|bowiki: update logos (T419268)]] [13:39:22] T419268: Update site logo for bo.wikipedia (spelling correction) - https://phabricator.wikimedia.org/T419268 [13:39:43] (03PS5) 10Jcrespo: mediabackups: Fix bug in which services were defined in duplicate [puppet] - 10https://gerrit.wikimedia.org/r/1253491 (https://phabricator.wikimedia.org/T410020) [13:39:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow2004.codfw.wmnet [13:39:47] (03CR) 10CI reject: [V:04-1] mediabackups: Fix bug in which services were defined in duplicate [puppet] - 10https://gerrit.wikimedia.org/r/1253491 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [13:39:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow1002.eqiad.wmnet [13:40:20] (03CR) 10CI reject: [V:04-1] mediabackups: Fix bug in which services were defined in duplicate [puppet] - 10https://gerrit.wikimedia.org/r/1253491 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [13:41:08] !log mszwarc@deploy2002 mszwarc, anzx: Backport for [[gerrit:1253046|bowiki: update logos (T419268)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:41:30] Msz2001: looks good, ok to sync [13:41:36] !log mszwarc@deploy2002 mszwarc, anzx: Continuing with sync [13:42:14] (03PS6) 10Jcrespo: mediabackups: Fix bug in which services were defined in duplicate [puppet] - 10https://gerrit.wikimedia.org/r/1253491 (https://phabricator.wikimedia.org/T410020) [13:42:55] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1253491 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [13:43:01] !log jiji@cumin1003 END (PASS) - Cookbook sre.memcached.roll-reboot-restart (exit_code=0) rolling reboot on A:memcached-gutter-eqiad [13:43:09] PROBLEM - Bird Internet Routing Daemon on doh2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:43:25] RESOLVED: [16x] BFDdown: BFD session down between cr1-codfw and 10.192.32.58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:43:55] !log jiji@cumin1003 START - Cookbook sre.memcached.roll-reboot-restart rolling reboot on A:memcached-gutter-codfw [13:44:12] RECOVERY - Bird Internet Routing Daemon on doh2001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:44:41] (03Restored) 10Dpogorzelski: kserve: update image to 0.16 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235826 (owner: 10Dpogorzelski) [13:44:47] (03PS5) 10Dpogorzelski: kserve: update image to 0.16 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235826 [13:44:47] (03PS1) 10Dpogorzelski: kserve: bump to upstream version 0.17 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1253495 (https://phabricator.wikimedia.org/T419722) [13:45:21] (03Abandoned) 10Dpogorzelski: kserve: update to version 0.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235835 (owner: 10Dpogorzelski) [13:45:27] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudgw1003.eqiad.wmnet with OS trixie [13:45:36] !log mszwarc@deploy2002 Finished scap sync-world: Backport for [[gerrit:1253046|bowiki: update logos (T419268)]] (duration: 06m 17s) [13:45:40] T419268: Update site logo for bo.wikipedia (spelling correction) - https://phabricator.wikimedia.org/T419268 [13:45:44] Msz2001: please run above to purge images of bowiki https://www.irccloud.com/pastebin/67de05BT/ [13:45:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow1002.eqiad.wmnet [13:45:52] PROBLEM - Bird Internet Routing Daemon on durum6002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:45:53] (03Abandoned) 10Dpogorzelski: kserve: bump to upstream version 0.17 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1253495 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [13:45:57] (03Abandoned) 10Dbrant: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251552 (owner: 10PipelineBot) [13:46:04] (03Abandoned) 10Dbrant: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251863 (owner: 10PipelineBot) [13:46:32] (03Abandoned) 10Dbrant: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253490 (owner: 10PipelineBot) [13:46:33] (03Abandoned) 10Dpogorzelski: kserve: update image to 0.16 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235826 (owner: 10Dpogorzelski) [13:46:46] Purged [13:47:05] Msz2001: Thanks for deploying [13:47:15] sergi0: You can deploy. I finished [13:47:34] (03PS1) 10Dpogorzelski: kserve: bump to upstream version 0.17 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1253496 (https://phabricator.wikimedia.org/T419722) [13:47:42] PROBLEM - Host mc-gp2004 is DOWN: PING CRITICAL - Packet loss = 100% [13:47:52] RECOVERY - Bird Internet Routing Daemon on durum6002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:47:54] (03Abandoned) 10Dpogorzelski: kserve: bump to upstream version 0.17 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1253496 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [13:48:25] FIRING: [22x] BFDdown: BFD session down between asw1-b13-drmrs and 10.136.1.23 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:48:32] (03PS1) 10Btullis: Allow members of analytics-wikidata-users access to stat hosts [puppet] - 10https://gerrit.wikimedia.org/r/1253497 (https://phabricator.wikimedia.org/T404073) [13:49:07] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1253497 (https://phabricator.wikimedia.org/T404073) (owner: 10Btullis) [13:49:14] RECOVERY - Host mc-gp2004 is UP: PING OK - Packet loss = 0%, RTA = 30.34 ms [13:49:22] FIRING: GnmiTargetDown: lsw1-e8-eqiad is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [13:49:34] PROBLEM - Bird Internet Routing Daemon on doh2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:49:58] PROBLEM - Host cloudgw1003 is DOWN: PING CRITICAL - Packet loss = 100% [13:50:18] PROBLEM - Host ms-fe1019 is DOWN: PING CRITICAL - Packet loss = 100% [13:50:30] PROBLEM - Host ms-fe2018 is DOWN: PING CRITICAL - Packet loss = 100% [13:50:34] RECOVERY - Bird Internet Routing Daemon on doh2002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:51:23] (03PS1) 10Dpogorzelski: kserve: update to version 0.17 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1253498 (https://phabricator.wikimedia.org/T419722) [13:51:46] RECOVERY - Host ms-fe1019 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [13:52:00] RECOVERY - Host ms-fe2018 is UP: PING OK - Packet loss = 0%, RTA = 30.55 ms [13:52:31] (03Abandoned) 10Dpogorzelski: kserve: update to version 0.17 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1253498 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [13:53:25] RESOLVED: [18x] BFDdown: BFD session down between asw1-b13-drmrs and 10.136.1.23 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:53:28] meh, UTC confused [13:53:28] PROBLEM - Bird Internet Routing Daemon on durum1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:53:50] RECOVERY - Host cloudgw1003 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [13:54:16] (03PS2) 10Btullis: Allow members of analytics-wikidata-users access to stat hosts [puppet] - 10https://gerrit.wikimedia.org/r/1253497 (https://phabricator.wikimedia.org/T404073) [13:54:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow1003.eqiad.wmnet [13:54:28] RECOVERY - Bird Internet Routing Daemon on durum1001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:54:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [extensions/MobileFrontend] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1253461 (https://phabricator.wikimedia.org/T415160) (owner: 10Sergio Gimeno) [13:54:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1253450 (https://phabricator.wikimedia.org/T416100) (owner: 10Sergio Gimeno) [13:54:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet [13:54:56] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1253497 (https://phabricator.wikimedia.org/T404073) (owner: 10Btullis) [13:55:41] PROBLEM - Bird Internet Routing Daemon on doh3005 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:55:58] (03Restored) 10Dpogorzelski: kserve: update to version 0.17 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1253498 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [13:56:03] (03PS2) 10Dpogorzelski: kserve: update to version 0.17 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1253498 (https://phabricator.wikimedia.org/T419722) [13:57:41] RECOVERY - Bird Internet Routing Daemon on doh3005 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:57:47] PROBLEM - Host ms-fe1020 is DOWN: PING CRITICAL - Packet loss = 100% [13:57:47] PROBLEM - Host ms-fe2019 is DOWN: PING CRITICAL - Packet loss = 100% [13:57:55] (03PS1) 10Ssingh: geo-resources: update IP addresses for ulsfo services [dns] - 10https://gerrit.wikimedia.org/r/1253503 (https://phabricator.wikimedia.org/T418971) [13:58:35] PROBLEM - Host ganeti2026 is DOWN: PING CRITICAL - Packet loss = 100% [13:58:47] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:59:38] (03CR) 10JavierMonton: [C:03+2] stream: mw-content-history-reconcile-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253425 (https://phabricator.wikimedia.org/T408918) (owner: 10JavierMonton) [13:59:39] PROBLEM - Bird Internet Routing Daemon on durum1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:59:43] RECOVERY - Host ms-fe1020 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [13:59:43] RECOVERY - Host ganeti2026 is UP: PING OK - Packet loss = 0%, RTA = 30.39 ms [13:59:55] (03CR) 10JavierMonton: [C:03+2] stream: mw-page-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253426 (https://phabricator.wikimedia.org/T408918) (owner: 10JavierMonton) [14:00:01] RECOVERY - Host ms-fe2019 is UP: PING OK - Packet loss = 0%, RTA = 30.46 ms [14:00:38] jiji@cumin1003 roll-reboot-restart (PID 3700611) is awaiting input [14:01:39] RECOVERY - Bird Internet Routing Daemon on durum1002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:01:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow1003.eqiad.wmnet [14:01:49] (03Merged) 10jenkins-bot: stream: mw-content-history-reconcile-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253425 (https://phabricator.wikimedia.org/T408918) (owner: 10JavierMonton) [14:02:27] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1253505 [14:02:40] !log arnaudb@cumin1003 DONE (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 2:00:00 on gerrit2002.wikimedia.org with reason: T418256 [14:02:44] T418256: Deploy Phab/Phorge 2026-02-24 - https://phabricator.wikimedia.org/T418256 [14:02:51] PROBLEM - Bird Internet Routing Daemon on doh3006 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:03:05] (03Merged) 10jenkins-bot: stream: mw-page-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253426 (https://phabricator.wikimedia.org/T408918) (owner: 10JavierMonton) [14:03:07] (03PS1) 10Ssingh: service.yaml/WMCS cloudgw: update IPs for ulsfo-lb (text, upload, gerrit, ncredir) [puppet] - 10https://gerrit.wikimedia.org/r/1253506 (https://phabricator.wikimedia.org/T418971) [14:03:41] (03CR) 10CI reject: [V:04-1] service.yaml/WMCS cloudgw: update IPs for ulsfo-lb (text, upload, gerrit, ncredir) [puppet] - 10https://gerrit.wikimedia.org/r/1253506 (https://phabricator.wikimedia.org/T418971) (owner: 10Ssingh) [14:03:52] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw1003.eqiad.wmnet with reason: host reimage [14:04:14] (03PS2) 10Ssingh: service.yaml/WMCS cloudgw: update IPs for ulsfo-lb (text/upload/gerrit/ncredir) [puppet] - 10https://gerrit.wikimedia.org/r/1253506 (https://phabricator.wikimedia.org/T418971) [14:04:51] RECOVERY - Bird Internet Routing Daemon on doh3006 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:04:58] !log arnaudb@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gerrit2002.wikimedia.org with reason: testing [14:05:38] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: Update ULSFO LVS service IP's - https://phabricator.wikimedia.org/T418971#11713777 (10ssingh) @Papaul / @ayounsi : Patches should be ready for this. Any preference for the day of (this) week for when we should do this? [14:05:54] (03PS3) 10Aude: Set wgReadingListsBetaDefaultForNewAccountsAfter for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251309 (https://phabricator.wikimedia.org/T419163) [14:06:45] starting some rolling reboots of wikikube workers shortly [14:07:31] PROBLEM - Bird Internet Routing Daemon on durum5001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:08:22] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:08:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet [14:08:25] FIRING: [14x] BFDdown: BFD session down between cr1-codfw and 208.80.153.38 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:08:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet [14:08:31] RECOVERY - Bird Internet Routing Daemon on durum5001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:08:54] (03Merged) 10jenkins-bot: fix(anon warning): remove wring type=signup param [extensions/MobileFrontend] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1253461 (https://phabricator.wikimedia.org/T415160) (owner: 10Sergio Gimeno) [14:08:57] (03Merged) 10jenkins-bot: AccountCreation: track account registrations for WE1.8 experiments [extensions/WikimediaEvents] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1253450 (https://phabricator.wikimedia.org/T416100) (owner: 10Sergio Gimeno) [14:08:58] (03CR) 10Cathal Mooney: [C:03+1] decom cookbook: add --homer parameter [cookbooks] - 10https://gerrit.wikimedia.org/r/1251099 (owner: 10Ayounsi) [14:09:17] !log sgimeno@deploy2002 Started scap sync-world: Backport for [[gerrit:1253461|fix(anon warning): remove wring type=signup param (T415160)]], [[gerrit:1253450|AccountCreation: track account registrations for WE1.8 experiments (T416100)]] [14:09:21] T415160: Logged-Out Warning Message: first iteration design changes for mobile - https://phabricator.wikimedia.org/T415160 [14:09:22] T416100: Logged-Out Warning Message: Instrumentation and Experiment Setup for first iteration A/B Test - https://phabricator.wikimedia.org/T416100 [14:09:47] PROBLEM - Bird Internet Routing Daemon on doh4001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:10:54] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw1003.eqiad.wmnet with reason: host reimage [14:11:04] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply [14:11:05] !log sgimeno@deploy2002 sgimeno: Backport for [[gerrit:1253461|fix(anon warning): remove wring type=signup param (T415160)]], [[gerrit:1253450|AccountCreation: track account registrations for WE1.8 experiments (T416100)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:11:11] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply [14:11:23] FIRING: HelmReleaseBadStatus: Helm release zarcillo/main on k8s-aux@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-aux&var-namespace=zarcillo - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:11:28] mvernon@cumin2002 roll-restart-reboot-swift-ms-proxies (PID 3031201) is awaiting input [14:11:41] (03CR) 10Lerickson: [C:03+1] "Thanks Ben!" [puppet] - 10https://gerrit.wikimedia.org/r/1253497 (https://phabricator.wikimedia.org/T404073) (owner: 10Btullis) [14:11:47] RECOVERY - Bird Internet Routing Daemon on doh4001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:12:25] PROBLEM - Host ms-fe1021 is DOWN: PING CRITICAL - Packet loss = 100% [14:12:53] RECOVERY - Host ms-fe1021 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [14:13:25] RESOLVED: [14x] BFDdown: BFD session down between cr1-eqiad and 10.64.16.20 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:13:27] PROBLEM - Bird Internet Routing Daemon on durum5002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:13:49] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply [14:13:55] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply [14:14:36] !log sgimeno@deploy2002 sgimeno: Continuing with sync [14:16:29] RECOVERY - Bird Internet Routing Daemon on durum5002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:16:49] PROBLEM - Bird Internet Routing Daemon on doh4002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:17:25] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply [14:17:28] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply [14:17:51] 10ops-codfw, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup2005 power supplies fried or overvoltage - https://phabricator.wikimedia.org/T419970#11713863 (10Jhancock.wm) It's not posting at the moment. I have some tricks to try today and if not, i have some decommed servers i can pull... [14:17:55] PROBLEM - Host ms-fe2020 is DOWN: PING CRITICAL - Packet loss = 100% [14:18:25] FIRING: [18x] BFDdown: BFD session down between cr1-eqiad and 10.64.16.20 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:18:33] !log sgimeno@deploy2002 Finished scap sync-world: Backport for [[gerrit:1253461|fix(anon warning): remove wring type=signup param (T415160)]], [[gerrit:1253450|AccountCreation: track account registrations for WE1.8 experiments (T416100)]] (duration: 09m 16s) [14:18:37] T415160: Logged-Out Warning Message: first iteration design changes for mobile - https://phabricator.wikimedia.org/T415160 [14:18:38] T416100: Logged-Out Warning Message: Instrumentation and Experiment Setup for first iteration A/B Test - https://phabricator.wikimedia.org/T416100 [14:18:49] RECOVERY - Bird Internet Routing Daemon on doh4002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:19:23] RECOVERY - Host ms-fe2020 is UP: PING OK - Packet loss = 0%, RTA = 30.41 ms [14:20:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:20:42] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:20:51] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply [14:20:57] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply [14:21:00] !log blake@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{wikikube-worker[1002-1327].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [14:21:13] !log blake@cumin1003 END (ERROR) - Cookbook sre.k8s.reboot-nodes (exit_code=97) rolling reboot on P{wikikube-worker[1002-1327].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [14:21:13] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review, 07Security: Audit production for systemd parse warnings - https://phabricator.wikimedia.org/T419166#11713874 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This has been resolved, the Docker registry and the Prometheus export... [14:21:23] RESOLVED: HelmReleaseBadStatus: Helm release zarcillo/main on k8s-aux@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-aux&var-namespace=zarcillo - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:22:23] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1002-1003].eqiad.wmnet [14:22:24] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1002-1003].eqiad.wmnet [14:22:39] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar: Add --min-uptime to cookbooks - https://phabricator.wikimedia.org/T419967#11713899 (10elukey) p:05Triage→03Low [14:22:50] !log blake@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{wikikube-worker[1002-1327].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [14:23:25] RESOLVED: [16x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:25:04] 06SRE, 06Traffic: Anycast ns[01].wikimedia.org for IPv4 - https://phabricator.wikimedia.org/T366193#11713908 (10ssingh) >>! In T366193#11686108, @cmooney wrote: > @ssingh in terms of the IPv6 anycast plans what is the current situation? > > I notice some patches like [[ https://gerrit.wikimedia.org/r/c/operat... [14:25:29] PROBLEM - Bird Internet Routing Daemon on doh5001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:25:49] (03CR) 10Jcrespo: [C:03+2] mediabackups: Fix bug in which services were defined in duplicate [puppet] - 10https://gerrit.wikimedia.org/r/1253491 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [14:26:19] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:26:29] RECOVERY - Bird Internet Routing Daemon on doh5001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:26:31] PROBLEM - Host ms-fe1022 is DOWN: PING CRITICAL - Packet loss = 100% [14:27:45] RECOVERY - Host ms-fe1022 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [14:27:53] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Eqiad: lsw1-c2-eqiad BGP maintenance/ Tuesday 17th at 9:30 CDT - https://phabricator.wikimedia.org/T420158#11713919 (10cmooney) p:05Triage→03Low [14:28:01] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Eqiad: lsw1-c7-eqiad BGP maintenance/ Thursday 19th at 10:00 am CDT - https://phabricator.wikimedia.org/T420159#11713920 (10cmooney) p:05Triage→03Low [14:28:33] PROBLEM - Host wikikube-worker1003 is DOWN: PING CRITICAL - Packet loss = 100% [14:28:33] PROBLEM - Host wikikube-worker1002 is DOWN: PING CRITICAL - Packet loss = 100% [14:28:51] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply [14:28:51] PROBLEM - Bird Internet Routing Daemon on durum3006 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:28:57] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply [14:29:03] RECOVERY - Host wikikube-worker1002 is UP: PING OK - Packet loss = 0%, RTA = 1.76 ms [14:29:03] RECOVERY - Host wikikube-worker1003 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [14:29:18] 06SRE, 06Infrastructure-Foundations: Consider reducing verbosity of IRC logging - https://phabricator.wikimedia.org/T419919#11713927 (10cmooney) p:05Triage→03Low [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260316T1430) [14:30:20] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:30:38] (03PS1) 10Effie Mouzeli: hieradata: migrate memcached cluster to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1253514 (https://phabricator.wikimedia.org/T398611) [14:30:52] mvernon@cumin2002 roll-restart-reboot-swift-ms-proxies (PID 3031201) is awaiting input [14:30:53] RECOVERY - Bird Internet Routing Daemon on durum3006 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:31:07] (03PS2) 10Elukey: sre.hosts.provision: use PATCH and PUT to set Supermicro's BIOS settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216) [14:31:27] PROBLEM - Bird Internet Routing Daemon on doh5002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:31:33] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:31:55] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:33:25] FIRING: [16x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:33:28] RECOVERY - Bird Internet Routing Daemon on doh5002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:33:59] PROBLEM - Host mc-gp2005 is DOWN: PING CRITICAL - Packet loss = 100% [14:34:07] PROBLEM - Host ms-fe1023 is DOWN: PING CRITICAL - Packet loss = 100% [14:34:23] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw1003.eqiad.wmnet with OS trixie [14:34:45] RECOVERY - Host ms-fe1023 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [14:34:55] jouncebot: nowandnext [14:34:55] For the next 0 hour(s) and 25 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260316T1430) [14:34:55] In 0 hour(s) and 55 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260316T1530) [14:35:00] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:35:02] 06SRE, 06Traffic: Startup failure for Bird on new durum hosts - https://phabricator.wikimedia.org/T419868#11713955 (10ssingh) That's interesting, thanks for debugging. What is weird is that a restart of anycast-healthchecker then should have fixed this in theory? [14:35:15] RECOVERY - Host mc-gp2005 is UP: PING OK - Packet loss = 0%, RTA = 30.36 ms [14:35:57] PROBLEM - Bird Internet Routing Daemon on durum7003 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:36:41] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:36:57] RECOVERY - Bird Internet Routing Daemon on durum7003 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:37:02] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:37:25] PROBLEM - Host ms-fe2021 is DOWN: PING CRITICAL - Packet loss = 100% [14:38:23] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:38:25] RESOLVED: [12x] BFDdown: BFD session down between cr2-eqsin and 103.102.166.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:38:31] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet [14:38:53] PROBLEM - Bird Internet Routing Daemon on doh6001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:38:59] RECOVERY - Host ms-fe2021 is UP: PING OK - Packet loss = 0%, RTA = 30.36 ms [14:39:53] RECOVERY - Bird Internet Routing Daemon on doh6001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:40:23] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 7 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:40:31] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:40:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2296:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2296 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:40:42] blake@cumin1003 reboot-nodes (PID 3707224) is awaiting input [14:41:19] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply [14:41:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet [14:41:24] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply [14:41:33] (03CR) 10Elukey: "The code works, but for some reason PUT doesn't as well, and it seemed to from spicerack-shell. Need to dig into what's going on." [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216) (owner: 10Elukey) [14:41:33] !log blake@cumin1003 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on P{wikikube-worker[1002-1327].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [14:41:43] (03PS3) 10Elukey: sre.hosts.provision: use PATCH and PUT to set Supermicro's BIOS settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216) [14:42:57] PROBLEM - Bird Internet Routing Daemon on durum7004 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:43:19] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:43:39] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:44:26] (03CR) 10JHathaway: [C:03+1] sre.hosts.provision: Allow more optional BIOS values for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1251424 (https://phabricator.wikimedia.org/T414216) (owner: 10Elukey) [14:44:53] PROBLEM - Bird Internet Routing Daemon on doh6002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:44:57] RECOVERY - Bird Internet Routing Daemon on durum7004 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:44:57] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1002-1003].eqiad.wmnet [14:44:59] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1002-1003].eqiad.wmnet [14:45:13] PROBLEM - Host ganeti2027 is DOWN: PING CRITICAL - Packet loss = 100% [14:45:15] PROBLEM - Host kubestagemaster2005 is DOWN: PING CRITICAL - Packet loss = 100% [14:45:16] (03CR) 10JHathaway: [C:03+1] confluent: kafka::broker: Fix legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1251540 (https://phabricator.wikimedia.org/T420034) (owner: 10Majavah) [14:45:23] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:45:23] !log blake@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{wikikube-worker[1004-1327].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [14:45:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker2296:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2296 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:46:03] RECOVERY - Host ganeti2027 is UP: PING OK - Packet loss = 0%, RTA = 30.24 ms [14:46:23] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 7 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:46:53] RECOVERY - Bird Internet Routing Daemon on doh6002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:47:11] (03PS1) 10Mszwarc: Add APCOND_OATH_HAS2FA to UserRequirementsPrivateConditions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253520 [14:47:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet [14:47:56] (03CR) 10Muehlenhoff: java: add java-21-security erb template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1251836 (https://phabricator.wikimedia.org/T420083) (owner: 10Elukey) [14:48:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet [14:48:30] (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1253498 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [14:48:45] mvernon@cumin2002 roll-restart-reboot-swift-ms-proxies (PID 3031201) is awaiting input [14:48:50] FIRING: [2x] ProbeDown: Service ganeti2027:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:49:10] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply [14:49:15] PROBLEM - Host mc-gp2006 is DOWN: PING CRITICAL - Packet loss = 100% [14:49:16] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply [14:49:51] PROBLEM - Bird Internet Routing Daemon on durum4001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:50:18] (03CR) 10Muehlenhoff: "I think it would be better to apply this in two batches? So first via hieradata/role/codfw/mediawiki/memcached.yaml and then eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/1253514 (https://phabricator.wikimedia.org/T398611) (owner: 10Effie Mouzeli) [14:50:25] RECOVERY - Host kubestagemaster2005 is UP: PING WARNING - Packet loss = 77%, RTA = 35.60 ms [14:50:28] !log mvernon@cumin1003 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling reboot on A:swift-fe-eqiad [14:50:32] FIRING: KubernetesCalicoDown: kubestagemaster2005.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2005.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:50:43] RECOVERY - Host mc-gp2006 is UP: PING OK - Packet loss = 0%, RTA = 30.34 ms [14:50:53] RECOVERY - Bird Internet Routing Daemon on durum4001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:51:04] (03PS4) 10Elukey: java: add java-21-security erb template [puppet] - 10https://gerrit.wikimedia.org/r/1251836 (https://phabricator.wikimedia.org/T420083) [14:51:15] (03CR) 10JHathaway: [C:03+1] P:kafka::broker::monitoring: Fix legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1251539 (https://phabricator.wikimedia.org/T420034) (owner: 10Majavah) [14:51:16] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply [14:51:17] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet [14:51:22] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply [14:51:31] PROBLEM - Host wikikube-worker1005 is DOWN: PING CRITICAL - Packet loss = 100% [14:51:35] PROBLEM - Host wikikube-worker1004 is DOWN: PING CRITICAL - Packet loss = 100% [14:51:53] 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11714012 (10RobH) They had an issue where they couldn't locate the fiber listed and instead skipped the work entirely! I need to review the photos and find out wha... [14:51:55] PROBLEM - Bird Internet Routing Daemon on doh7003 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:52:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet [14:53:01] RECOVERY - Host wikikube-worker1005 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [14:53:03] RECOVERY - Host wikikube-worker1004 is UP: PING OK - Packet loss = 0%, RTA = 1.59 ms [14:53:07] !log jiji@cumin1003 END (PASS) - Cookbook sre.memcached.roll-reboot-restart (exit_code=0) rolling reboot on A:memcached-gutter-codfw [14:53:17] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet [14:53:50] FIRING: [4x] ProbeDown: Service ganeti2027:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:53:55] RECOVERY - Bird Internet Routing Daemon on doh7003 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:54:24] (03Merged) 10jenkins-bot: Revert "mediawiki.util: Prefer prev step over non-standard in adjustThumbWidthForSteps" [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1253518 (https://phabricator.wikimedia.org/T419927) (owner: 10Ladsgroup) [14:54:45] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1253518|Revert "mediawiki.util: Prefer prev step over non-standard in adjustThumbWidthForSteps" (T419927)]] [14:54:48] T419927: PNGs are being displayed at a too-low, blurry resolution - https://phabricator.wikimedia.org/T419927 [14:55:31] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply [14:55:32] RESOLVED: KubernetesCalicoDown: kubestagemaster2005.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2005.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:55:37] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply [14:55:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2296:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2296 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:56:31] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1253518|Revert "mediawiki.util: Prefer prev step over non-standard in adjustThumbWidthForSteps" (T419927)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:56:43] PROBLEM - Host ganeti2028 is DOWN: PING CRITICAL - Packet loss = 100% [14:56:53] PROBLEM - Bird Internet Routing Daemon on durum4002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:57:17] PROBLEM - Host ml-staging-etcd2001 is DOWN: PING CRITICAL - Packet loss = 100% [14:57:23] PROBLEM - Host cloudgw1004 is DOWN: PING CRITICAL - Packet loss = 100% [14:57:41] RECOVERY - Host ganeti2028 is UP: PING OK - Packet loss = 0%, RTA = 30.39 ms [14:57:51] RECOVERY - Bird Internet Routing Daemon on durum4002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:58:35] (03CR) 10AKhatun: stream: deploy edit-type stream to production (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251480 (https://phabricator.wikimedia.org/T351225) (owner: 10AKhatun) [14:58:51] RECOVERY - Host cloudgw1004 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [14:59:56] 10ops-codfw, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup2005 power supplies fried or overvoltage - https://phabricator.wikimedia.org/T419970#11714044 (10jcrespo) >>! In T419970#11713863, @Jhancock.wm wrote: > It's not posting at the moment. I have some tricks to try today and if n... [15:00:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet [15:00:45] RECOVERY - Host ml-staging-etcd2001 is UP: PING OK - Packet loss = 0%, RTA = 30.61 ms [15:00:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet [15:00:55] PROBLEM - Bird Internet Routing Daemon on doh7004 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [15:01:52] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [15:01:54] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1004.eqiad.wmnet [15:01:55] RECOVERY - Bird Internet Routing Daemon on doh7004 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [15:02:11] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough [15:02:53] PROBLEM - Bird Internet Routing Daemon on durum4003 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [15:02:57] (03PS4) 10Elukey: sre.hosts.provision: use more URIs to set Supermicro's BIOS settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216) [15:03:32] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:03:50] FIRING: [5x] ProbeDown: Service ganeti2027:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:03:51] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:04:52] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-ctrl2001.codfw.wmnet [15:04:53] RECOVERY - Bird Internet Routing Daemon on durum4003 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [15:05:03] 06SRE, 06Infrastructure-Foundations: Consider reducing verbosity of IRC logging - https://phabricator.wikimedia.org/T419919#11714069 (10herron) >>! In T419919#11713106, @Volans wrote: > * Each cookbook owners can decide to log the START/END or alternatively a single DONE line using Spicerack API [1] depending... [15:05:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker2296:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2296 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:06:26] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1251836 (https://phabricator.wikimedia.org/T420083) (owner: 10Elukey) [15:07:13] PROBLEM - Host wikikube-worker1007 is DOWN: PING CRITICAL - Packet loss = 100% [15:07:25] PROBLEM - Host wikikube-worker1006 is DOWN: PING CRITICAL - Packet loss = 100% [15:07:31] PROBLEM - Host ms-fe2022 is DOWN: PING CRITICAL - Packet loss = 100% [15:07:59] RECOVERY - Host ms-fe2022 is UP: PING OK - Packet loss = 0%, RTA = 30.31 ms [15:08:55] RECOVERY - Host wikikube-worker1006 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [15:08:55] RECOVERY - Host wikikube-worker1007 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [15:09:03] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet [15:09:51] PROBLEM - Bird Internet Routing Daemon on durum4004 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [15:10:15] PROBLEM - Host wikikube-ctrl2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:10:15] (03CR) 10Elukey: "So on X13+ the options listed in the /SD enpoint are different/renamed from the Bios ones. So we'll need to use a different set of key/val" [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216) (owner: 10Elukey) [15:10:51] RECOVERY - Bird Internet Routing Daemon on durum4004 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [15:11:25] RECOVERY - Host wikikube-ctrl2001 is UP: PING OK - Packet loss = 0%, RTA = 30.34 ms [15:11:35] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and A:durum [15:12:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 13Patch-For-Review: Q3:rack/setup/install dse-k8s-worker10[20-23] - https://phabricator.wikimedia.org/T414216#11714086 (10elukey) This requires more work, since those models are X13 of a very new generation that don't accept BIO... [15:13:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet [15:13:23] FIRING: HelmReleaseBadStatus: Helm release zarcillo/main on k8s-aux@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-aux&var-namespace=zarcillo - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:14:14] (03CR) 10Dpogorzelski: [C:03+2] kserve: update to version 0.17 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1253498 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [15:14:23] (03CR) 10Dpogorzelski: [V:03+2 C:03+2] kserve: update to version 0.17 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1253498 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [15:15:35] PROBLEM - Host ms-fe2023 is DOWN: PING CRITICAL - Packet loss = 100% [15:15:39] PROBLEM - Host ganeti2029 is DOWN: PING CRITICAL - Packet loss = 100% [15:16:03] !log fnegri@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1022.eqiad.wmnet with reason: Rebooting clouddb1022 T419960 [15:16:13] RECOVERY - Host ms-fe2023 is UP: PING OK - Packet loss = 0%, RTA = 30.38 ms [15:17:14] (03CR) 10Elukey: [C:03+2] java: add java-21-security erb template [puppet] - 10https://gerrit.wikimedia.org/r/1251836 (https://phabricator.wikimedia.org/T420083) (owner: 10Elukey) [15:18:03] RECOVERY - Host ganeti2029 is UP: PING OK - Packet loss = 0%, RTA = 30.28 ms [15:20:17] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1253518|Revert "mediawiki.util: Prefer prev step over non-standard in adjustThumbWidthForSteps" (T419927)]] [15:20:20] T419927: PNGs are being displayed at a too-low, blurry resolution - https://phabricator.wikimedia.org/T419927 [15:20:41] PROBLEM - MariaDB Replica SQL: x3 on clouddb1022 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:20:53] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-ctrl2001.codfw.wmnet [15:21:06] !log fnegri@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1023.eqiad.wmnet with reason: Rebooting clouddb1023 T419960 [15:21:13] PROBLEM - MariaDB Replica IO: s3 on clouddb1022 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:21:13] PROBLEM - MariaDB Replica SQL: s3 on clouddb1022 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:21:13] PROBLEM - MariaDB Replica IO: x3 on clouddb1022 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:21:21] PROBLEM - MariaDB read only wikireplica-x3 on clouddb1022 is CRITICAL: Could not connect to localhost:3363 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:21:21] PROBLEM - MariaDB read only s3 on clouddb1022 is CRITICAL: Could not connect to localhost:3313 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:21:21] PROBLEM - MariaDB read only x3 on clouddb1022 is CRITICAL: Could not connect to localhost:3363 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:21:21] PROBLEM - MariaDB read only wikireplica-s3 on clouddb1022 is CRITICAL: Could not connect to localhost:3313 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:21:32] (03CR) 10Elukey: kserve: update to version 0.17 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1253498 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [15:21:35] PROBLEM - mysqld processes on clouddb1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:22:35] PROBLEM - Host wikikube-worker1012 is DOWN: PING CRITICAL - Packet loss = 100% [15:22:37] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-ctrl2002.codfw.wmnet [15:22:37] PROBLEM - Host wikikube-worker1011 is DOWN: PING CRITICAL - Packet loss = 100% [15:23:27] PROBLEM - Host ms-fe2024 is DOWN: PING CRITICAL - Packet loss = 100% [15:23:53] (03PS1) 10Dwisehaupt: Shift fundraising read handle to frdb1004 [dns] - 10https://gerrit.wikimedia.org/r/1253525 (https://phabricator.wikimedia.org/T420018) [15:23:59] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply [15:24:03] RECOVERY - Host wikikube-worker1012 is UP: PING OK - Packet loss = 0%, RTA = 4.82 ms [15:24:04] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply [15:24:05] RECOVERY - Host wikikube-worker1011 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [15:24:05] PROBLEM - Host clouddb1022 is DOWN: PING CRITICAL - Packet loss = 100% [15:24:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and fe80::a6e1:1a00:1a6f:d3a3 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:24:23] RECOVERY - Host ms-fe2024 is UP: PING OK - Packet loss = 0%, RTA = 30.33 ms [15:24:45] (03PS2) 10AKhatun: stream: deploy edit-type stream to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251480 (https://phabricator.wikimedia.org/T351225) [15:25:43] RECOVERY - Host clouddb1022 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [15:26:13] PROBLEM - MariaDB Replica IO: s3 on clouddb1022 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:26:13] PROBLEM - MariaDB Replica SQL: s3 on clouddb1022 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:26:13] PROBLEM - MariaDB Replica IO: x3 on clouddb1022 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:26:17] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply [15:26:21] PROBLEM - MariaDB read only x3 on clouddb1022 is CRITICAL: Could not connect to localhost:3363 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:26:21] PROBLEM - MariaDB read only wikireplica-x3 on clouddb1022 is CRITICAL: Could not connect to localhost:3363 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:26:21] PROBLEM - MariaDB read only wikireplica-s3 on clouddb1022 is CRITICAL: Could not connect to localhost:3313 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:26:21] PROBLEM - MariaDB read only s3 on clouddb1022 is CRITICAL: Could not connect to localhost:3313 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:26:22] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply [15:26:35] PROBLEM - mysqld processes on clouddb1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:26:41] PROBLEM - MariaDB Replica SQL: x3 on clouddb1022 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:26:44] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti2029.codfw.wmnet [15:26:44] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2029.codfw.wmnet [15:26:52] !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling reboot on A:ldap-replicas-codfw [15:27:03] PROBLEM - SSH on wikikube-ctrl2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:27:30] (03CR) 10Milimetric: [C:03+2] "merging this to be deployed with the train tomorrow. This is an isolated change - new stream configuration - should not interfere with an" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250249 (https://phabricator.wikimedia.org/T417050) (owner: 10TChin) [15:27:37] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling reboot on A:swift-fe-codfw [15:28:25] (03Merged) 10jenkins-bot: Add stream config for attribution research [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250249 (https://phabricator.wikimedia.org/T417050) (owner: 10TChin) [15:28:50] FIRING: [2x] ProbeDown: Service ganeti2029:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:28:53] RECOVERY - SSH on wikikube-ctrl2002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:29:10] RESOLVED: BFDdown: BFD session down between cr2-eqdfw and fe80::a6e1:1a00:1a6f:d3a3 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:29:23] (03CR) 10Jgreen: [C:03+1] Shift fundraising read handle to frdb1004 [dns] - 10https://gerrit.wikimedia.org/r/1253525 (https://phabricator.wikimedia.org/T420018) (owner: 10Dwisehaupt) [15:29:35] RECOVERY - mysqld processes on clouddb1022 is OK: PROCS OK: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:29:56] 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 10Infrastructure Security, 10media-backups: Unexpected media growth led to low disk resources on several media backup hosts - https://phabricator.wikimedia.org/T410028#11714176 (10jcrespo) Moritz: I would like your assessment on deploying a new storag... [15:30:04] jan_drewniak: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260316T1530). [15:30:21] RECOVERY - MariaDB read only s3 on clouddb1022 is OK: Version 10.11.16-MariaDB, Uptime 57s, read_only: True, event_scheduler: False, 20.09 QPS, connection latency: 0.016287s, query latency: 0.000714s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:30:21] RECOVERY - MariaDB read only wikireplica-x3 on clouddb1022 is OK: Version 10.11.16-MariaDB, Uptime 49s, read_only: True, event_scheduler: False, 28.55 QPS, connection latency: 0.012311s, query latency: 0.000397s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:30:21] RECOVERY - MariaDB read only x3 on clouddb1022 is OK: Version 10.11.16-MariaDB, Uptime 49s, read_only: True, event_scheduler: False, 17.24 QPS, connection latency: 0.017261s, query latency: 0.000560s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:30:21] RECOVERY - MariaDB read only wikireplica-s3 on clouddb1022 is OK: Version 10.11.16-MariaDB, Uptime 57s, read_only: True, event_scheduler: False, 22.75 QPS, connection latency: 0.027291s, query latency: 0.000832s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:31:05] (03CR) 10Dwisehaupt: [C:03+2] Shift fundraising read handle to frdb1004 [dns] - 10https://gerrit.wikimedia.org/r/1253525 (https://phabricator.wikimedia.org/T420018) (owner: 10Dwisehaupt) [15:31:13] PROBLEM - MariaDB Replica Lag: s3 on clouddb1022 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 765.10 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:31:14] !log dwisehaupt@dns1006 START - running authdns-update [15:31:35] PROBLEM - MariaDB Replica Lag: x3 on clouddb1022 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 780.83 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:31:59] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet [15:32:05] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-ctrl2002.codfw.wmnet [15:32:13] RECOVERY - MariaDB Replica IO: s3 on clouddb1022 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:32:13] RECOVERY - MariaDB Replica Lag: s3 on clouddb1022 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:32:13] RECOVERY - MariaDB Replica IO: x3 on clouddb1022 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:32:13] RECOVERY - MariaDB Replica SQL: s3 on clouddb1022 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:32:35] RECOVERY - MariaDB Replica Lag: x3 on clouddb1022 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:32:40] !log dwisehaupt@dns1006 END - running authdns-update [15:32:41] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-ctrl2003.codfw.wmnet [15:32:41] RECOVERY - MariaDB Replica SQL: x3 on clouddb1022 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:34:14] (03CR) 10Jelto: [C:03+1] "lgtm, thank you for the migration" [puppet] - 10https://gerrit.wikimedia.org/r/1253397 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [15:35:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2030.codfw.wmnet [15:35:13] !log fnegri@cumin1003 START - Cookbook sre.hosts.remove-downtime for clouddb1022.eqiad.wmnet [15:35:14] !log fnegri@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for clouddb1022.eqiad.wmnet [15:37:09] PROBLEM - SSH on wikikube-ctrl2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:37:43] !log fnegri@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1023.eqiad.wmnet with reason: Rebooting clouddb1023 T419960 [15:38:12] (03PS1) 10Elukey: profile::kafka::broker: allow to set use_modern_jvm_default_opts [puppet] - 10https://gerrit.wikimedia.org/r/1253535 [15:38:13] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply [15:38:19] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply [15:38:23] (03CR) 10BCornwall: [C:03+2] hardware.upgrade-firmware: Fix usage path [cookbooks] - 10https://gerrit.wikimedia.org/r/1244788 (owner: 10BCornwall) [15:38:41] the deploy failed twice. So I'm not going to try again [15:38:44] (03PS2) 10Elukey: profile::kafka::broker: allow to set use_modern_jvm_default_opts [puppet] - 10https://gerrit.wikimedia.org/r/1253535 [15:38:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling reboot on A:ldap-replicas-codfw [15:38:53] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1253535 (owner: 10Elukey) [15:38:55] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply [15:38:59] RECOVERY - SSH on wikikube-ctrl2003 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:39:00] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply [15:39:47] PROBLEM - Host ganeti2030 is DOWN: PING CRITICAL - Packet loss = 100% [15:40:12] Amir1: mw-deploy? (Curious if it is due to the docker registry or other things) [15:40:15] RECOVERY - Host ganeti2030 is UP: PING OK - Packet loss = 0%, RTA = 30.32 ms [15:40:26] yeah, mw-deploy [15:40:30] the error was very weird [15:40:31] PROBLEM - Host wikikube-worker1016 is DOWN: PING CRITICAL - Packet loss = 100% [15:40:35] PROBLEM - Host wikikube-worker1015 is DOWN: PING CRITICAL - Packet loss = 100% [15:40:45] didn't dig too deep though [15:40:55] https://spiderpig.wikimedia.org/jobs/1553 [15:41:48] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-ctrl2003.codfw.wmnet [15:41:56] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply [15:42:01] RECOVERY - Host wikikube-worker1016 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [15:42:02] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply [15:42:07] RECOVERY - Host wikikube-worker1015 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [15:42:08] (03PS1) 10Muehlenhoff: Remove ncredir4001/4002 [puppet] - 10https://gerrit.wikimedia.org/r/1253538 (https://phabricator.wikimedia.org/T418993) [15:42:32] !log reimage cp2041 for HAProxy testing (T419825) [15:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:36] T419825: Test HAProxy 3.2 with AWS-LC libraries - https://phabricator.wikimedia.org/T419825 [15:43:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2030.codfw.wmnet [15:43:16] (03CR) 10BCornwall: [C:03+1] "Values match on my review." [dns] - 10https://gerrit.wikimedia.org/r/1253503 (https://phabricator.wikimedia.org/T418971) (owner: 10Ssingh) [15:43:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2030.codfw.wmnet [15:43:36] !log fabfur@cumin1003 START - Cookbook sre.hosts.reimage for host cp2041.codfw.wmnet with OS trixie [15:43:44] !log fnegri@cumin1003 START - Cookbook sre.hosts.remove-downtime for clouddb1023.eqiad.wmnet [15:43:44] !log fnegri@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for clouddb1023.eqiad.wmnet [15:43:50] FIRING: [2x] ProbeDown: Service ganeti2030:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:43:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet [15:44:02] !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1024.eqiad.wmnet [15:45:02] (03CR) 10Clément Goubert: [C:03+1] mw-web: upsize for single-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251045 (https://phabricator.wikimedia.org/T413974) (owner: 10Blake) [15:46:08] 10ops-codfw, 06DC-Ops, 10Phabricator: phab2002: SEL System Event:, System Board Front LED Panel, Critical, management controller unavailable - https://phabricator.wikimedia.org/T420228 (10Aklapper) 03NEW [15:46:27] !log fnegri@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1024.eqiad.wmnet with reason: Rebooting clouddb1024 T419960 [15:46:27] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1253497 (https://phabricator.wikimedia.org/T404073) (owner: 10Btullis) [15:47:01] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host ncmonitor1001.eqiad.wmnet [15:47:10] STDERR: [15:47:10] Error: UPGRADE FAILED: release next failed, and has been rolled back due to atomic being set: context deadline exceeded [15:47:10] COMBINED OUTPUT: [15:47:10] Error: UPGRADE FAILED: release next failed, and has been rolled back due to atomic being set: context deadline exceeded [15:47:33] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet [15:47:44] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-ctrl2004.codfw.wmnet [15:48:07] PROBLEM - Host cp2041 is DOWN: PING CRITICAL - Packet loss = 100% [15:50:58] 10ops-esams, 06SRE, 06DC-Ops: ganeti3005 didn't come up after reboot - https://phabricator.wikimedia.org/T420229 (10MoritzMuehlenhoff) 03NEW [15:52:04] PROBLEM - Host ganeti2031 is DOWN: PING CRITICAL - Packet loss = 100% [15:52:46] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncmonitor1001.eqiad.wmnet [15:52:50] !log jayme@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=helm-charts,name=eqiad [15:53:01] Amir1: are you seeing deployment issues? [15:53:12] swfrench-wmf: yup :( [15:53:44] RECOVERY - Host ganeti2031 is UP: PING OK - Packet loss = 0%, RTA = 30.36 ms [15:53:45] I can't say whether my patch is breaking things (unlikely but not impossible) or it's just the k8s deciding to chose violence today [15:53:53] (03PS2) 10SBassett: Allow-list some additional domains to the currently enforcing CSP [puppet] - 10https://gerrit.wikimedia.org/r/1251550 (https://phabricator.wikimedia.org/T419502) [15:53:55] Amir1: I think I see what might be doing it - is it failing in eqiad specifically? [15:53:57] !log jayme@cumin1003 START - Cookbook sre.hosts.reboot-single for host chartmuseum1001.eqiad.wmnet [15:54:31] (03CR) 10Elukey: [C:03+2] profile::kafka::broker: allow to set use_modern_jvm_default_opts [puppet] - 10https://gerrit.wikimedia.org/r/1253535 (owner: 10Elukey) [15:54:39] yeah, it looks like eqiad [15:54:40] https://spiderpig.wikimedia.org/jobs/1553 [15:54:46] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-ctrl2004.codfw.wmnet [15:54:50] (click to show the logs) [15:55:02] (03CR) 10SBassett: Allow-list some additional domains to the currently enforcing CSP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1251550 (https://phabricator.wikimedia.org/T419502) (owner: 10SBassett) [15:55:14] (03CR) 10Elukey: [C:03+2] "merged since it was trivial and I need to unblock logs in deployment-prep, lemme know if anything concerns you :)" [puppet] - 10https://gerrit.wikimedia.org/r/1253535 (owner: 10Elukey) [15:56:28] PROBLEM - Host wikikube-worker1019 is DOWN: PING CRITICAL - Packet loss = 100% [15:56:34] PROBLEM - Host wikikube-worker1020 is DOWN: PING CRITICAL - Packet loss = 100% [15:57:03] (03CR) 10Muehlenhoff: [C:03+2] Remove tcp-proxy4001/4002 [puppet] - 10https://gerrit.wikimedia.org/r/1253397 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [15:57:18] Amir1: thanks for confirming. I think we know what's happening, and we should have it set in a few minutes [15:57:25] (03PS1) 10Dwisehaupt: Shift fundraising db read handle back [dns] - 10https://gerrit.wikimedia.org/r/1253546 (https://phabricator.wikimedia.org/T420018) [15:57:29] Amir1: it's because of the reboots sorry, we're on it [15:57:32] PROBLEM - Host clouddb1024 is DOWN: PING CRITICAL - Packet loss = 100% [15:57:43] <3 [15:57:45] No worries [15:57:55] As long as it's not my patch causing issues, I don't mind [15:57:56] RECOVERY - Host wikikube-worker1019 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [15:58:02] RECOVERY - Host wikikube-worker1020 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [15:58:42] RECOVERY - Host clouddb1024 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [15:58:56] (03CR) 10Jgreen: [C:03+1] Shift fundraising db read handle back [dns] - 10https://gerrit.wikimedia.org/r/1253546 (https://phabricator.wikimedia.org/T420018) (owner: 10Dwisehaupt) [15:59:07] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti2031.codfw.wmnet [15:59:07] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2031.codfw.wmnet [15:59:22] PROBLEM - MariaDB read only s4 on clouddb1024 is CRITICAL: Could not connect to localhost:3314 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:59:22] PROBLEM - MariaDB read only wikireplica-s4 on clouddb1024 is CRITICAL: Could not connect to localhost:3314 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:59:36] PROBLEM - MariaDB Replica IO: s4 on clouddb1024 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:59:40] PROBLEM - MariaDB Replica SQL: s4 on clouddb1024 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:59:44] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host chartmuseum1001.eqiad.wmnet [16:00:06] !log blake@cumin1003 END (ERROR) - Cookbook sre.k8s.reboot-nodes (exit_code=97) rolling reboot on P{wikikube-worker[1004-1327].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [16:00:34] PROBLEM - mysqld processes on clouddb1024 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [16:03:21] (03PS1) 10Andrew Bogott: codfw1dev cloud backups: upgrade to 'flamingo' [puppet] - 10https://gerrit.wikimedia.org/r/1253548 (https://phabricator.wikimedia.org/T406516) [16:03:22] RECOVERY - MariaDB read only wikireplica-s4 on clouddb1024 is OK: Version 10.11.16-MariaDB, Uptime 38s, read_only: True, event_scheduler: False, 23.26 QPS, connection latency: 0.013212s, query latency: 0.000395s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:03:22] RECOVERY - MariaDB read only s4 on clouddb1024 is OK: Version 10.11.16-MariaDB, Uptime 38s, read_only: True, event_scheduler: False, 23.35 QPS, connection latency: 0.026219s, query latency: 0.000610s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:03:24] (03PS1) 10Andrew Bogott: Openstack eqiad1 -> version 'flamingo' [puppet] - 10https://gerrit.wikimedia.org/r/1253549 (https://phabricator.wikimedia.org/T406516) [16:03:34] RECOVERY - mysqld processes on clouddb1024 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [16:03:43] (03PS1) 10Btullis: Add dummy analytics-wikidata keytabs [labs/private] - 10https://gerrit.wikimedia.org/r/1253550 (https://phabricator.wikimedia.org/T404073) [16:03:50] FIRING: [2x] ProbeDown: Service ganeti2031:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:03:57] (03CR) 10Btullis: [V:03+2 C:03+2] Add dummy analytics-wikidata keytabs [labs/private] - 10https://gerrit.wikimedia.org/r/1253550 (https://phabricator.wikimedia.org/T404073) (owner: 10Btullis) [16:04:05] !log jayme@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=helm-charts,name=eqiad [16:04:13] (03CR) 10Dwisehaupt: [C:03+2] Shift fundraising db read handle back [dns] - 10https://gerrit.wikimedia.org/r/1253546 (https://phabricator.wikimedia.org/T420018) (owner: 10Dwisehaupt) [16:04:13] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1253497 (https://phabricator.wikimedia.org/T404073) (owner: 10Btullis) [16:04:20] 10ops-eqiad, 06DC-Ops: High (relatively) number of memcached errors in eqiad - https://phabricator.wikimedia.org/T420223#11714399 (10neriah) [16:04:25] !log dwisehaupt@dns1006 START - running authdns-update [16:04:34] !log jayme@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=helm-charts,name=codfw [16:04:36] RECOVERY - MariaDB Replica IO: s4 on clouddb1024 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:04:40] RECOVERY - MariaDB Replica SQL: s4 on clouddb1024 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:04:54] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1253548 (https://phabricator.wikimedia.org/T406516) (owner: 10Andrew Bogott) [16:04:57] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1253549 (https://phabricator.wikimedia.org/T406516) (owner: 10Andrew Bogott) [16:05:02] !log fabfur@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2041.codfw.wmnet with reason: host reimage [16:05:08] !log jayme@cumin1003 START - Cookbook sre.hosts.reboot-single for host chartmuseum2001.codfw.wmnet [16:05:53] !log dwisehaupt@dns1006 END - running authdns-update [16:06:06] !log jayme@cumin1003 START - Cookbook sre.hosts.reboot-single for host dragonfly-supernode1001.eqiad.wmnet [16:06:07] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-ctrl2005.codfw.wmnet [16:06:07] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp6016.drmrs.wmnet with OS trixie [16:06:43] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1004-1007,1011-1012,1015-1016,1019-1021,1029-1031,1034-1168,1240-1289,1291-1327].eqiad.wmnet [16:07:26] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1004-1007,1011-1012,1015-1016,1019-1021,1029-1031,1034-1168,1240-1289,1291-1327].eqiad.wmnet [16:08:40] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:09:03] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1024.eqiad.wmnet [16:09:08] (03PS1) 10C. Scott Ananian: Fix double post-processing in legacy preview case [extensions/DiscussionTools] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1253551 (https://phabricator.wikimedia.org/T419908) [16:09:11] (03CR) 10Andrew Bogott: [C:03+2] codfw1dev cloud backups: upgrade to 'flamingo' [puppet] - 10https://gerrit.wikimedia.org/r/1253548 (https://phabricator.wikimedia.org/T406516) (owner: 10Andrew Bogott) [16:09:14] !log fnegri@cumin1003 START - Cookbook sre.hosts.remove-downtime for clouddb1024.eqiad.wmnet [16:09:15] !log fnegri@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for clouddb1024.eqiad.wmnet [16:09:55] Amir1: you should be unblocked for your deploy [16:09:59] ping bjensen when you're done [16:10:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/DiscussionTools] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1253551 (https://phabricator.wikimedia.org/T419908) (owner: 10C. Scott Ananian) [16:10:09] noted, thanks [16:10:56] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host chartmuseum2001.codfw.wmnet [16:10:57] PROBLEM - Host cp6016 is DOWN: PING CRITICAL - Packet loss = 100% [16:11:04] !log fabfur@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2041.codfw.wmnet with reason: host reimage [16:11:19] !log jayme@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=helm-charts,name=codfw [16:11:40] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dragonfly-supernode1001.eqiad.wmnet [16:12:02] !log jayme@cumin1003 START - Cookbook sre.hosts.reboot-single for host dragonfly-supernode2001.codfw.wmnet [16:12:55] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1253518|Revert "mediawiki.util: Prefer prev step over non-standard in adjustThumbWidthForSteps" (T419927)]] [16:12:59] T419927: PNGs are being displayed at a too-low, blurry resolution - https://phabricator.wikimedia.org/T419927 [16:13:09] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-ctrl2005.codfw.wmnet [16:14:43] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1253518|Revert "mediawiki.util: Prefer prev step over non-standard in adjustThumbWidthForSteps" (T419927)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:15:51] (03PS1) 10Gehel: alerts(blazegraph): reduce severity of CategoriesQueryServiceUpdateLagTooHigh to warning [alerts] - 10https://gerrit.wikimedia.org/r/1253552 (https://phabricator.wikimedia.org/T420235) [16:16:28] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [16:17:40] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dragonfly-supernode2001.codfw.wmnet [16:18:16] (03CR) 10Btullis: [C:03+2] Allow members of analytics-wikidata-users access to stat hosts [puppet] - 10https://gerrit.wikimedia.org/r/1253497 (https://phabricator.wikimedia.org/T404073) (owner: 10Btullis) [16:20:23] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1253518|Revert "mediawiki.util: Prefer prev step over non-standard in adjustThumbWidthForSteps" (T419927)]] (duration: 07m 28s) [16:20:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:20:27] T419927: PNGs are being displayed at a too-low, blurry resolution - https://phabricator.wikimedia.org/T419927 [16:23:57] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11714533 (10Gehel) [16:27:33] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6016.drmrs.wmnet with reason: host reimage [16:27:41] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1012 [16:29:16] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1012 [16:30:20] bjensen: I'm done with my deploy [16:30:37] jouncebot: nowandnext [16:30:37] No deployments scheduled for the next 0 hour(s) and 29 minute(s) [16:30:37] In 0 hour(s) and 29 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260316T1700) [16:30:37] In 0 hour(s) and 29 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260316T1700) [16:30:38] Amir1: thanks! sorry for the disturbance earlier [16:31:39] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:32:51] !log my bad, accidentally merged https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1250249, will read docs on config deployment better [16:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:40] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:48] 10ops-esams, 06SRE, 06DC-Ops: ganeti3005 didn't come up after reboot - https://phabricator.wikimedia.org/T420229#11714572 (10RobH) Updated the idrac, then the backplane firmware, and as it was rebooting to update the BIOS firmware the SEL updated with: ` A configuration related issue on the device Backpla... [16:35:55] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6016.drmrs.wmnet with reason: host reimage [16:36:54] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11714576 (10BTullis) a:05AWesterinen→03BTullis [16:36:57] 10ops-eqiad, 06SRE, 06DC-Ops: High (relatively) number of memcached errors in eqiad - https://phabricator.wikimedia.org/T420223#11714578 (10jijiki) Looking at the memcached servers for which mcrouter in eqiad records errors, it seems that the majority are in codfw. | IP | Port | Errors | Hostname | |---|--... [16:37:05] !log blake@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{wikikube-worker[1020-1327].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [16:37:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253520 (owner: 10Mszwarc) [16:37:42] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11714585 (10BTullis) [16:38:19] (03Merged) 10jenkins-bot: Add APCOND_OATH_HAS2FA to UserRequirementsPrivateConditions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253520 (owner: 10Mszwarc) [16:38:39] !log mszwarc@deploy2002 Started scap sync-world: Backport for [[gerrit:1253520|Add APCOND_OATH_HAS2FA to UserRequirementsPrivateConditions]] [16:39:56] !log fabfur@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2041.codfw.wmnet with OS trixie [16:40:28] !log mszwarc@deploy2002 mszwarc: Backport for [[gerrit:1253520|Add APCOND_OATH_HAS2FA to UserRequirementsPrivateConditions]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:41:01] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.4 point update - https://phabricator.wikimedia.org/T420240#11714603 (10MoritzMuehlenhoff) p:05Triage→03Medium [16:41:04] !log mszwarc@deploy2002 mszwarc: Continuing with sync [16:41:25] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11714606 (10BTullis) [16:41:35] !log reimage cp2042 for HAProxy testing (T419825) [16:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:38] T419825: Test HAProxy 3.2 with AWS-LC libraries - https://phabricator.wikimedia.org/T419825 [16:42:03] !log fabfur@cumin1003 START - Cookbook sre.hosts.reimage for host cp2042.codfw.wmnet with OS trixie [16:42:33] !log rebooting backends of releases.wikimedia.org [16:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:09] RECOVERY - Host ganeti3005 is UP: PING OK - Packet loss = 0%, RTA = 80.25 ms [16:43:11] 10ops-esams, 06SRE, 06DC-Ops: ganeti3005 didn't come up after reboot - https://phabricator.wikimedia.org/T420229#11714612 (10RobH) a:03MoritzMuehlenhoff Ok, bios update done and its booting to the debian loader so handing back to @MoritzMuehlenhoff Please resolve this task once you are aware this host is... [16:43:50] RESOLVED: ProbeDown: Service ganeti3005:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:44:55] !log mszwarc@deploy2002 Finished scap sync-world: Backport for [[gerrit:1253520|Add APCOND_OATH_HAS2FA to UserRequirementsPrivateConditions]] (duration: 06m 15s) [16:45:02] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:45:10] RESOLVED: [2x] GanetiBGPDown: BGP session down between ganeti3005 and asw1-by27-esams - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [16:45:12] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11714620 (10BTullis) I have noticed a disrepancy in the Wikimedia Developer Account username. The correct username should b... [16:46:39] PROBLEM - Host cp2042 is DOWN: PING CRITICAL - Packet loss = 100% [16:47:56] !log phab2002 - rebooting [16:47:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:37] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11714633 (10BTullis) [16:50:16] RECOVERY - Host ganeti3005 is UP: PING OK - Packet loss = 0%, RTA = 80.36 ms [16:50:45] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.4 point update - https://phabricator.wikimedia.org/T420240#11714636 (10MoritzMuehlenhoff) [16:50:48] (03PS1) 10Mszwarc: Configure external link tracking on 12 wikis (167 ext. domains) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253566 (https://phabricator.wikimedia.org/T419837) [16:51:13] FIRING: [2x] ProbeDown: Service phab2002:25 has failed probes (tcp_phabricator_smtp_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#phab2002:25 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:51:58] (03CR) 10Ssingh: "Scott: @bcornwall@wikimedia.org can help with the deployment at 10:45 PT/13:45 ET." [puppet] - 10https://gerrit.wikimedia.org/r/1251550 (https://phabricator.wikimedia.org/T419502) (owner: 10SBassett) [16:52:16] 10ops-esams, 06SRE, 06DC-Ops: ganeti3005 didn't come up after reboot - https://phabricator.wikimedia.org/T420229#11714645 (10MoritzMuehlenhoff) Thanks! I'll reimage the server to ensure all is working fine and will resolve the task when this is completed. [16:54:50] (03CR) 10Andrew Bogott: [C:03+2] Openstack eqiad1 -> version 'flamingo' [puppet] - 10https://gerrit.wikimedia.org/r/1253549 (https://phabricator.wikimedia.org/T406516) (owner: 10Andrew Bogott) [16:55:56] (03CR) 10Vgutierrez: sre.loadbalancer: Provide check-ipip cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1251442 (https://phabricator.wikimedia.org/T419873) (owner: 10Vgutierrez) [16:56:13] RESOLVED: [2x] ProbeDown: Service phab2002:25 has failed probes (tcp_phabricator_smtp_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#phab2002:25 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:58:22] that was my reboot. did not expect it to trigger. but good to see the recovery. [16:59:50] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp6016 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [16:59:50] PROBLEM - haproxy process on cp6016 is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260316T1700) [17:00:05] ryankemper: Your horoscope predicts another Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260316T1700). [17:00:29] (03CR) 10SomeRandomDeveloper: Allow-list some additional domains to the currently enforcing CSP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1251550 (https://phabricator.wikimedia.org/T419502) (owner: 10SBassett) [17:00:50] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp6016 is OK: SSL OK - Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2026-05-13 04:44:41 +0000 (expires in 57 days) https://wikitech.wikimedia.org/wiki/HTTPS [17:00:50] RECOVERY - haproxy process on cp6016 is OK: PROCS OK: 2 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [17:02:28] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6016.drmrs.wmnet with OS trixie [17:02:46] (03PS2) 10Catrope: Enable passwordless login in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248665 (https://phabricator.wikimedia.org/T419198) [17:03:25] !log fabfur@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2042.codfw.wmnet with reason: host reimage [17:03:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248665 (https://phabricator.wikimedia.org/T419198) (owner: 10Catrope) [17:06:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-d5-eqiad and cloudservices1005 (172.20.2.4) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:06:58] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp403[7-9].ulsfo.wmnet} and A:cp [17:08:48] !log fabfur@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2042.codfw.wmnet with reason: host reimage [17:09:57] (03PS2) 10Kosta Harlan: Instrument clicks on external links to selected domains [extensions/WikimediaEvents] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1253572 (https://phabricator.wikimedia.org/T419837) [17:10:12] PROBLEM - Host wikikube-worker1036 is DOWN: PING CRITICAL - Packet loss = 60%, RTA = 2019.88 ms [17:10:26] RECOVERY - Host wikikube-worker1036 is UP: PING WARNING - Packet loss = 33%, RTA = 938.38 ms [17:10:38] (03PS1) 10Filippo Giunchedi: cr-cloud: allow cumin/cloudcumin traffic [homer/public] - 10https://gerrit.wikimedia.org/r/1253574 (https://phabricator.wikimedia.org/T419996) [17:11:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cloudsw1-d5-eqiad and cloudservices1005 (172.20.2.4) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:12:25] (03CR) 10Filippo Giunchedi: "See also task for more context. tl;dr we want the cumin openstack backend to be able to talk to the openstack api" [homer/public] - 10https://gerrit.wikimedia.org/r/1253574 (https://phabricator.wikimedia.org/T419996) (owner: 10Filippo Giunchedi) [17:12:40] (03PS1) 10Herron: icinga: add monthly restart [puppet] - 10https://gerrit.wikimedia.org/r/1253576 (https://phabricator.wikimedia.org/T196336) [17:14:06] PROBLEM - BFD status on cloudsw1-c8-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:18:06] RECOVERY - BFD status on cloudsw1-c8-eqiad.mgmt is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:18:09] FIRING: [4x] CoreBGPDown: Core BGP session down between cloudsw1-c8-eqiad and cloudservices1006 (172.20.1.5) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:18:31] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp4037.ulsfo.wmnet [17:23:09] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cloudsw1-c8-eqiad and cloudservices1006 (172.20.1.5) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:23:57] (03PS2) 10Effie Mouzeli: hieradata: migrate eqiad memcached cluster to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1253514 (https://phabricator.wikimedia.org/T398611) [17:24:40] 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11714832 (10RobH) I want to ensure I'm reading the photos correctly, but the update from remote hands is the fiber wasn't found, and it appears to me that the fiber... [17:25:06] (03CR) 10Effie Mouzeli: "I split it in two, however we will try to wrap this up shortly, given we would have to reboot anyway." [puppet] - 10https://gerrit.wikimedia.org/r/1253514 (https://phabricator.wikimedia.org/T398611) (owner: 10Effie Mouzeli) [17:27:27] 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11714848 (10RobH) {F72909066} {F72909067} {F72909068} {F72909069} [17:32:07] !log fabfur@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2042.codfw.wmnet with OS trixie [17:35:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-magru and EdgeUno (200.25.58.212) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [17:35:57] (03CR) 10Harroyo-wmf: [C:03+1] Configure external link tracking on 12 wikis (167 ext. domains) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253566 (https://phabricator.wikimedia.org/T419837) (owner: 10Mszwarc) [17:37:39] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp6016.* [17:39:23] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp6015.drmrs.wmnet with OS trixie [17:39:38] (03PS1) 10Ryan Kemper: wdqs: Disable lag check for categories remediation [puppet] - 10https://gerrit.wikimedia.org/r/1253583 (https://phabricator.wikimedia.org/T242453) [17:42:30] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1253583 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [17:43:43] (03PS2) 10Mszwarc: Configure external link tracking on 12 wikis (411 ext. domains) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253566 (https://phabricator.wikimedia.org/T419837) [17:44:56] 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11714936 (10RobH) Confirm with @cmooney via IRC that 70152 is indeed xe-0//0 in these photos and updated the remote hands for Wednesday. > We had the wrong cabl... [17:45:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-magru and EdgeUno (200.25.58.212) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [17:46:22] (03CR) 10Scott French: [C:03+1] "Thanks, Blake!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251045 (https://phabricator.wikimedia.org/T413974) (owner: 10Blake) [17:46:41] 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11714951 (10cmooney) >>! In T415743#11714936, @RobH wrote: > Please re-drain this link Wednesday in advance of this work, thank you! Cool thanks Rob will do. [17:49:31] (03CR) 10DCausse: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1253583 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [17:49:37] FIRING: GnmiTargetDown: lsw1-e8-eqiad is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [17:50:47] (03PS2) 10Ryan Kemper: wdqs: Disable lag check for categories remediation [puppet] - 10https://gerrit.wikimedia.org/r/1253583 (https://phabricator.wikimedia.org/T242453) [17:52:46] (03PS3) 10Ryan Kemper: wdqs: Disable lag check for categories remediation [puppet] - 10https://gerrit.wikimedia.org/r/1253583 (https://phabricator.wikimedia.org/T242453) [17:53:20] (03CR) 10Ryan Kemper: "PCC caught an issue that I have fixed in PS2 (PS3 was just updating commit message accordingly). Basically, we fixed the undef issue at on" [puppet] - 10https://gerrit.wikimedia.org/r/1253583 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [17:55:05] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1253583 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [17:56:47] (03CR) 10Ottomata: stream: deploy edit-type stream to production (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251480 (https://phabricator.wikimedia.org/T351225) (owner: 10AKhatun) [17:58:13] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp4038.ulsfo.wmnet [17:58:15] (03CR) 10AKhatun: stream: deploy edit-type stream to production (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251480 (https://phabricator.wikimedia.org/T351225) (owner: 10AKhatun) [17:59:35] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6015.drmrs.wmnet with reason: host reimage [18:00:41] (03CR) 10Ryan Kemper: [C:03+2] wdqs: Disable lag check for categories remediation [puppet] - 10https://gerrit.wikimedia.org/r/1253583 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [18:02:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [18:03:57] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6015.drmrs.wmnet with reason: host reimage [18:06:55] (03CR) 10SBassett: Allow-list some additional domains to the currently enforcing CSP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1251550 (https://phabricator.wikimedia.org/T419502) (owner: 10SBassett) [18:07:11] (03CR) 10SBassett: [C:03+1] Allow-list some additional domains to the currently enforcing CSP [puppet] - 10https://gerrit.wikimedia.org/r/1251550 (https://phabricator.wikimedia.org/T419502) (owner: 10SBassett) [18:07:25] FIRING: SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:10:01] (03CR) 10Alex.sanford: [C:03+1] Allow-list some additional domains to the currently enforcing CSP [puppet] - 10https://gerrit.wikimedia.org/r/1251550 (https://phabricator.wikimedia.org/T419502) (owner: 10SBassett) [18:10:06] (03CR) 10Catrope: [C:03+1] Allow-list some additional domains to the currently enforcing CSP [puppet] - 10https://gerrit.wikimedia.org/r/1251550 (https://phabricator.wikimedia.org/T419502) (owner: 10SBassett) [18:12:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [18:14:30] (03CR) 10Ottomata: [C:03+1] "Comment on where configs live, but +1 otherwise." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251480 (https://phabricator.wikimedia.org/T351225) (owner: 10AKhatun) [18:14:35] (03CR) 10Rsilvola: [C:03+1] Allow-list some additional domains to the currently enforcing CSP [puppet] - 10https://gerrit.wikimedia.org/r/1251550 (https://phabricator.wikimedia.org/T419502) (owner: 10SBassett) [18:15:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [18:17:23] (03CR) 10BCornwall: [C:03+2] Allow-list some additional domains to the currently enforcing CSP [puppet] - 10https://gerrit.wikimedia.org/r/1251550 (https://phabricator.wikimedia.org/T419502) (owner: 10SBassett) [18:17:35] (03CR) 10SomeRandomDeveloper: Allow-list some additional domains to the currently enforcing CSP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1251550 (https://phabricator.wikimedia.org/T419502) (owner: 10SBassett) [18:20:02] hi, there used to be a limit on the max number of edits (maybe 200k?) for renaming a username. I remember this limit was lifted when global accounts were introduced. Could renaming a user with 4.5 million edits cause any problems? [18:20:03] (03CR) 10BCornwall: [V:03+1 C:03+2] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8279/co" [puppet] - 10https://gerrit.wikimedia.org/r/1251550 (https://phabricator.wikimedia.org/T419502) (owner: 10SBassett) [18:20:54] (03CR) 10SBassett: [C:03+1] Allow-list some additional domains to the currently enforcing CSP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1251550 (https://phabricator.wikimedia.org/T419502) (owner: 10SBassett) [18:23:57] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: Update ULSFO LVS service IP's - https://phabricator.wikimedia.org/T418971#11715178 (10Papaul) @ssingh thank you let me get back with you tomorrow. I have to double check some things in Netbox. [18:24:49] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp6001.drmrs.wmnet [reason: trixie reimaging] [18:26:01] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp6001.drmrs.wmnet with OS trixie [18:26:29] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp6002.drmrs.wmnet [reason: trixie reimaging] [18:27:00] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp6002.drmrs.wmnet with OS trixie [18:34:50] (03PS1) 10BCornwall: hiera: Remove single_backend from codfw [puppet] - 10https://gerrit.wikimedia.org/r/1253605 (https://phabricator.wikimedia.org/T401832) [18:34:55] (03CR) 10AKhatun: [C:03+2] stream: deploy edit-type stream to production (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251480 (https://phabricator.wikimedia.org/T351225) (owner: 10AKhatun) [18:35:00] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:35:25] (03CR) 10Jdlrobson: "scheduling for later today" [skins/Vector] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251157 (https://phabricator.wikimedia.org/T419730) (owner: 10Jdlrobson) [18:35:56] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6015.drmrs.wmnet with OS trixie [18:36:11] (03CR) 10Anne Tomasevich: "Removing my -1 now that I39332d309422c22d8e8a024b98a1adb819df61e4 is in place and ready for deployment, thanks @jrobson@wikimedia.org!" [skins/Vector] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251157 (https://phabricator.wikimedia.org/T419730) (owner: 10Jdlrobson) [18:36:16] (03PS2) 10Jdlrobson: Enable languages in main menu on Russian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251158 (https://phabricator.wikimedia.org/T419730) [18:36:57] (03Merged) 10jenkins-bot: stream: deploy edit-type stream to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251480 (https://phabricator.wikimedia.org/T351225) (owner: 10AKhatun) [18:37:25] (03PS1) 10BCornwall: hiera: Set default codfw storage_elements [puppet] - 10https://gerrit.wikimedia.org/r/1253606 (https://phabricator.wikimedia.org/T401832) [18:38:00] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp4039.ulsfo.wmnet [18:38:00] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp403[7-9].ulsfo.wmnet} and A:cp [18:38:45] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp6015.* [18:39:45] (03PS1) 10Jdlrobson: Don't output language HTML when no languages present [skins/Vector] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1253604 (https://phabricator.wikimedia.org/T419730) [18:39:46] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp404[5-6].ulsfo.wmnet} and A:cp [18:43:13] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8280/co" [puppet] - 10https://gerrit.wikimedia.org/r/1253605 (https://phabricator.wikimedia.org/T401832) (owner: 10BCornwall) [18:45:18] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6001.drmrs.wmnet with reason: host reimage [18:46:34] (03CR) 10A smart kitten: Allow-list some additional domains to the currently enforcing CSP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1251550 (https://phabricator.wikimedia.org/T419502) (owner: 10SBassett) [18:47:19] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6002.drmrs.wmnet with reason: host reimage [18:47:39] !log cdobbins@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp4046.ulsfo.wmnet} and A:cp [18:49:02] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6001.drmrs.wmnet with reason: host reimage [18:49:22] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp4045.ulsfo.wmnet [18:52:48] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6002.drmrs.wmnet with reason: host reimage [18:56:49] (03PS1) 10BCornwall: conftool/hiera: Remove old codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/1253616 (https://phabricator.wikimedia.org/T419753) [18:57:23] !log cdobbins@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp4046.ulsfo.wmnet [18:57:24] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp4046.ulsfo.wmnet} and A:cp [18:58:12] (03PS2) 10BCornwall: conftool/hiera: Remove old codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/1253616 (https://phabricator.wikimedia.org/T419753) [19:00:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [19:01:04] (03PS1) 10BCornwall: site.pp: Remove most of old codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/1253617 (https://phabricator.wikimedia.org/T419753) [19:02:19] 06SRE, 06Data-Platform-SRE: Data Platform SRE paging alerts and on-call SRE response - https://phabricator.wikimedia.org/T420264 (10ssingh) 03NEW [19:02:26] 06SRE, 06Data-Platform-SRE: Data Platform SRE paging alerts and on-call SRE response - https://phabricator.wikimedia.org/T420264#11715404 (10ssingh) p:05Triage→03Medium [19:02:38] !log akhatun@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-edit-type-enrich-next: apply [19:02:49] !log akhatun@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-edit-type-enrich-next: apply [19:05:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta [19:06:11] (03CR) 10Ssingh: [C:03+1] "In another commit, we should also update:" [puppet] - 10https://gerrit.wikimedia.org/r/1253616 (https://phabricator.wikimedia.org/T419753) (owner: 10BCornwall) [19:07:14] (03CR) 10Ssingh: [C:03+1] site.pp: Remove most of old codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/1253617 (https://phabricator.wikimedia.org/T419753) (owner: 10BCornwall) [19:10:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUns [19:11:57] 10ops-codfw, 06SRE, 06DC-Ops, 10Phabricator: phab2002: SEL System Event:, System Board Front LED Panel, Critical, management controller unavailable - https://phabricator.wikimedia.org/T420228#11715411 (10Jhancock.wm) @Aklapper this notifies on the physical server if something goes wrong. like if a power su... [19:12:16] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6001.drmrs.wmnet with OS trixie [19:13:00] (03PS1) 10Ladsgroup: Revert "Media: Use previous step for non-standard width between steps and original" [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1253622 (https://phabricator.wikimedia.org/T419927) [19:13:19] jouncebot: nowandnext [19:13:19] No deployments scheduled for the next 0 hour(s) and 46 minute(s) [19:13:19] In 0 hour(s) and 46 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260316T2000) [19:13:27] (03CR) 10Ladsgroup: [C:03+2] Revert "Media: Use previous step for non-standard width between steps and original" [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1253622 (https://phabricator.wikimedia.org/T419927) (owner: 10Ladsgroup) [19:13:38] FIRING: HelmReleaseBadStatus: Helm release zarcillo/main on k8s-aux@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-aux&var-namespace=zarcillo - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [19:15:09] !log akhatun@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [19:15:14] !log akhatun@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [19:16:16] (03CR) 10BCornwall: [C:03+2] conftool/hiera: Remove old codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/1253616 (https://phabricator.wikimedia.org/T419753) (owner: 10BCornwall) [19:16:26] (03CR) 10BCornwall: [C:03+2] site.pp: Remove most of old codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/1253617 (https://phabricator.wikimedia.org/T419753) (owner: 10BCornwall) [19:16:52] !log fabfur@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on cp2041.codfw.wmnet with reason: Testing hosts - not for production [19:17:14] !log fabfur@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on cp2042.codfw.wmnet with reason: Testing hosts - not for production [19:17:26] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6002.drmrs.wmnet with OS trixie [19:17:47] (03PS1) 10Bartosz Dziewoński: Fix client credentials access tokens [extensions/OAuth] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1253623 (https://phabricator.wikimedia.org/T417278) [19:19:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/OAuth] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1253623 (https://phabricator.wikimedia.org/T417278) (owner: 10Bartosz Dziewoński) [19:21:12] 10ops-codfw, 06SRE, 06DC-Ops, 10Phabricator: phab2002: SEL System Event:, System Board Front LED Panel, Critical, management controller unavailable - https://phabricator.wikimedia.org/T420228#11715437 (10Dzahn) "management controller unavailable" sounds like the management console/DRAC is not working norma... [19:21:26] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp6014.drmrs.wmnet with OS trixie [19:21:41] 10ops-codfw, 06collaboration-services, 06DC-Ops, 10Phabricator: phab2002: SEL System Event:, System Board Front LED Panel, Critical, management controller unavailable - https://phabricator.wikimedia.org/T420228#11715438 (10Dzahn) [19:22:16] PROBLEM - Host ml-serve2001 is DOWN: PING CRITICAL - Packet loss = 100% [19:46:54] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6003.drmrs.wmnet with reason: host reimage [19:47:04] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6004.drmrs.wmnet with reason: host reimage [19:47:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1253572 (https://phabricator.wikimedia.org/T419837) (owner: 10Kosta Harlan) [19:47:27] !log releases2003 - rm rsync-srv-org-wikimedia-releases-releases2003.* - alerts flapping since server reboot - puppet code needs to be improved to ensure units are removed when primary server is switched (T420246) [19:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:30] T420246: SystemdUnitFailed - https://phabricator.wikimedia.org/T420246 [19:47:57] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [19:48:18] (03CR) 10Dreamy Jazz: [C:03+2] Uninstall GlobalBlocking from closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251589 (https://phabricator.wikimedia.org/T420062) (owner: 10Dreamy Jazz) [19:49:20] (03Merged) 10jenkins-bot: Uninstall GlobalBlocking from closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251589 (https://phabricator.wikimedia.org/T420062) (owner: 10Dreamy Jazz) [19:50:33] (03PS1) 10BCornwall: hiera: Set cp2041/2042 to single_backend: true [puppet] - 10https://gerrit.wikimedia.org/r/1253629 [19:51:23] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6003.drmrs.wmnet with reason: host reimage [19:51:51] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1251582|Uninstall AbuseFilter from closed wikis with no AbuseFilter logs (T420063)]] (duration: 09m 26s) [19:51:55] T420063: Uninstall AbuseFilter from wikis which are closed and have no AbuseLog entries - https://phabricator.wikimedia.org/T420063 [19:52:12] (03CR) 10BCornwall: [C:03+2] hiera: Set cp2041/2042 to single_backend: true [puppet] - 10https://gerrit.wikimedia.org/r/1253629 (owner: 10BCornwall) [19:52:16] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudcephmon2007-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:52:47] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1251589|Uninstall GlobalBlocking from closed wikis (T420062)]] [19:52:48] (03CR) 10Vgutierrez: [C:03+1] hiera: Set cp2041/2042 to single_backend: true [puppet] - 10https://gerrit.wikimedia.org/r/1253629 (owner: 10BCornwall) [19:52:51] T420062: Uninstall PSI extensions on closed wikis which are not needed - https://phabricator.wikimedia.org/T420062 [19:52:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253566 (https://phabricator.wikimedia.org/T419837) (owner: 10Mszwarc) [19:53:55] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp2041.codfw.wmnet with OS trixie [19:53:58] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephmon2007-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:54:03] (03PS3) 10Kosta Harlan: Configure external link tracking on 12 wikis (411 ext. domains) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253566 (https://phabricator.wikimedia.org/T419837) (owner: 10Mszwarc) [19:54:20] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp2042.codfw.wmnet with OS trixie [19:54:37] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1251589|Uninstall GlobalBlocking from closed wikis (T420062)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:54:46] (03PS4) 10Kosta Harlan: Configure external link aggregate usage on 12 wikis for top domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253566 (https://phabricator.wikimedia.org/T419837) (owner: 10Mszwarc) [19:55:00] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephmon2007-dev.codfw.wmnet with OS bookworm [19:55:07] 10ops-codfw, 06SRE, 06DC-Ops: Q3:rack/setup/install cloudcephmon2007-dev - https://phabricator.wikimedia.org/T416396#11715546 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudcephmon2007-dev.codfw.wmnet with OS bookworm [19:55:13] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6004.drmrs.wmnet with reason: host reimage [19:55:32] FIRING: [6x] KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:55:43] (03PS1) 10Dzahn: releases: remove rsync systemd units when primary server changes [puppet] - 10https://gerrit.wikimedia.org/r/1253631 (https://phabricator.wikimedia.org/T420246) [19:57:13] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [19:57:24] (03PS6) 10Dreamy Jazz: Disable CheckUser on closed wikis where no checks were ever made [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251848 (https://phabricator.wikimedia.org/T420062) [19:57:28] PROBLEM - Host wikikube-worker1036 is DOWN: PING CRITICAL - Packet loss = 60%, RTA = 4187.76 ms [19:57:31] (03PS6) 10Dreamy Jazz: Uninstall SecurePoll from closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251865 (https://phabricator.wikimedia.org/T420062) [19:57:37] (03PS14) 10Dreamy Jazz: DiscussionTools: Uninstall wikis closed before permalinks were deployed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251888 (https://phabricator.wikimedia.org/T420052) [19:57:48] RECOVERY - Host wikikube-worker1036 is UP: PING WARNING - Packet loss = 50%, RTA = 616.67 ms [19:58:26] (03PS1) 10Xcollazo: Scale up mw-content-history-reconcile-enrich temporarily for big reconcile. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253633 (https://phabricator.wikimedia.org/T419055) [19:59:57] (03PS2) 10Dzahn: releases: remove rsync systemd units when primary server changes [puppet] - 10https://gerrit.wikimedia.org/r/1253631 (https://phabricator.wikimedia.org/T420246) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: Time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260316T2000). [20:00:05] cscott, RoanKattouw, MatmaRex, and kostajh: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:19] hi [20:00:27] hello [20:00:44] o/ [20:01:07] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1251589|Uninstall GlobalBlocking from closed wikis (T420062)]] (duration: 08m 20s) [20:01:11] T420062: Uninstall PSI extensions on closed wikis which are not needed - https://phabricator.wikimedia.org/T420062 [20:01:27] Will do the rest of my changes after other deployments [20:01:27] Im here but it will be a couple of minutes before I'm at a keyboard so folks should feel free to cut the line and get started [20:01:31] Handing over to someone else [20:01:41] The WikimediaEvents patch can go out with something else [20:02:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251848 (https://phabricator.wikimedia.org/T420062) (owner: 10Dreamy Jazz) [20:02:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251865 (https://phabricator.wikimedia.org/T420062) (owner: 10Dreamy Jazz) [20:02:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251888 (https://phabricator.wikimedia.org/T420052) (owner: 10Dreamy Jazz) [20:02:42] I will self serve my patch in a minute [20:03:15] (03CR) 10Dzahn: [C:03+1] miscweb: add wmf-navigator values - empty httpd [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253489 (https://phabricator.wikimedia.org/T414405) (owner: 10AOkoth) [20:03:50] !log brett@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp[2027-2040].codfw.wmnet [20:04:09] RoanKattouw: can you bundle the WikimediaEvents patch with it please? [20:04:19] I’ll do the config one that actually enables it later in the window [20:05:00] Will do [20:05:56] i'm at a keyboard now, so i'll slide in after roan [20:06:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248665 (https://phabricator.wikimedia.org/T419198) (owner: 10Catrope) [20:06:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1253572 (https://phabricator.wikimedia.org/T419837) (owner: 10Kosta Harlan) [20:06:56] (03CR) 10Ottomata: [C:03+1] Scale up mw-content-history-reconcile-enrich temporarily for big reconcile. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253633 (https://phabricator.wikimedia.org/T419055) (owner: 10Xcollazo) [20:07:19] (03Merged) 10jenkins-bot: Enable passwordless login in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248665 (https://phabricator.wikimedia.org/T419198) (owner: 10Catrope) [20:08:12] i'd appreciate if someone could sync my patches as well. they are low-risk and can be bundled with whatever [20:08:23] (03CR) 10Xcollazo: [C:03+2] Scale up mw-content-history-reconcile-enrich temporarily for big reconcile. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253633 (https://phabricator.wikimedia.org/T419055) (owner: 10Xcollazo) [20:08:42] I can do them after cscott does his [20:09:22] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6014.drmrs.wmnet with OS trixie [20:10:08] (03Merged) 10jenkins-bot: Instrument clicks on external links to selected domains [extensions/WikimediaEvents] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1253572 (https://phabricator.wikimedia.org/T419837) (owner: 10Kosta Harlan) [20:10:26] !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1248665|Enable passwordless login in production (T419198)]], [[gerrit:1253572|Instrument clicks on external links to selected domains (T419837)]] [20:10:31] T419198: Deploy passwordless login - https://phabricator.wikimedia.org/T419198 [20:10:31] T419837: Temporary measurement of outbound citation link clicks - https://phabricator.wikimedia.org/T419837 [20:10:32] FIRING: [6x] KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:10:44] (03Merged) 10jenkins-bot: Scale up mw-content-history-reconcile-enrich temporarily for big reconcile. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253633 (https://phabricator.wikimedia.org/T419055) (owner: 10Xcollazo) [20:11:40] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2041.codfw.wmnet with reason: host reimage [20:12:02] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2042.codfw.wmnet with reason: host reimage [20:12:07] (03CR) 10Dreamy Jazz: hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250575 (https://phabricator.wikimedia.org/T419125) (owner: 10Harroyo-wmf) [20:12:16] !log catrope@deploy2002 kharlan, catrope: Backport for [[gerrit:1248665|Enable passwordless login in production (T419198)]], [[gerrit:1253572|Instrument clicks on external links to selected domains (T419837)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:12:56] !log xcollazo@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [20:13:15] !log catrope@deploy2002 kharlan, catrope: Continuing with sync [20:15:11] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6003.drmrs.wmnet with OS trixie [20:15:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2007-dev.codfw.wmnet with reason: host reimage [20:15:37] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2041.codfw.wmnet with reason: host reimage [20:16:12] !log xcollazo@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [20:17:09] !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248665|Enable passwordless login in production (T419198)]], [[gerrit:1253572|Instrument clicks on external links to selected domains (T419837)]] (duration: 06m 43s) [20:17:13] T419198: Deploy passwordless login - https://phabricator.wikimedia.org/T419198 [20:17:14] T419837: Temporary measurement of outbound citation link clicks - https://phabricator.wikimedia.org/T419837 [20:17:19] cscott: Go ahead [20:17:27] ok, thanks! [20:17:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy2002 using scap backport" [extensions/DiscussionTools] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1253551 (https://phabricator.wikimedia.org/T419908) (owner: 10C. Scott Ananian) [20:18:04] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp700[1-4].magru.wmnet} and A:cp [20:18:18] can I go after cscott? [20:19:04] If it's okay with MatmaRex [20:19:20] sure [20:19:43] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp70[09-12].magru.wmnet} and A:cp [20:19:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2007-dev.codfw.wmnet with reason: host reimage [20:19:49] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp6003.drmrs.wmnet [reason: trixie reimaging] [20:19:50] …if you sync my patches as well ;) i don't have access [20:20:12] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp6005.drmrs.wmnet [reason: trixie reimaging] [20:20:18] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6004.drmrs.wmnet with OS trixie [20:20:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:21:10] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp6005.drmrs.wmnet with OS trixie [20:21:15] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp6004.drmrs.wmnet [reason: trixie reimaging] [20:21:54] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp6006.drmrs.wmnet [reason: trixie reimaging] [20:22:07] I was hoping to just sync my config patch, as it's late here. Perhaps RoanKattouw could ship your patches? [20:22:21] Yes happy to [20:22:32] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp6006.drmrs.wmnet with OS trixie [20:22:56] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2042.codfw.wmnet with reason: host reimage [20:23:17] thanks [20:23:31] brett@cumin2002 decommission (PID 3155499) is awaiting input [20:25:38] (03Merged) 10jenkins-bot: Fix double post-processing in legacy preview case [extensions/DiscussionTools] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1253551 (https://phabricator.wikimedia.org/T419908) (owner: 10C. Scott Ananian) [20:25:59] !log cscott@deploy2002 Started scap sync-world: Backport for [[gerrit:1253551|Fix double post-processing in legacy preview case (T419908)]] [20:26:04] T419908: DiscussionTools duplicates the meta-data sub-header content when Previewing - https://phabricator.wikimedia.org/T419908 [20:27:46] !log cscott@deploy2002 cscott: Backport for [[gerrit:1253551|Fix double post-processing in legacy preview case (T419908)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:28:33] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp7001.magru.wmnet [20:28:38] brett@cumin2002 decommission (PID 3155499) is awaiting input [20:28:49] !log cscott@deploy2002 cscott: Continuing with sync [20:29:19] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp7009.magru.wmnet [20:30:32] FIRING: [5x] KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:32:52] !log cscott@deploy2002 Finished scap sync-world: Backport for [[gerrit:1253551|Fix double post-processing in legacy preview case (T419908)]] (duration: 06m 52s) [20:32:56] T419908: DiscussionTools duplicates the meta-data sub-header content when Previewing - https://phabricator.wikimedia.org/T419908 [20:33:01] ok over to kostajh [20:33:16] (03PS3) 10Herron: kafkamon: rename class [puppet] - 10https://gerrit.wikimedia.org/r/1253505 (https://phabricator.wikimedia.org/T418858) [20:33:31] or MatmaRex sorry wasn't watching the backlog conversation [20:33:47] !log xcollazo@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [20:33:55] !log xcollazo@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [20:34:05] maybe its RoanKattouw who is going to sync kostajh and MatmaRex's patches? [20:34:40] thanks [20:34:42] I thought kostajh was going to sync just his own patch, and then I was going to sync MatmaRex's patches after that [20:34:45] I can sync my config patch [20:34:50] yes [20:34:53] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp6014.* [20:35:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253566 (https://phabricator.wikimedia.org/T419837) (owner: 10Mszwarc) [20:36:46] (03Merged) 10jenkins-bot: Configure external link aggregate usage on 12 wikis for top domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253566 (https://phabricator.wikimedia.org/T419837) (owner: 10Mszwarc) [20:37:05] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1253566|Configure external link aggregate usage on 12 wikis for top domains (T419837)]] [20:37:09] T419837: Temporary measurement of outbound citation link clicks - https://phabricator.wikimedia.org/T419837 [20:38:55] !log kharlan@deploy2002 kharlan, mszwarc: Backport for [[gerrit:1253566|Configure external link aggregate usage on 12 wikis for top domains (T419837)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:39:15] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:39:47] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2041.codfw.wmnet with OS trixie [20:40:12] !log kharlan@deploy2002 kharlan, mszwarc: Continuing with sync [20:41:03] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6005.drmrs.wmnet with reason: host reimage [20:41:22] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on cp2041.codfw.wmnet with reason: Testing hosts - not for production [20:41:29] (03PS1) 10CDanis: wmf-auto-restart: also require the base Service [puppet] - 10https://gerrit.wikimedia.org/r/1253645 [20:41:42] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1253645 (owner: 10CDanis) [20:41:44] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6006.drmrs.wmnet with reason: host reimage [20:42:21] jhancock@cumin2002 reimage (PID 3150147) is awaiting input [20:43:26] !log brett@cumin2002 START - Cookbook sre.dns.netbox [20:43:38] (03CR) 10CI reject: [V:04-1] wmf-auto-restart: also require the base Service [puppet] - 10https://gerrit.wikimedia.org/r/1253645 (owner: 10CDanis) [20:44:04] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1253566|Configure external link aggregate usage on 12 wikis for top domains (T419837)]] (duration: 06m 59s) [20:44:08] T419837: Temporary measurement of outbound citation link clicks - https://phabricator.wikimedia.org/T419837 [20:44:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:44:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon2007-dev.codfw.wmnet with OS bookworm [20:44:33] 10ops-codfw, 06SRE, 06DC-Ops: Q3:rack/setup/install cloudcephmon2007-dev - https://phabricator.wikimedia.org/T416396#11715774 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudcephmon2007-dev.codfw.wmnet with OS bookworm completed: - cloudcephmon2007-dev... [20:44:43] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2042.codfw.wmnet with OS trixie [20:45:09] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6005.drmrs.wmnet with reason: host reimage [20:45:10] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on cp2042.codfw.wmnet with reason: Testing hosts - not for production [20:45:21] 10ops-codfw, 06SRE, 06DC-Ops: Q3:rack/setup/install cloudcephmon2007-dev - https://phabricator.wikimedia.org/T416396#11715777 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm @Andrew i swear i didn't forget about you. This is complete. [20:45:32] FIRING: [6x] KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:45:52] RoanKattouw: over to you [20:46:40] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp6013.drmrs.wmnet with OS trixie [20:47:06] (03PS2) 10CDanis: wmf-auto-restart: also require the base Service [puppet] - 10https://gerrit.wikimedia.org/r/1253645 [20:47:20] (03PS3) 10CDanis: wmf-auto-restart: also require the base Service [puppet] - 10https://gerrit.wikimedia.org/r/1253645 [20:47:50] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1253645 (owner: 10CDanis) [20:48:43] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp6012.drmrs.wmnet with OS trixie [20:49:12] brett@cumin2002 decommission (PID 3155499) is awaiting input [20:49:23] (03CR) 10CI reject: [V:04-1] wmf-auto-restart: also require the base Service [puppet] - 10https://gerrit.wikimedia.org/r/1253645 (owner: 10CDanis) [20:49:30] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6006.drmrs.wmnet with reason: host reimage [20:49:40] (03PS1) 10Xcollazo: Revert "stream: mw-content-history-reconcile-enrich" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253650 [20:50:30] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[2027-2040].codfw.wmnet decommissioned, removing all IPs except the asset tag one - brett@cumin2002" [20:50:35] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[2027-2040].codfw.wmnet decommissioned, removing all IPs except the asset tag one - brett@cumin2002" [20:50:36] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:50:37] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp[2027-2040].codfw.wmnet [20:51:29] Thanks, I'll finish up with MatmaRex's patches now [20:52:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [extensions/OAuth] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1253623 (https://phabricator.wikimedia.org/T417278) (owner: 10Bartosz Dziewoński) [20:52:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253625 (https://phabricator.wikimedia.org/T414338) (owner: 10Bartosz Dziewoński) [20:52:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253626 (https://phabricator.wikimedia.org/T418957) (owner: 10Bartosz Dziewoński) [20:52:21] (03Abandoned) 10CDanis: wmf-auto-restart: also require the base Service [puppet] - 10https://gerrit.wikimedia.org/r/1253645 (owner: 10CDanis) [20:53:27] (03CR) 10Xcollazo: [V:03+2 C:03+2] Revert "stream: mw-content-history-reconcile-enrich" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253650 (owner: 10Xcollazo) [20:54:03] thanks [20:54:35] (03Merged) 10jenkins-bot: Enable $wgTrackMediaRequestProvenance on testwikis and beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253625 (https://phabricator.wikimedia.org/T414338) (owner: 10Bartosz Dziewoński) [20:54:41] !log xcollazo@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [20:54:51] !log xcollazo@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [20:55:23] (03Merged) 10jenkins-bot: Configure $wgApiClientErrorSampleRate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253626 (https://phabricator.wikimedia.org/T418957) (owner: 10Bartosz Dziewoński) [20:55:29] (03CR) 10Dzahn: [C:03+1] "since https://phabricator.wikimedia.org/T417998 has been closed which was about aligning timeouts between ATS and Apache - the next step w" [puppet] - 10https://gerrit.wikimedia.org/r/1241048 (https://phabricator.wikimedia.org/T246763) (owner: 10Hashar) [20:56:31] (03PS2) 10BCornwall: hiera: Remove single_backend from codfw [puppet] - 10https://gerrit.wikimedia.org/r/1253605 (https://phabricator.wikimedia.org/T401832) [20:57:05] (03Merged) 10jenkins-bot: Fix client credentials access tokens [extensions/OAuth] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1253623 (https://phabricator.wikimedia.org/T417278) (owner: 10Bartosz Dziewoński) [20:57:26] !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1253623|Fix client credentials access tokens (T417278 T419921)]], [[gerrit:1253625|Enable $wgTrackMediaRequestProvenance on testwikis and beta cluster (T414338)]], [[gerrit:1253626|Configure $wgApiClientErrorSampleRate (T418957)]] [20:57:38] T417278: Choosing client credentials grant for OAuth 2 results in an anonymous access token - https://phabricator.wikimedia.org/T417278 [20:57:38] T419921: TypeError: MediaWiki\Extension\OAuth\ResourceServer::getUser(): Return value must be of type MediaWiki\User\User, false returned - https://phabricator.wikimedia.org/T419921 [20:57:38] T414338: FY25-26 WE5.4.12: Identify the provenance of image requests - https://phabricator.wikimedia.org/T414338 [20:57:39] T418957: Add client-side logging for non-MediaWiki action API errors (HTTP 429) - https://phabricator.wikimedia.org/T418957 [20:59:17] !log catrope@deploy2002 matmarex, catrope: Backport for [[gerrit:1253623|Fix client credentials access tokens (T417278 T419921)]], [[gerrit:1253625|Enable $wgTrackMediaRequestProvenance on testwikis and beta cluster (T414338)]], [[gerrit:1253626|Configure $wgApiClientErrorSampleRate (T418957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:00:05] Reedy, sbassett, Maryum, and manfredi: Time to snap out of that daydream and deploy Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260316T2100). [21:00:20] Hold on, I'm still doing a backport deploy [21:00:27] MatmaRex: Please test your patches [21:01:04] Are there patches to deploy in the Security window? [21:01:11] I also have patches to deploy in the backport window [21:01:30] RoanKattouw: looks good [21:01:39] !log catrope@deploy2002 matmarex, catrope: Continuing with sync [21:03:45] FIRING: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [21:04:56] (03CR) 10Xcollazo: [V:03+2 C:03+2] "Linking this back to https://phabricator.wikimedia.org/T408918#11715866" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253650 (owner: 10Xcollazo) [21:05:32] !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1253623|Fix client credentials access tokens (T417278 T419921)]], [[gerrit:1253625|Enable $wgTrackMediaRequestProvenance on testwikis and beta cluster (T414338)]], [[gerrit:1253626|Configure $wgApiClientErrorSampleRate (T418957)]] (duration: 08m 06s) [21:05:45] T417278: Choosing client credentials grant for OAuth 2 results in an anonymous access token - https://phabricator.wikimedia.org/T417278 [21:05:45] OK I'm done [21:05:45] T419921: TypeError: MediaWiki\Extension\OAuth\ResourceServer::getUser(): Return value must be of type MediaWiki\User\User, false returned - https://phabricator.wikimedia.org/T419921 [21:05:46] T414338: FY25-26 WE5.4.12: Identify the provenance of image requests - https://phabricator.wikimedia.org/T414338 [21:05:46] T418957: Add client-side logging for non-MediaWiki action API errors (HTTP 429) - https://phabricator.wikimedia.org/T418957 [21:06:22] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6013.drmrs.wmnet with reason: host reimage [21:07:26] Dreamy_Jazz: Go ahead, I don't believe there are any security patches today [21:07:55] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6012.drmrs.wmnet with reason: host reimage [21:08:22] (03PS1) 10Xcollazo: Revert "stream: mw-page-content-change-enrich" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253653 [21:08:25] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host zuul1003.eqiad.wmnet with OS trixie [21:08:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [21:08:46] (03CR) 10Xcollazo: [V:03+2 C:03+2] Revert "stream: mw-page-content-change-enrich" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253653 (owner: 10Xcollazo) [21:09:18] thanks RoanKattouw [21:09:22] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8281/co" [puppet] - 10https://gerrit.wikimedia.org/r/1253605 (https://phabricator.wikimedia.org/T401832) (owner: 10BCornwall) [21:09:45] FIRING: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [21:09:51] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6013.drmrs.wmnet with reason: host reimage [21:10:11] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp7002.magru.wmnet [21:10:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251848 (https://phabricator.wikimedia.org/T420062) (owner: 10Dreamy Jazz) [21:10:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251865 (https://phabricator.wikimedia.org/T420062) (owner: 10Dreamy Jazz) [21:10:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251888 (https://phabricator.wikimedia.org/T420052) (owner: 10Dreamy Jazz) [21:10:55] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp7010.magru.wmnet [21:10:55] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6005.drmrs.wmnet with OS trixie [21:11:39] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp6005.drmrs.wmnet [reason: trixie reimaging] [21:11:48] (03CR) 10EggRoll97: [C:03+1] "Acknowledged that this will resolve once the permission is removed, it was never added to wgAddGroups in the first place." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251193 (https://phabricator.wikimedia.org/T419105) (owner: 10Codename Noreste) [21:12:04] (03Merged) 10jenkins-bot: Disable CheckUser on closed wikis where no checks were ever made [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251848 (https://phabricator.wikimedia.org/T420062) (owner: 10Dreamy Jazz) [21:12:07] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp6007.drmrs.wmnet [reason: trixie reimaging] [21:12:07] (03Merged) 10jenkins-bot: Uninstall SecurePoll from closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251865 (https://phabricator.wikimedia.org/T420062) (owner: 10Dreamy Jazz) [21:12:11] (03Merged) 10jenkins-bot: DiscussionTools: Uninstall wikis closed before permalinks were deployed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251888 (https://phabricator.wikimedia.org/T420052) (owner: 10Dreamy Jazz) [21:12:30] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1251848|Disable CheckUser on closed wikis where no checks were ever made (T420062)]], [[gerrit:1251865|Uninstall SecurePoll from closed wikis (T420062)]], [[gerrit:1251888|DiscussionTools: Uninstall wikis closed before permalinks were deployed (T420052)]] [21:12:40] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp6007.drmrs.wmnet with OS trixie [21:12:40] T420062: Uninstall PSI extensions on closed wikis which are not needed - https://phabricator.wikimedia.org/T420062 [21:12:41] T420052: Drop extensions from closed wikis where the database tables are unused - https://phabricator.wikimedia.org/T420052 [21:13:28] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6012.drmrs.wmnet with reason: host reimage [21:14:20] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1251848|Disable CheckUser on closed wikis where no checks were ever made (T420062)]], [[gerrit:1251865|Uninstall SecurePoll from closed wikis (T420062)]], [[gerrit:1251888|DiscussionTools: Uninstall wikis closed before permalinks were deployed (T420052)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified the [21:14:21] re. [21:14:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [21:14:48] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [21:14:50] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8282/console" [puppet] - 10https://gerrit.wikimedia.org/r/1253605 (https://phabricator.wikimedia.org/T401832) (owner: 10BCornwall) [21:14:52] (03PS1) 10Arlolra: Deploy PRV to XX wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253654 (https://phabricator.wikimedia.org/T420273) [21:15:00] PROBLEM - Host wikikube-worker1291 is DOWN: PING CRITICAL - Packet loss = 100% [21:15:05] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6006.drmrs.wmnet with OS trixie [21:15:09] (03CR) 10BCornwall: hiera: Remove single_backend from codfw [puppet] - 10https://gerrit.wikimedia.org/r/1253605 (https://phabricator.wikimedia.org/T401832) (owner: 10BCornwall) [21:15:25] (03PS2) 10BCornwall: hiera: Set default codfw storage_elements [puppet] - 10https://gerrit.wikimedia.org/r/1253606 (https://phabricator.wikimedia.org/T401832) [21:15:32] FIRING: [7x] KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [21:15:53] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp6006.drmrs.wmnet [reason: trixie reimaging] [21:16:12] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp6008.drmrs.wmnet [reason: trixie reimaging] [21:17:10] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8284/co" [puppet] - 10https://gerrit.wikimedia.org/r/1253606 (https://phabricator.wikimedia.org/T401832) (owner: 10BCornwall) [21:17:18] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp6008.drmrs.wmnet with OS trixie [21:17:54] (03PS1) 10Herron: systemd::timer::job: add ExecCondition support [puppet] - 10https://gerrit.wikimedia.org/r/1253655 [21:18:40] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1251848|Disable CheckUser on closed wikis where no checks were ever made (T420062)]], [[gerrit:1251865|Uninstall SecurePoll from closed wikis (T420062)]], [[gerrit:1251888|DiscussionTools: Uninstall wikis closed before permalinks were deployed (T420052)]] (duration: 06m 10s) [21:18:45] T420062: Uninstall PSI extensions on closed wikis which are not needed - https://phabricator.wikimedia.org/T420062 [21:18:45] T420052: Drop extensions from closed wikis where the database tables are unused - https://phabricator.wikimedia.org/T420052 [21:19:23] !log Evening UTC backport window done [21:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1036:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1036 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:22:36] (03PS2) 10Arlolra: Deploy PRV to XX wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253654 (https://phabricator.wikimedia.org/T420273) [21:22:47] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on zuul1003.eqiad.wmnet with reason: host reimage [21:23:50] (03PS2) 10Herron: icinga: add monthly restart [puppet] - 10https://gerrit.wikimedia.org/r/1253576 (https://phabricator.wikimedia.org/T196336) [21:24:29] (03PS3) 10Herron: icinga: add monthly restart [puppet] - 10https://gerrit.wikimedia.org/r/1253576 (https://phabricator.wikimedia.org/T196336) [21:25:03] (03CR) 10Dzahn: "I was going to warn of this a little bit but then I saw the "execcondition" and that removed the concern. :)" [puppet] - 10https://gerrit.wikimedia.org/r/1253576 (https://phabricator.wikimedia.org/T196336) (owner: 10Herron) [21:25:26] (03CR) 10CI reject: [V:04-1] icinga: add monthly restart [puppet] - 10https://gerrit.wikimedia.org/r/1253576 (https://phabricator.wikimedia.org/T196336) (owner: 10Herron) [21:27:38] (03PS4) 10Herron: icinga: add monthly restart [puppet] - 10https://gerrit.wikimedia.org/r/1253576 (https://phabricator.wikimedia.org/T196336) [21:27:45] FIRING: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [21:28:35] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on zuul1003.eqiad.wmnet with reason: host reimage [21:29:52] the "wide-spread puppet" alert is not that widespread - seems caused by the cp6 reimages [21:30:13] generally that alert seems to hang out right under the alerting threshhold at all times [21:31:06] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 5 others: MediaViewer (and the commons file page) should serve WebP originals not thumbnails of equivalent size - https://phabricator.wikimedia.org/T418745#11716017 (10matmarex) Two of the patches from this task (and their backports... [21:32:14] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6007.drmrs.wmnet with reason: host reimage [21:32:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [21:35:55] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6013.drmrs.wmnet with OS trixie [21:36:16] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6008.drmrs.wmnet with reason: host reimage [21:36:59] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp6013.* [21:37:37] (03PS2) 10Herron: systemd::timer::job: add ExecCondition support [puppet] - 10https://gerrit.wikimedia.org/r/1253655 [21:38:15] (03CR) 10Codename Noreste: "On the $wgAddGroups and $wgRemoveGroups, I couldn't find the code that the editor user group can be granted or removed by admins (for idwi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251193 (https://phabricator.wikimedia.org/T419105) (owner: 10Codename Noreste) [21:38:47] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6007.drmrs.wmnet with reason: host reimage [21:39:01] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp6011.drmrs.wmnet with OS trixie [21:39:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1036:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1036 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:39:46] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6012.drmrs.wmnet with OS trixie [21:40:10] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp6012.* [21:40:32] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp6010.drmrs.wmnet with OS trixie [21:41:57] 06SRE, 06DC-Ops, 10Wikidata: NVMe versus standard SSD performance info - https://phabricator.wikimedia.org/T419884#11716070 (10RobH) Sent a gentle followup to the Dell team today. [21:42:15] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6008.drmrs.wmnet with reason: host reimage [21:42:53] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host zuul1003.eqiad.wmnet with OS trixie [21:49:27] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp7003.magru.wmnet [21:49:38] FIRING: GnmiTargetDown: lsw1-e8-eqiad is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [21:52:26] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp7011.magru.wmnet [21:55:43] I'm going to fiddle in mw-experimental unless someone has a reason I shouldn't. [21:58:12] !log cdobbins@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp6008.drmrs.wmnet with OS trixie [21:58:55] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6011.drmrs.wmnet with reason: host reimage [21:59:01] (03PS2) 10RLazarus: _mediawiki-common_: Add /*/wf-wan memcache routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247677 (https://phabricator.wikimedia.org/T411807) [21:59:12] (03PS2) 10RLazarus: wikifunctions and friends: Add /*/wf-wan memcache routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247678 (https://phabricator.wikimedia.org/T411807) [21:59:36] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6010.drmrs.wmnet with reason: host reimage [22:02:12] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: cloudcephmon2007-dev service implementation - https://phabricator.wikimedia.org/T420282 (10Andrew) 03NEW [22:02:19] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp6008.drmrs.wmnet with OS trixie [22:03:08] 06SRE, 06Infrastructure-Foundations: Consider reducing verbosity of IRC logging - https://phabricator.wikimedia.org/T419919#11716165 (10bd808) >>! In T419919#11713106, @Volans wrote: > And AFAIK both wikitech SAL and toolforge SAL don't have any live update mechanism right now. So bypassing IRC to store them d... [22:03:48] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6007.drmrs.wmnet with OS trixie [22:04:03] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6011.drmrs.wmnet with reason: host reimage [22:05:46] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp6007.drmrs.wmnet [reason: trixie reimaging] [22:06:42] (03PS1) 10Neriah: upload: Return 400 instead of 429 for non-standard thumbnail sizes [puppet] - 10https://gerrit.wikimedia.org/r/1253662 (https://phabricator.wikimedia.org/T419663) [22:07:27] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6010.drmrs.wmnet with reason: host reimage [22:07:40] FIRING: SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:07:41] (03CR) 10Neriah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1253662 (https://phabricator.wikimedia.org/T419663) (owner: 10Neriah) [22:13:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:17:38] !log blake@cumin1003 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on P{wikikube-worker[1020-1327].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [22:18:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:19:47] (03CR) 10Jforrester: [C:03+2] _mediawiki-common_: Add /*/wf-wan memcache routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247677 (https://phabricator.wikimedia.org/T411807) (owner: 10RLazarus) [22:19:49] (03CR) 10Jforrester: [C:03+2] wikifunctions and friends: Add /*/wf-wan memcache routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247678 (https://phabricator.wikimedia.org/T411807) (owner: 10RLazarus) [22:20:41] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6008.drmrs.wmnet with reason: host reimage [22:21:45] FIRING: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [22:22:40] (03Merged) 10jenkins-bot: _mediawiki-common_: Add /*/wf-wan memcache routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247677 (https://phabricator.wikimedia.org/T411807) (owner: 10RLazarus) [22:23:06] (03Merged) 10jenkins-bot: wikifunctions and friends: Add /*/wf-wan memcache routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247678 (https://phabricator.wikimedia.org/T411807) (owner: 10RLazarus) [22:24:23] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6008.drmrs.wmnet with reason: host reimage [22:26:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [22:27:28] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1285-1289,1291-1299].eqiad.wmnet [22:27:32] !log jforrester@deploy2002 Started scap sync-world: T411807 [22:27:36] T411807: WF memcached service is dc-local but used for dc-global content - https://phabricator.wikimedia.org/T411807 [22:28:10] !log jforrester@deploy2002 jforrester: T411807 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:28:45] FIRING: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [22:30:10] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6011.drmrs.wmnet with OS trixie [22:30:14] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11716213 (10Papaul) Removed interface et-0/0/1.1221 from both routers and cleanup all reference for sandbox1-ulsfo in Netbox ` - unit 1221 { - descr... [22:30:45] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp7004.magru.wmnet [22:30:46] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp700[1-4].magru.wmnet} and A:cp [22:31:34] !log blake@cumin1003 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) pool for host wikikube-worker[1285-1289,1291-1299].eqiad.wmnet [22:31:57] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp7012.magru.wmnet [22:31:58] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp70[09-12].magru.wmnet} and A:cp [22:32:47] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6010.drmrs.wmnet with OS trixie [22:33:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [22:33:53] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: Update ULSFO LVS service IP's - https://phabricator.wikimedia.org/T418971#11716224 (10Papaul) @ssingh I double check all for the new prefix 198.35.26.224/27 in Netbox all looks good. You can make your changes any time. Please... [22:35:00] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:35:52] !log jforrester@deploy2002 jforrester: Continuing with sync [22:35:59] (03PS2) 10Scott French: mw-(api-int|web): Use envoy drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253664 (https://phabricator.wikimedia.org/T364245) [22:35:59] (03CR) 10Scott French: "Thanks in advance for the review, Reuven!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253664 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [22:37:00] !log jforrester@deploy2002 Finished scap sync-world: T411807 (duration: 11m 10s) [22:37:03] T411807: WF memcached service is dc-local but used for dc-global content - https://phabricator.wikimedia.org/T411807 [22:44:19] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: Update ULSFO LVS service IP's - https://phabricator.wikimedia.org/T418971#11716272 (10Papaul) [22:46:03] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Eqiad: lsw1-c2-eqiad BGP maintenance/ Tuesday 17th at 9:30 CDT - https://phabricator.wikimedia.org/T420158#11716280 (10colewhite) [22:49:24] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6008.drmrs.wmnet with OS trixie [22:52:24] 06SRE, 06Infrastructure-Foundations, 10netops: Update esams network pop diagrams - https://phabricator.wikimedia.org/T368084#11716296 (10Papaul) 05Open→03Resolved Both diagrams for esams are now up to date. Closing this. [22:54:09] (03PS3) 10Jforrester: mc: Shift the Wikifunctions MC route from /local/wf/ to //wf-wan/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247687 (https://phabricator.wikimedia.org/T411807) [22:54:41] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp6008.drmrs.wmnet [reason: trixie reimaging] [23:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260316T2300) [23:04:03] Are any deploys still in process? [23:05:38] ok starting backport process [23:06:51] (03Abandoned) 10Jforrester: [DNM] memcached: Point to the replicated Wikifunctions cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229232 (https://phabricator.wikimedia.org/T411807) (owner: 10Jforrester) [23:07:01] Jdlrobson: No, go for it. [23:08:33] (03CR) 10RLazarus: [C:03+1] mw-(api-int|web): Use envoy drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253664 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [23:09:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy2002 using scap backport" [skins/Vector] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1253604 (https://phabricator.wikimedia.org/T419730) (owner: 10Jdlrobson) [23:09:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy2002 using scap backport" [skins/Vector] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251157 (https://phabricator.wikimedia.org/T419730) (owner: 10Jdlrobson) [23:13:38] FIRING: HelmReleaseBadStatus: Helm release zarcillo/main on k8s-aux@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-aux&var-namespace=zarcillo - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [23:20:57] (03Merged) 10jenkins-bot: Support duplication of languages in header and main menu [skins/Vector] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251157 (https://phabricator.wikimedia.org/T419730) (owner: 10Jdlrobson) [23:21:00] (03CR) 10CI reject: [V:04-1] Don't output language HTML when no languages present [skins/Vector] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1253604 (https://phabricator.wikimedia.org/T419730) (owner: 10Jdlrobson) [23:22:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy2002 using scap backport" [skins/Vector] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1253604 (https://phabricator.wikimedia.org/T419730) (owner: 10Jdlrobson) [23:28:43] (03PS3) 10Bartosz Dziewoński: rest-gateway: handle trust level C with invalid token. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1252658 (https://phabricator.wikimedia.org/T420106) (owner: 10Daniel Kinzler) [23:28:56] (03CR) 10Bartosz Dziewoński: "(Un-tagged T420011, since I closed it in favor of the other task)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1252658 (https://phabricator.wikimedia.org/T420106) (owner: 10Daniel Kinzler) [23:30:07] (03CR) 10Bartosz Dziewoński: [C:03+1] rest-gateway: handle trust level C with invalid token. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1252658 (https://phabricator.wikimedia.org/T420106) (owner: 10Daniel Kinzler) [23:32:24] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp601(0|1).* [23:36:05] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp6009.drmrs.wmnet with OS trixie [23:37:36] (03Merged) 10jenkins-bot: Don't output language HTML when no languages present [skins/Vector] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1253604 (https://phabricator.wikimedia.org/T419730) (owner: 10Jdlrobson) [23:39:22] spiderpig question.. this is new.. my 2 patches merged but it is reporting ERROR [23:39:42] How can I sync the staged changes? [23:41:15] (03PS3) 10Jdlrobson: Enable languages in main menu on Russian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251158 (https://phabricator.wikimedia.org/T419730) [23:41:41] I am guessing from this experience I can't deploy 2 chained patches at the same time with spiderpig? [23:41:50] Specifically it says "The change '1253604' failed build tests and could not be merged" [23:42:51] (Yes, despite it being a merged patch whose pipeline builds succeeded.) [23:43:02] And my understanding is that whoever deploys next will accidentally deploy my changes. [23:44:06] Unless the repo is now wedged into a state where nobody can deploy because there's some internal test that'll always fail with the current state of skins/Vector, I guess. [23:44:15] there is a retry button [23:45:19] hmm who is still around in RelEng land? I tried pinging @dduvall and @dancy [23:49:21] ok RoanKattouw is giving me some advice [23:49:59] !log jdlrobson@deploy2002 Started scap sync-world: Backport for [[gerrit:1253604|Don't output language HTML when no languages present (T419730)]], [[gerrit:1251157|Support duplication of languages in header and main menu (T419730)]] [23:50:03] T419730: Vector 2022 should support duplication of languages in header and sidebar - https://phabricator.wikimedia.org/T419730 [23:50:30] RoanKattouw suggested retrying and sure enough that seems to be working [23:50:40] Jdlrobson: I can take a look in 5 [23:50:54] ah yeah, retry makes sense [23:50:59] Super weird though, it claimed The change '1253604' failed build tests and could not be merged , but that change was in fact merged [23:51:07] The second retry seems to have worked so far [23:51:19] Logs of the failed attempts are at https://spiderpig.wikimedia.org/jobs/1564 and https://spiderpig.wikimedia.org/jobs/1565 [23:51:24] If the patchset is already merged, spiderpig will skip that step [23:51:37] we can look at why that occurred tomorrow [23:51:47] !log jdlrobson@deploy2002 jdlrobson: Backport for [[gerrit:1253604|Don't output language HTML when no languages present (T419730)]], [[gerrit:1251157|Support duplication of languages in header and main menu (T419730)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:51:59] 🚡 [23:52:07] OK I think I might understand what happened [23:52:44] The first time (job 1564), TrainBranchBot +2ed the change, and the build failed, so the change didn't merge and "scap backport" aborted. This makes sense. [23:52:49] !log jdlrobson@deploy2002 jdlrobson: Continuing with sync [23:53:05] The second time (job 1565), TrainBranchBot +2ed it again, then saw that it had previously failed and aborted immediately. This seems like a bug [23:53:47] Then later, while Jon was still asking confused questions here, the second CI run finished and did succeed, and the change merged [23:54:22] So then when Jon just tried the third time, the patch was already merged, and everything went fine [23:54:31] ( dduvall ---^^ ) [23:56:02] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6009.drmrs.wmnet with reason: host reimage [23:56:43] !log jdlrobson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1253604|Don't output language HTML when no languages present (T419730)]], [[gerrit:1251157|Support duplication of languages in header and main menu (T419730)]] (duration: 06m 44s) [23:56:47] T419730: Vector 2022 should support duplication of languages in header and sidebar - https://phabricator.wikimedia.org/T419730 [23:59:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251158 (https://phabricator.wikimedia.org/T419730) (owner: 10Jdlrobson) [23:59:12] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6009.drmrs.wmnet with reason: host reimage