[00:00:05] (03Merged) 10jenkins-bot: Enable languages in main menu on Russian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251158 (https://phabricator.wikimedia.org/T419730) (owner: 10Jdlrobson) [00:00:44] !log jdlrobson@deploy2002 Started scap sync-world: Backport for [[gerrit:1251158|Enable languages in main menu on Russian Wikipedia (T419730)]] [00:02:25] RESOLVED: SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:02:39] !log jdlrobson@deploy2002 jdlrobson: Backport for [[gerrit:1251158|Enable languages in main menu on Russian Wikipedia (T419730)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:02:43] T419730: Vector 2022 should support duplication of languages in header and sidebar - https://phabricator.wikimedia.org/T419730 [00:03:49] !log jdlrobson@deploy2002 jdlrobson: Continuing with sync [00:04:49] RoanKattouw: ok, thank you for that. perhaps it's a race condition between the messages/actions taken by gerrit and spiderpig. please file a bug and we will have a closer look [00:04:50] Jdlrobson: are you unblocked for your deployment now? [00:06:38] (03PS12) 10Bartosz Dziewoński: rest-gateway rate limiting: add CORS headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248461 (https://phabricator.wikimedia.org/T418969) (owner: 10Daniel Kinzler) [00:06:48] (03CR) 10Bartosz Dziewoński: [C:03+1] "Looks right, as far as I can tell." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248461 (https://phabricator.wikimedia.org/T418969) (owner: 10Daniel Kinzler) [00:07:41] !log jdlrobson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1251158|Enable languages in main menu on Russian Wikipedia (T419730)]] (duration: 06m 57s) [00:07:45] T419730: Vector 2022 should support duplication of languages in header and sidebar - https://phabricator.wikimedia.org/T419730 [00:08:01] ok all done. Thanks for the troubleshooting dduvall RoanKattouw [00:08:16] RoanKattouw: would you be able to raise a bug since it seems like you have a good handle on what happened here [00:08:49] I gotta run but I'll file one tomorrow [00:09:38] Jdlrobson: good to hear! RoanKattouw: thank you! [00:20:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:25:23] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6009.drmrs.wmnet with OS trixie [00:26:11] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp6009.* [00:32:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.22% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:32:59] 10ops-eqiad, 06SRE, 06DC-Ops: Cable cleanup in rack - https://phabricator.wikimedia.org/T420266#11716635 (10VRiley-WMF) 05Open→03Resolved This is completed [00:35:25] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:37:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:39:01] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1253686 [00:39:01] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1253686 (owner: 10TrainBranchBot) [00:42:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 18.77% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:43:52] (03CR) 10Ssingh: [C:03+1] hiera: Remove single_backend from codfw [puppet] - 10https://gerrit.wikimedia.org/r/1253605 (https://phabricator.wikimedia.org/T401832) (owner: 10BCornwall) [00:44:35] (03CR) 10Ssingh: [C:03+1] hiera: Set default codfw storage_elements [puppet] - 10https://gerrit.wikimedia.org/r/1253606 (https://phabricator.wikimedia.org/T401832) (owner: 10BCornwall) [00:52:15] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1253686 (owner: 10TrainBranchBot) [01:08:59] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1253707 [01:08:59] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1253707 (owner: 10TrainBranchBot) [01:15:32] FIRING: [2x] KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:20:22] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:20:22] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:21:22] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:21:22] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:26:07] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1253707 (owner: 10TrainBranchBot) [01:42:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:43:45] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.36% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:48:45] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 21.37% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:49:38] FIRING: GnmiTargetDown: lsw1-e8-eqiad is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [01:52:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 21.87% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:57:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.82% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:57:45] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.29% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T0200) [02:00:49] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [02:02:45] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:05:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:08:40] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:08:52] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.46.0-wmf.20 [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1253715 (https://phabricator.wikimedia.org/T413811) [02:08:54] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.46.0-wmf.20 [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1253715 (https://phabricator.wikimedia.org/T413811) (owner: 10TrainBranchBot) [02:09:00] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 08m 10s) [02:10:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:12:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:20:45] (03Merged) 10jenkins-bot: Branch commit for wmf/1.46.0-wmf.20 [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1253715 (https://phabricator.wikimedia.org/T413811) (owner: 10TrainBranchBot) [02:22:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.32% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:24:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 22.93% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:29:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:33:40] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 22.65% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:35:00] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:38:40] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 22.97% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:52:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:00:05] Deploy window Automatic deployment of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T0300) [03:02:06] (03PS1) 10TrainBranchBot: testwikis to 1.46.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253737 (https://phabricator.wikimedia.org/T413811) [03:02:08] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253737 (https://phabricator.wikimedia.org/T413811) (owner: 10TrainBranchBot) [03:02:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.47% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:03:05] (03Merged) 10jenkins-bot: testwikis to 1.46.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253737 (https://phabricator.wikimedia.org/T413811) (owner: 10TrainBranchBot) [03:03:36] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.46.0-wmf.20 refs T413811 [03:03:40] T413811: 1.46.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T413811 [03:05:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 22.08% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:13:38] FIRING: HelmReleaseBadStatus: Helm release zarcillo/main on k8s-aux@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-aux&var-namespace=zarcillo - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [03:43:10] !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.46.0-wmf.20 refs T413811 (duration: 39m 34s) [03:43:14] T413811: 1.46.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T413811 [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T0400) [04:01:19] !log mwpresync@deploy2002 Pruned MediaWiki: 1.46.0-wmf.17 (duration: 01m 17s) [04:13:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:15:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 22.9% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:23:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.4% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:28:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.4% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:34:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.4% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:35:40] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:37:34] (03PS1) 10Dwisehaupt: Point fundraising read handle back at the origin server [dns] - 10https://gerrit.wikimedia.org/r/1253750 (https://phabricator.wikimedia.org/T420155) [04:39:36] (03CR) 10Dwisehaupt: [C:03+2] Point fundraising read handle back at the origin server [dns] - 10https://gerrit.wikimedia.org/r/1253750 (https://phabricator.wikimedia.org/T420155) (owner: 10Dwisehaupt) [04:39:58] !log dwisehaupt@dns1005 START - running authdns-update [04:41:23] !log dwisehaupt@dns1005 END - running authdns-update [04:44:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.64% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:48:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 22.51% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:53:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 22.51% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:00:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1015:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:15:26] 10ops-codfw, 06DC-Ops: Unresponsive management for backup2005.mgmt:22 - https://phabricator.wikimedia.org/T420308 (10phaultfinder) 03NEW [05:15:47] FIRING: [2x] KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:49:38] FIRING: GnmiTargetDown: lsw1-e8-eqiad is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [05:55:03] Deploying cxserver. [05:55:10] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2026-03-16-071247-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253260 (https://phabricator.wikimedia.org/T420004) (owner: 10KartikMistry) [05:57:23] (03Merged) 10jenkins-bot: Update cxserver to 2026-03-16-071247-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253260 (https://phabricator.wikimedia.org/T420004) (owner: 10KartikMistry) [05:58:34] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [05:58:59] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T0600) [06:00:05] marostegui, Amir1, and federico3: It is that lovely time of the day again! You are hereby commanded to deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T0600). [06:04:48] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [06:05:19] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [06:06:37] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [06:07:13] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [06:08:09] !log Updated cxserver to 2026-03-16-071247-production (T420004) [06:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:12] T420004: Translating an article with ContentTranslation may fail with Critical error - https://phabricator.wikimedia.org/T420004 [06:35:00] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:54:23] (03CR) 10Ayounsi: "lgtm but leaving the last call to the traffic team" [puppet] - 10https://gerrit.wikimedia.org/r/1253538 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [06:55:09] (03CR) 10Arnaudb: [C:03+2] gerrit: ProxyTimeout shorter than Jetty's idle timeout [puppet] - 10https://gerrit.wikimedia.org/r/1241048 (https://phabricator.wikimedia.org/T246763) (owner: 10Hashar) [06:55:28] (03CR) 10Ayounsi: [C:03+2] decom cookbook: add --homer parameter [cookbooks] - 10https://gerrit.wikimedia.org/r/1251099 (owner: 10Ayounsi) [06:55:53] (03CR) 10Ayounsi: [C:03+2] "Not for now, but eventually yes." [cookbooks] - 10https://gerrit.wikimedia.org/r/1251099 (owner: 10Ayounsi) [06:59:40] (03CR) 10Ayounsi: [C:03+1] cr-cloud: allow cumin/cloudcumin traffic [homer/public] - 10https://gerrit.wikimedia.org/r/1253574 (https://phabricator.wikimedia.org/T419996) (owner: 10Filippo Giunchedi) [07:00:05] Amir1, Urbanecm, and awight: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti3005.esams.wmnet with OS bookworm [07:00:59] (03Merged) 10jenkins-bot: decom cookbook: add --homer parameter [cookbooks] - 10https://gerrit.wikimedia.org/r/1251099 (owner: 10Ayounsi) [07:01:03] 10ops-esams, 06SRE, 06DC-Ops: ganeti3005 didn't come up after reboot - https://phabricator.wikimedia.org/T420229#11717077 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti3005.esams.wmnet with OS bookworm [07:03:46] 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11717082 (10ayounsi) I created a meeting to not forget, and invited you both just in case. [07:05:26] (03PS1) 10Muehlenhoff: Remove LDAP access for lmixter [puppet] - 10https://gerrit.wikimedia.org/r/1253996 [07:08:28] (03CR) 10Muehlenhoff: [C:03+2] Remove LDAP access for lmixter [puppet] - 10https://gerrit.wikimedia.org/r/1253996 (owner: 10Muehlenhoff) [07:13:38] FIRING: HelmReleaseBadStatus: Helm release zarcillo/main on k8s-aux@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-aux&var-namespace=zarcillo - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [07:19:06] (03CR) 10Muehlenhoff: systemd::timer::job: add ExecCondition support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1253655 (owner: 10Herron) [07:21:25] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1253514 (https://phabricator.wikimedia.org/T398611) (owner: 10Effie Mouzeli) [07:23:24] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet [07:24:43] (03CR) 10Ryan Kemper: [C:03+1] alerts(blazegraph): reduce severity of CategoriesQueryServiceUpdateLagTooHigh to warning [alerts] - 10https://gerrit.wikimedia.org/r/1253552 (https://phabricator.wikimedia.org/T420235) (owner: 10Gehel) [07:25:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet [07:25:21] ml-etcd2001 will go down for a Ganeti reboot [07:25:26] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti3005.esams.wmnet with reason: host reimage [07:27:28] PROBLEM - Host ml-etcd2001 is DOWN: PING CRITICAL - Packet loss = 100% [07:29:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti3005.esams.wmnet with reason: host reimage [07:30:56] RECOVERY - Host ml-etcd2001 is UP: PING OK - Packet loss = 0%, RTA = 30.69 ms [07:31:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet [07:31:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet [07:32:02] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2033.codfw.wmnet [07:32:04] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2033.codfw.wmnet [07:32:34] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.restart-gerrit Restarting Gerrit on gerrit2003 [07:32:51] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet [07:33:28] RECOVERY - Host wikikube-worker1291 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [07:34:23] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gerrit.restart-gerrit (exit_code=0) Restarting Gerrit on gerrit2003 [07:35:32] FIRING: [2x] KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:35:55] jmm@cumin2002 drain-node (PID 3322888) is awaiting input [07:37:49] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1285-1289,1291-1299].eqiad.wmnet [07:37:53] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1285-1289,1291-1299].eqiad.wmnet [07:38:24] (03CR) 10Ayounsi: "This should be able to be pushed anytime, and then followed up by a cleanup patch for the older ranges." [homer/public] - 10https://gerrit.wikimedia.org/r/1247994 (https://phabricator.wikimedia.org/T418971) (owner: 10Ayounsi) [07:41:42] jmm@cumin2002 drain-node (PID 3322888) is awaiting input [07:41:50] (03CR) 10Filippo Giunchedi: [C:03+2] cr-cloud: allow cumin/cloudcumin traffic [homer/public] - 10https://gerrit.wikimedia.org/r/1253574 (https://phabricator.wikimedia.org/T419996) (owner: 10Filippo Giunchedi) [07:46:00] (03PS1) 10Arnaudb: gerrit: fix typo on isRegex for alerting downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1254035 [07:46:07] (03CR) 10Arnaudb: [C:03+2] gerrit: fix typo on isRegex for alerting downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1254035 (owner: 10Arnaudb) [07:46:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast5004.wikimedia.org [07:46:40] (03CR) 10Arnaudb: [C:03+1] "thanks for the change, looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1253631 (https://phabricator.wikimedia.org/T420246) (owner: 10Dzahn) [07:48:18] (03CR) 10Arnaudb: [C:03+1] miscweb: add wmf-navigator values - empty httpd [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253489 (https://phabricator.wikimedia.org/T414405) (owner: 10AOkoth) [07:50:36] (03Merged) 10jenkins-bot: gerrit: fix typo on isRegex for alerting downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1254035 (owner: 10Arnaudb) [07:51:20] (03PS16) 10Kosta Harlan: hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250575 (https://phabricator.wikimedia.org/T419125) (owner: 10Harroyo-wmf) [07:51:38] (03CR) 10Kosta Harlan: hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250575 (https://phabricator.wikimedia.org/T419125) (owner: 10Harroyo-wmf) [07:51:44] (03CR) 10Kosta Harlan: hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250575 (https://phabricator.wikimedia.org/T419125) (owner: 10Harroyo-wmf) [07:52:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti3005.esams.wmnet with OS bookworm [07:52:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org [07:52:39] 10ops-esams, 06SRE, 06DC-Ops: ganeti3005 didn't come up after reboot - https://phabricator.wikimedia.org/T420229#11717168 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti3005.esams.wmnet with OS bookworm completed: - ganeti3005 (**PASS**) - Downtimed on I... [07:54:14] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: Allow more optional BIOS values for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1251424 (https://phabricator.wikimedia.org/T414216) (owner: 10Elukey) [07:55:29] (03CR) 10Elukey: [C:03+2] P:kafka::broker::monitoring: Fix legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1251539 (https://phabricator.wikimedia.org/T420034) (owner: 10Majavah) [07:55:45] (03CR) 10Elukey: [C:03+2] confluent: kafka::broker: Fix legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1251540 (https://phabricator.wikimedia.org/T420034) (owner: 10Majavah) [07:57:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org [08:00:05] andre and brennen: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-0+Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T0800). [08:02:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3005.esams.wmnet [08:04:55] (03PS1) 10Slyngshede: geo-maps: update Meta geo mapping [dns] - 10https://gerrit.wikimedia.org/r/1254092 [08:08:41] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:09:26] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:12:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3005.esams.wmnet [08:13:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:14:02] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host parsoidtest1001.eqiad.wmnet [08:14:12] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti3005.esams.wmnet to cluster esams03 and group B [08:14:25] !log powercycling bast2003 (stuck on reboot) [08:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti3005.esams.wmnet to cluster esams03 and group B [08:18:00] PROBLEM - Host bast2003 is DOWN: PING CRITICAL - Packet loss = 100% [08:18:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet [08:19:56] PROBLEM - Host ml-staging-etcd2003 is DOWN: PING CRITICAL - Packet loss = 100% [08:20:55] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host parsoidtest1001.eqiad.wmnet [08:21:04] 10ops-codfw, 06SRE, 06DC-Ops: bast2003 boot failure - https://phabricator.wikimedia.org/T420320 (10MoritzMuehlenhoff) 03NEW [08:21:31] (03CR) 10Mszwarc: hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250575 (https://phabricator.wikimedia.org/T419125) (owner: 10Harroyo-wmf) [08:24:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet [08:24:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet [08:25:33] (03CR) 10Gehel: [C:03+2] alerts(blazegraph): reduce severity of CategoriesQueryServiceUpdateLagTooHigh to warning [alerts] - 10https://gerrit.wikimedia.org/r/1253552 (https://phabricator.wikimedia.org/T420235) (owner: 10Gehel) [08:26:10] RECOVERY - Host ml-staging-etcd2003 is UP: PING OK - Packet loss = 0%, RTA = 31.42 ms [08:27:42] !log arnaudb@cumin1003 START - Cookbook sre.hosts.reboot-single for host contint1002.wikimedia.org [08:28:59] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-misc1001.eqiad.wmnet [08:29:05] the CI host (Jenkins/Zuul) is being restarted [08:30:27] 10ops-esams, 06SRE, 06DC-Ops: ganeti3005 didn't come up after reboot - https://phabricator.wikimedia.org/T420229#11717310 (10MoritzMuehlenhoff) 05Open→03Resolved p:05Triage→03Medium I've reimaged ganeti3005 and re-added it to the cluster. [08:31:41] (03PS5) 10Elukey: sre.hosts.provision: use more URIs to set Supermicro's BIOS settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216) [08:31:55] (03CR) 10CI reject: [V:04-1] sre.hosts.provision: use more URIs to set Supermicro's BIOS settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216) (owner: 10Elukey) [08:32:17] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:32:36] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:34:18] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host contint1002.wikimedia.org [08:34:22] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc1001.eqiad.wmnet [08:34:36] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.restart-gerrit Restarting Gerrit on gerrit2002 [08:35:47] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gerrit.restart-gerrit (exit_code=0) Restarting Gerrit on gerrit2002 [08:36:04] (03PS17) 10Kosta Harlan: hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250575 (https://phabricator.wikimedia.org/T419125) (owner: 10Harroyo-wmf) [08:36:10] (03CR) 10Kosta Harlan: hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250575 (https://phabricator.wikimedia.org/T419125) (owner: 10Harroyo-wmf) [08:38:55] (03PS17) 10Kosta Harlan: hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250575 (https://phabricator.wikimedia.org/T419125) (owner: 10Harroyo-wmf) [08:40:33] hashar: can I sync a config patch? [08:40:47] or is the train running now? cc andre [08:40:57] kostajh, train is blocked, go ahead [08:40:58] the train will run in 20 minutes [08:41:06] the deployment calendar is off by one hour because of the DST confusion time [08:41:09] no, Daylight Confusion Time [08:41:19] ok, I'm starting [08:41:34] (03PS18) 10Kosta Harlan: hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250575 (https://phabricator.wikimedia.org/T419125) (owner: 10Harroyo-wmf) [08:41:42] hashar: Deployment Calendar is bound to SF/PST time, so train window started 40min ago, AFAIK [08:41:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250575 (https://phabricator.wikimedia.org/T419125) (owner: 10Harroyo-wmf) [08:41:56] (03PS3) 10Elukey: sre.hosts.provision: refactor bios if/else branches [cookbooks] - 10https://gerrit.wikimedia.org/r/1253412 (https://phabricator.wikimedia.org/T414216) [08:41:56] (03PS2) 10Elukey: sre.hosts.provision: add sys-112c-tn-configg to SUPERMICRO_NO_FQDN_MANAGEMENT [cookbooks] - 10https://gerrit.wikimedia.org/r/1253448 (https://phabricator.wikimedia.org/T414216) [08:41:56] (03PS6) 10Elukey: sre.hosts.provision: use more URIs to set Supermicro's BIOS settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216) [08:42:28] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:42:45] (03Merged) 10jenkins-bot: hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250575 (https://phabricator.wikimedia.org/T419125) (owner: 10Harroyo-wmf) [08:42:47] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:43:34] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1250575|hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend (T419125)]] [08:43:39] T419125: hCaptcha: Update mediawiki-config to enforce checks for API edits coming from the MobileFrontend - https://phabricator.wikimedia.org/T419125 [08:44:44] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host bast2003.wikimedia.org [08:45:18] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet [08:45:29] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:45:35] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/7 UP : OSPFv3: 4/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:45:35] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:45:35] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:46:10] FIRING: [2x] BFDdown: BFD session down between cr2-drmrs and 208.80.153.204 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:46:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-drmrs (208.80.153.205) - group Confed_drmrs - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:47:13] 06SRE, 06Data-Platform-SRE: Data Platform SRE paging alerts and on-call SRE response - https://phabricator.wikimedia.org/T420264#11717376 (10Gehel) I don't think this alert should have been paging. The workloads we run on k8s are all supposed to be able to be down for extended periods. [08:47:28] 06SRE, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Data Platform SRE paging alerts and on-call SRE response - https://phabricator.wikimedia.org/T420264#11717377 (10Gehel) [08:48:59] !log kharlan@deploy2002 harroyo-wmf, kharlan: Backport for [[gerrit:1250575|hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend (T419125)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:49:02] T419125: hCaptcha: Update mediawiki-config to enforce checks for API edits coming from the MobileFrontend - https://phabricator.wikimedia.org/T419125 [08:49:28] (03PS1) 10Effie Mouzeli: site.pp: switch mc-misc to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1254111 (https://phabricator.wikimedia.org/T398611) [08:49:52] jmm@cumin2002 drain-node (PID 3338663) is awaiting input [08:50:17] (03CR) 10Effie Mouzeli: [C:03+2] hieradata: migrate eqiad memcached cluster to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1253514 (https://phabricator.wikimedia.org/T398611) (owner: 10Effie Mouzeli) [08:50:57] 06SRE, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Data Platform SRE paging alerts and on-call SRE response - https://phabricator.wikimedia.org/T420264#11717388 (10MoritzMuehlenhoff) >>! In T420264#11717376, @Gehel wrote: > I don't think this alert should have been paging. The workloads we run on k8s are al... [08:51:10] FIRING: [6x] BFDdown: BFD session down between cr2-drmrs and 208.80.153.204 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:51:39] FIRING: [6x] CoreBGPDown: Core BGP session down between cr2-drmrs and cr2-eqdfw (208.80.153.204) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:52:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet [08:52:47] !log jiji@cumin1003 START - Cookbook sre.memcached.roll-reboot-restart rolling reboot on A:memcached-canary [08:53:02] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1254111 (https://phabricator.wikimedia.org/T398611) (owner: 10Effie Mouzeli) [08:53:33] (03CR) 10Effie Mouzeli: [C:03+2] site.pp: switch mc-misc to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1254111 (https://phabricator.wikimedia.org/T398611) (owner: 10Effie Mouzeli) [08:54:10] !log kharlan@deploy2002 Sync cancelled. [08:54:30] (03PS1) 10Kosta Harlan: Revert "hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254114 [08:54:59] (03PS2) 10Kosta Harlan: Revert "hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254114 (https://phabricator.wikimedia.org/T419125) [08:55:06] (03CR) 10TrainBranchBot: [C:03+2] "Copied votes on follow-up patch sets have been updated:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254114 (https://phabricator.wikimedia.org/T419125) (owner: 10Kosta Harlan) [08:55:11] (03CR) 10TrainBranchBot: "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254114 (https://phabricator.wikimedia.org/T419125) (owner: 10Kosta Harlan) [08:56:56] (03Merged) 10jenkins-bot: Revert "hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254114 (https://phabricator.wikimedia.org/T419125) (owner: 10Kosta Harlan) [08:57:07] !log rebuilt the trixie d-i image for the 13.4 point release T420240 [08:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:11] T420240: Integrate Trixie 13.4 point update - https://phabricator.wikimedia.org/T420240 [08:57:23] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1254114|Revert "hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend" (T419125)]] [08:57:27] T419125: hCaptcha: Update mediawiki-config to enforce checks for API edits coming from the MobileFrontend - https://phabricator.wikimedia.org/T419125 [08:57:37] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.4 point update - https://phabricator.wikimedia.org/T420240#11717402 (10MoritzMuehlenhoff) [08:57:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet [08:57:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet [08:58:39] (03PS1) 10Arnaudb: gerrit: cookbook to reboot gerrit primary instance [cookbooks] - 10https://gerrit.wikimedia.org/r/1254113 (https://phabricator.wikimedia.org/T420194) [08:58:52] !log jiji@cumin1003 END (PASS) - Cookbook sre.memcached.roll-reboot-restart (exit_code=0) rolling reboot on A:memcached-canary [09:00:40] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1015:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:02:47] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1254114|Revert "hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend" (T419125)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:02:51] T419125: hCaptcha: Update mediawiki-config to enforce checks for API edits coming from the MobileFrontend - https://phabricator.wikimedia.org/T419125 [09:03:51] !log kharlan@deploy2002 kharlan: Continuing with sync [09:05:13] !log mvernon@cumin1003 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling reboot on A:thanos-fe [09:06:08] !log increase VRRP priority on eqiad vlans on CR2 to shift active gateway to cr2-eqiad T420180 [09:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:12] T420180: Drain ssw1-d1-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420180 [09:06:29] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:10:00] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1254114|Revert "hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend" (T419125)]] (duration: 12m 36s) [09:10:05] T419125: hCaptcha: Update mediawiki-config to enforce checks for API edits coming from the MobileFrontend - https://phabricator.wikimedia.org/T419125 [09:11:10] RESOLVED: [6x] BFDdown: BFD session down between cr2-drmrs and 208.80.153.204 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:11:39] RESOLVED: [6x] CoreBGPDown: Core BGP session down between cr2-drmrs and cr2-eqdfw (208.80.153.204) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:11:41] (03PS1) 10Effie Mouzeli: site.pp: switch mc-wf hosts to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1254115 (https://phabricator.wikimedia.org/T398611) [09:15:06] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-misc1001.eqiad.wmnet [09:15:39] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:20:22] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc1001.eqiad.wmnet [09:20:25] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-misc1002.eqiad.wmnet [09:21:38] !log jiji@cumin1003 START - Cookbook sre.memcached.roll-reboot-restart rolling reboot on A:memcached-eqiad [09:25:36] (03CR) 10MVernon: [C:04-1] "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1253662 (https://phabricator.wikimedia.org/T419663) (owner: 10Neriah) [09:25:39] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:25:41] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc1002.eqiad.wmnet [09:25:45] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-misc2001.codfw.wmnet [09:27:48] (03PS1) 10JavierMonton: stream: mw-content-history-reconcile-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254118 (https://phabricator.wikimedia.org/T408918) [09:30:39] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:31:02] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc2001.codfw.wmnet [09:31:07] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-misc2002.codfw.wmnet [09:36:40] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1254115 (https://phabricator.wikimedia.org/T398611) (owner: 10Effie Mouzeli) [09:37:17] (03CR) 10Jaime Nuche: [C:03+1] releases: remove rsync systemd units when primary server changes [puppet] - 10https://gerrit.wikimedia.org/r/1253631 (https://phabricator.wikimedia.org/T420246) (owner: 10Dzahn) [09:38:13] !log installing openssl bugfix updates on trixie hosts [09:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:16] (03CR) 10Effie Mouzeli: [C:03+2] site.pp: switch mc-wf hosts to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1254115 (https://phabricator.wikimedia.org/T398611) (owner: 10Effie Mouzeli) [09:39:39] (03PS1) 10Kevin Bazira: ml-services: update gpt isvc image to one that supports configurable gpu_memory_utilization flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254124 (https://phabricator.wikimedia.org/T418350) [09:40:05] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc2002.codfw.wmnet [09:42:03] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-wf1001.eqiad.wmnet [09:42:12] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:43:34] (03CR) 10JavierMonton: [C:03+2] stream: mw-content-history-reconcile-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254118 (https://phabricator.wikimedia.org/T408918) (owner: 10JavierMonton) [09:43:34] (03CR) 10Bartosz Wójtowicz: [C:03+1] "LGTM <3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254124 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [09:45:19] (03Merged) 10jenkins-bot: stream: mw-content-history-reconcile-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254118 (https://phabricator.wikimedia.org/T408918) (owner: 10JavierMonton) [09:47:56] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-wf1001.eqiad.wmnet [09:47:59] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-wf1002.eqiad.wmnet [09:48:20] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update gpt isvc image to one that supports configurable gpu_memory_utilization flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254124 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [09:48:23] RESOLVED: HelmReleaseBadStatus: Helm release zarcillo/main on k8s-aux@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-aux&var-namespace=zarcillo - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:49:38] FIRING: GnmiTargetDown: lsw1-e8-eqiad is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [09:50:08] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:50:21] (03Merged) 10jenkins-bot: ml-services: update gpt isvc image to one that supports configurable gpu_memory_utilization flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254124 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [09:51:07] 10SRE-tools, 06Infrastructure-Foundations: offboard-user: Migrate Phabricator API access from user.query() to user.search() - https://phabricator.wikimedia.org/T420324 (10MoritzMuehlenhoff) 03NEW [09:52:06] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:52:31] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.4 point update - https://phabricator.wikimedia.org/T420240#11717551 (10MoritzMuehlenhoff) [09:53:23] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:53:54] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet [09:54:08] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-wf1002.eqiad.wmnet [09:54:12] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-wf2001.codfw.wmnet [09:54:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid2003.codfw.wmnet [09:54:53] FIRING: [2x] HelmReleaseBadStatus: Helm release zarcillo/main on k8s-aux@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-aux&var-namespace=zarcillo - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:55:31] ACKNOWLEDGEMENT - dump of db_inventory in codfw on backupmon1001 is CRITICAL: Last dump for db_inventory at codfw (db2185) taken on 2026-03-17 00:36:31 is 3 MiB, but the previous one was 3 MiB, a change of +30.0 % Jcrespo expected - The acknowledgement expires at: 2026-03-24 09:55:13. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [09:55:31] ACKNOWLEDGEMENT - dump of db_inventory in eqiad on backupmon1001 is CRITICAL: Last dump for db_inventory at eqiad (db1215) taken on 2026-03-17 00:40:04 is 3 MiB, but the previous one was 3 MiB, a change of +29.9 % Jcrespo expected - The acknowledgement expires at: 2026-03-24 09:55:13. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [09:56:28] !log mvernon@cumin1003 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling reboot on A:thanos-fe [09:56:44] !log shift traffic from codfw to eqiad off Arelion CCT to Lumen [09:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:19] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [09:57:29] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [09:57:54] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:58:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid2003.codfw.wmnet [09:58:37] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:58:50] jmm@cumin2002 drain-node (PID 3353921) is awaiting input [09:59:01] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:59:24] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:59:52] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:59:56] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1000) [10:00:19] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-wf2001.codfw.wmnet [10:00:22] (03PS8) 10Daniel Kinzler: rest-gateway: per-route jwt overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248477 (https://phabricator.wikimedia.org/T419130) [10:00:23] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-wf2002.codfw.wmnet [10:00:58] (03CR) 10Daniel Kinzler: rest-gateway: per-route jwt overrides (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248477 (https://phabricator.wikimedia.org/T419130) (owner: 10Daniel Kinzler) [10:01:34] (03CR) 10Daniel Kinzler: "I completely changed how this works, no Lua involved anymore." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248477 (https://phabricator.wikimedia.org/T419130) (owner: 10Daniel Kinzler) [10:01:36] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:01:37] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:01:47] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:02:54] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:03:06] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:03:58] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:04:19] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:05:16] (03PS1) 10JavierMonton: stream: mw-content-history-reconcile-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254132 (https://phabricator.wikimedia.org/T408918) [10:06:39] 10SRE-tools, 06Infrastructure-Foundations, 10Phabricator: offboard-user: Migrate Phabricator API access from user.query() to user.search() - https://phabricator.wikimedia.org/T420324#11717595 (10Aklapper) FYI pretty similar tasks: https://phabricator.wikimedia.org/maniphest/query/lV7c54v0tL3z/#R [10:06:42] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-wf2002.codfw.wmnet [10:07:13] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:08:04] (03PS1) 10JavierMonton: stream: mw-page-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254135 (https://phabricator.wikimedia.org/T408918) [10:08:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet [10:08:22] (03PS1) 10Arnaudb: gerrit: add a ttl on ProxyPass to jetty [puppet] - 10https://gerrit.wikimedia.org/r/1254128 (https://phabricator.wikimedia.org/T420189) [10:08:22] (03CR) 10Arnaudb: "no cc for Traffic here; because the issue seem to come from the internal reverse proxy, or less likely from jetty's config. Yesterday, @dz" [puppet] - 10https://gerrit.wikimedia.org/r/1254128 (https://phabricator.wikimedia.org/T420189) (owner: 10Arnaudb) [10:09:11] ml-etcd2002 and dse-k8s-ctrl will go down for a Ganeti reboot [10:09:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet [10:10:53] PROBLEM - Host ml-etcd2002 is DOWN: PING CRITICAL - Packet loss = 100% [10:12:01] PROBLEM - Host dse-k8s-ctrl2001 is DOWN: PING CRITICAL - Packet loss = 100% [10:12:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet [10:12:31] PROBLEM - VRRP status on cr1-eqiad is CRITICAL: VRRP CRITICAL - 6 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [10:12:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid1003.eqiad.wmnet [10:13:19] ^^^ VRRP status is me, it is fine [10:13:26] (03CR) 10TChin: [C:03+1] stream: mw-content-history-reconcile-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254132 (https://phabricator.wikimedia.org/T408918) (owner: 10JavierMonton) [10:13:35] (03CR) 10TChin: [C:03+1] stream: mw-page-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254135 (https://phabricator.wikimedia.org/T408918) (owner: 10JavierMonton) [10:14:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet [10:15:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet [10:15:29] RECOVERY - Host dse-k8s-ctrl2001 is UP: PING OK - Packet loss = 0%, RTA = 31.91 ms [10:15:32] FIRING: [2x] KubernetesCalicoDown: dse-k8s-ctrl2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:15:44] (03CR) 10Jelto: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253489 (https://phabricator.wikimedia.org/T414405) (owner: 10AOkoth) [10:15:55] RECOVERY - Host ml-etcd2002 is UP: PING OK - Packet loss = 0%, RTA = 32.20 ms [10:16:14] FIRING: [2x] ProbeDown: Service dse-k8s-ctrl2001:6443 has failed probes (http_dse_k8s_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#dse-k8s-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:16:35] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5004.eqsin.wmnet [10:16:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid1003.eqiad.wmnet [10:17:14] (03CR) 10JavierMonton: [C:03+2] stream: mw-content-history-reconcile-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254132 (https://phabricator.wikimedia.org/T408918) (owner: 10JavierMonton) [10:19:01] !incidents [10:19:01] 7767 (UNACKED) [2x] ProbeDown sre (dse-k8s-ctrl2001:6443 probes/custom codfw) [10:19:07] (03Merged) 10jenkins-bot: stream: mw-content-history-reconcile-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254132 (https://phabricator.wikimedia.org/T408918) (owner: 10JavierMonton) [10:19:11] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:19:11] !ack 7767 [10:19:12] 7767 (ACKED) [2x] ProbeDown sre (dse-k8s-ctrl2001:6443 probes/custom codfw) [10:19:18] !log Delete `job/growthexperiments-listtaskcounts-29513771` from mw-cron (job stuck for more than a month) [10:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:41] jmm@cumin2002 drain-node (PID 3359051) is awaiting input [10:20:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1015:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:20:32] FIRING: [2x] KubernetesCalicoDown: dse-k8s-ctrl2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:20:51] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:20:51] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudgw2003-dev.codfw.wmnet [10:21:13] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2001.codfw.wmnet, ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:21:14] RESOLVED: [2x] ProbeDown: Service dse-k8s-ctrl2001:6443 has failed probes (http_dse_k8s_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#dse-k8s-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:21:19] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2003.codfw.wmnet, ml-staging2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:21:38] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [10:21:52] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [10:24:35] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:25:34] !log disable EVPN IBGP peering between ssw1-d1-eqiad and ssw1-d8-eqiad T420180 [10:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:38] T420180: Drain ssw1-d1-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420180 [10:26:53] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:27:13] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:27:14] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw2003-dev.codfw.wmnet [10:27:19] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:27:38] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudgw2004-dev.codfw.wmnet [10:28:27] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:29:26] !log stop announcing directly connected routes to L3 switches from cr1-eqiad T420180 [10:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:39] FIRING: CoreBGPDown: Core BGP session down between ssw1-d1-eqiad and ssw1-d8-eqiad (10.64.128.18) - group ibgp_evpn - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=ssw1-d1-eqiad:9804&var-bgp_group=ibgp_evpn&var-bgp_neighbor=ssw1-d8-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [10:31:05] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2038.codfw.wmnet [10:31:53] ^^^ that core bgp one is me too... silencing [10:33:11] (03CR) 10JavierMonton: [C:03+2] stream: mw-page-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254135 (https://phabricator.wikimedia.org/T408918) (owner: 10JavierMonton) [10:33:38] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw2004-dev.codfw.wmnet [10:34:59] (03Merged) 10jenkins-bot: stream: mw-page-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254135 (https://phabricator.wikimedia.org/T408918) (owner: 10JavierMonton) [10:35:00] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:35:44] jmm@cumin2002 drain-node (PID 3359051) is awaiting input [10:35:49] ^^ this is not me but a possible spanner in the works [10:36:35] actually that is the as-yet unconfigured 100G circuit, so not an issue [10:36:56] jmm@cumin2002 drain-node (PID 3363050) is awaiting input [10:37:20] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:37:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5004.eqsin.wmnet [10:38:03] aux-k8s-etcd2005 will go down for a Ganeti reboot [10:38:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2038.codfw.wmnet [10:39:03] (03PS4) 10Elukey: sre.hosts.provision: refactor bios if/else branches [cookbooks] - 10https://gerrit.wikimedia.org/r/1253412 (https://phabricator.wikimedia.org/T414216) [10:39:03] (03PS3) 10Elukey: sre.hosts.provision: add sys-112c-tn-configg to SUPERMICRO_NO_FQDN_MANAGEMENT [cookbooks] - 10https://gerrit.wikimedia.org/r/1253448 (https://phabricator.wikimedia.org/T414216) [10:39:04] (03PS7) 10Elukey: sre.hosts.provision: use more URIs to set Supermicro's BIOS settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216) [10:39:32] !log javiermonton@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [10:39:49] !log javiermonton@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [10:39:53] RESOLVED: HelmReleaseBadStatus: Helm release zarcillo/main on k8s-aux@eqiad in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-aux&var-namespace=zarcillo - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:40:07] PROBLEM - Host aux-k8s-etcd2005 is DOWN: PING CRITICAL - Packet loss = 100% [10:40:22] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:40:29] RECOVERY - Host aux-k8s-etcd2005 is UP: PING OK - Packet loss = 0%, RTA = 32.14 ms [10:40:48] (03PS4) 10Muehlenhoff: Remove obsolete Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1251080 (https://phabricator.wikimedia.org/T350694) [10:41:30] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:41:53] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:42:18] !log cease announcing routed networks from ssw1-d1-eqiad to cr1-eqiad in BGP T420180 [10:42:21] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: handle trust level C with invalid token. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1252658 (https://phabricator.wikimedia.org/T420106) (owner: 10Daniel Kinzler) [10:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:22] T420180: Drain ssw1-d1-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420180 [10:43:07] !log javiermonton@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [10:43:19] !log javiermonton@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [10:43:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2038.codfw.wmnet [10:43:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2038.codfw.wmnet [10:43:42] (03PS2) 10Daniel Kinzler: rest-gateway: test: check for www-authenticate error [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248398 (https://phabricator.wikimedia.org/T419034) [10:43:46] (03PS1) 10Fabfur: aptrepo: new haproxy32 component for trixie [puppet] - 10https://gerrit.wikimedia.org/r/1254146 (https://phabricator.wikimedia.org/T419825) [10:43:47] (03PS2) 10Daniel Kinzler: rest gateways: tests: check heathz first [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250637 [10:44:35] (03CR) 10Daniel Kinzler: rest-gateway rate limit: add DENY policy and class (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250598 (owner: 10Daniel Kinzler) [10:44:35] (03Merged) 10jenkins-bot: rest-gateway: handle trust level C with invalid token. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1252658 (https://phabricator.wikimedia.org/T420106) (owner: 10Daniel Kinzler) [10:44:45] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1251080 (https://phabricator.wikimedia.org/T350694) (owner: 10Muehlenhoff) [10:45:04] !log javiermonton@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [10:45:12] !log javiermonton@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [10:45:20] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2039.codfw.wmnet [10:45:25] (03CR) 10Daniel Kinzler: rest-gateway rate limit: add DENY policy and class (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250598 (owner: 10Daniel Kinzler) [10:45:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5004.eqsin.wmnet [10:46:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5004.eqsin.wmnet [10:47:49] PROBLEM - Host ssw1-d1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [10:48:06] (03CR) 10Kamila Součková: [C:03+1] rest gateways: tests: check heathz first [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250637 (owner: 10Daniel Kinzler) [10:48:49] kubestagemaster2004 will go down for a Ganeti reboot [10:48:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2039.codfw.wmnet [10:48:58] (03CR) 10Kamila Součková: [C:03+1] rest-gateway: test: check for www-authenticate error [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248398 (https://phabricator.wikimedia.org/T419034) (owner: 10Daniel Kinzler) [10:49:01] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5005.eqsin.wmnet [10:49:06] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:49:23] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:51:15] PROBLEM - Host kubestagemaster2004 is DOWN: PING CRITICAL - Packet loss = 100% [10:51:26] (03PS1) 10Daniel Kinzler: rest gateway: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254148 [10:51:41] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:52:16] !enable graceful-shutdown sender for internet BGP peerings on cr1-eqiad [10:52:24] !log jiji@cumin1003 END (PASS) - Cookbook sre.memcached.roll-reboot-restart (exit_code=0) rolling reboot on A:memcached-eqiad [10:52:53] (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254148 (owner: 10Daniel Kinzler) [10:53:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5005.eqsin.wmnet [10:54:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2039.codfw.wmnet [10:54:55] (03Merged) 10jenkins-bot: rest gateway: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254148 (owner: 10Daniel Kinzler) [10:54:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2039.codfw.wmnet [10:55:12] FIRING: [2x] ProbeDown: Service kubestagemaster2004:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster2004:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:55:19] (03PS1) 10Abijeet Patro: Enable ULS rewrite beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254149 (https://phabricator.wikimedia.org/T418187) [10:55:32] FIRING: [2x] KubernetesCalicoDown: kubestagemaster2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:56:03] RECOVERY - Host kubestagemaster2004 is UP: PING OK - Packet loss = 0%, RTA = 32.18 ms [10:56:29] hashar: Would you mind looking again into https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Math/+/1251102 after your restart the test workflow passed, but gate-and-submit failed "10:11:44 Exception: Error cloning https://gerrit.wikimedia.org/r/mediawiki/extensions/WikimediaCampaignEvents to /workspace/src/extensions/WikimediaCampaignEvents" [10:56:41] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:56:59] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:57:51] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:58:09] !log prepend external BGP announcements from cr1-eqiad T420180 [10:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:13] T420180: Drain ssw1-d1-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420180 [10:58:40] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:58:58] (03PS8) 10Elukey: sre.hosts.provision: use more URIs to set Supermicro's BIOS settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216) [10:59:34] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:00:12] RESOLVED: [3x] ProbeDown: Service ganeti5004:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:00:32] FIRING: [2x] KubernetesCalicoDown: kubestagemaster2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:00:58] !log daniel@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [11:01:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5005.eqsin.wmnet [11:01:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5005.eqsin.wmnet [11:01:44] (03PS1) 10A smart kitten: Revert "Create tests for NotificationMapper::deleteByUserAndAge" [extensions/Echo] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254155 (https://phabricator.wikimedia.org/T383948) [11:01:55] !log daniel@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [11:02:43] (03PS1) 10A smart kitten: Revert "Delete old notifications of users" [extensions/Echo] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254156 (https://phabricator.wikimedia.org/T383948) [11:03:52] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2040.codfw.wmnet [11:04:03] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5006.eqsin.wmnet [11:04:22] (03CR) 10CI reject: [V:04-1] sre.hosts.provision: use more URIs to set Supermicro's BIOS settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216) (owner: 10Elukey) [11:04:49] !log daniel@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [11:05:14] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:05:22] 10SRE-swift-storage, 10observability: Add FileBackend statsd metrics and a dashboard - https://phabricator.wikimedia.org/T217754#11717854 (10Aklapper) [11:05:50] !log daniel@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [11:06:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5006.eqsin.wmnet [11:09:08] (03CR) 10Daniel Kinzler: [C:03+2] rest gateways: tests: check heathz first [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250637 (owner: 10Daniel Kinzler) [11:09:10] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: test: check for www-authenticate error [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248398 (https://phabricator.wikimedia.org/T419034) (owner: 10Daniel Kinzler) [11:09:23] (03PS1) 10Kevin Bazira: ml-services: update gpt isvc image to one that supports configurable max_model_len flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254157 (https://phabricator.wikimedia.org/T418350) [11:09:41] jmm@cumin2002 drain-node (PID 3369380) is awaiting input [11:11:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2040.codfw.wmnet [11:11:12] (03Merged) 10jenkins-bot: rest gateways: tests: check heathz first [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250637 (owner: 10Daniel Kinzler) [11:11:24] (03Merged) 10jenkins-bot: rest-gateway: test: check for www-authenticate error [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248398 (https://phabricator.wikimedia.org/T419034) (owner: 10Daniel Kinzler) [11:11:36] (03PS9) 10Elukey: sre.hosts.provision: use more URIs to set Supermicro's BIOS settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216) [11:13:24] (03CR) 10Bartosz Wójtowicz: [C:03+1] ml-services: update gpt isvc image to one that supports configurable max_model_len flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254157 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [11:14:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5006.eqsin.wmnet [11:15:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5006.eqsin.wmnet [11:15:28] (03CR) 10A smart kitten: "('officially' proposing a revert on -wmf.20 per my comments on the task. when -wmf.21 comes along, I would also personally cherry-pick thi" [extensions/Echo] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254156 (https://phabricator.wikimedia.org/T383948) (owner: 10A smart kitten) [11:15:29] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update gpt isvc image to one that supports configurable max_model_len flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254157 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [11:16:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2040.codfw.wmnet [11:16:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2040.codfw.wmnet [11:17:27] (03PS1) 10Jelto: gitlab: start ssh-gitlab service after network-online and gitlab [puppet] - 10https://gerrit.wikimedia.org/r/1254162 (https://phabricator.wikimedia.org/T420164) [11:17:45] (03Merged) 10jenkins-bot: ml-services: update gpt isvc image to one that supports configurable max_model_len flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254157 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [11:18:59] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [11:19:31] (03CR) 10Arnaudb: [C:03+1] "looks good to me! question out of curiosity inline" [puppet] - 10https://gerrit.wikimedia.org/r/1254162 (https://phabricator.wikimedia.org/T420164) (owner: 10Jelto) [11:20:38] (03CR) 10Jelto: gitlab: start ssh-gitlab service after network-online and gitlab (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1254162 (https://phabricator.wikimedia.org/T420164) (owner: 10Jelto) [11:21:35] (03CR) 10Btullis: wikidata-platform: wdqs-queryhammer helmfile deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251453 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [11:21:52] (03PS1) 10Zabe: Remove misplaced readonly from CategoryViewer::$query [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254166 (https://phabricator.wikimedia.org/T420315) [11:22:01] (03CR) 10Arnaudb: [C:03+1] gitlab: start ssh-gitlab service after network-online and gitlab (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1254162 (https://phabricator.wikimedia.org/T420164) (owner: 10Jelto) [11:22:06] (03CR) 10Thiemo Kreuz (WMDE): [C:03+1] Remove misplaced readonly from CategoryViewer::$query [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254166 (https://phabricator.wikimedia.org/T420315) (owner: 10Zabe) [11:22:49] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [11:24:40] !log reduce local-preference for BGP routes learnt from servers on cr1-eqiad T420180 [11:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:44] T420180: Drain ssw1-d1-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420180 [11:27:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2041.codfw.wmnet [11:28:15] !log failover Ganeti master in eqsin to ganeti5004 [11:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install7002.wikimedia.org [11:30:52] jmm@cumin2002 drain-node (PID 3374664) is awaiting input [11:31:11] PROBLEM - ganeti-wconfd running on ganeti5007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [11:33:00] !log roll-reboot apus frontends (eqiad) for March reboots [11:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:03] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-cluster [11:34:38] (03CR) 10Clément Goubert: "Small comments, otherwise LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248477 (https://phabricator.wikimedia.org/T419130) (owner: 10Daniel Kinzler) [11:34:44] (03CR) 10Clément Goubert: [C:03+1] rest-gateway: per-route jwt overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248477 (https://phabricator.wikimedia.org/T419130) (owner: 10Daniel Kinzler) [11:36:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install7002.wikimedia.org [11:38:50] (03CR) 10JMeybohm: cassandra-http-gateway: new chart based on aqs-http-gateway (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250649 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [11:39:09] !log stop accepting external routes on ssw1-d1-eqiad from cr1-eqiad T420180 [11:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:13] T420180: Drain ssw1-d1-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420180 [11:39:36] (03CR) 10Btullis: wikidata-platform: wdqs-queryhammer helmfile deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251453 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [11:39:38] (03CR) 10JMeybohm: "You should bump the version in Chart.yaml to release new version." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250650 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [11:40:33] (03CR) 10JMeybohm: "LGTM so far, but we should re-run CI after the parents have been merged" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250651 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [11:40:38] (03PS2) 10Abijeet Patro: Enable ULS rewrite beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254149 (https://phabricator.wikimedia.org/T418187) [11:41:15] !log cgoubert@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{wikikube-worker13[00-47].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [11:41:32] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5007.eqsin.wmnet [11:41:43] jmm@cumin2002 drain-node (PID 3374664) is awaiting input [11:43:09] PROBLEM - Host ssw1-d1-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:28] !log btullis@cumin1003 START - Cookbook sre.presto.reboot-workers for Presto an-presto cluster: Reboot Presto nodes [11:43:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5007.eqsin.wmnet [11:45:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2041.codfw.wmnet [11:46:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install6003.wikimedia.org [11:47:31] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1012.eqiad.wmnet with OS bookworm [11:47:50] (03CR) 10JMeybohm: sre.k8s: use SREBatchRunnerBase, rather than SRELBBatchRunnerBase (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1248486 (https://phabricator.wikimedia.org/T419032) (owner: 10Blake) [11:48:15] (03CR) 10Gmodena: wikidata-platform: wdqs-queryhammer helmfile deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251453 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [11:48:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install6003.wikimedia.org [11:48:40] !log jiji@cumin1003 START - Cookbook sre.memcached.roll-reboot-restart rolling reboot on A:memcached-eqiad [11:49:26] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install5003.wikimedia.org [11:49:54] !log btullis@cumin1003 START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster [11:51:48] (03CR) 10Gmodena: wikidata-platform: wdqs-queryhammer helmfile deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251453 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [11:51:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5007.eqsin.wmnet [11:52:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5007.eqsin.wmnet [11:52:22] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [11:52:48] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1254146 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [11:52:52] (03CR) 10Trueg: wikidata-platform: wdqs-queryhammer helmfile deployment (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251453 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [11:52:57] (03CR) 10Gmodena: wikidata-platform: wdqs-queryhammer helmfile deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251453 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [11:53:22] !log reset BGP session to ssw1-d1-eiqad from lsw1-d1-eqiad T420180 [11:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:26] T420180: Drain ssw1-d1-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420180 [11:54:23] !log reset BGP session to ssw1-d1-eiqad from lsw1-d3-eqiad T420180 [11:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:41] !log jayme@cumin1003 conftool action : set/pooled=no; selector: service=docker-registry,name=(registry1004.eqiad.wmnet|registry2004.codfw.wmnet) [11:54:47] (03CR) 10Trueg: wikidata-platform: wdqs-queryhammer helmfile deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251453 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [11:55:32] FIRING: [3x] KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:55:41] !log jayme@cumin1003 START - Cookbook sre.hosts.reboot-single for host registry2004.codfw.wmnet [11:55:48] !log jayme@cumin1003 START - Cookbook sre.hosts.reboot-single for host registry1004.eqiad.wmnet [11:56:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install5003.wikimedia.org [11:56:22] !log reset BGP session to ssw1-d1-eiqad from lsw1-c2-eqiad T420180 [11:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install4003.wikimedia.org [11:58:16] (03CR) 10Fabfur: [C:03+2] aptrepo: new haproxy32 component for trixie [puppet] - 10https://gerrit.wikimedia.org/r/1254146 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [11:58:35] !log reset BGP session to ssw1-d1-eiqad from lsw1-c3-eqiad T420180 [11:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:39] T420180: Drain ssw1-d1-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420180 [11:58:48] tnx moritzm [11:59:16] !log reset BGP session to ssw1-d1-eiqad from lsw1-c4-eqiad T420180 [11:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:29] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host registry2004.codfw.wmnet [11:59:44] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host registry1004.eqiad.wmnet [12:00:00] !log jayme@cumin1003 conftool action : set/pooled=yes; selector: service=docker-registry,name=(registry1004.eqiad.wmnet|registry2004.codfw.wmnet) [12:00:03] !log reset BGP session to ssw1-d1-eiqad from lsw1-c6-eqiad T420180 [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1200) [12:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:32] (03CR) 10Dbrant: [C:03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253492 (owner: 10PipelineBot) [12:00:51] !log jayme@cumin1003 conftool action : set/pooled=no; selector: service=docker-registry,name=(registry1005.eqiad.wmnet|registry2005.codfw.wmnet) [12:01:25] !log jayme@cumin1003 START - Cookbook sre.hosts.reboot-single for host registry2005.codfw.wmnet [12:01:26] !log jayme@cumin1003 START - Cookbook sre.hosts.reboot-single for host registry1005.eqiad.wmnet [12:01:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2041.codfw.wmnet [12:01:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2041.codfw.wmnet [12:01:50] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1251080 (https://phabricator.wikimedia.org/T350694) (owner: 10Muehlenhoff) [12:02:05] (03PS1) 10Slyngshede: P:idp Allow enabling of gauth mfa / TOTP [puppet] - 10https://gerrit.wikimedia.org/r/1254176 (https://phabricator.wikimedia.org/T372892) [12:02:51] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for SCardenas (WMF) - https://phabricator.wikimedia.org/T419932#11718024 (10DMburugu) I approve [12:03:00] !log reset BGP session to ssw1-d1-eiqad from lsw1-c7-eqiad T420180 [12:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:07] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253492 (owner: 10PipelineBot) [12:03:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install4003.wikimedia.org [12:04:25] !log dbrant@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [12:04:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install3004.wikimedia.org [12:04:45] !log dbrant@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [12:04:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:05:22] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host registry1005.eqiad.wmnet [12:05:25] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host registry2005.codfw.wmnet [12:05:36] !log jayme@cumin1003 conftool action : set/pooled=yes; selector: service=docker-registry,name=(registry1005.eqiad.wmnet|registry2005.codfw.wmnet) [12:06:05] !log dbrant@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [12:06:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install3004.wikimedia.org [12:06:39] !log dbrant@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [12:06:50] !log dbrant@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [12:06:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install2005.wikimedia.org [12:07:00] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254176 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede) [12:07:19] !log dbrant@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [12:08:53] !log jayme@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{wikikube-worker[2280-2331].codfw.wmnet} and (A:wikikube-master-codfw or A:wikikube-worker-codfw) [12:09:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:10:50] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2042.codfw.wmnet [12:11:57] (03PS2) 10Slyngshede: P:idp Allow enabling of gauth mfa / TOTP [puppet] - 10https://gerrit.wikimedia.org/r/1254176 (https://phabricator.wikimedia.org/T372892) [12:12:41] (03PS5) 10Muehlenhoff: Install systemd-timesyncd universally [puppet] - 10https://gerrit.wikimedia.org/r/1243756 [12:13:01] !log restart BGP announcements from ssw1-d1-eqiad following change T420180 [12:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:04] T420180: Drain ssw1-d1-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420180 [12:13:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install2005.wikimedia.org [12:13:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:13:58] (03CR) 10Arnaudb: [C:03+2] gerrit: dns cache wipe update [cookbooks] - 10https://gerrit.wikimedia.org/r/1247534 (https://phabricator.wikimedia.org/T418108) (owner: 10Arnaudb) [12:14:10] RECOVERY - Host ssw1-d1-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms [12:15:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243756 (owner: 10Muehlenhoff) [12:16:02] RECOVERY - Host ssw1-d1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [12:16:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install1005.wikimedia.org [12:16:52] jmm@cumin2002 drain-node (PID 3382740) is awaiting input [12:17:36] PROBLEM - Host wikikube-worker1307 is DOWN: PING CRITICAL - Packet loss = 100% [12:19:03] (03Merged) 10jenkins-bot: gerrit: dns cache wipe update [cookbooks] - 10https://gerrit.wikimedia.org/r/1247534 (https://phabricator.wikimedia.org/T418108) (owner: 10Arnaudb) [12:19:47] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254176 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede) [12:20:32] FIRING: [6x] KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:20:38] !log roll-reboot apus frontends (codfw) for March reboots [12:20:40] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-cluster [12:20:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:41] (03PS2) 10Muehlenhoff: Remove support for PHP 7.4/8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1247620 [12:21:33] (03CR) 10CI reject: [V:04-1] Remove support for PHP 7.4/8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1247620 (owner: 10Muehlenhoff) [12:21:45] (03PS3) 10Muehlenhoff: Remove support for PHP 7.4/8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1247620 [12:21:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2042.codfw.wmnet [12:22:05] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1012.eqiad.wmnet with OS bookworm [12:22:30] (03CR) 10CI reject: [V:04-1] Remove support for PHP 7.4/8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1247620 (owner: 10Muehlenhoff) [12:22:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install1005.wikimedia.org [12:23:32] RECOVERY - VRRP status on cr1-eqiad is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [12:24:03] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6003.drmrs.wmnet [12:24:41] (03PS4) 10Muehlenhoff: Remove support for PHP 7.4/8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1247620 [12:27:07] jmm@cumin2002 drain-node (PID 3384525) is awaiting input [12:27:18] (03PS3) 10Slyngshede: P:idp Allow enabling of gauth mfa / TOTP [puppet] - 10https://gerrit.wikimedia.org/r/1254176 (https://phabricator.wikimedia.org/T372892) [12:29:52] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254176 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede) [12:31:21] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1247620 (owner: 10Muehlenhoff) [12:31:24] (03CR) 10Muehlenhoff: Remove support for PHP 7.4/8.1 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1247620 (owner: 10Muehlenhoff) [12:31:35] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1012.eqiad.wmnet with OS bookworm [12:32:01] (03CR) 10Muehlenhoff: "Good catch, updated" [puppet] - 10https://gerrit.wikimedia.org/r/1243756 (owner: 10Muehlenhoff) [12:32:50] (03PS1) 10Ayounsi: Anycast: prepend once more when peering with the core routers [homer/public] - 10https://gerrit.wikimedia.org/r/1254185 (https://phabricator.wikimedia.org/T420342) [12:32:54] jouncebot: next [12:32:54] In 0 hour(s) and 27 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1300) [12:34:33] (03PS4) 10Slyngshede: P:idp Allow enabling of gauth mfa / TOTP [puppet] - 10https://gerrit.wikimedia.org/r/1254176 (https://phabricator.wikimedia.org/T372892) [12:34:43] !log powercycling ganeti2041 (stuck on reboot) [12:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:45] RESOLVED: [2x] WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:35:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6003.drmrs.wmnet [12:35:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254166 (https://phabricator.wikimedia.org/T420315) (owner: 10Zabe) [12:36:38] (03PS5) 10Slyngshede: P:idp Allow enabling of gauth mfa / TOTP [puppet] - 10https://gerrit.wikimedia.org/r/1254176 (https://phabricator.wikimedia.org/T372892) [12:36:39] (03CR) 10Cathal Mooney: [C:03+1] "LGTM. I don't think it should affect the DNS hosts (they all peer with CRs, so longer path everywhere thus no diff). I am also a little " [homer/public] - 10https://gerrit.wikimedia.org/r/1254185 (https://phabricator.wikimedia.org/T420342) (owner: 10Ayounsi) [12:37:38] (03CR) 10Ladsgroup: [C:04-1] "Until someone actually brings an evidence that people need this change" [extensions/Echo] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254156 (https://phabricator.wikimedia.org/T383948) (owner: 10A smart kitten) [12:37:46] (03CR) 10Ladsgroup: [C:04-1] "Until someone actually brings an evidence that people need this change" [extensions/Echo] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254155 (https://phabricator.wikimedia.org/T383948) (owner: 10A smart kitten) [12:38:04] !log powercycling ganeti2042 (stuck on reboot) [12:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2042.codfw.wmnet [12:39:36] (03PS1) 10Muehlenhoff: toolforge::services::aptly: Remove buster [puppet] - 10https://gerrit.wikimedia.org/r/1254188 [12:39:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2042.codfw.wmnet [12:40:10] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254176 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede) [12:40:33] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [12:41:40] (03PS1) 10Esanders: TitleWidget: Prioritise namespace prefix over interwiki prefix [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254189 (https://phabricator.wikimedia.org/T420288) [12:42:01] (03PS1) 10Esanders: TitleWidget: Prioritise namespace prefix over interwiki prefix [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254190 (https://phabricator.wikimedia.org/T420288) [12:42:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6003.drmrs.wmnet [12:42:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6003.drmrs.wmnet [12:42:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254189 (https://phabricator.wikimedia.org/T420288) (owner: 10Esanders) [12:43:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254190 (https://phabricator.wikimedia.org/T420288) (owner: 10Esanders) [12:44:07] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2043.codfw.wmnet [12:44:27] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1012.eqiad.wmnet with OS bookworm [12:44:48] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1012.eqiad.wmnet with OS bookworm [12:44:48] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6004.drmrs.wmnet [12:45:58] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1015 [12:46:50] (03PS6) 10Slyngshede: P:idp Allow enabling of gauth mfa / TOTP [puppet] - 10https://gerrit.wikimedia.org/r/1254176 (https://phabricator.wikimedia.org/T372892) [12:46:57] (03PS1) 10Majavah: P:toolforge: Retire package builder hosts [puppet] - 10https://gerrit.wikimedia.org/r/1254191 (https://phabricator.wikimedia.org/T401819) [12:47:24] (03CR) 10CI reject: [V:04-1] P:idp Allow enabling of gauth mfa / TOTP [puppet] - 10https://gerrit.wikimedia.org/r/1254176 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede) [12:47:41] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Eqiad: lsw1-c2-eqiad BGP maintenance/ Tuesday 17th at 9:30 CDT - https://phabricator.wikimedia.org/T420158#11718297 (10cmooney) 05Open→03Declined This won't be required now, we have res... [12:47:51] jmm@cumin2002 drain-node (PID 3389356) is awaiting input [12:48:10] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host moss-be1001.eqiad.wmnet [12:48:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:48:42] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1015 [12:49:03] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Eqiad: lsw1-c7-eqiad BGP maintenance/ Thursday 19th at 10:00 am CDT - https://phabricator.wikimedia.org/T420159#11718300 (10cmooney) 05Open→03Declined This won't be needed now, we were... [12:49:16] (03PS7) 10Slyngshede: P:idp Allow enabling of gauth mfa / TOTP [puppet] - 10https://gerrit.wikimedia.org/r/1254176 (https://phabricator.wikimedia.org/T372892) [12:50:04] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:50:17] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1254191 (https://phabricator.wikimedia.org/T401819) (owner: 10Majavah) [12:50:48] (03CR) 10Majavah: [C:03+2] P:toolforge: Retire package builder hosts [puppet] - 10https://gerrit.wikimedia.org/r/1254191 (https://phabricator.wikimedia.org/T401819) (owner: 10Majavah) [12:51:21] !log ayounsi@cumin1003 START - Cookbook sre.network.tls for network device lsw1-e8-eqiad [12:51:28] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e8-eqiad [12:51:33] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1012.eqiad.wmnet with OS bookworm [12:51:39] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1012 [12:51:52] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 9269 [12:52:19] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:52:28] jmm@cumin2002 drain-node (PID 3389274) is awaiting input [12:52:44] jmm@cumin2002 drain-node (PID 3389356) is awaiting input [12:52:46] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254176 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede) [12:52:48] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1012 [12:52:54] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 9269 [12:53:13] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:53:13] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 28788 [12:53:20] (03Abandoned) 10Muehlenhoff: check_timedatectl: Drop support for old systemd versions [puppet] - 10https://gerrit.wikimedia.org/r/1243824 (owner: 10Muehlenhoff) [12:53:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:53:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251610 (https://phabricator.wikimedia.org/T348255) (owner: 10C. Scott Ananian) [12:53:51] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 28788 [12:54:02] dse-k8s-etcd2003 will go down for a Ganeti reboot [12:54:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2043.codfw.wmnet [12:54:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6004.drmrs.wmnet [12:54:23] RESOLVED: GnmiTargetDown: lsw1-e8-eqiad is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [12:54:55] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 28788 [12:55:03] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be1001.eqiad.wmnet [12:55:22] (03CR) 10CI reject: [V:04-1] TitleWidget: Prioritise namespace prefix over interwiki prefix [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254190 (https://phabricator.wikimedia.org/T420288) (owner: 10Esanders) [12:55:23] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host moss-be1002.eqiad.wmnet [12:55:26] PROBLEM - Host dse-k8s-etcd2003 is DOWN: PING CRITICAL - Packet loss = 100% [12:55:38] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 28788 [12:55:45] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 56308 [12:55:59] !log jiji@cumin1003 END (PASS) - Cookbook sre.memcached.roll-reboot-restart (exit_code=0) rolling reboot on A:memcached-eqiad [12:56:12] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 56308 [12:56:21] (03PS4) 10Daniel Kinzler: rest-gateway rate limit: add BYPASS and DENY policy and class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250598 [12:56:56] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 214657 [12:57:06] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 214657 [12:57:23] (03PS1) 10Kevin Bazira: ml-services: update gpt isvc image to one that supports configurable block_size flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254193 (https://phabricator.wikimedia.org/T418350) [12:59:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2043.codfw.wmnet [12:59:50] (03CR) 10Majavah: [C:03+1] toolforge::services::aptly: Remove buster [puppet] - 10https://gerrit.wikimedia.org/r/1254188 (owner: 10Muehlenhoff) [12:59:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2043.codfw.wmnet [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1300). [13:00:05] andre, edsanders, and cscott: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:13] o/ [13:00:16] I'm going to deploy a backport and then move the train [13:00:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6004.drmrs.wmnet [13:00:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6004.drmrs.wmnet [13:00:28] RECOVERY - Host dse-k8s-etcd2003 is UP: PING OK - Packet loss = 0%, RTA = 32.11 ms [13:00:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aklapper@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254166 (https://phabricator.wikimedia.org/T420315) (owner: 10Zabe) [13:01:14] o/ [13:01:47] !log failover Ganeti masters in drmrs to ganeti6003/6004 [13:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:06] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be1002.eqiad.wmnet [13:02:08] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2044.codfw.wmnet [13:03:10] PROBLEM - ganeti-wconfd running on ganeti6002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [13:03:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:04:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet [13:04:10] PROBLEM - ganeti-wconfd running on ganeti6001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [13:04:16] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host apus-be1004.eqiad.wmnet [13:04:24] (03CR) 10Bartosz Wójtowicz: [C:03+1] ml-services: update gpt isvc image to one that supports configurable block_size flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254193 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [13:04:45] o/ [13:05:12] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update gpt isvc image to one that supports configurable block_size flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254193 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [13:05:20] (03PS5) 10Daniel Kinzler: rest-gateway rate limit: add BYPASS and DENY policy and class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250598 [13:05:42] andre: are you starting? [13:05:50] edsanders, yes [13:06:11] 👍 [13:07:04] jmm@cumin2002 drain-node (PID 3393714) is awaiting input [13:07:11] (03Merged) 10jenkins-bot: ml-services: update gpt isvc image to one that supports configurable block_size flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254193 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [13:07:50] edsanders: backport already in progress for 1254166 [13:07:54] edsanders: Plus I still need to deploy the train to group0 afterwards - not sure about the order though? [13:07:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet [13:08:10] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:08:37] (03Merged) 10jenkins-bot: Remove misplaced readonly from CategoryViewer::$query [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254166 (https://phabricator.wikimedia.org/T420315) (owner: 10Zabe) [13:08:40] I don't mind, as long as I can get mine done in the next hour [13:08:54] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:09:08] !log aklapper@deploy2002 Started scap sync-world: Backport for [[gerrit:1254166|Remove misplaced readonly from CategoryViewer::$query (T420315)]] [13:09:11] T420315: Error: Cannot modify readonly property MediaWiki\Category\CategoryViewer::$query - https://phabricator.wikimedia.org/T420315 [13:09:29] i'm also in no hurry [13:09:51] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apus-be1004.eqiad.wmnet [13:10:18] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 16509 [13:10:34] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host moss-be1003.eqiad.wmnet [13:10:37] !log otto@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [13:10:41] !log otto@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [13:11:10] !log aklapper@deploy2002 zabe, aklapper: Backport for [[gerrit:1254166|Remove misplaced readonly from CategoryViewer::$query (T420315)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:11:37] !log aklapper@deploy2002 zabe, aklapper: Continuing with sync [13:11:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet [13:13:17] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: PeeringBGPDown (instance cr3-eqsin:9804) - https://phabricator.wikimedia.org/T419859#11718437 (10ayounsi) 05Open→03Invalid I go through the karma dashboard from time to time. I prefer to have the peering sessions on... [13:13:45] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: PeeringBGPDown (instance cr3-eqsin:9804) - https://phabricator.wikimedia.org/T419858#11718441 (10ayounsi) 05Open→03Invalid I go through the karma dashboard from time to time. I prefer to have the peering sessions on... [13:13:58] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: PeeringBGPDown (instance cr1-esams:9804) - https://phabricator.wikimedia.org/T419857#11718445 (10ayounsi) 05Open→03Invalid I go through the karma dashboard from time to time. I prefer to have the peering sessions on... [13:14:04] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: PeeringBGPDown (instance cr1-esams:9804) - https://phabricator.wikimedia.org/T419856#11718448 (10ayounsi) 05Open→03Invalid I go through the karma dashboard from time to time. I prefer to have the peering sessions on... [13:14:10] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: PeeringBGPDown (instance cr3-eqsin:9804) - https://phabricator.wikimedia.org/T419855#11718451 (10ayounsi) 05Open→03Invalid I go through the karma dashboard from time to time. I prefer to have the peering sessions on... [13:14:18] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: PeeringBGPDown (instance cr3-eqsin:9804) - https://phabricator.wikimedia.org/T419854#11718454 (10ayounsi) 05Open→03Invalid I go through the karma dashboard from time to time. I prefer to have the peering sessions on... [13:15:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet [13:15:32] edsanders: I'd say you deploy your backport(s) after I'm done with my backport, because we would collide in case I need to roll back the train from group0 to the testwikis. I should be done soon with the backport [13:15:38] !log aklapper@deploy2002 Finished scap sync-world: Backport for [[gerrit:1254166|Remove misplaced readonly from CategoryViewer::$query (T420315)]] (duration: 06m 31s) [13:15:42] T420315: Error: Cannot modify readonly property MediaWiki\Category\CategoryViewer::$query - https://phabricator.wikimedia.org/T420315 [13:15:48] dse-k8s-ctrl2002, kubestagemaster2003, ml-etcd2003 will go down for a Ganeti reboot [13:15:50] edsanders: Done. The stage is yours for now! [13:15:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2044.codfw.wmnet [13:15:55] thanks [13:15:57] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be1003.eqiad.wmnet [13:16:02] ayounsi@cumin1003 peering (PID 3908523) is awaiting input [13:16:10] !log btullis@cumin1003 END (PASS) - Cookbook sre.presto.reboot-workers (exit_code=0) for Presto an-presto cluster: Reboot Presto nodes [13:16:17] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6001.drmrs.wmnet [13:16:40] (03CR) 10Kamila Součková: "Does this look more reasonable?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251475 (https://phabricator.wikimedia.org/T419548) (owner: 10Kamila Součková) [13:16:52] cscott: you go first - I've got a CI issue that is probably meaningless, but I need to double check [13:17:01] (03PS2) 10Kamila Součková: shellbox: Setup shellbox-icu72 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251475 (https://phabricator.wikimedia.org/T419548) [13:17:09] PROBLEM - Host ml-etcd2003 is DOWN: PING CRITICAL - Packet loss = 100% [13:17:29] edsanders: ok, thanks. i should be quick, it's just config [13:17:43] PROBLEM - Host an-worker1148 is DOWN: PING CRITICAL - Packet loss = 100% [13:18:05] PROBLEM - Host dse-k8s-ctrl2002 is DOWN: PING CRITICAL - Packet loss = 100% [13:18:35] PROBLEM - Host kubestagemaster2003 is DOWN: PING CRITICAL - Packet loss = 100% [13:18:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251610 (https://phabricator.wikimedia.org/T348255) (owner: 10C. Scott Ananian) [13:18:48] (03CR) 10Esanders: "recheck" [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254190 (https://phabricator.wikimedia.org/T420288) (owner: 10Esanders) [13:19:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6001.drmrs.wmnet [13:19:37] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on P{wikikube-worker[2280-2331].codfw.wmnet} and (A:wikikube-master-codfw or A:wikikube-worker-codfw) [13:19:39] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [13:19:43] (03Merged) 10jenkins-bot: Turn on postprocessing cache for all Parsoid parses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251610 (https://phabricator.wikimedia.org/T348255) (owner: 10C. Scott Ananian) [13:20:00] (03CR) 10Muehlenhoff: [C:03+2] toolforge::services::aptly: Remove buster [puppet] - 10https://gerrit.wikimedia.org/r/1254188 (owner: 10Muehlenhoff) [13:20:12] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [13:20:13] !log cscott@deploy2002 Started scap sync-world: Backport for [[gerrit:1251610|Turn on postprocessing cache for all Parsoid parses (T348255)]] [13:20:18] T348255: Parser cache infrastructure for OutputTransform - https://phabricator.wikimedia.org/T348255 [13:20:24] !log cgoubert@cumin1003 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on P{wikikube-worker13[00-47].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [13:20:29] (03PS3) 10Eevans: cassandra-http-gateway: new chart based on aqs-http-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250649 (https://phabricator.wikimedia.org/T414112) [13:20:29] (03PS5) 10Eevans: charts/cassandra-http-gateway: template table configuration for hoarde [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250650 (https://phabricator.wikimedia.org/T414112) [13:20:29] (03PS6) 10Eevans: services: add linked-artifacts service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250651 (https://phabricator.wikimedia.org/T414112) [13:20:35] RECOVERY - Host dse-k8s-ctrl2002 is UP: PING OK - Packet loss = 0%, RTA = 32.28 ms [13:20:42] (03CR) 10Eevans: cassandra-http-gateway: new chart based on aqs-http-gateway (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250649 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [13:20:51] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host moss-be2001.codfw.wmnet [13:21:03] RECOVERY - Host kubestagemaster2003 is UP: PING OK - Packet loss = 0%, RTA = 32.12 ms [13:21:05] RECOVERY - Host ml-etcd2003 is UP: PING OK - Packet loss = 0%, RTA = 32.15 ms [13:21:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2044.codfw.wmnet [13:21:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2044.codfw.wmnet [13:22:19] !log cscott@deploy2002 cscott: Backport for [[gerrit:1251610|Turn on postprocessing cache for all Parsoid parses (T348255)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:23:12] (03PS13) 10Daniel Kinzler: rest-gateway rate limiting: add CORS headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248461 (https://phabricator.wikimedia.org/T418969) [13:24:10] (03CR) 10Elukey: "Jesse: the patch series work on the new dse k8s workers,but it will require more testing of course. Lemme know if you like the idea :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216) (owner: 10Elukey) [13:25:31] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#11718504 (10Wellverywell) Is there some ETA on whether thumbnail cache cleanup will happen on Commons? User @Medvednikita has reported that [[ https://commons.wikimedia.org/wiki/File%... [13:25:31] (03Abandoned) 10Eevans: service, trafficserver: Prepare "linked-artifacts" k8s pod [puppet] - 10https://gerrit.wikimedia.org/r/1227851 (https://phabricator.wikimedia.org/T414112) (owner: 10Federico Ceratto) [13:25:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6001.drmrs.wmnet [13:25:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6001.drmrs.wmnet [13:26:10] (03PS1) 10Fabfur: haproxy: test haproxy32 on cp2041 [puppet] - 10https://gerrit.wikimedia.org/r/1254195 (https://phabricator.wikimedia.org/T419825) [13:26:23] (03Abandoned) 10Eevans: hoarde: initial commit of chart (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249978 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [13:26:29] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be2001.codfw.wmnet [13:26:41] !log cscott@deploy2002 cscott: Continuing with sync [13:26:50] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host moss-be2002.codfw.wmnet [13:26:59] (03Abandoned) 10Eevans: services: add linked-artifacts service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249979 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [13:27:26] (03PS2) 10Fabfur: haproxy: test haproxy32 on cp2041 [puppet] - 10https://gerrit.wikimedia.org/r/1254195 (https://phabricator.wikimedia.org/T419825) [13:28:04] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2045.codfw.wmnet [13:30:44] !log cscott@deploy2002 Finished scap sync-world: Backport for [[gerrit:1251610|Turn on postprocessing cache for all Parsoid parses (T348255)]] (duration: 10m 31s) [13:30:48] T348255: Parser cache infrastructure for OutputTransform - https://phabricator.wikimedia.org/T348255 [13:30:54] ok, over to you edsanders [13:30:59] ta [13:31:03] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6002.drmrs.wmnet [13:31:08] jmm@cumin2002 drain-node (PID 3399351) is awaiting input [13:31:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254189 (https://phabricator.wikimedia.org/T420288) (owner: 10Esanders) [13:31:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254190 (https://phabricator.wikimedia.org/T420288) (owner: 10Esanders) [13:32:16] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be2002.codfw.wmnet [13:32:19] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2045.codfw.wmnet [13:32:33] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host apus-be2004.codfw.wmnet [13:32:41] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254195 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [13:33:15] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2045.codfw.wmnet [13:33:35] (03Abandoned) 10Fabfur: haproxy: adding haproxy30 component and support [puppet] - 10https://gerrit.wikimedia.org/r/1041647 (https://phabricator.wikimedia.org/T366885) (owner: 10Fabfur) [13:35:09] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on logstash2023.codfw.wmnet with reason: ganeti reboot [13:35:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254189 (https://phabricator.wikimedia.org/T420288) (owner: 10Esanders) [13:35:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254190 (https://phabricator.wikimedia.org/T420288) (owner: 10Esanders) [13:36:02] jmm@cumin2002 drain-node (PID 3399694) is awaiting input [13:36:28] (03Merged) 10jenkins-bot: TitleWidget: Prioritise namespace prefix over interwiki prefix [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254189 (https://phabricator.wikimedia.org/T420288) (owner: 10Esanders) [13:36:30] (03PS1) 10Kevin Bazira: ml-services: disable AMDGCN_USE_BUFFER_OPS in gpt isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254198 (https://phabricator.wikimedia.org/T418350) [13:36:44] jmm@cumin2002 drain-node (PID 3400011) is awaiting input [13:37:11] (03Merged) 10jenkins-bot: TitleWidget: Prioritise namespace prefix over interwiki prefix [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254190 (https://phabricator.wikimedia.org/T420288) (owner: 10Esanders) [13:37:43] !log esanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1254189|TitleWidget: Prioritise namespace prefix over interwiki prefix (T420288)]], [[gerrit:1254190|TitleWidget: Prioritise namespace prefix over interwiki prefix (T420288)]] [13:37:47] T420288: VisualEditor link tool is confusing project namespace and interwiki links - https://phabricator.wikimedia.org/T420288 [13:38:39] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apus-be2004.codfw.wmnet [13:38:58] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host moss-be2003.codfw.wmnet [13:39:43] !log esanders@deploy2002 esanders: Backport for [[gerrit:1254189|TitleWidget: Prioritise namespace prefix over interwiki prefix (T420288)]], [[gerrit:1254190|TitleWidget: Prioritise namespace prefix over interwiki prefix (T420288)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:40:23] (03PS1) 10Andrew Bogott: eqiad1: move to a newer Horizon build [puppet] - 10https://gerrit.wikimedia.org/r/1254200 (https://phabricator.wikimedia.org/T405117) [13:41:39] (03CR) 10Andrew Bogott: [C:03+2] eqiad1: move to a newer Horizon build [puppet] - 10https://gerrit.wikimedia.org/r/1254200 (https://phabricator.wikimedia.org/T405117) (owner: 10Andrew Bogott) [13:42:00] !log esanders@deploy2002 esanders: Continuing with sync [13:43:19] 06SRE, 06Infrastructure-Foundations, 10netops: Drain ssw1-d8-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420351 (10cmooney) 03NEW p:05Triage→03Medium [13:43:25] 06SRE, 06Infrastructure-Foundations, 10netops: Drain ssw1-d8-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420351#11718588 (10cmooney) [13:43:30] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new: Nokia SR-Linux DHCP Relay Bug - https://phabricator.wikimedia.org/T411054#11718589 (10cmooney) [13:43:50] (03CR) 10Bartosz Wójtowicz: [C:03+1] ml-services: disable AMDGCN_USE_BUFFER_OPS in gpt isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254198 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [13:44:29] (03CR) 10Kevin Bazira: [C:03+2] ml-services: disable AMDGCN_USE_BUFFER_OPS in gpt isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254198 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [13:44:36] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be2003.codfw.wmnet [13:45:33] (03CR) 10Filippo Giunchedi: [C:03+1] Install systemd-timesyncd universally [puppet] - 10https://gerrit.wikimedia.org/r/1243756 (owner: 10Muehlenhoff) [13:45:52] !log esanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1254189|TitleWidget: Prioritise namespace prefix over interwiki prefix (T420288)]], [[gerrit:1254190|TitleWidget: Prioritise namespace prefix over interwiki prefix (T420288)]] (duration: 08m 10s) [13:45:56] T420288: VisualEditor link tool is confusing project namespace and interwiki links - https://phabricator.wikimedia.org/T420288 [13:46:37] (03Merged) 10jenkins-bot: ml-services: disable AMDGCN_USE_BUFFER_OPS in gpt isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254198 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [13:46:56] edsanders: are you're done? [13:49:33] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11718638 (10Papaul) [13:49:58] edsanders: I assume yes, so I'm going to deploy wmf.20 to group0 now [13:50:24] (03PS1) 10TrainBranchBot: group0 to 1.46.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254203 (https://phabricator.wikimedia.org/T413811) [13:50:26] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by aklapper@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254203 (https://phabricator.wikimedia.org/T413811) (owner: 10TrainBranchBot) [13:50:43] dse-k8s-ctrl2001, aux-k8s-etcd2003 will go down for a Ganeti reboot [13:50:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet [13:50:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6002.drmrs.wmnet [13:51:01] btullis@cumin1003 reboot-workers (PID 3894227) is awaiting input [13:51:13] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2001.codfw.wmnet, ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:51:19] (03Merged) 10jenkins-bot: group0 to 1.46.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254203 (https://phabricator.wikimedia.org/T413811) (owner: 10TrainBranchBot) [13:51:23] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2003.codfw.wmnet, ml-staging2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:52:20] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudlb2002-dev.codfw.wmnet [13:53:03] PROBLEM - Host dse-k8s-etcd2001 is DOWN: PING CRITICAL - Packet loss = 100% [13:53:05] PROBLEM - Host aux-k8s-etcd2003 is DOWN: PING CRITICAL - Packet loss = 100% [13:53:46] (03CR) 10Jforrester: [C:03+1] "LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254149 (https://phabricator.wikimedia.org/T418187) (owner: 10Abijeet Patro) [13:54:43] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:55:31] RECOVERY - Host dse-k8s-etcd2001 is UP: PING OK - Packet loss = 0%, RTA = 32.09 ms [13:55:37] RECOVERY - Host aux-k8s-etcd2003 is UP: PING OK - Packet loss = 0%, RTA = 32.22 ms [13:56:10] FIRING: [2x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.3 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:56:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet [13:56:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2045.codfw.wmnet [13:56:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2002-dev (172.20.5.3) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:57:09] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2046.codfw.wmnet [13:57:12] !log aklapper@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.46.0-wmf.20 refs T413811 [13:57:16] T413811: 1.46.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T413811 [13:57:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6002.drmrs.wmnet [13:57:30] (03PS1) 10Ottomata: Increase the kafka-jumbo maximum message size to 20MB [puppet] - 10https://gerrit.wikimedia.org/r/1254205 (https://phabricator.wikimedia.org/T419495) [13:57:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6002.drmrs.wmnet [13:58:04] (03CR) 10CI reject: [V:04-1] Increase the kafka-jumbo maximum message size to 20MB [puppet] - 10https://gerrit.wikimedia.org/r/1254205 (https://phabricator.wikimedia.org/T419495) (owner: 10Ottomata) [13:58:43] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:59:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet [14:00:05] Deploy window Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1400) [14:00:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cumin2003.codfw.wmnet [14:01:10] RESOLVED: [2x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.3 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:01:34] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet [14:01:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2002-dev (172.20.5.3) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:02:52] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#11718729 (10Ladsgroup) That is actually unrelated to this work and is about {T360589} and {T414805}. See https://www.mediawiki.org/wiki/Common_thumbnail_sizes [14:03:56] the cloud codfw alerts are me, rolling reboots in progress [14:04:43] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:05:09] FIRING: [4x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2002-dev (172.20.5.3) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:05:14] !log setting cr1-eqiad as VRRP master for all vlans T420351 [14:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:18] T420351: Drain ssw1-d8-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420351 [14:05:30] (03PS1) 10Ottomata: eventstreams-internal - increase kafka max message size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254210 (https://phabricator.wikimedia.org/T420356) [14:06:25] FIRING: [4x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.4 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:06:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin2003.codfw.wmnet [14:07:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host people2004.codfw.wmnet [14:07:43] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:08:33] 10SRE-tools, 10Cumin, 06Infrastructure-Foundations: Add proxy support to cumin openstack backend - https://phabricator.wikimedia.org/T420360 (10fgiunchedi) 03NEW [14:09:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1053:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1053 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:10:09] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2002-dev (172.20.5.3) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:10:34] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet [14:10:38] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet [14:11:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people2004.codfw.wmnet [14:11:25] RESOLVED: [4x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.4 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:11:30] !log powercycling ganeti2046 (stuck on reboot) [14:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test1005.wikimedia.org [14:13:43] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:13:43] !log disable VRRP on cr2-eqiad interfaces facing ssw1-d8-eqiad T420351 [14:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:47] T420351: Drain ssw1-d8-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420351 [14:14:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet [14:14:09] FIRING: [4x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2002-dev (172.20.5.3) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:14:09] FIRING: [3x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2004-dev (172.20.5.5) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:14:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2046.codfw.wmnet [14:15:32] FIRING: [7x] KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:16:03] jouncebot: nowandnext [14:16:03] For the next 0 hour(s) and 13 minute(s): Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1400) [14:16:04] In 0 hour(s) and 13 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1430) [14:16:04] In 0 hour(s) and 13 minute(s): Create new table for the CampaignEvents extension (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1430) [14:16:06] (03PS1) 10Majavah: cr-cloud-vrf: Narrowly scope (cloud)cumin firewall exemption [homer/public] - 10https://gerrit.wikimedia.org/r/1254211 (https://phabricator.wikimedia.org/T419996) [14:16:11] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2047.codfw.wmnet [14:16:25] FIRING: [6x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.4 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:16:43] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:16:54] FIRING: [8x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2002-dev (172.20.5.3) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:17:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test1005.wikimedia.org [14:17:35] PROBLEM - VRRP status on cr1-eqiad is CRITICAL: VRRP CRITICAL - 6 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [14:19:09] FIRING: [10x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2002-dev (172.20.5.3) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:19:36] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet [14:19:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1053:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1053 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:19:40] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudnet2005-dev.codfw.wmnet [14:19:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet [14:19:54] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar: Add --min-uptime to cookbooks - https://phabricator.wikimedia.org/T419967#11718805 (10fgiunchedi) I also was wondering about a resumable rolling reboot feature for cookbooks and found this task, and of course I'm +1! The way I understand the feat... [14:19:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-test-presto1001.eqiad.wmnet [14:20:40] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1015:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:21:25] RESOLVED: [4x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.4 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:21:57] PROBLEM - Host ml-staging-etcd2002 is DOWN: PING CRITICAL - Packet loss = 100% [14:22:59] 07Puppet, 06collaboration-services, 10Gerrit, 06Infrastructure-Foundations: Edit puppet-merge to use gerrit.discovery.wmnet instead of gerrit.wikimedia.org? - https://phabricator.wikimedia.org/T420184#11718810 (10ABran-WMF) [14:23:29] 07Puppet, 06collaboration-services, 10Gerrit, 06Infrastructure-Foundations: Edit puppet-merge to use gerrit.discovery.wmnet instead of gerrit.wikimedia.org? - https://phabricator.wikimedia.org/T420184#11718811 (10ABran-WMF) p:05Triage→03Low [14:23:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-presto1001.eqiad.wmnet [14:24:52] 07Puppet, 06collaboration-services, 10Gerrit, 06Infrastructure-Foundations: Change puppet-merge git origin to use gerrit.discovery.wmnet instead of gerrit.wikimedia.org - https://phabricator.wikimedia.org/T420184#11718825 (10ABran-WMF) [14:25:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet [14:25:29] RECOVERY - Host ml-staging-etcd2002 is UP: PING OK - Packet loss = 0%, RTA = 30.73 ms [14:25:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2047.codfw.wmnet [14:25:33] jouncebot: nowandnext [14:25:33] For the next 0 hour(s) and 4 minute(s): Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1400) [14:25:33] In 0 hour(s) and 4 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1430) [14:25:33] In 0 hour(s) and 4 minute(s): Create new table for the CampaignEvents extension (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1430) [14:25:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2049.codfw.wmnet [14:27:16] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet2005-dev.codfw.wmnet [14:27:20] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudnet2006-dev.codfw.wmnet [14:27:29] !log de-pref internet circuits landing on cr2-eqiad to shift traffic to cr1 T420351 [14:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:32] T420351: Drain ssw1-d8-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420351 [14:27:33] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/1254211 (https://phabricator.wikimedia.org/T419996) (owner: 10Majavah) [14:27:35] aux-k8s-etcd2004 will go down for a Ganeti reboot [14:27:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet [14:27:53] (03CR) 10Ayounsi: [C:03+1] cr-cloud-vrf: Narrowly scope (cloud)cumin firewall exemption [homer/public] - 10https://gerrit.wikimedia.org/r/1254211 (https://phabricator.wikimedia.org/T419996) (owner: 10Majavah) [14:28:03] (03CR) 10Majavah: [C:03+2] cr-cloud-vrf: Narrowly scope (cloud)cumin firewall exemption [homer/public] - 10https://gerrit.wikimedia.org/r/1254211 (https://phabricator.wikimedia.org/T419996) (owner: 10Majavah) [14:29:23] (03Merged) 10jenkins-bot: cr-cloud-vrf: Narrowly scope (cloud)cumin firewall exemption [homer/public] - 10https://gerrit.wikimedia.org/r/1254211 (https://phabricator.wikimedia.org/T419996) (owner: 10Majavah) [14:29:47] PROBLEM - Host aux-k8s-etcd2004 is DOWN: PING CRITICAL - Packet loss = 100% [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1430) [14:30:05] Daimona: Your horoscope predicts another Create new table for the CampaignEvents extension deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1430). [14:30:11] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host thanos-be2005.codfw.wmnet [14:30:32] FIRING: [9x] KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:30:37] RECOVERY - Host aux-k8s-etcd2004 is UP: PING OK - Packet loss = 0%, RTA = 30.82 ms [14:30:55] Oh wow that's a very specific horoscope [14:31:52] !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host gitlab2002.wikimedia.org [14:31:53] (03PS1) 10Muehlenhoff: apereo_cas: Drop obsolete test [puppet] - 10https://gerrit.wikimedia.org/r/1254212 [14:33:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet [14:33:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2049.codfw.wmnet [14:34:15] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2050.codfw.wmnet [14:34:28] !log Creating ce_event_goals DB table for the CampaignEvents extension in x1.testwiki, x1.test2wiki, x1.officewiki, and x1.wikishared # T411433 [14:34:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1006.eqiad.wmnet [14:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:32] T411433: Create new database table for event goals - https://phabricator.wikimedia.org/T411433 [14:35:00] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:35:09] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet2006-dev.codfw.wmnet [14:35:13] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudnet2007-dev.codfw.wmnet [14:35:58] And I'm done. [14:36:23] DB creation takes so little time. DB migration takes forever. Alas. [14:36:35] (03CR) 10Jelto: "looks good to me, one comment in-line" [cookbooks] - 10https://gerrit.wikimedia.org/r/1254113 (https://phabricator.wikimedia.org/T420194) (owner: 10Arnaudb) [14:36:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet [14:37:57] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2005.codfw.wmnet [14:38:01] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host thanos-be2006.codfw.wmnet [14:38:03] (03PS1) 10Majavah: definitions: Remove duplicate definitions [homer/public] - 10https://gerrit.wikimedia.org/r/1254215 [14:38:09] !log bking@requestctl remove `wdqs_highest_error_rate_ever_seen` requestctl rule as it is no longer needed [14:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:19] (03PS1) 10Sergio Gimeno: GrowthExperiments: increase edit and thanks query limit II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254216 (https://phabricator.wikimedia.org/T341599) [14:38:24] (03CR) 10Ayounsi: [C:03+1] definitions: Remove duplicate definitions [homer/public] - 10https://gerrit.wikimedia.org/r/1254215 (owner: 10Majavah) [14:38:35] (03CR) 10Majavah: [C:03+2] definitions: Remove duplicate definitions [homer/public] - 10https://gerrit.wikimedia.org/r/1254215 (owner: 10Majavah) [14:38:40] FIRING: JobUnavailable: Reduced availability for job wmf_gitlab_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:43] jouncebot: nowandnext [14:38:44] For the next 0 hour(s) and 21 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1430) [14:38:44] For the next 0 hour(s) and 11 minute(s): Create new table for the CampaignEvents extension (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1430) [14:38:44] In 0 hour(s) and 21 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1500) [14:39:02] (Just checking for when the next puppet window is) [14:39:12] (03CR) 10Cathal Mooney: [C:03+1] "LGTM, confirmed the IPs are in Netbox with the same fqdn in the "dns_name" field so I believe that should cover it, no need for a static d" [homer/public] - 10https://gerrit.wikimedia.org/r/1254215 (owner: 10Majavah) [14:39:34] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab2002.wikimedia.org [14:39:44] Actually, do want to deploy a no-op config patch [14:40:00] RESOLVED: JobUnavailable: Reduced availability for job wmf_gitlab_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:40:13] (03Merged) 10jenkins-bot: definitions: Remove duplicate definitions [homer/public] - 10https://gerrit.wikimedia.org/r/1254215 (owner: 10Majavah) [14:40:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1006.eqiad.wmnet [14:40:44] !log disabling EVPN IBGP peering from ssw1-d8-eqiad to ssw1-d1-eqiad to stop them reflecting routes T420351 [14:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:47] T420351: Drain ssw1-d8-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420351 [14:41:14] (03PS1) 10Dreamy Jazz: Create dblists for wikis where CheckUser and AbuseFilter are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254217 (https://phabricator.wikimedia.org/T420063) [14:41:45] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet2007-dev.codfw.wmnet [14:41:49] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudrabbit2001-dev.codfw.wmnet [14:42:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet [14:42:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2050.codfw.wmnet [14:42:14] (03CR) 10CI reject: [V:04-1] Create dblists for wikis where CheckUser and AbuseFilter are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254217 (https://phabricator.wikimedia.org/T420063) (owner: 10Dreamy Jazz) [14:42:25] (03CR) 10Jelto: [C:03+2] gitlab: start ssh-gitlab service after network-online and gitlab [puppet] - 10https://gerrit.wikimedia.org/r/1254162 (https://phabricator.wikimedia.org/T420164) (owner: 10Jelto) [14:43:07] (03PS2) 10Dreamy Jazz: Create dblists for wikis where CheckUser and AbuseFilter are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254217 (https://phabricator.wikimedia.org/T420063) [14:43:25] !log failover Ganeti master in codfw to ganeti2047 [14:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:16] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2034.codfw.wmnet [14:44:17] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2034.codfw.wmnet [14:44:18] !log stop announcing "direct" routes to ssw1-d8-eqiad from cr2-eqiad T420351 [14:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:29] (03CR) 10CI reject: [V:04-1] Create dblists for wikis where CheckUser and AbuseFilter are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254217 (https://phabricator.wikimedia.org/T420063) (owner: 10Dreamy Jazz) [14:44:33] !log deploying cr firewall changes from https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1254211 [14:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:39] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2034.codfw.wmnet [14:45:07] PROBLEM - ganeti-wconfd running on ganeti2048 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [14:45:08] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar: Add --min-uptime to cookbooks - https://phabricator.wikimedia.org/T419967#11718942 (10JMeybohm) I think usability wise it might be more helpful to have an argument which takes the date and time after which a reboot is expected. So something like... [14:45:30] !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host gitlab2002.wikimedia.org [14:45:33] FIRING: [5x] KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:46:26] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2006.codfw.wmnet [14:46:30] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host thanos-be2007.codfw.wmnet [14:46:54] FIRING: [6x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:47:57] PROBLEM - Host gitlab-replica-b.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [14:48:22] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit2001-dev.codfw.wmnet [14:48:25] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudrabbit2002-dev.codfw.wmnet [14:48:40] (03CR) 10JMeybohm: [C:03+2] k8s-staging: Switch to IPIP mode for kube-apiserver [puppet] - 10https://gerrit.wikimedia.org/r/1240275 (https://phabricator.wikimedia.org/T352956) (owner: 10JMeybohm) [14:49:06] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11718960 (10RobH) The distro swap did not fix this host, it will require a mainboard swap via a procurement task (linked in) [14:49:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2034.codfw.wmnet [14:49:58] !log stop announcing routes from ssw1-d8-eqiad to external peers (cr2-eqiad, other spines) T420351 [14:49:59] (03PS2) 10Bking: dse-k8s: Add CFSSL profile for longer-lived certificates (6 mo). [puppet] - 10https://gerrit.wikimedia.org/r/1251117 (https://phabricator.wikimedia.org/T419289) [14:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host durum4004.ulsfo.wmnet [14:50:02] T420351: Drain ssw1-d8-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420351 [14:51:31] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab2002.wikimedia.org [14:51:47] (03CR) 10Muehlenhoff: [C:03+2] thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251482 (owner: 10Muehlenhoff) [14:51:50] (03CR) 10Scott French: shellbox: Setup shellbox-icu72 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251475 (https://phabricator.wikimedia.org/T419548) (owner: 10Kamila Součková) [14:51:52] (03PS2) 10Btullis: Increase the kafka-jumbo maximum message size to 20MB [puppet] - 10https://gerrit.wikimedia.org/r/1254205 (https://phabricator.wikimedia.org/T419495) (owner: 10Ottomata) [14:51:54] !log stop accepting routes on ssw1-d8-eqiad from external peers (cr2-eqiad, other spines) T420351 [14:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:07] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254205 (https://phabricator.wikimedia.org/T419495) (owner: 10Ottomata) [14:52:40] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2007.codfw.wmnet [14:52:44] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host thanos-be2008.codfw.wmnet [14:52:59] RECOVERY - Host gitlab-replica-b.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 30.42 ms [14:53:13] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1254212 (owner: 10Muehlenhoff) [14:53:40] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:53:48] !log jmm@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply [14:53:59] !log jmm@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [14:54:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host durum4004.ulsfo.wmnet [14:54:09] (03CR) 10Muehlenhoff: [C:03+2] apereo_cas: Drop obsolete test [puppet] - 10https://gerrit.wikimedia.org/r/1254212 (owner: 10Muehlenhoff) [14:54:29] PROBLEM - Host ssw1-d8-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:54:29] PROBLEM - Host ssw1-d8-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:54:46] 10ops-codfw, 06collaboration-services, 06DC-Ops, 10Phabricator: phab2002: SEL System Event:, System Board Front LED Panel, Critical, management controller unavailable - https://phabricator.wikimedia.org/T420228#11719008 (10Jhancock.wm) soft rebooted the idrac [14:54:50] (03CR) 10David Caro: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1247033 (owner: 10Muehlenhoff) [14:55:16] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit2002-dev.codfw.wmnet [14:55:20] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudrabbit2003-dev.codfw.wmnet [14:55:32] FIRING: [4x] KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:55:57] (03CR) 10Btullis: [C:03+1] "Looks good to me. Should we run PCC against the PKI hosts?" [puppet] - 10https://gerrit.wikimedia.org/r/1251117 (https://phabricator.wikimedia.org/T419289) (owner: 10Bking) [14:56:05] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11719023 (10RobH) [14:56:11] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11719026 (10RobH) [14:56:21] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11719028 (10RobH) [14:57:08] (03CR) 10Daniel Kinzler: rest-gateway rate limit: add BYPASS and DENY policy and class (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250598 (owner: 10Daniel Kinzler) [14:57:21] 06SRE, 06Traffic: Startup failure for Bird on new durum hosts - https://phabricator.wikimedia.org/T419868#11719032 (10MoritzMuehlenhoff) >>! In T419868#11713955, @ssingh wrote: > That's interesting, thanks for debugging. What is weird is that a restart of anycast-healthchecker then should have fixed this in th... [14:57:21] !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host phab1004.eqiad.wmnet [14:57:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2034.codfw.wmnet [14:57:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2034.codfw.wmnet [14:57:40] (03CR) 10Muehlenhoff: [C:03+2] cloudceph: Run the spec tests on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1247033 (owner: 10Muehlenhoff) [14:58:03] (03CR) 10Scott French: [C:03+1] "Thanks, Moritz!" [puppet] - 10https://gerrit.wikimedia.org/r/1247620 (owner: 10Muehlenhoff) [14:58:18] (03CR) 10Btullis: [C:03+2] Increase the kafka-jumbo maximum message size to 20MB [puppet] - 10https://gerrit.wikimedia.org/r/1254205 (https://phabricator.wikimedia.org/T419495) (owner: 10Ottomata) [14:58:35] (03PS3) 10Dreamy Jazz: Create dblists for wikis where CheckUser and AbuseFilter are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254217 (https://phabricator.wikimedia.org/T420063) [14:58:38] I am getting a "Error: 502, Broken pipe" from Phabricator [14:58:43] jelto: 3 min too early? :) [14:58:51] taavi: maintenance planned for 3pm UTC [14:59:01] yes :) Phabricator needs a short restart [14:59:12] (03CR) 10Muehlenhoff: [C:03+2] Remove support for PHP 7.4/8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1247620 (owner: 10Muehlenhoff) [14:59:54] jouncebot: nowandnext [14:59:54] For the next 0 hour(s) and 0 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1430) [14:59:54] In 0 hour(s) and 0 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1500) [15:00:01] jouncebot: nowandnext [15:00:01] For the next 0 hour(s) and 59 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1500) [15:00:01] In 0 hour(s) and 59 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1600) [15:00:05] jelto, arnoldokoth, mutante, and arnaudb: Your horoscope predicts another SRE Collaboration Services office hours deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1500). [15:00:26] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2008.codfw.wmnet [15:00:30] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host thanos-be2009.codfw.wmnet [15:00:32] FIRING: [4x] KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:00:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254217 (https://phabricator.wikimedia.org/T420063) (owner: 10Dreamy Jazz) [15:02:05] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit2003-dev.codfw.wmnet [15:02:07] (My config change is a no-op to prod and needs to happen before the puppet request window) [15:02:09] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudservices2004-dev.codfw.wmnet [15:02:10] (03Merged) 10jenkins-bot: Create dblists for wikis where CheckUser and AbuseFilter are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254217 (https://phabricator.wikimedia.org/T420063) (owner: 10Dreamy Jazz) [15:02:18] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [15:02:40] !log reset BGP session to ssw1-d8-eiqad from lsw1-d4-eqiad T420180 [15:02:40] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1254217|Create dblists for wikis where CheckUser and AbuseFilter are disabled (T420063 T420062)]] [15:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:43] T420180: Drain ssw1-d1-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420180 [15:02:48] T420063: Uninstall AbuseFilter from wikis which are closed and have no AbuseLog entries - https://phabricator.wikimedia.org/T420063 [15:02:48] T420062: Uninstall PSI extensions on closed wikis which are not needed - https://phabricator.wikimedia.org/T420062 [15:03:20] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host phab1004.eqiad.wmnet [15:04:16] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp7014.magru.wmnet with OS trixie [15:04:39] Phabricator maintenance finished [15:04:43] RECOVERY - Host an-worker1148 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [15:04:47] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:04:51] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1254217|Create dblists for wikis where CheckUser and AbuseFilter are disabled (T420063 T420062)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:05:18] !log jelto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator deploy [15:05:19] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [15:05:51] !log jelto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: Phabricator deploy [15:06:10] FIRING: [2x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.8 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:06:54] FIRING: [8x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:07:47] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:08:07] PROBLEM - ganeti-wconfd running on ganeti2033 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [15:08:10] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2009.codfw.wmnet [15:08:23] !log brennen@deploy2002 Started deploy [phabricator/deployment@e845707]: deploy phab2002 for T420366 [15:08:27] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudservices2004-dev.codfw.wmnet [15:08:27] T420366: Deploy Phab/Phorge 2026-03-17 - https://phabricator.wikimedia.org/T420366 [15:08:31] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudservices2005-dev.codfw.wmnet [15:08:59] !log brennen@deploy2002 Finished deploy [phabricator/deployment@e845707]: deploy phab2002 for T420366 (duration: 00m 35s) [15:09:09] FIRING: [8x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:09:19] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1254217|Create dblists for wikis where CheckUser and AbuseFilter are disabled (T420063 T420062)]] (duration: 06m 38s) [15:09:20] !log brennen@deploy2002 Started deploy [phabricator/deployment@e845707]: deploy phab1004 for T420366 [15:09:24] T420063: Uninstall AbuseFilter from wikis which are closed and have no AbuseLog entries - https://phabricator.wikimedia.org/T420063 [15:09:24] T420062: Uninstall PSI extensions on closed wikis which are not needed - https://phabricator.wikimedia.org/T420062 [15:09:29] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host thanos-be1005.eqiad.wmnet [15:09:40] (03PS1) 10JMeybohm: Revert "k8s-staging: Switch to IPIP mode for kube-apiserver" [puppet] - 10https://gerrit.wikimedia.org/r/1254224 [15:09:55] RECOVERY - Host ssw1-d8-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [15:09:55] RECOVERY - Host ssw1-d8-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [15:10:07] (03PS2) 10JMeybohm: Revert "k8s-staging: Switch to IPIP mode for kube-apiserver" [puppet] - 10https://gerrit.wikimedia.org/r/1254224 (https://phabricator.wikimedia.org/T352956) [15:10:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244723 (https://phabricator.wikimedia.org/T418518) (owner: 10Urbanecm) [15:10:22] !log brennen@deploy2002 Finished deploy [phabricator/deployment@e845707]: deploy phab1004 for T420366 (duration: 01m 02s) [15:10:23] RECOVERY - MegaRAID on an-worker1148 is OK: OK: optimal, 12 logical, 13 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:10:41] RECOVERY - VRRP status on cr1-eqiad is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [15:10:47] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:11:03] (03Abandoned) 10JMeybohm: Revert "k8s-staging: Switch to IPIP mode for kube-apiserver" [puppet] - 10https://gerrit.wikimedia.org/r/1254224 (https://phabricator.wikimedia.org/T352956) (owner: 10JMeybohm) [15:11:03] (03PS3) 10Herron: systemd::timer::job: add ExecCondition support [puppet] - 10https://gerrit.wikimedia.org/r/1253655 [15:11:10] RESOLVED: [2x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.8 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:11:17] (03Merged) 10jenkins-bot: cleanup: Growth: Remove temporary GrowthMentorList overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244723 (https://phabricator.wikimedia.org/T418518) (owner: 10Urbanecm) [15:11:39] (03CR) 10CI reject: [V:04-1] systemd::timer::job: add ExecCondition support [puppet] - 10https://gerrit.wikimedia.org/r/1253655 (owner: 10Herron) [15:11:45] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1244723|cleanup: Growth: Remove temporary GrowthMentorList overrides (T418518)]] [15:11:49] T418518: Remove code for legacy GrowthMentorList validator - https://phabricator.wikimedia.org/T418518 [15:11:53] (03PS2) 10Arnaudb: gerrit: cookbook to reboot gerrit primary instance [cookbooks] - 10https://gerrit.wikimedia.org/r/1254113 (https://phabricator.wikimedia.org/T420194) [15:12:10] FIRING: [2x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.9 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:12:34] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2048.codfw.wmnet [15:12:43] (03PS4) 10Herron: systemd::timer::job: add ExecCondition support [puppet] - 10https://gerrit.wikimedia.org/r/1253655 [15:13:07] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1012.eqiad.wmnet with OS bookworm [15:13:44] (03CR) 10Arnaudb: gerrit: cookbook to reboot gerrit primary instance (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1254113 (https://phabricator.wikimedia.org/T420194) (owner: 10Arnaudb) [15:13:47] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:13:52] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1244723|cleanup: Growth: Remove temporary GrowthMentorList overrides (T418518)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:14:09] 06SRE, 06Infrastructure-Foundations, 10netops: Drain ssw1-d8-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420351#11719150 (10cmooney) 05Open→03Resolved Ok this work is now complete. Only had to reset the tunnel on `lsw1-d4-eqiad` it w... [15:14:09] FIRING: [10x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:14:15] !log urbanecm@deploy2002 urbanecm: Continuing with sync [15:14:21] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudservices2005-dev.codfw.wmnet [15:14:52] (03PS1) 10Dreamy Jazz: maintenance: Disable scripts for closed wikis on various extensions [puppet] - 10https://gerrit.wikimedia.org/r/1254225 (https://phabricator.wikimedia.org/T420052) [15:15:57] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new: Nokia SR-Linux DHCP Relay Bug - https://phabricator.wikimedia.org/T411054#11719160 (10cmooney) p:05Medium→03Low Ok all vxlan tunnels right now on row c/d leaf switches to ssw1-d1-eqiad and ssw1-d8-eqiad have a valid vxlan tunnel id. So u... [15:15:58] (03PS2) 10Dreamy Jazz: mw::maintenance: Disable scripts for closed wikis on various extensions [puppet] - 10https://gerrit.wikimedia.org/r/1254225 (https://phabricator.wikimedia.org/T420052) [15:16:09] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 16509 [15:16:10] !log jmm@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [15:16:25] RESOLVED: [4x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.9 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:17:15] jmm@cumin2002 drain-node (PID 3419798) is awaiting input [15:17:50] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for backup2005.mgmt:22 - https://phabricator.wikimedia.org/T420308#11719165 (10Jhancock.wm) related to T419970. will clear soon [15:18:17] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1244723|cleanup: Growth: Remove temporary GrowthMentorList overrides (T418518)]] (duration: 06m 32s) [15:18:21] T418518: Remove code for legacy GrowthMentorList validator - https://phabricator.wikimedia.org/T418518 [15:18:28] !log jmm@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [15:18:43] 10ops-codfw, 06collaboration-services, 06DC-Ops, 10Phabricator: phab2002: SEL System Event:, System Board Front LED Panel, Critical, management controller unavailable - https://phabricator.wikimedia.org/T420228#11719169 (10Aklapper) I see an entry `ipmi_sdr_cache_open: internal IPMI error` for `phab2002` a... [15:19:20] (03PS3) 10Dreamy Jazz: mw::maintenance: Disable scripts for closed wikis on various extensions [puppet] - 10https://gerrit.wikimedia.org/r/1254225 (https://phabricator.wikimedia.org/T420052) [15:19:29] (03CR) 10Dreamy Jazz: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254225 (https://phabricator.wikimedia.org/T420052) (owner: 10Dreamy Jazz) [15:20:23] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1005.eqiad.wmnet [15:20:25] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11719172 (10MoritzMuehlenhoff) [15:20:26] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host thanos-be1006.eqiad.wmnet [15:20:56] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudnet2008-dev.codfw.wmnet [15:21:16] (03CR) 10Herron: systemd::timer::job: add ExecCondition support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1253655 (owner: 10Herron) [15:21:34] !log jmm@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [15:22:07] jouncebot: nowandnext [15:22:07] For the next 0 hour(s) and 37 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1500) [15:22:07] In 0 hour(s) and 37 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1600) [15:22:18] !log btullis@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-jumbo-eqiad [15:23:59] !log jmm@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [15:25:49] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm [15:27:14] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet2008-dev.codfw.wmnet [15:27:15] 06SRE, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Data Platform SRE paging alerts and on-call SRE response - https://phabricator.wikimedia.org/T420264#11719221 (10RLazarus) One more axis to consider: Best-practices-wise, for alerting on Kubernetes platforms, there's a distinction between control plane and... [15:27:34] !log samtar@deploy2002 mwscript-k8s job started: cleanupWatchlistLabelMember.php --wiki=testwiki # T420328 [15:27:37] T420328: Run cleanupWatchlistLabelMember maintenance script - https://phabricator.wikimedia.org/T420328 [15:27:41] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1012.eqiad.wmnet with reason: host reimage [15:28:23] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1006.eqiad.wmnet [15:28:27] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host thanos-be1007.eqiad.wmnet [15:29:15] (03CR) 10Arnaudb: [C:03+2] gerrit: cookbook to reboot gerrit primary instance [cookbooks] - 10https://gerrit.wikimedia.org/r/1254113 (https://phabricator.wikimedia.org/T420194) (owner: 10Arnaudb) [15:31:54] FIRING: [6x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2005-dev (172.20.5.9) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:32:08] !log btullis@cumin1003 START - Cookbook sre.druid.reboot-workers for Druid analytics cluster: Reboot Druid nodes [15:33:30] (03PS2) 10JMeybohm: kubestagemaster: Enable ipip_encapsulation and mh scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1242289 (https://phabricator.wikimedia.org/T352956) [15:33:30] (03PS1) 10JMeybohm: realserver::ipip: Only write ferm rules if there are IPIP services [puppet] - 10https://gerrit.wikimedia.org/r/1254232 (https://phabricator.wikimedia.org/T352956) [15:33:43] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254232 (https://phabricator.wikimedia.org/T352956) (owner: 10JMeybohm) [15:33:55] !log samtar@deploy2002 mwscript-k8s job started: foreachwikiindblist testwikis cleanupWatchlistLabelMember.php # T420328 [15:33:59] T420328: Run cleanupWatchlistLabelMember maintenance script - https://phabricator.wikimedia.org/T420328 [15:34:01] dse-k8s-etcd2002 will go down for a Ganeti reboot [15:34:04] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1012.eqiad.wmnet with reason: host reimage [15:34:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet [15:34:09] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:34:28] (03Merged) 10jenkins-bot: gerrit: cookbook to reboot gerrit primary instance [cookbooks] - 10https://gerrit.wikimedia.org/r/1254113 (https://phabricator.wikimedia.org/T420194) (owner: 10Arnaudb) [15:35:30] (kudos to whoever made `mwscript-k8s` accept a dblist <3) [15:36:06] PROBLEM - Host dse-k8s-etcd2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:36:51] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1007.eqiad.wmnet [15:36:55] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host thanos-be1008.eqiad.wmnet [15:37:21] btullis@cumin1003 reimage (PID 3929598) is awaiting input [15:37:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet [15:38:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2048.codfw.wmnet [15:38:33] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2033.codfw.wmnet [15:39:23] (03CR) 10Clément Goubert: [C:03+1] rest-gateway rate limit: add BYPASS and DENY policy and class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250598 (owner: 10Daniel Kinzler) [15:40:01] (03CR) 10Tchanders: [C:03+1] mw::maintenance: Disable scripts for closed wikis on various extensions [puppet] - 10https://gerrit.wikimedia.org/r/1254225 (https://phabricator.wikimedia.org/T420052) (owner: 10Dreamy Jazz) [15:40:37] RECOVERY - Host dse-k8s-etcd2002 is UP: PING OK - Packet loss = 0%, RTA = 30.77 ms [15:41:21] TheresNoTime: in more ways than one, even! :) [15:43:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2033.codfw.wmnet [15:43:40] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:44:48] !log samtar@deploy2002 mwscript-k8s job started: foreachwikiindblist group0 cleanupWatchlistLabelMember.php # T420328 [15:44:52] T420328: Run cleanupWatchlistLabelMember maintenance script - https://phabricator.wikimedia.org/T420328 [15:45:24] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1008.eqiad.wmnet [15:45:28] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host thanos-be1009.eqiad.wmnet [15:46:28] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host zuul2003.codfw.wmnet with OS trixie [15:46:51] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - fasw1-f5a-codfw:et-0/0/47 (Core: pfw1-codfw:et-0/1/0 {#122505}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [15:48:40] RESOLVED: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:49:08] TheresNoTime: <3 [15:51:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2033.codfw.wmnet [15:51:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2033.codfw.wmnet [15:51:14] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-wmde-users for Ben.buchenau - https://phabricator.wikimedia.org/T419878#11719397 (10ayounsi) a:03Gehel @Gehel as the approval of the analytics-wmde-users group, do you approve this request ? [15:51:23] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-wmde-users for Ben.buchenau - https://phabricator.wikimedia.org/T419878#11719399 (10ayounsi) [15:51:28] (03CR) 10Dzahn: [C:04-1] "oops, I reversed the logic. This is supposed to exist on all servers EXCEPT the primary, but this is the opposite." [puppet] - 10https://gerrit.wikimedia.org/r/1253631 (https://phabricator.wikimedia.org/T420246) (owner: 10Dzahn) [15:52:52] !log samtar@deploy2002 mwscript-k8s job started: foreachwikiindblist group1 cleanupWatchlistLabelMember.php # T420328 [15:52:55] (03Abandoned) 10Muehlenhoff: Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1242255 (owner: 10Muehlenhoff) [15:52:56] T420328: Run cleanupWatchlistLabelMember maintenance script - https://phabricator.wikimedia.org/T420328 [15:53:00] (03PS1) 10Ayounsi: Add benbuchenau to analytics-wmde-users [puppet] - 10https://gerrit.wikimedia.org/r/1254237 (https://phabricator.wikimedia.org/T419878) [15:53:14] (03PS3) 10Bking: dse-k8s: Add CFSSL profile for longer-lived certificates (6 mo). [puppet] - 10https://gerrit.wikimedia.org/r/1251117 (https://phabricator.wikimedia.org/T419289) [15:53:26] (03PS3) 10Dzahn: releases: remove rsync systemd units when primary server changes [puppet] - 10https://gerrit.wikimedia.org/r/1253631 (https://phabricator.wikimedia.org/T420246) [15:53:37] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1251117 (https://phabricator.wikimedia.org/T419289) (owner: 10Bking) [15:54:11] !log zuul2003 - reimaging with trixie [15:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:38] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1009.eqiad.wmnet [15:55:40] (03PS2) 10JMeybohm: realserver::ipip: Only write ferm rules if there are IPIP services [puppet] - 10https://gerrit.wikimedia.org/r/1254232 (https://phabricator.wikimedia.org/T352956) [15:55:40] (03PS3) 10JMeybohm: kubestagemaster: Enable ipip_encapsulation and mh scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1242289 (https://phabricator.wikimedia.org/T352956) [15:55:52] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254232 (https://phabricator.wikimedia.org/T352956) (owner: 10JMeybohm) [15:56:51] RESOLVED: [2x] SwitchCoreInterfaceDown: Switch core interface down - fasw1-f5a-codfw:et-0/0/47 (Core: pfw1-codfw:et-0/1/0 {#122505}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [15:59:54] (03CR) 10Bking: "Yes, just ran it." [puppet] - 10https://gerrit.wikimedia.org/r/1251117 (https://phabricator.wikimedia.org/T419289) (owner: 10Bking) [16:00:05] jhathaway and rzl: gettimeofday() says it's time for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1600) [16:00:05] phuedx and Dreamy_Jazz: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:10] \o [16:00:18] o/ [16:01:21] Dreamy_Jazz: I'll merge anything else you need? [16:01:41] (03PS1) 10Muehlenhoff: Test pki1002 on ganeti-test [puppet] - 10https://gerrit.wikimedia.org/r/1254242 (https://phabricator.wikimedia.org/T416664) [16:01:43] No should be fine to just merge [16:01:46] Thanks [16:01:54] (03CR) 10JHathaway: [C:03+2] mw::maintenance: Disable scripts for closed wikis on various extensions [puppet] - 10https://gerrit.wikimedia.org/r/1254225 (https://phabricator.wikimedia.org/T420052) (owner: 10Dreamy Jazz) [16:03:13] Dreamy_Jazz: done [16:03:16] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp700[5-8].magru.wmnet} and A:cp [16:03:55] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for SCardenas (WMF) - https://phabricator.wikimedia.org/T419932#11719497 (10ayounsi) @thcipriani as approval contact for the `deployment` group, do you approve this request ? @scardenasmolinar can you read and sign https://phabricator.wikimedia.org... [16:03:57] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp7013.magru.wmnet,cp701[5-6].magru.wmnet} and A:cp [16:04:00] Thanks [16:04:25] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for SCardenas (WMF) - https://phabricator.wikimedia.org/T419932#11719498 (10ayounsi) [16:05:02] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp7014.magru.wmnet with OS trixie [16:05:26] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp7014.magru.wmnet with OS trixie [16:07:49] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1064.eqiad.wmnet [16:08:22] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on zuul2003.codfw.wmnet with reason: host reimage [16:08:40] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:10:11] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2062.codfw.wmnet [16:14:31] jouncebot: nowandnext [16:14:31] For the next 0 hour(s) and 45 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1600) [16:14:31] In 0 hour(s) and 45 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1700) [16:14:43] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp7005.magru.wmnet [16:15:03] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on zuul2003.codfw.wmnet with reason: host reimage [16:15:44] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp7013.magru.wmnet [16:15:55] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254242 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff) [16:16:16] (03CR) 10BCornwall: [C:03+2] hiera: Remove single_backend from codfw [puppet] - 10https://gerrit.wikimedia.org/r/1253605 (https://phabricator.wikimedia.org/T401832) (owner: 10BCornwall) [16:16:20] (03CR) 10BCornwall: [V:03+1 C:03+2] hiera: Set default codfw storage_elements [puppet] - 10https://gerrit.wikimedia.org/r/1253606 (https://phabricator.wikimedia.org/T401832) (owner: 10BCornwall) [16:16:49] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:16:59] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 1/3 UP : OSPFv3: 1/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:17:42] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1064.eqiad.wmnet [16:17:54] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2062.codfw.wmnet [16:17:57] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1253631/8287/" [puppet] - 10https://gerrit.wikimedia.org/r/1253631 (https://phabricator.wikimedia.org/T420246) (owner: 10Dzahn) [16:17:59] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:18:03] !log btullis@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-jumbo-eqiad [16:18:05] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2063.codfw.wmnet [16:18:11] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1065.eqiad.wmnet [16:18:33] (03CR) 10Dzahn: [C:03+2] releases: remove rsync systemd units when primary server changes [puppet] - 10https://gerrit.wikimedia.org/r/1253631 (https://phabricator.wikimedia.org/T420246) (owner: 10Dzahn) [16:18:39] FIRING: CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-esams (208.80.153.216) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Confed_esams&var-bgp_neighbor=cr2-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:18:45] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:20:25] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1015:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:20:37] (03CR) 10Vgutierrez: [C:03+1] realserver::ipip: Only write ferm rules if there are IPIP services [puppet] - 10https://gerrit.wikimedia.org/r/1254232 (https://phabricator.wikimedia.org/T352956) (owner: 10JMeybohm) [16:20:49] (03PS1) 10Kevin Bazira: ml-services: update gpt isvc image to one that supports AITER [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254245 (https://phabricator.wikimedia.org/T418350) [16:21:28] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for SCardenas (WMF) - https://phabricator.wikimedia.org/T419932#11719644 (10ayounsi) a:03thcipriani [16:21:47] (03CR) 10Muehlenhoff: systemd::timer::job: add ExecCondition support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1253655 (owner: 10Herron) [16:23:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-esams (208.80.153.216) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Confed_esams&var-bgp_neighbor=cr2-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:24:45] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: Update ULSFO LVS service IP's - https://phabricator.wikimedia.org/T418971#11719697 (10Fabfur) Procedure from the traffic perspective should be roughly - Depool ulsfo (around 0900UTC) and wait about 30' for all connections to... [16:25:01] !log cgoubert@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1306,1308-1311].eqiad.wmnet [16:25:03] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1306,1308-1311].eqiad.wmnet [16:25:47] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1065.eqiad.wmnet [16:25:50] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2063.codfw.wmnet [16:25:52] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on releases2003.codfw.wmnet with reason: T420246 [16:25:55] T420246: SystemdUnitFailed - rsync releases2003 - https://phabricator.wikimedia.org/T420246 [16:26:20] (03CR) 10Dzahn: [C:03+2] "noop on releases1003 - releases2003 now has an issue with stunnel - something is not fully removed - TBD" [puppet] - 10https://gerrit.wikimedia.org/r/1253631 (https://phabricator.wikimedia.org/T420246) (owner: 10Dzahn) [16:27:55] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.4 point update - https://phabricator.wikimedia.org/T420240#11719717 (10MoritzMuehlenhoff) [16:28:14] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2064.codfw.wmnet [16:28:18] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1066.eqiad.wmnet [16:28:22] (03CR) 10Dzahn: [C:03+2] "apparently the combination of "server_uses_stunnel => true" with "ensure => absent" is an issue" [puppet] - 10https://gerrit.wikimedia.org/r/1253631 (https://phabricator.wikimedia.org/T420246) (owner: 10Dzahn) [16:28:29] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7014.magru.wmnet with reason: host reimage [16:29:26] (03PS2) 10Kevin Bazira: ml-services: update gpt isvc image to one that supports AITER [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254245 (https://phabricator.wikimedia.org/T418350) [16:29:46] (03PS4) 10Ryan Kemper: profile::pyrra: remove old wdqs SLO configs [puppet] - 10https://gerrit.wikimedia.org/r/1248761 (https://phabricator.wikimedia.org/T393966) (owner: 10Elukey) [16:31:24] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for SCardenas (WMF) - https://phabricator.wikimedia.org/T419932#11719772 (10thcipriani) Reason for access makes sense, approved for `deployment` group membership. --- @Scardenasmolinar some additional bits for you: - Our web deploy tool [[https:/... [16:32:36] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1248761 (https://phabricator.wikimedia.org/T393966) (owner: 10Elukey) [16:32:51] !log cgoubert@cumin1003 START - Cookbook sre.hosts.remove-downtime for wikikube-worker[1306,1308-1311].eqiad.wmnet [16:32:55] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wikikube-worker[1306,1308-1311].eqiad.wmnet [16:33:16] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7014.magru.wmnet with reason: host reimage [16:33:29] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host zuul2003.codfw.wmnet with OS trixie [16:33:40] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:27] !log samtar@deploy2002 mwscript-k8s job started: foreachwikiindblist group2 cleanupWatchlistLabelMember.php # T420328 [16:34:31] T420328: Run cleanupWatchlistLabelMember maintenance script - https://phabricator.wikimedia.org/T420328 [16:35:20] btullis@cumin1003 reimage (PID 3929598) is awaiting input [16:35:23] (03CR) 10Ryan Kemper: [C:03+1] "PCC looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1251117 (https://phabricator.wikimedia.org/T419289) (owner: 10Bking) [16:35:52] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1066.eqiad.wmnet [16:35:55] (03PS3) 10Kevin Bazira: ml-services: update gpt isvc image to one that supports AITER [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254245 (https://phabricator.wikimedia.org/T418350) [16:36:12] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2064.codfw.wmnet [16:36:33] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2065.codfw.wmnet [16:36:36] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1067.eqiad.wmnet [16:36:48] (03CR) 10Ryan Kemper: [C:03+2] profile::pyrra: remove old wdqs SLO configs [puppet] - 10https://gerrit.wikimedia.org/r/1248761 (https://phabricator.wikimedia.org/T393966) (owner: 10Elukey) [16:37:37] !log cgoubert@cumin1003 START - Cookbook sre.dns.netbox [16:37:55] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: hw troubleshooting: Comm Error: Backplane 0 for wikikube-worker1307.eqiad.wmnet - https://phabricator.wikimedia.org/T420389 (10Clement_Goubert) 03NEW p:05Triage→03Low [16:37:55] (03PS1) 10Btullis: Temporarily set dse-k8s-worker101[2,5] into insetup mode [puppet] - 10https://gerrit.wikimedia.org/r/1254247 (https://phabricator.wikimedia.org/T414787) [16:39:36] merged a patch, but seeming to have trouble getting onto puppetserver (looks bastion related right now). if someone merges something else before I figure this out, please merge my patch on my behalf [16:39:55] !log btullis@cumin1003 START - Cookbook sre.kafka.roll-restart-mirror-maker restart MirrorMaker for Kafka A:kafka-mirror-maker-jumbo-eqiad cluster: Roll restart of jvm daemons. [16:41:11] 06SRE, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Data Platform SRE paging alerts and on-call SRE response - https://phabricator.wikimedia.org/T420264#11719879 (10BTullis) Thanks ever so much for getting this conversation started. I think that it's really important for us to get a good consensus on this, a... [16:41:54] 10ops-codfw, 06SRE, 06DC-Ops: bast2003 boot failure - https://phabricator.wikimedia.org/T420320#11719889 (10Jhancock.wm) got into the idrac/console and found the server as this: Booting from Hard Drive C: GRUB rebooted and went to the same screen. contacted @Papaul for consult. corrupted or missing conf... [16:42:02] (03CR) 10Bartosz Wójtowicz: [C:03+1] "thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254245 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [16:42:06] !log btullis@cumin1003 END (PASS) - Cookbook sre.druid.reboot-workers (exit_code=0) for Druid analytics cluster: Reboot Druid nodes [16:42:33] ok, yeah bast2003 was down. went through 1003 instead [16:43:21] cgoubert@cumin1003 netbox (PID 3940865) is awaiting input [16:43:27] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update gpt isvc image to one that supports AITER [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254245 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [16:44:36] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2065.codfw.wmnet [16:44:42] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1067.eqiad.wmnet [16:44:57] (03PS1) 10BCornwall: hiera: storage_elements override for cp2041/cp2042 [puppet] - 10https://gerrit.wikimedia.org/r/1254249 [16:45:07] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet [16:45:11] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1068.eqiad.wmnet [16:45:18] (03Merged) 10jenkins-bot: ml-services: update gpt isvc image to one that supports AITER [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254245 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [16:45:31] (03PS2) 10BCornwall: hiera: storage_elements override for cp2041/cp2042 [puppet] - 10https://gerrit.wikimedia.org/r/1254249 [16:46:13] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['bast2003'] [16:46:44] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm [16:47:10] !log samtar@deploy2002 mwscript-k8s job started: foreachwikiindblist all cleanupWatchlistLabelMember.php # T420328 [16:47:14] T420328: Run cleanupWatchlistLabelMember maintenance script - https://phabricator.wikimedia.org/T420328 [16:47:16] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [16:48:23] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1254249 (owner: 10BCornwall) [16:49:30] (03CR) 10Vgutierrez: [C:03+1] hiera: storage_elements override for cp2041/cp2042 [puppet] - 10https://gerrit.wikimedia.org/r/1254249 (owner: 10BCornwall) [16:50:17] (03CR) 10Btullis: [C:03+2] Temporarily set dse-k8s-worker101[2,5] into insetup mode [puppet] - 10https://gerrit.wikimedia.org/r/1254247 (https://phabricator.wikimedia.org/T414787) (owner: 10Btullis) [16:50:34] (03CR) 10BCornwall: [V:03+1 C:03+2] hiera: storage_elements override for cp2041/cp2042 [puppet] - 10https://gerrit.wikimedia.org/r/1254249 (owner: 10BCornwall) [16:52:59] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet [16:53:15] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1068.eqiad.wmnet [16:53:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 13Patch-For-Review: Q3:rack/setup/install dse-k8s-worker10[20-23] - https://phabricator.wikimedia.org/T414216#11719996 (10elukey) @Jclark-ctr I provisioned dse-k8s-worker1020 with an experimental provisioning cookbook, when you... [16:55:47] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2067.codfw.wmnet [16:55:50] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1069.eqiad.wmnet [16:56:17] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp7006.magru.wmnet [16:57:22] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp7015.magru.wmnet [16:58:11] 10ops-codfw, 06collaboration-services, 06DC-Ops, 10Phabricator: phab2002: SEL System Event:, System Board Front LED Panel, Critical, management controller unavailable - https://phabricator.wikimedia.org/T420228#11720040 (10Jhancock.wm) yes. that matches the time. this error can be from a firmware issue. [16:58:18] !log btullis@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0) restart MirrorMaker for Kafka A:kafka-mirror-maker-jumbo-eqiad cluster: Roll restart of jvm daemons. [16:58:20] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm [16:58:34] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm [16:59:22] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7014.magru.wmnet with OS trixie [17:00:05] swfrench-wmf: Time to snap out of that daydream and deploy MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1700). [17:00:26] !log cgoubert@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [17:01:22] o/ [17:01:31] (03PS1) 10BCornwall: trafficserver: Remove outdated comment [puppet] - 10https://gerrit.wikimedia.org/r/1254254 [17:01:31] I'll be getting started on the infra window shortly [17:01:34] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm [17:01:39] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1069.eqiad.wmnet [17:02:27] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2067.codfw.wmnet [17:02:29] (03CR) 10Scott French: [C:03+2] mw-(api-int|web): Use envoy drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253664 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [17:02:40] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1070.eqiad.wmnet [17:02:51] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2068.codfw.wmnet [17:03:36] (03PS2) 10BCornwall: trafficserver: Update single_backend site comments [puppet] - 10https://gerrit.wikimedia.org/r/1254254 [17:04:22] (03PS3) 10BCornwall: trafficserver: Update single_backend site comments [puppet] - 10https://gerrit.wikimedia.org/r/1254254 [17:04:36] (03Merged) 10jenkins-bot: mw-(api-int|web): Use envoy drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253664 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [17:05:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['bast2003'] [17:06:09] !log cgoubert@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{wikikube-worker[1312-1327].eqiad.wmnet,wikikube-worker-exp1001.eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [17:06:24] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host bast2003.wikimedia.org with OS bookworm [17:06:38] 10ops-codfw, 06SRE, 06DC-Ops: bast2003 boot failure - https://phabricator.wikimedia.org/T420320#11720075 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host bast2003.wikimedia.org with OS bookworm [17:06:49] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [17:07:40] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [17:08:50] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2068.codfw.wmnet [17:08:56] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1070.eqiad.wmnet [17:09:03] 06SRE, 10Infrastructure Security, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware, and 4 others: October 2025 Bullseye reboots (ServiceOps hosts) - https://phabricator.wikimedia.org/T416451#11720092 (10MLechvien-WMF) @Blake remaining hosts to reboot should be done as part of T420175 , should we dedup this... [17:09:04] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [17:09:34] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2069.codfw.wmnet [17:09:38] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1071.eqiad.wmnet [17:09:50] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7014.* [17:10:26] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [17:11:25] RECOVERY - Host bast2003 is UP: PING OK - Packet loss = 0%, RTA = 33.82 ms [17:13:03] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [17:13:11] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [17:13:56] FIRING: ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:14:20] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [17:14:37] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [17:14:54] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [17:15:26] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1012.eqiad.wmnet with OS bookworm [17:16:15] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2069.codfw.wmnet [17:16:20] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [17:16:55] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1071.eqiad.wmnet [17:18:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:19:13] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [17:19:17] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp3081.esams.wmnet with OS trixie [17:19:44] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [17:20:28] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [17:20:32] FIRING: [5x] KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:21:56] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [17:23:22] (03CR) 10Elukey: [C:03+1] "is that easy? Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1254242 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff) [17:24:08] (03PS1) 10Kamila Součková: k8s: create shellbox-icu72 [puppet] - 10https://gerrit.wikimedia.org/r/1254266 (https://phabricator.wikimedia.org/T419049) [17:24:47] (03CR) 10CI reject: [V:04-1] k8s: create shellbox-icu72 [puppet] - 10https://gerrit.wikimedia.org/r/1254266 (https://phabricator.wikimedia.org/T419049) (owner: 10Kamila Součková) [17:25:28] (03PS1) 10Jcrespo: mediabackups: Enable TLS and set multitenant mode [puppet] - 10https://gerrit.wikimedia.org/r/1254267 (https://phabricator.wikimedia.org/T410028) [17:25:35] (03PS2) 10Kamila Součková: k8s: create shellbox-icu72 [puppet] - 10https://gerrit.wikimedia.org/r/1254266 (https://phabricator.wikimedia.org/T419049) [17:25:46] (03PS2) 10Jcrespo: mediabackups: Enable TLS and set multitenant mode [puppet] - 10https://gerrit.wikimedia.org/r/1254267 (https://phabricator.wikimedia.org/T410028) [17:26:07] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254267 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [17:26:14] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254266 (https://phabricator.wikimedia.org/T419049) (owner: 10Kamila Součková) [17:26:18] (03CR) 10CI reject: [V:04-1] k8s: create shellbox-icu72 [puppet] - 10https://gerrit.wikimedia.org/r/1254266 (https://phabricator.wikimedia.org/T419049) (owner: 10Kamila Součková) [17:26:32] (03CR) 10CI reject: [V:04-1] mediabackups: Enable TLS and set multitenant mode [puppet] - 10https://gerrit.wikimedia.org/r/1254267 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [17:26:50] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp3066.esams.wmnet [reason: trixie reimaging] [17:26:51] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [17:27:17] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp3066.esams.wmnet with OS trixie [17:27:24] !log cdobbins@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp3066.esams.wmnet with OS trixie [17:27:28] (03PS3) 10Jcrespo: mediabackups: Enable TLS and set multitenant mode [puppet] - 10https://gerrit.wikimedia.org/r/1254267 (https://phabricator.wikimedia.org/T410028) [17:27:39] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254267 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [17:28:06] (03CR) 10CI reject: [V:04-1] mediabackups: Enable TLS and set multitenant mode [puppet] - 10https://gerrit.wikimedia.org/r/1254267 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [17:28:07] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [17:28:30] (03PS3) 10Kamila Součková: k8s: create shellbox-icu72 [puppet] - 10https://gerrit.wikimedia.org/r/1254266 (https://phabricator.wikimedia.org/T419049) [17:28:55] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp3066.esams.wmnet with OS trixie [17:29:26] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp3067.esams.wmnet [reason: trixie reimaging] [17:29:32] (03PS4) 10Jcrespo: mediabackups: Enable TLS and set multitenant mode [puppet] - 10https://gerrit.wikimedia.org/r/1254267 (https://phabricator.wikimedia.org/T410028) [17:29:39] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254266 (https://phabricator.wikimedia.org/T419049) (owner: 10Kamila Součková) [17:29:39] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254267 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [17:29:52] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp3067.esams.wmnet with OS trixie [17:30:32] FIRING: [7x] KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:31:54] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [17:32:19] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [17:33:14] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [17:33:29] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [17:33:55] (03CR) 10Jcrespo: [C:03+2] mediabackups: Enable TLS and set multitenant mode [puppet] - 10https://gerrit.wikimedia.org/r/1254267 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [17:33:56] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [17:34:57] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [17:37:21] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [17:37:31] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp7007.magru.wmnet [17:37:49] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [17:38:27] (03CR) 10Dzahn: [C:04-1] "per our meeting just now: this is not good - we need it to run only on one of the 2 machines..and contint1003 is it. should be solved by " [puppet] - 10https://gerrit.wikimedia.org/r/1251208 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [17:38:31] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [17:39:15] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp7016.magru.wmnet [17:39:15] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp7013.magru.wmnet,cp701[5-6].magru.wmnet} and A:cp [17:39:30] (03CR) 10Dzahn: [C:04-2] "per meeting just now: not needed - proxy config stays on old host" [puppet] - 10https://gerrit.wikimedia.org/r/1250748 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [17:39:41] (03Abandoned) 10Dzahn: jenkins: add proxy_jenkins profile to role [puppet] - 10https://gerrit.wikimedia.org/r/1250748 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [17:40:09] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [17:40:32] FIRING: [6x] KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:40:33] (03CR) 10Dzahn: [C:03+1] gerrit: add a ttl on ProxyPass to jetty [puppet] - 10https://gerrit.wikimedia.org/r/1254128 (https://phabricator.wikimedia.org/T420189) (owner: 10Arnaudb) [17:41:28] !log btullis@cumin1003 START - Cookbook sre.druid.reboot-workers for Druid public cluster: Reboot Druid nodes [17:42:10] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [17:42:13] PROBLEM - Host wikikube-worker1036 is DOWN: PING CRITICAL - Packet loss = 50%, RTA = 3465.18 ms [17:42:30] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp3081.esams.wmnet with OS trixie [17:42:40] (03CR) 10Anne Tomasevich: [C:03+1] Set wgReadingListsBetaDefaultForNewAccountsAfter for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251309 (https://phabricator.wikimedia.org/T419163) (owner: 10Aude) [17:42:55] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp3081.esams.wmnet with OS trixie [17:42:56] 10ops-codfw, 06SRE, 06DC-Ops: bast2003 boot failure - https://phabricator.wikimedia.org/T420320#11720298 (10MoritzMuehlenhoff) @Jhancock.wm Given that we need to reimage this server anyway, could you please reimage with trixie instead of bookworm (what it ran before)? The first new bastion (bast1004) is alre... [17:43:05] RECOVERY - Host wikikube-worker1036 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [17:43:28] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [17:43:53] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [17:44:29] btullis@cumin1003 reboot-workers (PID 3951063) is awaiting input [17:44:56] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [17:44:59] (03CR) 10VolkerE: [C:03+1] Set wgReadingListsBetaDefaultForNewAccountsAfter for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251309 (https://phabricator.wikimedia.org/T419163) (owner: 10Aude) [17:45:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251309 (https://phabricator.wikimedia.org/T419163) (owner: 10Aude) [17:46:13] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp3080.esams.wmnet with OS trixie [17:49:31] (03CR) 10JMeybohm: [C:03+2] realserver::ipip: Only write ferm rules if there are IPIP services [puppet] - 10https://gerrit.wikimedia.org/r/1254232 (https://phabricator.wikimedia.org/T352956) (owner: 10JMeybohm) [17:52:00] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm [17:52:19] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm [17:52:24] !log cgoubert@cumin1003 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on P{wikikube-worker[1312-1327].eqiad.wmnet,wikikube-worker-exp1001.eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [17:53:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:54:15] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-parsoid_4452: Servers wikikube-worker1144.eqiad.wmnet, wikikube-worker1291.eqiad.wmnet, wikikube-worker1322.eqiad.wmnet, wikikube-worker1042.eqiad.wmnet, wikikube-worker1118.eqiad.wmnet, wikikube-worker1298.eqiad.wmnet, wikikube-worker1155.eqiad.wmnet, wikikube-worker1103.eqiad.wmnet, wikikube-worker1108.eqiad.wmnet, wikikube-worker1101.eqiad.wmne [17:54:15] ube-worker1121.eqiad.wmnet, wikikube-worker1116.eqiad.wmnet, wikikube-worker1036.eqiad.wmnet, wikikube-worker1029.eqiad.wmnet, wikikube-worker1049.eqiad.wmnet, wikikube-worker1315.eqiad.wmnet, wikikube-worker1132.eqiad.wmnet, wikikube-worker1247.eqiad.wmnet, wikikube-worker1273.eqiad.wmnet, wikikube-worker1071.eqiad.wmnet, wikikube-worker1279.eqiad.wmnet, wikikube-worker1358.eqiad.wmnet, wikikube-worker1072.eqiad.wmnet, wikikube-worker114 [17:54:15] wmnet, wikikube-worker1159.eqiad.wmnet, wikikube-worker1270.eqiad.wmnet, wikikube-worker1015.eqiad.wmnet, wikikube-worker1037.eqiad.wmnet, wikikube-worker1160.eqiad.wmnet, wikikube-work https://wikitech.wikimedia.org/wiki/PyBal [17:54:31] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-parsoid_4452: Servers wikikube-worker1058.eqiad.wmnet, wikikube-worker1280.eqiad.wmnet, wikikube-worker1298.eqiad.wmnet, wikikube-worker1303.eqiad.wmnet, wikikube-worker1274.eqiad.wmnet, wikikube-worker1042.eqiad.wmnet, wikikube-worker1092.eqiad.wmnet, wikikube-worker1051.eqiad.wmnet, wikikube-worker1118.eqiad.wmnet, wikikube-worker1304.eqiad.wmne [17:54:31] ube-worker1144.eqiad.wmnet, wikikube-worker1289.eqiad.wmnet, wikikube-worker1155.eqiad.wmnet, wikikube-worker1257.eqiad.wmnet, wikikube-worker1108.eqiad.wmnet, wikikube-worker1101.eqiad.wmnet, wikikube-worker1116.eqiad.wmnet, wikikube-worker1050.eqiad.wmnet, wikikube-worker1007.eqiad.wmnet, wikikube-worker1036.eqiad.wmnet, wikikube-worker1122.eqiad.wmnet, wikikube-worker1068.eqiad.wmnet, wikikube-worker1029.eqiad.wmnet, wikikube-worker108 [17:54:31] wmnet, wikikube-worker1310.eqiad.wmnet, wikikube-worker1049.eqiad.wmnet, wikikube-worker1353.eqiad.wmnet, wikikube-worker1052.eqiad.wmnet, wikikube-worker1079.eqiad.wmnet, wikikube-work https://wikitech.wikimedia.org/wiki/PyBal [17:54:58] (03PS1) 10Dwisehaupt: wmnet: shift fundraisingdb-read back to frdb1004 [dns] - 10https://gerrit.wikimedia.org/r/1254271 (https://phabricator.wikimedia.org/T420155) [17:58:10] (03PS1) 10Kevin Bazira: ml-services: update gpt isvc image to one that applies vLLM AITER/Inductor compilation optimizations. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254273 (https://phabricator.wikimedia.org/T418350) [18:00:05] andre and brennen: Time to snap out of that daydream and deploy MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1800). [18:00:15] jouncebot: no! [18:02:25] uhh ... the mw-parsoid_4452 PyBal backends alert is *probably* the result of wikikube-worker-exp* restarts [18:02:30] I can take a look in a bit [18:02:54] oh no [18:02:56] thanks [18:03:11] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on bast2003.wikimedia.org with reason: host reimage [18:03:40] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3066.esams.wmnet with reason: host reimage [18:04:09] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3067.esams.wmnet with reason: host reimage [18:06:26] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for SCardenas (WMF) - https://phabricator.wikimedia.org/T419932#11720399 (10Scardenasmolinar) > @Scardenasmolinar can you read and sign https://phabricator.wikimedia.org/L3 ? Signed! > Our web deploy tool SpiderPig also requires you request membe... [18:08:16] (03CR) 10Bartosz Wójtowicz: [C:03+1] ml-services: update gpt isvc image to one that applies vLLM AITER/Inductor compilation optimizations. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254273 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [18:09:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast2003.wikimedia.org with reason: host reimage [18:09:58] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3080.esams.wmnet with reason: host reimage [18:12:47] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update gpt isvc image to one that applies vLLM AITER/Inductor compilation optimizations. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254273 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [18:13:01] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3066.esams.wmnet with reason: host reimage [18:14:44] (03Merged) 10jenkins-bot: ml-services: update gpt isvc image to one that applies vLLM AITER/Inductor compilation optimizations. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254273 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [18:15:08] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users level 3 for bvibber - https://phabricator.wikimedia.org/T420406 (10bvibber) 03NEW [18:16:18] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [18:16:23] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3067.esams.wmnet with reason: host reimage [18:16:40] !log akhatun@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [18:16:51] (03CR) 10Jgreen: [C:03+1] wmnet: shift fundraisingdb-read back to frdb1004 [dns] - 10https://gerrit.wikimedia.org/r/1254271 (https://phabricator.wikimedia.org/T420155) (owner: 10Dwisehaupt) [18:17:01] !log akhatun@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [18:19:03] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp7008.magru.wmnet [18:19:03] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp700[5-8].magru.wmnet} and A:cp [18:19:25] (03PS1) 10Alex.sanford: Remove notice from login form in popup mode [skins/MinervaNeue] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254280 (https://phabricator.wikimedia.org/T418534) [18:19:39] PROBLEM - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp3080 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [18:20:17] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3080.esams.wmnet with reason: host reimage [18:22:11] sukhe: following up - mw-parsoid is now special, in that the backing pods are only allowed to run on a very limited number of k8s workers (2). the ongoing reboot run seems to have taken both nodes out of service at the same time. [18:22:30] I'll take a look at why that happened and figure out how to unstick it [18:23:01] in any case, this is not in any way a critical service [18:24:28] thanks! [18:25:37] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm [18:25:56] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm [18:30:35] (03CR) 10Dwisehaupt: [C:03+2] wmnet: shift fundraisingdb-read back to frdb1004 [dns] - 10https://gerrit.wikimedia.org/r/1254271 (https://phabricator.wikimedia.org/T420155) (owner: 10Dwisehaupt) [18:30:39] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: Update ULSFO LVS service IP's - https://phabricator.wikimedia.org/T418971#11720532 (10ssingh) >>! In T418971#11719697, @Fabfur wrote: > Procedure from the traffic perspective should be roughly > > - Depool ulsfo (around 0900... [18:31:04] !log dwisehaupt@dns1005 START - running authdns-update [18:31:11] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast2003.wikimedia.org with OS bookworm [18:31:20] 10ops-codfw, 06SRE, 06DC-Ops: bast2003 boot failure - https://phabricator.wikimedia.org/T420320#11720534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host bast2003.wikimedia.org with OS bookworm completed: - bast2003 (**WARN**) - Downtimed on Icinga/Alertman... [18:31:45] FIRING: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [18:31:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [skins/MinervaNeue] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254280 (https://phabricator.wikimedia.org/T418534) (owner: 10Alex.sanford) [18:32:34] !log dwisehaupt@dns1005 END - running authdns-update [18:32:38] (03CR) 10Huei Tan: [C:03+1] "👍" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254149 (https://phabricator.wikimedia.org/T418187) (owner: 10Abijeet Patro) [18:35:29] (03CR) 10Ssingh: [C:03+1] "Trusting the script!" [dns] - 10https://gerrit.wikimedia.org/r/1254092 (owner: 10Slyngshede) [18:38:11] (03CR) 10Bartosz Dziewoński: "If you get build failures when trying to deploy this change (I'm not sure how the CI is set up for wmf.XX branches and whether it'll pass " [skins/MinervaNeue] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254280 (https://phabricator.wikimedia.org/T418534) (owner: 10Alex.sanford) [18:40:59] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3066.esams.wmnet with OS trixie [18:41:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [18:43:05] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3067.esams.wmnet with OS trixie [18:44:45] PROBLEM - Host an-worker1172 is DOWN: PING CRITICAL - Packet loss = 100% [18:45:12] (03PS3) 10Arlolra: Deploy PRV to XX wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253654 (https://phabricator.wikimedia.org/T420273) [18:45:45] RECOVERY - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp3080 is OK: SSL OK - Certificate measure-eqiad.wikimedia.org contains all required SANs:Certificate measure-eqiad.wikimedia.org (ECDSA) valid until 2026-06-04 03:56:45 +0000 (expires in 78 days) https://wikitech.wikimedia.org/wiki/HTTPS [18:47:12] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3080.esams.wmnet with OS trixie [18:48:49] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:48:59] ^ there we go [18:49:15] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:49:22] !log manually uncordoned wikikube-worker-exp1001.eqiad.wmnet after failed reboot [18:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:51] (03CR) 10Ssingh: trafficserver: Update single_backend site comments (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1254254 (owner: 10BCornwall) [18:50:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host bast2003.wikimedia.org with OS trixie [18:50:51] 10ops-codfw, 06SRE, 06DC-Ops: bast2003 boot failure - https://phabricator.wikimedia.org/T420320#11720575 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host bast2003.wikimedia.org with OS trixie [18:53:27] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp3067.esams.wmnet [reason: trixie reimaging] [18:53:38] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp3066.esams.wmnet [reason: trixie reimaging] [18:54:39] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp3068.esams.wmnet [reason: trixie reimaging] [18:54:42] (03CR) 10Ssingh: ulsfo: add new LVS service IP range (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1247994 (https://phabricator.wikimedia.org/T418971) (owner: 10Ayounsi) [18:55:13] !log btullis@cumin1003 END (PASS) - Cookbook sre.druid.reboot-workers (exit_code=0) for Druid public cluster: Reboot Druid nodes [18:55:30] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp3068.esams.wmnet with OS trixie [18:55:58] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp3069.esams.wmnet [reason: trixie reimaging] [18:56:21] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp3069.esams.wmnet with OS trixie [18:57:50] (03CR) 10Daniel Kinzler: rest-gateway: per-route jwt overrides (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248477 (https://phabricator.wikimedia.org/T419130) (owner: 10Daniel Kinzler) [18:58:47] (03PS1) 10Ssingh: definitions/static.net: add IPv6 addresses for nameservers [homer/public] - 10https://gerrit.wikimedia.org/r/1254290 (https://phabricator.wikimedia.org/T420361) [19:00:58] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp3080.* [19:05:06] (03PS1) 10Dzahn: create a discovery name for new jenkins on contint machines [dns] - 10https://gerrit.wikimedia.org/r/1254292 (https://phabricator.wikimedia.org/T418521) [19:05:17] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp3081.esams.wmnet with OS trixie [19:05:34] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp3081.esams.wmnet with OS trixie [19:07:05] (03CR) 10Dzahn: [C:03+2] create a discovery name for new jenkins on contint machines [dns] - 10https://gerrit.wikimedia.org/r/1254292 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [19:07:22] !log dzahn@dns1004 START - running authdns-update [19:07:46] (03PS1) 10Ssingh: definitions/static.net: add IPv6 addresses for nameservers [homer/public] - 10https://gerrit.wikimedia.org/r/1254293 (https://phabricator.wikimedia.org/T420361) [19:08:56] !log dzahn@dns1004 END - running authdns-update [19:09:09] (03Abandoned) 10Ssingh: definitions/static.net: add IPv6 addresses for nameservers [homer/public] - 10https://gerrit.wikimedia.org/r/1254290 (https://phabricator.wikimedia.org/T420361) (owner: 10Ssingh) [19:10:15] (03CR) 10Catrope: rest-gateway rate limiting: add CORS headers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248461 (https://phabricator.wikimedia.org/T418969) (owner: 10Daniel Kinzler) [19:11:44] (03CR) 10Daniel Kinzler: rest-gateway rate limiting: add CORS headers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248461 (https://phabricator.wikimedia.org/T418969) (owner: 10Daniel Kinzler) [19:11:55] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on bast2003.wikimedia.org with reason: host reimage [19:12:33] (03PS1) 10Dzahn: jenkins: define contint1003 as the manager_host for the jenkins role [puppet] - 10https://gerrit.wikimedia.org/r/1254295 (https://phabricator.wikimedia.org/T418521) [19:16:34] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/1254295/8289/" [puppet] - 10https://gerrit.wikimedia.org/r/1254295 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [19:17:34] (03CR) 10Dzahn: [V:03+1 C:03+1] jenkins: define contint1003 as the manager_host for the jenkins role [puppet] - 10https://gerrit.wikimedia.org/r/1254295 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [19:18:08] btullis@cumin1003 reboot-workers (PID 3894227) is awaiting input [19:19:07] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast2003.wikimedia.org with reason: host reimage [19:20:26] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3068.esams.wmnet with reason: host reimage [19:21:40] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3069.esams.wmnet with reason: host reimage [19:23:50] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3068.esams.wmnet with reason: host reimage [19:26:12] (03CR) 10Dzahn: [C:03+2] "This can't work - the class is needed on both sides - it has internal logic to do appropriate things on each one." [puppet] - 10https://gerrit.wikimedia.org/r/1253631 (https://phabricator.wikimedia.org/T420246) (owner: 10Dzahn) [19:26:27] (03PS1) 10Dzahn: Revert "releases: remove rsync systemd units when primary server changes" [puppet] - 10https://gerrit.wikimedia.org/r/1254300 [19:28:01] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3069.esams.wmnet with reason: host reimage [19:28:45] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on releases1003.eqiad.wmnet with reason: T420246 [19:28:49] T420246: SystemdUnitFailed - rsync releases2003 - https://phabricator.wikimedia.org/T420246 [19:28:54] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3081.esams.wmnet with reason: host reimage [19:29:12] (03PS1) 10Catrope: Passwordless login: Don't display conditional auth errors [extensions/OATHAuth] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254301 [19:29:21] (03PS1) 10Catrope: Passwordless login: Don't display conditional auth errors [extensions/OATHAuth] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254302 [19:29:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/OATHAuth] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254302 (owner: 10Catrope) [19:29:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/OATHAuth] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254301 (owner: 10Catrope) [19:32:34] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3081.esams.wmnet with reason: host reimage [19:33:46] (03Abandoned) 10Kamila Součková: k8s: create shellbox-icu72 [puppet] - 10https://gerrit.wikimedia.org/r/1254266 (https://phabricator.wikimedia.org/T419049) (owner: 10Kamila Součková) [19:38:09] (03PS2) 10Dzahn: jenkins: add envoy and config for jenkins.discovery.wmnet (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1251205 (https://phabricator.wikimedia.org/T418521) [19:39:10] (03PS3) 10Dzahn: jenkins: add envoy and config for jenkins.discovery.wmnet (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1251205 (https://phabricator.wikimedia.org/T418521) [19:39:43] (03Abandoned) 10Dzahn: jenkins: enable the jenkins service if using new role [puppet] - 10https://gerrit.wikimedia.org/r/1251208 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [19:39:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast2003.wikimedia.org with OS trixie [19:40:02] 10ops-codfw, 06SRE, 06DC-Ops: bast2003 boot failure - https://phabricator.wikimedia.org/T420320#11720664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host bast2003.wikimedia.org with OS trixie completed: - bast2003 (**PASS**) - Downtimed on Icinga/Alertmanag... [19:41:00] (03CR) 10Dzahn: [C:03+1] profile::reboot::unattended: add class to mark hosts for unattended reboots [puppet] - 10https://gerrit.wikimedia.org/r/1251406 (owner: 10Jelto) [19:43:05] 10ops-codfw, 06SRE, 06DC-Ops: bast2003 boot failure - https://phabricator.wikimedia.org/T420320#11720668 (10Jhancock.wm) @MoritzMuehlenhoff redid it in trixie. [19:46:07] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm [19:46:45] FIRING: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [19:47:32] (03PS1) 10Dzahn: contint/jenkins: make the jenkins host name configurable [puppet] - 10https://gerrit.wikimedia.org/r/1254307 (https://phabricator.wikimedia.org/T418521) [19:48:12] (03CR) 10CI reject: [V:04-1] contint/jenkins: make the jenkins host name configurable [puppet] - 10https://gerrit.wikimedia.org/r/1254307 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [19:48:46] 10ops-codfw, 06SRE, 06DC-Ops: bast2003 boot failure - https://phabricator.wikimedia.org/T420320#11720684 (10MoritzMuehlenhoff) Thanks! [19:49:14] (03PS2) 10Dzahn: contint/jenkins: make the jenkins host name configurable [puppet] - 10https://gerrit.wikimedia.org/r/1254307 (https://phabricator.wikimedia.org/T418521) [19:50:26] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3068.esams.wmnet with OS trixie [19:50:54] (03PS1) 10Dzahn: ci: switch jenkins proxy target to new discovery name [puppet] - 10https://gerrit.wikimedia.org/r/1254308 (https://phabricator.wikimedia.org/T418521) [19:51:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [19:53:36] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-wmde-users for Ben.buchenau - https://phabricator.wikimedia.org/T419878#11720695 (10Gehel) Approved. [19:54:05] !log T411568 rebooted `an-test-client1002`, `an-test-ui1001`, `an-test-coord1001`, `an-test-master1001` [19:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:09] T411568: October 2025 Bullseye reboots: Data Platform Engineering-owned hosts - https://phabricator.wikimedia.org/T411568 [19:54:23] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3069.esams.wmnet with OS trixie [19:58:58] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3081.esams.wmnet with OS trixie [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T2000). [20:00:05] aude, alexsanford, and RoanKattouw: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:11] hi [20:00:26] hey! [20:01:01] i can deploy mine (it's config so should be quick) [20:01:45] proceedign [20:01:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aude@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251309 (https://phabricator.wikimedia.org/T419163) (owner: 10Aude) [20:03:21] (03Merged) 10jenkins-bot: Set wgReadingListsBetaDefaultForNewAccountsAfter for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251309 (https://phabricator.wikimedia.org/T419163) (owner: 10Aude) [20:03:53] just a heads-up: I'm going to be monitoring how some changes we (SRE) made earlier today perform through the deployments in this window. should all be fine, but I *might* have to ask you folks to pause between patches for me to revert something if there are surprises :) [20:03:55] !log aude@deploy2002 Started scap sync-world: Backport for [[gerrit:1251309|Set wgReadingListsBetaDefaultForNewAccountsAfter for beta cluster (T419163)]] [20:03:58] T419163: Opt new accounts into ReadingLists BetaFeature - https://phabricator.wikimedia.org/T419163 [20:05:40] swfrench-wmf let me know if anything doesn't look good to you with my config deploy [20:05:52] * swfrench-wmf thumbs up [20:06:01] !log aude@deploy2002 aude: Backport for [[gerrit:1251309|Set wgReadingListsBetaDefaultForNewAccountsAfter for beta cluster (T419163)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:07:48] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp3069.esams.wmnet [reason: trixie reimaging] [20:08:00] looks good from my side [20:08:03] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp3068.esams.wmnet [reason: trixie reimaging] [20:08:24] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp3070.esams.wmnet [reason: trixie reimaging] [20:08:38] proceeding [20:08:50] !log aude@deploy2002 aude: Continuing with sync [20:09:03] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp3070.esams.wmnet with OS trixie [20:09:24] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp3071.esams.wmnet with OS trixie [20:09:57] FIRING: CertAlmostExpired: Certificate for service opensearch-ipoid:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#opensearch-ipoid:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:11:04] (03CR) 10JHathaway: [C:03+1] "I think this looks okay, one code comment." [cookbooks] - 10https://gerrit.wikimedia.org/r/1253412 (https://phabricator.wikimedia.org/T414216) (owner: 10Elukey) [20:12:48] !log aude@deploy2002 Finished scap sync-world: Backport for [[gerrit:1251309|Set wgReadingListsBetaDefaultForNewAccountsAfter for beta cluster (T419163)]] (duration: 08m 53s) [20:12:54] T419163: Opt new accounts into ReadingLists BetaFeature - https://phabricator.wikimedia.org/T419163 [20:13:27] (03CR) 10JHathaway: [C:03+1] Install systemd-timesyncd universally [puppet] - 10https://gerrit.wikimedia.org/r/1243756 (owner: 10Muehlenhoff) [20:14:57] i'm done swfrench-wmf RoanKattouw alexsanford [20:15:31] I'll continue to monitor, but I think that looked good from my end :) [20:15:36] (03PS17) 10Andrew Bogott: toolforge etcdctl: update handling of 'member list' output [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 (https://phabricator.wikimedia.org/T361237) [20:16:12] !log T411568 rebooted `an-test-master1002`, `an-test-master1003`, `an-test-master1004`, `archiva1002` [20:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:16] T411568: October 2025 Bullseye reboots: Data Platform Engineering-owned hosts - https://phabricator.wikimedia.org/T411568 [20:16:58] FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [20:18:53] maybe I spoke too soon ... [20:19:19] that's a lot of NELs for the measure domains [20:19:24] what's the relation? [20:19:47] I'm struggling to think of a way that could possibly be related to my change earlier today [20:20:12] I'm about to deploy some more MW patches, speak up now/soon if you want me to stop/pause [20:20:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [extensions/OATHAuth] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254301 (owner: 10Catrope) [20:20:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [extensions/OATHAuth] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254302 (owner: 10Catrope) [20:20:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1015:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:21:02] TCP timed out, interesting [20:21:03] PROBLEM - Host an-coord1003 is DOWN: PING CRITICAL - Packet loss = 100% [20:21:09] PROBLEM - Host an-web1001 is DOWN: PING CRITICAL - Packet loss = 100% [20:21:19] when did the errors start? [20:21:45] 19:55 UTC [20:21:47] aude: looks like they've been creeping up since ~ 19:40 [20:21:58] RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [20:22:01] (03Merged) 10jenkins-bot: Passwordless login: Don't display conditional auth errors [extensions/OATHAuth] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254301 (owner: 10Catrope) [20:22:02] (03Merged) 10jenkins-bot: Passwordless login: Don't display conditional auth errors [extensions/OATHAuth] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254302 (owner: 10Catrope) [20:22:08] I think Telefonica de Espanica is having issues specifically [20:22:09] so yeah, unrelated to either of our chagnes [20:22:16] ^ exactly [20:22:23] phew. everything looked good to me with my change [20:22:27] RECOVERY - Host an-coord1003 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [20:22:30] (03PS1) 10Ryan Kemper: wdqs: Remove old single-instance deadlock remediation cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1254314 (https://phabricator.wikimedia.org/T242453) [20:22:34] can't see how it is related [20:22:36] !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1254301|Passwordless login: Don't display conditional auth errors]], [[gerrit:1254302|Passwordless login: Don't display conditional auth errors]] [20:22:37] RECOVERY - Host an-web1001 is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms [20:23:17] * swfrench-wmf returns to staring a different graphs than NEL [20:23:34] :) [20:24:38] !log catrope@deploy2002 catrope: Backport for [[gerrit:1254301|Passwordless login: Don't display conditional auth errors]], [[gerrit:1254302|Passwordless login: Don't display conditional auth errors]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:24:54] 06SRE, 10SRE-Access-Requests: Requesting access to WMDE LDAP group for Sarmbruster - https://phabricator.wikimedia.org/T420410 (10Sarmbruster) 03NEW [20:24:57] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254314 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [20:26:35] (03CR) 10Andrew Bogott: "I've created and deleted several nodes in toolsbeta with the latest version of this patch." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 (https://phabricator.wikimedia.org/T361237) (owner: 10Andrew Bogott) [20:26:41] PROBLEM - Host an-coord1004 is DOWN: PING CRITICAL - Packet loss = 100% [20:27:27] RECOVERY - Host an-coord1004 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [20:27:33] !log catrope@deploy2002 catrope: Continuing with sync [20:30:33] (03CR) 10Ryan Kemper: [C:03+1] "0 blast radius cleanup patch, self-merging" [puppet] - 10https://gerrit.wikimedia.org/r/1254314 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [20:30:35] (03CR) 10Ryan Kemper: [C:03+2] wdqs: Remove old single-instance deadlock remediation cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1254314 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [20:31:32] !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1254301|Passwordless login: Don't display conditional auth errors]], [[gerrit:1254302|Passwordless login: Don't display conditional auth errors]] (duration: 08m 56s) [20:32:12] alexsanford: You're up! You can use the "deploy change" link on the deployments page to jump straight into a SpiderPig session for your patch [20:32:12] https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T2000 [20:32:30] ok on it [20:34:40] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3071.esams.wmnet with reason: host reimage [20:34:44] RoanKattouw hmm looks like my developer account doesn't have enough privileges to open SpiderPig [20:34:54] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3070.esams.wmnet with reason: host reimage [20:34:56] Huh but you do have deployment access? [20:35:06] Yep [20:35:24] Ah I see, it's a separate group [20:35:35] OK for now request it at https://idm.wikimedia.org/permissions/ and then you'll have it for next time [20:35:48] will do [20:36:41] SpiderPig is really just a fancy UI around `scap backport`, so you can run `scap backport 1254280` on the deployment host and you'll get basically the same experience [20:37:05] Ok I'll try that [20:37:31] You'll just have to type Y/N instead of clicking buttons, and you won't get notifications from your browser when you need to take action [20:38:25] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3071.esams.wmnet with reason: host reimage [20:38:38] !log T411568 rebooted `an-coord1003`, `an-coord1004`, `an-tool1007`, `an-tool1008`, `an-tool1011`, `an-web1001` [20:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:41] T411568: October 2025 Bullseye reboots: Data Platform Engineering-owned hosts - https://phabricator.wikimedia.org/T411568 [20:38:44] !log T411568 failed over HDFS NameNode from an-master1003 to an-master1004, then rebooted `an-master1003` [20:38:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by alexsanford@deploy2002 using scap backport" [skins/MinervaNeue] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254280 (https://phabricator.wikimedia.org/T418534) (owner: 10Alex.sanford) [20:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:12] PROBLEM - Host an-master1003 is DOWN: PING CRITICAL - Packet loss = 100% [20:40:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Degraded RAID on an-presto1007 - https://phabricator.wikimedia.org/T419329#11720846 (10VRiley-WMF) @BTullis Would you like us to order a replacment dirve for this? [20:40:27] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm [20:40:42] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm [20:40:44] RECOVERY - Host an-master1003 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [20:40:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: hw troubleshooting: Comm Error: Backplane 0 for wikikube-worker1307.eqiad.wmnet - https://phabricator.wikimedia.org/T420389#11720848 (10VRiley-WMF) a:03VRiley-WMF [20:43:05] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3070.esams.wmnet with reason: host reimage [20:43:37] (03PS4) 10Arlolra: Deploy PRV to 13 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253654 (https://phabricator.wikimedia.org/T420273) [20:43:51] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:45:11] (03PS1) 10Bking: WIP: dse-k8s: Add automation for setting OpenSearch pod ureadahead values [puppet] - 10https://gerrit.wikimedia.org/r/1254320 (https://phabricator.wikimedia.org/T419041) [20:45:28] (03PS3) 10Kamila Součková: shellbox: Setup shellbox-icu72 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251475 (https://phabricator.wikimedia.org/T419548) [20:47:49] btullis@cumin1003 provision (PID 3971211) is awaiting input [20:48:29] (03CR) 10Dzahn: [C:03+2] Revert "releases: remove rsync systemd units when primary server changes" [puppet] - 10https://gerrit.wikimedia.org/r/1254300 (owner: 10Dzahn) [20:48:31] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:49:36] (03PS3) 10Dzahn: contint/jenkins: make the jenkins host name configurable [puppet] - 10https://gerrit.wikimedia.org/r/1254307 (https://phabricator.wikimedia.org/T418521) [20:51:54] (03Merged) 10jenkins-bot: Remove notice from login form in popup mode [skins/MinervaNeue] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254280 (https://phabricator.wikimedia.org/T418534) (owner: 10Alex.sanford) [20:51:56] RECOVERY - Host wikikube-worker1307 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [20:52:24] !log alexsanford@deploy2002 Started scap sync-world: Backport for [[gerrit:1254280|Remove notice from login form in popup mode (T418534)]] [20:52:28] T418534: Update the design of the popup login form for use in a mobile web view - https://phabricator.wikimedia.org/T418534 [20:53:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: hw troubleshooting: Comm Error: Backplane 0 for wikikube-worker1307.eqiad.wmnet - https://phabricator.wikimedia.org/T420389#11720875 (10VRiley-WMF) Forced the unit down, and performed a flea power drain. Booted it back up an... [20:54:26] !log alexsanford@deploy2002 alexsanford: Backport for [[gerrit:1254280|Remove notice from login form in popup mode (T418534)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:55:02] (03CR) 10Kamila Součková: shellbox: Setup shellbox-icu72 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251475 (https://phabricator.wikimedia.org/T419548) (owner: 10Kamila Součková) [20:55:30] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/1254307/8290/" [puppet] - 10https://gerrit.wikimedia.org/r/1254307 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [20:55:52] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1254307/8290/" [puppet] - 10https://gerrit.wikimedia.org/r/1254307 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [20:56:03] !log alexsanford@deploy2002 alexsanford: Continuing with sync [20:59:56] !log alexsanford@deploy2002 Finished scap sync-world: Backport for [[gerrit:1254280|Remove notice from login form in popup mode (T418534)]] (duration: 07m 32s) [21:00:00] T418534: Update the design of the popup login form for use in a mobile web view - https://phabricator.wikimedia.org/T418534 [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T2100) [21:00:26] (03PS5) 10Herron: systemd::timer::job: add ExecCondition support [puppet] - 10https://gerrit.wikimedia.org/r/1253655 [21:00:28] (03PS4) 10Dzahn: jenkins: add envoy and config for jenkins.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1251205 (https://phabricator.wikimedia.org/T418521) [21:00:32] FIRING: [2x] KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [21:00:44] (03CR) 10Dzahn: [V:03+1 C:03+2] "noop confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/1254307 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [21:00:45] FIRING: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [21:01:24] my backport deployment is done [21:05:46] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3071.esams.wmnet with OS trixie [21:06:48] (03PS1) 10Dzahn: releases: remove "unless" condition around rsync data copy [puppet] - 10https://gerrit.wikimedia.org/r/1254331 (https://phabricator.wikimedia.org/T420246) [21:08:26] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1254332 [21:08:41] (03PS2) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1254332 [21:09:03] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar: Add --min-uptime to cookbooks - https://phabricator.wikimedia.org/T419967#11720994 (10Ajuanca) What's task `T419960` about? I don't enough privilegies to access it. Yes, I think a parameter with expressive reboot time is more robust than a relati... [21:09:33] (03CR) 10CI reject: [V:04-1] wip [puppet] - 10https://gerrit.wikimedia.org/r/1254332 (owner: 10Herron) [21:09:50] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3070.esams.wmnet with OS trixie [21:10:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [21:13:46] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp3071.esams.wmnet [reason: trixie reimaging] [21:13:55] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp3070.esams.wmnet [reason: trixie reimaging] [21:14:09] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp3072.esams.wmnet [reason: trixie reimaging] [21:14:35] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp3072.esams.wmnet with OS trixie [21:14:53] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp3073.esams.wmnet [reason: trixie reimaging] [21:15:22] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp3073.esams.wmnet with OS trixie [21:16:55] (03CR) 10JHathaway: sre.hosts.provision: use more URIs to set Supermicro's BIOS settings (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216) (owner: 10Elukey) [21:22:29] (03CR) 10C. Scott Ananian: [C:03+1] Deploy PRV to 13 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253654 (https://phabricator.wikimedia.org/T420273) (owner: 10Arlolra) [21:27:42] (03PS1) 10RLazarus: function-{evaluator,orchestrator}: set AppArmor profile in pod SecurityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254338 (https://phabricator.wikimedia.org/T367880) [21:29:44] (03CR) 10CI reject: [V:04-1] function-{evaluator,orchestrator}: set AppArmor profile in pod SecurityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254338 (https://phabricator.wikimedia.org/T367880) (owner: 10RLazarus) [21:33:55] (03PS2) 10RLazarus: function-{evaluator,orchestrator}: set AppArmor profile in pod SecurityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254338 (https://phabricator.wikimedia.org/T367880) [21:36:00] (03CR) 10CI reject: [V:04-1] function-{evaluator,orchestrator}: set AppArmor profile in pod SecurityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254338 (https://phabricator.wikimedia.org/T367880) (owner: 10RLazarus) [21:36:28] 06SRE, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): October 2025 Bullseye reboots: Data Platform Engineering-owned hosts - https://phabricator.wikimedia.org/T411568#11721043 (10RKemper) 05Open→03Resolved Completed all remaining DPE-owned host reboots today (2026-03-17). All 143 reachable Bullseye hos... [21:38:56] !log T411568 Failed back HDFS NameNode from an-master1004 to an-master1003; cluster back to original active/standby configuration [21:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:00] T411568: October 2025 Bullseye reboots: Data Platform Engineering-owned hosts - https://phabricator.wikimedia.org/T411568 [21:39:39] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3072.esams.wmnet with reason: host reimage [21:41:41] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3073.esams.wmnet with reason: host reimage [21:44:28] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3072.esams.wmnet with reason: host reimage [21:47:03] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users level 3 for bvibber - https://phabricator.wikimedia.org/T420406#11721057 (10HSwan-WMF) Please grant this access so that Brooke can pull data. Thank you! [21:48:03] 06SRE, 06DC-Ops, 10Wikidata: NVMe versus standard SSD performance info - https://phabricator.wikimedia.org/T419884#11721058 (10RobH) [21:48:28] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3073.esams.wmnet with reason: host reimage [22:02:18] 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11721097 (10RKemper) a:05RKemper→03Jclark-ctr [22:03:16] 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11721106 (10RKemper) Switched this to a HW failure ticket, given `racadm getsel` revealed a backplane issue [22:04:16] 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11721109 (10RKemper) [22:05:20] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on releases2003.codfw.wmnet with reason: T420246 [22:05:24] T420246: SystemdUnitFailed - rsync releases2003 - https://phabricator.wikimedia.org/T420246 [22:05:46] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on releases1003.eqiad.wmnet with reason: T420246 [22:06:56] 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11721122 (10RKemper) p:05Triage→03Medium [22:07:18] 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11721124 (10RKemper) [22:10:46] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3072.esams.wmnet with OS trixie [22:11:57] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp3072.esams.wmnet [reason: trixie reimaging] [22:15:17] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3073.esams.wmnet with OS trixie [22:15:32] PROBLEM - statsv Varnishkafka log producer on cp3071 is CRITICAL: PROCS CRITICAL: 3 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [22:16:32] RECOVERY - statsv Varnishkafka log producer on cp3071 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [22:20:21] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp3073.esams.wmnet [reason: trixie reimaging] [22:35:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253654 (https://phabricator.wikimedia.org/T420273) (owner: 10Arlolra) [22:49:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:54:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:55:29] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp3081.* [23:00:49] (03PS1) 10Jforrester: [DNM] Make abstractwiki a multi-lingual Wikidata client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254359 (https://phabricator.wikimedia.org/T420420) [23:07:18] 06SRE: upload.wikimedia.org serves .ogg audio files with content-type `application/ogg` instead of `audio/ogg`. - https://phabricator.wikimedia.org/T420422#11721260 (10Reedy) [23:38:30] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hadoop.reboot-workers (exit_code=99) for Hadoop analytics cluster [23:44:14] !log btullis@cumin1003 START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster [23:46:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Degraded RAID on an-presto1007 - https://phabricator.wikimedia.org/T419329#11721339 (10BTullis) a:05VRiley-WMF→03BTullis Thanks, @VRiley-WMF. I'm not sure that it's worth it to purchase a replacement. The disks aren't actuall... [23:48:49] (03CR) 10Scott French: [C:03+1] "Thanks, Raine!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251475 (https://phabricator.wikimedia.org/T419548) (owner: 10Kamila Součková) [23:50:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11721344 (10BTullis) I confirmed that the backplane was giving this error. {F72973897,width=3...