[00:00:05] <wikibugs>	 (03Merged) 10jenkins-bot: Enable languages in main menu on Russian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251158 (https://phabricator.wikimedia.org/T419730) (owner: 10Jdlrobson)
[00:00:44] <logmsgbot>	 !log jdlrobson@deploy2002 Started scap sync-world: Backport for [[gerrit:1251158|Enable languages in main menu on Russian Wikipedia (T419730)]]
[00:02:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:02:39] <logmsgbot>	 !log jdlrobson@deploy2002 jdlrobson: Backport for [[gerrit:1251158|Enable languages in main menu on Russian Wikipedia (T419730)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[00:02:43] <stashbot>	 T419730: Vector 2022 should support duplication of languages in header and sidebar - https://phabricator.wikimedia.org/T419730
[00:03:49] <logmsgbot>	 !log jdlrobson@deploy2002 jdlrobson: Continuing with sync
[00:04:49] <dduvall>	 RoanKattouw: ok, thank you for that. perhaps it's a race condition between the messages/actions taken by gerrit and spiderpig. please file a bug and we will have a closer look
[00:04:50] <dduvall>	  Jdlrobson: are you unblocked for your deployment now?
[00:06:38] <wikibugs>	 (03PS12) 10Bartosz Dziewoński: rest-gateway rate limiting: add CORS headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248461 (https://phabricator.wikimedia.org/T418969) (owner: 10Daniel Kinzler)
[00:06:48] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C:03+1] "Looks right, as far as I can tell." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248461 (https://phabricator.wikimedia.org/T418969) (owner: 10Daniel Kinzler)
[00:07:41] <logmsgbot>	 !log jdlrobson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1251158|Enable languages in main menu on Russian Wikipedia (T419730)]] (duration: 06m 57s)
[00:07:45] <stashbot>	 T419730: Vector 2022 should support duplication of languages in header and sidebar - https://phabricator.wikimedia.org/T419730
[00:08:01] <Jdlrobson>	 ok all done. Thanks for the troubleshooting dduvall RoanKattouw 
[00:08:16] <Jdlrobson>	 RoanKattouw: would you be able to raise a bug since it seems like you have a good handle on what happened here
[00:08:49] <RoanKattouw>	 I gotta run but I'll file one tomorrow
[00:09:38] <dduvall>	 Jdlrobson: good to hear! RoanKattouw: thank you!
[00:20:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:25:23] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6009.drmrs.wmnet with OS trixie
[00:26:11] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp6009.*
[00:32:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.22% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[00:32:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Cable cleanup in rack - https://phabricator.wikimedia.org/T420266#11716635 (10VRiley-WMF) 05Open→03Resolved This is completed
[00:35:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:37:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[00:39:01] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1253686
[00:39:01] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1253686 (owner: 10TrainBranchBot)
[00:42:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 18.77% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[00:43:52] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] hiera: Remove single_backend from codfw [puppet] - 10https://gerrit.wikimedia.org/r/1253605 (https://phabricator.wikimedia.org/T401832) (owner: 10BCornwall)
[00:44:35] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] hiera: Set default codfw storage_elements [puppet] - 10https://gerrit.wikimedia.org/r/1253606 (https://phabricator.wikimedia.org/T401832) (owner: 10BCornwall)
[00:52:15] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1253686 (owner: 10TrainBranchBot)
[01:08:59] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1253707
[01:08:59] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1253707 (owner: 10TrainBranchBot)
[01:15:32] <jinxer-wm>	 FIRING: [2x] KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[01:20:22] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:20:22] <icinga-wm>	 PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:21:22] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:21:22] <icinga-wm>	 RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:26:07] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1253707 (owner: 10TrainBranchBot)
[01:42:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[01:43:45] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.36% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[01:48:45] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 21.37% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[01:49:38] <jinxer-wm>	 FIRING: GnmiTargetDown: lsw1-e8-eqiad is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[01:52:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 21.87% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[01:57:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.82% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[01:57:45] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.29% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[02:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T0200)
[02:00:49] <logmsgbot>	 !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image
[02:02:45] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[02:05:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[02:08:40] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:08:52] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.46.0-wmf.20 [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1253715 (https://phabricator.wikimedia.org/T413811)
[02:08:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.46.0-wmf.20 [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1253715 (https://phabricator.wikimedia.org/T413811) (owner: 10TrainBranchBot)
[02:09:00] <logmsgbot>	 !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 08m 10s)
[02:10:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[02:12:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[02:20:45] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.46.0-wmf.20 [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1253715 (https://phabricator.wikimedia.org/T413811) (owner: 10TrainBranchBot)
[02:22:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.32% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[02:24:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 22.93% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[02:29:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[02:33:40] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:34:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 22.65% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[02:35:00] <jinxer-wm>	 FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[02:38:40] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:39:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 22.97% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[02:52:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[03:00:05] <jouncebot>	 Deploy window Automatic deployment of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T0300)
[03:02:06] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis to 1.46.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253737 (https://phabricator.wikimedia.org/T413811)
[03:02:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253737 (https://phabricator.wikimedia.org/T413811) (owner: 10TrainBranchBot)
[03:02:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.47% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[03:03:05] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis to 1.46.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253737 (https://phabricator.wikimedia.org/T413811) (owner: 10TrainBranchBot)
[03:03:36] <logmsgbot>	 !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.46.0-wmf.20  refs T413811
[03:03:40] <stashbot>	 T413811: 1.46.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T413811
[03:05:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 22.08% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[03:13:38] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release zarcillo/main on k8s-aux@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-aux&var-namespace=zarcillo - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[03:43:10] <logmsgbot>	 !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.46.0-wmf.20  refs T413811 (duration: 39m 34s)
[03:43:14] <stashbot>	 T413811: 1.46.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T413811
[04:00:05] <jouncebot>	 Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T0400)
[04:01:19] <logmsgbot>	 !log mwpresync@deploy2002 Pruned MediaWiki: 1.46.0-wmf.17 (duration: 01m 17s)
[04:13:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:15:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 22.9% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[04:23:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.4% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[04:28:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.4% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[04:34:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.4% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[04:35:40] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:37:34] <wikibugs>	 (03PS1) 10Dwisehaupt: Point fundraising read handle back at the origin server [dns] - 10https://gerrit.wikimedia.org/r/1253750 (https://phabricator.wikimedia.org/T420155)
[04:39:36] <wikibugs>	 (03CR) 10Dwisehaupt: [C:03+2] Point fundraising read handle back at the origin server [dns] - 10https://gerrit.wikimedia.org/r/1253750 (https://phabricator.wikimedia.org/T420155) (owner: 10Dwisehaupt)
[04:39:58] <logmsgbot>	 !log dwisehaupt@dns1005 START - running authdns-update
[04:41:23] <logmsgbot>	 !log dwisehaupt@dns1005 END - running authdns-update
[04:44:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.64% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[04:48:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 22.51% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[04:53:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 22.51% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[05:00:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1015:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:15:26] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for backup2005.mgmt:22 - https://phabricator.wikimedia.org/T420308 (10phaultfinder) 03NEW
[05:15:47] <jinxer-wm>	 FIRING: [2x] KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[05:49:38] <jinxer-wm>	 FIRING: GnmiTargetDown: lsw1-e8-eqiad is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[05:55:03] <kart_>	 Deploying cxserver.
[05:55:10] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2026-03-16-071247-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253260 (https://phabricator.wikimedia.org/T420004) (owner: 10KartikMistry)
[05:57:23] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2026-03-16-071247-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253260 (https://phabricator.wikimedia.org/T420004) (owner: 10KartikMistry)
[05:58:34] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply
[05:58:59] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T0600)
[06:00:05] <jouncebot>	 marostegui, Amir1, and federico3: It is that lovely time of the day again! You are hereby commanded to deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T0600).
[06:04:48] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[06:05:19] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[06:06:37] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[06:07:13] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[06:08:09] <kart_>	 !log Updated cxserver to 2026-03-16-071247-production (T420004)
[06:08:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:08:12] <stashbot>	 T420004: Translating an article with ContentTranslation may fail with Critical error - https://phabricator.wikimedia.org/T420004
[06:35:00] <jinxer-wm>	 FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[06:54:23] <wikibugs>	 (03CR) 10Ayounsi: "lgtm but leaving the last call to the traffic team" [puppet] - 10https://gerrit.wikimedia.org/r/1253538 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff)
[06:55:09] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: ProxyTimeout shorter than Jetty's idle timeout [puppet] - 10https://gerrit.wikimedia.org/r/1241048 (https://phabricator.wikimedia.org/T246763) (owner: 10Hashar)
[06:55:28] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] decom cookbook: add --homer parameter [cookbooks] - 10https://gerrit.wikimedia.org/r/1251099 (owner: 10Ayounsi)
[06:55:53] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] "Not for now, but eventually yes." [cookbooks] - 10https://gerrit.wikimedia.org/r/1251099 (owner: 10Ayounsi)
[06:59:40] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] cr-cloud: allow cumin/cloudcumin traffic [homer/public] - 10https://gerrit.wikimedia.org/r/1253574 (https://phabricator.wikimedia.org/T419996) (owner: 10Filippo Giunchedi)
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:00:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti3005.esams.wmnet with OS bookworm
[07:00:59] <wikibugs>	 (03Merged) 10jenkins-bot: decom cookbook: add --homer parameter [cookbooks] - 10https://gerrit.wikimedia.org/r/1251099 (owner: 10Ayounsi)
[07:01:03] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops: ganeti3005 didn't come up after reboot - https://phabricator.wikimedia.org/T420229#11717077 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti3005.esams.wmnet with OS bookworm
[07:03:46] <wikibugs>	 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11717082 (10ayounsi) I created a meeting to not forget, and invited you both just in case.
[07:05:26] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove LDAP access for lmixter [puppet] - 10https://gerrit.wikimedia.org/r/1253996
[07:08:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove LDAP access for lmixter [puppet] - 10https://gerrit.wikimedia.org/r/1253996 (owner: 10Muehlenhoff)
[07:13:38] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release zarcillo/main on k8s-aux@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-aux&var-namespace=zarcillo - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[07:19:06] <wikibugs>	 (03CR) 10Muehlenhoff: systemd::timer::job: add ExecCondition support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1253655 (owner: 10Herron)
[07:21:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1253514 (https://phabricator.wikimedia.org/T398611) (owner: 10Effie Mouzeli)
[07:23:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet
[07:24:43] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+1] alerts(blazegraph): reduce severity of CategoriesQueryServiceUpdateLagTooHigh to warning [alerts] - 10https://gerrit.wikimedia.org/r/1253552 (https://phabricator.wikimedia.org/T420235) (owner: 10Gehel)
[07:25:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet
[07:25:21] <moritzm>	 ml-etcd2001 will go down for a Ganeti reboot
[07:25:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti3005.esams.wmnet with reason: host reimage
[07:27:28] <icinga-wm>	 PROBLEM - Host ml-etcd2001 is DOWN: PING CRITICAL - Packet loss = 100%
[07:29:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti3005.esams.wmnet with reason: host reimage
[07:30:56] <icinga-wm>	 RECOVERY - Host ml-etcd2001 is UP: PING OK - Packet loss = 0%, RTA = 30.69 ms
[07:31:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet
[07:31:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet
[07:32:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2033.codfw.wmnet
[07:32:04] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2033.codfw.wmnet
[07:32:34] <logmsgbot>	 !log arnaudb@cumin1003 START - Cookbook sre.gerrit.restart-gerrit Restarting Gerrit on gerrit2003
[07:32:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet
[07:33:28] <icinga-wm>	 RECOVERY - Host wikikube-worker1291 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms
[07:34:23] <logmsgbot>	 !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gerrit.restart-gerrit (exit_code=0) Restarting Gerrit on gerrit2003
[07:35:32] <jinxer-wm>	 FIRING: [2x] KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[07:35:55] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3322888) is awaiting input
[07:37:49] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1285-1289,1291-1299].eqiad.wmnet
[07:37:53] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1285-1289,1291-1299].eqiad.wmnet
[07:38:24] <wikibugs>	 (03CR) 10Ayounsi: "This should be able to be pushed anytime, and then followed up by a cleanup patch for the older ranges." [homer/public] - 10https://gerrit.wikimedia.org/r/1247994 (https://phabricator.wikimedia.org/T418971) (owner: 10Ayounsi)
[07:41:42] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3322888) is awaiting input
[07:41:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] cr-cloud: allow cumin/cloudcumin traffic [homer/public] - 10https://gerrit.wikimedia.org/r/1253574 (https://phabricator.wikimedia.org/T419996) (owner: 10Filippo Giunchedi)
[07:46:00] <wikibugs>	 (03PS1) 10Arnaudb: gerrit: fix typo on isRegex for alerting downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1254035
[07:46:07] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: fix typo on isRegex for alerting downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1254035 (owner: 10Arnaudb)
[07:46:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast5004.wikimedia.org
[07:46:40] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] "thanks for the change, looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1253631 (https://phabricator.wikimedia.org/T420246) (owner: 10Dzahn)
[07:48:18] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] miscweb: add wmf-navigator values - empty httpd [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253489 (https://phabricator.wikimedia.org/T414405) (owner: 10AOkoth)
[07:50:36] <wikibugs>	 (03Merged) 10jenkins-bot: gerrit: fix typo on isRegex for alerting downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1254035 (owner: 10Arnaudb)
[07:51:20] <wikibugs>	 (03PS16) 10Kosta Harlan: hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250575 (https://phabricator.wikimedia.org/T419125) (owner: 10Harroyo-wmf)
[07:51:38] <wikibugs>	 (03CR) 10Kosta Harlan: hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250575 (https://phabricator.wikimedia.org/T419125) (owner: 10Harroyo-wmf)
[07:51:44] <wikibugs>	 (03CR) 10Kosta Harlan: hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250575 (https://phabricator.wikimedia.org/T419125) (owner: 10Harroyo-wmf)
[07:52:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti3005.esams.wmnet with OS bookworm
[07:52:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org
[07:52:39] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops: ganeti3005 didn't come up after reboot - https://phabricator.wikimedia.org/T420229#11717168 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti3005.esams.wmnet with OS bookworm completed: - ganeti3005 (**PASS**)   - Downtimed on I...
[07:54:14] <wikibugs>	 (03CR) 10Elukey: [C:03+2] sre.hosts.provision: Allow more optional BIOS values for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1251424 (https://phabricator.wikimedia.org/T414216) (owner: 10Elukey)
[07:55:29] <wikibugs>	 (03CR) 10Elukey: [C:03+2] P:kafka::broker::monitoring: Fix legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1251539 (https://phabricator.wikimedia.org/T420034) (owner: 10Majavah)
[07:55:45] <wikibugs>	 (03CR) 10Elukey: [C:03+2] confluent: kafka::broker: Fix legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1251540 (https://phabricator.wikimedia.org/T420034) (owner: 10Majavah)
[07:57:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org
[08:00:05] <jouncebot>	 andre and brennen: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-0+Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T0800).
[08:02:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3005.esams.wmnet
[08:04:55] <wikibugs>	 (03PS1) 10Slyngshede: geo-maps: update Meta geo mapping [dns] - 10https://gerrit.wikimedia.org/r/1254092
[08:08:41] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[08:09:26] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[08:12:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3005.esams.wmnet
[08:13:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:14:02] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host parsoidtest1001.eqiad.wmnet
[08:14:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti3005.esams.wmnet to cluster esams03 and group B
[08:14:25] <moritzm>	 !log powercycling bast2003 (stuck on reboot)
[08:14:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:15:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti3005.esams.wmnet to cluster esams03 and group B
[08:18:00] <icinga-wm>	 PROBLEM - Host bast2003 is DOWN: PING CRITICAL - Packet loss = 100%
[08:18:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet
[08:19:56] <icinga-wm>	 PROBLEM - Host ml-staging-etcd2003 is DOWN: PING CRITICAL - Packet loss = 100%
[08:20:55] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host parsoidtest1001.eqiad.wmnet
[08:21:04] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: bast2003 boot failure - https://phabricator.wikimedia.org/T420320 (10MoritzMuehlenhoff) 03NEW
[08:21:31] <wikibugs>	 (03CR) 10Mszwarc: hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250575 (https://phabricator.wikimedia.org/T419125) (owner: 10Harroyo-wmf)
[08:24:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet
[08:24:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet
[08:25:33] <wikibugs>	 (03CR) 10Gehel: [C:03+2] alerts(blazegraph): reduce severity of CategoriesQueryServiceUpdateLagTooHigh to warning [alerts] - 10https://gerrit.wikimedia.org/r/1253552 (https://phabricator.wikimedia.org/T420235) (owner: 10Gehel)
[08:26:10] <icinga-wm>	 RECOVERY - Host ml-staging-etcd2003 is UP: PING OK - Packet loss = 0%, RTA = 31.42 ms
[08:27:42] <logmsgbot>	 !log arnaudb@cumin1003 START - Cookbook sre.hosts.reboot-single for host contint1002.wikimedia.org
[08:28:59] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-misc1001.eqiad.wmnet
[08:29:05] <hashar>	 the CI host (Jenkins/Zuul) is being restarted
[08:30:27] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops: ganeti3005 didn't come up after reboot - https://phabricator.wikimedia.org/T420229#11717310 (10MoritzMuehlenhoff) 05Open→03Resolved p:05Triage→03Medium I've reimaged ganeti3005 and re-added it to the cluster.
[08:31:41] <wikibugs>	 (03PS5) 10Elukey: sre.hosts.provision: use more URIs to set Supermicro's BIOS settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216)
[08:31:55] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.hosts.provision: use more URIs to set Supermicro's BIOS settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216) (owner: 10Elukey)
[08:32:17] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[08:32:36] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[08:34:18] <logmsgbot>	 !log arnaudb@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host contint1002.wikimedia.org
[08:34:22] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc1001.eqiad.wmnet
[08:34:36] <logmsgbot>	 !log arnaudb@cumin1003 START - Cookbook sre.gerrit.restart-gerrit Restarting Gerrit on gerrit2002
[08:35:47] <logmsgbot>	 !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gerrit.restart-gerrit (exit_code=0) Restarting Gerrit on gerrit2002
[08:36:04] <wikibugs>	 (03PS17) 10Kosta Harlan: hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250575 (https://phabricator.wikimedia.org/T419125) (owner: 10Harroyo-wmf)
[08:36:10] <wikibugs>	 (03CR) 10Kosta Harlan: hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250575 (https://phabricator.wikimedia.org/T419125) (owner: 10Harroyo-wmf)
[08:38:55] <wikibugs>	 (03PS17) 10Kosta Harlan: hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250575 (https://phabricator.wikimedia.org/T419125) (owner: 10Harroyo-wmf)
[08:40:33] <kostajh>	 hashar: can I sync a config patch? 
[08:40:47] <kostajh>	 or is the train running now? cc andre 
[08:40:57] <andre>	 kostajh, train is blocked, go ahead
[08:40:58] <hashar>	 the train will run in 20 minutes
[08:41:06] <hashar>	 the deployment calendar is off by one hour because of the DST confusion time
[08:41:09] <andre>	 no, Daylight Confusion Time
[08:41:19] <kostajh>	 ok, I'm starting
[08:41:34] <wikibugs>	 (03PS18) 10Kosta Harlan: hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250575 (https://phabricator.wikimedia.org/T419125) (owner: 10Harroyo-wmf)
[08:41:42] <andre>	 hashar: Deployment Calendar is bound to SF/PST time, so train window started 40min ago, AFAIK
[08:41:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250575 (https://phabricator.wikimedia.org/T419125) (owner: 10Harroyo-wmf)
[08:41:56] <wikibugs>	 (03PS3) 10Elukey: sre.hosts.provision: refactor bios if/else branches [cookbooks] - 10https://gerrit.wikimedia.org/r/1253412 (https://phabricator.wikimedia.org/T414216)
[08:41:56] <wikibugs>	 (03PS2) 10Elukey: sre.hosts.provision: add sys-112c-tn-configg to SUPERMICRO_NO_FQDN_MANAGEMENT [cookbooks] - 10https://gerrit.wikimedia.org/r/1253448 (https://phabricator.wikimedia.org/T414216)
[08:41:56] <wikibugs>	 (03PS6) 10Elukey: sre.hosts.provision: use more URIs to set Supermicro's BIOS settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216)
[08:42:28] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[08:42:45] <wikibugs>	 (03Merged) 10jenkins-bot: hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250575 (https://phabricator.wikimedia.org/T419125) (owner: 10Harroyo-wmf)
[08:42:47] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[08:43:34] <logmsgbot>	 !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1250575|hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend (T419125)]]
[08:43:39] <stashbot>	 T419125: hCaptcha: Update mediawiki-config to enforce checks for API edits coming from the MobileFrontend - https://phabricator.wikimedia.org/T419125
[08:44:44] <logmsgbot>	 !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host bast2003.wikimedia.org
[08:45:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet
[08:45:29] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:45:35] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/7 UP : OSPFv3: 4/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:45:35] <icinga-wm>	 PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:45:35] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:46:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr2-drmrs and 208.80.153.204 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[08:46:39] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-drmrs (208.80.153.205) - group Confed_drmrs - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[08:47:13] <wikibugs>	 06SRE, 06Data-Platform-SRE: Data Platform SRE paging alerts and on-call SRE response - https://phabricator.wikimedia.org/T420264#11717376 (10Gehel) I don't think this alert should have been paging. The workloads we run on k8s are all supposed to be able to be down for extended periods.
[08:47:28] <wikibugs>	 06SRE, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Data Platform SRE paging alerts and on-call SRE response - https://phabricator.wikimedia.org/T420264#11717377 (10Gehel)
[08:48:59] <logmsgbot>	 !log kharlan@deploy2002 harroyo-wmf, kharlan: Backport for [[gerrit:1250575|hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend (T419125)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[08:49:02] <stashbot>	 T419125: hCaptcha: Update mediawiki-config to enforce checks for API edits coming from the MobileFrontend - https://phabricator.wikimedia.org/T419125
[08:49:28] <wikibugs>	 (03PS1) 10Effie Mouzeli: site.pp: switch mc-misc to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1254111 (https://phabricator.wikimedia.org/T398611)
[08:49:52] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3338663) is awaiting input
[08:50:17] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] hieradata: migrate eqiad memcached cluster to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1253514 (https://phabricator.wikimedia.org/T398611) (owner: 10Effie Mouzeli)
[08:50:57] <wikibugs>	 06SRE, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Data Platform SRE paging alerts and on-call SRE response - https://phabricator.wikimedia.org/T420264#11717388 (10MoritzMuehlenhoff) >>! In T420264#11717376, @Gehel wrote: > I don't think this alert should have been paging. The workloads we run on k8s are al...
[08:51:10] <jinxer-wm>	 FIRING: [6x] BFDdown: BFD session down between cr2-drmrs and 208.80.153.204 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[08:51:39] <jinxer-wm>	 FIRING: [6x] CoreBGPDown: Core BGP session down between cr2-drmrs and cr2-eqdfw (208.80.153.204) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[08:52:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet
[08:52:47] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.memcached.roll-reboot-restart rolling reboot on A:memcached-canary
[08:53:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1254111 (https://phabricator.wikimedia.org/T398611) (owner: 10Effie Mouzeli)
[08:53:33] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] site.pp: switch mc-misc to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1254111 (https://phabricator.wikimedia.org/T398611) (owner: 10Effie Mouzeli)
[08:54:10] <logmsgbot>	 !log kharlan@deploy2002 Sync cancelled.
[08:54:30] <wikibugs>	 (03PS1) 10Kosta Harlan: Revert "hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254114
[08:54:59] <wikibugs>	 (03PS2) 10Kosta Harlan: Revert "hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254114 (https://phabricator.wikimedia.org/T419125)
[08:55:06] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Copied votes on follow-up patch sets have been updated:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254114 (https://phabricator.wikimedia.org/T419125) (owner: 10Kosta Harlan)
[08:55:11] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254114 (https://phabricator.wikimedia.org/T419125) (owner: 10Kosta Harlan)
[08:56:56] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254114 (https://phabricator.wikimedia.org/T419125) (owner: 10Kosta Harlan)
[08:57:07] <moritzm>	 !log rebuilt the trixie d-i image for the 13.4 point release T420240
[08:57:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:11] <stashbot>	 T420240: Integrate Trixie 13.4 point update - https://phabricator.wikimedia.org/T420240
[08:57:23] <logmsgbot>	 !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1254114|Revert "hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend" (T419125)]]
[08:57:27] <stashbot>	 T419125: hCaptcha: Update mediawiki-config to enforce checks for API edits coming from the MobileFrontend - https://phabricator.wikimedia.org/T419125
[08:57:37] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.4 point update - https://phabricator.wikimedia.org/T420240#11717402 (10MoritzMuehlenhoff)
[08:57:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet
[08:57:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet
[08:58:39] <wikibugs>	 (03PS1) 10Arnaudb: gerrit: cookbook to reboot gerrit primary instance [cookbooks] - 10https://gerrit.wikimedia.org/r/1254113 (https://phabricator.wikimedia.org/T420194)
[08:58:52] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.memcached.roll-reboot-restart (exit_code=0) rolling reboot on A:memcached-canary
[09:00:40] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1015:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:02:47] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1254114|Revert "hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend" (T419125)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[09:02:51] <stashbot>	 T419125: hCaptcha: Update mediawiki-config to enforce checks for API edits coming from the MobileFrontend - https://phabricator.wikimedia.org/T419125
[09:03:51] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Continuing with sync
[09:05:13] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling reboot on A:thanos-fe
[09:06:08] <topranks>	 !log increase VRRP priority on eqiad vlans on CR2 to shift active gateway to cr2-eqiad T420180 
[09:06:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:12] <stashbot>	 T420180: Drain ssw1-d1-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420180
[09:06:29] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:10:00] <logmsgbot>	 !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1254114|Revert "hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend" (T419125)]] (duration: 12m 36s)
[09:10:05] <stashbot>	 T419125: hCaptcha: Update mediawiki-config to enforce checks for API edits coming from the MobileFrontend - https://phabricator.wikimedia.org/T419125
[09:11:10] <jinxer-wm>	 RESOLVED: [6x] BFDdown: BFD session down between cr2-drmrs and 208.80.153.204 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[09:11:39] <jinxer-wm>	 RESOLVED: [6x] CoreBGPDown: Core BGP session down between cr2-drmrs and cr2-eqdfw (208.80.153.204) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[09:11:41] <wikibugs>	 (03PS1) 10Effie Mouzeli: site.pp: switch mc-wf hosts to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1254115 (https://phabricator.wikimedia.org/T398611)
[09:15:06] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-misc1001.eqiad.wmnet
[09:15:39] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:20:22] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc1001.eqiad.wmnet
[09:20:25] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-misc1002.eqiad.wmnet
[09:21:38] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.memcached.roll-reboot-restart rolling reboot on A:memcached-eqiad
[09:25:36] <wikibugs>	 (03CR) 10MVernon: [C:04-1] "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1253662 (https://phabricator.wikimedia.org/T419663) (owner: 10Neriah)
[09:25:39] <icinga-wm>	 RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:25:41] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc1002.eqiad.wmnet
[09:25:45] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-misc2001.codfw.wmnet
[09:27:48] <wikibugs>	 (03PS1) 10JavierMonton: stream: mw-content-history-reconcile-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254118 (https://phabricator.wikimedia.org/T408918)
[09:30:39] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:31:02] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc2001.codfw.wmnet
[09:31:07] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-misc2002.codfw.wmnet
[09:36:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1254115 (https://phabricator.wikimedia.org/T398611) (owner: 10Effie Mouzeli)
[09:37:17] <wikibugs>	 (03CR) 10Jaime Nuche: [C:03+1] releases: remove rsync systemd units when primary server changes [puppet] - 10https://gerrit.wikimedia.org/r/1253631 (https://phabricator.wikimedia.org/T420246) (owner: 10Dzahn)
[09:38:13] <moritzm>	 !log installing openssl bugfix updates on trixie hosts
[09:38:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:16] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] site.pp: switch mc-wf hosts to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1254115 (https://phabricator.wikimedia.org/T398611) (owner: 10Effie Mouzeli)
[09:39:39] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: update gpt isvc image to one that supports configurable gpu_memory_utilization flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254124 (https://phabricator.wikimedia.org/T418350)
[09:40:05] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc2002.codfw.wmnet
[09:42:03] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-wf1001.eqiad.wmnet
[09:42:12] <logmsgbot>	 !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[09:43:34] <wikibugs>	 (03CR) 10JavierMonton: [C:03+2] stream: mw-content-history-reconcile-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254118 (https://phabricator.wikimedia.org/T408918) (owner: 10JavierMonton)
[09:43:34] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: [C:03+1] "LGTM <3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254124 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[09:45:19] <wikibugs>	 (03Merged) 10jenkins-bot: stream: mw-content-history-reconcile-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254118 (https://phabricator.wikimedia.org/T408918) (owner: 10JavierMonton)
[09:47:56] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-wf1001.eqiad.wmnet
[09:47:59] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-wf1002.eqiad.wmnet
[09:48:20] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] ml-services: update gpt isvc image to one that supports configurable gpu_memory_utilization flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254124 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[09:48:23] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release zarcillo/main on k8s-aux@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-aux&var-namespace=zarcillo - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[09:49:38] <jinxer-wm>	 FIRING: GnmiTargetDown: lsw1-e8-eqiad is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[09:50:08] <logmsgbot>	 !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[09:50:21] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update gpt isvc image to one that supports configurable gpu_memory_utilization flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254124 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[09:51:07] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations: offboard-user: Migrate Phabricator API access from user.query() to user.search() - https://phabricator.wikimedia.org/T420324 (10MoritzMuehlenhoff) 03NEW
[09:52:06] <logmsgbot>	 !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[09:52:31] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.4 point update - https://phabricator.wikimedia.org/T420240#11717551 (10MoritzMuehlenhoff)
[09:53:23] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[09:53:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet
[09:54:08] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-wf1002.eqiad.wmnet
[09:54:12] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-wf2001.codfw.wmnet
[09:54:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid2003.codfw.wmnet
[09:54:53] <jinxer-wm>	 FIRING: [2x] HelmReleaseBadStatus: Helm release zarcillo/main on k8s-aux@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-aux&var-namespace=zarcillo - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[09:55:31] <icinga-wm>	 ACKNOWLEDGEMENT - dump of db_inventory in codfw on backupmon1001 is CRITICAL: Last dump for db_inventory at codfw (db2185) taken on 2026-03-17 00:36:31 is 3 MiB, but the previous one was 3 MiB, a change of +30.0 % Jcrespo expected - The acknowledgement expires at: 2026-03-24 09:55:13. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[09:55:31] <icinga-wm>	 ACKNOWLEDGEMENT - dump of db_inventory in eqiad on backupmon1001 is CRITICAL: Last dump for db_inventory at eqiad (db1215) taken on 2026-03-17 00:40:04 is 3 MiB, but the previous one was 3 MiB, a change of +29.9 % Jcrespo expected - The acknowledgement expires at: 2026-03-24 09:55:13. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[09:56:28] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling reboot on A:thanos-fe
[09:56:44] <topranks>	 !log shift traffic from codfw to eqiad off Arelion CCT to Lumen 
[09:56:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:57:19] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply
[09:57:29] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply
[09:57:54] <logmsgbot>	 !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[09:58:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid2003.codfw.wmnet
[09:58:37] <logmsgbot>	 !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[09:58:50] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3353921) is awaiting input
[09:59:01] <logmsgbot>	 !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[09:59:24] <logmsgbot>	 !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[09:59:52] <logmsgbot>	 !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[09:59:56] <logmsgbot>	 !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1000)
[10:00:19] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-wf2001.codfw.wmnet
[10:00:22] <wikibugs>	 (03PS8) 10Daniel Kinzler: rest-gateway: per-route jwt overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248477 (https://phabricator.wikimedia.org/T419130)
[10:00:23] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-wf2002.codfw.wmnet
[10:00:58] <wikibugs>	 (03CR) 10Daniel Kinzler: rest-gateway: per-route jwt overrides (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248477 (https://phabricator.wikimedia.org/T419130) (owner: 10Daniel Kinzler)
[10:01:34] <wikibugs>	 (03CR) 10Daniel Kinzler: "I completely changed how this works, no Lua involved anymore." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248477 (https://phabricator.wikimedia.org/T419130) (owner: 10Daniel Kinzler)
[10:01:36] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[10:01:37] <logmsgbot>	 !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[10:01:47] <logmsgbot>	 !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[10:02:54] <logmsgbot>	 !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[10:03:06] <logmsgbot>	 !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[10:03:58] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[10:04:19] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:05:16] <wikibugs>	 (03PS1) 10JavierMonton: stream: mw-content-history-reconcile-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254132 (https://phabricator.wikimedia.org/T408918)
[10:06:39] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Phabricator: offboard-user: Migrate Phabricator API access from user.query() to user.search() - https://phabricator.wikimedia.org/T420324#11717595 (10Aklapper) FYI pretty similar tasks: https://phabricator.wikimedia.org/maniphest/query/lV7c54v0tL3z/#R
[10:06:42] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-wf2002.codfw.wmnet
[10:07:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:08:04] <wikibugs>	 (03PS1) 10JavierMonton: stream: mw-page-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254135 (https://phabricator.wikimedia.org/T408918)
[10:08:14] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet
[10:08:22] <wikibugs>	 (03PS1) 10Arnaudb: gerrit: add a ttl on ProxyPass to jetty [puppet] - 10https://gerrit.wikimedia.org/r/1254128 (https://phabricator.wikimedia.org/T420189)
[10:08:22] <wikibugs>	 (03CR) 10Arnaudb: "no cc for Traffic here; because the issue seem to come from the internal reverse proxy, or less likely from jetty's config. Yesterday, @dz" [puppet] - 10https://gerrit.wikimedia.org/r/1254128 (https://phabricator.wikimedia.org/T420189) (owner: 10Arnaudb)
[10:09:11] <moritzm>	 ml-etcd2002 and dse-k8s-ctrl will go down for a Ganeti reboot
[10:09:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet
[10:10:53] <icinga-wm>	 PROBLEM - Host ml-etcd2002 is DOWN: PING CRITICAL - Packet loss = 100%
[10:12:01] <icinga-wm>	 PROBLEM - Host dse-k8s-ctrl2001 is DOWN: PING CRITICAL - Packet loss = 100%
[10:12:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet
[10:12:31] <icinga-wm>	 PROBLEM - VRRP status on cr1-eqiad is CRITICAL: VRRP CRITICAL - 6 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status
[10:12:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid1003.eqiad.wmnet
[10:13:19] <topranks>	 ^^^ VRRP status is me, it is fine 
[10:13:26] <wikibugs>	 (03CR) 10TChin: [C:03+1] stream: mw-content-history-reconcile-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254132 (https://phabricator.wikimedia.org/T408918) (owner: 10JavierMonton)
[10:13:35] <wikibugs>	 (03CR) 10TChin: [C:03+1] stream: mw-page-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254135 (https://phabricator.wikimedia.org/T408918) (owner: 10JavierMonton)
[10:14:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet
[10:15:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet
[10:15:29] <icinga-wm>	 RECOVERY - Host dse-k8s-ctrl2001 is UP: PING OK - Packet loss = 0%, RTA = 31.91 ms
[10:15:32] <jinxer-wm>	 FIRING: [2x] KubernetesCalicoDown: dse-k8s-ctrl2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[10:15:44] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253489 (https://phabricator.wikimedia.org/T414405) (owner: 10AOkoth)
[10:15:55] <icinga-wm>	 RECOVERY - Host ml-etcd2002 is UP: PING OK - Packet loss = 0%, RTA = 32.20 ms
[10:16:14] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service dse-k8s-ctrl2001:6443 has failed probes (http_dse_k8s_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#dse-k8s-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:16:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5004.eqsin.wmnet
[10:16:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid1003.eqiad.wmnet
[10:17:14] <wikibugs>	 (03CR) 10JavierMonton: [C:03+2] stream: mw-content-history-reconcile-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254132 (https://phabricator.wikimedia.org/T408918) (owner: 10JavierMonton)
[10:19:01] <effie>	 !incidents
[10:19:01] <sirenbot>	 7767 (UNACKED)  [2x] ProbeDown sre (dse-k8s-ctrl2001:6443 probes/custom codfw)
[10:19:07] <wikibugs>	 (03Merged) 10jenkins-bot: stream: mw-content-history-reconcile-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254132 (https://phabricator.wikimedia.org/T408918) (owner: 10JavierMonton)
[10:19:11] <logmsgbot>	 !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[10:19:11] <effie>	 !ack 7767 
[10:19:12] <sirenbot>	 7767 (ACKED)  [2x] ProbeDown sre (dse-k8s-ctrl2001:6443 probes/custom codfw)
[10:19:18] <urbanecm>	 !log Delete `job/growthexperiments-listtaskcounts-29513771` from mw-cron (job stuck for more than a month)
[10:19:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:41] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3359051) is awaiting input
[10:20:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1015:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:20:32] <jinxer-wm>	 FIRING: [2x] KubernetesCalicoDown: dse-k8s-ctrl2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[10:20:51] <logmsgbot>	 !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[10:20:51] <logmsgbot>	 !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudgw2003-dev.codfw.wmnet
[10:21:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2001.codfw.wmnet, ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:21:14] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service dse-k8s-ctrl2001:6443 has failed probes (http_dse_k8s_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#dse-k8s-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:21:19] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2003.codfw.wmnet, ml-staging2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:21:38] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[10:21:52] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[10:24:35] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[10:25:34] <topranks>	 !log disable EVPN IBGP peering between ssw1-d1-eqiad and ssw1-d8-eqiad T420180
[10:25:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:25:38] <stashbot>	 T420180: Drain ssw1-d1-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420180
[10:26:53] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[10:27:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:27:14] <logmsgbot>	 !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw2003-dev.codfw.wmnet
[10:27:19] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:27:38] <logmsgbot>	 !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudgw2004-dev.codfw.wmnet
[10:28:27] <logmsgbot>	 !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[10:29:26] <topranks>	 !log stop announcing directly connected routes to L3 switches from cr1-eqiad T420180 
[10:29:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:39] <jinxer-wm>	 FIRING: CoreBGPDown: Core BGP session down between ssw1-d1-eqiad and ssw1-d8-eqiad (10.64.128.18) - group ibgp_evpn - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=ssw1-d1-eqiad:9804&var-bgp_group=ibgp_evpn&var-bgp_neighbor=ssw1-d8-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[10:31:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2038.codfw.wmnet
[10:31:53] <topranks>	 ^^^ that core bgp one is me too... silencing 
[10:33:11] <wikibugs>	 (03CR) 10JavierMonton: [C:03+2] stream: mw-page-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254135 (https://phabricator.wikimedia.org/T408918) (owner: 10JavierMonton)
[10:33:38] <logmsgbot>	 !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw2004-dev.codfw.wmnet
[10:34:59] <wikibugs>	 (03Merged) 10jenkins-bot: stream: mw-page-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254135 (https://phabricator.wikimedia.org/T408918) (owner: 10JavierMonton)
[10:35:00] <jinxer-wm>	 FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[10:35:44] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3359051) is awaiting input
[10:35:49] <topranks>	 ^^ this is not me but a possible spanner in the works 
[10:36:35] <topranks>	 actually that is the as-yet unconfigured 100G circuit, so not an issue 
[10:36:56] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3363050) is awaiting input
[10:37:20] <logmsgbot>	 !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[10:37:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5004.eqsin.wmnet
[10:38:03] <moritzm>	 aux-k8s-etcd2005 will go down for a Ganeti reboot
[10:38:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2038.codfw.wmnet
[10:39:03] <wikibugs>	 (03PS4) 10Elukey: sre.hosts.provision: refactor bios if/else branches [cookbooks] - 10https://gerrit.wikimedia.org/r/1253412 (https://phabricator.wikimedia.org/T414216)
[10:39:03] <wikibugs>	 (03PS3) 10Elukey: sre.hosts.provision: add sys-112c-tn-configg to SUPERMICRO_NO_FQDN_MANAGEMENT [cookbooks] - 10https://gerrit.wikimedia.org/r/1253448 (https://phabricator.wikimedia.org/T414216)
[10:39:04] <wikibugs>	 (03PS7) 10Elukey: sre.hosts.provision: use more URIs to set Supermicro's BIOS settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216)
[10:39:32] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[10:39:49] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[10:39:53] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release zarcillo/main on k8s-aux@eqiad in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-aux&var-namespace=zarcillo - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[10:40:07] <icinga-wm>	 PROBLEM - Host aux-k8s-etcd2005 is DOWN: PING CRITICAL - Packet loss = 100%
[10:40:22] <logmsgbot>	 !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[10:40:29] <icinga-wm>	 RECOVERY - Host aux-k8s-etcd2005 is UP: PING OK - Packet loss = 0%, RTA = 32.14 ms
[10:40:48] <wikibugs>	 (03PS4) 10Muehlenhoff: Remove obsolete Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1251080 (https://phabricator.wikimedia.org/T350694)
[10:41:30] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[10:41:53] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[10:42:18] <topranks>	 !log cease announcing routed networks from ssw1-d1-eqiad to cr1-eqiad in BGP T420180
[10:42:21] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: handle trust level C with invalid token. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1252658 (https://phabricator.wikimedia.org/T420106) (owner: 10Daniel Kinzler)
[10:42:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:22] <stashbot>	 T420180: Drain ssw1-d1-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420180
[10:43:07] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[10:43:19] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[10:43:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2038.codfw.wmnet
[10:43:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2038.codfw.wmnet
[10:43:42] <wikibugs>	 (03PS2) 10Daniel Kinzler: rest-gateway: test: check for www-authenticate error [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248398 (https://phabricator.wikimedia.org/T419034)
[10:43:46] <wikibugs>	 (03PS1) 10Fabfur: aptrepo: new haproxy32 component for trixie [puppet] - 10https://gerrit.wikimedia.org/r/1254146 (https://phabricator.wikimedia.org/T419825)
[10:43:47] <wikibugs>	 (03PS2) 10Daniel Kinzler: rest gateways: tests: check heathz first [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250637
[10:44:35] <wikibugs>	 (03CR) 10Daniel Kinzler: rest-gateway rate limit: add DENY policy and class (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250598 (owner: 10Daniel Kinzler)
[10:44:35] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: handle trust level C with invalid token. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1252658 (https://phabricator.wikimedia.org/T420106) (owner: 10Daniel Kinzler)
[10:44:45] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1251080 (https://phabricator.wikimedia.org/T350694) (owner: 10Muehlenhoff)
[10:45:04] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply
[10:45:12] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[10:45:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2039.codfw.wmnet
[10:45:25] <wikibugs>	 (03CR) 10Daniel Kinzler: rest-gateway rate limit: add DENY policy and class (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250598 (owner: 10Daniel Kinzler)
[10:45:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5004.eqsin.wmnet
[10:46:06] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5004.eqsin.wmnet
[10:47:49] <icinga-wm>	 PROBLEM - Host ssw1-d1-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[10:48:06] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] rest gateways: tests: check heathz first [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250637 (owner: 10Daniel Kinzler)
[10:48:49] <moritzm>	 kubestagemaster2004 will go down for a Ganeti reboot
[10:48:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2039.codfw.wmnet
[10:48:58] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] rest-gateway: test: check for www-authenticate error [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248398 (https://phabricator.wikimedia.org/T419034) (owner: 10Daniel Kinzler)
[10:49:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5005.eqsin.wmnet
[10:49:06] <logmsgbot>	 !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[10:49:23] <logmsgbot>	 !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[10:51:15] <icinga-wm>	 PROBLEM - Host kubestagemaster2004 is DOWN: PING CRITICAL - Packet loss = 100%
[10:51:26] <wikibugs>	 (03PS1) 10Daniel Kinzler: rest gateway: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254148
[10:51:41] <logmsgbot>	 !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[10:52:16] <topranks>	 !enable graceful-shutdown sender for internet BGP peerings on cr1-eqiad 
[10:52:24] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.memcached.roll-reboot-restart (exit_code=0) rolling reboot on A:memcached-eqiad
[10:52:53] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254148 (owner: 10Daniel Kinzler)
[10:53:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5005.eqsin.wmnet
[10:54:19] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2039.codfw.wmnet
[10:54:55] <wikibugs>	 (03Merged) 10jenkins-bot: rest gateway: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254148 (owner: 10Daniel Kinzler)
[10:54:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2039.codfw.wmnet
[10:55:12] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service kubestagemaster2004:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster2004:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:55:19] <wikibugs>	 (03PS1) 10Abijeet Patro: Enable ULS rewrite beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254149 (https://phabricator.wikimedia.org/T418187)
[10:55:32] <jinxer-wm>	 FIRING: [2x] KubernetesCalicoDown: kubestagemaster2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[10:56:03] <icinga-wm>	 RECOVERY - Host kubestagemaster2004 is UP: PING OK - Packet loss = 0%, RTA = 32.18 ms
[10:56:29] <physikerwelt>	 hashar: Would you mind looking again into https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Math/+/1251102 after your restart the test workflow passed, but gate-and-submit failed "10:11:44 Exception: Error cloning https://gerrit.wikimedia.org/r/mediawiki/extensions/WikimediaCampaignEvents to /workspace/src/extensions/WikimediaCampaignEvents"
[10:56:41] <logmsgbot>	 !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[10:56:59] <logmsgbot>	 !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[10:57:51] <logmsgbot>	 !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[10:58:09] <topranks>	 !log prepend external BGP announcements from cr1-eqiad T420180
[10:58:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:58:13] <stashbot>	 T420180: Drain ssw1-d1-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420180
[10:58:40] <logmsgbot>	 !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[10:58:58] <wikibugs>	 (03PS8) 10Elukey: sre.hosts.provision: use more URIs to set Supermicro's BIOS settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216)
[10:59:34] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[11:00:12] <jinxer-wm>	 RESOLVED: [3x] ProbeDown: Service ganeti5004:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:00:32] <jinxer-wm>	 FIRING: [2x] KubernetesCalicoDown: kubestagemaster2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[11:00:58] <logmsgbot>	 !log daniel@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[11:01:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5005.eqsin.wmnet
[11:01:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5005.eqsin.wmnet
[11:01:44] <wikibugs>	 (03PS1) 10A smart kitten: Revert "Create tests for NotificationMapper::deleteByUserAndAge" [extensions/Echo] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254155 (https://phabricator.wikimedia.org/T383948)
[11:01:55] <logmsgbot>	 !log daniel@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[11:02:43] <wikibugs>	 (03PS1) 10A smart kitten: Revert "Delete old notifications of users" [extensions/Echo] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254156 (https://phabricator.wikimedia.org/T383948)
[11:03:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2040.codfw.wmnet
[11:04:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5006.eqsin.wmnet
[11:04:22] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.hosts.provision: use more URIs to set Supermicro's BIOS settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216) (owner: 10Elukey)
[11:04:49] <logmsgbot>	 !log daniel@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[11:05:14] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[11:05:22] <wikibugs>	 10SRE-swift-storage, 10observability: Add FileBackend statsd metrics and a dashboard - https://phabricator.wikimedia.org/T217754#11717854 (10Aklapper)
[11:05:50] <logmsgbot>	 !log daniel@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[11:06:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5006.eqsin.wmnet
[11:09:08] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] rest gateways: tests: check heathz first [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250637 (owner: 10Daniel Kinzler)
[11:09:10] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: test: check for www-authenticate error [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248398 (https://phabricator.wikimedia.org/T419034) (owner: 10Daniel Kinzler)
[11:09:23] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: update gpt isvc image to one that supports configurable max_model_len flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254157 (https://phabricator.wikimedia.org/T418350)
[11:09:41] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3369380) is awaiting input
[11:11:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2040.codfw.wmnet
[11:11:12] <wikibugs>	 (03Merged) 10jenkins-bot: rest gateways: tests: check heathz first [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250637 (owner: 10Daniel Kinzler)
[11:11:24] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: test: check for www-authenticate error [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248398 (https://phabricator.wikimedia.org/T419034) (owner: 10Daniel Kinzler)
[11:11:36] <wikibugs>	 (03PS9) 10Elukey: sre.hosts.provision: use more URIs to set Supermicro's BIOS settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216)
[11:13:24] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: [C:03+1] ml-services: update gpt isvc image to one that supports configurable max_model_len flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254157 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[11:14:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5006.eqsin.wmnet
[11:15:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5006.eqsin.wmnet
[11:15:28] <wikibugs>	 (03CR) 10A smart kitten: "('officially' proposing a revert on -wmf.20 per my comments on the task. when -wmf.21 comes along, I would also personally cherry-pick thi" [extensions/Echo] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254156 (https://phabricator.wikimedia.org/T383948) (owner: 10A smart kitten)
[11:15:29] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] ml-services: update gpt isvc image to one that supports configurable max_model_len flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254157 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[11:16:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2040.codfw.wmnet
[11:16:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2040.codfw.wmnet
[11:17:27] <wikibugs>	 (03PS1) 10Jelto: gitlab: start ssh-gitlab service after network-online and gitlab [puppet] - 10https://gerrit.wikimedia.org/r/1254162 (https://phabricator.wikimedia.org/T420164)
[11:17:45] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update gpt isvc image to one that supports configurable max_model_len flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254157 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[11:18:59] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[11:19:31] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] "looks good to me! question out of curiosity inline" [puppet] - 10https://gerrit.wikimedia.org/r/1254162 (https://phabricator.wikimedia.org/T420164) (owner: 10Jelto)
[11:20:38] <wikibugs>	 (03CR) 10Jelto: gitlab: start ssh-gitlab service after network-online and gitlab (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1254162 (https://phabricator.wikimedia.org/T420164) (owner: 10Jelto)
[11:21:35] <wikibugs>	 (03CR) 10Btullis: wikidata-platform: wdqs-queryhammer helmfile deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251453 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg)
[11:21:52] <wikibugs>	 (03PS1) 10Zabe: Remove misplaced readonly from CategoryViewer::$query [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254166 (https://phabricator.wikimedia.org/T420315)
[11:22:01] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] gitlab: start ssh-gitlab service after network-online and gitlab (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1254162 (https://phabricator.wikimedia.org/T420164) (owner: 10Jelto)
[11:22:06] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C:03+1] Remove misplaced readonly from CategoryViewer::$query [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254166 (https://phabricator.wikimedia.org/T420315) (owner: 10Zabe)
[11:22:49] <logmsgbot>	 !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[11:24:40] <topranks>	 !log reduce local-preference for BGP routes learnt from servers on cr1-eqiad T420180
[11:24:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:44] <stashbot>	 T420180: Drain ssw1-d1-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420180
[11:27:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2041.codfw.wmnet
[11:28:15] <moritzm>	 !log failover Ganeti master in eqsin to ganeti5004
[11:28:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install7002.wikimedia.org
[11:30:52] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3374664) is awaiting input
[11:31:11] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti5007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[11:33:00] <Emperor>	 !log roll-reboot apus frontends (eqiad) for March reboots
[11:33:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:03] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-cluster
[11:34:38] <wikibugs>	 (03CR) 10Clément Goubert: "Small comments, otherwise LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248477 (https://phabricator.wikimedia.org/T419130) (owner: 10Daniel Kinzler)
[11:34:44] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] rest-gateway: per-route jwt overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248477 (https://phabricator.wikimedia.org/T419130) (owner: 10Daniel Kinzler)
[11:36:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install7002.wikimedia.org
[11:38:50] <wikibugs>	 (03CR) 10JMeybohm: cassandra-http-gateway: new chart based on aqs-http-gateway (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250649 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans)
[11:39:09] <topranks>	 !log stop accepting external routes on ssw1-d1-eqiad from cr1-eqiad T420180
[11:39:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:39:13] <stashbot>	 T420180: Drain ssw1-d1-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420180
[11:39:36] <wikibugs>	 (03CR) 10Btullis: wikidata-platform: wdqs-queryhammer helmfile deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251453 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg)
[11:39:38] <wikibugs>	 (03CR) 10JMeybohm: "You should bump the version in Chart.yaml to release new version." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250650 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans)
[11:40:33] <wikibugs>	 (03CR) 10JMeybohm: "LGTM so far, but we should re-run CI after the parents have been merged" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250651 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans)
[11:40:38] <wikibugs>	 (03PS2) 10Abijeet Patro: Enable ULS rewrite beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254149 (https://phabricator.wikimedia.org/T418187)
[11:41:15] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{wikikube-worker13[00-47].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[11:41:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5007.eqsin.wmnet
[11:41:43] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3374664) is awaiting input
[11:43:09] <icinga-wm>	 PROBLEM - Host ssw1-d1-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[11:43:28] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.presto.reboot-workers for Presto an-presto cluster: Reboot Presto nodes
[11:43:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5007.eqsin.wmnet
[11:45:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2041.codfw.wmnet
[11:46:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install6003.wikimedia.org
[11:47:31] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1012.eqiad.wmnet with OS bookworm
[11:47:50] <wikibugs>	 (03CR) 10JMeybohm: sre.k8s: use SREBatchRunnerBase, rather than SRELBBatchRunnerBase (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1248486 (https://phabricator.wikimedia.org/T419032) (owner: 10Blake)
[11:48:15] <wikibugs>	 (03CR) 10Gmodena: wikidata-platform: wdqs-queryhammer helmfile deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251453 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg)
[11:48:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install6003.wikimedia.org
[11:48:40] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.memcached.roll-reboot-restart rolling reboot on A:memcached-eqiad
[11:49:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install5003.wikimedia.org
[11:49:54] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster
[11:51:48] <wikibugs>	 (03CR) 10Gmodena: wikidata-platform: wdqs-queryhammer helmfile deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251453 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg)
[11:51:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5007.eqsin.wmnet
[11:52:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5007.eqsin.wmnet
[11:52:22] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0)
[11:52:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1254146 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur)
[11:52:52] <wikibugs>	 (03CR) 10Trueg: wikidata-platform: wdqs-queryhammer helmfile deployment (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251453 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg)
[11:52:57] <wikibugs>	 (03CR) 10Gmodena: wikidata-platform: wdqs-queryhammer helmfile deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251453 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg)
[11:53:22] <topranks>	 !log reset BGP session to ssw1-d1-eiqad from lsw1-d1-eqiad T420180
[11:53:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:53:26] <stashbot>	 T420180: Drain ssw1-d1-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420180
[11:54:23] <topranks>	 !log reset BGP session to ssw1-d1-eiqad from lsw1-d3-eqiad T420180
[11:54:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:54:41] <logmsgbot>	 !log jayme@cumin1003 conftool action : set/pooled=no; selector: service=docker-registry,name=(registry1004.eqiad.wmnet|registry2004.codfw.wmnet)
[11:54:47] <wikibugs>	 (03CR) 10Trueg: wikidata-platform: wdqs-queryhammer helmfile deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251453 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg)
[11:55:32] <jinxer-wm>	 FIRING: [3x] KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[11:55:41] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.hosts.reboot-single for host registry2004.codfw.wmnet
[11:55:48] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.hosts.reboot-single for host registry1004.eqiad.wmnet
[11:56:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install5003.wikimedia.org
[11:56:22] <topranks>	 !log reset BGP session to ssw1-d1-eiqad from lsw1-c2-eqiad T420180
[11:56:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:56:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install4003.wikimedia.org
[11:58:16] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] aptrepo: new haproxy32 component for trixie [puppet] - 10https://gerrit.wikimedia.org/r/1254146 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur)
[11:58:35] <topranks>	 !log reset BGP session to ssw1-d1-eiqad from lsw1-c3-eqiad T420180
[11:58:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:58:39] <stashbot>	 T420180: Drain ssw1-d1-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420180
[11:58:48] <fabfur>	 tnx moritzm 
[11:59:16] <topranks>	 !log reset BGP session to ssw1-d1-eiqad from lsw1-c4-eqiad T420180
[11:59:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:59:29] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host registry2004.codfw.wmnet
[11:59:44] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host registry1004.eqiad.wmnet
[12:00:00] <logmsgbot>	 !log jayme@cumin1003 conftool action : set/pooled=yes; selector: service=docker-registry,name=(registry1004.eqiad.wmnet|registry2004.codfw.wmnet)
[12:00:03] <topranks>	 !log reset BGP session to ssw1-d1-eiqad from lsw1-c6-eqiad T420180
[12:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1200)
[12:00:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:32] <wikibugs>	 (03CR) 10Dbrant: [C:03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253492 (owner: 10PipelineBot)
[12:00:51] <logmsgbot>	 !log jayme@cumin1003 conftool action : set/pooled=no; selector: service=docker-registry,name=(registry1005.eqiad.wmnet|registry2005.codfw.wmnet)
[12:01:25] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.hosts.reboot-single for host registry2005.codfw.wmnet
[12:01:26] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.hosts.reboot-single for host registry1005.eqiad.wmnet
[12:01:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2041.codfw.wmnet
[12:01:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2041.codfw.wmnet
[12:01:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1251080 (https://phabricator.wikimedia.org/T350694) (owner: 10Muehlenhoff)
[12:02:05] <wikibugs>	 (03PS1) 10Slyngshede: P:idp Allow enabling of gauth mfa / TOTP [puppet] - 10https://gerrit.wikimedia.org/r/1254176 (https://phabricator.wikimedia.org/T372892)
[12:02:51] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for SCardenas (WMF) - https://phabricator.wikimedia.org/T419932#11718024 (10DMburugu) I approve
[12:03:00] <topranks>	 !log reset BGP session to ssw1-d1-eiqad from lsw1-c7-eqiad T420180
[12:03:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:03:07] <wikibugs>	 (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253492 (owner: 10PipelineBot)
[12:03:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install4003.wikimedia.org
[12:04:25] <logmsgbot>	 !log dbrant@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply
[12:04:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install3004.wikimedia.org
[12:04:45] <logmsgbot>	 !log dbrant@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply
[12:04:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[12:05:22] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host registry1005.eqiad.wmnet
[12:05:25] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host registry2005.codfw.wmnet
[12:05:36] <logmsgbot>	 !log jayme@cumin1003 conftool action : set/pooled=yes; selector: service=docker-registry,name=(registry1005.eqiad.wmnet|registry2005.codfw.wmnet)
[12:06:05] <logmsgbot>	 !log dbrant@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply
[12:06:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install3004.wikimedia.org
[12:06:39] <logmsgbot>	 !log dbrant@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply
[12:06:50] <logmsgbot>	 !log dbrant@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply
[12:06:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install2005.wikimedia.org
[12:07:00] <wikibugs>	 (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254176 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede)
[12:07:19] <logmsgbot>	 !log dbrant@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply
[12:08:53] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{wikikube-worker[2280-2331].codfw.wmnet} and (A:wikikube-master-codfw or A:wikikube-worker-codfw)
[12:09:45] <jinxer-wm>	 FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[12:10:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2042.codfw.wmnet
[12:11:57] <wikibugs>	 (03PS2) 10Slyngshede: P:idp Allow enabling of gauth mfa / TOTP [puppet] - 10https://gerrit.wikimedia.org/r/1254176 (https://phabricator.wikimedia.org/T372892)
[12:12:41] <wikibugs>	 (03PS5) 10Muehlenhoff: Install systemd-timesyncd universally [puppet] - 10https://gerrit.wikimedia.org/r/1243756
[12:13:01] <topranks>	 !log restart BGP announcements from ssw1-d1-eqiad following change T420180
[12:13:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:13:04] <stashbot>	 T420180: Drain ssw1-d1-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420180
[12:13:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install2005.wikimedia.org
[12:13:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:13:58] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: dns cache wipe update [cookbooks] - 10https://gerrit.wikimedia.org/r/1247534 (https://phabricator.wikimedia.org/T418108) (owner: 10Arnaudb)
[12:14:10] <icinga-wm>	 RECOVERY - Host ssw1-d1-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms
[12:15:33] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243756 (owner: 10Muehlenhoff)
[12:16:02] <icinga-wm>	 RECOVERY - Host ssw1-d1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms
[12:16:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install1005.wikimedia.org
[12:16:52] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3382740) is awaiting input
[12:17:36] <icinga-wm>	 PROBLEM - Host wikikube-worker1307 is DOWN: PING CRITICAL - Packet loss = 100%
[12:19:03] <wikibugs>	 (03Merged) 10jenkins-bot: gerrit: dns cache wipe update [cookbooks] - 10https://gerrit.wikimedia.org/r/1247534 (https://phabricator.wikimedia.org/T418108) (owner: 10Arnaudb)
[12:19:47] <wikibugs>	 (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254176 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede)
[12:20:32] <jinxer-wm>	 FIRING: [6x] KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[12:20:38] <Emperor>	 !log roll-reboot apus frontends (codfw) for March reboots
[12:20:40] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-cluster
[12:20:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:20:41] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove support for PHP 7.4/8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1247620
[12:21:33] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Remove support for PHP 7.4/8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1247620 (owner: 10Muehlenhoff)
[12:21:45] <wikibugs>	 (03PS3) 10Muehlenhoff: Remove support for PHP 7.4/8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1247620
[12:21:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2042.codfw.wmnet
[12:22:05] <logmsgbot>	 !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1012.eqiad.wmnet with OS bookworm
[12:22:30] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Remove support for PHP 7.4/8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1247620 (owner: 10Muehlenhoff)
[12:22:49] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install1005.wikimedia.org
[12:23:32] <icinga-wm>	 RECOVERY - VRRP status on cr1-eqiad is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status
[12:24:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6003.drmrs.wmnet
[12:24:41] <wikibugs>	 (03PS4) 10Muehlenhoff: Remove support for PHP 7.4/8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1247620
[12:27:07] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3384525) is awaiting input
[12:27:18] <wikibugs>	 (03PS3) 10Slyngshede: P:idp Allow enabling of gauth mfa / TOTP [puppet] - 10https://gerrit.wikimedia.org/r/1254176 (https://phabricator.wikimedia.org/T372892)
[12:29:52] <wikibugs>	 (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254176 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede)
[12:31:21] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1247620 (owner: 10Muehlenhoff)
[12:31:24] <wikibugs>	 (03CR) 10Muehlenhoff: Remove support for PHP 7.4/8.1 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1247620 (owner: 10Muehlenhoff)
[12:31:35] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1012.eqiad.wmnet with OS bookworm
[12:32:01] <wikibugs>	 (03CR) 10Muehlenhoff: "Good catch, updated" [puppet] - 10https://gerrit.wikimedia.org/r/1243756 (owner: 10Muehlenhoff)
[12:32:50] <wikibugs>	 (03PS1) 10Ayounsi: Anycast: prepend once more when peering with the core routers [homer/public] - 10https://gerrit.wikimedia.org/r/1254185 (https://phabricator.wikimedia.org/T420342)
[12:32:54] <MatmaRex>	 jouncebot: next
[12:32:54] <jouncebot>	 In 0 hour(s) and 27 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1300)
[12:34:33] <wikibugs>	 (03PS4) 10Slyngshede: P:idp Allow enabling of gauth mfa / TOTP [puppet] - 10https://gerrit.wikimedia.org/r/1254176 (https://phabricator.wikimedia.org/T372892)
[12:34:43] <moritzm>	 !log powercycling ganeti2041 (stuck on reboot)
[12:34:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:34:45] <jinxer-wm>	 RESOLVED: [2x] WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[12:35:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6003.drmrs.wmnet
[12:35:47] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254166 (https://phabricator.wikimedia.org/T420315) (owner: 10Zabe)
[12:36:38] <wikibugs>	 (03PS5) 10Slyngshede: P:idp Allow enabling of gauth mfa / TOTP [puppet] - 10https://gerrit.wikimedia.org/r/1254176 (https://phabricator.wikimedia.org/T372892)
[12:36:39] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "LGTM.  I don't think it should affect the DNS hosts (they all peer with CRs, so longer path everywhere thus no diff).  I am also a little " [homer/public] - 10https://gerrit.wikimedia.org/r/1254185 (https://phabricator.wikimedia.org/T420342) (owner: 10Ayounsi)
[12:37:38] <wikibugs>	 (03CR) 10Ladsgroup: [C:04-1] "Until someone actually brings an evidence that people need this change" [extensions/Echo] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254156 (https://phabricator.wikimedia.org/T383948) (owner: 10A smart kitten)
[12:37:46] <wikibugs>	 (03CR) 10Ladsgroup: [C:04-1] "Until someone actually brings an evidence that people need this change" [extensions/Echo] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254155 (https://phabricator.wikimedia.org/T383948) (owner: 10A smart kitten)
[12:38:04] <moritzm>	 !log powercycling ganeti2042 (stuck on reboot)
[12:38:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:39:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2042.codfw.wmnet
[12:39:36] <wikibugs>	 (03PS1) 10Muehlenhoff: toolforge::services::aptly: Remove buster [puppet] - 10https://gerrit.wikimedia.org/r/1254188
[12:39:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2042.codfw.wmnet
[12:40:10] <wikibugs>	 (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254176 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede)
[12:40:33] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0)
[12:41:40] <wikibugs>	 (03PS1) 10Esanders: TitleWidget: Prioritise namespace prefix over interwiki prefix [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254189 (https://phabricator.wikimedia.org/T420288)
[12:42:01] <wikibugs>	 (03PS1) 10Esanders: TitleWidget: Prioritise namespace prefix over interwiki prefix [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254190 (https://phabricator.wikimedia.org/T420288)
[12:42:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6003.drmrs.wmnet
[12:42:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6003.drmrs.wmnet
[12:42:43] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254189 (https://phabricator.wikimedia.org/T420288) (owner: 10Esanders)
[12:43:01] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254190 (https://phabricator.wikimedia.org/T420288) (owner: 10Esanders)
[12:44:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2043.codfw.wmnet
[12:44:27] <logmsgbot>	 !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1012.eqiad.wmnet with OS bookworm
[12:44:48] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1012.eqiad.wmnet with OS bookworm
[12:44:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6004.drmrs.wmnet
[12:45:58] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1015
[12:46:50] <wikibugs>	 (03PS6) 10Slyngshede: P:idp Allow enabling of gauth mfa / TOTP [puppet] - 10https://gerrit.wikimedia.org/r/1254176 (https://phabricator.wikimedia.org/T372892)
[12:46:57] <wikibugs>	 (03PS1) 10Majavah: P:toolforge: Retire package builder hosts [puppet] - 10https://gerrit.wikimedia.org/r/1254191 (https://phabricator.wikimedia.org/T401819)
[12:47:24] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:idp Allow enabling of gauth mfa / TOTP [puppet] - 10https://gerrit.wikimedia.org/r/1254176 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede)
[12:47:41] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Eqiad: lsw1-c2-eqiad BGP maintenance/ Tuesday 17th at 9:30 CDT - https://phabricator.wikimedia.org/T420158#11718297 (10cmooney) 05Open→03Declined This won't be required now, we have res...
[12:47:51] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3389356) is awaiting input
[12:48:10] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host moss-be1001.eqiad.wmnet
[12:48:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:48:42] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1015
[12:49:03] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Eqiad: lsw1-c7-eqiad BGP maintenance/ Thursday 19th at 10:00 am CDT - https://phabricator.wikimedia.org/T420159#11718300 (10cmooney) 05Open→03Declined This won't be needed now, we were...
[12:49:16] <wikibugs>	 (03PS7) 10Slyngshede: P:idp Allow enabling of gauth mfa / TOTP [puppet] - 10https://gerrit.wikimedia.org/r/1254176 (https://phabricator.wikimedia.org/T372892)
[12:50:04] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:50:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1254191 (https://phabricator.wikimedia.org/T401819) (owner: 10Majavah)
[12:50:48] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:toolforge: Retire package builder hosts [puppet] - 10https://gerrit.wikimedia.org/r/1254191 (https://phabricator.wikimedia.org/T401819) (owner: 10Majavah)
[12:51:21] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.tls for network device lsw1-e8-eqiad
[12:51:28] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e8-eqiad
[12:51:33] <logmsgbot>	 !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1012.eqiad.wmnet with OS bookworm
[12:51:39] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1012
[12:51:52] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 9269
[12:52:19] <logmsgbot>	 !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:52:28] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3389274) is awaiting input
[12:52:44] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3389356) is awaiting input
[12:52:46] <wikibugs>	 (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254176 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede)
[12:52:48] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1012
[12:52:54] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 9269
[12:53:13] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:53:13] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 28788
[12:53:20] <wikibugs>	 (03Abandoned) 10Muehlenhoff: check_timedatectl: Drop support for old systemd versions [puppet] - 10https://gerrit.wikimedia.org/r/1243824 (owner: 10Muehlenhoff)
[12:53:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:53:29] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251610 (https://phabricator.wikimedia.org/T348255) (owner: 10C. Scott Ananian)
[12:53:51] <logmsgbot>	 !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 28788
[12:54:02] <moritzm>	 dse-k8s-etcd2003 will go down for a Ganeti reboot
[12:54:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2043.codfw.wmnet
[12:54:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6004.drmrs.wmnet
[12:54:23] <jinxer-wm>	 RESOLVED: GnmiTargetDown: lsw1-e8-eqiad is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[12:54:55] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 28788
[12:55:03] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be1001.eqiad.wmnet
[12:55:22] <wikibugs>	 (03CR) 10CI reject: [V:04-1] TitleWidget: Prioritise namespace prefix over interwiki prefix [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254190 (https://phabricator.wikimedia.org/T420288) (owner: 10Esanders)
[12:55:23] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host moss-be1002.eqiad.wmnet
[12:55:26] <icinga-wm>	 PROBLEM - Host dse-k8s-etcd2003 is DOWN: PING CRITICAL - Packet loss = 100%
[12:55:38] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 28788
[12:55:45] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 56308
[12:55:59] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.memcached.roll-reboot-restart (exit_code=0) rolling reboot on A:memcached-eqiad
[12:56:12] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 56308
[12:56:21] <wikibugs>	 (03PS4) 10Daniel Kinzler: rest-gateway rate limit: add BYPASS and DENY policy and class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250598
[12:56:56] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 214657
[12:57:06] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 214657
[12:57:23] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: update gpt isvc image to one that supports configurable block_size flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254193 (https://phabricator.wikimedia.org/T418350)
[12:59:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2043.codfw.wmnet
[12:59:50] <wikibugs>	 (03CR) 10Majavah: [C:03+1] toolforge::services::aptly: Remove buster [puppet] - 10https://gerrit.wikimedia.org/r/1254188 (owner: 10Muehlenhoff)
[12:59:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2043.codfw.wmnet
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1300).
[13:00:05] <jouncebot>	 andre, edsanders, and cscott: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:13] <andre>	 o/
[13:00:16] <andre>	 I'm going to deploy a backport and then move the train
[13:00:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6004.drmrs.wmnet
[13:00:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6004.drmrs.wmnet
[13:00:28] <icinga-wm>	 RECOVERY - Host dse-k8s-etcd2003 is UP: PING OK - Packet loss = 0%, RTA = 32.11 ms
[13:00:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by aklapper@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254166 (https://phabricator.wikimedia.org/T420315) (owner: 10Zabe)
[13:01:14] <edsanders>	 o/
[13:01:47] <moritzm>	 !log failover Ganeti masters in drmrs to ganeti6003/6004
[13:01:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:02:06] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be1002.eqiad.wmnet
[13:02:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2044.codfw.wmnet
[13:03:10] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti6002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[13:03:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:04:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet
[13:04:10] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti6001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[13:04:16] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host apus-be1004.eqiad.wmnet
[13:04:24] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: [C:03+1] ml-services: update gpt isvc image to one that supports configurable block_size flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254193 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[13:04:45] <cscott>	 o/
[13:05:12] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] ml-services: update gpt isvc image to one that supports configurable block_size flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254193 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[13:05:20] <wikibugs>	 (03PS5) 10Daniel Kinzler: rest-gateway rate limit: add BYPASS and DENY policy and class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250598
[13:05:42] <edsanders>	 andre: are you starting?
[13:05:50] <andre>	 edsanders, yes
[13:06:11] <edsanders>	 👍
[13:07:04] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3393714) is awaiting input
[13:07:11] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update gpt isvc image to one that supports configurable block_size flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254193 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[13:07:50] <andre>	 edsanders: backport already in progress for 1254166
[13:07:54] <andre>	 edsanders: Plus I still need to deploy the train to group0 afterwards - not sure about the order though?
[13:07:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet
[13:08:10] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:08:37] <wikibugs>	 (03Merged) 10jenkins-bot: Remove misplaced readonly from CategoryViewer::$query [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254166 (https://phabricator.wikimedia.org/T420315) (owner: 10Zabe)
[13:08:40] <edsanders>	 I don't mind, as long as I can get mine done in the next hour
[13:08:54] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[13:09:08] <logmsgbot>	 !log aklapper@deploy2002 Started scap sync-world: Backport for [[gerrit:1254166|Remove misplaced readonly from CategoryViewer::$query (T420315)]]
[13:09:11] <stashbot>	 T420315: Error: Cannot modify readonly property MediaWiki\Category\CategoryViewer::$query - https://phabricator.wikimedia.org/T420315
[13:09:29] <cscott>	 i'm also in no hurry
[13:09:51] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apus-be1004.eqiad.wmnet
[13:10:18] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 16509
[13:10:34] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host moss-be1003.eqiad.wmnet
[13:10:37] <logmsgbot>	 !log otto@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[13:10:41] <logmsgbot>	 !log otto@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[13:11:10] <logmsgbot>	 !log aklapper@deploy2002 zabe, aklapper: Backport for [[gerrit:1254166|Remove misplaced readonly from CategoryViewer::$query (T420315)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:11:37] <logmsgbot>	 !log aklapper@deploy2002 zabe, aklapper: Continuing with sync
[13:11:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet
[13:13:17] <wikibugs>	 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: PeeringBGPDown (instance cr3-eqsin:9804) - https://phabricator.wikimedia.org/T419859#11718437 (10ayounsi) 05Open→03Invalid I go through the karma dashboard from time to time. I prefer to have the peering sessions on...
[13:13:45] <wikibugs>	 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: PeeringBGPDown (instance cr3-eqsin:9804) - https://phabricator.wikimedia.org/T419858#11718441 (10ayounsi) 05Open→03Invalid I go through the karma dashboard from time to time. I prefer to have the peering sessions on...
[13:13:58] <wikibugs>	 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: PeeringBGPDown (instance cr1-esams:9804) - https://phabricator.wikimedia.org/T419857#11718445 (10ayounsi) 05Open→03Invalid I go through the karma dashboard from time to time. I prefer to have the peering sessions on...
[13:14:04] <wikibugs>	 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: PeeringBGPDown (instance cr1-esams:9804) - https://phabricator.wikimedia.org/T419856#11718448 (10ayounsi) 05Open→03Invalid I go through the karma dashboard from time to time. I prefer to have the peering sessions on...
[13:14:10] <wikibugs>	 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: PeeringBGPDown (instance cr3-eqsin:9804) - https://phabricator.wikimedia.org/T419855#11718451 (10ayounsi) 05Open→03Invalid I go through the karma dashboard from time to time. I prefer to have the peering sessions on...
[13:14:18] <wikibugs>	 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: PeeringBGPDown (instance cr3-eqsin:9804) - https://phabricator.wikimedia.org/T419854#11718454 (10ayounsi) 05Open→03Invalid I go through the karma dashboard from time to time. I prefer to have the peering sessions on...
[13:15:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet
[13:15:32] <andre>	 edsanders: I'd say you deploy your backport(s) after I'm done with my backport, because we would collide in case I need to roll back the train from group0 to the testwikis. I should be done soon with the backport
[13:15:38] <logmsgbot>	 !log aklapper@deploy2002 Finished scap sync-world: Backport for [[gerrit:1254166|Remove misplaced readonly from CategoryViewer::$query (T420315)]] (duration: 06m 31s)
[13:15:42] <stashbot>	 T420315: Error: Cannot modify readonly property MediaWiki\Category\CategoryViewer::$query - https://phabricator.wikimedia.org/T420315
[13:15:48] <moritzm>	 dse-k8s-ctrl2002, kubestagemaster2003, ml-etcd2003 will go down for a Ganeti reboot
[13:15:50] <andre>	 edsanders: Done. The stage is yours for now!
[13:15:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2044.codfw.wmnet
[13:15:55] <edsanders>	 thanks
[13:15:57] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be1003.eqiad.wmnet
[13:16:02] <logmsgbot>	 ayounsi@cumin1003 peering (PID 3908523) is awaiting input
[13:16:10] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.presto.reboot-workers (exit_code=0) for Presto an-presto cluster: Reboot Presto nodes
[13:16:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6001.drmrs.wmnet
[13:16:40] <wikibugs>	 (03CR) 10Kamila Součková: "Does this look more reasonable?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251475 (https://phabricator.wikimedia.org/T419548) (owner: 10Kamila Součková)
[13:16:52] <edsanders>	 cscott: you go first - I've got a CI issue that is probably meaningless, but I need to double check
[13:17:01] <wikibugs>	 (03PS2) 10Kamila Součková: shellbox: Setup shellbox-icu72 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251475 (https://phabricator.wikimedia.org/T419548)
[13:17:09] <icinga-wm>	 PROBLEM - Host ml-etcd2003 is DOWN: PING CRITICAL - Packet loss = 100%
[13:17:29] <cscott>	 edsanders: ok, thanks. i should be quick, it's just config
[13:17:43] <icinga-wm>	 PROBLEM - Host an-worker1148 is DOWN: PING CRITICAL - Packet loss = 100%
[13:18:05] <icinga-wm>	 PROBLEM - Host dse-k8s-ctrl2002 is DOWN: PING CRITICAL - Packet loss = 100%
[13:18:35] <icinga-wm>	 PROBLEM - Host kubestagemaster2003 is DOWN: PING CRITICAL - Packet loss = 100%
[13:18:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251610 (https://phabricator.wikimedia.org/T348255) (owner: 10C. Scott Ananian)
[13:18:48] <wikibugs>	 (03CR) 10Esanders: "recheck" [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254190 (https://phabricator.wikimedia.org/T420288) (owner: 10Esanders)
[13:19:14] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6001.drmrs.wmnet
[13:19:37] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on P{wikikube-worker[2280-2331].codfw.wmnet} and (A:wikikube-master-codfw or A:wikikube-worker-codfw)
[13:19:39] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[13:19:43] <wikibugs>	 (03Merged) 10jenkins-bot: Turn on postprocessing cache for all Parsoid parses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251610 (https://phabricator.wikimedia.org/T348255) (owner: 10C. Scott Ananian)
[13:20:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] toolforge::services::aptly: Remove buster [puppet] - 10https://gerrit.wikimedia.org/r/1254188 (owner: 10Muehlenhoff)
[13:20:12] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[13:20:13] <logmsgbot>	 !log cscott@deploy2002 Started scap sync-world: Backport for [[gerrit:1251610|Turn on postprocessing cache for all Parsoid parses (T348255)]]
[13:20:18] <stashbot>	 T348255: Parser cache infrastructure for OutputTransform - https://phabricator.wikimedia.org/T348255
[13:20:24] <logmsgbot>	 !log cgoubert@cumin1003 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on P{wikikube-worker13[00-47].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[13:20:29] <wikibugs>	 (03PS3) 10Eevans: cassandra-http-gateway: new chart based on aqs-http-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250649 (https://phabricator.wikimedia.org/T414112)
[13:20:29] <wikibugs>	 (03PS5) 10Eevans: charts/cassandra-http-gateway: template table configuration for hoarde [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250650 (https://phabricator.wikimedia.org/T414112)
[13:20:29] <wikibugs>	 (03PS6) 10Eevans: services: add linked-artifacts service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250651 (https://phabricator.wikimedia.org/T414112)
[13:20:35] <icinga-wm>	 RECOVERY - Host dse-k8s-ctrl2002 is UP: PING OK - Packet loss = 0%, RTA = 32.28 ms
[13:20:42] <wikibugs>	 (03CR) 10Eevans: cassandra-http-gateway: new chart based on aqs-http-gateway (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250649 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans)
[13:20:51] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host moss-be2001.codfw.wmnet
[13:21:03] <icinga-wm>	 RECOVERY - Host kubestagemaster2003 is UP: PING OK - Packet loss = 0%, RTA = 32.12 ms
[13:21:05] <icinga-wm>	 RECOVERY - Host ml-etcd2003 is UP: PING OK - Packet loss = 0%, RTA = 32.15 ms
[13:21:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2044.codfw.wmnet
[13:21:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2044.codfw.wmnet
[13:22:19] <logmsgbot>	 !log cscott@deploy2002 cscott: Backport for [[gerrit:1251610|Turn on postprocessing cache for all Parsoid parses (T348255)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:23:12] <wikibugs>	 (03PS13) 10Daniel Kinzler: rest-gateway rate limiting: add CORS headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248461 (https://phabricator.wikimedia.org/T418969)
[13:24:10] <wikibugs>	 (03CR) 10Elukey: "Jesse: the patch series work on the new dse k8s workers,but it will require more testing of course. Lemme know if you like the idea :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216) (owner: 10Elukey)
[13:25:31] <wikibugs>	 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#11718504 (10Wellverywell) Is there some ETA on whether thumbnail cache cleanup will happen on Commons? User @Medvednikita has reported that [[ https://commons.wikimedia.org/wiki/File%...
[13:25:31] <wikibugs>	 (03Abandoned) 10Eevans: service, trafficserver: Prepare "linked-artifacts" k8s pod [puppet] - 10https://gerrit.wikimedia.org/r/1227851 (https://phabricator.wikimedia.org/T414112) (owner: 10Federico Ceratto)
[13:25:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6001.drmrs.wmnet
[13:25:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6001.drmrs.wmnet
[13:26:10] <wikibugs>	 (03PS1) 10Fabfur: haproxy: test haproxy32 on cp2041 [puppet] - 10https://gerrit.wikimedia.org/r/1254195 (https://phabricator.wikimedia.org/T419825)
[13:26:23] <wikibugs>	 (03Abandoned) 10Eevans: hoarde: initial commit of chart (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249978 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans)
[13:26:29] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be2001.codfw.wmnet
[13:26:41] <logmsgbot>	 !log cscott@deploy2002 cscott: Continuing with sync
[13:26:50] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host moss-be2002.codfw.wmnet
[13:26:59] <wikibugs>	 (03Abandoned) 10Eevans: services: add linked-artifacts service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249979 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans)
[13:27:26] <wikibugs>	 (03PS2) 10Fabfur: haproxy: test haproxy32 on cp2041 [puppet] - 10https://gerrit.wikimedia.org/r/1254195 (https://phabricator.wikimedia.org/T419825)
[13:28:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2045.codfw.wmnet
[13:30:44] <logmsgbot>	 !log cscott@deploy2002 Finished scap sync-world: Backport for [[gerrit:1251610|Turn on postprocessing cache for all Parsoid parses (T348255)]] (duration: 10m 31s)
[13:30:48] <stashbot>	 T348255: Parser cache infrastructure for OutputTransform - https://phabricator.wikimedia.org/T348255
[13:30:54] <cscott>	 ok, over to you edsanders 
[13:30:59] <edsanders>	 ta
[13:31:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6002.drmrs.wmnet
[13:31:08] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3399351) is awaiting input
[13:31:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254189 (https://phabricator.wikimedia.org/T420288) (owner: 10Esanders)
[13:31:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254190 (https://phabricator.wikimedia.org/T420288) (owner: 10Esanders)
[13:32:16] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be2002.codfw.wmnet
[13:32:19] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2045.codfw.wmnet
[13:32:33] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host apus-be2004.codfw.wmnet
[13:32:41] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254195 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur)
[13:33:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2045.codfw.wmnet
[13:33:35] <wikibugs>	 (03Abandoned) 10Fabfur: haproxy: adding haproxy30 component and support [puppet] - 10https://gerrit.wikimedia.org/r/1041647 (https://phabricator.wikimedia.org/T366885) (owner: 10Fabfur)
[13:35:09] <logmsgbot>	 !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on logstash2023.codfw.wmnet with reason: ganeti reboot
[13:35:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254189 (https://phabricator.wikimedia.org/T420288) (owner: 10Esanders)
[13:35:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254190 (https://phabricator.wikimedia.org/T420288) (owner: 10Esanders)
[13:36:02] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3399694) is awaiting input
[13:36:28] <wikibugs>	 (03Merged) 10jenkins-bot: TitleWidget: Prioritise namespace prefix over interwiki prefix [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254189 (https://phabricator.wikimedia.org/T420288) (owner: 10Esanders)
[13:36:30] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: disable AMDGCN_USE_BUFFER_OPS in gpt isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254198 (https://phabricator.wikimedia.org/T418350)
[13:36:44] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3400011) is awaiting input
[13:37:11] <wikibugs>	 (03Merged) 10jenkins-bot: TitleWidget: Prioritise namespace prefix over interwiki prefix [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254190 (https://phabricator.wikimedia.org/T420288) (owner: 10Esanders)
[13:37:43] <logmsgbot>	 !log esanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1254189|TitleWidget: Prioritise namespace prefix over interwiki prefix (T420288)]], [[gerrit:1254190|TitleWidget: Prioritise namespace prefix over interwiki prefix (T420288)]]
[13:37:47] <stashbot>	 T420288: VisualEditor link tool is confusing project namespace and interwiki links - https://phabricator.wikimedia.org/T420288
[13:38:39] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apus-be2004.codfw.wmnet
[13:38:58] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host moss-be2003.codfw.wmnet
[13:39:43] <logmsgbot>	 !log esanders@deploy2002 esanders: Backport for [[gerrit:1254189|TitleWidget: Prioritise namespace prefix over interwiki prefix (T420288)]], [[gerrit:1254190|TitleWidget: Prioritise namespace prefix over interwiki prefix (T420288)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:40:23] <wikibugs>	 (03PS1) 10Andrew Bogott: eqiad1: move to a newer Horizon build [puppet] - 10https://gerrit.wikimedia.org/r/1254200 (https://phabricator.wikimedia.org/T405117)
[13:41:39] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] eqiad1: move to a newer Horizon build [puppet] - 10https://gerrit.wikimedia.org/r/1254200 (https://phabricator.wikimedia.org/T405117) (owner: 10Andrew Bogott)
[13:42:00] <logmsgbot>	 !log esanders@deploy2002 esanders: Continuing with sync
[13:43:19] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Drain ssw1-d8-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420351 (10cmooney) 03NEW p:05Triage→03Medium
[13:43:25] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Drain ssw1-d8-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420351#11718588 (10cmooney)
[13:43:30] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new: Nokia SR-Linux DHCP Relay Bug - https://phabricator.wikimedia.org/T411054#11718589 (10cmooney)
[13:43:50] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: [C:03+1] ml-services: disable AMDGCN_USE_BUFFER_OPS in gpt isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254198 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[13:44:29] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] ml-services: disable AMDGCN_USE_BUFFER_OPS in gpt isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254198 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[13:44:36] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be2003.codfw.wmnet
[13:45:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] Install systemd-timesyncd universally [puppet] - 10https://gerrit.wikimedia.org/r/1243756 (owner: 10Muehlenhoff)
[13:45:52] <logmsgbot>	 !log esanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1254189|TitleWidget: Prioritise namespace prefix over interwiki prefix (T420288)]], [[gerrit:1254190|TitleWidget: Prioritise namespace prefix over interwiki prefix (T420288)]] (duration: 08m 10s)
[13:45:56] <stashbot>	 T420288: VisualEditor link tool is confusing project namespace and interwiki links - https://phabricator.wikimedia.org/T420288
[13:46:37] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: disable AMDGCN_USE_BUFFER_OPS in gpt isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254198 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[13:46:56] <andre>	 edsanders: are you're done?
[13:49:33] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11718638 (10Papaul)
[13:49:58] <andre>	 edsanders: I assume yes, so I'm going to deploy wmf.20 to group0 now
[13:50:24] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 to 1.46.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254203 (https://phabricator.wikimedia.org/T413811)
[13:50:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Initiated by aklapper@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254203 (https://phabricator.wikimedia.org/T413811) (owner: 10TrainBranchBot)
[13:50:43] <moritzm>	 dse-k8s-ctrl2001, aux-k8s-etcd2003 will go down for a Ganeti reboot
[13:50:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet
[13:50:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6002.drmrs.wmnet
[13:51:01] <logmsgbot>	 btullis@cumin1003 reboot-workers (PID 3894227) is awaiting input
[13:51:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2001.codfw.wmnet, ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[13:51:19] <wikibugs>	 (03Merged) 10jenkins-bot: group0 to 1.46.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254203 (https://phabricator.wikimedia.org/T413811) (owner: 10TrainBranchBot)
[13:51:23] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2003.codfw.wmnet, ml-staging2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[13:52:20] <logmsgbot>	 !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudlb2002-dev.codfw.wmnet
[13:53:03] <icinga-wm>	 PROBLEM - Host dse-k8s-etcd2001 is DOWN: PING CRITICAL - Packet loss = 100%
[13:53:05] <icinga-wm>	 PROBLEM - Host aux-k8s-etcd2003 is DOWN: PING CRITICAL - Packet loss = 100%
[13:53:46] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] "LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254149 (https://phabricator.wikimedia.org/T418187) (owner: 10Abijeet Patro)
[13:54:43] <icinga-wm>	 PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:55:31] <icinga-wm>	 RECOVERY - Host dse-k8s-etcd2001 is UP: PING OK - Packet loss = 0%, RTA = 32.09 ms
[13:55:37] <icinga-wm>	 RECOVERY - Host aux-k8s-etcd2003 is UP: PING OK - Packet loss = 0%, RTA = 32.22 ms
[13:56:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.3 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[13:56:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet
[13:56:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2045.codfw.wmnet
[13:56:39] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2002-dev (172.20.5.3) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[13:57:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2046.codfw.wmnet
[13:57:12] <logmsgbot>	 !log aklapper@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.46.0-wmf.20  refs T413811
[13:57:16] <stashbot>	 T413811: 1.46.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T413811
[13:57:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6002.drmrs.wmnet
[13:57:30] <wikibugs>	 (03PS1) 10Ottomata: Increase the kafka-jumbo maximum message size to 20MB [puppet] - 10https://gerrit.wikimedia.org/r/1254205 (https://phabricator.wikimedia.org/T419495)
[13:57:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6002.drmrs.wmnet
[13:58:04] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Increase the kafka-jumbo maximum message size to 20MB [puppet] - 10https://gerrit.wikimedia.org/r/1254205 (https://phabricator.wikimedia.org/T419495) (owner: 10Ottomata)
[13:58:43] <icinga-wm>	 RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:59:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet
[14:00:05] <jouncebot>	 Deploy window Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1400)
[14:00:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cumin2003.codfw.wmnet
[14:01:10] <jinxer-wm>	 RESOLVED: [2x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.3 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:01:34] <logmsgbot>	 !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet
[14:01:39] <jinxer-wm>	 RESOLVED: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2002-dev (172.20.5.3) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[14:02:52] <wikibugs>	 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#11718729 (10Ladsgroup) That is actually unrelated to this work and is about {T360589} and {T414805}. See https://www.mediawiki.org/wiki/Common_thumbnail_sizes
[14:03:56] <godog>	 the cloud codfw alerts are me, rolling reboots in progress
[14:04:43] <icinga-wm>	 PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:05:09] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2002-dev (172.20.5.3) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[14:05:14] <topranks>	 !log setting cr1-eqiad as VRRP master for all vlans T420351
[14:05:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:18] <stashbot>	 T420351: Drain ssw1-d8-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420351
[14:05:30] <wikibugs>	 (03PS1) 10Ottomata: eventstreams-internal - increase kafka max message size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254210 (https://phabricator.wikimedia.org/T420356)
[14:06:25] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.4 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:06:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin2003.codfw.wmnet
[14:07:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host people2004.codfw.wmnet
[14:07:43] <icinga-wm>	 RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:08:33] <wikibugs>	 10SRE-tools, 10Cumin, 06Infrastructure-Foundations: Add proxy support to cumin openstack backend - https://phabricator.wikimedia.org/T420360 (10fgiunchedi) 03NEW
[14:09:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1053:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1053 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[14:10:09] <jinxer-wm>	 RESOLVED: [4x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2002-dev (172.20.5.3) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[14:10:34] <logmsgbot>	 !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet
[14:10:38] <logmsgbot>	 !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet
[14:11:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people2004.codfw.wmnet
[14:11:25] <jinxer-wm>	 RESOLVED: [4x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.4 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:11:30] <moritzm>	 !log powercycling ganeti2046 (stuck on reboot)
[14:11:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:13:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test1005.wikimedia.org
[14:13:43] <icinga-wm>	 PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:13:43] <topranks>	 !log disable VRRP on cr2-eqiad interfaces facing ssw1-d8-eqiad T420351
[14:13:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:13:47] <stashbot>	 T420351: Drain ssw1-d8-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420351
[14:14:06] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet
[14:14:09] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2002-dev (172.20.5.3) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[14:14:09] <jinxer-wm>	 FIRING: [3x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2004-dev (172.20.5.5) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[14:14:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2046.codfw.wmnet
[14:15:32] <jinxer-wm>	 FIRING: [7x] KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:16:03] <TheresNoTime>	 jouncebot: nowandnext
[14:16:03] <jouncebot>	 For the next 0 hour(s) and 13 minute(s): Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1400)
[14:16:04] <jouncebot>	 In 0 hour(s) and 13 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1430)
[14:16:04] <jouncebot>	 In 0 hour(s) and 13 minute(s): Create new table for the CampaignEvents extension (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1430)
[14:16:06] <wikibugs>	 (03PS1) 10Majavah: cr-cloud-vrf: Narrowly scope (cloud)cumin firewall exemption [homer/public] - 10https://gerrit.wikimedia.org/r/1254211 (https://phabricator.wikimedia.org/T419996)
[14:16:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2047.codfw.wmnet
[14:16:25] <jinxer-wm>	 FIRING: [6x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.4 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:16:43] <icinga-wm>	 RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:16:54] <jinxer-wm>	 FIRING: [8x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2002-dev (172.20.5.3) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[14:17:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test1005.wikimedia.org
[14:17:35] <icinga-wm>	 PROBLEM - VRRP status on cr1-eqiad is CRITICAL: VRRP CRITICAL - 6 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status
[14:19:09] <jinxer-wm>	 FIRING: [10x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2002-dev (172.20.5.3) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[14:19:36] <logmsgbot>	 !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet
[14:19:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1053:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1053 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[14:19:40] <logmsgbot>	 !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudnet2005-dev.codfw.wmnet
[14:19:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet
[14:19:54] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar: Add --min-uptime to cookbooks - https://phabricator.wikimedia.org/T419967#11718805 (10fgiunchedi) I also was wondering about a resumable rolling reboot feature for cookbooks and found this task, and of course I'm +1! The way I understand the feat...
[14:19:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-test-presto1001.eqiad.wmnet
[14:20:40] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1015:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:21:25] <jinxer-wm>	 RESOLVED: [4x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.4 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:21:57] <icinga-wm>	 PROBLEM - Host ml-staging-etcd2002 is DOWN: PING CRITICAL - Packet loss = 100%
[14:22:59] <wikibugs>	 07Puppet, 06collaboration-services, 10Gerrit, 06Infrastructure-Foundations: Edit puppet-merge to use gerrit.discovery.wmnet instead of gerrit.wikimedia.org? - https://phabricator.wikimedia.org/T420184#11718810 (10ABran-WMF)
[14:23:29] <wikibugs>	 07Puppet, 06collaboration-services, 10Gerrit, 06Infrastructure-Foundations: Edit puppet-merge to use gerrit.discovery.wmnet instead of gerrit.wikimedia.org? - https://phabricator.wikimedia.org/T420184#11718811 (10ABran-WMF) p:05Triage→03Low
[14:23:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-presto1001.eqiad.wmnet
[14:24:52] <wikibugs>	 07Puppet, 06collaboration-services, 10Gerrit, 06Infrastructure-Foundations: Change puppet-merge git origin to use gerrit.discovery.wmnet instead of gerrit.wikimedia.org - https://phabricator.wikimedia.org/T420184#11718825 (10ABran-WMF)
[14:25:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet
[14:25:29] <icinga-wm>	 RECOVERY - Host ml-staging-etcd2002 is UP: PING OK - Packet loss = 0%, RTA = 30.73 ms
[14:25:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2047.codfw.wmnet
[14:25:33] <urbanecm>	 jouncebot: nowandnext
[14:25:33] <jouncebot>	 For the next 0 hour(s) and 4 minute(s): Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1400)
[14:25:33] <jouncebot>	 In 0 hour(s) and 4 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1430)
[14:25:33] <jouncebot>	 In 0 hour(s) and 4 minute(s): Create new table for the CampaignEvents extension (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1430)
[14:25:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2049.codfw.wmnet
[14:27:16] <logmsgbot>	 !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet2005-dev.codfw.wmnet
[14:27:20] <logmsgbot>	 !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudnet2006-dev.codfw.wmnet
[14:27:29] <topranks>	 !log de-pref internet circuits landing on cr2-eqiad to shift traffic to cr1 T420351
[14:27:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:27:32] <stashbot>	 T420351: Drain ssw1-d8-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420351
[14:27:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/1254211 (https://phabricator.wikimedia.org/T419996) (owner: 10Majavah)
[14:27:35] <moritzm>	 aux-k8s-etcd2004 will go down for a Ganeti reboot
[14:27:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet
[14:27:53] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] cr-cloud-vrf: Narrowly scope (cloud)cumin firewall exemption [homer/public] - 10https://gerrit.wikimedia.org/r/1254211 (https://phabricator.wikimedia.org/T419996) (owner: 10Majavah)
[14:28:03] <wikibugs>	 (03CR) 10Majavah: [C:03+2] cr-cloud-vrf: Narrowly scope (cloud)cumin firewall exemption [homer/public] - 10https://gerrit.wikimedia.org/r/1254211 (https://phabricator.wikimedia.org/T419996) (owner: 10Majavah)
[14:29:23] <wikibugs>	 (03Merged) 10jenkins-bot: cr-cloud-vrf: Narrowly scope (cloud)cumin firewall exemption [homer/public] - 10https://gerrit.wikimedia.org/r/1254211 (https://phabricator.wikimedia.org/T419996) (owner: 10Majavah)
[14:29:47] <icinga-wm>	 PROBLEM - Host aux-k8s-etcd2004 is DOWN: PING CRITICAL - Packet loss = 100%
[14:30:05] <jouncebot>	 Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1430)
[14:30:05] <jouncebot>	 Daimona: Your horoscope predicts another Create new table for the CampaignEvents extension deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1430).
[14:30:11] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host thanos-be2005.codfw.wmnet
[14:30:32] <jinxer-wm>	 FIRING: [9x] KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:30:37] <icinga-wm>	 RECOVERY - Host aux-k8s-etcd2004 is UP: PING OK - Packet loss = 0%, RTA = 30.82 ms
[14:30:55] <Daimona>	 Oh wow that's a very specific horoscope
[14:31:52] <logmsgbot>	 !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host gitlab2002.wikimedia.org
[14:31:53] <wikibugs>	 (03PS1) 10Muehlenhoff: apereo_cas: Drop obsolete test [puppet] - 10https://gerrit.wikimedia.org/r/1254212
[14:33:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet
[14:33:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2049.codfw.wmnet
[14:34:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2050.codfw.wmnet
[14:34:28] <Daimona>	 !log Creating ce_event_goals DB table for the CampaignEvents extension in x1.testwiki, x1.test2wiki, x1.officewiki, and x1.wikishared # T411433
[14:34:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1006.eqiad.wmnet
[14:34:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:32] <stashbot>	 T411433: Create new database table for event goals - https://phabricator.wikimedia.org/T411433
[14:35:00] <jinxer-wm>	 FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[14:35:09] <logmsgbot>	 !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet2006-dev.codfw.wmnet
[14:35:13] <logmsgbot>	 !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudnet2007-dev.codfw.wmnet
[14:35:58] <Daimona>	 And I'm done.
[14:36:23] <James_F>	 DB creation takes so little time. DB migration takes forever. Alas.
[14:36:35] <wikibugs>	 (03CR) 10Jelto: "looks good to me, one comment in-line" [cookbooks] - 10https://gerrit.wikimedia.org/r/1254113 (https://phabricator.wikimedia.org/T420194) (owner: 10Arnaudb)
[14:36:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet
[14:37:57] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2005.codfw.wmnet
[14:38:01] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host thanos-be2006.codfw.wmnet
[14:38:03] <wikibugs>	 (03PS1) 10Majavah: definitions: Remove duplicate definitions [homer/public] - 10https://gerrit.wikimedia.org/r/1254215
[14:38:09] <inflatador>	 !log bking@requestctl remove `wdqs_highest_error_rate_ever_seen` requestctl rule as it is no longer needed
[14:38:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:19] <wikibugs>	 (03PS1) 10Sergio Gimeno: GrowthExperiments: increase edit and thanks query limit II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254216 (https://phabricator.wikimedia.org/T341599)
[14:38:24] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] definitions: Remove duplicate definitions [homer/public] - 10https://gerrit.wikimedia.org/r/1254215 (owner: 10Majavah)
[14:38:35] <wikibugs>	 (03CR) 10Majavah: [C:03+2] definitions: Remove duplicate definitions [homer/public] - 10https://gerrit.wikimedia.org/r/1254215 (owner: 10Majavah)
[14:38:40] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job wmf_gitlab_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:38:43] <Dreamy_Jazz>	 jouncebot: nowandnext
[14:38:44] <jouncebot>	 For the next 0 hour(s) and 21 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1430)
[14:38:44] <jouncebot>	 For the next 0 hour(s) and 11 minute(s): Create new table for the CampaignEvents extension (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1430)
[14:38:44] <jouncebot>	 In 0 hour(s) and 21 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1500)
[14:39:02] <Dreamy_Jazz>	 (Just checking for when the next puppet window is)
[14:39:12] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "LGTM, confirmed the IPs are in Netbox with the same fqdn in the "dns_name" field so I believe that should cover it, no need for a static d" [homer/public] - 10https://gerrit.wikimedia.org/r/1254215 (owner: 10Majavah)
[14:39:34] <logmsgbot>	 !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab2002.wikimedia.org
[14:39:44] <Dreamy_Jazz>	 Actually, do want to deploy a no-op config patch
[14:40:00] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job wmf_gitlab_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:40:13] <wikibugs>	 (03Merged) 10jenkins-bot: definitions: Remove duplicate definitions [homer/public] - 10https://gerrit.wikimedia.org/r/1254215 (owner: 10Majavah)
[14:40:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1006.eqiad.wmnet
[14:40:44] <topranks>	 !log disabling EVPN IBGP peering from ssw1-d8-eqiad to ssw1-d1-eqiad to stop them reflecting routes T420351
[14:40:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:47] <stashbot>	 T420351: Drain ssw1-d8-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420351
[14:41:14] <wikibugs>	 (03PS1) 10Dreamy Jazz: Create dblists for wikis where CheckUser and AbuseFilter are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254217 (https://phabricator.wikimedia.org/T420063)
[14:41:45] <logmsgbot>	 !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet2007-dev.codfw.wmnet
[14:41:49] <logmsgbot>	 !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudrabbit2001-dev.codfw.wmnet
[14:42:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet
[14:42:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2050.codfw.wmnet
[14:42:14] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Create dblists for wikis where CheckUser and AbuseFilter are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254217 (https://phabricator.wikimedia.org/T420063) (owner: 10Dreamy Jazz)
[14:42:25] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gitlab: start ssh-gitlab service after network-online and gitlab [puppet] - 10https://gerrit.wikimedia.org/r/1254162 (https://phabricator.wikimedia.org/T420164) (owner: 10Jelto)
[14:43:07] <wikibugs>	 (03PS2) 10Dreamy Jazz: Create dblists for wikis where CheckUser and AbuseFilter are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254217 (https://phabricator.wikimedia.org/T420063)
[14:43:25] <moritzm>	 !log failover Ganeti master in codfw to ganeti2047
[14:43:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2034.codfw.wmnet
[14:44:17] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2034.codfw.wmnet
[14:44:18] <topranks>	 !log stop announcing "direct" routes to ssw1-d8-eqiad from cr2-eqiad T420351
[14:44:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:29] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Create dblists for wikis where CheckUser and AbuseFilter are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254217 (https://phabricator.wikimedia.org/T420063) (owner: 10Dreamy Jazz)
[14:44:33] <taavi>	 !log deploying cr firewall changes from https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1254211
[14:44:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2034.codfw.wmnet
[14:45:07] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti2048 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[14:45:08] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar: Add --min-uptime to cookbooks - https://phabricator.wikimedia.org/T419967#11718942 (10JMeybohm) I think usability wise it might be more helpful to have an argument which takes the date and time after which a reboot is expected. So something like...
[14:45:30] <logmsgbot>	 !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host gitlab2002.wikimedia.org
[14:45:33] <jinxer-wm>	 FIRING: [5x] KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:46:26] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2006.codfw.wmnet
[14:46:30] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host thanos-be2007.codfw.wmnet
[14:46:54] <jinxer-wm>	 FIRING: [6x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[14:47:57] <icinga-wm>	 PROBLEM - Host gitlab-replica-b.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[14:48:22] <logmsgbot>	 !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit2001-dev.codfw.wmnet
[14:48:25] <logmsgbot>	 !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudrabbit2002-dev.codfw.wmnet
[14:48:40] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] k8s-staging: Switch to IPIP mode for kube-apiserver [puppet] - 10https://gerrit.wikimedia.org/r/1240275 (https://phabricator.wikimedia.org/T352956) (owner: 10JMeybohm)
[14:49:06] <wikibugs>	 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11718960 (10RobH) The distro swap did not fix this host, it will require a mainboard swap via a procurement task (linked in)
[14:49:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2034.codfw.wmnet
[14:49:58] <topranks>	 !log stop announcing routes from ssw1-d8-eqiad to external peers (cr2-eqiad, other spines) T420351
[14:49:59] <wikibugs>	 (03PS2) 10Bking: dse-k8s: Add CFSSL profile for longer-lived certificates (6 mo). [puppet] - 10https://gerrit.wikimedia.org/r/1251117 (https://phabricator.wikimedia.org/T419289)
[14:50:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host durum4004.ulsfo.wmnet
[14:50:02] <stashbot>	 T420351: Drain ssw1-d8-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420351
[14:51:31] <logmsgbot>	 !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab2002.wikimedia.org
[14:51:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251482 (owner: 10Muehlenhoff)
[14:51:50] <wikibugs>	 (03CR) 10Scott French: shellbox: Setup shellbox-icu72 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251475 (https://phabricator.wikimedia.org/T419548) (owner: 10Kamila Součková)
[14:51:52] <wikibugs>	 (03PS2) 10Btullis: Increase the kafka-jumbo maximum message size to 20MB [puppet] - 10https://gerrit.wikimedia.org/r/1254205 (https://phabricator.wikimedia.org/T419495) (owner: 10Ottomata)
[14:51:54] <topranks>	 !log stop accepting routes on ssw1-d8-eqiad from external peers (cr2-eqiad, other spines) T420351
[14:51:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:07] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254205 (https://phabricator.wikimedia.org/T419495) (owner: 10Ottomata)
[14:52:40] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2007.codfw.wmnet
[14:52:44] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host thanos-be2008.codfw.wmnet
[14:52:59] <icinga-wm>	 RECOVERY - Host gitlab-replica-b.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 30.42 ms
[14:53:13] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1254212 (owner: 10Muehlenhoff)
[14:53:40] <jinxer-wm>	 FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[14:53:48] <logmsgbot>	 !log jmm@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply
[14:53:59] <logmsgbot>	 !log jmm@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[14:54:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host durum4004.ulsfo.wmnet
[14:54:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] apereo_cas: Drop obsolete test [puppet] - 10https://gerrit.wikimedia.org/r/1254212 (owner: 10Muehlenhoff)
[14:54:29] <icinga-wm>	 PROBLEM - Host ssw1-d8-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[14:54:29] <icinga-wm>	 PROBLEM - Host ssw1-d8-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[14:54:46] <wikibugs>	 10ops-codfw, 06collaboration-services, 06DC-Ops, 10Phabricator: phab2002: SEL System Event:, System Board Front LED Panel, Critical, management controller unavailable - https://phabricator.wikimedia.org/T420228#11719008 (10Jhancock.wm) soft rebooted the idrac
[14:54:50] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1247033 (owner: 10Muehlenhoff)
[14:55:16] <logmsgbot>	 !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit2002-dev.codfw.wmnet
[14:55:20] <logmsgbot>	 !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudrabbit2003-dev.codfw.wmnet
[14:55:32] <jinxer-wm>	 FIRING: [4x] KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:55:57] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good to me. Should we run PCC against the PKI hosts?" [puppet] - 10https://gerrit.wikimedia.org/r/1251117 (https://phabricator.wikimedia.org/T419289) (owner: 10Bking)
[14:56:05] <wikibugs>	 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11719023 (10RobH)
[14:56:11] <wikibugs>	 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11719026 (10RobH)
[14:56:21] <wikibugs>	 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11719028 (10RobH)
[14:57:08] <wikibugs>	 (03CR) 10Daniel Kinzler: rest-gateway rate limit: add BYPASS and DENY policy and class (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250598 (owner: 10Daniel Kinzler)
[14:57:21] <wikibugs>	 06SRE, 06Traffic: Startup failure for Bird on new durum hosts - https://phabricator.wikimedia.org/T419868#11719032 (10MoritzMuehlenhoff) >>! In T419868#11713955, @ssingh wrote: > That's interesting, thanks for debugging. What is weird is that a restart of anycast-healthchecker then should have fixed this in th...
[14:57:21] <logmsgbot>	 !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host phab1004.eqiad.wmnet
[14:57:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2034.codfw.wmnet
[14:57:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2034.codfw.wmnet
[14:57:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] cloudceph: Run the spec tests on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1247033 (owner: 10Muehlenhoff)
[14:58:03] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks, Moritz!" [puppet] - 10https://gerrit.wikimedia.org/r/1247620 (owner: 10Muehlenhoff)
[14:58:18] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Increase the kafka-jumbo maximum message size to 20MB [puppet] - 10https://gerrit.wikimedia.org/r/1254205 (https://phabricator.wikimedia.org/T419495) (owner: 10Ottomata)
[14:58:35] <wikibugs>	 (03PS3) 10Dreamy Jazz: Create dblists for wikis where CheckUser and AbuseFilter are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254217 (https://phabricator.wikimedia.org/T420063)
[14:58:38] <taavi>	 I am getting a "Error: 502, Broken pipe" from Phabricator
[14:58:43] <XioNoX>	 jelto:  3 min too early? :)
[14:58:51] <XioNoX>	 taavi: maintenance planned for 3pm UTC
[14:59:01] <jelto>	 yes :) Phabricator needs a short restart 
[14:59:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove support for PHP 7.4/8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1247620 (owner: 10Muehlenhoff)
[14:59:54] <Dreamy_Jazz>	 jouncebot: nowandnext
[14:59:54] <jouncebot>	 For the next 0 hour(s) and 0 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1430)
[14:59:54] <jouncebot>	 In 0 hour(s) and 0 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1500)
[15:00:01] <Dreamy_Jazz>	 jouncebot: nowandnext
[15:00:01] <jouncebot>	 For the next 0 hour(s) and 59 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1500)
[15:00:01] <jouncebot>	 In 0 hour(s) and 59 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1600)
[15:00:05] <jouncebot>	 jelto, arnoldokoth, mutante, and arnaudb: Your horoscope predicts another SRE Collaboration Services office hours deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1500).
[15:00:26] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2008.codfw.wmnet
[15:00:30] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host thanos-be2009.codfw.wmnet
[15:00:32] <jinxer-wm>	 FIRING: [4x] KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[15:00:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254217 (https://phabricator.wikimedia.org/T420063) (owner: 10Dreamy Jazz)
[15:02:05] <logmsgbot>	 !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit2003-dev.codfw.wmnet
[15:02:07] <Dreamy_Jazz>	 (My config change is a no-op to prod and needs to happen before the puppet request window)
[15:02:09] <logmsgbot>	 !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudservices2004-dev.codfw.wmnet
[15:02:10] <wikibugs>	 (03Merged) 10jenkins-bot: Create dblists for wikis where CheckUser and AbuseFilter are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254217 (https://phabricator.wikimedia.org/T420063) (owner: 10Dreamy Jazz)
[15:02:18] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[15:02:40] <topranks>	 !log reset BGP session to ssw1-d8-eiqad from lsw1-d4-eqiad T420180
[15:02:40] <logmsgbot>	 !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1254217|Create dblists for wikis where CheckUser and AbuseFilter are disabled (T420063 T420062)]]
[15:02:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:43] <stashbot>	 T420180: Drain ssw1-d1-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420180
[15:02:48] <stashbot>	 T420063: Uninstall AbuseFilter from wikis which are closed and have no AbuseLog entries - https://phabricator.wikimedia.org/T420063
[15:02:48] <stashbot>	 T420062: Uninstall PSI extensions on closed wikis which are not needed - https://phabricator.wikimedia.org/T420062
[15:03:20] <logmsgbot>	 !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host phab1004.eqiad.wmnet
[15:04:16] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp7014.magru.wmnet with OS trixie
[15:04:39] <jelto>	 Phabricator maintenance finished
[15:04:43] <icinga-wm>	 RECOVERY - Host an-worker1148 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms
[15:04:47] <icinga-wm>	 PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:04:51] <logmsgbot>	 !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1254217|Create dblists for wikis where CheckUser and AbuseFilter are disabled (T420063 T420062)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[15:05:18] <logmsgbot>	 !log jelto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator deploy
[15:05:19] <logmsgbot>	 !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync
[15:05:51] <logmsgbot>	 !log jelto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: Phabricator deploy
[15:06:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.8 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[15:06:54] <jinxer-wm>	 FIRING: [8x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[15:07:47] <icinga-wm>	 RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:08:07] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti2033 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[15:08:10] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2009.codfw.wmnet
[15:08:23] <logmsgbot>	 !log brennen@deploy2002 Started deploy [phabricator/deployment@e845707]: deploy phab2002 for T420366
[15:08:27] <logmsgbot>	 !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudservices2004-dev.codfw.wmnet
[15:08:27] <stashbot>	 T420366: Deploy Phab/Phorge 2026-03-17 - https://phabricator.wikimedia.org/T420366
[15:08:31] <logmsgbot>	 !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudservices2005-dev.codfw.wmnet
[15:08:59] <logmsgbot>	 !log brennen@deploy2002 Finished deploy [phabricator/deployment@e845707]: deploy phab2002 for T420366 (duration: 00m 35s)
[15:09:09] <jinxer-wm>	 FIRING: [8x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[15:09:19] <logmsgbot>	 !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1254217|Create dblists for wikis where CheckUser and AbuseFilter are disabled (T420063 T420062)]] (duration: 06m 38s)
[15:09:20] <logmsgbot>	 !log brennen@deploy2002 Started deploy [phabricator/deployment@e845707]: deploy phab1004 for T420366
[15:09:24] <stashbot>	 T420063: Uninstall AbuseFilter from wikis which are closed and have no AbuseLog entries - https://phabricator.wikimedia.org/T420063
[15:09:24] <stashbot>	 T420062: Uninstall PSI extensions on closed wikis which are not needed - https://phabricator.wikimedia.org/T420062
[15:09:29] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host thanos-be1005.eqiad.wmnet
[15:09:40] <wikibugs>	 (03PS1) 10JMeybohm: Revert "k8s-staging: Switch to IPIP mode for kube-apiserver" [puppet] - 10https://gerrit.wikimedia.org/r/1254224
[15:09:55] <icinga-wm>	 RECOVERY - Host ssw1-d8-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms
[15:09:55] <icinga-wm>	 RECOVERY - Host ssw1-d8-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms
[15:10:07] <wikibugs>	 (03PS2) 10JMeybohm: Revert "k8s-staging: Switch to IPIP mode for kube-apiserver" [puppet] - 10https://gerrit.wikimedia.org/r/1254224 (https://phabricator.wikimedia.org/T352956)
[15:10:18] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244723 (https://phabricator.wikimedia.org/T418518) (owner: 10Urbanecm)
[15:10:22] <logmsgbot>	 !log brennen@deploy2002 Finished deploy [phabricator/deployment@e845707]: deploy phab1004 for T420366 (duration: 01m 02s)
[15:10:23] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1148 is OK: OK: optimal, 12 logical, 13 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[15:10:41] <icinga-wm>	 RECOVERY - VRRP status on cr1-eqiad is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status
[15:10:47] <icinga-wm>	 PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:11:03] <wikibugs>	 (03Abandoned) 10JMeybohm: Revert "k8s-staging: Switch to IPIP mode for kube-apiserver" [puppet] - 10https://gerrit.wikimedia.org/r/1254224 (https://phabricator.wikimedia.org/T352956) (owner: 10JMeybohm)
[15:11:03] <wikibugs>	 (03PS3) 10Herron: systemd::timer::job: add ExecCondition support [puppet] - 10https://gerrit.wikimedia.org/r/1253655
[15:11:10] <jinxer-wm>	 RESOLVED: [2x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.8 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[15:11:17] <wikibugs>	 (03Merged) 10jenkins-bot: cleanup: Growth: Remove temporary GrowthMentorList overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244723 (https://phabricator.wikimedia.org/T418518) (owner: 10Urbanecm)
[15:11:39] <wikibugs>	 (03CR) 10CI reject: [V:04-1] systemd::timer::job: add ExecCondition support [puppet] - 10https://gerrit.wikimedia.org/r/1253655 (owner: 10Herron)
[15:11:45] <logmsgbot>	 !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1244723|cleanup: Growth: Remove temporary GrowthMentorList overrides (T418518)]]
[15:11:49] <stashbot>	 T418518: Remove code for legacy GrowthMentorList validator - https://phabricator.wikimedia.org/T418518
[15:11:53] <wikibugs>	 (03PS2) 10Arnaudb: gerrit: cookbook to reboot gerrit primary instance [cookbooks] - 10https://gerrit.wikimedia.org/r/1254113 (https://phabricator.wikimedia.org/T420194)
[15:12:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.9 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[15:12:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2048.codfw.wmnet
[15:12:43] <wikibugs>	 (03PS4) 10Herron: systemd::timer::job: add ExecCondition support [puppet] - 10https://gerrit.wikimedia.org/r/1253655
[15:13:07] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1012.eqiad.wmnet with OS bookworm
[15:13:44] <wikibugs>	 (03CR) 10Arnaudb: gerrit: cookbook to reboot gerrit primary instance (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1254113 (https://phabricator.wikimedia.org/T420194) (owner: 10Arnaudb)
[15:13:47] <icinga-wm>	 RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:13:52] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1244723|cleanup: Growth: Remove temporary GrowthMentorList overrides (T418518)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[15:14:09] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Drain ssw1-d8-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420351#11719150 (10cmooney) 05Open→03Resolved Ok this work is now complete.  Only had to reset the tunnel on `lsw1-d4-eqiad` it w...
[15:14:09] <jinxer-wm>	 FIRING: [10x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[15:14:15] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Continuing with sync
[15:14:21] <logmsgbot>	 !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudservices2005-dev.codfw.wmnet
[15:14:52] <wikibugs>	 (03PS1) 10Dreamy Jazz: maintenance: Disable scripts for closed wikis on various extensions [puppet] - 10https://gerrit.wikimedia.org/r/1254225 (https://phabricator.wikimedia.org/T420052)
[15:15:57] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new: Nokia SR-Linux DHCP Relay Bug - https://phabricator.wikimedia.org/T411054#11719160 (10cmooney) p:05Medium→03Low Ok all vxlan tunnels right now on row c/d leaf switches to ssw1-d1-eqiad and ssw1-d8-eqiad have a valid vxlan tunnel id.  So u...
[15:15:58] <wikibugs>	 (03PS2) 10Dreamy Jazz: mw::maintenance: Disable scripts for closed wikis on various extensions [puppet] - 10https://gerrit.wikimedia.org/r/1254225 (https://phabricator.wikimedia.org/T420052)
[15:16:09] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 16509
[15:16:10] <logmsgbot>	 !log jmm@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[15:16:25] <jinxer-wm>	 RESOLVED: [4x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.9 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[15:17:15] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3419798) is awaiting input
[15:17:50] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for backup2005.mgmt:22 - https://phabricator.wikimedia.org/T420308#11719165 (10Jhancock.wm) related to T419970. will clear soon
[15:18:17] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1244723|cleanup: Growth: Remove temporary GrowthMentorList overrides (T418518)]] (duration: 06m 32s)
[15:18:21] <stashbot>	 T418518: Remove code for legacy GrowthMentorList validator - https://phabricator.wikimedia.org/T418518
[15:18:28] <logmsgbot>	 !log jmm@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[15:18:43] <wikibugs>	 10ops-codfw, 06collaboration-services, 06DC-Ops, 10Phabricator: phab2002: SEL System Event:, System Board Front LED Panel, Critical, management controller unavailable - https://phabricator.wikimedia.org/T420228#11719169 (10Aklapper) I see an entry `ipmi_sdr_cache_open: internal IPMI error` for `phab2002` a...
[15:19:20] <wikibugs>	 (03PS3) 10Dreamy Jazz: mw::maintenance: Disable scripts for closed wikis on various extensions [puppet] - 10https://gerrit.wikimedia.org/r/1254225 (https://phabricator.wikimedia.org/T420052)
[15:19:29] <wikibugs>	 (03CR) 10Dreamy Jazz: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254225 (https://phabricator.wikimedia.org/T420052) (owner: 10Dreamy Jazz)
[15:20:23] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1005.eqiad.wmnet
[15:20:25] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11719172 (10MoritzMuehlenhoff)
[15:20:26] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host thanos-be1006.eqiad.wmnet
[15:20:56] <logmsgbot>	 !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudnet2008-dev.codfw.wmnet
[15:21:16] <wikibugs>	 (03CR) 10Herron: systemd::timer::job: add ExecCondition support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1253655 (owner: 10Herron)
[15:21:34] <logmsgbot>	 !log jmm@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[15:22:07] <TheresNoTime>	 jouncebot: nowandnext
[15:22:07] <jouncebot>	 For the next 0 hour(s) and 37 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1500)
[15:22:07] <jouncebot>	 In 0 hour(s) and 37 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1600)
[15:22:18] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-jumbo-eqiad
[15:23:59] <logmsgbot>	 !log jmm@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[15:25:49] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm
[15:27:14] <logmsgbot>	 !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet2008-dev.codfw.wmnet
[15:27:15] <wikibugs>	 06SRE, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Data Platform SRE paging alerts and on-call SRE response - https://phabricator.wikimedia.org/T420264#11719221 (10RLazarus) One more axis to consider: Best-practices-wise, for alerting on Kubernetes platforms, there's a distinction between control plane and...
[15:27:34] <logmsgbot>	 !log samtar@deploy2002 mwscript-k8s job started: cleanupWatchlistLabelMember.php --wiki=testwiki  # T420328
[15:27:37] <stashbot>	 T420328: Run cleanupWatchlistLabelMember maintenance script - https://phabricator.wikimedia.org/T420328
[15:27:41] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1012.eqiad.wmnet with reason: host reimage
[15:28:23] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1006.eqiad.wmnet
[15:28:27] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host thanos-be1007.eqiad.wmnet
[15:29:15] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: cookbook to reboot gerrit primary instance [cookbooks] - 10https://gerrit.wikimedia.org/r/1254113 (https://phabricator.wikimedia.org/T420194) (owner: 10Arnaudb)
[15:31:54] <jinxer-wm>	 FIRING: [6x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2005-dev (172.20.5.9) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[15:32:08] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.druid.reboot-workers for Druid analytics cluster: Reboot Druid nodes
[15:33:30] <wikibugs>	 (03PS2) 10JMeybohm: kubestagemaster: Enable ipip_encapsulation and mh scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1242289 (https://phabricator.wikimedia.org/T352956)
[15:33:30] <wikibugs>	 (03PS1) 10JMeybohm: realserver::ipip: Only write ferm rules if there are IPIP services [puppet] - 10https://gerrit.wikimedia.org/r/1254232 (https://phabricator.wikimedia.org/T352956)
[15:33:43] <wikibugs>	 (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254232 (https://phabricator.wikimedia.org/T352956) (owner: 10JMeybohm)
[15:33:55] <logmsgbot>	 !log samtar@deploy2002 mwscript-k8s job started: foreachwikiindblist testwikis cleanupWatchlistLabelMember.php  # T420328
[15:33:59] <stashbot>	 T420328: Run cleanupWatchlistLabelMember maintenance script - https://phabricator.wikimedia.org/T420328
[15:34:01] <moritzm>	 dse-k8s-etcd2002 will go down for a Ganeti reboot
[15:34:04] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1012.eqiad.wmnet with reason: host reimage
[15:34:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet
[15:34:09] <jinxer-wm>	 RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[15:34:28] <wikibugs>	 (03Merged) 10jenkins-bot: gerrit: cookbook to reboot gerrit primary instance [cookbooks] - 10https://gerrit.wikimedia.org/r/1254113 (https://phabricator.wikimedia.org/T420194) (owner: 10Arnaudb)
[15:35:30] <TheresNoTime>	 (kudos to whoever made `mwscript-k8s` accept a dblist <3)
[15:36:06] <icinga-wm>	 PROBLEM - Host dse-k8s-etcd2002 is DOWN: PING CRITICAL - Packet loss = 100%
[15:36:51] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1007.eqiad.wmnet
[15:36:55] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host thanos-be1008.eqiad.wmnet
[15:37:21] <logmsgbot>	 btullis@cumin1003 reimage (PID 3929598) is awaiting input
[15:37:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet
[15:38:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2048.codfw.wmnet
[15:38:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2033.codfw.wmnet
[15:39:23] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] rest-gateway rate limit: add BYPASS and DENY policy and class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250598 (owner: 10Daniel Kinzler)
[15:40:01] <wikibugs>	 (03CR) 10Tchanders: [C:03+1] mw::maintenance: Disable scripts for closed wikis on various extensions [puppet] - 10https://gerrit.wikimedia.org/r/1254225 (https://phabricator.wikimedia.org/T420052) (owner: 10Dreamy Jazz)
[15:40:37] <icinga-wm>	 RECOVERY - Host dse-k8s-etcd2002 is UP: PING OK - Packet loss = 0%, RTA = 30.77 ms
[15:41:21] <urbanecm>	 TheresNoTime: in more ways than one, even! :)
[15:43:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2033.codfw.wmnet
[15:43:40] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[15:44:48] <logmsgbot>	 !log samtar@deploy2002 mwscript-k8s job started: foreachwikiindblist group0 cleanupWatchlistLabelMember.php  # T420328
[15:44:52] <stashbot>	 T420328: Run cleanupWatchlistLabelMember maintenance script - https://phabricator.wikimedia.org/T420328
[15:45:24] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1008.eqiad.wmnet
[15:45:28] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host thanos-be1009.eqiad.wmnet
[15:46:28] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host zuul2003.codfw.wmnet with OS trixie
[15:46:51] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - fasw1-f5a-codfw:et-0/0/47 (Core: pfw1-codfw:et-0/1/0 {#122505}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[15:48:40] <jinxer-wm>	 RESOLVED: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[15:49:08] <rzl>	 TheresNoTime: <3
[15:51:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2033.codfw.wmnet
[15:51:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2033.codfw.wmnet
[15:51:14] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-wmde-users for Ben.buchenau - https://phabricator.wikimedia.org/T419878#11719397 (10ayounsi) a:03Gehel @Gehel as the approval of the analytics-wmde-users group, do you approve this request ?
[15:51:23] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-wmde-users for Ben.buchenau - https://phabricator.wikimedia.org/T419878#11719399 (10ayounsi)
[15:51:28] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] "oops, I reversed the logic. This is supposed to exist on all servers EXCEPT the primary, but this is the opposite." [puppet] - 10https://gerrit.wikimedia.org/r/1253631 (https://phabricator.wikimedia.org/T420246) (owner: 10Dzahn)
[15:52:52] <logmsgbot>	 !log samtar@deploy2002 mwscript-k8s job started: foreachwikiindblist group1 cleanupWatchlistLabelMember.php  # T420328
[15:52:55] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1242255 (owner: 10Muehlenhoff)
[15:52:56] <stashbot>	 T420328: Run cleanupWatchlistLabelMember maintenance script - https://phabricator.wikimedia.org/T420328
[15:53:00] <wikibugs>	 (03PS1) 10Ayounsi: Add benbuchenau to analytics-wmde-users [puppet] - 10https://gerrit.wikimedia.org/r/1254237 (https://phabricator.wikimedia.org/T419878)
[15:53:14] <wikibugs>	 (03PS3) 10Bking: dse-k8s: Add CFSSL profile for longer-lived certificates (6 mo). [puppet] - 10https://gerrit.wikimedia.org/r/1251117 (https://phabricator.wikimedia.org/T419289)
[15:53:26] <wikibugs>	 (03PS3) 10Dzahn: releases: remove rsync systemd units when primary server changes [puppet] - 10https://gerrit.wikimedia.org/r/1253631 (https://phabricator.wikimedia.org/T420246)
[15:53:37] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1251117 (https://phabricator.wikimedia.org/T419289) (owner: 10Bking)
[15:54:11] <mutante>	 !log zuul2003 - reimaging with trixie 
[15:54:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:54:38] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1009.eqiad.wmnet
[15:55:40] <wikibugs>	 (03PS2) 10JMeybohm: realserver::ipip: Only write ferm rules if there are IPIP services [puppet] - 10https://gerrit.wikimedia.org/r/1254232 (https://phabricator.wikimedia.org/T352956)
[15:55:40] <wikibugs>	 (03PS3) 10JMeybohm: kubestagemaster: Enable ipip_encapsulation and mh scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1242289 (https://phabricator.wikimedia.org/T352956)
[15:55:52] <wikibugs>	 (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254232 (https://phabricator.wikimedia.org/T352956) (owner: 10JMeybohm)
[15:56:51] <jinxer-wm>	 RESOLVED: [2x] SwitchCoreInterfaceDown: Switch core interface down - fasw1-f5a-codfw:et-0/0/47 (Core: pfw1-codfw:et-0/1/0 {#122505}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[15:59:54] <wikibugs>	 (03CR) 10Bking: "Yes, just ran it." [puppet] - 10https://gerrit.wikimedia.org/r/1251117 (https://phabricator.wikimedia.org/T419289) (owner: 10Bking)
[16:00:05] <jouncebot>	 jhathaway and rzl: gettimeofday() says it's time for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1600)
[16:00:05] <jouncebot>	 phuedx and Dreamy_Jazz: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[16:00:10] <Dreamy_Jazz>	 \o
[16:00:18] <jhathaway>	 o/
[16:01:21] <jhathaway>	 Dreamy_Jazz: I'll merge anything else you need?
[16:01:41] <wikibugs>	 (03PS1) 10Muehlenhoff: Test pki1002 on ganeti-test [puppet] - 10https://gerrit.wikimedia.org/r/1254242 (https://phabricator.wikimedia.org/T416664)
[16:01:43] <Dreamy_Jazz>	 No should be fine to just merge
[16:01:46] <Dreamy_Jazz>	 Thanks
[16:01:54] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] mw::maintenance: Disable scripts for closed wikis on various extensions [puppet] - 10https://gerrit.wikimedia.org/r/1254225 (https://phabricator.wikimedia.org/T420052) (owner: 10Dreamy Jazz)
[16:03:13] <jhathaway>	 Dreamy_Jazz: done
[16:03:16] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp700[5-8].magru.wmnet} and A:cp
[16:03:55] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for SCardenas (WMF) - https://phabricator.wikimedia.org/T419932#11719497 (10ayounsi) @thcipriani as approval contact for the `deployment` group, do you approve this request ?  @scardenasmolinar can you read and sign https://phabricator.wikimedia.org...
[16:03:57] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp7013.magru.wmnet,cp701[5-6].magru.wmnet} and A:cp
[16:04:00] <Dreamy_Jazz>	 Thanks
[16:04:25] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for SCardenas (WMF) - https://phabricator.wikimedia.org/T419932#11719498 (10ayounsi)
[16:05:02] <logmsgbot>	 !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp7014.magru.wmnet with OS trixie
[16:05:26] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp7014.magru.wmnet with OS trixie
[16:07:49] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1064.eqiad.wmnet
[16:08:22] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on zuul2003.codfw.wmnet with reason: host reimage
[16:08:40] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:10:11] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2062.codfw.wmnet
[16:14:31] <TheresNoTime>	 jouncebot: nowandnext
[16:14:31] <jouncebot>	 For the next 0 hour(s) and 45 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1600)
[16:14:31] <jouncebot>	 In 0 hour(s) and 45 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1700)
[16:14:43] <logmsgbot>	 !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp7005.magru.wmnet
[16:15:03] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on zuul2003.codfw.wmnet with reason: host reimage
[16:15:44] <logmsgbot>	 !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp7013.magru.wmnet
[16:15:55] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254242 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff)
[16:16:16] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] hiera: Remove single_backend from codfw [puppet] - 10https://gerrit.wikimedia.org/r/1253605 (https://phabricator.wikimedia.org/T401832) (owner: 10BCornwall)
[16:16:20] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+2] hiera: Set default codfw storage_elements [puppet] - 10https://gerrit.wikimedia.org/r/1253606 (https://phabricator.wikimedia.org/T401832) (owner: 10BCornwall)
[16:16:49] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:16:59] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 1/3 UP : OSPFv3: 1/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:17:42] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1064.eqiad.wmnet
[16:17:54] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2062.codfw.wmnet
[16:17:57] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1253631/8287/" [puppet] - 10https://gerrit.wikimedia.org/r/1253631 (https://phabricator.wikimedia.org/T420246) (owner: 10Dzahn)
[16:17:59] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:18:03] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-jumbo-eqiad
[16:18:05] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2063.codfw.wmnet
[16:18:11] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1065.eqiad.wmnet
[16:18:33] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] releases: remove rsync systemd units when primary server changes [puppet] - 10https://gerrit.wikimedia.org/r/1253631 (https://phabricator.wikimedia.org/T420246) (owner: 10Dzahn)
[16:18:39] <jinxer-wm>	 FIRING: CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-esams (208.80.153.216) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Confed_esams&var-bgp_neighbor=cr2-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[16:18:45] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:20:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1015:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:20:37] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] realserver::ipip: Only write ferm rules if there are IPIP services [puppet] - 10https://gerrit.wikimedia.org/r/1254232 (https://phabricator.wikimedia.org/T352956) (owner: 10JMeybohm)
[16:20:49] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: update gpt isvc image to one that supports AITER [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254245 (https://phabricator.wikimedia.org/T418350)
[16:21:28] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for SCardenas (WMF) - https://phabricator.wikimedia.org/T419932#11719644 (10ayounsi) a:03thcipriani
[16:21:47] <wikibugs>	 (03CR) 10Muehlenhoff: systemd::timer::job: add ExecCondition support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1253655 (owner: 10Herron)
[16:23:39] <jinxer-wm>	 RESOLVED: CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-esams (208.80.153.216) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Confed_esams&var-bgp_neighbor=cr2-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[16:24:45] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: Update ULSFO LVS service IP's - https://phabricator.wikimedia.org/T418971#11719697 (10Fabfur) Procedure from the traffic perspective should be roughly   - Depool ulsfo (around 0900UTC) and wait about 30' for all connections to...
[16:25:01] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1306,1308-1311].eqiad.wmnet
[16:25:03] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1306,1308-1311].eqiad.wmnet
[16:25:47] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1065.eqiad.wmnet
[16:25:50] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2063.codfw.wmnet
[16:25:52] <logmsgbot>	 !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on releases2003.codfw.wmnet with reason: T420246
[16:25:55] <stashbot>	 T420246: SystemdUnitFailed - rsync releases2003 - https://phabricator.wikimedia.org/T420246
[16:26:20] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "noop on releases1003 - releases2003 now has an issue with stunnel - something is not fully removed - TBD" [puppet] - 10https://gerrit.wikimedia.org/r/1253631 (https://phabricator.wikimedia.org/T420246) (owner: 10Dzahn)
[16:27:55] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.4 point update - https://phabricator.wikimedia.org/T420240#11719717 (10MoritzMuehlenhoff)
[16:28:14] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2064.codfw.wmnet
[16:28:18] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1066.eqiad.wmnet
[16:28:22] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "apparently the combination of "server_uses_stunnel => true" with "ensure => absent" is an issue" [puppet] - 10https://gerrit.wikimedia.org/r/1253631 (https://phabricator.wikimedia.org/T420246) (owner: 10Dzahn)
[16:28:29] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7014.magru.wmnet with reason: host reimage
[16:29:26] <wikibugs>	 (03PS2) 10Kevin Bazira: ml-services: update gpt isvc image to one that supports AITER [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254245 (https://phabricator.wikimedia.org/T418350)
[16:29:46] <wikibugs>	 (03PS4) 10Ryan Kemper: profile::pyrra: remove old wdqs SLO configs [puppet] - 10https://gerrit.wikimedia.org/r/1248761 (https://phabricator.wikimedia.org/T393966) (owner: 10Elukey)
[16:31:24] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for SCardenas (WMF) - https://phabricator.wikimedia.org/T419932#11719772 (10thcipriani) Reason for access makes sense, approved for `deployment` group membership.  ---  @Scardenasmolinar some additional bits for you:  - Our web deploy tool [[https:/...
[16:32:36] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1248761 (https://phabricator.wikimedia.org/T393966) (owner: 10Elukey)
[16:32:51] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.remove-downtime for wikikube-worker[1306,1308-1311].eqiad.wmnet
[16:32:55] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wikikube-worker[1306,1308-1311].eqiad.wmnet
[16:33:16] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7014.magru.wmnet with reason: host reimage
[16:33:29] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host zuul2003.codfw.wmnet with OS trixie
[16:33:40] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:34:27] <logmsgbot>	 !log samtar@deploy2002 mwscript-k8s job started: foreachwikiindblist group2 cleanupWatchlistLabelMember.php  # T420328
[16:34:31] <stashbot>	 T420328: Run cleanupWatchlistLabelMember maintenance script - https://phabricator.wikimedia.org/T420328
[16:35:20] <logmsgbot>	 btullis@cumin1003 reimage (PID 3929598) is awaiting input
[16:35:23] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+1] "PCC looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1251117 (https://phabricator.wikimedia.org/T419289) (owner: 10Bking)
[16:35:52] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1066.eqiad.wmnet
[16:35:55] <wikibugs>	 (03PS3) 10Kevin Bazira: ml-services: update gpt isvc image to one that supports AITER [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254245 (https://phabricator.wikimedia.org/T418350)
[16:36:12] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2064.codfw.wmnet
[16:36:33] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2065.codfw.wmnet
[16:36:36] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1067.eqiad.wmnet
[16:36:48] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] profile::pyrra: remove old wdqs SLO configs [puppet] - 10https://gerrit.wikimedia.org/r/1248761 (https://phabricator.wikimedia.org/T393966) (owner: 10Elukey)
[16:37:37] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.dns.netbox
[16:37:55] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: hw troubleshooting:  Comm Error: Backplane 0 for wikikube-worker1307.eqiad.wmnet - https://phabricator.wikimedia.org/T420389 (10Clement_Goubert) 03NEW p:05Triage→03Low
[16:37:55] <wikibugs>	 (03PS1) 10Btullis: Temporarily set dse-k8s-worker101[2,5] into insetup mode [puppet] - 10https://gerrit.wikimedia.org/r/1254247 (https://phabricator.wikimedia.org/T414787)
[16:39:36] <ryankemper>	 merged a patch, but seeming to have trouble getting onto puppetserver (looks bastion related right now). if someone merges something else before I figure this out, please merge my patch on my behalf
[16:39:55] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.kafka.roll-restart-mirror-maker restart MirrorMaker for Kafka A:kafka-mirror-maker-jumbo-eqiad cluster: Roll restart of jvm daemons.
[16:41:11] <wikibugs>	 06SRE, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Data Platform SRE paging alerts and on-call SRE response - https://phabricator.wikimedia.org/T420264#11719879 (10BTullis) Thanks ever so much for getting this conversation started. I think that it's really important for us to get a good consensus on this, a...
[16:41:54] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: bast2003 boot failure - https://phabricator.wikimedia.org/T420320#11719889 (10Jhancock.wm) got into the idrac/console and found the server as this:   Booting from Hard Drive C:   GRUB rebooted and went to the same screen.  contacted @Papaul for consult. corrupted or missing conf...
[16:42:02] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: [C:03+1] "thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254245 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[16:42:06] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.druid.reboot-workers (exit_code=0) for Druid analytics cluster: Reboot Druid nodes
[16:42:33] <ryankemper>	 ok, yeah bast2003 was down. went through 1003 instead
[16:43:21] <logmsgbot>	 cgoubert@cumin1003 netbox (PID 3940865) is awaiting input
[16:43:27] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] ml-services: update gpt isvc image to one that supports AITER [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254245 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[16:44:36] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2065.codfw.wmnet
[16:44:42] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1067.eqiad.wmnet
[16:44:57] <wikibugs>	 (03PS1) 10BCornwall: hiera: storage_elements override for cp2041/cp2042 [puppet] - 10https://gerrit.wikimedia.org/r/1254249
[16:45:07] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet
[16:45:11] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1068.eqiad.wmnet
[16:45:18] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update gpt isvc image to one that supports AITER [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254245 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[16:45:31] <wikibugs>	 (03PS2) 10BCornwall: hiera: storage_elements override for cp2041/cp2042 [puppet] - 10https://gerrit.wikimedia.org/r/1254249
[16:46:13] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['bast2003']
[16:46:44] <logmsgbot>	 !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm
[16:47:10] <logmsgbot>	 !log samtar@deploy2002 mwscript-k8s job started: foreachwikiindblist all cleanupWatchlistLabelMember.php  # T420328
[16:47:14] <stashbot>	 T420328: Run cleanupWatchlistLabelMember maintenance script - https://phabricator.wikimedia.org/T420328
[16:47:16] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[16:48:23] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1254249 (owner: 10BCornwall)
[16:49:30] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] hiera: storage_elements override for cp2041/cp2042 [puppet] - 10https://gerrit.wikimedia.org/r/1254249 (owner: 10BCornwall)
[16:50:17] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Temporarily set dse-k8s-worker101[2,5] into insetup mode [puppet] - 10https://gerrit.wikimedia.org/r/1254247 (https://phabricator.wikimedia.org/T414787) (owner: 10Btullis)
[16:50:34] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+2] hiera: storage_elements override for cp2041/cp2042 [puppet] - 10https://gerrit.wikimedia.org/r/1254249 (owner: 10BCornwall)
[16:52:59] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet
[16:53:15] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1068.eqiad.wmnet
[16:53:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 13Patch-For-Review: Q3:rack/setup/install dse-k8s-worker10[20-23] - https://phabricator.wikimedia.org/T414216#11719996 (10elukey) @Jclark-ctr I provisioned dse-k8s-worker1020 with an experimental provisioning cookbook, when you...
[16:55:47] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2067.codfw.wmnet
[16:55:50] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1069.eqiad.wmnet
[16:56:17] <logmsgbot>	 !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp7006.magru.wmnet
[16:57:22] <logmsgbot>	 !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp7015.magru.wmnet
[16:58:11] <wikibugs>	 10ops-codfw, 06collaboration-services, 06DC-Ops, 10Phabricator: phab2002: SEL System Event:, System Board Front LED Panel, Critical, management controller unavailable - https://phabricator.wikimedia.org/T420228#11720040 (10Jhancock.wm) yes. that matches the time. this error can be from a firmware issue.
[16:58:18] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0) restart MirrorMaker for Kafka A:kafka-mirror-maker-jumbo-eqiad cluster: Roll restart of jvm daemons.
[16:58:20] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm
[16:58:34] <logmsgbot>	 !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm
[16:59:22] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7014.magru.wmnet with OS trixie
[17:00:05] <jouncebot>	 swfrench-wmf: Time to snap out of that daydream and deploy MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1700).
[17:00:26] <logmsgbot>	 !log cgoubert@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[17:01:22] <swfrench-wmf>	 o/
[17:01:31] <wikibugs>	 (03PS1) 10BCornwall: trafficserver: Remove outdated comment [puppet] - 10https://gerrit.wikimedia.org/r/1254254
[17:01:31] <swfrench-wmf>	 I'll be getting started on the infra window shortly
[17:01:34] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm
[17:01:39] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1069.eqiad.wmnet
[17:02:27] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2067.codfw.wmnet
[17:02:29] <wikibugs>	 (03CR) 10Scott French: [C:03+2] mw-(api-int|web): Use envoy drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253664 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French)
[17:02:40] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1070.eqiad.wmnet
[17:02:51] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2068.codfw.wmnet
[17:03:36] <wikibugs>	 (03PS2) 10BCornwall: trafficserver: Update single_backend site comments [puppet] - 10https://gerrit.wikimedia.org/r/1254254
[17:04:22] <wikibugs>	 (03PS3) 10BCornwall: trafficserver: Update single_backend site comments [puppet] - 10https://gerrit.wikimedia.org/r/1254254
[17:04:36] <wikibugs>	 (03Merged) 10jenkins-bot: mw-(api-int|web): Use envoy drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253664 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French)
[17:05:02] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['bast2003']
[17:06:09] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{wikikube-worker[1312-1327].eqiad.wmnet,wikikube-worker-exp1001.eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[17:06:24] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host bast2003.wikimedia.org with OS bookworm
[17:06:38] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: bast2003 boot failure - https://phabricator.wikimedia.org/T420320#11720075 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host bast2003.wikimedia.org with OS bookworm
[17:06:49] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[17:07:40] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[17:08:50] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2068.codfw.wmnet
[17:08:56] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1070.eqiad.wmnet
[17:09:03] <wikibugs>	 06SRE, 10Infrastructure Security, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware, and 4 others: October 2025 Bullseye reboots (ServiceOps hosts) - https://phabricator.wikimedia.org/T416451#11720092 (10MLechvien-WMF) @Blake remaining hosts to reboot should be done as part of T420175 , should we dedup this...
[17:09:04] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[17:09:34] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2069.codfw.wmnet
[17:09:38] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1071.eqiad.wmnet
[17:09:50] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7014.*
[17:10:26] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[17:11:25] <icinga-wm>	 RECOVERY - Host bast2003 is UP: PING OK - Packet loss = 0%, RTA = 33.82 ms
[17:13:03] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[17:13:11] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[17:13:56] <jinxer-wm>	 FIRING: ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:14:20] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[17:14:37] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[17:14:54] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[17:15:26] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1012.eqiad.wmnet with OS bookworm
[17:16:15] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2069.codfw.wmnet
[17:16:20] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[17:16:55] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1071.eqiad.wmnet
[17:18:56] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:19:13] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[17:19:17] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp3081.esams.wmnet with OS trixie
[17:19:44] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[17:20:28] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[17:20:32] <jinxer-wm>	 FIRING: [5x] KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[17:21:56] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[17:23:22] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "is that easy? Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1254242 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff)
[17:24:08] <wikibugs>	 (03PS1) 10Kamila Součková: k8s: create shellbox-icu72 [puppet] - 10https://gerrit.wikimedia.org/r/1254266 (https://phabricator.wikimedia.org/T419049)
[17:24:47] <wikibugs>	 (03CR) 10CI reject: [V:04-1] k8s: create shellbox-icu72 [puppet] - 10https://gerrit.wikimedia.org/r/1254266 (https://phabricator.wikimedia.org/T419049) (owner: 10Kamila Součková)
[17:25:28] <wikibugs>	 (03PS1) 10Jcrespo: mediabackups: Enable TLS and set multitenant mode [puppet] - 10https://gerrit.wikimedia.org/r/1254267 (https://phabricator.wikimedia.org/T410028)
[17:25:35] <wikibugs>	 (03PS2) 10Kamila Součková: k8s: create shellbox-icu72 [puppet] - 10https://gerrit.wikimedia.org/r/1254266 (https://phabricator.wikimedia.org/T419049)
[17:25:46] <wikibugs>	 (03PS2) 10Jcrespo: mediabackups: Enable TLS and set multitenant mode [puppet] - 10https://gerrit.wikimedia.org/r/1254267 (https://phabricator.wikimedia.org/T410028)
[17:26:07] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254267 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[17:26:14] <wikibugs>	 (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254266 (https://phabricator.wikimedia.org/T419049) (owner: 10Kamila Součková)
[17:26:18] <wikibugs>	 (03CR) 10CI reject: [V:04-1] k8s: create shellbox-icu72 [puppet] - 10https://gerrit.wikimedia.org/r/1254266 (https://phabricator.wikimedia.org/T419049) (owner: 10Kamila Součková)
[17:26:32] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mediabackups: Enable TLS and set multitenant mode [puppet] - 10https://gerrit.wikimedia.org/r/1254267 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[17:26:50] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp3066.esams.wmnet [reason: trixie reimaging]
[17:26:51] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[17:27:17] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp3066.esams.wmnet with OS trixie
[17:27:24] <logmsgbot>	 !log cdobbins@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp3066.esams.wmnet with OS trixie
[17:27:28] <wikibugs>	 (03PS3) 10Jcrespo: mediabackups: Enable TLS and set multitenant mode [puppet] - 10https://gerrit.wikimedia.org/r/1254267 (https://phabricator.wikimedia.org/T410028)
[17:27:39] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254267 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[17:28:06] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mediabackups: Enable TLS and set multitenant mode [puppet] - 10https://gerrit.wikimedia.org/r/1254267 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[17:28:07] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[17:28:30] <wikibugs>	 (03PS3) 10Kamila Součková: k8s: create shellbox-icu72 [puppet] - 10https://gerrit.wikimedia.org/r/1254266 (https://phabricator.wikimedia.org/T419049)
[17:28:55] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp3066.esams.wmnet with OS trixie
[17:29:26] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp3067.esams.wmnet [reason: trixie reimaging]
[17:29:32] <wikibugs>	 (03PS4) 10Jcrespo: mediabackups: Enable TLS and set multitenant mode [puppet] - 10https://gerrit.wikimedia.org/r/1254267 (https://phabricator.wikimedia.org/T410028)
[17:29:39] <wikibugs>	 (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254266 (https://phabricator.wikimedia.org/T419049) (owner: 10Kamila Součková)
[17:29:39] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254267 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[17:29:52] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp3067.esams.wmnet with OS trixie
[17:30:32] <jinxer-wm>	 FIRING: [7x] KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[17:31:54] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[17:32:19] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[17:33:14] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[17:33:29] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[17:33:55] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] mediabackups: Enable TLS and set multitenant mode [puppet] - 10https://gerrit.wikimedia.org/r/1254267 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[17:33:56] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[17:34:57] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[17:37:21] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[17:37:31] <logmsgbot>	 !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp7007.magru.wmnet
[17:37:49] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[17:38:27] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] "per our meeting just now: this is not good - we need it to run only on one of the 2 machines..and contint1003 is it.  should be solved by " [puppet] - 10https://gerrit.wikimedia.org/r/1251208 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn)
[17:38:31] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[17:39:15] <logmsgbot>	 !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp7016.magru.wmnet
[17:39:15] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp7013.magru.wmnet,cp701[5-6].magru.wmnet} and A:cp
[17:39:30] <wikibugs>	 (03CR) 10Dzahn: [C:04-2] "per meeting just now: not needed - proxy config stays on old host" [puppet] - 10https://gerrit.wikimedia.org/r/1250748 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn)
[17:39:41] <wikibugs>	 (03Abandoned) 10Dzahn: jenkins: add proxy_jenkins profile to role [puppet] - 10https://gerrit.wikimedia.org/r/1250748 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn)
[17:40:09] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[17:40:32] <jinxer-wm>	 FIRING: [6x] KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[17:40:33] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] gerrit: add a ttl on ProxyPass to jetty [puppet] - 10https://gerrit.wikimedia.org/r/1254128 (https://phabricator.wikimedia.org/T420189) (owner: 10Arnaudb)
[17:41:28] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.druid.reboot-workers for Druid public cluster: Reboot Druid nodes
[17:42:10] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[17:42:13] <icinga-wm>	 PROBLEM - Host wikikube-worker1036 is DOWN: PING CRITICAL - Packet loss = 50%, RTA = 3465.18 ms
[17:42:30] <logmsgbot>	 !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp3081.esams.wmnet with OS trixie
[17:42:40] <wikibugs>	 (03CR) 10Anne Tomasevich: [C:03+1] Set wgReadingListsBetaDefaultForNewAccountsAfter for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251309 (https://phabricator.wikimedia.org/T419163) (owner: 10Aude)
[17:42:55] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp3081.esams.wmnet with OS trixie
[17:42:56] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: bast2003 boot failure - https://phabricator.wikimedia.org/T420320#11720298 (10MoritzMuehlenhoff) @Jhancock.wm Given that we need to reimage this server anyway, could you please reimage with trixie instead of bookworm (what it ran before)? The first new bastion (bast1004) is alre...
[17:43:05] <icinga-wm>	 RECOVERY - Host wikikube-worker1036 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms
[17:43:28] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[17:43:53] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[17:44:29] <logmsgbot>	 btullis@cumin1003 reboot-workers (PID 3951063) is awaiting input
[17:44:56] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[17:44:59] <wikibugs>	 (03CR) 10VolkerE: [C:03+1] Set wgReadingListsBetaDefaultForNewAccountsAfter for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251309 (https://phabricator.wikimedia.org/T419163) (owner: 10Aude)
[17:45:36] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251309 (https://phabricator.wikimedia.org/T419163) (owner: 10Aude)
[17:46:13] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp3080.esams.wmnet with OS trixie
[17:49:31] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] realserver::ipip: Only write ferm rules if there are IPIP services [puppet] - 10https://gerrit.wikimedia.org/r/1254232 (https://phabricator.wikimedia.org/T352956) (owner: 10JMeybohm)
[17:52:00] <logmsgbot>	 !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm
[17:52:19] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm
[17:52:24] <logmsgbot>	 !log cgoubert@cumin1003 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on P{wikikube-worker[1312-1327].eqiad.wmnet,wikikube-worker-exp1001.eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[17:53:56] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:54:15] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-parsoid_4452: Servers wikikube-worker1144.eqiad.wmnet, wikikube-worker1291.eqiad.wmnet, wikikube-worker1322.eqiad.wmnet, wikikube-worker1042.eqiad.wmnet, wikikube-worker1118.eqiad.wmnet, wikikube-worker1298.eqiad.wmnet, wikikube-worker1155.eqiad.wmnet, wikikube-worker1103.eqiad.wmnet, wikikube-worker1108.eqiad.wmnet, wikikube-worker1101.eqiad.wmne
[17:54:15] <icinga-wm>	 ube-worker1121.eqiad.wmnet, wikikube-worker1116.eqiad.wmnet, wikikube-worker1036.eqiad.wmnet, wikikube-worker1029.eqiad.wmnet, wikikube-worker1049.eqiad.wmnet, wikikube-worker1315.eqiad.wmnet, wikikube-worker1132.eqiad.wmnet, wikikube-worker1247.eqiad.wmnet, wikikube-worker1273.eqiad.wmnet, wikikube-worker1071.eqiad.wmnet, wikikube-worker1279.eqiad.wmnet, wikikube-worker1358.eqiad.wmnet, wikikube-worker1072.eqiad.wmnet, wikikube-worker114
[17:54:15] <icinga-wm>	 wmnet, wikikube-worker1159.eqiad.wmnet, wikikube-worker1270.eqiad.wmnet, wikikube-worker1015.eqiad.wmnet, wikikube-worker1037.eqiad.wmnet, wikikube-worker1160.eqiad.wmnet, wikikube-work https://wikitech.wikimedia.org/wiki/PyBal
[17:54:31] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-parsoid_4452: Servers wikikube-worker1058.eqiad.wmnet, wikikube-worker1280.eqiad.wmnet, wikikube-worker1298.eqiad.wmnet, wikikube-worker1303.eqiad.wmnet, wikikube-worker1274.eqiad.wmnet, wikikube-worker1042.eqiad.wmnet, wikikube-worker1092.eqiad.wmnet, wikikube-worker1051.eqiad.wmnet, wikikube-worker1118.eqiad.wmnet, wikikube-worker1304.eqiad.wmne
[17:54:31] <icinga-wm>	 ube-worker1144.eqiad.wmnet, wikikube-worker1289.eqiad.wmnet, wikikube-worker1155.eqiad.wmnet, wikikube-worker1257.eqiad.wmnet, wikikube-worker1108.eqiad.wmnet, wikikube-worker1101.eqiad.wmnet, wikikube-worker1116.eqiad.wmnet, wikikube-worker1050.eqiad.wmnet, wikikube-worker1007.eqiad.wmnet, wikikube-worker1036.eqiad.wmnet, wikikube-worker1122.eqiad.wmnet, wikikube-worker1068.eqiad.wmnet, wikikube-worker1029.eqiad.wmnet, wikikube-worker108
[17:54:31] <icinga-wm>	 wmnet, wikikube-worker1310.eqiad.wmnet, wikikube-worker1049.eqiad.wmnet, wikikube-worker1353.eqiad.wmnet, wikikube-worker1052.eqiad.wmnet, wikikube-worker1079.eqiad.wmnet, wikikube-work https://wikitech.wikimedia.org/wiki/PyBal
[17:54:58] <wikibugs>	 (03PS1) 10Dwisehaupt: wmnet: shift fundraisingdb-read back to frdb1004 [dns] - 10https://gerrit.wikimedia.org/r/1254271 (https://phabricator.wikimedia.org/T420155)
[17:58:10] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: update gpt isvc image to one that applies vLLM AITER/Inductor compilation optimizations. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254273 (https://phabricator.wikimedia.org/T418350)
[18:00:05] <jouncebot>	 andre and brennen: Time to snap out of that daydream and deploy MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T1800).
[18:00:15] <andre>	 jouncebot: no!
[18:02:25] <swfrench-wmf>	 uhh ... the mw-parsoid_4452 PyBal backends alert is *probably* the result of wikikube-worker-exp* restarts
[18:02:30] <swfrench-wmf>	 I can take a look in a bit
[18:02:54] <sukhe>	 oh no
[18:02:56] <sukhe>	 thanks
[18:03:11] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on bast2003.wikimedia.org with reason: host reimage
[18:03:40] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3066.esams.wmnet with reason: host reimage
[18:04:09] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3067.esams.wmnet with reason: host reimage
[18:06:26] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for SCardenas (WMF) - https://phabricator.wikimedia.org/T419932#11720399 (10Scardenasmolinar) > @Scardenasmolinar can you read and sign https://phabricator.wikimedia.org/L3 ?  Signed!   > Our web deploy tool SpiderPig also requires you request membe...
[18:08:16] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: [C:03+1] ml-services: update gpt isvc image to one that applies vLLM AITER/Inductor compilation optimizations. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254273 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[18:09:28] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast2003.wikimedia.org with reason: host reimage
[18:09:58] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3080.esams.wmnet with reason: host reimage
[18:12:47] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] ml-services: update gpt isvc image to one that applies vLLM AITER/Inductor compilation optimizations. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254273 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[18:13:01] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3066.esams.wmnet with reason: host reimage
[18:14:44] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update gpt isvc image to one that applies vLLM AITER/Inductor compilation optimizations. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254273 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[18:15:08] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users level 3 for bvibber - https://phabricator.wikimedia.org/T420406 (10bvibber) 03NEW
[18:16:18] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[18:16:23] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3067.esams.wmnet with reason: host reimage
[18:16:40] <logmsgbot>	 !log akhatun@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[18:16:51] <wikibugs>	 (03CR) 10Jgreen: [C:03+1] wmnet: shift fundraisingdb-read back to frdb1004 [dns] - 10https://gerrit.wikimedia.org/r/1254271 (https://phabricator.wikimedia.org/T420155) (owner: 10Dwisehaupt)
[18:17:01] <logmsgbot>	 !log akhatun@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[18:19:03] <logmsgbot>	 !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp7008.magru.wmnet
[18:19:03] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp700[5-8].magru.wmnet} and A:cp
[18:19:25] <wikibugs>	 (03PS1) 10Alex.sanford: Remove notice from login form in popup mode [skins/MinervaNeue] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254280 (https://phabricator.wikimedia.org/T418534)
[18:19:39] <icinga-wm>	 PROBLEM - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp3080 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[18:20:17] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3080.esams.wmnet with reason: host reimage
[18:22:11] <swfrench-wmf>	 sukhe: following up - mw-parsoid is now special, in that the backing pods are only allowed to run on a very limited number of k8s workers (2). the ongoing reboot run seems to have taken both nodes out of service at the same time.
[18:22:30] <swfrench-wmf>	 I'll take a look at why that happened and figure out how to unstick it
[18:23:01] <swfrench-wmf>	 in any case, this is not in any way a critical service
[18:24:28] <sukhe>	 thanks!
[18:25:37] <logmsgbot>	 !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm
[18:25:56] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm
[18:30:35] <wikibugs>	 (03CR) 10Dwisehaupt: [C:03+2] wmnet: shift fundraisingdb-read back to frdb1004 [dns] - 10https://gerrit.wikimedia.org/r/1254271 (https://phabricator.wikimedia.org/T420155) (owner: 10Dwisehaupt)
[18:30:39] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: Update ULSFO LVS service IP's - https://phabricator.wikimedia.org/T418971#11720532 (10ssingh) >>! In T418971#11719697, @Fabfur wrote: > Procedure from the traffic perspective should be roughly  >  > - Depool ulsfo (around 0900...
[18:31:04] <logmsgbot>	 !log dwisehaupt@dns1005 START - running authdns-update
[18:31:11] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast2003.wikimedia.org with OS bookworm
[18:31:20] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: bast2003 boot failure - https://phabricator.wikimedia.org/T420320#11720534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host bast2003.wikimedia.org with OS bookworm completed: - bast2003 (**WARN**)   - Downtimed on Icinga/Alertman...
[18:31:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[18:31:51] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [skins/MinervaNeue] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254280 (https://phabricator.wikimedia.org/T418534) (owner: 10Alex.sanford)
[18:32:34] <logmsgbot>	 !log dwisehaupt@dns1005 END - running authdns-update
[18:32:38] <wikibugs>	 (03CR) 10Huei Tan: [C:03+1] "👍" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254149 (https://phabricator.wikimedia.org/T418187) (owner: 10Abijeet Patro)
[18:35:29] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Trusting the script!" [dns] - 10https://gerrit.wikimedia.org/r/1254092 (owner: 10Slyngshede)
[18:38:11] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "If you get build failures when trying to deploy this change (I'm not sure how the CI is set up for wmf.XX branches and whether it'll pass " [skins/MinervaNeue] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254280 (https://phabricator.wikimedia.org/T418534) (owner: 10Alex.sanford)
[18:40:59] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3066.esams.wmnet with OS trixie
[18:41:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[18:43:05] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3067.esams.wmnet with OS trixie
[18:44:45] <icinga-wm>	 PROBLEM - Host an-worker1172 is DOWN: PING CRITICAL - Packet loss = 100%
[18:45:12] <wikibugs>	 (03PS3) 10Arlolra: Deploy PRV to XX wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253654 (https://phabricator.wikimedia.org/T420273)
[18:45:45] <icinga-wm>	 RECOVERY - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp3080 is OK: SSL OK - Certificate measure-eqiad.wikimedia.org contains all required SANs:Certificate measure-eqiad.wikimedia.org (ECDSA) valid until 2026-06-04 03:56:45 +0000 (expires in 78 days) https://wikitech.wikimedia.org/wiki/HTTPS
[18:47:12] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3080.esams.wmnet with OS trixie
[18:48:49] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[18:48:59] <swfrench-wmf>	 ^ there we go
[18:49:15] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[18:49:22] <swfrench-wmf>	 !log manually uncordoned wikikube-worker-exp1001.eqiad.wmnet after failed reboot
[18:49:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:49:51] <wikibugs>	 (03CR) 10Ssingh: trafficserver: Update single_backend site comments (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1254254 (owner: 10BCornwall)
[18:50:41] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host bast2003.wikimedia.org with OS trixie
[18:50:51] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: bast2003 boot failure - https://phabricator.wikimedia.org/T420320#11720575 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host bast2003.wikimedia.org with OS trixie
[18:53:27] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp3067.esams.wmnet [reason: trixie reimaging]
[18:53:38] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp3066.esams.wmnet [reason: trixie reimaging]
[18:54:39] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp3068.esams.wmnet [reason: trixie reimaging]
[18:54:42] <wikibugs>	 (03CR) 10Ssingh: ulsfo: add new LVS service IP range (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1247994 (https://phabricator.wikimedia.org/T418971) (owner: 10Ayounsi)
[18:55:13] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.druid.reboot-workers (exit_code=0) for Druid public cluster: Reboot Druid nodes
[18:55:30] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp3068.esams.wmnet with OS trixie
[18:55:58] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp3069.esams.wmnet [reason: trixie reimaging]
[18:56:21] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp3069.esams.wmnet with OS trixie
[18:57:50] <wikibugs>	 (03CR) 10Daniel Kinzler: rest-gateway: per-route jwt overrides (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248477 (https://phabricator.wikimedia.org/T419130) (owner: 10Daniel Kinzler)
[18:58:47] <wikibugs>	 (03PS1) 10Ssingh: definitions/static.net: add IPv6 addresses for nameservers [homer/public] - 10https://gerrit.wikimedia.org/r/1254290 (https://phabricator.wikimedia.org/T420361)
[19:00:58] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp3080.*
[19:05:06] <wikibugs>	 (03PS1) 10Dzahn: create a discovery name for new jenkins on contint machines [dns] - 10https://gerrit.wikimedia.org/r/1254292 (https://phabricator.wikimedia.org/T418521)
[19:05:17] <logmsgbot>	 !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp3081.esams.wmnet with OS trixie
[19:05:34] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp3081.esams.wmnet with OS trixie
[19:07:05] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] create a discovery name for new jenkins on contint machines [dns] - 10https://gerrit.wikimedia.org/r/1254292 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn)
[19:07:22] <logmsgbot>	 !log dzahn@dns1004 START - running authdns-update
[19:07:46] <wikibugs>	 (03PS1) 10Ssingh: definitions/static.net: add IPv6 addresses for nameservers [homer/public] - 10https://gerrit.wikimedia.org/r/1254293 (https://phabricator.wikimedia.org/T420361)
[19:08:56] <logmsgbot>	 !log dzahn@dns1004 END - running authdns-update
[19:09:09] <wikibugs>	 (03Abandoned) 10Ssingh: definitions/static.net: add IPv6 addresses for nameservers [homer/public] - 10https://gerrit.wikimedia.org/r/1254290 (https://phabricator.wikimedia.org/T420361) (owner: 10Ssingh)
[19:10:15] <wikibugs>	 (03CR) 10Catrope: rest-gateway rate limiting: add CORS headers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248461 (https://phabricator.wikimedia.org/T418969) (owner: 10Daniel Kinzler)
[19:11:44] <wikibugs>	 (03CR) 10Daniel Kinzler: rest-gateway rate limiting: add CORS headers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248461 (https://phabricator.wikimedia.org/T418969) (owner: 10Daniel Kinzler)
[19:11:55] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on bast2003.wikimedia.org with reason: host reimage
[19:12:33] <wikibugs>	 (03PS1) 10Dzahn: jenkins: define contint1003 as the manager_host for the jenkins role [puppet] - 10https://gerrit.wikimedia.org/r/1254295 (https://phabricator.wikimedia.org/T418521)
[19:16:34] <wikibugs>	 (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/1254295/8289/" [puppet] - 10https://gerrit.wikimedia.org/r/1254295 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn)
[19:17:34] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+1] jenkins: define contint1003 as the manager_host for the jenkins role [puppet] - 10https://gerrit.wikimedia.org/r/1254295 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn)
[19:18:08] <logmsgbot>	 btullis@cumin1003 reboot-workers (PID 3894227) is awaiting input
[19:19:07] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast2003.wikimedia.org with reason: host reimage
[19:20:26] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3068.esams.wmnet with reason: host reimage
[19:21:40] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3069.esams.wmnet with reason: host reimage
[19:23:50] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3068.esams.wmnet with reason: host reimage
[19:26:12] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "This can't work - the class is needed on both sides - it has internal logic to do appropriate things on each one." [puppet] - 10https://gerrit.wikimedia.org/r/1253631 (https://phabricator.wikimedia.org/T420246) (owner: 10Dzahn)
[19:26:27] <wikibugs>	 (03PS1) 10Dzahn: Revert "releases: remove rsync systemd units when primary server changes" [puppet] - 10https://gerrit.wikimedia.org/r/1254300
[19:28:01] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3069.esams.wmnet with reason: host reimage
[19:28:45] <logmsgbot>	 !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on releases1003.eqiad.wmnet with reason: T420246
[19:28:49] <stashbot>	 T420246: SystemdUnitFailed - rsync releases2003 - https://phabricator.wikimedia.org/T420246
[19:28:54] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3081.esams.wmnet with reason: host reimage
[19:29:12] <wikibugs>	 (03PS1) 10Catrope: Passwordless login: Don't display conditional auth errors [extensions/OATHAuth] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254301
[19:29:21] <wikibugs>	 (03PS1) 10Catrope: Passwordless login: Don't display conditional auth errors [extensions/OATHAuth] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254302
[19:29:28] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/OATHAuth] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254302 (owner: 10Catrope)
[19:29:39] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/OATHAuth] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254301 (owner: 10Catrope)
[19:32:34] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3081.esams.wmnet with reason: host reimage
[19:33:46] <wikibugs>	 (03Abandoned) 10Kamila Součková: k8s: create shellbox-icu72 [puppet] - 10https://gerrit.wikimedia.org/r/1254266 (https://phabricator.wikimedia.org/T419049) (owner: 10Kamila Součková)
[19:38:09] <wikibugs>	 (03PS2) 10Dzahn: jenkins: add envoy and config for jenkins.discovery.wmnet (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1251205 (https://phabricator.wikimedia.org/T418521)
[19:39:10] <wikibugs>	 (03PS3) 10Dzahn: jenkins: add envoy and config for jenkins.discovery.wmnet (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1251205 (https://phabricator.wikimedia.org/T418521)
[19:39:43] <wikibugs>	 (03Abandoned) 10Dzahn: jenkins: enable the jenkins service if using new role [puppet] - 10https://gerrit.wikimedia.org/r/1251208 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn)
[19:39:56] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast2003.wikimedia.org with OS trixie
[19:40:02] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: bast2003 boot failure - https://phabricator.wikimedia.org/T420320#11720664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host bast2003.wikimedia.org with OS trixie completed: - bast2003 (**PASS**)   - Downtimed on Icinga/Alertmanag...
[19:41:00] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] profile::reboot::unattended: add class to mark hosts for unattended reboots [puppet] - 10https://gerrit.wikimedia.org/r/1251406 (owner: 10Jelto)
[19:43:05] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: bast2003 boot failure - https://phabricator.wikimedia.org/T420320#11720668 (10Jhancock.wm) @MoritzMuehlenhoff redid it in trixie.
[19:46:07] <logmsgbot>	 !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm
[19:46:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[19:47:32] <wikibugs>	 (03PS1) 10Dzahn: contint/jenkins: make the jenkins host name configurable [puppet] - 10https://gerrit.wikimedia.org/r/1254307 (https://phabricator.wikimedia.org/T418521)
[19:48:12] <wikibugs>	 (03CR) 10CI reject: [V:04-1] contint/jenkins: make the jenkins host name configurable [puppet] - 10https://gerrit.wikimedia.org/r/1254307 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn)
[19:48:46] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: bast2003 boot failure - https://phabricator.wikimedia.org/T420320#11720684 (10MoritzMuehlenhoff) Thanks!
[19:49:14] <wikibugs>	 (03PS2) 10Dzahn: contint/jenkins: make the jenkins host name configurable [puppet] - 10https://gerrit.wikimedia.org/r/1254307 (https://phabricator.wikimedia.org/T418521)
[19:50:26] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3068.esams.wmnet with OS trixie
[19:50:54] <wikibugs>	 (03PS1) 10Dzahn: ci: switch jenkins proxy target to new discovery name [puppet] - 10https://gerrit.wikimedia.org/r/1254308 (https://phabricator.wikimedia.org/T418521)
[19:51:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[19:53:36] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-wmde-users for Ben.buchenau - https://phabricator.wikimedia.org/T419878#11720695 (10Gehel) Approved.
[19:54:05] <ryankemper>	 !log T411568 rebooted `an-test-client1002`, `an-test-ui1001`, `an-test-coord1001`, `an-test-master1001`
[19:54:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:54:09] <stashbot>	 T411568: October 2025 Bullseye reboots: Data Platform Engineering-owned hosts - https://phabricator.wikimedia.org/T411568
[19:54:23] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3069.esams.wmnet with OS trixie
[19:58:58] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3081.esams.wmnet with OS trixie
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T2000).
[20:00:05] <jouncebot>	 aude, alexsanford, and RoanKattouw: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:11] <aude>	 hi
[20:00:26] <alexsanford>	 hey!
[20:01:01] <aude>	 i can deploy mine (it's config so should be quick)
[20:01:45] <aude>	 proceedign
[20:01:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by aude@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251309 (https://phabricator.wikimedia.org/T419163) (owner: 10Aude)
[20:03:21] <wikibugs>	 (03Merged) 10jenkins-bot: Set wgReadingListsBetaDefaultForNewAccountsAfter for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251309 (https://phabricator.wikimedia.org/T419163) (owner: 10Aude)
[20:03:53] <swfrench-wmf>	 just a heads-up: I'm going to be monitoring how some changes we (SRE) made earlier today perform through the deployments in this window. should all be fine, but I *might* have to ask you folks to pause between patches for me to revert something if there are surprises :)
[20:03:55] <logmsgbot>	 !log aude@deploy2002 Started scap sync-world: Backport for [[gerrit:1251309|Set wgReadingListsBetaDefaultForNewAccountsAfter for beta cluster (T419163)]]
[20:03:58] <stashbot>	 T419163: Opt new accounts into ReadingLists BetaFeature - https://phabricator.wikimedia.org/T419163
[20:05:40] <aude>	 swfrench-wmf let me know if anything doesn't look good to you with my config deploy
[20:05:52] * swfrench-wmf thumbs up
[20:06:01] <logmsgbot>	 !log aude@deploy2002 aude: Backport for [[gerrit:1251309|Set wgReadingListsBetaDefaultForNewAccountsAfter for beta cluster (T419163)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:07:48] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp3069.esams.wmnet [reason: trixie reimaging]
[20:08:00] <aude>	 looks good from my side
[20:08:03] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp3068.esams.wmnet [reason: trixie reimaging]
[20:08:24] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp3070.esams.wmnet [reason: trixie reimaging]
[20:08:38] <aude>	 proceeding
[20:08:50] <logmsgbot>	 !log aude@deploy2002 aude: Continuing with sync
[20:09:03] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp3070.esams.wmnet with OS trixie
[20:09:24] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp3071.esams.wmnet with OS trixie
[20:09:57] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service opensearch-ipoid:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#opensearch-ipoid:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[20:11:04] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] "I think this looks okay, one code comment." [cookbooks] - 10https://gerrit.wikimedia.org/r/1253412 (https://phabricator.wikimedia.org/T414216) (owner: 10Elukey)
[20:12:48] <logmsgbot>	 !log aude@deploy2002 Finished scap sync-world: Backport for [[gerrit:1251309|Set wgReadingListsBetaDefaultForNewAccountsAfter for beta cluster (T419163)]] (duration: 08m 53s)
[20:12:54] <stashbot>	 T419163: Opt new accounts into ReadingLists BetaFeature - https://phabricator.wikimedia.org/T419163
[20:13:27] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] Install systemd-timesyncd universally [puppet] - 10https://gerrit.wikimedia.org/r/1243756 (owner: 10Muehlenhoff)
[20:14:57] <aude>	 i'm done swfrench-wmf RoanKattouw alexsanford 
[20:15:31] <swfrench-wmf>	 I'll continue to monitor, but I think that looked good from my end :)
[20:15:36] <wikibugs>	 (03PS17) 10Andrew Bogott: toolforge etcdctl: update handling of 'member list' output [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 (https://phabricator.wikimedia.org/T361237)
[20:16:12] <ryankemper>	 !log T411568 rebooted `an-test-master1002`, `an-test-master1003`, `an-test-master1004`, `archiva1002`
[20:16:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:16:16] <stashbot>	 T411568: October 2025 Bullseye reboots: Data Platform Engineering-owned hosts - https://phabricator.wikimedia.org/T411568
[20:16:58] <jinxer-wm>	 FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[20:18:53] <swfrench-wmf>	 maybe I spoke too soon ...
[20:19:19] <sukhe>	 that's a lot of NELs for the measure domains
[20:19:24] <sukhe>	 what's the relation?
[20:19:47] <swfrench-wmf>	 I'm struggling to think of a way that could possibly be related to my change earlier today
[20:20:12] <RoanKattouw>	 I'm about to deploy some more MW patches, speak up now/soon if you want me to stop/pause
[20:20:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [extensions/OATHAuth] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254301 (owner: 10Catrope)
[20:20:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [extensions/OATHAuth] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254302 (owner: 10Catrope)
[20:20:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1015:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:21:02] <sukhe>	 TCP timed out, interesting
[20:21:03] <icinga-wm>	 PROBLEM - Host an-coord1003 is DOWN: PING CRITICAL - Packet loss = 100%
[20:21:09] <icinga-wm>	 PROBLEM - Host an-web1001 is DOWN: PING CRITICAL - Packet loss = 100%
[20:21:19] <aude>	 when did the errors start?
[20:21:45] <cdanis>	 19:55 UTC
[20:21:47] <swfrench-wmf>	 aude: looks like they've been creeping up since ~ 19:40
[20:21:58] <jinxer-wm>	 RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[20:22:01] <wikibugs>	 (03Merged) 10jenkins-bot: Passwordless login: Don't display conditional auth errors [extensions/OATHAuth] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254301 (owner: 10Catrope)
[20:22:02] <wikibugs>	 (03Merged) 10jenkins-bot: Passwordless login: Don't display conditional auth errors [extensions/OATHAuth] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254302 (owner: 10Catrope)
[20:22:08] <cdanis>	 I think Telefonica de Espanica is having issues specifically
[20:22:09] <swfrench-wmf>	 so yeah, unrelated to either of our chagnes
[20:22:16] <swfrench-wmf>	 ^ exactly
[20:22:23] <aude>	 phew.   everything looked good to me with my change
[20:22:27] <icinga-wm>	 RECOVERY - Host an-coord1003 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms
[20:22:30] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: Remove old single-instance deadlock remediation cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1254314 (https://phabricator.wikimedia.org/T242453)
[20:22:34] <aude>	 can't see how it is related
[20:22:36] <logmsgbot>	 !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1254301|Passwordless login: Don't display conditional auth errors]], [[gerrit:1254302|Passwordless login: Don't display conditional auth errors]]
[20:22:37] <icinga-wm>	 RECOVERY - Host an-web1001 is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms
[20:23:17] * swfrench-wmf returns to staring a different graphs than NEL
[20:23:34] <sukhe>	 :)
[20:24:38] <logmsgbot>	 !log catrope@deploy2002 catrope: Backport for [[gerrit:1254301|Passwordless login: Don't display conditional auth errors]], [[gerrit:1254302|Passwordless login: Don't display conditional auth errors]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:24:54] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to WMDE LDAP group for Sarmbruster - https://phabricator.wikimedia.org/T420410 (10Sarmbruster) 03NEW
[20:24:57] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254314 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper)
[20:26:35] <wikibugs>	 (03CR) 10Andrew Bogott: "I've created and deleted several nodes in toolsbeta with the latest version of this patch." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 (https://phabricator.wikimedia.org/T361237) (owner: 10Andrew Bogott)
[20:26:41] <icinga-wm>	 PROBLEM - Host an-coord1004 is DOWN: PING CRITICAL - Packet loss = 100%
[20:27:27] <icinga-wm>	 RECOVERY - Host an-coord1004 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms
[20:27:33] <logmsgbot>	 !log catrope@deploy2002 catrope: Continuing with sync
[20:30:33] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+1] "0 blast radius cleanup patch, self-merging" [puppet] - 10https://gerrit.wikimedia.org/r/1254314 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper)
[20:30:35] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs: Remove old single-instance deadlock remediation cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1254314 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper)
[20:31:32] <logmsgbot>	 !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1254301|Passwordless login: Don't display conditional auth errors]], [[gerrit:1254302|Passwordless login: Don't display conditional auth errors]] (duration: 08m 56s)
[20:32:12] <RoanKattouw>	 alexsanford: You're up! You can use the "deploy change" link on the deployments page to jump straight into a SpiderPig session for your patch
[20:32:12] <RoanKattouw>	 https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T2000
[20:32:30] <alexsanford>	 ok on it
[20:34:40] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3071.esams.wmnet with reason: host reimage
[20:34:44] <alexsanford>	 RoanKattouw hmm looks like my developer account doesn't have enough privileges to open SpiderPig
[20:34:54] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3070.esams.wmnet with reason: host reimage
[20:34:56] <RoanKattouw>	 Huh but you do have deployment access?
[20:35:06] <alexsanford>	 Yep
[20:35:24] <RoanKattouw>	 Ah I see, it's a separate group
[20:35:35] <RoanKattouw>	 OK for now request it at https://idm.wikimedia.org/permissions/ and then you'll have it for next time
[20:35:48] <alexsanford>	 will do
[20:36:41] <RoanKattouw>	 SpiderPig is really just a fancy UI around `scap backport`, so you can run `scap backport 1254280` on the deployment host and you'll get basically the same experience
[20:37:05] <alexsanford>	 Ok I'll try that
[20:37:31] <RoanKattouw>	 You'll just have to type Y/N instead of clicking buttons, and you won't get notifications from your browser when you need to take action
[20:38:25] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3071.esams.wmnet with reason: host reimage
[20:38:38] <ryankemper>	 !log T411568 rebooted `an-coord1003`, `an-coord1004`, `an-tool1007`, `an-tool1008`, `an-tool1011`, `an-web1001`
[20:38:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:38:41] <stashbot>	 T411568: October 2025 Bullseye reboots: Data Platform Engineering-owned hosts - https://phabricator.wikimedia.org/T411568
[20:38:44] <ryankemper>	 !log T411568 failed over HDFS NameNode from an-master1003 to an-master1004, then rebooted `an-master1003`
[20:38:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by alexsanford@deploy2002 using scap backport" [skins/MinervaNeue] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254280 (https://phabricator.wikimedia.org/T418534) (owner: 10Alex.sanford)
[20:38:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:39:12] <icinga-wm>	 PROBLEM - Host an-master1003 is DOWN: PING CRITICAL - Packet loss = 100%
[20:40:19] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Degraded RAID on an-presto1007 - https://phabricator.wikimedia.org/T419329#11720846 (10VRiley-WMF) @BTullis Would you like us to order a replacment dirve for this?
[20:40:27] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm
[20:40:42] <logmsgbot>	 !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm
[20:40:44] <icinga-wm>	 RECOVERY - Host an-master1003 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms
[20:40:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: hw troubleshooting:  Comm Error: Backplane 0 for wikikube-worker1307.eqiad.wmnet - https://phabricator.wikimedia.org/T420389#11720848 (10VRiley-WMF) a:03VRiley-WMF
[20:43:05] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3070.esams.wmnet with reason: host reimage
[20:43:37] <wikibugs>	 (03PS4) 10Arlolra: Deploy PRV to 13 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253654 (https://phabricator.wikimedia.org/T420273)
[20:43:51] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[20:45:11] <wikibugs>	 (03PS1) 10Bking: WIP: dse-k8s: Add automation for setting OpenSearch pod ureadahead values [puppet] - 10https://gerrit.wikimedia.org/r/1254320 (https://phabricator.wikimedia.org/T419041)
[20:45:28] <wikibugs>	 (03PS3) 10Kamila Součková: shellbox: Setup shellbox-icu72 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251475 (https://phabricator.wikimedia.org/T419548)
[20:47:49] <logmsgbot>	 btullis@cumin1003 provision (PID 3971211) is awaiting input
[20:48:29] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Revert "releases: remove rsync systemd units when primary server changes" [puppet] - 10https://gerrit.wikimedia.org/r/1254300 (owner: 10Dzahn)
[20:48:31] <logmsgbot>	 !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[20:49:36] <wikibugs>	 (03PS3) 10Dzahn: contint/jenkins: make the jenkins host name configurable [puppet] - 10https://gerrit.wikimedia.org/r/1254307 (https://phabricator.wikimedia.org/T418521)
[20:51:54] <wikibugs>	 (03Merged) 10jenkins-bot: Remove notice from login form in popup mode [skins/MinervaNeue] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254280 (https://phabricator.wikimedia.org/T418534) (owner: 10Alex.sanford)
[20:51:56] <icinga-wm>	 RECOVERY - Host wikikube-worker1307 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms
[20:52:24] <logmsgbot>	 !log alexsanford@deploy2002 Started scap sync-world: Backport for [[gerrit:1254280|Remove notice from login form in popup mode (T418534)]]
[20:52:28] <stashbot>	 T418534: Update the design of the popup login form for use in a mobile web view - https://phabricator.wikimedia.org/T418534
[20:53:21] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: hw troubleshooting:  Comm Error: Backplane 0 for wikikube-worker1307.eqiad.wmnet - https://phabricator.wikimedia.org/T420389#11720875 (10VRiley-WMF) Forced the unit down, and performed a flea power drain. Booted it back up an...
[20:54:26] <logmsgbot>	 !log alexsanford@deploy2002 alexsanford: Backport for [[gerrit:1254280|Remove notice from login form in popup mode (T418534)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:55:02] <wikibugs>	 (03CR) 10Kamila Součková: shellbox: Setup shellbox-icu72 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251475 (https://phabricator.wikimedia.org/T419548) (owner: 10Kamila Součková)
[20:55:30] <wikibugs>	 (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/1254307/8290/" [puppet] - 10https://gerrit.wikimedia.org/r/1254307 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn)
[20:55:52] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1254307/8290/" [puppet] - 10https://gerrit.wikimedia.org/r/1254307 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn)
[20:56:03] <logmsgbot>	 !log alexsanford@deploy2002 alexsanford: Continuing with sync
[20:59:56] <logmsgbot>	 !log alexsanford@deploy2002 Finished scap sync-world: Backport for [[gerrit:1254280|Remove notice from login form in popup mode (T418534)]] (duration: 07m 32s)
[21:00:00] <stashbot>	 T418534: Update the design of the popup login form for use in a mobile web view - https://phabricator.wikimedia.org/T418534
[21:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260317T2100)
[21:00:26] <wikibugs>	 (03PS5) 10Herron: systemd::timer::job: add ExecCondition support [puppet] - 10https://gerrit.wikimedia.org/r/1253655
[21:00:28] <wikibugs>	 (03PS4) 10Dzahn: jenkins: add envoy and config for jenkins.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1251205 (https://phabricator.wikimedia.org/T418521)
[21:00:32] <jinxer-wm>	 FIRING: [2x] KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[21:00:44] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "noop confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/1254307 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn)
[21:00:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[21:01:24] <alexsanford>	 my backport deployment is done
[21:05:46] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3071.esams.wmnet with OS trixie
[21:06:48] <wikibugs>	 (03PS1) 10Dzahn: releases: remove "unless" condition around rsync data copy [puppet] - 10https://gerrit.wikimedia.org/r/1254331 (https://phabricator.wikimedia.org/T420246)
[21:08:26] <wikibugs>	 (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1254332
[21:08:41] <wikibugs>	 (03PS2) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1254332
[21:09:03] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar: Add --min-uptime to cookbooks - https://phabricator.wikimedia.org/T419967#11720994 (10Ajuanca) What's task `T419960` about? I don't enough privilegies to access it. Yes, I think a parameter with expressive reboot time is more robust than a relati...
[21:09:33] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wip [puppet] - 10https://gerrit.wikimedia.org/r/1254332 (owner: 10Herron)
[21:09:50] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3070.esams.wmnet with OS trixie
[21:10:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[21:13:46] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp3071.esams.wmnet [reason: trixie reimaging]
[21:13:55] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp3070.esams.wmnet [reason: trixie reimaging]
[21:14:09] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp3072.esams.wmnet [reason: trixie reimaging]
[21:14:35] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp3072.esams.wmnet with OS trixie
[21:14:53] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp3073.esams.wmnet [reason: trixie reimaging]
[21:15:22] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp3073.esams.wmnet with OS trixie
[21:16:55] <wikibugs>	 (03CR) 10JHathaway: sre.hosts.provision: use more URIs to set Supermicro's BIOS settings (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216) (owner: 10Elukey)
[21:22:29] <wikibugs>	 (03CR) 10C. Scott Ananian: [C:03+1] Deploy PRV to 13 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253654 (https://phabricator.wikimedia.org/T420273) (owner: 10Arlolra)
[21:27:42] <wikibugs>	 (03PS1) 10RLazarus: function-{evaluator,orchestrator}: set AppArmor profile in pod SecurityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254338 (https://phabricator.wikimedia.org/T367880)
[21:29:44] <wikibugs>	 (03CR) 10CI reject: [V:04-1] function-{evaluator,orchestrator}: set AppArmor profile in pod SecurityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254338 (https://phabricator.wikimedia.org/T367880) (owner: 10RLazarus)
[21:33:55] <wikibugs>	 (03PS2) 10RLazarus: function-{evaluator,orchestrator}: set AppArmor profile in pod SecurityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254338 (https://phabricator.wikimedia.org/T367880)
[21:36:00] <wikibugs>	 (03CR) 10CI reject: [V:04-1] function-{evaluator,orchestrator}: set AppArmor profile in pod SecurityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254338 (https://phabricator.wikimedia.org/T367880) (owner: 10RLazarus)
[21:36:28] <wikibugs>	 06SRE, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): October 2025 Bullseye reboots: Data Platform Engineering-owned hosts - https://phabricator.wikimedia.org/T411568#11721043 (10RKemper) 05Open→03Resolved Completed all remaining DPE-owned host reboots today (2026-03-17). All 143 reachable Bullseye hos...
[21:38:56] <ryankemper>	 !log T411568 Failed back HDFS NameNode from an-master1004 to an-master1003; cluster back to original active/standby configuration
[21:38:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:39:00] <stashbot>	 T411568: October 2025 Bullseye reboots: Data Platform Engineering-owned hosts - https://phabricator.wikimedia.org/T411568
[21:39:39] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3072.esams.wmnet with reason: host reimage
[21:41:41] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3073.esams.wmnet with reason: host reimage
[21:44:28] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3072.esams.wmnet with reason: host reimage
[21:47:03] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users level 3 for bvibber - https://phabricator.wikimedia.org/T420406#11721057 (10HSwan-WMF) Please grant this access so that Brooke can pull data. Thank you!
[21:48:03] <wikibugs>	 06SRE, 06DC-Ops, 10Wikidata: NVMe versus standard SSD performance info - https://phabricator.wikimedia.org/T419884#11721058 (10RobH)
[21:48:28] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3073.esams.wmnet with reason: host reimage
[22:02:18] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11721097 (10RKemper) a:05RKemper→03Jclark-ctr
[22:03:16] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11721106 (10RKemper) Switched this to a HW failure ticket, given `racadm getsel` revealed a backplane issue
[22:04:16] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11721109 (10RKemper)
[22:05:20] <logmsgbot>	 !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on releases2003.codfw.wmnet with reason: T420246
[22:05:24] <stashbot>	 T420246: SystemdUnitFailed - rsync releases2003 - https://phabricator.wikimedia.org/T420246
[22:05:46] <logmsgbot>	 !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on releases1003.eqiad.wmnet with reason: T420246
[22:06:56] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11721122 (10RKemper) p:05Triage→03Medium
[22:07:18] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11721124 (10RKemper)
[22:10:46] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3072.esams.wmnet with OS trixie
[22:11:57] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp3072.esams.wmnet [reason: trixie reimaging]
[22:15:17] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3073.esams.wmnet with OS trixie
[22:15:32] <icinga-wm>	 PROBLEM - statsv Varnishkafka log producer on cp3071 is CRITICAL: PROCS CRITICAL: 3 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[22:16:32] <icinga-wm>	 RECOVERY - statsv Varnishkafka log producer on cp3071 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[22:20:21] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp3073.esams.wmnet [reason: trixie reimaging]
[22:35:36] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253654 (https://phabricator.wikimedia.org/T420273) (owner: 10Arlolra)
[22:49:56] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:54:56] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:55:29] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp3081.*
[23:00:49] <wikibugs>	 (03PS1) 10Jforrester: [DNM] Make abstractwiki a multi-lingual Wikidata client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254359 (https://phabricator.wikimedia.org/T420420)
[23:07:18] <wikibugs>	 06SRE: upload.wikimedia.org serves .ogg audio files with content-type `application/ogg` instead of `audio/ogg`. - https://phabricator.wikimedia.org/T420422#11721260 (10Reedy)
[23:38:30] <logmsgbot>	 !log btullis@cumin1003 END (FAIL) - Cookbook sre.hadoop.reboot-workers (exit_code=99) for Hadoop analytics cluster
[23:44:14] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster
[23:46:09] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Degraded RAID on an-presto1007 - https://phabricator.wikimedia.org/T419329#11721339 (10BTullis) a:05VRiley-WMF→03BTullis Thanks, @VRiley-WMF. I'm not sure that it's worth it to purchase a replacement.  The disks aren't actuall...
[23:48:49] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks, Raine!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251475 (https://phabricator.wikimedia.org/T419548) (owner: 10Kamila Součková)
[23:50:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11721344 (10BTullis) I confirmed that the backplane was giving this error. {F72973897,width=3...