[00:12:39] <icinga-wm>	 PROBLEM - MD RAID on aqs1010 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[00:12:40] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on aqs1010 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T420867 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[00:12:47] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1010 - https://phabricator.wikimedia.org/T420867 (10ops-monitoring-bot) 03NEW
[00:34:54] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job wikidough in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:39:25] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1258375
[00:39:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1258375 (owner: 10TrainBranchBot)
[00:47:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.72% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[00:52:11] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1258375 (owner: 10TrainBranchBot)
[00:52:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 22.51% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[00:53:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 20.8% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[01:09:16] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1258387
[01:09:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1258387 (owner: 10TrainBranchBot)
[01:13:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.61% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[01:17:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.43% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[01:19:54] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:22:15] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1258387 (owner: 10TrainBranchBot)
[01:37:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.32% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[01:42:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.22% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[01:47:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.22% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[02:00:00] <jinxer-wm>	 FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability
[02:00:00] <jinxer-wm>	 FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability
[02:00:52] <logmsgbot>	 !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image
[02:02:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 21.9% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[02:07:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[02:08:47] <logmsgbot>	 !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 07m 55s)
[02:09:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:31:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.57% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[02:34:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:36:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.01% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[02:38:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[02:49:45] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254448 (https://phabricator.wikimedia.org/T419275) (owner: 10DDesouza)
[02:50:02] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254450 (https://phabricator.wikimedia.org/T419275) (owner: 10DDesouza)
[02:50:13] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254452 (https://phabricator.wikimedia.org/T419778) (owner: 10DDesouza)
[03:08:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 20.8% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[03:13:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 22.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[04:03:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 22.08% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[04:34:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.4% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[04:44:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 21.58% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[04:48:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.57% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[05:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260322T0700)
[05:00:05] <jouncebot>	 arnaudb : #bothumor My software never has bugs. It just develops random features. Rise for Gerrit. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260323T0500).
[05:08:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.04% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[05:12:03] <icinga-wm>	 PROBLEM - MegaRAID on db1170 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[05:12:05] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on db1170 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T420873 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[05:12:12] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1170 - https://phabricator.wikimedia.org/T420873 (10ops-monitoring-bot) 03NEW
[05:19:54] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:24:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:00:00] <jinxer-wm>	 FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability
[06:00:00] <jinxer-wm>	 FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability
[06:02:54] <wikibugs>	 (03CR) 10KartikMistry: [C:03+1] Enable ULS rewrite beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254149 (https://phabricator.wikimedia.org/T418187) (owner: 10Abijeet Patro)
[06:05:32] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: Wire mpm_event configuration to allow connection reuse on CDN [puppet] - 10https://gerrit.wikimedia.org/r/1254940 (https://phabricator.wikimedia.org/T420189) (owner: 10Arnaudb)
[06:17:57] <wikibugs>	 (03PS2) 10Arnaudb: gerrit: Tune mpm_event configuration to allow connection reuse on CDN [puppet] - 10https://gerrit.wikimedia.org/r/1256445 (https://phabricator.wikimedia.org/T420189)
[06:18:04] <wikibugs>	 (03PS2) 10Arnaudb: gerrit: Tune mpm_event configuration to allow connection reuse on CDN [puppet] - 10https://gerrit.wikimedia.org/r/1256446 (https://phabricator.wikimedia.org/T420189)
[06:18:44] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: Tune mpm_event configuration to allow connection reuse on CDN [puppet] - 10https://gerrit.wikimedia.org/r/1256445 (https://phabricator.wikimedia.org/T420189) (owner: 10Arnaudb)
[06:34:19] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job wikidough in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:59:00] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: bump up k8s resources in experimental ns to enable policy-violation isvc deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258647 (https://phabricator.wikimedia.org/T418350)
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260323T0700). nyaa~
[07:00:05] <jouncebot>	 abijeet and hector-arroyo: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:01:03] <abijeet>	 hello
[07:01:11] <kart_>	 I can deploy abijeet's change.
[07:01:16] <kart_>	 abijeet: should we start?
[07:01:35] <abijeet>	 kart_, sure
[07:02:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254149 (https://phabricator.wikimedia.org/T418187) (owner: 10Abijeet Patro)
[07:03:01] <wikibugs>	 (03Merged) 10jenkins-bot: Enable ULS rewrite beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254149 (https://phabricator.wikimedia.org/T418187) (owner: 10Abijeet Patro)
[07:03:50] <logmsgbot>	 !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1254149|Enable ULS rewrite beta feature (T418187 T253303)]]
[07:04:00] <stashbot>	 T418187: Define rollout strategy for the ULS rewrite - https://phabricator.wikimedia.org/T418187
[07:04:00] <stashbot>	 T253303: Basic support for a responsive language selector - https://phabricator.wikimedia.org/T253303
[07:11:05] <kart_>	 abijeet: things seems slow. still on k8s images build/push stage..
[07:15:13] <abijeet>	 kart_, ok
[07:15:20] <abijeet>	 kart_, let me know when its ready to test
[07:16:41] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'.
[07:17:15] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'.
[07:18:13] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[07:18:51] <kart_>	 abijeet: sure
[07:18:52] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[07:20:51] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] kafka-main-eqiad: disable mirroring to kafka-main-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1255657 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[07:21:46] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] aux-k8s/kafka-mirrormaker: add main-eqiad-to-main-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255659 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[07:22:50] <logmsgbot>	 !log kartik@deploy2002 kartik, abi: Backport for [[gerrit:1254149|Enable ULS rewrite beta feature (T418187 T253303)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:22:59] <stashbot>	 T418187: Define rollout strategy for the ULS rewrite - https://phabricator.wikimedia.org/T418187
[07:22:59] <stashbot>	 T253303: Basic support for a responsive language selector - https://phabricator.wikimedia.org/T253303
[07:23:18] <kart_>	 abijeet: ready to test
[07:25:39] <abijeet>	 kart_, on it
[07:26:47] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad@0 on kafka-main1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-codfw_to_main-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[07:26:58] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] aux-k8s/kafka-mirrormaker: add main-codfw-to-main-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255660 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[07:27:04] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] aux-k8s/kafka-mirrormaker: add main-eqad-to-jumbo-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255661 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[07:27:10] <wikibugs>	 (03CR) 10CI reject: [V:04-1] aux-k8s/kafka-mirrormaker: add main-codfw-to-main-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255660 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[07:27:16] <wikibugs>	 (03CR) 10CI reject: [V:04-1] aux-k8s/kafka-mirrormaker: add main-codfw-to-main-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255660 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[07:27:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] aux-k8s/kafka-mirrormaker: add main-eqad-to-jumbo-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255661 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[07:27:50] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] kafka-main-codfw: disable mirroring to kafka-main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1255656 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[07:27:59] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad@0 on kafka-main1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-codfw_to_main-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[07:27:59] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad@0 on kafka-main1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-codfw_to_main-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[07:28:03] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad@0 on kafka-main1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-codfw_to_main-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[07:28:03] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad@0 on kafka-main1010 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-codfw_to_main-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[07:28:06] <wikibugs>	 (03PS4) 10Brouberol: aux-k8s/kafka-mirrormaker: add main-codfw-to-main-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255660 (https://phabricator.wikimedia.org/T417407)
[07:28:46] <wikibugs>	 (03CR) 10Brouberol: [V:03+2 C:03+2] aux-k8s/kafka-mirrormaker: add main-codfw-to-main-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255660 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[07:29:16] <logmsgbot>	 !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/kafka-mirrormaker: apply
[07:29:43] <abijeet>	 kart_, all ok
[07:30:14] <kart_>	 cool. 
[07:30:19] <logmsgbot>	 !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/kafka-mirrormaker: apply
[07:30:22] <logmsgbot>	 !log kartik@deploy2002 kartik, abi: Continuing with sync
[07:33:23] <logmsgbot>	 !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/kafka-mirrormaker: apply
[07:36:34] <wikibugs>	 (03PS5) 10Brouberol: aux-k8s/kafka-mirrormaker: add main-eqad-to-jumbo-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255661 (https://phabricator.wikimedia.org/T417407)
[07:38:55] <wikibugs>	 (03PS1) 10Brouberol: Revert "kafka-main-eqiad: disable mirroring to kafka-main-codfw" [puppet] - 10https://gerrit.wikimedia.org/r/1258677
[07:39:05] <wikibugs>	 (03PS1) 10Brouberol: Revert "kafka-main-codfw: disable mirroring to kafka-main-eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/1258679
[07:41:09] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Revert "kafka-main-codfw: disable mirroring to kafka-main-eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/1258679 (owner: 10Brouberol)
[07:41:17] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Revert "kafka-main-eqiad: disable mirroring to kafka-main-codfw" [puppet] - 10https://gerrit.wikimedia.org/r/1258677 (owner: 10Brouberol)
[07:42:39] <logmsgbot>	 !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/kafka-mirrormaker: apply
[07:45:20] <logmsgbot>	 !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1254149|Enable ULS rewrite beta feature (T418187 T253303)]] (duration: 41m 30s)
[07:45:26] <stashbot>	 T418187: Define rollout strategy for the ULS rewrite - https://phabricator.wikimedia.org/T418187
[07:45:26] <stashbot>	 T253303: Basic support for a responsive language selector - https://phabricator.wikimedia.org/T253303
[07:47:20] <wikibugs>	 (03PS6) 10Brouberol: aux-k8s/kafka-mirrormaker: add main-eqad-to-jumbo-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255661 (https://phabricator.wikimedia.org/T417407)
[07:47:20] <wikibugs>	 (03PS5) 10Brouberol: aux-k8s/kafka-mirrormaker: cleanup helmfile of duplicated namespace definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255662 (https://phabricator.wikimedia.org/T417407)
[07:47:20] <wikibugs>	 (03PS1) 10Brouberol: aux-k8s/kafka-mirrormaker: fix values by not overriding the app config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258688 (https://phabricator.wikimedia.org/T417407)
[07:47:38] <kart_>	 abijeet: done.
[07:50:44] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] aux-k8s/kafka-mirrormaker: fix values by not overriding the app config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258688 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[07:54:53] <wikibugs>	 (03Merged) 10jenkins-bot: aux-k8s/kafka-mirrormaker: cleanup helmfile of duplicated namespace definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255662 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[08:09:35] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255736 (https://phabricator.wikimedia.org/T420574) (owner: 10Kosta Harlan)
[08:10:25] <wikibugs>	 (03PS1) 10Brouberol: site: install the aux-k8s-worker1006-9 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1258704 (https://phabricator.wikimedia.org/T393053)
[08:14:15] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+1] ml-services: bump up k8s resources in experimental ns to enable policy-violation isvc deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258647 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[08:15:54] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+2] ml-services: bump up k8s resources in experimental ns to enable policy-violation isvc deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258647 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[08:18:22] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{wikikube-worker[2005-2006,2011-2018,2033-2039,2041-2042,2044,2046,2049-2051,2055-2062,2064-2065,2067-2078,2087-2095,2102-2115,2124-2179,2184-2199].codfw.wmnet} and (A:wikikube-master-codfw or A:wikikube-worker-codfw)
[08:19:00] <wikibugs>	 (03CR) 10MVernon: [C:03+1] admin: Add mpostoronca shell access and deployment membership [puppet] - 10https://gerrit.wikimedia.org/r/1256520 (https://phabricator.wikimedia.org/T420458) (owner: 10Scott French)
[08:19:23] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: cache::haproxy: remove hotfix for traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1258714
[08:19:39] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2005-2006,2011-2018,2033-2037].codfw.wmnet
[08:20:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255736 (https://phabricator.wikimedia.org/T420574) (owner: 10Kosta Harlan)
[08:21:24] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to  superset for alice.moutinho - https://phabricator.wikimedia.org/T420751#11736952 (10Alice.moutinho) Hello @Aklapper, @Scott_French,   i now have an LDAP acount linked to my Phabricator account.  @KFrancis i just saw the NDA agreement in my inbox this morning...
[08:23:07] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8316/co" [puppet] - 10https://gerrit.wikimedia.org/r/1258714 (owner: 10Giuseppe Lavagetto)
[08:29:22] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2005-2006,2011-2018,2033-2037].codfw.wmnet
[08:31:02] <wikibugs>	 (03Merged) 10jenkins-bot: hcaptcha: Use the global edit key for MobileFrontend edits if present [extensions/ConfirmEdit] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255736 (https://phabricator.wikimedia.org/T420574) (owner: 10Kosta Harlan)
[08:31:22] <logmsgbot>	 !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1255736|hcaptcha: Use the global edit key for MobileFrontend edits if present (T420574)]]
[08:31:27] <stashbot>	 T420574: hcaptcha: Make edits coming from the MobileFrontend use the sitekey for edits - https://phabricator.wikimedia.org/T420574
[08:32:05] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-search: add secrets for opensearch-semantic-search clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255778 (https://phabricator.wikimedia.org/T414091) (owner: 10DCausse)
[08:34:13] <wikibugs>	 10SRE-swift-storage, 06Commons, 07Wikimedia-production-error: uploadstash-exception: Could not store upload in the stash while uploading PDF file - https://phabricator.wikimedia.org/T420786#11736989 (10MatthewVernon) I'm guessing you don't have an exact timestamp for the error? I'm afraid it's going to be al...
[08:34:31] <wikibugs>	 (03Merged) 10jenkins-bot: airflow-search: add secrets for opensearch-semantic-search clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255778 (https://phabricator.wikimedia.org/T414091) (owner: 10DCausse)
[08:35:08] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply
[08:35:35] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply
[08:36:42] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply
[08:37:16] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1255736|hcaptcha: Use the global edit key for MobileFrontend edits if present (T420574)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[08:37:21] <stashbot>	 T420574: hcaptcha: Make edits coming from the MobileFrontend use the sitekey for edits - https://phabricator.wikimedia.org/T420574
[08:37:47] <wikibugs>	 10SRE-swift-storage, 06Commons: Server error 500 after uploading chunk - https://phabricator.wikimedia.org/T340917#11736996 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Thanks, I'm going to optimistically close this ticket then :)
[08:37:55] <wikibugs>	 (03CR) 10Kgraessle: [C:03+1] PersonalDashboard: Add config for Active Discussions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256498 (https://phabricator.wikimedia.org/T420785) (owner: 10Scardenasmolinar)
[08:37:58] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:38:40] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[08:39:34] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Continuing with sync
[08:40:41] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[08:40:42] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2005-2006,2011-2018,2033-2037].codfw.wmnet
[08:40:51] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2005-2006,2011-2018,2033-2037].codfw.wmnet
[08:41:18] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2038-2039,2041-2042,2044,2046,2049-2051,2055-2059].codfw.wmnet
[08:42:58] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-deprecated: Add a --rack flag to sre.k8s.pool-depool-node - https://phabricator.wikimedia.org/T410537#11737005 (10JMeybohm) 05Resolved→03Open This additional confirmation thing is making bigger reboots pretty annoying since one has to come back and...
[08:43:06] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[08:43:52] <icinga-wm>	 PROBLEM - Host ps1-b7-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[08:44:08] <icinga-wm>	 PROBLEM - Host lsw1-b7-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[08:44:22] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[08:44:32] <icinga-wm>	 RECOVERY - Host ps1-b7-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.38 ms
[08:44:42] <icinga-wm>	 RECOVERY - Host lsw1-b7-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.75 ms
[08:46:04] <logmsgbot>	 !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1255736|hcaptcha: Use the global edit key for MobileFrontend edits if present (T420574)]] (duration: 14m 42s)
[08:46:09] <stashbot>	 T420574: hcaptcha: Make edits coming from the MobileFrontend use the sitekey for edits - https://phabricator.wikimedia.org/T420574
[08:47:21] <wikibugs>	 10SRE-tools, 06ServiceOps new: Add a --rack flag to sre.k8s.pool-depool-node - https://phabricator.wikimedia.org/T410537#11737009 (10JMeybohm)
[08:50:09] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2038-2039,2041-2042,2044,2046,2049-2051,2055-2059].codfw.wmnet
[08:58:25] <wikibugs>	 (03PS2) 10Fabfur: cache::haproxy: remove hotfix for traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1258714 (https://phabricator.wikimedia.org/T415007) (owner: 10Giuseppe Lavagetto)
[08:58:33] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] cache::haproxy: remove hotfix for traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1258714 (https://phabricator.wikimedia.org/T415007) (owner: 10Giuseppe Lavagetto)
[08:59:25] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2038-2039,2041-2042,2044,2046,2049-2051,2055-2059].codfw.wmnet
[08:59:33] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2038-2039,2041-2042,2044,2046,2049-2051,2055-2059].codfw.wmnet
[08:59:51] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.prepare for the switch from eqiad to codfw for section test-s4
[08:59:55] <logmsgbot>	 !log fceratto@cumin1003 END (FAIL) - Cookbook sre.switchdc.databases.prepare (exit_code=99) for the switch from eqiad to codfw for section test-s4
[09:00:02] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2060-2062,2064-2065,2067-2075].codfw.wmnet
[09:00:40] <federico3>	 !log starting T416706
[09:00:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:46] <stashbot>	 T416706: Enable eqiad -> codfw replication - https://phabricator.wikimedia.org/T416706
[09:01:05] <wikibugs>	 10SRE-tools, 06ServiceOps new: Add a --rack flag to sre.k8s.pool-depool-node - https://phabricator.wikimedia.org/T410537#11737043 (10MLechvien-WMF) Good point. IMO it feels more intuitive/predictable to have the careful version as the default, and add a `--force` flag which bypasses all confirmation.  If it's...
[09:01:12] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section x1
[09:02:40] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section x1
[09:04:54] <wikibugs>	 10SRE-tools, 06ServiceOps new: Add a --rack flag to sre.k8s.pool-depool-node - https://phabricator.wikimedia.org/T410537#11737062 (10JMeybohm) I'm not a huge 'confirmation-fan' in general, but sgtm. When you're at it you could also make the cookbooks that call 'pool-depool-node' call it with `--force`
[09:05:36] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[09:05:38] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[09:08:12] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] cache::haproxy: remove hotfix for traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1258714 (https://phabricator.wikimedia.org/T415007) (owner: 10Giuseppe Lavagetto)
[09:08:26] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2060-2062,2064-2065,2067-2075].codfw.wmnet
[09:09:07] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section x3
[09:10:36] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section x3
[09:10:43] <wikibugs>	 (03PS10) 10Elukey: sre.hosts.provision: use more URIs to set Supermicro's BIOS settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216)
[09:11:22] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[09:11:30] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[09:14:23] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] prometheus: adjust join in PrometheusZombieSeriesDetected rule [alerts] - 10https://gerrit.wikimedia.org/r/1256451 (https://phabricator.wikimedia.org/T415317) (owner: 10Tiziano Fogli)
[09:15:28] <icinga-wm>	 PROBLEM - Ensure acme-chief-api is running on acmechief2002 is CRITICAL: PROCS CRITICAL: 3 processes with args /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/acme-chief.ini https://wikitech.wikimedia.org/wiki/Acme-chief
[09:16:17] <wikibugs>	 (03Merged) 10jenkins-bot: prometheus: adjust join in PrometheusZombieSeriesDetected rule [alerts] - 10https://gerrit.wikimedia.org/r/1256451 (https://phabricator.wikimedia.org/T415317) (owner: 10Tiziano Fogli)
[09:16:25] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section es6
[09:16:28] <icinga-wm>	 RECOVERY - Ensure acme-chief-api is running on acmechief2002 is OK: PROCS OK: 1 process with args /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/acme-chief.ini https://wikitech.wikimedia.org/wiki/Acme-chief
[09:17:50] <wikibugs>	 (03PS1) 10Majavah: P:toolforge: k8s: haproxy: Add option for sending traffic to Istio [puppet] - 10https://gerrit.wikimedia.org/r/1258948 (https://phabricator.wikimedia.org/T392356)
[09:17:52] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section es6
[09:18:52] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8317/co" [puppet] - 10https://gerrit.wikimedia.org/r/1258948 (https://phabricator.wikimedia.org/T392356) (owner: 10Majavah)
[09:19:15] <logmsgbot>	 jayme@cumin1003 reboot-nodes (PID 1359444) is awaiting input
[09:19:54] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:22:06] <logmsgbot>	 !log trueg@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs-queryhammer: apply
[09:22:17] <logmsgbot>	 !log trueg@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs-queryhammer: apply
[09:22:57] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: haproxy: temporarily re-add the lua file to avoid race conditions [puppet] - 10https://gerrit.wikimedia.org/r/1258949
[09:22:57] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: haproxy: remove the traffic_class.lua file for good [puppet] - 10https://gerrit.wikimedia.org/r/1258950
[09:23:08] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section es7
[09:24:16] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section es7
[09:24:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:25:31] <logmsgbot>	 !log trueg@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs-queryhammer: apply
[09:25:38] <logmsgbot>	 !log trueg@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs-queryhammer: apply
[09:25:57] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] haproxy: temporarily re-add the lua file to avoid race conditions [puppet] - 10https://gerrit.wikimedia.org/r/1258949 (owner: 10Giuseppe Lavagetto)
[09:26:13] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to data and Superset for Daria-WMDE (Daria Ammalainen (WMDE)) - https://phabricator.wikimedia.org/T420716#11737117 (10Daria-WMDE) Hello @KFrancis could you please resend the NDA? Was out of office EOD Friday, and now the link has expired
[09:29:19] <logmsgbot>	 !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1172.eqiad.wmnet with OS bullseye
[09:29:24] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11737122 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin...
[09:29:35] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1172.eqiad.wmnet with OS bullseye
[09:29:47] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11737123 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@c...
[09:32:07] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Update grants for new hosts ms-backup[12]00[34], which replaces [12] [puppet] - 10https://gerrit.wikimedia.org/r/1258954 (https://phabricator.wikimedia.org/T420464)
[09:32:09] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section s6
[09:33:18] <wikibugs>	 (03Abandoned) 10Jgiannelos: beta: Fix duplicate definition of site.v1.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1234940 (owner: 10Jgiannelos)
[09:33:21] <wikibugs>	 (03CR) 10Jcrespo: "There is no rush on deploying this, it can wait until maintenance freeze happens, despite only affecting backup dbs." [puppet] - 10https://gerrit.wikimedia.org/r/1258954 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo)
[09:33:45] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section s6
[09:34:01] <wikibugs>	 (03Abandoned) 10Jgiannelos: pcs: Block RB traffic for all domains [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145828 (owner: 10Jgiannelos)
[09:35:36] <wikibugs>	 (03PS1) 10Trueg: wdqs-queryhammer: Deployment fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258956 (https://phabricator.wikimedia.org/T417415)
[09:36:32] <wikibugs>	 (03PS6) 10Cathal Mooney: Routed ganeti: disable nftables conntrack for forwarded VM traffic [puppet] - 10https://gerrit.wikimedia.org/r/1257209 (https://phabricator.wikimedia.org/T420715)
[09:37:58] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:39:55] <logmsgbot>	 !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1172.eqiad.wmnet with OS bullseye
[09:40:02] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11737211 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin...
[09:40:13] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1172.eqiad.wmnet with OS bullseye
[09:40:21] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11737214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@c...
[09:41:20] <wikibugs>	 (03CR) 10Clément Goubert: rest-gateway: Add core API support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255725 (https://phabricator.wikimedia.org/T418146) (owner: 10Clément Goubert)
[09:42:43] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section s5
[09:42:57] <wikibugs>	 (03PS3) 10Clément Goubert: rest-gateway: Add core API support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255725 (https://phabricator.wikimedia.org/T418146)
[09:44:08] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2060-2062,2064-2065,2067-2075].codfw.wmnet
[09:44:17] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2060-2062,2064-2065,2067-2075].codfw.wmnet
[09:44:29] <wikibugs>	 (03CR) 10Blake: [C:03+2] mw-web: upsize for single-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251045 (https://phabricator.wikimedia.org/T413974) (owner: 10Blake)
[09:44:33] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section s5
[09:44:41] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2076-2078,2087-2095,2102-2103].codfw.wmnet
[09:45:39] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1257209 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney)
[09:46:54] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] haproxy: remove the traffic_class.lua file for good [puppet] - 10https://gerrit.wikimedia.org/r/1258950 (owner: 10Giuseppe Lavagetto)
[09:47:01] <wikibugs>	 (03Merged) 10jenkins-bot: mw-web: upsize for single-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251045 (https://phabricator.wikimedia.org/T413974) (owner: 10Blake)
[09:48:04] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: haproxy: well, actually remove the file :P [puppet] - 10https://gerrit.wikimedia.org/r/1258962
[09:48:24] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] haproxy: well, actually remove the file :P [puppet] - 10https://gerrit.wikimedia.org/r/1258962 (owner: 10Giuseppe Lavagetto)
[09:48:38] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to data and Superset for Daria-WMDE (Daria Ammalainen (WMDE)) - https://phabricator.wikimedia.org/T420716#11737267 (10Daria-WMDE)
[09:49:00] <logmsgbot>	 !log blake@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[09:49:29] <logmsgbot>	 !log blake@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[09:49:31] <logmsgbot>	 !log blake@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[09:49:31] <wikibugs>	 (03PS7) 10Cathal Mooney: Routed ganeti: disable nftables conntrack for forwarded VM traffic [puppet] - 10https://gerrit.wikimedia.org/r/1257209 (https://phabricator.wikimedia.org/T420715)
[09:49:41] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to data and Superset for Daria-WMDE (Daria Ammalainen (WMDE)) - https://phabricator.wikimedia.org/T420716#11737270 (10Daria-WMDE) Hi @Scott_French, I added a developer account to the task and linked it with the Phabricator account and the Wikimedia Global Account
[09:49:56] <logmsgbot>	 !log blake@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[09:49:59] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section s2
[09:50:55] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "LG!" [puppet] - 10https://gerrit.wikimedia.org/r/1256287 (https://phabricator.wikimedia.org/T420264) (owner: 10Btullis)
[09:52:24] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section s2
[09:53:05] <wikibugs>	 (03PS1) 10Blake: geo-maps: update map default to list eqiad first [dns] - 10https://gerrit.wikimedia.org/r/1244621 (https://phabricator.wikimedia.org/T413974)
[09:53:31] <wikibugs>	 (03PS1) 10Blake: debug: reorder debug backends for eqiad switchover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244628 (https://phabricator.wikimedia.org/T413974)
[09:53:38] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2076-2078,2087-2095,2102-2103].codfw.wmnet
[09:53:52] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[09:53:56] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[09:54:08] <logmsgbot>	 btullis@cumin1003 reimage (PID 1408579) is awaiting input
[09:57:27] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section s3
[09:57:40] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11737346 (10AnnieKim_WMDE) Linked my LDAP account. Thanks everyone for your help.
[09:57:48] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1257209 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney)
[09:58:22] <wikibugs>	 (03PS11) 10Elukey: sre.hosts.provision: use more URIs to set Supermicro's BIOS settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216)
[09:58:29] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to <ENTER RESOURCE NAME> for <ENTER YOUR USERNAME> - https://phabricator.wikimedia.org/T420896 (10kera_wmde) 03NEW
[09:58:48] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[09:58:52] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Superset for keren.ramirezWMDE - https://phabricator.wikimedia.org/T420896#11737371 (10kera_wmde)
[09:58:53] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[09:58:55] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section s3
[10:00:00] <jinxer-wm>	 FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability
[10:00:00] <jinxer-wm>	 FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260323T1000)
[10:01:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] P:toolforge: k8s: haproxy: Add option for sending traffic to Istio [puppet] - 10https://gerrit.wikimedia.org/r/1258948 (https://phabricator.wikimedia.org/T392356) (owner: 10Majavah)
[10:02:11] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge: k8s: haproxy: Add option for sending traffic to Istio [puppet] - 10https://gerrit.wikimedia.org/r/1258948 (https://phabricator.wikimedia.org/T392356) (owner: 10Majavah)
[10:03:17] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1258948 (https://phabricator.wikimedia.org/T392356) (owner: 10Majavah)
[10:04:27] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section s7
[10:04:30] <logmsgbot>	 jayme@cumin1003 reboot-nodes (PID 1359444) is awaiting input
[10:04:58] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2076-2078,2087-2095,2102-2103].codfw.wmnet
[10:05:06] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2076-2078,2087-2095,2102-2103].codfw.wmnet
[10:05:58] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section s7
[10:06:19] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s7 #page on db1253 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:06:37] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s7 #page on db1253 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:06:37] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s7 #page on db1253 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:06:45] <federico3>	 silencing it again
[10:06:59] <XioNoX>	 federico3: what's u ?
[10:07:00] <XioNoX>	 p
[10:07:03] <Emperor>	 !incidents
[10:07:04] <sirenbot>	 7784 (UNACKED)  db1253 (paged)/MariaDB Replica IO: s7 (paged)
[10:07:04] <sirenbot>	 7785 (UNACKED)  db1253 (paged)/MariaDB Replica Lag: s7 (paged)
[10:07:04] <sirenbot>	 7786 (UNACKED)  db1253 (paged)/MariaDB Replica SQL: s7 (paged)
[10:07:07] <Emperor>	 !ack
[10:07:07] <sirenbot>	 Could not ack the alert. Please check the parameters.
[10:07:16] <Emperor>	 I thought that was meant to work now?
[10:07:20] <Emperor>	 !ack 7784
[10:07:21] <sirenbot>	 7784 (ACKED)  db1253 (paged)/MariaDB Replica IO: s7 (paged)
[10:07:21] <federico3>	 it's due to a cookbook removing the silence while running I think
[10:07:22] <Emperor>	 !ack 7785
[10:07:23] <sirenbot>	 7785 (ACKED)  db1253 (paged)/MariaDB Replica Lag: s7 (paged)
[10:07:26] <Emperor>	 !ack 7786
[10:07:26] <sirenbot>	 7786 (ACKED)  db1253 (paged)/MariaDB Replica SQL: s7 (paged)
[10:08:10] <logmsgbot>	 jayme@cumin1003 reboot-nodes (PID 1359444) is awaiting input
[10:08:11] <federico3>	 on alertmanager I see only silenced alerts tho
[10:08:12] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Routed ganeti: disable nftables conntrack for forwarded VM traffic [puppet] - 10https://gerrit.wikimedia.org/r/1257209 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney)
[10:08:36] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2104-2115,2124-2125].codfw.wmnet
[10:08:40] <federico3>	 ( a side effect of https://phabricator.wikimedia.org/T416706 )
[10:09:13] <Emperor>	 federico3: we got email from nagios as well as the p.ages
[10:09:21] <logmsgbot>	 !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1172.eqiad.wmnet with OS bullseye
[10:09:25] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] rabbitmq: set pause_minority for cluster_partition_handling [puppet] - 10https://gerrit.wikimedia.org/r/1254877 (https://phabricator.wikimedia.org/T418444) (owner: 10Filippo Giunchedi)
[10:09:31] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11737429 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin...
[10:10:41] <wikibugs>	 (03PS1) 10Matthieulec: Add --force flag to sre.k8s.pool-depool-node cookbook and callers to bypass confirmation. [cookbooks] - 10https://gerrit.wikimedia.org/r/1258952 (https://phabricator.wikimedia.org/T410537)
[10:11:34] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section s8
[10:13:02] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section s8
[10:15:09] <wikibugs>	 (03PS1) 10Majavah: P:toolforge: k8s: haproxy: Use HTTP/1.1 for health checks [puppet] - 10https://gerrit.wikimedia.org/r/1258980 (https://phabricator.wikimedia.org/T392356)
[10:15:10] <wikibugs>	 (03PS1) 10Arnaudb: gerrit: add Envoy TLS termination for the CDN path [puppet] - 10https://gerrit.wikimedia.org/r/1258976 (https://phabricator.wikimedia.org/T420189)
[10:15:57] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] wikimedia6 prefix-list: add wikidough anycast range [homer/public] - 10https://gerrit.wikimedia.org/r/1257195 (https://phabricator.wikimedia.org/T420820) (owner: 10Cathal Mooney)
[10:18:36] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section s4
[10:18:40] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1258980 (https://phabricator.wikimedia.org/T392356) (owner: 10Majavah)
[10:20:38] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section s4
[10:21:05] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8319/co" [puppet] - 10https://gerrit.wikimedia.org/r/1258980 (https://phabricator.wikimedia.org/T392356) (owner: 10Majavah)
[10:21:05] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge: k8s: haproxy: Use HTTP/1.1 for health checks [puppet] - 10https://gerrit.wikimedia.org/r/1258980 (https://phabricator.wikimedia.org/T392356) (owner: 10Majavah)
[10:22:05] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2104-2115,2124-2125].codfw.wmnet
[10:22:29] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Update DHCP server in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1256335 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff)
[10:23:58] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host an-worker1172.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[10:24:04] <logmsgbot>	 !log ayounsi@dns1004 START - running authdns-update
[10:24:19] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:24:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:25:37] <logmsgbot>	 !log ayounsi@dns1004 END - running authdns-update
[10:25:44] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section s1
[10:27:02] <logmsgbot>	 btullis@cumin1003 provision (PID 1455852) is awaiting input
[10:27:15] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section s1
[10:28:50] <topranks>	 !log disable puppet on routed-ganeti hosts to test nftables update on specific nodes T420715
[10:28:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:30:16] <wikibugs>	 (03PS1) 10Filippo Giunchedi: rabbit: apply cluster_partition_handling to rabbitmq4 [puppet] - 10https://gerrit.wikimedia.org/r/1258990 (https://phabricator.wikimedia.org/T418444)
[10:30:16] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Point proxy in ulsfo to install4004 [dns] - 10https://gerrit.wikimedia.org/r/1256324 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff)
[10:30:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] "Self merging since the related change for rabbitmq3 was approved" [puppet] - 10https://gerrit.wikimedia.org/r/1258990 (https://phabricator.wikimedia.org/T418444) (owner: 10Filippo Giunchedi)
[10:30:22] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2104-2115,2124-2125].codfw.wmnet
[10:30:30] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2104-2115,2124-2125].codfw.wmnet
[10:31:23] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2126-2139].codfw.wmnet
[10:32:22] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] ulsfo: update dhcp server to install4004 [homer/public] - 10https://gerrit.wikimedia.org/r/1258994 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi)
[10:33:13] <wikibugs>	 (03Merged) 10jenkins-bot: ulsfo: update dhcp server to install4004 [homer/public] - 10https://gerrit.wikimedia.org/r/1258994 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi)
[10:37:38] <wikibugs>	 (03PS1) 10Majavah: P:toolforge: k8s: haproxy: Fix istio-gateway health checks [puppet] - 10https://gerrit.wikimedia.org/r/1259000 (https://phabricator.wikimedia.org/T392356)
[10:38:01] <logmsgbot>	 btullis@cumin1003 provision (PID 1455852) is awaiting input
[10:38:29] <logmsgbot>	 !log trueg@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs-queryhammer: apply
[10:38:32] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11737597 (10Volans) That's what's in puppetdb and what's reported by facter on the host though: ` $ sudo facter -p...
[10:38:33] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Superset for keren.ramirezWMDE - https://phabricator.wikimedia.org/T420896#11737598 (10Aklapper) @kera_wmde: Please also [link your LDAP account to your Phabricator account](https://phabricator.wikimedia.org/settings/panel/external/), so your 'LDAP User' accou...
[10:38:33] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[10:38:37] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[10:38:38] <logmsgbot>	 !log trueg@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs-queryhammer: apply
[10:39:28] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8320/co" [puppet] - 10https://gerrit.wikimedia.org/r/1259000 (https://phabricator.wikimedia.org/T392356) (owner: 10Majavah)
[10:41:24] <wikibugs>	 (03PS1) 10Cathal Mooney: Routed-ganeti: fix syntax error in new forward rule [puppet] - 10https://gerrit.wikimedia.org/r/1259004 (https://phabricator.wikimedia.org/T420715)
[10:41:28] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Superset for keren.ramirezWMDE - https://phabricator.wikimedia.org/T420896#11737603 (10WMDE-leszek) I approve this request on WMDE's end. Thank you
[10:42:22] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11737605 (10ayounsi)
[10:42:40] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Routed-ganeti: fix syntax error in new forward rule [puppet] - 10https://gerrit.wikimedia.org/r/1259004 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney)
[10:43:04] <wikibugs>	 (03CR) 10David Caro: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1259000 (https://phabricator.wikimedia.org/T392356) (owner: 10Majavah)
[10:43:34] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[10:43:37] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Routed-ganeti: fix syntax error in new forward rule [puppet] - 10https://gerrit.wikimedia.org/r/1259004 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney)
[10:43:38] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[10:43:46] <wikibugs>	 10ops-codfw, 06DC-Ops: Power Supply - Status - issue on wikikube-ctrl2001:9290 - https://phabricator.wikimedia.org/T420905 (10phaultfinder) 03NEW
[10:44:07] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2126-2139].codfw.wmnet
[10:45:48] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge: k8s: haproxy: Fix istio-gateway health checks [puppet] - 10https://gerrit.wikimedia.org/r/1259000 (https://phabricator.wikimedia.org/T392356) (owner: 10Majavah)
[10:46:53] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11737645 (10ayounsi)
[10:48:03] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Make hcaptcha-proxy4003/hcaptcha-proxy4004 new hcaptcha-proxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/1255786 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff)
[10:48:12] <wikibugs>	 (03PS2) 10Muehlenhoff: Make hcaptcha-proxy4003/hcaptcha-proxy4004 new hcaptcha-proxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/1255786 (https://phabricator.wikimedia.org/T418993)
[10:48:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:49:35] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Make hcaptcha-proxy4003/hcaptcha-proxy4004 new hcaptcha-proxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/1255786 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff)
[10:52:26] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Superset for keren.ramirezWMDE - https://phabricator.wikimedia.org/T420896#11737660 (10kera_wmde) Link in my account confirmed! Thank you!  >>! In T420896#11737598, @Aklapper wrote: > @kera_wmde: Please also [link your LDAP account to your Phabricator account]...
[10:53:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:54:54] <logmsgbot>	 jayme@cumin1003 reboot-nodes (PID 1359444) is awaiting input
[10:55:21] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2126-2139].codfw.wmnet
[10:55:29] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2126-2139].codfw.wmnet
[10:55:57] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2140-2153].codfw.wmnet
[10:57:26] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[10:58:31] <wikibugs>	 (03PS1) 10Cathal Mooney: routed-ganeti nftables forward chain: correct syntax [puppet] - 10https://gerrit.wikimedia.org/r/1259018 (https://phabricator.wikimedia.org/T420715)
[11:00:16] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.decommission for hosts install4003.wikimedia.org
[11:01:53] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] routed-ganeti nftables forward chain: correct syntax [puppet] - 10https://gerrit.wikimedia.org/r/1259018 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney)
[11:02:31] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] wikimedia6 prefix-list: add wikidough anycast range [homer/public] - 10https://gerrit.wikimedia.org/r/1257195 (https://phabricator.wikimedia.org/T420820) (owner: 10Cathal Mooney)
[11:05:06] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox
[11:08:48] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2140-2153].codfw.wmnet
[11:08:53] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install4003.wikimedia.org decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1003"
[11:09:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job squid in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:09:53] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install4003.wikimedia.org decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1003"
[11:09:53] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:09:54] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts install4003.wikimedia.org
[11:10:05] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11737707 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1003 for hosts: `install4003.wikimedia.org` -...
[11:13:41] <wikibugs>	 (03CR) 10Volans: sre.k8s: Add cookbook to print network topology details of nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1256260 (https://phabricator.wikimedia.org/T418142) (owner: 10JMeybohm)
[11:15:20] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.ganeti.makevm for new host bast4006.wikimedia.org
[11:15:21] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox
[11:18:14] <icinga-wm>	 PROBLEM - Host lsw1-b7-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[11:18:22] <icinga-wm>	 PROBLEM - Host ps1-b7-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[11:18:50] <wikibugs>	 (03PS2) 10Genoveva Galarza: Enable view urls in abstract.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1256396 (https://phabricator.wikimedia.org/T420666)
[11:18:52] <icinga-wm>	 RECOVERY - Host ps1-b7-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.49 ms
[11:19:00] <icinga-wm>	 RECOVERY - Host lsw1-b7-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.68 ms
[11:19:02] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast4006.wikimedia.org - ayounsi@cumin1003"
[11:19:08] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast4006.wikimedia.org - ayounsi@cumin1003"
[11:19:08] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:19:08] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache bast4006.wikimedia.org on all recursors
[11:19:12] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast4006.wikimedia.org on all recursors
[11:19:42] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast4006.wikimedia.org - ayounsi@cumin1003"
[11:19:47] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast4006.wikimedia.org - ayounsi@cumin1003"
[11:20:03] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2140-2153].codfw.wmnet
[11:20:03] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host bast4006.wikimedia.org with OS bookworm
[11:20:11] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2140-2153].codfw.wmnet
[11:20:18] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11737716 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ayounsi@cumin1003 for host bast4006.wikimedia.org w...
[11:23:27] <wikibugs>	 (03CR) 10Genoveva Galarza: "Done! Thanks a lot for the references and the examples, super helpful." [puppet] - 10https://gerrit.wikimedia.org/r/1256396 (https://phabricator.wikimedia.org/T420666) (owner: 10Genoveva Galarza)
[11:23:31] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2154-2167].codfw.wmnet
[11:23:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[11:26:01] <wikibugs>	 (03PS2) 10Arnaudb: gerrit: add Envoy TLS termination for the CDN path [puppet] - 10https://gerrit.wikimedia.org/r/1258976 (https://phabricator.wikimedia.org/T420909)
[11:27:06] <wikibugs>	 (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1258976 (https://phabricator.wikimedia.org/T420909) (owner: 10Arnaudb)
[11:27:10] <wikibugs>	 (03PS12) 10Elukey: sre.hosts.provision: use more URIs to set Supermicro's BIOS settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216)
[11:27:16] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on hcaptcha-proxy4004 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[11:28:09] <wikibugs>	 (03CR) 10Elukey: sre.hosts.provision: use more URIs to set Supermicro's BIOS settings (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216) (owner: 10Elukey)
[11:28:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[11:29:14] <wikibugs>	 10SRE-tools, 10Cumin, 06Infrastructure-Foundations: Add proxy support to cumin openstack backend - https://phabricator.wikimedia.org/T420360#11737762 (10Volans) p:05Triage→03Medium a:03Volans
[11:29:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job squid in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:29:52] <wikibugs>	 (03PS5) 10Elukey: sre.hosts.provision: refactor bios if/else branches [cookbooks] - 10https://gerrit.wikimedia.org/r/1253412 (https://phabricator.wikimedia.org/T414216)
[11:29:52] <wikibugs>	 (03PS4) 10Elukey: sre.hosts.provision: add sys-112c-tn-configg to SUPERMICRO_NO_FQDN_MANAGEMENT [cookbooks] - 10https://gerrit.wikimedia.org/r/1253448 (https://phabricator.wikimedia.org/T414216)
[11:29:52] <wikibugs>	 (03PS13) 10Elukey: sre.hosts.provision: use more URIs to set Supermicro's BIOS settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216)
[11:30:06] <wikibugs>	 (03CR) 10Elukey: sre.hosts.provision: refactor bios if/else branches (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1253412 (https://phabricator.wikimedia.org/T414216) (owner: 10Elukey)
[11:31:47] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2154-2167].codfw.wmnet
[11:34:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops-deprecated, 13Patch-For-Review: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#11737771 (10elukey) 05Resolved→03Open Re-opening this one since something weird happens when running provisioning:  ` 2026-03...
[11:38:05] <wikibugs>	 (03PS1) 10Sergio Gimeno: fix(WelcomeSurveyHooks): ensure accountJustCreated is always added [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259035 (https://phabricator.wikimedia.org/T420722)
[11:38:07] <wikibugs>	 (03CR) 10Daniel Kinzler: rest-gateway: Add core API support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255725 (https://phabricator.wikimedia.org/T418146) (owner: 10Clément Goubert)
[11:38:28] <wikibugs>	 (03PS1) 10Sergio Gimeno: tests: add coverage for WelcomeSurveyHooks::onCentralAuthPostLoginRedirect [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259036 (https://phabricator.wikimedia.org/T420722)
[11:39:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job squid in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:40:36] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2154-2167].codfw.wmnet
[11:40:45] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2154-2167].codfw.wmnet
[11:43:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] tests: add coverage for WelcomeSurveyHooks::onCentralAuthPostLoginRedirect [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259036 (https://phabricator.wikimedia.org/T420722) (owner: 10Sergio Gimeno)
[11:43:16] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on hcaptcha-proxy4004 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[11:43:49] <logmsgbot>	 jayme@cumin1003 reboot-nodes (PID 1359444) is awaiting input
[11:44:30] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2168-2179,2184-2185].codfw.wmnet
[11:44:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] fix(WelcomeSurveyHooks): ensure accountJustCreated is always added [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259035 (https://phabricator.wikimedia.org/T420722) (owner: 10Sergio Gimeno)
[11:46:36] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] "lgtm, but might be worth dropping the cookie stripping" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255725 (https://phabricator.wikimedia.org/T418146) (owner: 10Clément Goubert)
[11:47:52] <wikibugs>	 (03CR) 10Clément Goubert: rest-gateway: Add core API support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255725 (https://phabricator.wikimedia.org/T418146) (owner: 10Clément Goubert)
[11:52:47] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2168-2179,2184-2185].codfw.wmnet
[12:00:20] <wikibugs>	 (03PS4) 10Clément Goubert: rest-gateway: Add linkrecommendation support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255581 (https://phabricator.wikimedia.org/T418148)
[12:00:20] <wikibugs>	 (03PS4) 10Clément Goubert: rest-gateway: Add core API support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255725 (https://phabricator.wikimedia.org/T418146)
[12:04:15] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2168-2179,2184-2185].codfw.wmnet
[12:04:24] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2168-2179,2184-2185].codfw.wmnet
[12:06:01] <wikibugs>	 (03Abandoned) 10Sergio Gimeno: tests: add coverage for WelcomeSurveyHooks::onCentralAuthPostLoginRedirect [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259036 (https://phabricator.wikimedia.org/T420722) (owner: 10Sergio Gimeno)
[12:06:14] <wikibugs>	 (03PS3) 10Genoveva Galarza: Enable view urls in abstract.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1256396 (https://phabricator.wikimedia.org/T420666)
[12:06:40] <wikibugs>	 (03Restored) 10Sergio Gimeno: tests: add coverage for WelcomeSurveyHooks::onCentralAuthPostLoginRedirect [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259036 (https://phabricator.wikimedia.org/T420722) (owner: 10Sergio Gimeno)
[12:07:09] <wikibugs>	 (03PS2) 10Sergio Gimeno: tests: add coverage for WelcomeSurveyHooks::onCentralAuthPostLoginRedirect [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259036 (https://phabricator.wikimedia.org/T420722)
[12:07:41] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2186-2199].codfw.wmnet
[12:07:44] <wikibugs>	 (03PS1) 10Sergio Gimeno: fix(WelcomeSurveyHooks): ensure accountJustCreated is always added 2 [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259046 (https://phabricator.wikimedia.org/T420722)
[12:08:37] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.hosts.reimage for host testvm2006.codfw.wmnet with OS bookworm
[12:08:42] <wikibugs>	 (03CR) 10Sergio Gimeno: "recheck, git unrelated `fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/extensions/TemplateData/': GnuTLS recv error (-5" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259035 (https://phabricator.wikimedia.org/T420722) (owner: 10Sergio Gimeno)
[12:09:39] <wikibugs>	 (03PS1) 10Ayounsi: Add bast4006 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1259049 (https://phabricator.wikimedia.org/T418993)
[12:10:13] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259035 (https://phabricator.wikimedia.org/T420722) (owner: 10Sergio Gimeno)
[12:10:25] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259036 (https://phabricator.wikimedia.org/T420722) (owner: 10Sergio Gimeno)
[12:10:44] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259046 (https://phabricator.wikimedia.org/T420722) (owner: 10Sergio Gimeno)
[12:11:05] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1259049 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi)
[12:11:11] <wikibugs>	 (03PS5) 10Clément Goubert: rest-gateway: Add linkrecommendation support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255581 (https://phabricator.wikimedia.org/T418148)
[12:11:11] <wikibugs>	 (03PS5) 10Clément Goubert: rest-gateway: Add core API support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255725 (https://phabricator.wikimedia.org/T418146)
[12:11:15] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Add bast4006 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1259049 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi)
[12:14:49] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on bast4006.wikimedia.org with reason: host reimage
[12:16:38] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2186-2199].codfw.wmnet
[12:18:52] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast4006.wikimedia.org with reason: host reimage
[12:21:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on ganeti2033:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[12:22:47] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2006.codfw.wmnet with reason: host reimage
[12:28:43] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2006.codfw.wmnet with reason: host reimage
[12:30:08] <logmsgbot>	 jayme@cumin1003 reboot-nodes (PID 1359444) is awaiting input
[12:30:12] <claime>	 .46
[12:34:08] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[12:34:54] <wikibugs>	 (03PS1) 10Cathal Mooney: ganeti-routed nftables: adjust notrack setup for VM traffic [puppet] - 10https://gerrit.wikimedia.org/r/1259064 (https://phabricator.wikimedia.org/T420715)
[12:35:31] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ganeti-routed nftables: adjust notrack setup for VM traffic [puppet] - 10https://gerrit.wikimedia.org/r/1259064 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney)
[12:37:27] <wikibugs>	 (03PS2) 10Cathal Mooney: ganeti-routed nftables: adjust notrack setup for VM traffic [puppet] - 10https://gerrit.wikimedia.org/r/1259064 (https://phabricator.wikimedia.org/T420715)
[12:38:37] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast4006.wikimedia.org with OS bookworm
[12:38:37] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast4006.wikimedia.org
[12:38:50] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11737945 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ayounsi@cumin1003 for host bast4006.wikimedia.org with OS bookworm completed:...
[12:41:19] <wikibugs>	 (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259064 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney)
[12:42:51] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host testvm2006.codfw.wmnet with OS bookworm
[12:42:53] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "bast4006 - ayounsi@cumin1003"
[12:43:09] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "bast4006 - ayounsi@cumin1003"
[12:45:11] <wikibugs>	 (03PS1) 10Clément Goubert: trafficserver: Add api.w.o to gateway-check.lua.conf [puppet] - 10https://gerrit.wikimedia.org/r/1259067 (https://phabricator.wikimedia.org/T418145)
[12:46:01] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11737967 (10ayounsi)
[12:48:52] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] ganeti-routed nftables: adjust notrack setup for VM traffic [puppet] - 10https://gerrit.wikimedia.org/r/1259064 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney)
[12:51:52] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] ganeti-routed nftables: adjust notrack setup for VM traffic [puppet] - 10https://gerrit.wikimedia.org/r/1259064 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney)
[12:52:10] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] routed-ganeti nftables forward chain: correct syntax [puppet] - 10https://gerrit.wikimedia.org/r/1259018 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney)
[12:55:30] <wikibugs>	 (03PS2) 10Clément Goubert: trafficserver: Add api.w.o to gateway-check.lua.conf [puppet] - 10https://gerrit.wikimedia.org/r/1259067 (https://phabricator.wikimedia.org/T418145)
[12:55:33] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259067 (https://phabricator.wikimedia.org/T418145) (owner: 10Clément Goubert)
[12:55:54] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] tests: Make many things static for PHPUnit 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1258300 (https://phabricator.wikimedia.org/T420844) (owner: 10Reedy)
[12:56:49] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] phpunit.xml: Update configuration for PHPUnit 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1258301 (https://phabricator.wikimedia.org/T420844) (owner: 10Reedy)
[12:57:21] <wikibugs>	 (03PS1) 10Clément Goubert: trafficserver: 100% of linkrecommendation to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259071 (https://phabricator.wikimedia.org/T418148)
[12:58:22] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1010 - https://phabricator.wikimedia.org/T420867#11738021 (10VRiley-WMF) a:03VRiley-WMF
[12:58:44] <wikibugs>	 (03PS2) 10Clément Goubert: trafficserver: 100% of linkrecommendation to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259071 (https://phabricator.wikimedia.org/T418148)
[12:59:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1170 - https://phabricator.wikimedia.org/T420873#11738023 (10VRiley-WMF) a:03VRiley-WMF
[13:00:01] <wikibugs>	 (03PS1) 10Clément Goubert: trafficserver: 100% of device-analytics to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259075 (https://phabricator.wikimedia.org/T418147)
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260323T1300). nyaa~
[13:00:05] <jouncebot>	 hector-arroyo and Sergi0: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:13] <wikibugs>	 (03PS2) 10Clément Goubert: trafficserver: 100% of device-analytics to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259075 (https://phabricator.wikimedia.org/T418147)
[13:00:39] <sergi0>	 o/
[13:00:53] <sergi0>	 I can self-deploy
[13:02:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259035 (https://phabricator.wikimedia.org/T420722) (owner: 10Sergio Gimeno)
[13:02:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259036 (https://phabricator.wikimedia.org/T420722) (owner: 10Sergio Gimeno)
[13:02:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259046 (https://phabricator.wikimedia.org/T420722) (owner: 10Sergio Gimeno)
[13:03:08] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] rest-gateway: Add linkrecommendation support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255581 (https://phabricator.wikimedia.org/T418148) (owner: 10Clément Goubert)
[13:03:27] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Initial entries for cloudcephosd105[3-6] [puppet] - 10https://gerrit.wikimedia.org/r/1256392 (https://phabricator.wikimedia.org/T416394) (owner: 10Andrew Bogott)
[13:03:56] <wikibugs>	 (03PS1) 10Clément Goubert: trafficserver: 50% of /core/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259077 (https://phabricator.wikimedia.org/T418146)
[13:03:59] <wikibugs>	 (03PS1) 10Clément Goubert: trafficserver: 100% of /core/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259078 (https://phabricator.wikimedia.org/T418146)
[13:04:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:04:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06cloud-services-team (Hardware), 13Patch-For-Review: Q3:rack/setup/install cloudcephosd105[56] - https://phabricator.wikimedia.org/T419892#11738055 (10Andrew) a:05Andrew→03None
[13:04:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install cloudcephosd1054 - https://phabricator.wikimedia.org/T416395#11738056 (10Andrew) a:05Andrew→03None
[13:04:47] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install cloudcephosd1053 - https://phabricator.wikimedia.org/T416394#11738058 (10Andrew) a:05Andrew→03None
[13:05:07] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255581 (https://phabricator.wikimedia.org/T418148) (owner: 10Clément Goubert)
[13:05:30] <wikibugs>	 (03Merged) 10jenkins-bot: fix(WelcomeSurveyHooks): ensure accountJustCreated is always added [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259035 (https://phabricator.wikimedia.org/T420722) (owner: 10Sergio Gimeno)
[13:05:33] <wikibugs>	 (03Merged) 10jenkins-bot: tests: add coverage for WelcomeSurveyHooks::onCentralAuthPostLoginRedirect [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259036 (https://phabricator.wikimedia.org/T420722) (owner: 10Sergio Gimeno)
[13:05:33] <wikibugs>	 (03PS1) 10Majavah: cloudnfs: Remove Huggle project config [puppet] - 10https://gerrit.wikimedia.org/r/1259079
[13:05:35] <wikibugs>	 (03Merged) 10jenkins-bot: fix(WelcomeSurveyHooks): ensure accountJustCreated is always added 2 [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259046 (https://phabricator.wikimedia.org/T420722) (owner: 10Sergio Gimeno)
[13:05:55] <logmsgbot>	 !log sgimeno@deploy2002 Started scap sync-world: Backport for [[gerrit:1259035|fix(WelcomeSurveyHooks): ensure accountJustCreated is always added (T420722)]], [[gerrit:1259036|tests: add coverage for WelcomeSurveyHooks::onCentralAuthPostLoginRedirect (T420722)]], [[gerrit:1259046|fix(WelcomeSurveyHooks): ensure accountJustCreated is always added 2 (T420722)]]
[13:05:59] <stashbot>	 T420722: accountJustCreated flag not properly added on WelcomeSurvey redirections - https://phabricator.wikimedia.org/T420722
[13:06:33] <wikibugs>	 (03PS1) 10Cathal Mooney: nftables: support nftables::rules definitions targetting prerouting [puppet] - 10https://gerrit.wikimedia.org/r/1259080 (https://phabricator.wikimedia.org/T420715)
[13:07:41] <wikibugs>	 (03CR) 10Majavah: [V:03+2 C:03+2] Add toolsbeta-acme-chief private key [labs/private] - 10https://gerrit.wikimedia.org/r/1240325 (owner: 10Majavah)
[13:07:42] <logmsgbot>	 !log sgimeno@deploy2002 sgimeno: Backport for [[gerrit:1259035|fix(WelcomeSurveyHooks): ensure accountJustCreated is always added (T420722)]], [[gerrit:1259036|tests: add coverage for WelcomeSurveyHooks::onCentralAuthPostLoginRedirect (T420722)]], [[gerrit:1259046|fix(WelcomeSurveyHooks): ensure accountJustCreated is always added 2 (T420722)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Ch
[13:07:42] <logmsgbot>	 anges can now be verified there.
[13:07:50] <wikibugs>	 (03CR) 10Majavah: [V:03+2 C:03+2] Add fake metricsinfra Grafana admin password [labs/private] - 10https://gerrit.wikimedia.org/r/1240326 (owner: 10Majavah)
[13:08:01] <wikibugs>	 (03CR) 10Majavah: [V:03+2 C:03+2] Add fake Docker registry passwrod for cloudinfra [labs/private] - 10https://gerrit.wikimedia.org/r/1245297 (owner: 10Majavah)
[13:08:26] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2186-2199].codfw.wmnet
[13:08:26] * sergi0 testing
[13:08:34] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2186-2199].codfw.wmnet
[13:08:34] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on P{wikikube-worker[2005-2006,2011-2018,2033-2039,2041-2042,2044,2046,2049-2051,2055-2062,2064-2065,2067-2078,2087-2095,2102-2115,2124-2179,2184-2199].codfw.wmnet} and (A:wikikube-master-codfw or A:wikikube-worker-codfw)
[13:09:45] <wikibugs>	 (03PS1) 10Jgreen: Switch fundraising default bastion back to eqiad after kernel update. [dns] - 10https://gerrit.wikimedia.org/r/1259081
[13:11:01] <wikibugs>	 (03CR) 10Andrew Bogott: "*nudge*" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 (https://phabricator.wikimedia.org/T361237) (owner: 10Andrew Bogott)
[13:11:21] <logmsgbot>	 !log sgimeno@deploy2002 sgimeno: Continuing with sync
[13:11:27] <wikibugs>	 (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259080 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney)
[13:11:38] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[13:11:47] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1213 - https://phabricator.wikimedia.org/T420812#11738116 (10VRiley-WMF) a:03VRiley-WMF
[13:13:29] <wikibugs>	 (03PS1) 10JMeybohm: k8s.print-network-topology: Prevent SAL logging [cookbooks] - 10https://gerrit.wikimedia.org/r/1259082
[13:13:34] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[13:14:15] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[13:16:12] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[13:16:24] <wikibugs>	 (03PS3) 10Arnaudb: gerrit: add Envoy TLS termination for the CDN path [puppet] - 10https://gerrit.wikimedia.org/r/1258976 (https://phabricator.wikimedia.org/T420909)
[13:17:39] <logmsgbot>	 !log sgimeno@deploy2002 Finished scap sync-world: Backport for [[gerrit:1259035|fix(WelcomeSurveyHooks): ensure accountJustCreated is always added (T420722)]], [[gerrit:1259036|tests: add coverage for WelcomeSurveyHooks::onCentralAuthPostLoginRedirect (T420722)]], [[gerrit:1259046|fix(WelcomeSurveyHooks): ensure accountJustCreated is always added 2 (T420722)]] (duration: 11m 43s)
[13:17:43] <stashbot>	 T420722: accountJustCreated flag not properly added on WelcomeSurvey redirections - https://phabricator.wikimedia.org/T420722
[13:18:44] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudlb1001.eqiad.wmnet
[13:19:00] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host cloudlb1001.eqiad.wmnet
[13:19:14] <wikibugs>	 (03CR) 10CI reject: [V:04-1] k8s.print-network-topology: Prevent SAL logging [cookbooks] - 10https://gerrit.wikimedia.org/r/1259082 (owner: 10JMeybohm)
[13:19:52] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[13:19:54] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:20:06] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudlb1001.eqiad.wmnet
[13:20:09] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] sre.k8s: Add cookbook to print network topology details of nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1256260 (https://phabricator.wikimedia.org/T418142) (owner: 10JMeybohm)
[13:20:11] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host cloudlb1001.eqiad.wmnet
[13:20:31] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudlb1001.eqiad.wmnet
[13:21:30] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{wikikube-worker[2332-2356].codfw.wmnet} and (A:wikikube-master-codfw or A:wikikube-worker-codfw)
[13:21:39] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2332-2336].codfw.wmnet
[13:21:41] <logmsgbot>	 !log jforrester@deploy2002 mwscript-k8s job started: extensions/WikimediaMaintenance/maintenance/createExtensionTables.php --wiki=abstractwiki translate  # T420656
[13:21:46] <stashbot>	 T420656: Enable Translate extension for Abstract Wikipedia - https://phabricator.wikimedia.org/T420656
[13:22:21] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] Add --force flag to sre.k8s.pool-depool-node cookbook and callers to bypass confirmation. [cookbooks] - 10https://gerrit.wikimedia.org/r/1258952 (https://phabricator.wikimedia.org/T410537) (owner: 10Matthieulec)
[13:22:52] <wikibugs>	 (03PS1) 10Jforrester: [abstractwiki] Enable the Translate extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259085 (https://phabricator.wikimedia.org/T420656)
[13:23:40] <icinga-wm>	 PROBLEM - BFD status on cloudsw1-c8-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:23:59] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259085 (https://phabricator.wikimedia.org/T420656) (owner: 10Jforrester)
[13:24:11] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.hosts.reboot-single for host deploy2003.codfw.wmnet
[13:24:39] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-c8-eqiad and cloudlb1001 (172.20.1.2) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[13:25:19] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2332-2336].codfw.wmnet
[13:26:40] <icinga-wm>	 RECOVERY - BFD status on cloudsw1-c8-eqiad.mgmt is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:28:52] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb1001.eqiad.wmnet
[13:29:07] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudlb1002.eqiad.wmnet
[13:29:23] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host cloudlb1002.eqiad.wmnet
[13:29:39] <jinxer-wm>	 RESOLVED: [2x] CoreBGPDown: Core BGP session down between cloudsw1-c8-eqiad and cloudlb1001 (172.20.1.2) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[13:30:03] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudlb1002.eqiad.wmnet
[13:30:30] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host deploy2003.codfw.wmnet
[13:30:51] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2332-2336].codfw.wmnet
[13:30:55] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2332-2336].codfw.wmnet
[13:31:07] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2337-2341].codfw.wmnet
[13:32:07] <wikibugs>	 (03CR) 10Matthieulec: [C:03+1] Add --force flag to sre.k8s.pool-depool-node cookbook and callers to bypass confirmation. [cookbooks] - 10https://gerrit.wikimedia.org/r/1258952 (https://phabricator.wikimedia.org/T410537) (owner: 10Matthieulec)
[13:32:40] <icinga-wm>	 PROBLEM - BFD status on cloudsw1-d5-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:34:09] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cloudsw1-c8-eqiad and cloudlb1001 (172.20.1.2) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[13:34:10] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] Add --force flag to sre.k8s.pool-depool-node cookbook and callers to bypass confirmation. [cookbooks] - 10https://gerrit.wikimedia.org/r/1258952 (https://phabricator.wikimedia.org/T410537) (owner: 10Matthieulec)
[13:36:00] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2337-2341].codfw.wmnet
[13:36:23] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.hosts.reboot-single for host rdb2011.codfw.wmnet
[13:36:32] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.hosts.reboot-single for host rdb2012.codfw.wmnet
[13:36:38] <icinga-wm>	 RECOVERY - BFD status on cloudsw1-d5-eqiad.mgmt is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:38:26] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb1002.eqiad.wmnet
[13:39:09] <jinxer-wm>	 RESOLVED: [4x] CoreBGPDown: Core BGP session down between cloudsw1-c8-eqiad and cloudlb1001 (172.20.1.2) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[13:39:09] <wikibugs>	 (03Merged) 10jenkins-bot: Add --force flag to sre.k8s.pool-depool-node cookbook and callers to bypass confirmation. [cookbooks] - 10https://gerrit.wikimedia.org/r/1258952 (https://phabricator.wikimedia.org/T410537) (owner: 10Matthieulec)
[13:39:36] <icinga-wm>	 PROBLEM - Postfix SMTP on crm2001 is CRITICAL: CRITICAL - Certificate crm2001.codfw.wmnet expires in 15 day(s) (Wed 08 Apr 2026 01:39:00 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting
[13:41:50] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2011.codfw.wmnet
[13:41:54] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2012.codfw.wmnet
[13:42:22] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1172.eqiad.wmnet with OS bullseye
[13:42:31] <logmsgbot>	 !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1172.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:42:33] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11738262 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cu...
[13:43:23] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2337-2341].codfw.wmnet
[13:43:27] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2337-2341].codfw.wmnet
[13:43:38] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2342-2346].codfw.wmnet
[13:44:48] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] geo-maps: update map default to list eqiad first [dns] - 10https://gerrit.wikimedia.org/r/1244621 (https://phabricator.wikimedia.org/T413974) (owner: 10Blake)
[13:47:19] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2342-2346].codfw.wmnet
[13:47:33] <logmsgbot>	 !log kamila@cumin1003 START - Cookbook sre.hosts.reboot-single for host hcaptcha1001.wikimedia.org
[13:48:51] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[13:50:01] <wikibugs>	 (03CR) 10Blake: [C:03+1] "Change seems good, though it looks like the pass ought to be removed." [cookbooks] - 10https://gerrit.wikimedia.org/r/1259082 (owner: 10JMeybohm)
[13:50:11] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[13:51:33] <logmsgbot>	 !log kamila@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host hcaptcha1001.wikimedia.org
[13:51:53] <logmsgbot>	 !log kamila@cumin1003 START - Cookbook sre.hosts.reboot-single for host hcaptcha1002.wikimedia.org
[13:52:13] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops: db1253 depooled following host crash - https://phabricator.wikimedia.org/T420041#11738332 (10wiki_willy)
[13:52:35] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2342-2346].codfw.wmnet
[13:52:39] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2342-2346].codfw.wmnet
[13:52:51] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2347-2351].codfw.wmnet
[13:54:35] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops: db1253 depooled following host crash - https://phabricator.wikimedia.org/T420041#11738340 (10Jclark-ctr) a:03Jclark-ctr
[13:54:36] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops: db1253 depooled following host crash - https://phabricator.wikimedia.org/T420041#11738342 (10wiki_willy) Adding the ops-eqiad tag and removing ops-eqdfw.  @Jclark-ctr will take a look at it a bit later today.
[13:55:08] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1172.eqiad.wmnet with reason: host reimage
[13:55:51] <logmsgbot>	 !log kamila@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host hcaptcha1002.wikimedia.org
[13:56:36] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[13:56:37] <logmsgbot>	 !log kamila@cumin1003 START - Cookbook sre.hosts.reboot-single for host hcaptcha2001.wikimedia.org
[13:57:07] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2347-2351].codfw.wmnet
[13:59:22] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1172.eqiad.wmnet with reason: host reimage
[14:00:00] <jinxer-wm>	 FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability
[14:00:00] <jinxer-wm>	 FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability
[14:00:25] <logmsgbot>	 !log kamila@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host hcaptcha2001.wikimedia.org
[14:00:37] <logmsgbot>	 !log kamila@cumin1003 START - Cookbook sre.hosts.reboot-single for host hcaptcha2002.wikimedia.org
[14:02:23] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Add new active-active discovery records for dse-k8s [dns] - 10https://gerrit.wikimedia.org/r/1248625 (https://phabricator.wikimedia.org/T417698) (owner: 10Bking)
[14:03:51] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2347-2351].codfw.wmnet
[14:03:54] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2347-2351].codfw.wmnet
[14:04:06] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2352-2356].codfw.wmnet
[14:04:28] <wikibugs>	 (03PS2) 10FNegri: conftool-data: move s3, x3 to new hosts (part 1) [puppet] - 10https://gerrit.wikimedia.org/r/1256417 (https://phabricator.wikimedia.org/T409557)
[14:04:28] <wikibugs>	 (03PS1) 10FNegri: conftool-data: move s3, x3 to new hosts (part 2) [puppet] - 10https://gerrit.wikimedia.org/r/1259113 (https://phabricator.wikimedia.org/T409557)
[14:04:30] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Add new active-active discovery service for dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1248605 (https://phabricator.wikimedia.org/T417698) (owner: 10Bking)
[14:04:39] <logmsgbot>	 !log kamila@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host hcaptcha2002.wikimedia.org
[14:06:31] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s7 #page on db1253 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:06:41] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s7 #page on db1253 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:06:41] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s7 #page on db1253 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:07:23] <federico3>	 !ack
[14:07:23] <sirenbot>	 All incidents are already acked.
[14:07:45] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2352-2356].codfw.wmnet
[14:07:58] <wikibugs>	 (03CR) 10Bking: [C:03+2] Add new active-active discovery service for dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1248605 (https://phabricator.wikimedia.org/T417698) (owner: 10Bking)
[14:09:25] <jhathaway>	 !incidents
[14:09:25] <sirenbot>	 7787 (ACKED)  db1253 (paged)/MariaDB Replica IO: s7 (paged)
[14:09:25] <sirenbot>	 7788 (UNACKED)  db1253 (paged)/MariaDB Replica SQL: s7 (paged)
[14:09:26] <sirenbot>	 7786 (RESOLVED)  db1253 (paged)/MariaDB Replica SQL: s7 (paged)
[14:09:26] <sirenbot>	 7785 (RESOLVED)  db1253 (paged)/MariaDB Replica Lag: s7 (paged)
[14:09:26] <sirenbot>	 7784 (RESOLVED)  db1253 (paged)/MariaDB Replica IO: s7 (paged)
[14:09:43] <jhathaway>	 !ack 7788
[14:09:44] <sirenbot>	 7788 (ACKED)  db1253 (paged)/MariaDB Replica SQL: s7 (paged)
[14:09:57] <Raine>	 silence gone again?
[14:10:27] <wikibugs>	 (03PS2) 10JMeybohm: k8s.print-network-topology: Prevent SAL logging [cookbooks] - 10https://gerrit.wikimedia.org/r/1259082
[14:10:47] <wikibugs>	 (03CR) 10Blake: [C:03+1] k8s.print-network-topology: Prevent SAL logging [cookbooks] - 10https://gerrit.wikimedia.org/r/1259082 (owner: 10JMeybohm)
[14:11:00] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudrabbit1001.eqiad.wmnet
[14:11:13] <federico3>	 that was removed as a side effect of a cookbook but then it had been created again
[14:11:47] <Raine>	 ok, thanks federico3 
[14:13:00] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on db1253.eqiad.wmnet with reason: Under repair
[14:13:08] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops: db1253 depooled following host crash - https://phabricator.wikimedia.org/T420041#11738465 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=5ce7e720-a20c-4ad5-a612-bbf5c41ccd0a) set by fceratto@cumin1003 for 14 days, 0:00:00 on 1 host(s) and their services with...
[14:14:29] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2352-2356].codfw.wmnet
[14:14:32] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2352-2356].codfw.wmnet
[14:14:32] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on P{wikikube-worker[2332-2356].codfw.wmnet} and (A:wikikube-master-codfw or A:wikikube-worker-codfw)
[14:14:41] <wikibugs>	 (03PS1) 10Daimona Eaytoy: Enable wgCampaignEventsEnableEventGoals in beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259120 (https://phabricator.wikimedia.org/T414148)
[14:15:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Enable wgCampaignEventsEnableEventGoals in beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259120 (https://phabricator.wikimedia.org/T414148) (owner: 10Daimona Eaytoy)
[14:16:50] <wikibugs>	 (03PS2) 10Daimona Eaytoy: Enable wgCampaignEventsEnableEventGoals in beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259120 (https://phabricator.wikimedia.org/T414148)
[14:17:05] <Reedy>	 jouncebot: nowandnext
[14:17:06] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 12 minute(s)
[14:17:06] <jouncebot>	 In 0 hour(s) and 12 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260323T1430)
[14:17:13] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit1001.eqiad.wmnet
[14:17:20] <icinga-wm>	 PROBLEM - Host lsw1-b7-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:17:46] <icinga-wm>	 PROBLEM - Host ps1-b7-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[14:18:10] <Daimona>	 Hi folks, could I get a beta-only config change deployed? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1259120
[14:18:10] <wikibugs>	 (03PS2) 10FNegri: conftool-data: move s3, x3 to new hosts (part 2) [puppet] - 10https://gerrit.wikimedia.org/r/1259113 (https://phabricator.wikimedia.org/T409557)
[14:18:35] <Daimona>	 (Assuming it's fine to do outside of normal deployment windows, since it's beta-only)
[14:18:48] <icinga-wm>	 RECOVERY - Host ps1-b7-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.05 ms
[14:18:52] <icinga-wm>	 RECOVERY - Host lsw1-b7-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.77 ms
[14:20:07] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[14:21:21] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1018.eqiad.wmnet
[14:22:03] <logmsgbot>	 !log fnegri@cumin1003 START - Cookbook sre.hosts.remove-downtime for clouddb1018.eqiad.wmnet
[14:22:04] <logmsgbot>	 !log fnegri@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for clouddb1018.eqiad.wmnet
[14:22:32] <logmsgbot>	 !log fnegri@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1018.eqiad.wmnet with reason: Rebooting clouddb1018 T419960
[14:23:12] <logmsgbot>	 jclark@cumin1003 reimage (PID 1496331) is awaiting input
[14:24:49] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1010 - https://phabricator.wikimedia.org/T420867#11738524 (10Eevans) The failed device is `/dev/sdh` (fourth/last device on the second controller?), and `lsblk` thinks its serial number is `KN09N7919I0509R4C`.  If we're confident in which drive to pull, it sh...
[14:27:20] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudrabbit1002.eqiad.wmnet
[14:30:04] <jouncebot>	 Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260323T1430)
[14:30:35] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[14:30:36] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1172.eqiad.wmnet with OS bullseye
[14:30:46] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11738557 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1...
[14:31:52] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=k8s-ingress-dse-aa,name=codfw
[14:32:02] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] Add new active-active discovery records for dse-k8s [dns] - 10https://gerrit.wikimedia.org/r/1248625 (https://phabricator.wikimedia.org/T417698) (owner: 10Bking)
[14:32:22] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[14:32:31] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11738573 (10Jclark-ctr) @BTullis , I was able to reimage it. The an-workers always seem to ha...
[14:32:51] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1018.eqiad.wmnet
[14:33:27] <logmsgbot>	 !log fnegri@cumin1003 START - Cookbook sre.hosts.remove-downtime for clouddb1018.eqiad.wmnet
[14:33:28] <logmsgbot>	 !log fnegri@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for clouddb1018.eqiad.wmnet
[14:33:34] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit1002.eqiad.wmnet
[14:33:57] <logmsgbot>	 !log sukhe@dns1004 FAIL - running authdns-update
[14:33:59] <logmsgbot>	 !log fnegri@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Rebooting clouddb1019 T419960
[14:34:41] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet
[14:36:32] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache k8s-ingress-dse-aa.discovery.wmnet on all recursors
[14:36:36] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) k8s-ingress-dse-aa.discovery.wmnet on all recursors
[14:36:52] <wikibugs>	 (03PS4) 10Clément Goubert: wikifeeds: Add request definition for page analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220629 (https://phabricator.wikimedia.org/T411769) (owner: 10Jgiannelos)
[14:36:53] <wikibugs>	 (03PS1) 10Arnaudb: gerrit: proxy Gitiles traffic to gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/1259121 (https://phabricator.wikimedia.org/T420595)
[14:36:53] <wikibugs>	 (03CR) 10Arnaudb: "pcc output visible here: https://puppet-compiler.wmflabs.org/output/1259121/6169/" [puppet] - 10https://gerrit.wikimedia.org/r/1259121 (https://phabricator.wikimedia.org/T420595) (owner: 10Arnaudb)
[14:37:02] <wikibugs>	 (03CR) 10Clément Goubert: wikifeeds: Add request definition for page analytics (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220629 (https://phabricator.wikimedia.org/T411769) (owner: 10Jgiannelos)
[14:37:46] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=k8s-ingress-dse-aa,name=eqiad
[14:38:42] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[14:39:50] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudrabbit1003.eqiad.wmnet
[14:40:05] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops: db1253 depooled following host crash - https://phabricator.wikimedia.org/T420041#11738596 (10Jclark-ctr) I physically checked power cables seated properly  nothing loose.    Idrac looked healthy.  i went though and updated multiple firmwares  800w Delta psu  from  00.1B.53  To...
[14:40:09] <logmsgbot>	 !log sukhe@dns1004 FAIL - running authdns-update
[14:43:34] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[14:44:59] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[14:45:15] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=k8s-ingress-dse-aa,name=eqiad
[14:46:01] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit1003.eqiad.wmnet
[14:47:52] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[14:47:53] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Add new active/active discovery records for dse-k8s opensearch test ns [dns] - 10https://gerrit.wikimedia.org/r/1250063 (https://phabricator.wikimedia.org/T417698) (owner: 10Bking)
[14:48:05] <wikibugs>	 (03CR) 10Bking: [C:03+2] Add new active/active discovery records for dse-k8s opensearch test ns [dns] - 10https://gerrit.wikimedia.org/r/1250063 (https://phabricator.wikimedia.org/T417698) (owner: 10Bking)
[14:48:12] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[14:48:51] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[14:49:41] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[14:49:47] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[14:49:50] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[14:50:12] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[14:50:16] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[14:50:16] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops: db1253 depooled following host crash - https://phabricator.wikimedia.org/T420041#11738636 (10FCeratto-WMF) Thank you @Jclark-ctr - is there anything else to be done on your side or can I claim the task?
[14:50:26] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[14:50:27] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[14:51:43] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[14:51:44] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[14:52:17] <wikibugs>	 (03PS2) 10Milimetric: testKitchen: Add custom stream name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255763 (https://phabricator.wikimedia.org/T417050)
[14:52:46] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[14:52:47] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[14:53:06] <claime>	 jouncebot: nowandnext
[14:53:06] <jouncebot>	 For the next 0 hour(s) and 6 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260323T1430)
[14:53:06] <jouncebot>	 In 0 hour(s) and 36 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260323T1530)
[14:53:15] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Add linkrecommendation support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255581 (https://phabricator.wikimedia.org/T418148) (owner: 10Clément Goubert)
[14:54:40] <wikibugs>	 (03PS2) 10Bking: Add new active/active discovery records for dse-k8s opensearch prod ns [dns] - 10https://gerrit.wikimedia.org/r/1250068 (https://phabricator.wikimedia.org/T417698)
[14:54:44] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops: db1253 depooled following host crash - https://phabricator.wikimedia.org/T420041#11738684 (10Jclark-ctr) BIOS required a second restart. Just finished—should be good now. I double-checked the logs again just now still looks good.     @FCeratto-WMF Feel free to Message me if anyt...
[14:54:58] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Add new active/active discovery records for dse-k8s opensearch prod ns [dns] - 10https://gerrit.wikimedia.org/r/1250068 (https://phabricator.wikimedia.org/T417698) (owner: 10Bking)
[14:55:17] <logmsgbot>	 !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:restbase-eqiad
[14:55:26] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[14:55:37] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet
[14:55:44] <wikibugs>	 (03CR) 10Bking: [C:03+2] Add new active/active discovery records for dse-k8s opensearch prod ns [dns] - 10https://gerrit.wikimedia.org/r/1250068 (https://phabricator.wikimedia.org/T417698) (owner: 10Bking)
[14:55:48] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[14:55:59] <logmsgbot>	 !log fnegri@cumin1003 START - Cookbook sre.hosts.remove-downtime for clouddb1019.eqiad.wmnet
[14:56:00] <logmsgbot>	 !log fnegri@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for clouddb1019.eqiad.wmnet
[14:56:04] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: Add linkrecommendation support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255581 (https://phabricator.wikimedia.org/T418148) (owner: 10Clément Goubert)
[14:56:06] <wikibugs>	 (03PS1) 10Sergio Gimeno: GrowthExperiments: scale edit and thanks query limit to more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259132 (https://phabricator.wikimedia.org/T341599)
[14:56:36] <logmsgbot>	 !log fnegri@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1020.eqiad.wmnet with reason: Rebooting clouddb1020 T419960
[14:56:52] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[14:56:56] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[14:57:05] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[14:57:08] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1020.eqiad.wmnet
[14:58:08] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255763 (https://phabricator.wikimedia.org/T417050) (owner: 10Milimetric)
[14:58:18] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache opensearch-test.discovery.wmnet on all recursors
[14:58:22] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) opensearch-test.discovery.wmnet on all recursors
[14:58:28] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[14:58:40] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service restbase1031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:58:59] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache opensearch-ipoid.discovery.wmnet on all recursors
[14:59:03] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) opensearch-ipoid.discovery.wmnet on all recursors
[14:59:13] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1213 - https://phabricator.wikimedia.org/T420812#11738720 (10VRiley-WMF) Opened up a Dell ticket to have a replacment drive sent out. SR224226231
[14:59:44] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:00:26] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:00:33] <wikibugs>	 (03PS3) 10Jdlrobson: Deploy temporary accounts to ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247073 (https://phabricator.wikimedia.org/T413771) (owner: 10STran)
[15:00:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Deploy temporary accounts to ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247073 (https://phabricator.wikimedia.org/T413771) (owner: 10STran)
[15:01:40] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[15:01:54] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[15:02:07] <wikibugs>	 (03PS1) 10Majavah: cloudlb: Merge http-service-by-host to main http-service type [puppet] - 10https://gerrit.wikimedia.org/r/1259134
[15:02:13] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11738727 (10dancy) https://wikitech.wikimedia.org/wiki/Bastion currently shows bast4005.wikimedia.org crossed out.
[15:02:38] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cloudlb: Merge http-service-by-host to main http-service type [puppet] - 10https://gerrit.wikimedia.org/r/1259134 (owner: 10Majavah)
[15:03:00] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=k8s-ingress-dse-aa,name=eqiad
[15:03:24] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache opensearch-ipoid.discovery.wmnet on all recursors
[15:03:28] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) opensearch-ipoid.discovery.wmnet on all recursors
[15:03:40] <jinxer-wm>	 RESOLVED: [6x] ProbeDown: Service restbase1031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:03:55] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1172.eqiad.wmnet
[15:04:38] <wikibugs>	 (03PS1) 10Kosta Harlan: EventStreamConfig: Add performer attributes to SI interaction v2 stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259136 (https://phabricator.wikimedia.org/T418740)
[15:05:31] <wikibugs>	 (03PS2) 10Majavah: cloudlb: Merge http-by-host to main http service type [puppet] - 10https://gerrit.wikimedia.org/r/1259134
[15:05:46] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1172.eqiad.wmnet
[15:06:10] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cloudlb: Merge http-by-host to main http service type [puppet] - 10https://gerrit.wikimedia.org/r/1259134 (owner: 10Majavah)
[15:06:25] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:04-2] "We set these manually for server side instrumentation, so this would break that" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259136 (https://phabricator.wikimedia.org/T418740) (owner: 10Kosta Harlan)
[15:06:33] <wikibugs>	 (03PS1) 10Btullis: Revert "Temporarily set an-worker1172 into insetup mode" [puppet] - 10https://gerrit.wikimedia.org/r/1259138
[15:06:55] <wikibugs>	 (03PS3) 10Majavah: cloudlb: Merge http-by-host to main http service type [puppet] - 10https://gerrit.wikimedia.org/r/1259134
[15:07:44] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11738735 (10ayounsi) Thanks, updated.
[15:09:09] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Wikidough unreachable over IPv6 if it is depooled but still announced from a POP - https://phabricator.wikimedia.org/T420820#11738741 (10cmooney) 05Open→03Resolved a:03cmooney Ok this should no longer be an issue after updating the `wikimedia6` prefix...
[15:09:30] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Revert "Temporarily set an-worker1172 into insetup mode" [puppet] - 10https://gerrit.wikimedia.org/r/1259138 (owner: 10Btullis)
[15:11:50] <wikibugs>	 (03PS1) 10Btullis: dse-k8s-eqiad: Set cert-manager leader election namespace to cert-manager [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259141 (https://phabricator.wikimedia.org/T383553)
[15:12:16] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops: db1253 depooled following host crash - https://phabricator.wikimedia.org/T420041#11738758 (10FCeratto-WMF) a:05Jclark-ctr→03FCeratto-WMF @Jclark-ctr thank you.
[15:13:38] <wikibugs>	 (03PS4) 10Majavah: cloudlb: Merge http-by-host to main http service type [puppet] - 10https://gerrit.wikimedia.org/r/1259134
[15:14:29] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase1031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:14:44] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase1031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:14:50] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1020.eqiad.wmnet
[15:14:58] <logmsgbot>	 !log fnegri@cumin1003 START - Cookbook sre.hosts.remove-downtime for clouddb1020.eqiad.wmnet
[15:14:59] <logmsgbot>	 !log fnegri@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for clouddb1020.eqiad.wmnet
[15:19:29] <jinxer-wm>	 RESOLVED: [12x] ProbeDown: Service restbase1031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:20:55] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[15:21:07] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[15:21:08] <wikibugs>	 (03PS2) 10Kosta Harlan: EventStreamConfig: Document not adding performer attributes to SI interaction v2 stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259136 (https://phabricator.wikimedia.org/T418740)
[15:21:52] <wikibugs>	 (03PS3) 10Kosta Harlan: EventStreamConfig: Document not adding performer attributes to SI interaction v2 stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259136 (https://phabricator.wikimedia.org/T418740)
[15:22:17] <wikibugs>	 (03PS5) 10Majavah: cloudlb: Merge http-by-host to main http service type [puppet] - 10https://gerrit.wikimedia.org/r/1259134
[15:23:04] <wikibugs>	 (03CR) 10Jforrester: Enable view urls in abstract.wikipedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1256396 (https://phabricator.wikimedia.org/T420666) (owner: 10Genoveva Galarza)
[15:23:22] <wikibugs>	 (03CR) 10Dreamy Jazz: EventStreamConfig: Document not adding performer attributes to SI interaction v2 stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259136 (https://phabricator.wikimedia.org/T418740) (owner: 10Kosta Harlan)
[15:23:30] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+1] EventStreamConfig: Document not adding performer attributes to SI interaction v2 stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259136 (https://phabricator.wikimedia.org/T418740) (owner: 10Kosta Harlan)
[15:24:44] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase1032-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:26:10] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8325/co" [puppet] - 10https://gerrit.wikimedia.org/r/1259134 (owner: 10Majavah)
[15:26:29] <wikibugs>	 (03PS6) 10Majavah: cloudlb: Merge http-by-host to main http service type [puppet] - 10https://gerrit.wikimedia.org/r/1259134
[15:28:52] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "lgtm but I'm not that familiar with nout nftables setup." [puppet] - 10https://gerrit.wikimedia.org/r/1259080 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney)
[15:29:02] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8326/co" [puppet] - 10https://gerrit.wikimedia.org/r/1259134 (owner: 10Majavah)
[15:29:44] <jinxer-wm>	 RESOLVED: [12x] ProbeDown: Service restbase1032-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:30:05] <jouncebot>	 jan_drewniak: It is that lovely time of the day again! You are hereby commanded to deploy Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260323T1530).
[15:31:51] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1172.eqiad.wmnet
[15:33:02] <wikibugs>	 (03CR) 10BPirkle: [C:03+1] "looks good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250113 (owner: 10Jforrester)
[15:34:44] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase1033-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:37:50] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] nftables: support nftables::rules definitions targetting prerouting [puppet] - 10https://gerrit.wikimedia.org/r/1259080 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney)
[15:39:44] <jinxer-wm>	 RESOLVED: [12x] ProbeDown: Service restbase1033-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:39:50] <wikibugs>	 (03PS1) 10Ebernhardson: search: Add codfw semanticsearch cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259143
[15:39:54] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job wikidough in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:44:10] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[15:45:37] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service restbase1034-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:49:10] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[15:49:33] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install payments101[0-2] - https://phabricator.wikimedia.org/T416252#11738970 (10Jgreen)
[15:50:00] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[15:50:37] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase1034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:50:57] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1022.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[15:51:32] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1010 - https://phabricator.wikimedia.org/T420867#11738984 (10VRiley-WMF) Thanks @Eevans Admittedly, I think it would be safest to shut down the server in order to have it verified which disk we are replacing. We have a spare on standby for this. If you wanted...
[15:52:19] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install payments101[0-2] - https://phabricator.wikimedia.org/T416252#11738990 (10Jgreen) @Jclark-ctr I just noticed it looks like these were configured in the frack-fundraising1-c-eqiad vlan, looks like I missed updating the install details when the task was create...
[15:52:31] <jinxer-wm>	 FIRING: ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:52:31] <wikibugs>	 10ops-codfw, 06cloud-services-team, 06DC-Ops: Power Supply - Status - issue on cloudbackup2003:9290 - https://phabricator.wikimedia.org/T420948 (10Andrew) 03NEW
[15:53:33] <topranks>	 !log disabling puppet for nftables-enabled machines to validate new ruleset on selected hosts before wider rollout T420715
[15:53:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:54:36] <wikibugs>	 (03PS4) 10Genoveva Galarza: Enable view urls in abstract.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1256396 (https://phabricator.wikimedia.org/T420666)
[15:55:18] <wikibugs>	 (03CR) 10Genoveva Galarza: Enable view urls in abstract.wikipedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1256396 (https://phabricator.wikimedia.org/T420666) (owner: 10Genoveva Galarza)
[15:55:37] <jinxer-wm>	 RESOLVED: [12x] ProbeDown: Service restbase1034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:56:50] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1022.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[15:57:23] <icinga-wm>	 PROBLEM - Host lsw1-b7-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:57:31] <jinxer-wm>	 RESOLVED: ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:57:35] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1023.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[15:57:45] <icinga-wm>	 PROBLEM - Host ps1-b7-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[15:57:48] <logmsgbot>	 !log fnegri@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1025.eqiad.wmnet with reason: Rebooting clouddb1025 T419960
[15:58:05] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1025.eqiad.wmnet
[15:59:07] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] nftables: support nftables::rules definitions targetting prerouting [puppet] - 10https://gerrit.wikimedia.org/r/1259080 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney)
[15:59:11] <wikibugs>	 10ops-codfw, 06collaboration-services, 06DC-Ops, 10Phabricator: phab2002: SEL System Event:, System Board Front LED Panel, Critical, management controller unavailable - https://phabricator.wikimedia.org/T420228#11739070 (10ABran-WMF) 05Open→03In progress p:05Triage→03Medium
[16:00:52] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase1035-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:02:10] <wikibugs>	 (03CR) 10Dwisehaupt: [C:03+1] Switch fundraising default bastion back to eqiad after kernel update. [dns] - 10https://gerrit.wikimedia.org/r/1259081 (owner: 10Jgreen)
[16:02:37] <wikibugs>	 (03PS1) 10Jdlrobson: Address FIXME and drop not selector for section headings [extensions/MobileFrontend] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259147 (https://phabricator.wikimedia.org/T420085)
[16:02:53] <icinga-wm>	 RECOVERY - Host ps1-b7-codfw is UP: PING WARNING - Packet loss = 71%, RTA = 31.03 ms
[16:02:55] <icinga-wm>	 RECOVERY - Host lsw1-b7-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.60 ms
[16:03:05] <logmsgbot>	 !log eevans@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on aqs1010.eqiad.wmnet with reason: Shutting down for SSD replacement — T420867
[16:03:06] <wikibugs>	 10ops-codfw, 06collaboration-services, 06DC-Ops, 10Phabricator: phab2002: SEL System Event:, System Board Front LED Panel, Critical, management controller unavailable - https://phabricator.wikimedia.org/T420228#11739121 (10Dzahn) @Jhancock.wm This server is currently not the active production Phabricator....
[16:03:11] <stashbot>	 T420867: Degraded RAID on aqs1010 - https://phabricator.wikimedia.org/T420867
[16:03:11] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1010 - https://phabricator.wikimedia.org/T420867#11739122 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=5399a138-5392-45c7-819b-6efa3f7d322a) set by eevans@cumin1003 for 8:00:00 on 1 host(s) and their services with reason: Shutting down...
[16:03:21] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1023.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[16:03:49] <wikibugs>	 (03PS1) 10Elukey: mcrouter: ease testing new cli parameters [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1259148 (https://phabricator.wikimedia.org/T420223)
[16:04:50] <urandom>	 !log stopping aqs1010 for SSD replacement — T420867
[16:04:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:11] <wikibugs>	 (03CR) 10Elukey: "I would like to test the following:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1259148 (https://phabricator.wikimedia.org/T420223) (owner: 10Elukey)
[16:05:31] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1010 - https://phabricator.wikimedia.org/T420867#11739147 (10Eevans) >>! In T420867#11738984, @VRiley-WMF wrote: > Thanks @Eevans Admittedly, I think it would be safest to shut down the server in order to have it verified which disk we are replacing. We have...
[16:05:52] <jinxer-wm>	 RESOLVED: [12x] ProbeDown: Service restbase1035-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:09:19] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1025.eqiad.wmnet
[16:09:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:09:38] <logmsgbot>	 !log fnegri@cumin1003 START - Cookbook sre.hosts.remove-downtime for clouddb1025.eqiad.wmnet
[16:09:39] <logmsgbot>	 !log fnegri@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for clouddb1025.eqiad.wmnet
[16:10:21] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 13Patch-For-Review: Q3:rack/setup/install dse-k8s-worker10[20-23] - https://phabricator.wikimedia.org/T414216#11739173 (10elukey) @Jclark-ctr all hosts provisioned! The new cookbook is not merged, but I thought to unblock you :)
[16:10:38] <wikibugs>	 (03PS1) 10Clément Goubert: rest-gateway: Fix linkrecommendation definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259151
[16:10:52] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:11:01] <logmsgbot>	 !log fnegri@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1023.eqiad.wmnet with reason: Rebooting clouddb1023 T419960
[16:13:06] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1010 - https://phabricator.wikimedia.org/T420867#11739209 (10VRiley-WMF) Shut down the unit. Verified the disk location, and brought it back up. Once it was up, I performed the swap. This should be good to go!
[16:13:09] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259120 (https://phabricator.wikimedia.org/T414148) (owner: 10Daimona Eaytoy)
[16:14:14] <wikibugs>	 (03CR) 10Dzahn: gerrit: add Envoy TLS termination for the CDN path (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1258976 (https://phabricator.wikimedia.org/T420909) (owner: 10Arnaudb)
[16:14:32] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] rest-gateway: Fix linkrecommendation definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259151 (owner: 10Clément Goubert)
[16:15:20] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Fix linkrecommendation definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259151 (owner: 10Clément Goubert)
[16:15:52] <jinxer-wm>	 FIRING: [18x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:15:55] <jinxer-wm>	 FIRING: [20x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:17:23] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: Fix linkrecommendation definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259151 (owner: 10Clément Goubert)
[16:17:54] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[16:18:10] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[16:18:49] <wikibugs>	 (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256748 (https://phabricator.wikimedia.org/T420704) (owner: 10Codename Noreste)
[16:19:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:19:22] <wikibugs>	 (03CR) 10Jgreen: [C:03+2] Switch fundraising default bastion back to eqiad after kernel update. [dns] - 10https://gerrit.wikimedia.org/r/1259081 (owner: 10Jgreen)
[16:19:37] <logmsgbot>	 !log jgreen@dns1004 START - running authdns-update
[16:20:28] <wikibugs>	 10SRE-SLO, 06ServiceOps new, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work, and 2 others: IPoid: Define service level indicators and service level objectives - https://phabricator.wikimedia.org/T348935#11739302 (10OKryva-WMF)
[16:20:52] <jinxer-wm>	 FIRING: [20x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:20:55] <wikibugs>	 (03PS1) 10Clément Goubert: rest-gateway: fix cidr [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259154
[16:21:09] <logmsgbot>	 !log jgreen@dns1004 END - running authdns-update
[16:22:03] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on ganeti2033:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[16:23:47] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] rest-gateway: fix cidr [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259154 (owner: 10Clément Goubert)
[16:24:01] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1010 - https://phabricator.wikimedia.org/T420867#11739322 (10Eevans) >>! In T420867#11739209, @VRiley-WMF wrote: > Shut down the unit. Verified the disk location, and brought it back up. Once it was up, I performed the swap. This should be good to go!  Thanks...
[16:24:22] <logmsgbot>	 !log eevans@cumin1003 START - Cookbook sre.hosts.remove-downtime for aqs1010.eqiad.wmnet
[16:24:23] <logmsgbot>	 !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs1010.eqiad.wmnet
[16:25:03] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1010 - https://phabricator.wikimedia.org/T420867#11739351 (10VRiley-WMF) 05Open→03Resolved
[16:25:52] <jinxer-wm>	 FIRING: [20x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:26:06] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: fix cidr [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259154 (owner: 10Clément Goubert)
[16:27:59] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[16:28:11] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[16:29:16] <logmsgbot>	 !log fnegri@cumin1003 START - Cookbook sre.hosts.remove-downtime for clouddb1023.eqiad.wmnet
[16:29:17] <logmsgbot>	 !log fnegri@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for clouddb1023.eqiad.wmnet
[16:30:26] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[16:30:35] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250113 (owner: 10Jforrester)
[16:30:41] <icinga-wm>	 PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 90%, RTA = 6741.95 ms
[16:30:52] <jinxer-wm>	 FIRING: [21x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:30:55] <jinxer-wm>	 FIRING: [21x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:31:12] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[16:31:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1170 - https://phabricator.wikimedia.org/T420873#11739396 (10VRiley-WMF) Hey @jcrespo we got this ticket to replace a drive on this unit. We can do this as soon as today if you're ready. Since this is under warrenty, we're going to use one that is fro...
[16:32:19] <icinga-wm>	 RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms
[16:32:26] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[16:34:11] <wikibugs>	 (03PS14) 10Elukey: sre.hosts.provision: use more URIs to set Supermicro's BIOS settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216)
[16:34:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:34:39] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcumin2001.codfw.wmnet
[16:34:39] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[16:34:43] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[16:35:11] <wikibugs>	 (03PS1) 10Btullis: Update dse-k8s-eqiad to k8s 1.31 [puppet] - 10https://gerrit.wikimedia.org/r/1259155 (https://phabricator.wikimedia.org/T414484)
[16:35:21] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259155 (https://phabricator.wikimedia.org/T414484) (owner: 10Btullis)
[16:35:41] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[16:35:52] <jinxer-wm>	 FIRING: [18x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:36:48] <jinxer-wm>	 RESOLVED: PuppetFailure: Puppet has failed on ganeti2033:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[16:38:07] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Add core API support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255725 (https://phabricator.wikimedia.org/T418146) (owner: 10Clément Goubert)
[16:38:18] <wikibugs>	 (03CR) 10CI reject: [V:04-1] rest-gateway: Add core API support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255725 (https://phabricator.wikimedia.org/T418146) (owner: 10Clément Goubert)
[16:38:27] <wikibugs>	 (03PS6) 10Clément Goubert: rest-gateway: Add core API support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255725 (https://phabricator.wikimedia.org/T418146)
[16:38:30] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcumin2001.codfw.wmnet
[16:40:17] <wikibugs>	 (03CR) 10Clément Goubert: rest-gateway: Add core API support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255725 (https://phabricator.wikimedia.org/T418146) (owner: 10Clément Goubert)
[16:40:44] <wikibugs>	 (03CR) 10Clément Goubert: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255725 (https://phabricator.wikimedia.org/T418146) (owner: 10Clément Goubert)
[16:40:52] <jinxer-wm>	 FIRING: [18x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:41:19] <wikibugs>	 (03CR) 10Btullis: "Do not merge until the maintenance window on March 26th." [puppet] - 10https://gerrit.wikimedia.org/r/1259155 (https://phabricator.wikimedia.org/T414484) (owner: 10Btullis)
[16:41:41] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[16:42:02] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Observability-Metrics: thanos swift capacity for FY 26/27 - https://phabricator.wikimedia.org/T419713#11739454 (10hnowlan) For the immediate future I think for the moment we're fine with current thanos-swift capacity. We'll experiment with SSD storage elsewhere but for now we do...
[16:42:32] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1170 - https://phabricator.wikimedia.org/T420873#11739456 (10jcrespo) That's an s7 core host, it is for @FCeratto-WMF to make the call.
[16:45:26] <wikibugs>	 (03CR) 10Btullis: wdqs-queryhammer: Deployment fixes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258956 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg)
[16:45:52] <jinxer-wm>	 FIRING: [18x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:46:21] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11739476 (10AnnieKim_WMDE) Encountering an error when I try to log into Superset: "Authentication Failure. Service access denied due to missing privileges." C...
[16:46:59] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[16:47:30] <wikibugs>	 (03CR) 10Trueg: wdqs-queryhammer: Deployment fixes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258956 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg)
[16:49:22] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Add core API support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255725 (https://phabricator.wikimedia.org/T418146) (owner: 10Clément Goubert)
[16:50:01] <wikibugs>	 (03PS1) 10Btullis: Update dse-k8s-eqiad to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259158 (https://phabricator.wikimedia.org/T414484)
[16:50:34] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker1007.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[16:50:52] <jinxer-wm>	 FIRING: [17x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:51:53] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: Add core API support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255725 (https://phabricator.wikimedia.org/T418146) (owner: 10Clément Goubert)
[16:52:36] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[16:52:49] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[16:53:10] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[16:54:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11739508 (10BTullis) 05Open→03Resolved I belive that this is now fixed. Thanks @Jclar...
[16:55:40] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-ctrl1001.eqiad.wmnet
[16:55:47] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on doh7004 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[16:55:52] <jinxer-wm>	 FIRING: [16x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:55:55] <jinxer-wm>	 FIRING: [16x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:55:55] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.remove-downtime for 14 hosts
[16:55:55] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker1007.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[16:56:04] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 14 hosts
[16:56:37] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] run_ci_locally.sh: use bind mounts for local runs [puppet] - 10https://gerrit.wikimedia.org/r/1142675 (owner: 10JHathaway)
[16:56:56] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker1008.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[16:57:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Update dse-k8s-eqiad to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259158 (https://phabricator.wikimedia.org/T414484) (owner: 10Btullis)
[16:58:35] <wikibugs>	 (03CR) 10Trueg: wdqs-queryhammer: Deployment fixes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258956 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg)
[16:58:39] <wikibugs>	 (03PS1) 10Clément Goubert: rest-gateway: fix mobileapps cluster for core [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259160
[16:59:11] <wikibugs>	 (03PS2) 10Trueg: wdqs-queryhammer: Deployment fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258956 (https://phabricator.wikimedia.org/T417415)
[16:59:19] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job wikidough in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260323T1700)
[17:00:05] <jouncebot>	 ryankemper: Your horoscope predicts another Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260323T1700).
[17:00:31] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-ctrl1001.eqiad.wmnet
[17:00:52] <jinxer-wm>	 FIRING: [16x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:01:57] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] rest-gateway: fix mobileapps cluster for core [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259160 (owner: 10Clément Goubert)
[17:02:03] <wikibugs>	 (03PS1) 10Scott French: mw-web: Reenable envoy drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259161 (https://phabricator.wikimedia.org/T364245)
[17:02:15] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker1008.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[17:03:41] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[17:03:59] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: fix mobileapps cluster for core [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259160 (owner: 10Clément Goubert)
[17:04:17] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[17:04:23] <wikibugs>	 (03PS2) 10Scott French: mw-web: Reenable envoy drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259161 (https://phabricator.wikimedia.org/T364245)
[17:04:39] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[17:05:14] <wikibugs>	 (03PS1) 10JHathaway: WIP: do not merge [puppet] - 10https://gerrit.wikimedia.org/r/1259162
[17:05:41] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Add api.w.o device-analytics support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255590 (https://phabricator.wikimedia.org/T418147) (owner: 10Clément Goubert)
[17:05:51] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] mw-web: Reenable envoy drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259161 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French)
[17:05:52] <jinxer-wm>	 FIRING: [16x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:06:27] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[17:06:45] <wikibugs>	 (03PS2) 10JHathaway: WIP: do not merge, test 2 [puppet] - 10https://gerrit.wikimedia.org/r/1259162
[17:07:01] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[17:07:59] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: Add api.w.o device-analytics support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255590 (https://phabricator.wikimedia.org/T418147) (owner: 10Clément Goubert)
[17:08:13] <wikibugs>	 06SRE, 10Beta-Cluster-Infrastructure: Beta cluster is slow as sludge / serves 503 and 504 - https://phabricator.wikimedia.org/T420833#11739553 (10bd808) Load has been spiky over the last 7 days with increased spike frequency on 2026-03-22 for sure. {F73533258,size=full} We likely have either a new range that b...
[17:08:26] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[17:08:33] <swfrench-wmf>	 o/
[17:08:33] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[17:08:42] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[17:09:11] <swfrench-wmf>	 FYI, as part of this infra window, I'll be applying a change to mw-web in a little bit
[17:09:17] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[17:09:56] <wikibugs>	 (03PS3) 10JHathaway: WIP: do not merge, test 2 [puppet] - 10https://gerrit.wikimedia.org/r/1259162
[17:10:52] <jinxer-wm>	 FIRING: [16x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:12:14] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[17:12:36] <wikibugs>	 (03CR) 10Scott French: [C:03+2] mw-web: Reenable envoy drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259161 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French)
[17:12:49] <logmsgbot>	 !log bd808@deploy2002 Started deploy [releng/jenkins-deploy@f47af21] (releasing): jobs: Use TZ=UTC in branchMWSingleVersion.groovy trigger (T404399)
[17:12:54] <stashbot>	 T404399: wmf/next branch cut job on releases-jenkins and systemd timer on deployment server times overlap - https://phabricator.wikimedia.org/T404399
[17:13:36] <logmsgbot>	 !log bd808@deploy2002 Finished deploy [releng/jenkins-deploy@f47af21] (releasing): jobs: Use TZ=UTC in branchMWSingleVersion.groovy trigger (T404399) (duration: 01m 36s)
[17:13:41] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[17:14:46] <wikibugs>	 (03Merged) 10jenkins-bot: mw-web: Reenable envoy drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259161 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French)
[17:15:55] <jinxer-wm>	 FIRING: [11x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:16:34] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker1009.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[17:17:01] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[17:17:39] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[17:18:30] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[17:19:54] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:20:05] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[17:20:41] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS trixie
[17:20:52] <jinxer-wm>	 FIRING: [16x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:21:05] <wikibugs>	 (03PS1) 10Tiziano Fogli: thanos/compact: increase meta-fetch goroutines to fix compactor inactivity [puppet] - 10https://gerrit.wikimedia.org/r/1259168 (https://phabricator.wikimedia.org/T410152)
[17:21:45] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[17:21:54] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker1009.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[17:22:05] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[17:23:41] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[17:24:16] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[17:25:18] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256498 (https://phabricator.wikimedia.org/T420785) (owner: 10Scardenasmolinar)
[17:25:52] <jinxer-wm>	 FIRING: [16x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:26:16] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[17:26:43] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247065 (https://phabricator.wikimedia.org/T411485) (owner: 10Kgraessle)
[17:27:56] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[17:29:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[17:30:01] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp1100.eqiad.wmnet [reason: trixie reimaging]
[17:30:52] <jinxer-wm>	 FIRING: [16x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:31:08] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp1100.eqiad.wmnet with OS trixie
[17:31:28] <logmsgbot>	 brett@cumin2002 reimage (PID 1072326) is awaiting input
[17:32:40] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] thanos/compact: increase meta-fetch goroutines to fix compactor inactivity [puppet] - 10https://gerrit.wikimedia.org/r/1259168 (https://phabricator.wikimedia.org/T410152) (owner: 10Tiziano Fogli)
[17:33:09] <wikibugs>	 (03PS1) 10Hnowlan: prometheus: add recording rules for the appservers RED dashboard [puppet] - 10https://gerrit.wikimedia.org/r/1259170 (https://phabricator.wikimedia.org/T249663)
[17:33:14] <Dreamy_Jazz>	 jouncebot: nowandnext
[17:33:14] <jouncebot>	 For the next 0 hour(s) and 26 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260323T1700)
[17:33:14] <jouncebot>	 In 2 hour(s) and 26 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260323T2000)
[17:33:38] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[17:34:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[17:34:26] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp1101.eqiad.wmnet [reason: trixie reimaging]
[17:34:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259136 (https://phabricator.wikimedia.org/T418740) (owner: 10Kosta Harlan)
[17:34:49] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp1101.eqiad.wmnet with OS trixie
[17:35:13] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[17:35:26] <swfrench-wmf>	 ?? why is there a backport happening
[17:35:37] <wikibugs>	 (03Merged) 10jenkins-bot: EventStreamConfig: Document not adding performer attributes to SI interaction v2 stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259136 (https://phabricator.wikimedia.org/T418740) (owner: 10Kosta Harlan)
[17:35:48] <Dreamy_Jazz>	 Stopping scap
[17:35:52] <jinxer-wm>	 FIRING: [16x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:35:54] <Dreamy_Jazz>	 Thought the window wasn't being used
[17:36:17] <Dreamy_Jazz>	 (additionally because the change is a no-op)
[17:36:17] <swfrench-wmf>	 Dreamy_Jazz: ah, got it - I'll be done checking on things in ~ 10 mins or so
[17:37:00] <Dreamy_Jazz>	 Thanks, apologies
[17:38:46] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Atlas no longer reachable from monitoring on routed ganeti - https://phabricator.wikimedia.org/T420975 (10cmooney) 03NEW p:05Triage→03Medium
[17:39:39] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-drmrs and NTT (2001:728:0:5000::164c) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[17:39:54] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:xe-0/1/5 (Transit: NTT (345038) {#345038}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[17:40:26] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Atlas no longer reachable from monitoring on routed ganeti - https://phabricator.wikimedia.org/T420975#11739860 (10cmooney)
[17:40:52] <jinxer-wm>	 FIRING: [22x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:41:13] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247073 (https://phabricator.wikimedia.org/T413771) (owner: 10STran)
[17:42:33] <swfrench-wmf>	 Dreamy_Jazz: alright, things look good. all yours :)
[17:42:56] <Dreamy_Jazz>	 Thanks, and apologies again (should have seen your message from above about using the window but missed it)
[17:43:27] <logmsgbot>	 !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1259136|EventStreamConfig: Document not adding performer attributes to SI interaction v2 stream (T418740)]]
[17:43:32] <stashbot>	 T418740: Special:CheckUser: Conditionally show a link to "SI cases" - https://phabricator.wikimedia.org/T418740
[17:43:43] <swfrench-wmf>	 a lot of noise in here today!
[17:45:17] <logmsgbot>	 !log dreamyjazz@deploy2002 kharlan, dreamyjazz: Backport for [[gerrit:1259136|EventStreamConfig: Document not adding performer attributes to SI interaction v2 stream (T418740)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[17:45:38] <logmsgbot>	 !log dreamyjazz@deploy2002 kharlan, dreamyjazz: Continuing with sync
[17:45:52] <jinxer-wm>	 FIRING: [16x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:49:19] <jinxer-wm>	 RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:xe-0/1/5 (Transit: NTT (345038) {#345038}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[17:49:39] <jinxer-wm>	 RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-drmrs and NTT (2001:728:0:5000::164c) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[17:49:56] <logmsgbot>	 !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1259136|EventStreamConfig: Document not adding performer attributes to SI interaction v2 stream (T418740)]] (duration: 06m 28s)
[17:50:01] <stashbot>	 T418740: Special:CheckUser: Conditionally show a link to "SI cases" - https://phabricator.wikimedia.org/T418740
[17:50:03] <Dreamy_Jazz>	 I'm done with scap
[17:50:52] <jinxer-wm>	 FIRING: [15x] ProbeDown: Service restbase1043-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:50:55] <jinxer-wm>	 FIRING: [16x] ProbeDown: Service restbase1043-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:53:13] <logmsgbot>	 !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:restbase-eqiad
[17:54:03] <logmsgbot>	 !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp1115.eqiad.wmnet with OS trixie
[17:54:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr1-codfw and 10.192.16.35 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[17:54:48] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS trixie
[17:55:52] <jinxer-wm>	 FIRING: [14x] ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:59:40] <wikibugs>	 (03PS4) 10SBassett: Allow-list some additional domains to the currently enforcing CSP [puppet] - 10https://gerrit.wikimedia.org/r/1255066 (https://phabricator.wikimedia.org/T420539) (owner: 10Sportzpikachu)
[18:00:00] <jinxer-wm>	 FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability
[18:00:00] <jinxer-wm>	 FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability
[18:00:52] <jinxer-wm>	 RESOLVED: [14x] ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:04:10] <jinxer-wm>	 RESOLVED: [2x] BFDdown: BFD session down between cr1-codfw and 10.192.16.35 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[18:05:55] <logmsgbot>	 !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-reboot rolling reboot on P{aqs[1011,1014,1016-1022]*} and P{P:Cassandra}
[18:10:40] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr1-codfw and 10.192.16.35 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[18:10:42] <logmsgbot>	 !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1115.eqiad.wmnet with OS trixie
[18:10:52] <jinxer-wm>	 FIRING: [14x] ProbeDown: Service aqs1011-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:10:55] <jinxer-wm>	 RESOLVED: [4x] BFDdown: BFD session down between cr1-codfw and 10.192.16.35 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[18:10:56] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS trixie
[18:14:59] <wikibugs>	 (03CR) 10Catrope: [C:03+1] Allow-list some additional domains to the currently enforcing CSP [puppet] - 10https://gerrit.wikimedia.org/r/1255066 (https://phabricator.wikimedia.org/T420539) (owner: 10Sportzpikachu)
[18:15:29] <wikibugs>	 (03CR) 10SBassett: Allow-list some additional domains to the currently enforcing CSP [puppet] - 10https://gerrit.wikimedia.org/r/1255066 (https://phabricator.wikimedia.org/T420539) (owner: 10Sportzpikachu)
[18:15:48] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on hcaptcha-proxy4002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[18:15:52] <jinxer-wm>	 RESOLVED: [6x] ProbeDown: Service aqs1011-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:16:16] <sukhe>	 ^ bird is expected, trying to move traffic over
[18:17:16] <wikibugs>	 06SRE, 06SRE-OnFire, 10Observability-Alerting: vopsbot !ack and !resolve without incident numbers aren't working - https://phabricator.wikimedia.org/T420982 (10RLazarus) 03NEW p:05Triage→03Medium
[18:20:40] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr1-eqiad and 10.64.16.86 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[18:20:52] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service aqs1011-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:20:55] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service aqs1011-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:22:03] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "0 tests failed, 0 tests skipped, 40 tests passed" [puppet] - 10https://gerrit.wikimedia.org/r/1255066 (https://phabricator.wikimedia.org/T420539) (owner: 10Sportzpikachu)
[18:22:04] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] Allow-list some additional domains to the currently enforcing CSP [puppet] - 10https://gerrit.wikimedia.org/r/1255066 (https://phabricator.wikimedia.org/T420539) (owner: 10Sportzpikachu)
[18:25:48] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on hcaptcha-proxy4001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[18:25:52] <jinxer-wm>	 RESOLVED: [10x] ProbeDown: Service aqs1011-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:30:40] <jinxer-wm>	 FIRING: [5x] BFDdown: BFD session down between cr1-eqiad and 10.64.16.86 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[18:31:32] <wikibugs>	 (03PS1) 10Aaron Schulz: Add Analytics APIs to the RestSandbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259183 (https://phabricator.wikimedia.org/T419429)
[18:35:51] <logmsgbot>	 !log sukhe@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on hcaptcha-proxy4001.wikimedia.org with reason: depooled host (soon to be decomed)
[18:35:52] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service aqs1014-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:35:59] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11740152 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7de31a58-f28e-43d7-99e1-e30cec213330) set by sukhe@cumin1003 for 3 days, 0:00:00...
[18:36:12] <logmsgbot>	 !log sukhe@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on hcaptcha-proxy4002.wikimedia.org with reason: depooled host (soon to be decomed)
[18:36:20] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11740153 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c4c9ce4c-f2e1-4e09-a892-c11aee00f6ea) set by sukhe@cumin1003 for 3 days, 0:00:00...
[18:40:52] <jinxer-wm>	 RESOLVED: [8x] ProbeDown: Service aqs1014-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:41:07] <wikibugs>	 (03CR) 10Rubah Hitam Vukova: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251193 (https://phabricator.wikimedia.org/T419105) (owner: 10Codename Noreste)
[18:41:11] <wikibugs>	 (03CR) 10Rubah Hitam Vukova: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251193 (https://phabricator.wikimedia.org/T419105) (owner: 10Codename Noreste)
[18:41:13] <wikibugs>	 (03CR) 10Rubah Hitam Vukova: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251193 (https://phabricator.wikimedia.org/T419105) (owner: 10Codename Noreste)
[18:42:36] <wikibugs>	 (03PS1) 10AKhatun: stream: mw-page-edit-type-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259186 (https://phabricator.wikimedia.org/T351225)
[18:45:52] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service aqs1016-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:48:54] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Anycast services - depool strategy in terms of BGP routing - https://phabricator.wikimedia.org/T420821#11740224 (10ssingh) Thanks for all the work here @cmooney and for mentioning this, something that I had most certainly overlooked at least. I will think a bit...
[18:49:42] <logmsgbot>	 !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1115.eqiad.wmnet with OS trixie
[18:50:03] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS trixie
[18:50:52] <jinxer-wm>	 RESOLVED: [8x] ProbeDown: Service aqs1016-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:53:32] <logmsgbot>	 !log cdobbins@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp1100.eqiad.wmnet with OS trixie
[18:53:49] <wikibugs>	 (03PS9) 10Eevans: services: add linked-artifacts service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250651 (https://phabricator.wikimedia.org/T414112)
[18:53:49] <wikibugs>	 (03PS1) 10Eevans: charts/cassandra-http-gateway: configurable Cassandra keyspace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259188 (https://phabricator.wikimedia.org/T414112)
[18:54:11] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp1100.eqiad.wmnet with OS trixie
[18:55:43] <logmsgbot>	 !log cdobbins@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp1101.eqiad.wmnet with OS trixie
[18:55:52] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service aqs1016-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:55:55] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service aqs1016-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:56:07] <wikibugs>	 (03CR) 10Bking: [C:03+2] "You are actually correct, we will be flying blind until we can get on the new chart (if we have to...we have also discussed making a separ" [puppet] - 10https://gerrit.wikimedia.org/r/1251117 (https://phabricator.wikimedia.org/T419289) (owner: 10Bking)
[18:57:05] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp1101.eqiad.wmnet with OS trixie
[18:59:32] <wikibugs>	 (03CR) 10JavierMonton: [C:03+1] stream: mw-page-edit-type-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259186 (https://phabricator.wikimedia.org/T351225) (owner: 10AKhatun)
[18:59:46] <inflatador>	 !log bking@deploy2002 restarting opensearch-semantic-search eqiad to renew certs
[18:59:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:00:52] <jinxer-wm>	 RESOLVED: [8x] ProbeDown: Service aqs1017-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:07:39] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] stream: mw-page-edit-type-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259186 (https://phabricator.wikimedia.org/T351225) (owner: 10AKhatun)
[19:08:31] <wikibugs>	 (03CR) 10AKhatun: [C:03+2] stream: mw-page-edit-type-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259186 (https://phabricator.wikimedia.org/T351225) (owner: 10AKhatun)
[19:10:38] <wikibugs>	 (03Merged) 10jenkins-bot: stream: mw-page-edit-type-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259186 (https://phabricator.wikimedia.org/T351225) (owner: 10AKhatun)
[19:10:41] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1100.eqiad.wmnet with reason: host reimage
[19:10:52] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service aqs1018-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:12:16] <wikibugs>	 (03CR) 10Brouberol: [C:04-1] "LGTM! Setting a -1 so this does not get merged before the maintenance window" [puppet] - 10https://gerrit.wikimedia.org/r/1259155 (https://phabricator.wikimedia.org/T414484) (owner: 10Btullis)
[19:12:51] <wikibugs>	 (03CR) 10Brouberol: [C:04-1] "LGTM! Setting a -1 so this does not get merged before the maintenance window" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259141 (https://phabricator.wikimedia.org/T383553) (owner: 10Btullis)
[19:13:05] <logmsgbot>	 !log akhatun@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-edit-type-enrich-next: apply
[19:13:20] <logmsgbot>	 !log akhatun@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-edit-type-enrich-next: apply
[19:13:44] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1101.eqiad.wmnet with reason: host reimage
[19:14:26] <wikibugs>	 (03PS7) 10Majavah: cloudlb: Merge http-by-host to main http service type [puppet] - 10https://gerrit.wikimedia.org/r/1259134
[19:14:44] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1100.eqiad.wmnet with reason: host reimage
[19:14:58] <wikibugs>	 (03CR) 10Brouberol: "I think that (looking at the CI logs) you also need to set `installCRDs: false` in `helmfile.d/admin_ng/cert-manager/cert-manager-values.y" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259158 (https://phabricator.wikimedia.org/T414484) (owner: 10Btullis)
[19:15:52] <jinxer-wm>	 RESOLVED: [8x] ProbeDown: Service aqs1018-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:17:07] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8328/co" [puppet] - 10https://gerrit.wikimedia.org/r/1259134 (owner: 10Majavah)
[19:17:26] <wikibugs>	 (03PS1) 10Ayounsi: anycast: don't prepent last AS in ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/1259199
[19:17:47] <wikibugs>	 (03PS2) 10Ayounsi: anycast: don't prepend last AS in ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/1259199
[19:18:01] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1101.eqiad.wmnet with reason: host reimage
[19:18:46] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Anycast services - depool strategy in terms of BGP routing - https://phabricator.wikimedia.org/T420821#11740353 (10cmooney) Thanks @ssingh.  I think a cookbook that takes down doh and durum simultaneously at a site (I assume by changing bird?) would solve this p...
[19:19:47] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "I can confirm the behaviour we are seeing, not sure about the syntax but I trust you know it so looks good!" [homer/public] - 10https://gerrit.wikimedia.org/r/1259199 (owner: 10Ayounsi)
[19:20:52] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service aqs1019-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:21:15] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "LGTM." [homer/public] - 10https://gerrit.wikimedia.org/r/1259199 (owner: 10Ayounsi)
[19:23:50] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] anycast: don't prepend last AS in ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/1259199 (owner: 10Ayounsi)
[19:25:31] <wikibugs>	 (03Merged) 10jenkins-bot: anycast: don't prepend last AS in ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/1259199 (owner: 10Ayounsi)
[19:25:52] <jinxer-wm>	 RESOLVED: [8x] ProbeDown: Service aqs1019-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:29:38] <wikibugs>	 (03CR) 10Scott French: "Thanks, Matthew!" [puppet] - 10https://gerrit.wikimedia.org/r/1256520 (https://phabricator.wikimedia.org/T420458) (owner: 10Scott French)
[19:30:01] <wikibugs>	 (03CR) 10Scott French: [C:03+2] admin: Add mpostoronca shell access and deployment membership [puppet] - 10https://gerrit.wikimedia.org/r/1256520 (https://phabricator.wikimedia.org/T420458) (owner: 10Scott French)
[19:30:32] <logmsgbot>	 !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1115.eqiad.wmnet with OS trixie
[19:30:52] <jinxer-wm>	 FIRING: [11x] ProbeDown: Service aqs1019-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:34:56] <wikibugs>	 (03PS7) 10CDanis: Add fundraising-data-uploader role user [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948)
[19:35:52] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis)
[19:35:52] <jinxer-wm>	 RESOLVED: [8x] ProbeDown: Service aqs1020-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:37:55] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1100.eqiad.wmnet with OS trixie
[19:38:30] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to production for Mpostoronca-wmf - https://phabricator.wikimedia.org/T420458#11740519 (10Scott_French) 05Open→03Resolved a:03Scott_French @MPostoronca-WMF - Thanks for your patience. This should be rolling out over the next 30 minutes or so.
[19:39:02] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp1100.eqiad.wmnet [reason: trixie reimaging]
[19:40:20] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on hcaptcha-proxy4004 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[19:40:28] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on hcaptcha-proxy4003 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[19:40:30] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1101.eqiad.wmnet with OS trixie
[19:40:31] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp1102.eqiad.wmnet [reason: trixie reimaging]
[19:41:01] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp1102.eqiad.wmnet with OS trixie
[19:41:30] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp1101.eqiad.wmnet [reason: trixie reimaging]
[19:42:15] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp1103.eqiad.wmnet [reason: trixie reimaging]
[19:42:34] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp1103.eqiad.wmnet with OS trixie
[19:44:29] <logmsgbot>	 !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on P{aqs[1011,1014,1016-1022]*} and P{P:Cassandra}
[19:44:43] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS trixie
[19:46:02] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] Enable view urls in abstract.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1256396 (https://phabricator.wikimedia.org/T420666) (owner: 10Genoveva Galarza)
[19:46:02] <wikibugs>	 (03CR) 10Gmodena: [C:03+1] wdqs-queryhammer: Deployment fixes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258956 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg)
[19:46:44] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.reboot-single for host hcaptcha-proxy4003.wikimedia.org
[19:47:33] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host hcaptcha-proxy4003.wikimedia.org
[19:47:58] <logmsgbot>	 !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1115.eqiad.wmnet with OS trixie
[19:48:47] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11740581 (10Scott_French) @AnnieKim_WMDE - Thanks for creating your LDAP account (having one is a prerequisite for gaining the privileges sought here). I'll f...
[19:49:56] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-03-18-023444 to 2026-03-23-124102 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259205 (https://phabricator.wikimedia.org/T418150)
[19:50:21] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.reboot-single for host hcaptcha-proxy4004.wikimedia.org
[19:50:23] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2026-03-18-023444 to 2026-03-23-124102 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259205 (https://phabricator.wikimedia.org/T418150) (owner: 10Jforrester)
[19:50:51] <logmsgbot>	 !log cdobbins@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1103.eqiad.wmnet with OS trixie
[19:51:01] <logmsgbot>	 !log cdobbins@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1102.eqiad.wmnet with OS trixie
[19:51:09] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host hcaptcha-proxy4004.wikimedia.org
[19:52:05] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11740594 (10Scott_French)
[19:52:27] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2026-03-18-023444 to 2026-03-23-124102 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259205 (https://phabricator.wikimedia.org/T418150) (owner: 10Jforrester)
[19:54:01] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[19:54:28] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS trixie
[19:54:29] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[19:57:33] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[19:58:07] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp1102.eqiad.wmnet with OS trixie
[19:58:10] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[19:58:15] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[19:58:47] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[19:59:06] <wikibugs>	 (03PS1) 10Scott French: admin: Add anniekimwmde to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1259208 (https://phabricator.wikimedia.org/T420500)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260323T2000).
[20:00:05] <jouncebot>	 alexsanford, RoanKattouw, danisztls, James_F, milimetric, and cmelo: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:14] <James_F>	 Hey.
[20:00:21] <alexsanford>	 Hey!
[20:00:28] <alexsanford>	 I can start with mine. I need to deploy a private file first, and then do the config update 
[20:00:33] <Reedy>	 deployment confusion time
[20:00:35] <James_F>	 Ack.
[20:00:48] <danisztls>	 o/
[20:01:27] <milimetric>	 hi here
[20:01:31] <cmelo>	 \o/
[20:02:01] <milimetric>	 my config update is very isolated if anyone wants to merge it with theirs
[20:02:34] <cmelo>	 same
[20:04:54] <wikibugs>	 (03CR) 10CDanis: "I think this is fine, but, I'll note that you could also do this in the CDN directly with some extra mappings in `hieradata/common/profile" [puppet] - 10https://gerrit.wikimedia.org/r/1259121 (https://phabricator.wikimedia.org/T420595) (owner: 10Arnaudb)
[20:07:47] <alexsanford>	 !log Deployed mitigation for T419605
[20:07:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:08:04] <alexsanford>	 (doing config change next)
[20:08:27] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp1103.eqiad.wmnet with OS trixie
[20:08:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by alexsanford@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256472 (https://phabricator.wikimedia.org/T419605) (owner: 10Alex.sanford)
[20:08:41] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11740702 (10Scott_French)
[20:09:34] <wikibugs>	 (03Merged) 10jenkins-bot: Reduce reauth timeout for editing site JS to 10 minutes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256472 (https://phabricator.wikimedia.org/T419605) (owner: 10Alex.sanford)
[20:09:52] <logmsgbot>	 !log alexsanford@deploy2002 Started scap sync-world: Backport for [[gerrit:1256472|Reduce reauth timeout for editing site JS to 10 minutes (T419605)]]
[20:09:53] <danisztls>	 cmelo, milimetric: if James_F don't mind I can batch yours with my deployment
[20:10:11] <milimetric>	 danisztls: thank you, that'd be great
[20:10:17] <James_F>	 danisztls: Sure!
[20:10:29] <milimetric>	 (do I need to +2 it or you do that?)
[20:10:40] <James_F>	 milimetric: danisztls will do that.
[20:10:52] <milimetric>	 (sorry thx)
[20:11:08] <James_F>	 Never a problem. :-)
[20:11:31] <cmelo>	 thanks danisztls
[20:11:41] <danisztls>	 I'll let SpiderPig do the 'dirty' work.
[20:11:42] <logmsgbot>	 !log alexsanford@deploy2002 alexsanford: Backport for [[gerrit:1256472|Reduce reauth timeout for editing site JS to 10 minutes (T419605)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:13:06] <logmsgbot>	 !log alexsanford@deploy2002 alexsanford: Continuing with sync
[20:14:45] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1102.eqiad.wmnet with reason: host reimage
[20:17:24] <logmsgbot>	 !log alexsanford@deploy2002 Finished scap sync-world: Backport for [[gerrit:1256472|Reduce reauth timeout for editing site JS to 10 minutes (T419605)]] (duration: 07m 32s)
[20:17:51] <alexsanford>	 Ok, mine is all good :)
[20:18:23] <James_F>	 alexsanford: Are you doing RoanKattouw's patch too? Or is it over to danisztls?
[20:18:24] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] admin: Add anniekimwmde to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1259208 (https://phabricator.wikimedia.org/T420500) (owner: 10Scott French)
[20:19:10] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1102.eqiad.wmnet with reason: host reimage
[20:19:23] <alexsanford>	 Over to danisztls
[20:20:34] <danisztls>	 alexsanford: thanks
[20:20:36] <danisztls>	 proceeding
[20:21:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254448 (https://phabricator.wikimedia.org/T419275) (owner: 10DDesouza)
[20:21:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254450 (https://phabricator.wikimedia.org/T419275) (owner: 10DDesouza)
[20:21:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254452 (https://phabricator.wikimedia.org/T419778) (owner: 10DDesouza)
[20:21:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255763 (https://phabricator.wikimedia.org/T417050) (owner: 10Milimetric)
[20:21:06] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259120 (https://phabricator.wikimedia.org/T414148) (owner: 10Daimona Eaytoy)
[20:21:19] <wikibugs>	 (03PS1) 10Bking: discovery: Replace soon-to-be-expired intermediate certificate [puppet] - 10https://gerrit.wikimedia.org/r/1259216 (https://phabricator.wikimedia.org/T420993)
[20:21:42] <James_F>	 Not my one? :-)
[20:21:55] <James_F>	 (I can also self-deploy, no worries.)
[20:22:22] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on hcaptcha-proxy4004 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[20:22:27] <wikibugs>	 (03Merged) 10jenkins-bot: Undeploy participant recruitment survey on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254448 (https://phabricator.wikimedia.org/T419275) (owner: 10DDesouza)
[20:22:28] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on hcaptcha-proxy4003 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[20:22:30] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Undeploy participant recruitment survey on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254450 (https://phabricator.wikimedia.org/T419275) (owner: 10DDesouza)
[20:22:31] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Undeploy participant recruitment survey on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254452 (https://phabricator.wikimedia.org/T419778) (owner: 10DDesouza)
[20:22:34] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install payments1009 - https://phabricator.wikimedia.org/T416253#11740758 (10VRiley-WMF)
[20:22:34] <wikibugs>	 (03Merged) 10jenkins-bot: testKitchen: Add custom stream name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255763 (https://phabricator.wikimedia.org/T417050) (owner: 10Milimetric)
[20:22:37] <wikibugs>	 (03Merged) 10jenkins-bot: Enable wgCampaignEventsEnableEventGoals in beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259120 (https://phabricator.wikimedia.org/T414148) (owner: 10Daimona Eaytoy)
[20:23:08] <danisztls>	 need to rebase
[20:23:20] <logmsgbot>	 !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1115.eqiad.wmnet with OS trixie
[20:24:31] <wikibugs>	 (03PS3) 10DDesouza: Undeploy participant recruitment survey on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254450 (https://phabricator.wikimedia.org/T419275)
[20:24:56] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1103.eqiad.wmnet with reason: host reimage
[20:26:09] <wikibugs>	 (03CR) 10DDesouza: [C:03+2] Undeploy participant recruitment survey on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254450 (https://phabricator.wikimedia.org/T419275) (owner: 10DDesouza)
[20:27:17] <wikibugs>	 (03Merged) 10jenkins-bot: Undeploy participant recruitment survey on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254450 (https://phabricator.wikimedia.org/T419275) (owner: 10DDesouza)
[20:27:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254452 (https://phabricator.wikimedia.org/T419778) (owner: 10DDesouza)
[20:27:38] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Undeploy participant recruitment survey on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254452 (https://phabricator.wikimedia.org/T419778) (owner: 10DDesouza)
[20:28:53] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1103.eqiad.wmnet with reason: host reimage
[20:30:37] <wikibugs>	 (03PS2) 10Bking: discovery: Replace soon-to-be-expired intermediate certificate [puppet] - 10https://gerrit.wikimedia.org/r/1259216 (https://phabricator.wikimedia.org/T420993)
[20:30:59] <wikibugs>	 (03PS2) 10DDesouza: Undeploy participant recruitment survey on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254452 (https://phabricator.wikimedia.org/T419778)
[20:31:14] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.decommission for hosts hcaptcha-proxy4001.wikimedia.org
[20:33:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254452 (https://phabricator.wikimedia.org/T419778) (owner: 10DDesouza)
[20:33:39] <cmelo>	 All good with mine, thank you!!!
[20:33:51] <James_F>	 cmelo: Yours isn't deployed yet.
[20:33:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install fransw100[23] - https://phabricator.wikimedia.org/T417295#11740859 (10Jgreen) @Jclark-ctr we don't have the prod management password, only the frack one and a temporary one the other DC-Ops use for us. Can you reset these too?
[20:33:57] <danisztls>	 cmelo: haven't deployed any yey
[20:33:58] <James_F>	 Just merged.
[20:34:00] <danisztls>	 *yet
[20:34:20] <danisztls>	 sorry about the delay, my patches needed to be rebased
[20:34:21] <wikibugs>	 (03Merged) 10jenkins-bot: Undeploy participant recruitment survey on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254452 (https://phabricator.wikimedia.org/T419778) (owner: 10DDesouza)
[20:34:43] <logmsgbot>	 !log dani@deploy2002 Started scap sync-world: Backport for [[gerrit:1254448|Undeploy participant recruitment survey on ptwiki (T419275)]], [[gerrit:1254450|Undeploy participant recruitment survey on trwiki (T419275)]], [[gerrit:1254452|Undeploy participant recruitment survey on frwiki (T419778)]], [[gerrit:1255763|testKitchen: Add custom stream name (T417050)]], [[gerrit:1259120|Enable wgCampaignEventsEnableEventGoals in
[20:34:43] <logmsgbot>	 beta wikis (T414148)]]
[20:34:51] <stashbot>	 T419275: Deploy QuickSurvey for research participant registration drive on trwiki & ptwiki - https://phabricator.wikimedia.org/T419275
[20:34:51] <stashbot>	 T419778: Deploy QuickSurvey for research participant registration drive on frwiki - https://phabricator.wikimedia.org/T419778
[20:34:52] <stashbot>	 T417050: Attribution Research: Instrument pageviews - https://phabricator.wikimedia.org/T417050
[20:34:52] <stashbot>	 T414148: Enable event goals in beta - https://phabricator.wikimedia.org/T414148
[20:35:40] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.netbox
[20:36:01] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install fransw100[23] - https://phabricator.wikimedia.org/T417295#11740893 (10Jgreen) a:05Jgreen→03Jclark-ctr
[20:36:38] <logmsgbot>	 !log dani@deploy2002 milimetric, daimona, dani: Backport for [[gerrit:1254448|Undeploy participant recruitment survey on ptwiki (T419275)]], [[gerrit:1254450|Undeploy participant recruitment survey on trwiki (T419275)]], [[gerrit:1254452|Undeploy participant recruitment survey on frwiki (T419778)]], [[gerrit:1255763|testKitchen: Add custom stream name (T417050)]], [[gerrit:1259120|Enable wgCampaignEventsEnableEventGoals i
[20:36:38] <logmsgbot>	 n beta wikis (T414148)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:36:41] <cmelo>	 No worries, I can already see the changes available in beta
[20:36:51] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install payments101[0-2] - https://phabricator.wikimedia.org/T416252#11740899 (10Jgreen) a:05Jgreen→03Jclark-ctr
[20:36:52] <danisztls>	 cmelo: great!
[20:36:56] <danisztls>	 milimetric: can you test?
[20:37:36] <milimetric>	 mine isn't testable until a deployment tomorrow, it's just preparing for that
[20:37:46] <danisztls>	 milimetric: ok
[20:37:49] <milimetric>	 nothing that uses that config is broken on debug servers, so all good
[20:37:52] <logmsgbot>	 !log dani@deploy2002 milimetric, daimona, dani: Continuing with sync
[20:39:51] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: hcaptcha-proxy4001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1003"
[20:40:37] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: hcaptcha-proxy4001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1003"
[20:40:37] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:40:39] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts hcaptcha-proxy4001.wikimedia.org
[20:40:45] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11740918 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin1003 for hosts: `hcaptcha-proxy4001.wikimedia.org` - hcaptcha-proxy40...
[20:41:55] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.decommission for hosts hcaptcha-proxy4002.wikimedia.org
[20:42:07] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1102.eqiad.wmnet with OS trixie
[20:42:09] <logmsgbot>	 !log dani@deploy2002 Finished scap sync-world: Backport for [[gerrit:1254448|Undeploy participant recruitment survey on ptwiki (T419275)]], [[gerrit:1254450|Undeploy participant recruitment survey on trwiki (T419275)]], [[gerrit:1254452|Undeploy participant recruitment survey on frwiki (T419778)]], [[gerrit:1255763|testKitchen: Add custom stream name (T417050)]], [[gerrit:1259120|Enable wgCampaignEventsEnableEventGoals in
[20:42:09] <logmsgbot>	 beta wikis (T414148)]] (duration: 07m 26s)
[20:42:17] <stashbot>	 T419275: Deploy QuickSurvey for research participant registration drive on trwiki & ptwiki - https://phabricator.wikimedia.org/T419275
[20:42:18] <stashbot>	 T419778: Deploy QuickSurvey for research participant registration drive on frwiki - https://phabricator.wikimedia.org/T419778
[20:42:18] <stashbot>	 T417050: Attribution Research: Instrument pageviews - https://phabricator.wikimedia.org/T417050
[20:42:19] <stashbot>	 T414148: Enable event goals in beta - https://phabricator.wikimedia.org/T414148
[20:42:21] <danisztls>	 RoanKattouw, James_F: I'm done
[20:42:26] <James_F>	 OK.
[20:42:29] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] Abstract Wikipedia: Fix API call to get page info [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1256394 (https://phabricator.wikimedia.org/T420725) (owner: 10Jforrester)
[20:42:44] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] [abstractwiki] Enable the Translate extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259085 (https://phabricator.wikimedia.org/T420656) (owner: 10Jforrester)
[20:42:46] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] Move testwiki-only Attribution REST API definition to IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250113 (owner: 10Jforrester)
[20:42:56] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp1102.eqiad.wmnet [reason: trixie reimaging]
[20:43:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1256394 (https://phabricator.wikimedia.org/T420725) (owner: 10Jforrester)
[20:43:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259085 (https://phabricator.wikimedia.org/T420656) (owner: 10Jforrester)
[20:43:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250113 (owner: 10Jforrester)
[20:43:19] <wikibugs>	 (03PS1) 10RLazarus: cache.mcrouter: Copy 1.3.4 to 1.3.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259221
[20:43:19] <wikibugs>	 (03PS1) 10RLazarus: cache.mcrouter: Add replica.remote_read option [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259222 (https://phabricator.wikimedia.org/T411807)
[20:44:24] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11740929 (10Ottomata) Approved.
[20:44:38] <wikibugs>	 (03Merged) 10jenkins-bot: [abstractwiki] Enable the Translate extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259085 (https://phabricator.wikimedia.org/T420656) (owner: 10Jforrester)
[20:44:42] <wikibugs>	 (03Merged) 10jenkins-bot: Move testwiki-only Attribution REST API definition to IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250113 (owner: 10Jforrester)
[20:45:09] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS trixie
[20:46:22] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.netbox
[20:47:19] <wikibugs>	 (03Merged) 10jenkins-bot: Abstract Wikipedia: Fix API call to get page info [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1256394 (https://phabricator.wikimedia.org/T420725) (owner: 10Jforrester)
[20:47:40] <logmsgbot>	 !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1256394|Abstract Wikipedia: Fix API call to get page info (T420725)]], [[gerrit:1259085|[abstractwiki] Enable the Translate extension (T420656)]], [[gerrit:1250113|Move testwiki-only Attribution REST API definition to IS]]
[20:47:46] <stashbot>	 T420725: Abstract Wikipedia allows creation of existing articles - https://phabricator.wikimedia.org/T420725
[20:47:47] <stashbot>	 T420656: Enable Translate extension for Abstract Wikipedia - https://phabricator.wikimedia.org/T420656
[20:47:56] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:49:10] <wikibugs>	 (03CR) 10Scott French: [C:03+2] admin: Add anniekimwmde to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1259208 (https://phabricator.wikimedia.org/T420500) (owner: 10Scott French)
[20:49:19] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mtail in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:50:10] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: hcaptcha-proxy4002.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1003"
[20:50:25] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: hcaptcha-proxy4002.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1003"
[20:50:25] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:50:30] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts hcaptcha-proxy4002.wikimedia.org
[20:50:40] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11740956 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin1003 for hosts: `hcaptcha-proxy4002.wikimedia.org` - hcaptcha-proxy40...
[20:51:25] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1103.eqiad.wmnet with OS trixie
[20:51:34] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11740957 (10ssingh) hcaptcha-proxy400[12], on the old Ganeti setup are now decommissioned. I think these were the last two VMs that had to be moved.
[20:53:34] <logmsgbot>	 !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1256394|Abstract Wikipedia: Fix API call to get page info (T420725)]], [[gerrit:1259085|[abstractwiki] Enable the Translate extension (T420656)]], [[gerrit:1250113|Move testwiki-only Attribution REST API definition to IS]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:53:40] <stashbot>	 T420725: Abstract Wikipedia allows creation of existing articles - https://phabricator.wikimedia.org/T420725
[20:53:40] <stashbot>	 T420656: Enable Translate extension for Abstract Wikipedia - https://phabricator.wikimedia.org/T420656
[20:53:55] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11740960 (10Scott_French) 05Open→03Resolved a:03Scott_French Thanks, all!  @AnnieKim_WMDE - Your [[ http...
[20:54:19] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job mtail in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:54:31] <logmsgbot>	 !log jforrester@deploy2002 jforrester: Continuing with sync
[20:56:50] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to data and Superset for Daria-WMDE (Daria Ammalainen (WMDE)) - https://phabricator.wikimedia.org/T420716#11740964 (10Scott_French)
[20:58:52] <logmsgbot>	 !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1256394|Abstract Wikipedia: Fix API call to get page info (T420725)]], [[gerrit:1259085|[abstractwiki] Enable the Translate extension (T420656)]], [[gerrit:1250113|Move testwiki-only Attribution REST API definition to IS]] (duration: 11m 12s)
[20:58:58] <stashbot>	 T420725: Abstract Wikipedia allows creation of existing articles - https://phabricator.wikimedia.org/T420725
[20:58:58] <James_F>	 All done, just in time.
[20:58:58] <stashbot>	 T420656: Enable Translate extension for Abstract Wikipedia - https://phabricator.wikimedia.org/T420656
[21:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: That opportune time for a Weekly Security deployment window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260323T2100).
[21:01:02] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to data and Superset for Daria-WMDE (Daria Ammalainen (WMDE)) - https://phabricator.wikimedia.org/T420716#11740987 (10Scott_French) @Daria-WMDE - Great, thank you! Once the NDA comes through, I believe that should be everything we need to en...
[21:03:03] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp1103.eqiad.wmnet [reason: trixie reimaging]
[21:03:15] <RoanKattouw>	 maryum: You're doing the security deploy I think? Once you're done I have another patch that I forgot to do during the previous window
[21:03:15] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to superset for alice.moutinho - https://phabricator.wikimedia.org/T420751#11740996 (10Scott_French)
[21:04:12] <wikibugs>	 (03PS2) 10Jforrester: Move GrowthExperiments REST API definition to IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250114
[21:04:38] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp1104.eqiad.wmnet [reason: trixie reimaging]
[21:05:02] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install frdata1003, frmx1002, frqueue100[5-6] - https://phabricator.wikimedia.org/T416249#11741000 (10Jgreen) Working on frqueue1005: * disabled the "embedded" NICs * set serial port address: COM2 * set console redirection after boot: enabled * switched boot method...
[21:05:10] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp1104.eqiad.wmnet with OS trixie
[21:05:18] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp1106.eqiad.wmnet [reason: trixie reimaging]
[21:05:26] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp1106.eqiad.wmnet [reason: trixie reimaging]
[21:05:55] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to superset for alice.moutinho - https://phabricator.wikimedia.org/T420751#11741004 (10Scott_French) @Alice.moutinho - Great, thank you - I see alicem LDAP account was created. Once the NDA comes through, I believe that should be everything...
[21:08:24] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Superset for keren.ramirezWMDE - https://phabricator.wikimedia.org/T420896#11741014 (10Scott_French)
[21:08:36] <maryum>	 Roankattouw: yes getting started with security deploys now
[21:08:51] <maryum>	 RoanKattouw: are you deploying anything?
[21:10:15] <RoanKattouw>	 maryum: Yes a patch from the backport window (previous hour) that I didn't get to
[21:10:32] <maryum>	 RoanKattouw: if you want you can go ahead and do that now
[21:11:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255847 (owner: 10Catrope)
[21:12:33] <wikibugs>	 (03Merged) 10jenkins-bot: testwiki: Add temporary groups for security testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255847 (owner: 10Catrope)
[21:12:52] <logmsgbot>	 !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1255847|testwiki: Add temporary groups for security testing]]
[21:13:23] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Superset for keren.ramirezWMDE - https://phabricator.wikimedia.org/T420896#11741045 (10Scott_French) @WMDE-leszek - Thank you!  @kera_wmde - Just to confirm, from the title of this task, it sounds like you are requesting "level 1" access [[ https://wikitech.wi...
[21:13:31] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to Superset for keren.ramirezWMDE - https://phabricator.wikimedia.org/T420896#11741047 (10Scott_French)
[21:14:59] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users level 3 for bvibber - https://phabricator.wikimedia.org/T420406#11741049 (10Scott_French) @bvibber - Just to signal boost in case it got lost in the noise:   >>! In T420406#11722329, @ayounsi wrote: > @bvibber...
[21:18:43] <logmsgbot>	 !log catrope@deploy2002 catrope: Backport for [[gerrit:1255847|testwiki: Add temporary groups for security testing]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:19:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:21:08] <logmsgbot>	 !log catrope@deploy2002 catrope: Continuing with sync
[21:22:14] <logmsgbot>	 !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1115.eqiad.wmnet with OS trixie
[21:25:25] <logmsgbot>	 !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1255847|testwiki: Add temporary groups for security testing]] (duration: 12m 33s)
[21:28:07] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users level 3 for bvibber - https://phabricator.wikimedia.org/T420406#11741074 (10bvibber) and..... signed :D thx!
[21:28:50] <wikibugs>	 (03Abandoned) 10JHathaway: WIP: do not merge, test 2 [puppet] - 10https://gerrit.wikimedia.org/r/1259162 (owner: 10JHathaway)
[21:29:12] <maryum>	 preparing to run scap
[21:34:14] <wikibugs>	 (03PS9) 10JHathaway: nftables::service: Improve src/dst filter handling [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah)
[21:34:30] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users level 3 for bvibber - https://phabricator.wikimedia.org/T420406#11741106 (10Scott_French)
[21:34:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] nftables::service: Improve src/dst filter handling [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah)
[21:35:20] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS trixie
[21:37:28] <wikibugs>	 (03CR) 10JHathaway: "Apologies for the wait @taavi@wikimedia.org. I made an attempt at iterating on your good work to further reproduce the duplication in logi" [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah)
[21:39:23] <wikibugs>	 (03PS10) 10JHathaway: nftables::service: Improve src/dst filter handling [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah)
[21:40:04] <wikibugs>	 (03CR) 10CI reject: [V:04-1] nftables::service: Improve src/dst filter handling [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah)
[21:41:32] <maryum>	 !log Deployed security fix for T419168
[21:41:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:41:41] <maryum>	 first of three patches deployed
[21:43:07] <wikibugs>	 (03PS11) 10JHathaway: nftables::service: Improve src/dst filter handling [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah)
[21:43:09] <maryum>	 running scap for second patch
[21:43:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] nftables::service: Improve src/dst filter handling [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah)
[21:44:10] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users level 3 for bvibber - https://phabricator.wikimedia.org/T420406#11741125 (10Scott_French) @bvibber - Great, thanks! One last question: I see that the SSH public key you've provided here is different from [[ htt...
[21:50:24] <wikibugs>	 (03PS3) 10Scott French: admin: Add bvibber to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1254868 (https://phabricator.wikimedia.org/T420406) (owner: 10Ayounsi)
[21:50:31] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users level 3 for bvibber - https://phabricator.wikimedia.org/T420406#11741139 (10bvibber) @Scott_French ah I misread the instructions I think. :D Ok to proivde the same key as for other wikimedia production servers,...
[21:51:28] <wikibugs>	 (03CR) 10Scott French: "Manual rebase to absorb changes to `analytics_privatedata_users`." [puppet] - 10https://gerrit.wikimedia.org/r/1254868 (https://phabricator.wikimedia.org/T420406) (owner: 10Ayounsi)
[21:53:03] <wikibugs>	 (03PS12) 10JHathaway: nftables::service: Improve src/dst filter handling [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah)
[21:53:17] <maryum>	 !log Deployed security fix for T419192
[21:53:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:53:27] <maryum>	 preparing to run scap for the 3rd and final security patch
[21:54:53] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah)
[21:56:14] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users level 3 for bvibber - https://phabricator.wikimedia.org/T420406#11741162 (10Scott_French) @bvibber - Thanks! Yes, exactly - you can continue to use your existing production SSH public key as usual (i.e., the on...
[21:56:43] <wikibugs>	 (03PS1) 10Daimona Eaytoy: Enable the CampaignEvents extension on all wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259231 (https://phabricator.wikimedia.org/T419597)
[21:56:58] <wikibugs>	 (03PS1) 10Bking: trixie: Add component/opensearch2 [puppet] - 10https://gerrit.wikimedia.org/r/1259232 (https://phabricator.wikimedia.org/T420759)
[21:57:08] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users level 3 for bvibber - https://phabricator.wikimedia.org/T420406#11741165 (10bvibber)
[21:57:27] <wikibugs>	 (03CR) 10CI reject: [V:04-1] trixie: Add component/opensearch2 [puppet] - 10https://gerrit.wikimedia.org/r/1259232 (https://phabricator.wikimedia.org/T420759) (owner: 10Bking)
[21:57:33] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users level 3 for bvibber - https://phabricator.wikimedia.org/T420406#11741168 (10bvibber) @Scott_French thanks done! Same ol' public key ;)
[21:57:52] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install frdata1003, frmx1002, frqueue100[5-6] - https://phabricator.wikimedia.org/T416249#11741170 (10Jgreen) frmx1002's management interface isn't accessible, doesn't respond to ping
[21:58:49] <wikibugs>	 (03PS2) 10Bking: trixie: Add component/opensearch2 [puppet] - 10https://gerrit.wikimedia.org/r/1259232 (https://phabricator.wikimedia.org/T420759)
[21:59:06] <wikibugs>	 (03PS1) 10Daimona Eaytoy: [WIP] Enable CampaignEvents on all SUL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259233
[22:00:00] <jinxer-wm>	 FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability
[22:00:00] <jinxer-wm>	 FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability
[22:02:41] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] admin: Add bvibber to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1254868 (https://phabricator.wikimedia.org/T420406) (owner: 10Ayounsi)
[22:03:07] <wikibugs>	 (03Abandoned) 10Daimona Eaytoy: [WIP] Enable CampaignEvents on all SUL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259233 (owner: 10Daimona Eaytoy)
[22:04:06] <wikibugs>	 (03CR) 10Scott French: [C:03+2] admin: Add bvibber to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1254868 (https://phabricator.wikimedia.org/T420406) (owner: 10Ayounsi)
[22:04:35] <wikibugs>	 (03PS1) 10Daimona Eaytoy: Enable $wgCampaignEventsEnableEventGoals in prod wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259237 (https://phabricator.wikimedia.org/T414149)
[22:04:56] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259231 (https://phabricator.wikimedia.org/T419597) (owner: 10Daimona Eaytoy)
[22:05:32] <maryum>	 !log Deployed security fix for T415584
[22:05:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:05:42] <maryum>	 Security deploy is finished
[22:05:49] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259237 (https://phabricator.wikimedia.org/T414149) (owner: 10Daimona Eaytoy)
[22:07:00] <logmsgbot>	 !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1115.eqiad.wmnet with OS trixie
[22:08:47] <wikibugs>	 10ops-eqiad, 06DC-Ops: firmware troubleshooting: Unable to PXE boot cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007 (10BCornwall) 03NEW
[22:19:03] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users level 3 for bvibber - https://phabricator.wikimedia.org/T420406#11741336 (10Scott_French) 05Open→03Resolved a:03Scott_French Alright, I think that should do it!  @bvibber - The c...
[22:25:52] <logmsgbot>	 !log cdobbins@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp1104.eqiad.wmnet with OS trixie
[22:28:03] <logmsgbot>	 !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host an-worker1172.eqiad.wmnet
[22:31:55] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr3-ulsfo and 198.35.26.13 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[22:36:52] <wikibugs>	 (03CR) 10Majavah: [C:04-1] "Unfortunately the latest PS seems to be re-introducing T351094. See, e.g. here: https://puppet-compiler.wmflabs.org/output/1212097/6172/cl" [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah)
[22:38:01] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:44:50] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS trixie
[22:49:25] <wikibugs>	 (03PS1) 10LorenMora: Transition reading list experiment to instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259251 (https://phabricator.wikimedia.org/T414368)
[22:51:31] <rzl>	 !log root@apt1002:~# reprepro --noskipold --restrict vopsbot update bookworm-wikimedia
[22:51:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:00:04] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260323T2300)
[23:04:49] <rzl>	 !incidents
[23:04:49] <sirenbot>	 7787 (ACKED)  db1253 (paged)/MariaDB Replica IO: s7 (paged)
[23:04:50] <sirenbot>	 7788 (ACKED)  db1253 (paged)/MariaDB Replica SQL: s7 (paged)
[23:04:50] <sirenbot>	 7786 (RESOLVED)  db1253 (paged)/MariaDB Replica SQL: s7 (paged)
[23:04:50] <sirenbot>	 7785 (RESOLVED)  db1253 (paged)/MariaDB Replica Lag: s7 (paged)
[23:04:50] <sirenbot>	 7784 (RESOLVED)  db1253 (paged)/MariaDB Replica IO: s7 (paged)
[23:04:57] <rzl>	 !resolve
[23:04:58] <sirenbot>	 7787 (RESOLVED)  db1253 (paged)/MariaDB Replica IO: s7 (paged)
[23:04:58] <sirenbot>	 7788 (RESOLVED)  db1253 (paged)/MariaDB Replica SQL: s7 (paged)
[23:05:02] <rzl>	 \o/
[23:07:02] <wikibugs>	 06SRE, 06SRE-OnFire, 10Observability-Alerting: vopsbot !ack and !resolve without incident numbers aren't working - https://phabricator.wikimedia.org/T420982#11741518 (10RLazarus) 05Open→03Resolved
[23:08:28] <wikibugs>	 (03CR) 10Aude: "This looks good though think we need to wait until https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/1251505 is full" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259251 (https://phabricator.wikimedia.org/T414368) (owner: 10LorenMora)
[23:18:14] <wikibugs>	 06SRE, 10Beta-Cluster-Infrastructure: Beta cluster is slow as sludge / serves 503 and 504 - https://phabricator.wikimedia.org/T420833#11741525 (10bd808) https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/5d4298ce7a31d1650f6741e2b9051b82e9661c8a%5E%21/#F0 ` diff --git a/deployment-prep/_.yam...
[23:35:59] <wikibugs>	 06SRE, 10Beta-Cluster-Infrastructure: Beta cluster is slow as sludge / serves 503 and 504 - https://phabricator.wikimedia.org/T420833#11741537 (10bd808) https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/b304067573816dedc4548607ca96202083700afd%5E%21/#F0 ` diff --git a/deployment-prep/_.yam...
[23:36:31] <logmsgbot>	 brett@cumin2002 reimage (PID 1146748) is awaiting input
[23:39:20] <wikibugs>	 06SRE, 10Beta-Cluster-Infrastructure: Beta cluster is slow as sludge / serves 503 and 504 - https://phabricator.wikimedia.org/T420833#11741539 (10bd808) https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/af98ae0206c2602a25a4d88414d77291788c7f0f%5E%21/#F0 ` diff --git a/deployment-prep/_.yam...
[23:39:23] <wikibugs>	 (03PS3) 10Andrea Denisse: grafana: Hide version number for the anonymous role [puppet] - 10https://gerrit.wikimedia.org/r/1259254 (https://phabricator.wikimedia.org/T402844)
[23:46:16] <wikibugs>	 06SRE, 10Beta-Cluster-Infrastructure: Beta cluster is slow as sludge / serves 503 and 504 - https://phabricator.wikimedia.org/T420833#11741541 (10bd808) https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/bda711470077c72c5c1d40f9b34a1f036bbd3981%5E%21/#F0 ` diff --git a/deployment-prep/_.yam...
[23:47:27] <wikibugs>	 06SRE, 10Beta-Cluster-Infrastructure: Beta cluster is slow as sludge / serves 503 and 504 - https://phabricator.wikimedia.org/T420833#11741542 (10bd808)
[23:59:50] <wikibugs>	 06SRE, 10Beta-Cluster-Infrastructure: Beta cluster is slow as sludge / serves 503 and 504 - https://phabricator.wikimedia.org/T420833#11741568 (10bd808) That's most of the really active networks Beta has seen in the last 24 hours blocked. Let's see what the 15 minute load graph looks like over the next couple...