[00:12:39] PROBLEM - MD RAID on aqs1010 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [00:12:40] ACKNOWLEDGEMENT - MD RAID on aqs1010 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T420867 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [00:12:47] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1010 - https://phabricator.wikimedia.org/T420867 (10ops-monitoring-bot) 03NEW [00:34:54] FIRING: JobUnavailable: Reduced availability for job wikidough in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:39:25] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1258375 [00:39:25] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1258375 (owner: 10TrainBranchBot) [00:47:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.72% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:52:11] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1258375 (owner: 10TrainBranchBot) [00:52:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 22.51% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:53:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 20.8% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:09:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1258387 [01:09:16] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1258387 (owner: 10TrainBranchBot) [01:13:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.61% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:17:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.43% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:19:54] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:22:15] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1258387 (owner: 10TrainBranchBot) [01:37:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.32% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:42:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.22% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:47:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.22% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:00:00] FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [02:00:00] FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [02:00:52] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [02:02:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 21.9% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:07:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:08:47] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 07m 55s) [02:09:19] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:31:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.57% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:34:19] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:36:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.01% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:38:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:49:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254448 (https://phabricator.wikimedia.org/T419275) (owner: 10DDesouza) [02:50:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254450 (https://phabricator.wikimedia.org/T419275) (owner: 10DDesouza) [02:50:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254452 (https://phabricator.wikimedia.org/T419778) (owner: 10DDesouza) [03:08:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 20.8% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:13:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 22.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:03:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 22.08% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:34:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.4% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:44:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 21.58% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:48:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.57% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260322T0700) [05:00:05] arnaudb : #bothumor My software never has bugs. It just develops random features. Rise for Gerrit. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260323T0500). [05:08:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.04% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:12:03] PROBLEM - MegaRAID on db1170 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:12:05] ACKNOWLEDGEMENT - MegaRAID on db1170 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T420873 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:12:12] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1170 - https://phabricator.wikimedia.org/T420873 (10ops-monitoring-bot) 03NEW [05:19:54] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:24:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:00:00] FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [06:00:00] FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [06:02:54] (03CR) 10KartikMistry: [C:03+1] Enable ULS rewrite beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254149 (https://phabricator.wikimedia.org/T418187) (owner: 10Abijeet Patro) [06:05:32] (03CR) 10Arnaudb: [C:03+2] gerrit: Wire mpm_event configuration to allow connection reuse on CDN [puppet] - 10https://gerrit.wikimedia.org/r/1254940 (https://phabricator.wikimedia.org/T420189) (owner: 10Arnaudb) [06:17:57] (03PS2) 10Arnaudb: gerrit: Tune mpm_event configuration to allow connection reuse on CDN [puppet] - 10https://gerrit.wikimedia.org/r/1256445 (https://phabricator.wikimedia.org/T420189) [06:18:04] (03PS2) 10Arnaudb: gerrit: Tune mpm_event configuration to allow connection reuse on CDN [puppet] - 10https://gerrit.wikimedia.org/r/1256446 (https://phabricator.wikimedia.org/T420189) [06:18:44] (03CR) 10Arnaudb: [C:03+2] gerrit: Tune mpm_event configuration to allow connection reuse on CDN [puppet] - 10https://gerrit.wikimedia.org/r/1256445 (https://phabricator.wikimedia.org/T420189) (owner: 10Arnaudb) [06:34:19] FIRING: JobUnavailable: Reduced availability for job wikidough in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:59:00] (03PS1) 10Kevin Bazira: ml-services: bump up k8s resources in experimental ns to enable policy-violation isvc deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258647 (https://phabricator.wikimedia.org/T418350) [07:00:05] Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260323T0700). nyaa~ [07:00:05] abijeet and hector-arroyo: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:03] hello [07:01:11] I can deploy abijeet's change. [07:01:16] abijeet: should we start? [07:01:35] kart_, sure [07:02:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254149 (https://phabricator.wikimedia.org/T418187) (owner: 10Abijeet Patro) [07:03:01] (03Merged) 10jenkins-bot: Enable ULS rewrite beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254149 (https://phabricator.wikimedia.org/T418187) (owner: 10Abijeet Patro) [07:03:50] !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1254149|Enable ULS rewrite beta feature (T418187 T253303)]] [07:04:00] T418187: Define rollout strategy for the ULS rewrite - https://phabricator.wikimedia.org/T418187 [07:04:00] T253303: Basic support for a responsive language selector - https://phabricator.wikimedia.org/T253303 [07:11:05] abijeet: things seems slow. still on k8s images build/push stage.. [07:15:13] kart_, ok [07:15:20] kart_, let me know when its ready to test [07:16:41] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [07:17:15] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [07:18:13] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [07:18:51] abijeet: sure [07:18:52] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [07:20:51] (03CR) 10Brouberol: [C:03+2] kafka-main-eqiad: disable mirroring to kafka-main-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1255657 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [07:21:46] (03CR) 10Brouberol: [C:03+2] aux-k8s/kafka-mirrormaker: add main-eqiad-to-main-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255659 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [07:22:50] !log kartik@deploy2002 kartik, abi: Backport for [[gerrit:1254149|Enable ULS rewrite beta feature (T418187 T253303)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:22:59] T418187: Define rollout strategy for the ULS rewrite - https://phabricator.wikimedia.org/T418187 [07:22:59] T253303: Basic support for a responsive language selector - https://phabricator.wikimedia.org/T253303 [07:23:18] abijeet: ready to test [07:25:39] kart_, on it [07:26:47] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad@0 on kafka-main1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-codfw_to_main-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [07:26:58] (03CR) 10Brouberol: [C:03+2] aux-k8s/kafka-mirrormaker: add main-codfw-to-main-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255660 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [07:27:04] (03CR) 10Brouberol: [C:03+2] aux-k8s/kafka-mirrormaker: add main-eqad-to-jumbo-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255661 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [07:27:10] (03CR) 10CI reject: [V:04-1] aux-k8s/kafka-mirrormaker: add main-codfw-to-main-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255660 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [07:27:16] (03CR) 10CI reject: [V:04-1] aux-k8s/kafka-mirrormaker: add main-codfw-to-main-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255660 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [07:27:17] (03CR) 10CI reject: [V:04-1] aux-k8s/kafka-mirrormaker: add main-eqad-to-jumbo-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255661 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [07:27:50] (03CR) 10Brouberol: [C:03+2] kafka-main-codfw: disable mirroring to kafka-main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1255656 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [07:27:59] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad@0 on kafka-main1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-codfw_to_main-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [07:27:59] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad@0 on kafka-main1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-codfw_to_main-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [07:28:03] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad@0 on kafka-main1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-codfw_to_main-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [07:28:03] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad@0 on kafka-main1010 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-codfw_to_main-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [07:28:06] (03PS4) 10Brouberol: aux-k8s/kafka-mirrormaker: add main-codfw-to-main-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255660 (https://phabricator.wikimedia.org/T417407) [07:28:46] (03CR) 10Brouberol: [V:03+2 C:03+2] aux-k8s/kafka-mirrormaker: add main-codfw-to-main-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255660 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [07:29:16] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [07:29:43] kart_, all ok [07:30:14] cool. [07:30:19] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [07:30:22] !log kartik@deploy2002 kartik, abi: Continuing with sync [07:33:23] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [07:36:34] (03PS5) 10Brouberol: aux-k8s/kafka-mirrormaker: add main-eqad-to-jumbo-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255661 (https://phabricator.wikimedia.org/T417407) [07:38:55] (03PS1) 10Brouberol: Revert "kafka-main-eqiad: disable mirroring to kafka-main-codfw" [puppet] - 10https://gerrit.wikimedia.org/r/1258677 [07:39:05] (03PS1) 10Brouberol: Revert "kafka-main-codfw: disable mirroring to kafka-main-eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/1258679 [07:41:09] (03CR) 10Brouberol: [C:03+2] Revert "kafka-main-codfw: disable mirroring to kafka-main-eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/1258679 (owner: 10Brouberol) [07:41:17] (03CR) 10Brouberol: [C:03+2] Revert "kafka-main-eqiad: disable mirroring to kafka-main-codfw" [puppet] - 10https://gerrit.wikimedia.org/r/1258677 (owner: 10Brouberol) [07:42:39] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [07:45:20] !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1254149|Enable ULS rewrite beta feature (T418187 T253303)]] (duration: 41m 30s) [07:45:26] T418187: Define rollout strategy for the ULS rewrite - https://phabricator.wikimedia.org/T418187 [07:45:26] T253303: Basic support for a responsive language selector - https://phabricator.wikimedia.org/T253303 [07:47:20] (03PS6) 10Brouberol: aux-k8s/kafka-mirrormaker: add main-eqad-to-jumbo-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255661 (https://phabricator.wikimedia.org/T417407) [07:47:20] (03PS5) 10Brouberol: aux-k8s/kafka-mirrormaker: cleanup helmfile of duplicated namespace definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255662 (https://phabricator.wikimedia.org/T417407) [07:47:20] (03PS1) 10Brouberol: aux-k8s/kafka-mirrormaker: fix values by not overriding the app config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258688 (https://phabricator.wikimedia.org/T417407) [07:47:38] abijeet: done. [07:50:44] (03CR) 10Brouberol: [C:03+2] aux-k8s/kafka-mirrormaker: fix values by not overriding the app config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258688 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [07:54:53] (03Merged) 10jenkins-bot: aux-k8s/kafka-mirrormaker: cleanup helmfile of duplicated namespace definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255662 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [08:09:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255736 (https://phabricator.wikimedia.org/T420574) (owner: 10Kosta Harlan) [08:10:25] (03PS1) 10Brouberol: site: install the aux-k8s-worker1006-9 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1258704 (https://phabricator.wikimedia.org/T393053) [08:14:15] (03CR) 10Dpogorzelski: [C:03+1] ml-services: bump up k8s resources in experimental ns to enable policy-violation isvc deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258647 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [08:15:54] (03CR) 10Dpogorzelski: [C:03+2] ml-services: bump up k8s resources in experimental ns to enable policy-violation isvc deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258647 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [08:18:22] !log jayme@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{wikikube-worker[2005-2006,2011-2018,2033-2039,2041-2042,2044,2046,2049-2051,2055-2062,2064-2065,2067-2078,2087-2095,2102-2115,2124-2179,2184-2199].codfw.wmnet} and (A:wikikube-master-codfw or A:wikikube-worker-codfw) [08:19:00] (03CR) 10MVernon: [C:03+1] admin: Add mpostoronca shell access and deployment membership [puppet] - 10https://gerrit.wikimedia.org/r/1256520 (https://phabricator.wikimedia.org/T420458) (owner: 10Scott French) [08:19:23] (03PS1) 10Giuseppe Lavagetto: cache::haproxy: remove hotfix for traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1258714 [08:19:39] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2005-2006,2011-2018,2033-2037].codfw.wmnet [08:20:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255736 (https://phabricator.wikimedia.org/T420574) (owner: 10Kosta Harlan) [08:21:24] 06SRE, 10SRE-Access-Requests: Requesting access to superset for alice.moutinho - https://phabricator.wikimedia.org/T420751#11736952 (10Alice.moutinho) Hello @Aklapper, @Scott_French, i now have an LDAP acount linked to my Phabricator account. @KFrancis i just saw the NDA agreement in my inbox this morning... [08:23:07] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8316/co" [puppet] - 10https://gerrit.wikimedia.org/r/1258714 (owner: 10Giuseppe Lavagetto) [08:29:22] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2005-2006,2011-2018,2033-2037].codfw.wmnet [08:31:02] (03Merged) 10jenkins-bot: hcaptcha: Use the global edit key for MobileFrontend edits if present [extensions/ConfirmEdit] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255736 (https://phabricator.wikimedia.org/T420574) (owner: 10Kosta Harlan) [08:31:22] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1255736|hcaptcha: Use the global edit key for MobileFrontend edits if present (T420574)]] [08:31:27] T420574: hcaptcha: Make edits coming from the MobileFrontend use the sitekey for edits - https://phabricator.wikimedia.org/T420574 [08:32:05] (03CR) 10Brouberol: [C:03+2] airflow-search: add secrets for opensearch-semantic-search clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255778 (https://phabricator.wikimedia.org/T414091) (owner: 10DCausse) [08:34:13] 10SRE-swift-storage, 06Commons, 07Wikimedia-production-error: uploadstash-exception: Could not store upload in the stash while uploading PDF file - https://phabricator.wikimedia.org/T420786#11736989 (10MatthewVernon) I'm guessing you don't have an exact timestamp for the error? I'm afraid it's going to be al... [08:34:31] (03Merged) 10jenkins-bot: airflow-search: add secrets for opensearch-semantic-search clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255778 (https://phabricator.wikimedia.org/T414091) (owner: 10DCausse) [08:35:08] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [08:35:35] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [08:36:42] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [08:37:16] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1255736|hcaptcha: Use the global edit key for MobileFrontend edits if present (T420574)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:37:21] T420574: hcaptcha: Make edits coming from the MobileFrontend use the sitekey for edits - https://phabricator.wikimedia.org/T420574 [08:37:47] 10SRE-swift-storage, 06Commons: Server error 500 after uploading chunk - https://phabricator.wikimedia.org/T340917#11736996 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Thanks, I'm going to optimistically close this ticket then :) [08:37:55] (03CR) 10Kgraessle: [C:03+1] PersonalDashboard: Add config for Active Discussions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256498 (https://phabricator.wikimedia.org/T420785) (owner: 10Scardenasmolinar) [08:37:58] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:38:40] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [08:39:34] !log kharlan@deploy2002 kharlan: Continuing with sync [08:40:41] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [08:40:42] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2005-2006,2011-2018,2033-2037].codfw.wmnet [08:40:51] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2005-2006,2011-2018,2033-2037].codfw.wmnet [08:41:18] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2038-2039,2041-2042,2044,2046,2049-2051,2055-2059].codfw.wmnet [08:42:58] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-deprecated: Add a --rack flag to sre.k8s.pool-depool-node - https://phabricator.wikimedia.org/T410537#11737005 (10JMeybohm) 05Resolved→03Open This additional confirmation thing is making bigger reboots pretty annoying since one has to come back and... [08:43:06] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [08:43:52] PROBLEM - Host ps1-b7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [08:44:08] PROBLEM - Host lsw1-b7-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:44:22] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [08:44:32] RECOVERY - Host ps1-b7-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.38 ms [08:44:42] RECOVERY - Host lsw1-b7-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.75 ms [08:46:04] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1255736|hcaptcha: Use the global edit key for MobileFrontend edits if present (T420574)]] (duration: 14m 42s) [08:46:09] T420574: hcaptcha: Make edits coming from the MobileFrontend use the sitekey for edits - https://phabricator.wikimedia.org/T420574 [08:47:21] 10SRE-tools, 06ServiceOps new: Add a --rack flag to sre.k8s.pool-depool-node - https://phabricator.wikimedia.org/T410537#11737009 (10JMeybohm) [08:50:09] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2038-2039,2041-2042,2044,2046,2049-2051,2055-2059].codfw.wmnet [08:58:25] (03PS2) 10Fabfur: cache::haproxy: remove hotfix for traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1258714 (https://phabricator.wikimedia.org/T415007) (owner: 10Giuseppe Lavagetto) [08:58:33] (03CR) 10Fabfur: [C:03+1] cache::haproxy: remove hotfix for traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1258714 (https://phabricator.wikimedia.org/T415007) (owner: 10Giuseppe Lavagetto) [08:59:25] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2038-2039,2041-2042,2044,2046,2049-2051,2055-2059].codfw.wmnet [08:59:33] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2038-2039,2041-2042,2044,2046,2049-2051,2055-2059].codfw.wmnet [08:59:51] !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.prepare for the switch from eqiad to codfw for section test-s4 [08:59:55] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.switchdc.databases.prepare (exit_code=99) for the switch from eqiad to codfw for section test-s4 [09:00:02] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2060-2062,2064-2065,2067-2075].codfw.wmnet [09:00:40] !log starting T416706 [09:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:46] T416706: Enable eqiad -> codfw replication - https://phabricator.wikimedia.org/T416706 [09:01:05] 10SRE-tools, 06ServiceOps new: Add a --rack flag to sre.k8s.pool-depool-node - https://phabricator.wikimedia.org/T410537#11737043 (10MLechvien-WMF) Good point. IMO it feels more intuitive/predictable to have the careful version as the default, and add a `--force` flag which bypasses all confirmation. If it's... [09:01:12] !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section x1 [09:02:40] !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section x1 [09:04:54] 10SRE-tools, 06ServiceOps new: Add a --rack flag to sre.k8s.pool-depool-node - https://phabricator.wikimedia.org/T410537#11737062 (10JMeybohm) I'm not a huge 'confirmation-fan' in general, but sgtm. When you're at it you could also make the cookbooks that call 'pool-depool-node' call it with `--force` [09:05:36] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [09:05:38] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [09:08:12] (03CR) 10Giuseppe Lavagetto: [C:03+2] cache::haproxy: remove hotfix for traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1258714 (https://phabricator.wikimedia.org/T415007) (owner: 10Giuseppe Lavagetto) [09:08:26] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2060-2062,2064-2065,2067-2075].codfw.wmnet [09:09:07] !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section x3 [09:10:36] !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section x3 [09:10:43] (03PS10) 10Elukey: sre.hosts.provision: use more URIs to set Supermicro's BIOS settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216) [09:11:22] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [09:11:30] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [09:14:23] (03CR) 10Tiziano Fogli: [C:03+2] prometheus: adjust join in PrometheusZombieSeriesDetected rule [alerts] - 10https://gerrit.wikimedia.org/r/1256451 (https://phabricator.wikimedia.org/T415317) (owner: 10Tiziano Fogli) [09:15:28] PROBLEM - Ensure acme-chief-api is running on acmechief2002 is CRITICAL: PROCS CRITICAL: 3 processes with args /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/acme-chief.ini https://wikitech.wikimedia.org/wiki/Acme-chief [09:16:17] (03Merged) 10jenkins-bot: prometheus: adjust join in PrometheusZombieSeriesDetected rule [alerts] - 10https://gerrit.wikimedia.org/r/1256451 (https://phabricator.wikimedia.org/T415317) (owner: 10Tiziano Fogli) [09:16:25] !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section es6 [09:16:28] RECOVERY - Ensure acme-chief-api is running on acmechief2002 is OK: PROCS OK: 1 process with args /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/acme-chief.ini https://wikitech.wikimedia.org/wiki/Acme-chief [09:17:50] (03PS1) 10Majavah: P:toolforge: k8s: haproxy: Add option for sending traffic to Istio [puppet] - 10https://gerrit.wikimedia.org/r/1258948 (https://phabricator.wikimedia.org/T392356) [09:17:52] !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section es6 [09:18:52] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8317/co" [puppet] - 10https://gerrit.wikimedia.org/r/1258948 (https://phabricator.wikimedia.org/T392356) (owner: 10Majavah) [09:19:15] jayme@cumin1003 reboot-nodes (PID 1359444) is awaiting input [09:19:54] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:22:06] !log trueg@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs-queryhammer: apply [09:22:17] !log trueg@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs-queryhammer: apply [09:22:57] (03PS1) 10Giuseppe Lavagetto: haproxy: temporarily re-add the lua file to avoid race conditions [puppet] - 10https://gerrit.wikimedia.org/r/1258949 [09:22:57] (03PS1) 10Giuseppe Lavagetto: haproxy: remove the traffic_class.lua file for good [puppet] - 10https://gerrit.wikimedia.org/r/1258950 [09:23:08] !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section es7 [09:24:16] !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section es7 [09:24:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:25:31] !log trueg@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs-queryhammer: apply [09:25:38] !log trueg@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs-queryhammer: apply [09:25:57] (03CR) 10Giuseppe Lavagetto: [C:03+2] haproxy: temporarily re-add the lua file to avoid race conditions [puppet] - 10https://gerrit.wikimedia.org/r/1258949 (owner: 10Giuseppe Lavagetto) [09:26:13] 06SRE, 10SRE-Access-Requests: Requesting access to data and Superset for Daria-WMDE (Daria Ammalainen (WMDE)) - https://phabricator.wikimedia.org/T420716#11737117 (10Daria-WMDE) Hello @KFrancis could you please resend the NDA? Was out of office EOD Friday, and now the link has expired [09:29:19] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1172.eqiad.wmnet with OS bullseye [09:29:24] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11737122 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin... [09:29:35] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1172.eqiad.wmnet with OS bullseye [09:29:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11737123 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@c... [09:32:07] (03PS1) 10Jcrespo: mariadb: Update grants for new hosts ms-backup[12]00[34], which replaces [12] [puppet] - 10https://gerrit.wikimedia.org/r/1258954 (https://phabricator.wikimedia.org/T420464) [09:32:09] !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section s6 [09:33:18] (03Abandoned) 10Jgiannelos: beta: Fix duplicate definition of site.v1.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1234940 (owner: 10Jgiannelos) [09:33:21] (03CR) 10Jcrespo: "There is no rush on deploying this, it can wait until maintenance freeze happens, despite only affecting backup dbs." [puppet] - 10https://gerrit.wikimedia.org/r/1258954 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo) [09:33:45] !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section s6 [09:34:01] (03Abandoned) 10Jgiannelos: pcs: Block RB traffic for all domains [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145828 (owner: 10Jgiannelos) [09:35:36] (03PS1) 10Trueg: wdqs-queryhammer: Deployment fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258956 (https://phabricator.wikimedia.org/T417415) [09:36:32] (03PS6) 10Cathal Mooney: Routed ganeti: disable nftables conntrack for forwarded VM traffic [puppet] - 10https://gerrit.wikimedia.org/r/1257209 (https://phabricator.wikimedia.org/T420715) [09:37:58] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:39:55] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1172.eqiad.wmnet with OS bullseye [09:40:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11737211 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin... [09:40:13] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1172.eqiad.wmnet with OS bullseye [09:40:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11737214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@c... [09:41:20] (03CR) 10Clément Goubert: rest-gateway: Add core API support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255725 (https://phabricator.wikimedia.org/T418146) (owner: 10Clément Goubert) [09:42:43] !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section s5 [09:42:57] (03PS3) 10Clément Goubert: rest-gateway: Add core API support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255725 (https://phabricator.wikimedia.org/T418146) [09:44:08] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2060-2062,2064-2065,2067-2075].codfw.wmnet [09:44:17] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2060-2062,2064-2065,2067-2075].codfw.wmnet [09:44:29] (03CR) 10Blake: [C:03+2] mw-web: upsize for single-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251045 (https://phabricator.wikimedia.org/T413974) (owner: 10Blake) [09:44:33] !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section s5 [09:44:41] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2076-2078,2087-2095,2102-2103].codfw.wmnet [09:45:39] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1257209 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney) [09:46:54] (03CR) 10Giuseppe Lavagetto: [C:03+2] haproxy: remove the traffic_class.lua file for good [puppet] - 10https://gerrit.wikimedia.org/r/1258950 (owner: 10Giuseppe Lavagetto) [09:47:01] (03Merged) 10jenkins-bot: mw-web: upsize for single-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251045 (https://phabricator.wikimedia.org/T413974) (owner: 10Blake) [09:48:04] (03PS1) 10Giuseppe Lavagetto: haproxy: well, actually remove the file :P [puppet] - 10https://gerrit.wikimedia.org/r/1258962 [09:48:24] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] haproxy: well, actually remove the file :P [puppet] - 10https://gerrit.wikimedia.org/r/1258962 (owner: 10Giuseppe Lavagetto) [09:48:38] 06SRE, 10SRE-Access-Requests: Requesting access to data and Superset for Daria-WMDE (Daria Ammalainen (WMDE)) - https://phabricator.wikimedia.org/T420716#11737267 (10Daria-WMDE) [09:49:00] !log blake@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [09:49:29] !log blake@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [09:49:31] !log blake@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [09:49:31] (03PS7) 10Cathal Mooney: Routed ganeti: disable nftables conntrack for forwarded VM traffic [puppet] - 10https://gerrit.wikimedia.org/r/1257209 (https://phabricator.wikimedia.org/T420715) [09:49:41] 06SRE, 10SRE-Access-Requests: Requesting access to data and Superset for Daria-WMDE (Daria Ammalainen (WMDE)) - https://phabricator.wikimedia.org/T420716#11737270 (10Daria-WMDE) Hi @Scott_French, I added a developer account to the task and linked it with the Phabricator account and the Wikimedia Global Account [09:49:56] !log blake@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [09:49:59] !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section s2 [09:50:55] (03CR) 10Brouberol: [C:03+1] "LG!" [puppet] - 10https://gerrit.wikimedia.org/r/1256287 (https://phabricator.wikimedia.org/T420264) (owner: 10Btullis) [09:52:24] !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section s2 [09:53:05] (03PS1) 10Blake: geo-maps: update map default to list eqiad first [dns] - 10https://gerrit.wikimedia.org/r/1244621 (https://phabricator.wikimedia.org/T413974) [09:53:31] (03PS1) 10Blake: debug: reorder debug backends for eqiad switchover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244628 (https://phabricator.wikimedia.org/T413974) [09:53:38] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2076-2078,2087-2095,2102-2103].codfw.wmnet [09:53:52] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [09:53:56] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [09:54:08] btullis@cumin1003 reimage (PID 1408579) is awaiting input [09:57:27] !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section s3 [09:57:40] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11737346 (10AnnieKim_WMDE) Linked my LDAP account. Thanks everyone for your help. [09:57:48] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1257209 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney) [09:58:22] (03PS11) 10Elukey: sre.hosts.provision: use more URIs to set Supermicro's BIOS settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216) [09:58:29] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T420896 (10kera_wmde) 03NEW [09:58:48] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [09:58:52] 06SRE, 10SRE-Access-Requests: Requesting access to Superset for keren.ramirezWMDE - https://phabricator.wikimedia.org/T420896#11737371 (10kera_wmde) [09:58:53] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [09:58:55] !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section s3 [10:00:00] FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [10:00:00] FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260323T1000) [10:01:46] (03CR) 10Filippo Giunchedi: [C:03+1] P:toolforge: k8s: haproxy: Add option for sending traffic to Istio [puppet] - 10https://gerrit.wikimedia.org/r/1258948 (https://phabricator.wikimedia.org/T392356) (owner: 10Majavah) [10:02:11] (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge: k8s: haproxy: Add option for sending traffic to Istio [puppet] - 10https://gerrit.wikimedia.org/r/1258948 (https://phabricator.wikimedia.org/T392356) (owner: 10Majavah) [10:03:17] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1258948 (https://phabricator.wikimedia.org/T392356) (owner: 10Majavah) [10:04:27] !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section s7 [10:04:30] jayme@cumin1003 reboot-nodes (PID 1359444) is awaiting input [10:04:58] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2076-2078,2087-2095,2102-2103].codfw.wmnet [10:05:06] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2076-2078,2087-2095,2102-2103].codfw.wmnet [10:05:58] !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section s7 [10:06:19] PROBLEM - MariaDB Replica IO: s7 #page on db1253 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:06:37] PROBLEM - MariaDB Replica SQL: s7 #page on db1253 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:06:37] PROBLEM - MariaDB Replica Lag: s7 #page on db1253 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:06:45] silencing it again [10:06:59] federico3: what's u ? [10:07:00] p [10:07:03] !incidents [10:07:04] 7784 (UNACKED) db1253 (paged)/MariaDB Replica IO: s7 (paged) [10:07:04] 7785 (UNACKED) db1253 (paged)/MariaDB Replica Lag: s7 (paged) [10:07:04] 7786 (UNACKED) db1253 (paged)/MariaDB Replica SQL: s7 (paged) [10:07:07] !ack [10:07:07] Could not ack the alert. Please check the parameters. [10:07:16] I thought that was meant to work now? [10:07:20] !ack 7784 [10:07:21] 7784 (ACKED) db1253 (paged)/MariaDB Replica IO: s7 (paged) [10:07:21] it's due to a cookbook removing the silence while running I think [10:07:22] !ack 7785 [10:07:23] 7785 (ACKED) db1253 (paged)/MariaDB Replica Lag: s7 (paged) [10:07:26] !ack 7786 [10:07:26] 7786 (ACKED) db1253 (paged)/MariaDB Replica SQL: s7 (paged) [10:08:10] jayme@cumin1003 reboot-nodes (PID 1359444) is awaiting input [10:08:11] on alertmanager I see only silenced alerts tho [10:08:12] (03CR) 10Ayounsi: [C:03+1] Routed ganeti: disable nftables conntrack for forwarded VM traffic [puppet] - 10https://gerrit.wikimedia.org/r/1257209 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney) [10:08:36] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2104-2115,2124-2125].codfw.wmnet [10:08:40] ( a side effect of https://phabricator.wikimedia.org/T416706 ) [10:09:13] federico3: we got email from nagios as well as the p.ages [10:09:21] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1172.eqiad.wmnet with OS bullseye [10:09:25] (03CR) 10Filippo Giunchedi: [C:03+2] rabbitmq: set pause_minority for cluster_partition_handling [puppet] - 10https://gerrit.wikimedia.org/r/1254877 (https://phabricator.wikimedia.org/T418444) (owner: 10Filippo Giunchedi) [10:09:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11737429 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin... [10:10:41] (03PS1) 10Matthieulec: Add --force flag to sre.k8s.pool-depool-node cookbook and callers to bypass confirmation. [cookbooks] - 10https://gerrit.wikimedia.org/r/1258952 (https://phabricator.wikimedia.org/T410537) [10:11:34] !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section s8 [10:13:02] !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section s8 [10:15:09] (03PS1) 10Majavah: P:toolforge: k8s: haproxy: Use HTTP/1.1 for health checks [puppet] - 10https://gerrit.wikimedia.org/r/1258980 (https://phabricator.wikimedia.org/T392356) [10:15:10] (03PS1) 10Arnaudb: gerrit: add Envoy TLS termination for the CDN path [puppet] - 10https://gerrit.wikimedia.org/r/1258976 (https://phabricator.wikimedia.org/T420189) [10:15:57] (03CR) 10Ayounsi: [C:03+1] wikimedia6 prefix-list: add wikidough anycast range [homer/public] - 10https://gerrit.wikimedia.org/r/1257195 (https://phabricator.wikimedia.org/T420820) (owner: 10Cathal Mooney) [10:18:36] !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section s4 [10:18:40] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1258980 (https://phabricator.wikimedia.org/T392356) (owner: 10Majavah) [10:20:38] !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section s4 [10:21:05] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8319/co" [puppet] - 10https://gerrit.wikimedia.org/r/1258980 (https://phabricator.wikimedia.org/T392356) (owner: 10Majavah) [10:21:05] (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge: k8s: haproxy: Use HTTP/1.1 for health checks [puppet] - 10https://gerrit.wikimedia.org/r/1258980 (https://phabricator.wikimedia.org/T392356) (owner: 10Majavah) [10:22:05] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2104-2115,2124-2125].codfw.wmnet [10:22:29] (03CR) 10Ayounsi: [C:03+2] Update DHCP server in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1256335 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [10:23:58] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host an-worker1172.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [10:24:04] !log ayounsi@dns1004 START - running authdns-update [10:24:19] FIRING: [2x] JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:24:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:25:37] !log ayounsi@dns1004 END - running authdns-update [10:25:44] !log fceratto@cumin1003 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section s1 [10:27:02] btullis@cumin1003 provision (PID 1455852) is awaiting input [10:27:15] !log fceratto@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section s1 [10:28:50] !log disable puppet on routed-ganeti hosts to test nftables update on specific nodes T420715 [10:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:30:16] (03PS1) 10Filippo Giunchedi: rabbit: apply cluster_partition_handling to rabbitmq4 [puppet] - 10https://gerrit.wikimedia.org/r/1258990 (https://phabricator.wikimedia.org/T418444) [10:30:16] (03CR) 10Ayounsi: [C:03+2] Point proxy in ulsfo to install4004 [dns] - 10https://gerrit.wikimedia.org/r/1256324 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [10:30:16] (03CR) 10Filippo Giunchedi: [C:03+2] "Self merging since the related change for rabbitmq3 was approved" [puppet] - 10https://gerrit.wikimedia.org/r/1258990 (https://phabricator.wikimedia.org/T418444) (owner: 10Filippo Giunchedi) [10:30:22] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2104-2115,2124-2125].codfw.wmnet [10:30:30] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2104-2115,2124-2125].codfw.wmnet [10:31:23] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2126-2139].codfw.wmnet [10:32:22] (03CR) 10Cathal Mooney: [C:03+2] ulsfo: update dhcp server to install4004 [homer/public] - 10https://gerrit.wikimedia.org/r/1258994 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi) [10:33:13] (03Merged) 10jenkins-bot: ulsfo: update dhcp server to install4004 [homer/public] - 10https://gerrit.wikimedia.org/r/1258994 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi) [10:37:38] (03PS1) 10Majavah: P:toolforge: k8s: haproxy: Fix istio-gateway health checks [puppet] - 10https://gerrit.wikimedia.org/r/1259000 (https://phabricator.wikimedia.org/T392356) [10:38:01] btullis@cumin1003 provision (PID 1455852) is awaiting input [10:38:29] !log trueg@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs-queryhammer: apply [10:38:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11737597 (10Volans) That's what's in puppetdb and what's reported by facter on the host though: ` $ sudo facter -p... [10:38:33] 06SRE, 10SRE-Access-Requests: Requesting access to Superset for keren.ramirezWMDE - https://phabricator.wikimedia.org/T420896#11737598 (10Aklapper) @kera_wmde: Please also [link your LDAP account to your Phabricator account](https://phabricator.wikimedia.org/settings/panel/external/), so your 'LDAP User' accou... [10:38:33] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:38:37] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:38:38] !log trueg@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs-queryhammer: apply [10:39:28] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8320/co" [puppet] - 10https://gerrit.wikimedia.org/r/1259000 (https://phabricator.wikimedia.org/T392356) (owner: 10Majavah) [10:41:24] (03PS1) 10Cathal Mooney: Routed-ganeti: fix syntax error in new forward rule [puppet] - 10https://gerrit.wikimedia.org/r/1259004 (https://phabricator.wikimedia.org/T420715) [10:41:28] 06SRE, 10SRE-Access-Requests: Requesting access to Superset for keren.ramirezWMDE - https://phabricator.wikimedia.org/T420896#11737603 (10WMDE-leszek) I approve this request on WMDE's end. Thank you [10:42:22] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11737605 (10ayounsi) [10:42:40] (03CR) 10Ayounsi: [C:03+1] Routed-ganeti: fix syntax error in new forward rule [puppet] - 10https://gerrit.wikimedia.org/r/1259004 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney) [10:43:04] (03CR) 10David Caro: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1259000 (https://phabricator.wikimedia.org/T392356) (owner: 10Majavah) [10:43:34] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:43:37] (03CR) 10Cathal Mooney: [C:03+2] Routed-ganeti: fix syntax error in new forward rule [puppet] - 10https://gerrit.wikimedia.org/r/1259004 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney) [10:43:38] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:43:46] 10ops-codfw, 06DC-Ops: Power Supply - Status - issue on wikikube-ctrl2001:9290 - https://phabricator.wikimedia.org/T420905 (10phaultfinder) 03NEW [10:44:07] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2126-2139].codfw.wmnet [10:45:48] (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge: k8s: haproxy: Fix istio-gateway health checks [puppet] - 10https://gerrit.wikimedia.org/r/1259000 (https://phabricator.wikimedia.org/T392356) (owner: 10Majavah) [10:46:53] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11737645 (10ayounsi) [10:48:03] (03CR) 10Ayounsi: [C:03+2] Make hcaptcha-proxy4003/hcaptcha-proxy4004 new hcaptcha-proxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/1255786 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [10:48:12] (03PS2) 10Muehlenhoff: Make hcaptcha-proxy4003/hcaptcha-proxy4004 new hcaptcha-proxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/1255786 (https://phabricator.wikimedia.org/T418993) [10:48:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:49:35] (03CR) 10Ayounsi: [C:03+2] Make hcaptcha-proxy4003/hcaptcha-proxy4004 new hcaptcha-proxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/1255786 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [10:52:26] 06SRE, 10SRE-Access-Requests: Requesting access to Superset for keren.ramirezWMDE - https://phabricator.wikimedia.org/T420896#11737660 (10kera_wmde) Link in my account confirmed! Thank you! >>! In T420896#11737598, @Aklapper wrote: > @kera_wmde: Please also [link your LDAP account to your Phabricator account]... [10:53:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:54:54] jayme@cumin1003 reboot-nodes (PID 1359444) is awaiting input [10:55:21] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2126-2139].codfw.wmnet [10:55:29] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2126-2139].codfw.wmnet [10:55:57] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2140-2153].codfw.wmnet [10:57:26] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:58:31] (03PS1) 10Cathal Mooney: routed-ganeti nftables forward chain: correct syntax [puppet] - 10https://gerrit.wikimedia.org/r/1259018 (https://phabricator.wikimedia.org/T420715) [11:00:16] !log ayounsi@cumin1003 START - Cookbook sre.hosts.decommission for hosts install4003.wikimedia.org [11:01:53] (03CR) 10Ayounsi: [C:03+1] routed-ganeti nftables forward chain: correct syntax [puppet] - 10https://gerrit.wikimedia.org/r/1259018 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney) [11:02:31] (03CR) 10Cathal Mooney: [C:03+2] wikimedia6 prefix-list: add wikidough anycast range [homer/public] - 10https://gerrit.wikimedia.org/r/1257195 (https://phabricator.wikimedia.org/T420820) (owner: 10Cathal Mooney) [11:05:06] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [11:08:48] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2140-2153].codfw.wmnet [11:08:53] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install4003.wikimedia.org decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1003" [11:09:19] FIRING: [3x] JobUnavailable: Reduced availability for job squid in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:09:53] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install4003.wikimedia.org decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1003" [11:09:53] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:09:54] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts install4003.wikimedia.org [11:10:05] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11737707 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1003 for hosts: `install4003.wikimedia.org` -... [11:13:41] (03CR) 10Volans: sre.k8s: Add cookbook to print network topology details of nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1256260 (https://phabricator.wikimedia.org/T418142) (owner: 10JMeybohm) [11:15:20] !log ayounsi@cumin1003 START - Cookbook sre.ganeti.makevm for new host bast4006.wikimedia.org [11:15:21] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [11:18:14] PROBLEM - Host lsw1-b7-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:18:22] PROBLEM - Host ps1-b7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [11:18:50] (03PS2) 10Genoveva Galarza: Enable view urls in abstract.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1256396 (https://phabricator.wikimedia.org/T420666) [11:18:52] RECOVERY - Host ps1-b7-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.49 ms [11:19:00] RECOVERY - Host lsw1-b7-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.68 ms [11:19:02] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast4006.wikimedia.org - ayounsi@cumin1003" [11:19:08] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast4006.wikimedia.org - ayounsi@cumin1003" [11:19:08] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:19:08] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache bast4006.wikimedia.org on all recursors [11:19:12] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast4006.wikimedia.org on all recursors [11:19:42] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast4006.wikimedia.org - ayounsi@cumin1003" [11:19:47] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast4006.wikimedia.org - ayounsi@cumin1003" [11:20:03] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2140-2153].codfw.wmnet [11:20:03] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host bast4006.wikimedia.org with OS bookworm [11:20:11] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2140-2153].codfw.wmnet [11:20:18] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11737716 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ayounsi@cumin1003 for host bast4006.wikimedia.org w... [11:23:27] (03CR) 10Genoveva Galarza: "Done! Thanks a lot for the references and the examples, super helpful." [puppet] - 10https://gerrit.wikimedia.org/r/1256396 (https://phabricator.wikimedia.org/T420666) (owner: 10Genoveva Galarza) [11:23:31] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2154-2167].codfw.wmnet [11:23:45] FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [11:26:01] (03PS2) 10Arnaudb: gerrit: add Envoy TLS termination for the CDN path [puppet] - 10https://gerrit.wikimedia.org/r/1258976 (https://phabricator.wikimedia.org/T420909) [11:27:06] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1258976 (https://phabricator.wikimedia.org/T420909) (owner: 10Arnaudb) [11:27:10] (03PS12) 10Elukey: sre.hosts.provision: use more URIs to set Supermicro's BIOS settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216) [11:27:16] PROBLEM - Bird Internet Routing Daemon on hcaptcha-proxy4004 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [11:28:09] (03CR) 10Elukey: sre.hosts.provision: use more URIs to set Supermicro's BIOS settings (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216) (owner: 10Elukey) [11:28:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [11:29:14] 10SRE-tools, 10Cumin, 06Infrastructure-Foundations: Add proxy support to cumin openstack backend - https://phabricator.wikimedia.org/T420360#11737762 (10Volans) p:05Triage→03Medium a:03Volans [11:29:19] FIRING: [3x] JobUnavailable: Reduced availability for job squid in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:29:52] (03PS5) 10Elukey: sre.hosts.provision: refactor bios if/else branches [cookbooks] - 10https://gerrit.wikimedia.org/r/1253412 (https://phabricator.wikimedia.org/T414216) [11:29:52] (03PS4) 10Elukey: sre.hosts.provision: add sys-112c-tn-configg to SUPERMICRO_NO_FQDN_MANAGEMENT [cookbooks] - 10https://gerrit.wikimedia.org/r/1253448 (https://phabricator.wikimedia.org/T414216) [11:29:52] (03PS13) 10Elukey: sre.hosts.provision: use more URIs to set Supermicro's BIOS settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216) [11:30:06] (03CR) 10Elukey: sre.hosts.provision: refactor bios if/else branches (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1253412 (https://phabricator.wikimedia.org/T414216) (owner: 10Elukey) [11:31:47] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2154-2167].codfw.wmnet [11:34:44] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops-deprecated, 13Patch-For-Review: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#11737771 (10elukey) 05Resolved→03Open Re-opening this one since something weird happens when running provisioning: ` 2026-03... [11:38:05] (03PS1) 10Sergio Gimeno: fix(WelcomeSurveyHooks): ensure accountJustCreated is always added [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259035 (https://phabricator.wikimedia.org/T420722) [11:38:07] (03CR) 10Daniel Kinzler: rest-gateway: Add core API support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255725 (https://phabricator.wikimedia.org/T418146) (owner: 10Clément Goubert) [11:38:28] (03PS1) 10Sergio Gimeno: tests: add coverage for WelcomeSurveyHooks::onCentralAuthPostLoginRedirect [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259036 (https://phabricator.wikimedia.org/T420722) [11:39:19] FIRING: [3x] JobUnavailable: Reduced availability for job squid in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:40:36] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2154-2167].codfw.wmnet [11:40:45] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2154-2167].codfw.wmnet [11:43:09] (03CR) 10CI reject: [V:04-1] tests: add coverage for WelcomeSurveyHooks::onCentralAuthPostLoginRedirect [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259036 (https://phabricator.wikimedia.org/T420722) (owner: 10Sergio Gimeno) [11:43:16] RECOVERY - Bird Internet Routing Daemon on hcaptcha-proxy4004 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [11:43:49] jayme@cumin1003 reboot-nodes (PID 1359444) is awaiting input [11:44:30] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2168-2179,2184-2185].codfw.wmnet [11:44:35] (03CR) 10CI reject: [V:04-1] fix(WelcomeSurveyHooks): ensure accountJustCreated is always added [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259035 (https://phabricator.wikimedia.org/T420722) (owner: 10Sergio Gimeno) [11:46:36] (03CR) 10Hnowlan: [C:03+1] "lgtm, but might be worth dropping the cookie stripping" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255725 (https://phabricator.wikimedia.org/T418146) (owner: 10Clément Goubert) [11:47:52] (03CR) 10Clément Goubert: rest-gateway: Add core API support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255725 (https://phabricator.wikimedia.org/T418146) (owner: 10Clément Goubert) [11:52:47] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2168-2179,2184-2185].codfw.wmnet [12:00:20] (03PS4) 10Clément Goubert: rest-gateway: Add linkrecommendation support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255581 (https://phabricator.wikimedia.org/T418148) [12:00:20] (03PS4) 10Clément Goubert: rest-gateway: Add core API support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255725 (https://phabricator.wikimedia.org/T418146) [12:04:15] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2168-2179,2184-2185].codfw.wmnet [12:04:24] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2168-2179,2184-2185].codfw.wmnet [12:06:01] (03Abandoned) 10Sergio Gimeno: tests: add coverage for WelcomeSurveyHooks::onCentralAuthPostLoginRedirect [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259036 (https://phabricator.wikimedia.org/T420722) (owner: 10Sergio Gimeno) [12:06:14] (03PS3) 10Genoveva Galarza: Enable view urls in abstract.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1256396 (https://phabricator.wikimedia.org/T420666) [12:06:40] (03Restored) 10Sergio Gimeno: tests: add coverage for WelcomeSurveyHooks::onCentralAuthPostLoginRedirect [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259036 (https://phabricator.wikimedia.org/T420722) (owner: 10Sergio Gimeno) [12:07:09] (03PS2) 10Sergio Gimeno: tests: add coverage for WelcomeSurveyHooks::onCentralAuthPostLoginRedirect [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259036 (https://phabricator.wikimedia.org/T420722) [12:07:41] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2186-2199].codfw.wmnet [12:07:44] (03PS1) 10Sergio Gimeno: fix(WelcomeSurveyHooks): ensure accountJustCreated is always added 2 [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259046 (https://phabricator.wikimedia.org/T420722) [12:08:37] !log cmooney@cumin1003 START - Cookbook sre.hosts.reimage for host testvm2006.codfw.wmnet with OS bookworm [12:08:42] (03CR) 10Sergio Gimeno: "recheck, git unrelated `fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/extensions/TemplateData/': GnuTLS recv error (-5" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259035 (https://phabricator.wikimedia.org/T420722) (owner: 10Sergio Gimeno) [12:09:39] (03PS1) 10Ayounsi: Add bast4006 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1259049 (https://phabricator.wikimedia.org/T418993) [12:10:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259035 (https://phabricator.wikimedia.org/T420722) (owner: 10Sergio Gimeno) [12:10:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259036 (https://phabricator.wikimedia.org/T420722) (owner: 10Sergio Gimeno) [12:10:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259046 (https://phabricator.wikimedia.org/T420722) (owner: 10Sergio Gimeno) [12:11:05] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1259049 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi) [12:11:11] (03PS5) 10Clément Goubert: rest-gateway: Add linkrecommendation support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255581 (https://phabricator.wikimedia.org/T418148) [12:11:11] (03PS5) 10Clément Goubert: rest-gateway: Add core API support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255725 (https://phabricator.wikimedia.org/T418146) [12:11:15] (03CR) 10Ayounsi: [C:03+2] Add bast4006 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1259049 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi) [12:14:49] !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on bast4006.wikimedia.org with reason: host reimage [12:16:38] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2186-2199].codfw.wmnet [12:18:52] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast4006.wikimedia.org with reason: host reimage [12:21:48] FIRING: PuppetFailure: Puppet has failed on ganeti2033:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:22:47] !log cmooney@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2006.codfw.wmnet with reason: host reimage [12:28:43] !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2006.codfw.wmnet with reason: host reimage [12:30:08] jayme@cumin1003 reboot-nodes (PID 1359444) is awaiting input [12:30:12] .46 [12:34:08] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:34:54] (03PS1) 10Cathal Mooney: ganeti-routed nftables: adjust notrack setup for VM traffic [puppet] - 10https://gerrit.wikimedia.org/r/1259064 (https://phabricator.wikimedia.org/T420715) [12:35:31] (03CR) 10CI reject: [V:04-1] ganeti-routed nftables: adjust notrack setup for VM traffic [puppet] - 10https://gerrit.wikimedia.org/r/1259064 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney) [12:37:27] (03PS2) 10Cathal Mooney: ganeti-routed nftables: adjust notrack setup for VM traffic [puppet] - 10https://gerrit.wikimedia.org/r/1259064 (https://phabricator.wikimedia.org/T420715) [12:38:37] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast4006.wikimedia.org with OS bookworm [12:38:37] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast4006.wikimedia.org [12:38:50] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11737945 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ayounsi@cumin1003 for host bast4006.wikimedia.org with OS bookworm completed:... [12:41:19] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259064 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney) [12:42:51] !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host testvm2006.codfw.wmnet with OS bookworm [12:42:53] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "bast4006 - ayounsi@cumin1003" [12:43:09] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "bast4006 - ayounsi@cumin1003" [12:45:11] (03PS1) 10Clément Goubert: trafficserver: Add api.w.o to gateway-check.lua.conf [puppet] - 10https://gerrit.wikimedia.org/r/1259067 (https://phabricator.wikimedia.org/T418145) [12:46:01] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11737967 (10ayounsi) [12:48:52] (03CR) 10Ayounsi: [C:03+1] ganeti-routed nftables: adjust notrack setup for VM traffic [puppet] - 10https://gerrit.wikimedia.org/r/1259064 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney) [12:51:52] (03CR) 10Cathal Mooney: [C:03+2] ganeti-routed nftables: adjust notrack setup for VM traffic [puppet] - 10https://gerrit.wikimedia.org/r/1259064 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney) [12:52:10] (03CR) 10Cathal Mooney: [C:03+2] routed-ganeti nftables forward chain: correct syntax [puppet] - 10https://gerrit.wikimedia.org/r/1259018 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney) [12:55:30] (03PS2) 10Clément Goubert: trafficserver: Add api.w.o to gateway-check.lua.conf [puppet] - 10https://gerrit.wikimedia.org/r/1259067 (https://phabricator.wikimedia.org/T418145) [12:55:33] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259067 (https://phabricator.wikimedia.org/T418145) (owner: 10Clément Goubert) [12:55:54] (03CR) 10Jforrester: [C:03+1] tests: Make many things static for PHPUnit 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1258300 (https://phabricator.wikimedia.org/T420844) (owner: 10Reedy) [12:56:49] (03CR) 10Jforrester: [C:03+1] phpunit.xml: Update configuration for PHPUnit 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1258301 (https://phabricator.wikimedia.org/T420844) (owner: 10Reedy) [12:57:21] (03PS1) 10Clément Goubert: trafficserver: 100% of linkrecommendation to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259071 (https://phabricator.wikimedia.org/T418148) [12:58:22] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1010 - https://phabricator.wikimedia.org/T420867#11738021 (10VRiley-WMF) a:03VRiley-WMF [12:58:44] (03PS2) 10Clément Goubert: trafficserver: 100% of linkrecommendation to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259071 (https://phabricator.wikimedia.org/T418148) [12:59:18] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1170 - https://phabricator.wikimedia.org/T420873#11738023 (10VRiley-WMF) a:03VRiley-WMF [13:00:01] (03PS1) 10Clément Goubert: trafficserver: 100% of device-analytics to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259075 (https://phabricator.wikimedia.org/T418147) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260323T1300). nyaa~ [13:00:05] hector-arroyo and Sergi0: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:13] (03PS2) 10Clément Goubert: trafficserver: 100% of device-analytics to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259075 (https://phabricator.wikimedia.org/T418147) [13:00:39] o/ [13:00:53] I can self-deploy [13:02:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259035 (https://phabricator.wikimedia.org/T420722) (owner: 10Sergio Gimeno) [13:02:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259036 (https://phabricator.wikimedia.org/T420722) (owner: 10Sergio Gimeno) [13:02:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259046 (https://phabricator.wikimedia.org/T420722) (owner: 10Sergio Gimeno) [13:03:08] (03CR) 10Kamila Součková: [C:03+1] rest-gateway: Add linkrecommendation support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255581 (https://phabricator.wikimedia.org/T418148) (owner: 10Clément Goubert) [13:03:27] (03CR) 10Andrew Bogott: [C:03+2] Initial entries for cloudcephosd105[3-6] [puppet] - 10https://gerrit.wikimedia.org/r/1256392 (https://phabricator.wikimedia.org/T416394) (owner: 10Andrew Bogott) [13:03:56] (03PS1) 10Clément Goubert: trafficserver: 50% of /core/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259077 (https://phabricator.wikimedia.org/T418146) [13:03:59] (03PS1) 10Clément Goubert: trafficserver: 100% of /core/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259078 (https://phabricator.wikimedia.org/T418146) [13:04:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:04:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06cloud-services-team (Hardware), 13Patch-For-Review: Q3:rack/setup/install cloudcephosd105[56] - https://phabricator.wikimedia.org/T419892#11738055 (10Andrew) a:05Andrew→03None [13:04:39] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install cloudcephosd1054 - https://phabricator.wikimedia.org/T416395#11738056 (10Andrew) a:05Andrew→03None [13:04:47] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install cloudcephosd1053 - https://phabricator.wikimedia.org/T416394#11738058 (10Andrew) a:05Andrew→03None [13:05:07] (03CR) 10Kamila Součková: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255581 (https://phabricator.wikimedia.org/T418148) (owner: 10Clément Goubert) [13:05:30] (03Merged) 10jenkins-bot: fix(WelcomeSurveyHooks): ensure accountJustCreated is always added [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259035 (https://phabricator.wikimedia.org/T420722) (owner: 10Sergio Gimeno) [13:05:33] (03Merged) 10jenkins-bot: tests: add coverage for WelcomeSurveyHooks::onCentralAuthPostLoginRedirect [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259036 (https://phabricator.wikimedia.org/T420722) (owner: 10Sergio Gimeno) [13:05:33] (03PS1) 10Majavah: cloudnfs: Remove Huggle project config [puppet] - 10https://gerrit.wikimedia.org/r/1259079 [13:05:35] (03Merged) 10jenkins-bot: fix(WelcomeSurveyHooks): ensure accountJustCreated is always added 2 [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259046 (https://phabricator.wikimedia.org/T420722) (owner: 10Sergio Gimeno) [13:05:55] !log sgimeno@deploy2002 Started scap sync-world: Backport for [[gerrit:1259035|fix(WelcomeSurveyHooks): ensure accountJustCreated is always added (T420722)]], [[gerrit:1259036|tests: add coverage for WelcomeSurveyHooks::onCentralAuthPostLoginRedirect (T420722)]], [[gerrit:1259046|fix(WelcomeSurveyHooks): ensure accountJustCreated is always added 2 (T420722)]] [13:05:59] T420722: accountJustCreated flag not properly added on WelcomeSurvey redirections - https://phabricator.wikimedia.org/T420722 [13:06:33] (03PS1) 10Cathal Mooney: nftables: support nftables::rules definitions targetting prerouting [puppet] - 10https://gerrit.wikimedia.org/r/1259080 (https://phabricator.wikimedia.org/T420715) [13:07:41] (03CR) 10Majavah: [V:03+2 C:03+2] Add toolsbeta-acme-chief private key [labs/private] - 10https://gerrit.wikimedia.org/r/1240325 (owner: 10Majavah) [13:07:42] !log sgimeno@deploy2002 sgimeno: Backport for [[gerrit:1259035|fix(WelcomeSurveyHooks): ensure accountJustCreated is always added (T420722)]], [[gerrit:1259036|tests: add coverage for WelcomeSurveyHooks::onCentralAuthPostLoginRedirect (T420722)]], [[gerrit:1259046|fix(WelcomeSurveyHooks): ensure accountJustCreated is always added 2 (T420722)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Ch [13:07:42] anges can now be verified there. [13:07:50] (03CR) 10Majavah: [V:03+2 C:03+2] Add fake metricsinfra Grafana admin password [labs/private] - 10https://gerrit.wikimedia.org/r/1240326 (owner: 10Majavah) [13:08:01] (03CR) 10Majavah: [V:03+2 C:03+2] Add fake Docker registry passwrod for cloudinfra [labs/private] - 10https://gerrit.wikimedia.org/r/1245297 (owner: 10Majavah) [13:08:26] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2186-2199].codfw.wmnet [13:08:26] * sergi0 testing [13:08:34] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2186-2199].codfw.wmnet [13:08:34] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on P{wikikube-worker[2005-2006,2011-2018,2033-2039,2041-2042,2044,2046,2049-2051,2055-2062,2064-2065,2067-2078,2087-2095,2102-2115,2124-2179,2184-2199].codfw.wmnet} and (A:wikikube-master-codfw or A:wikikube-worker-codfw) [13:09:45] (03PS1) 10Jgreen: Switch fundraising default bastion back to eqiad after kernel update. [dns] - 10https://gerrit.wikimedia.org/r/1259081 [13:11:01] (03CR) 10Andrew Bogott: "*nudge*" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 (https://phabricator.wikimedia.org/T361237) (owner: 10Andrew Bogott) [13:11:21] !log sgimeno@deploy2002 sgimeno: Continuing with sync [13:11:27] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259080 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney) [13:11:38] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [13:11:47] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1213 - https://phabricator.wikimedia.org/T420812#11738116 (10VRiley-WMF) a:03VRiley-WMF [13:13:29] (03PS1) 10JMeybohm: k8s.print-network-topology: Prevent SAL logging [cookbooks] - 10https://gerrit.wikimedia.org/r/1259082 [13:13:34] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [13:14:15] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [13:16:12] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [13:16:24] (03PS3) 10Arnaudb: gerrit: add Envoy TLS termination for the CDN path [puppet] - 10https://gerrit.wikimedia.org/r/1258976 (https://phabricator.wikimedia.org/T420909) [13:17:39] !log sgimeno@deploy2002 Finished scap sync-world: Backport for [[gerrit:1259035|fix(WelcomeSurveyHooks): ensure accountJustCreated is always added (T420722)]], [[gerrit:1259036|tests: add coverage for WelcomeSurveyHooks::onCentralAuthPostLoginRedirect (T420722)]], [[gerrit:1259046|fix(WelcomeSurveyHooks): ensure accountJustCreated is always added 2 (T420722)]] (duration: 11m 43s) [13:17:43] T420722: accountJustCreated flag not properly added on WelcomeSurvey redirections - https://phabricator.wikimedia.org/T420722 [13:18:44] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudlb1001.eqiad.wmnet [13:19:00] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host cloudlb1001.eqiad.wmnet [13:19:14] (03CR) 10CI reject: [V:04-1] k8s.print-network-topology: Prevent SAL logging [cookbooks] - 10https://gerrit.wikimedia.org/r/1259082 (owner: 10JMeybohm) [13:19:52] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [13:19:54] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:20:06] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudlb1001.eqiad.wmnet [13:20:09] (03CR) 10JMeybohm: [C:03+2] sre.k8s: Add cookbook to print network topology details of nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1256260 (https://phabricator.wikimedia.org/T418142) (owner: 10JMeybohm) [13:20:11] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host cloudlb1001.eqiad.wmnet [13:20:31] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudlb1001.eqiad.wmnet [13:21:30] !log jayme@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{wikikube-worker[2332-2356].codfw.wmnet} and (A:wikikube-master-codfw or A:wikikube-worker-codfw) [13:21:39] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2332-2336].codfw.wmnet [13:21:41] !log jforrester@deploy2002 mwscript-k8s job started: extensions/WikimediaMaintenance/maintenance/createExtensionTables.php --wiki=abstractwiki translate # T420656 [13:21:46] T420656: Enable Translate extension for Abstract Wikipedia - https://phabricator.wikimedia.org/T420656 [13:22:21] (03CR) 10JMeybohm: [C:03+1] Add --force flag to sre.k8s.pool-depool-node cookbook and callers to bypass confirmation. [cookbooks] - 10https://gerrit.wikimedia.org/r/1258952 (https://phabricator.wikimedia.org/T410537) (owner: 10Matthieulec) [13:22:52] (03PS1) 10Jforrester: [abstractwiki] Enable the Translate extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259085 (https://phabricator.wikimedia.org/T420656) [13:23:40] PROBLEM - BFD status on cloudsw1-c8-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:23:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259085 (https://phabricator.wikimedia.org/T420656) (owner: 10Jforrester) [13:24:11] !log jayme@cumin1003 START - Cookbook sre.hosts.reboot-single for host deploy2003.codfw.wmnet [13:24:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-c8-eqiad and cloudlb1001 (172.20.1.2) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:25:19] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2332-2336].codfw.wmnet [13:26:40] RECOVERY - BFD status on cloudsw1-c8-eqiad.mgmt is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:28:52] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb1001.eqiad.wmnet [13:29:07] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudlb1002.eqiad.wmnet [13:29:23] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host cloudlb1002.eqiad.wmnet [13:29:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cloudsw1-c8-eqiad and cloudlb1001 (172.20.1.2) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:30:03] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudlb1002.eqiad.wmnet [13:30:30] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host deploy2003.codfw.wmnet [13:30:51] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2332-2336].codfw.wmnet [13:30:55] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2332-2336].codfw.wmnet [13:31:07] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2337-2341].codfw.wmnet [13:32:07] (03CR) 10Matthieulec: [C:03+1] Add --force flag to sre.k8s.pool-depool-node cookbook and callers to bypass confirmation. [cookbooks] - 10https://gerrit.wikimedia.org/r/1258952 (https://phabricator.wikimedia.org/T410537) (owner: 10Matthieulec) [13:32:40] PROBLEM - BFD status on cloudsw1-d5-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:34:09] FIRING: [4x] CoreBGPDown: Core BGP session down between cloudsw1-c8-eqiad and cloudlb1001 (172.20.1.2) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:34:10] (03CR) 10Kamila Součková: [C:03+2] Add --force flag to sre.k8s.pool-depool-node cookbook and callers to bypass confirmation. [cookbooks] - 10https://gerrit.wikimedia.org/r/1258952 (https://phabricator.wikimedia.org/T410537) (owner: 10Matthieulec) [13:36:00] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2337-2341].codfw.wmnet [13:36:23] !log jayme@cumin1003 START - Cookbook sre.hosts.reboot-single for host rdb2011.codfw.wmnet [13:36:32] !log jayme@cumin1003 START - Cookbook sre.hosts.reboot-single for host rdb2012.codfw.wmnet [13:36:38] RECOVERY - BFD status on cloudsw1-d5-eqiad.mgmt is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:38:26] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb1002.eqiad.wmnet [13:39:09] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cloudsw1-c8-eqiad and cloudlb1001 (172.20.1.2) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:39:09] (03Merged) 10jenkins-bot: Add --force flag to sre.k8s.pool-depool-node cookbook and callers to bypass confirmation. [cookbooks] - 10https://gerrit.wikimedia.org/r/1258952 (https://phabricator.wikimedia.org/T410537) (owner: 10Matthieulec) [13:39:36] PROBLEM - Postfix SMTP on crm2001 is CRITICAL: CRITICAL - Certificate crm2001.codfw.wmnet expires in 15 day(s) (Wed 08 Apr 2026 01:39:00 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [13:41:50] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2011.codfw.wmnet [13:41:54] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2012.codfw.wmnet [13:42:22] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1172.eqiad.wmnet with OS bullseye [13:42:31] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1172.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:42:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11738262 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cu... [13:43:23] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2337-2341].codfw.wmnet [13:43:27] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2337-2341].codfw.wmnet [13:43:38] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2342-2346].codfw.wmnet [13:44:48] (03CR) 10Ssingh: [C:03+1] geo-maps: update map default to list eqiad first [dns] - 10https://gerrit.wikimedia.org/r/1244621 (https://phabricator.wikimedia.org/T413974) (owner: 10Blake) [13:47:19] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2342-2346].codfw.wmnet [13:47:33] !log kamila@cumin1003 START - Cookbook sre.hosts.reboot-single for host hcaptcha1001.wikimedia.org [13:48:51] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [13:50:01] (03CR) 10Blake: [C:03+1] "Change seems good, though it looks like the pass ought to be removed." [cookbooks] - 10https://gerrit.wikimedia.org/r/1259082 (owner: 10JMeybohm) [13:50:11] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [13:51:33] !log kamila@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host hcaptcha1001.wikimedia.org [13:51:53] !log kamila@cumin1003 START - Cookbook sre.hosts.reboot-single for host hcaptcha1002.wikimedia.org [13:52:13] 10ops-eqiad, 06DBA, 06DC-Ops: db1253 depooled following host crash - https://phabricator.wikimedia.org/T420041#11738332 (10wiki_willy) [13:52:35] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2342-2346].codfw.wmnet [13:52:39] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2342-2346].codfw.wmnet [13:52:51] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2347-2351].codfw.wmnet [13:54:35] 10ops-eqiad, 06DBA, 06DC-Ops: db1253 depooled following host crash - https://phabricator.wikimedia.org/T420041#11738340 (10Jclark-ctr) a:03Jclark-ctr [13:54:36] 10ops-eqiad, 06DBA, 06DC-Ops: db1253 depooled following host crash - https://phabricator.wikimedia.org/T420041#11738342 (10wiki_willy) Adding the ops-eqiad tag and removing ops-eqdfw. @Jclark-ctr will take a look at it a bit later today. [13:55:08] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1172.eqiad.wmnet with reason: host reimage [13:55:51] !log kamila@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host hcaptcha1002.wikimedia.org [13:56:36] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:56:37] !log kamila@cumin1003 START - Cookbook sre.hosts.reboot-single for host hcaptcha2001.wikimedia.org [13:57:07] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2347-2351].codfw.wmnet [13:59:22] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1172.eqiad.wmnet with reason: host reimage [14:00:00] FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [14:00:00] FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [14:00:25] !log kamila@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host hcaptcha2001.wikimedia.org [14:00:37] !log kamila@cumin1003 START - Cookbook sre.hosts.reboot-single for host hcaptcha2002.wikimedia.org [14:02:23] (03CR) 10Ssingh: [C:03+1] Add new active-active discovery records for dse-k8s [dns] - 10https://gerrit.wikimedia.org/r/1248625 (https://phabricator.wikimedia.org/T417698) (owner: 10Bking) [14:03:51] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2347-2351].codfw.wmnet [14:03:54] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2347-2351].codfw.wmnet [14:04:06] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2352-2356].codfw.wmnet [14:04:28] (03PS2) 10FNegri: conftool-data: move s3, x3 to new hosts (part 1) [puppet] - 10https://gerrit.wikimedia.org/r/1256417 (https://phabricator.wikimedia.org/T409557) [14:04:28] (03PS1) 10FNegri: conftool-data: move s3, x3 to new hosts (part 2) [puppet] - 10https://gerrit.wikimedia.org/r/1259113 (https://phabricator.wikimedia.org/T409557) [14:04:30] (03CR) 10Ssingh: [C:03+1] Add new active-active discovery service for dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1248605 (https://phabricator.wikimedia.org/T417698) (owner: 10Bking) [14:04:39] !log kamila@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host hcaptcha2002.wikimedia.org [14:06:31] PROBLEM - MariaDB Replica IO: s7 #page on db1253 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:06:41] PROBLEM - MariaDB Replica Lag: s7 #page on db1253 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:06:41] PROBLEM - MariaDB Replica SQL: s7 #page on db1253 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:07:23] !ack [14:07:23] All incidents are already acked. [14:07:45] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2352-2356].codfw.wmnet [14:07:58] (03CR) 10Bking: [C:03+2] Add new active-active discovery service for dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1248605 (https://phabricator.wikimedia.org/T417698) (owner: 10Bking) [14:09:25] !incidents [14:09:25] 7787 (ACKED) db1253 (paged)/MariaDB Replica IO: s7 (paged) [14:09:25] 7788 (UNACKED) db1253 (paged)/MariaDB Replica SQL: s7 (paged) [14:09:26] 7786 (RESOLVED) db1253 (paged)/MariaDB Replica SQL: s7 (paged) [14:09:26] 7785 (RESOLVED) db1253 (paged)/MariaDB Replica Lag: s7 (paged) [14:09:26] 7784 (RESOLVED) db1253 (paged)/MariaDB Replica IO: s7 (paged) [14:09:43] !ack 7788 [14:09:44] 7788 (ACKED) db1253 (paged)/MariaDB Replica SQL: s7 (paged) [14:09:57] silence gone again? [14:10:27] (03PS2) 10JMeybohm: k8s.print-network-topology: Prevent SAL logging [cookbooks] - 10https://gerrit.wikimedia.org/r/1259082 [14:10:47] (03CR) 10Blake: [C:03+1] k8s.print-network-topology: Prevent SAL logging [cookbooks] - 10https://gerrit.wikimedia.org/r/1259082 (owner: 10JMeybohm) [14:11:00] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudrabbit1001.eqiad.wmnet [14:11:13] that was removed as a side effect of a cookbook but then it had been created again [14:11:47] ok, thanks federico3 [14:13:00] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on db1253.eqiad.wmnet with reason: Under repair [14:13:08] 10ops-eqiad, 06DBA, 06DC-Ops: db1253 depooled following host crash - https://phabricator.wikimedia.org/T420041#11738465 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=5ce7e720-a20c-4ad5-a612-bbf5c41ccd0a) set by fceratto@cumin1003 for 14 days, 0:00:00 on 1 host(s) and their services with... [14:14:29] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2352-2356].codfw.wmnet [14:14:32] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2352-2356].codfw.wmnet [14:14:32] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on P{wikikube-worker[2332-2356].codfw.wmnet} and (A:wikikube-master-codfw or A:wikikube-worker-codfw) [14:14:41] (03PS1) 10Daimona Eaytoy: Enable wgCampaignEventsEnableEventGoals in beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259120 (https://phabricator.wikimedia.org/T414148) [14:15:34] (03CR) 10CI reject: [V:04-1] Enable wgCampaignEventsEnableEventGoals in beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259120 (https://phabricator.wikimedia.org/T414148) (owner: 10Daimona Eaytoy) [14:16:50] (03PS2) 10Daimona Eaytoy: Enable wgCampaignEventsEnableEventGoals in beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259120 (https://phabricator.wikimedia.org/T414148) [14:17:05] jouncebot: nowandnext [14:17:06] No deployments scheduled for the next 0 hour(s) and 12 minute(s) [14:17:06] In 0 hour(s) and 12 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260323T1430) [14:17:13] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit1001.eqiad.wmnet [14:17:20] PROBLEM - Host lsw1-b7-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:17:46] PROBLEM - Host ps1-b7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:18:10] Hi folks, could I get a beta-only config change deployed? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1259120 [14:18:10] (03PS2) 10FNegri: conftool-data: move s3, x3 to new hosts (part 2) [puppet] - 10https://gerrit.wikimedia.org/r/1259113 (https://phabricator.wikimedia.org/T409557) [14:18:35] (Assuming it's fine to do outside of normal deployment windows, since it's beta-only) [14:18:48] RECOVERY - Host ps1-b7-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.05 ms [14:18:52] RECOVERY - Host lsw1-b7-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.77 ms [14:20:07] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:21:21] !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1018.eqiad.wmnet [14:22:03] !log fnegri@cumin1003 START - Cookbook sre.hosts.remove-downtime for clouddb1018.eqiad.wmnet [14:22:04] !log fnegri@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for clouddb1018.eqiad.wmnet [14:22:32] !log fnegri@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1018.eqiad.wmnet with reason: Rebooting clouddb1018 T419960 [14:23:12] jclark@cumin1003 reimage (PID 1496331) is awaiting input [14:24:49] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1010 - https://phabricator.wikimedia.org/T420867#11738524 (10Eevans) The failed device is `/dev/sdh` (fourth/last device on the second controller?), and `lsblk` thinks its serial number is `KN09N7919I0509R4C`. If we're confident in which drive to pull, it sh... [14:27:20] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudrabbit1002.eqiad.wmnet [14:30:04] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260323T1430) [14:30:35] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:30:36] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1172.eqiad.wmnet with OS bullseye [14:30:46] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11738557 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1... [14:31:52] !log bking@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=k8s-ingress-dse-aa,name=codfw [14:32:02] (03CR) 10Ssingh: [C:03+2] Add new active-active discovery records for dse-k8s [dns] - 10https://gerrit.wikimedia.org/r/1248625 (https://phabricator.wikimedia.org/T417698) (owner: 10Bking) [14:32:22] !log sukhe@dns1004 START - running authdns-update [14:32:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11738573 (10Jclark-ctr) @BTullis , I was able to reimage it. The an-workers always seem to ha... [14:32:51] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1018.eqiad.wmnet [14:33:27] !log fnegri@cumin1003 START - Cookbook sre.hosts.remove-downtime for clouddb1018.eqiad.wmnet [14:33:28] !log fnegri@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for clouddb1018.eqiad.wmnet [14:33:34] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit1002.eqiad.wmnet [14:33:57] !log sukhe@dns1004 FAIL - running authdns-update [14:33:59] !log fnegri@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Rebooting clouddb1019 T419960 [14:34:41] !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet [14:36:32] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache k8s-ingress-dse-aa.discovery.wmnet on all recursors [14:36:36] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) k8s-ingress-dse-aa.discovery.wmnet on all recursors [14:36:52] (03PS4) 10Clément Goubert: wikifeeds: Add request definition for page analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220629 (https://phabricator.wikimedia.org/T411769) (owner: 10Jgiannelos) [14:36:53] (03PS1) 10Arnaudb: gerrit: proxy Gitiles traffic to gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/1259121 (https://phabricator.wikimedia.org/T420595) [14:36:53] (03CR) 10Arnaudb: "pcc output visible here: https://puppet-compiler.wmflabs.org/output/1259121/6169/" [puppet] - 10https://gerrit.wikimedia.org/r/1259121 (https://phabricator.wikimedia.org/T420595) (owner: 10Arnaudb) [14:37:02] (03CR) 10Clément Goubert: wikifeeds: Add request definition for page analytics (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220629 (https://phabricator.wikimedia.org/T411769) (owner: 10Jgiannelos) [14:37:46] !log bking@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=k8s-ingress-dse-aa,name=eqiad [14:38:42] !log sukhe@dns1004 START - running authdns-update [14:39:50] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudrabbit1003.eqiad.wmnet [14:40:05] 10ops-eqiad, 06DBA, 06DC-Ops: db1253 depooled following host crash - https://phabricator.wikimedia.org/T420041#11738596 (10Jclark-ctr) I physically checked power cables seated properly nothing loose. Idrac looked healthy. i went though and updated multiple firmwares 800w Delta psu from 00.1B.53 To... [14:40:09] !log sukhe@dns1004 FAIL - running authdns-update [14:43:34] !log sukhe@dns1004 START - running authdns-update [14:44:59] !log sukhe@dns1004 END - running authdns-update [14:45:15] !log bking@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=k8s-ingress-dse-aa,name=eqiad [14:46:01] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit1003.eqiad.wmnet [14:47:52] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:47:53] (03CR) 10Ssingh: [C:03+1] Add new active/active discovery records for dse-k8s opensearch test ns [dns] - 10https://gerrit.wikimedia.org/r/1250063 (https://phabricator.wikimedia.org/T417698) (owner: 10Bking) [14:48:05] (03CR) 10Bking: [C:03+2] Add new active/active discovery records for dse-k8s opensearch test ns [dns] - 10https://gerrit.wikimedia.org/r/1250063 (https://phabricator.wikimedia.org/T417698) (owner: 10Bking) [14:48:12] !log sukhe@dns1004 START - running authdns-update [14:48:51] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:49:41] !log sukhe@dns1004 END - running authdns-update [14:49:47] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:49:50] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:50:12] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:50:16] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:50:16] 10ops-eqiad, 06DBA, 06DC-Ops: db1253 depooled following host crash - https://phabricator.wikimedia.org/T420041#11738636 (10FCeratto-WMF) Thank you @Jclark-ctr - is there anything else to be done on your side or can I claim the task? [14:50:26] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:50:27] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:51:43] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:51:44] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:52:17] (03PS2) 10Milimetric: testKitchen: Add custom stream name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255763 (https://phabricator.wikimedia.org/T417050) [14:52:46] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:52:47] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:53:06] jouncebot: nowandnext [14:53:06] For the next 0 hour(s) and 6 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260323T1430) [14:53:06] In 0 hour(s) and 36 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260323T1530) [14:53:15] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Add linkrecommendation support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255581 (https://phabricator.wikimedia.org/T418148) (owner: 10Clément Goubert) [14:54:40] (03PS2) 10Bking: Add new active/active discovery records for dse-k8s opensearch prod ns [dns] - 10https://gerrit.wikimedia.org/r/1250068 (https://phabricator.wikimedia.org/T417698) [14:54:44] 10ops-eqiad, 06DBA, 06DC-Ops: db1253 depooled following host crash - https://phabricator.wikimedia.org/T420041#11738684 (10Jclark-ctr) BIOS required a second restart. Just finished—should be good now. I double-checked the logs again just now still looks good. @FCeratto-WMF Feel free to Message me if anyt... [14:54:58] (03CR) 10Ssingh: [C:03+1] Add new active/active discovery records for dse-k8s opensearch prod ns [dns] - 10https://gerrit.wikimedia.org/r/1250068 (https://phabricator.wikimedia.org/T417698) (owner: 10Bking) [14:55:17] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:restbase-eqiad [14:55:26] !log sukhe@dns1004 START - running authdns-update [14:55:37] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet [14:55:44] (03CR) 10Bking: [C:03+2] Add new active/active discovery records for dse-k8s opensearch prod ns [dns] - 10https://gerrit.wikimedia.org/r/1250068 (https://phabricator.wikimedia.org/T417698) (owner: 10Bking) [14:55:48] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:55:59] !log fnegri@cumin1003 START - Cookbook sre.hosts.remove-downtime for clouddb1019.eqiad.wmnet [14:56:00] !log fnegri@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for clouddb1019.eqiad.wmnet [14:56:04] (03Merged) 10jenkins-bot: rest-gateway: Add linkrecommendation support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255581 (https://phabricator.wikimedia.org/T418148) (owner: 10Clément Goubert) [14:56:06] (03PS1) 10Sergio Gimeno: GrowthExperiments: scale edit and thanks query limit to more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259132 (https://phabricator.wikimedia.org/T341599) [14:56:36] !log fnegri@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1020.eqiad.wmnet with reason: Rebooting clouddb1020 T419960 [14:56:52] !log sukhe@dns1004 END - running authdns-update [14:56:56] !log sukhe@dns1004 START - running authdns-update [14:57:05] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:57:08] !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1020.eqiad.wmnet [14:58:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255763 (https://phabricator.wikimedia.org/T417050) (owner: 10Milimetric) [14:58:18] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache opensearch-test.discovery.wmnet on all recursors [14:58:22] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) opensearch-test.discovery.wmnet on all recursors [14:58:28] !log sukhe@dns1004 END - running authdns-update [14:58:40] FIRING: [4x] ProbeDown: Service restbase1031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:58:59] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache opensearch-ipoid.discovery.wmnet on all recursors [14:59:03] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) opensearch-ipoid.discovery.wmnet on all recursors [14:59:13] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1213 - https://phabricator.wikimedia.org/T420812#11738720 (10VRiley-WMF) Opened up a Dell ticket to have a replacment drive sent out. SR224226231 [14:59:44] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:00:26] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:00:33] (03PS3) 10Jdlrobson: Deploy temporary accounts to ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247073 (https://phabricator.wikimedia.org/T413771) (owner: 10STran) [15:00:46] (03CR) 10CI reject: [V:04-1] Deploy temporary accounts to ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247073 (https://phabricator.wikimedia.org/T413771) (owner: 10STran) [15:01:40] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [15:01:54] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [15:02:07] (03PS1) 10Majavah: cloudlb: Merge http-service-by-host to main http-service type [puppet] - 10https://gerrit.wikimedia.org/r/1259134 [15:02:13] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11738727 (10dancy) https://wikitech.wikimedia.org/wiki/Bastion currently shows bast4005.wikimedia.org crossed out. [15:02:38] (03CR) 10CI reject: [V:04-1] cloudlb: Merge http-service-by-host to main http-service type [puppet] - 10https://gerrit.wikimedia.org/r/1259134 (owner: 10Majavah) [15:03:00] !log bking@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=k8s-ingress-dse-aa,name=eqiad [15:03:24] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache opensearch-ipoid.discovery.wmnet on all recursors [15:03:28] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) opensearch-ipoid.discovery.wmnet on all recursors [15:03:40] RESOLVED: [6x] ProbeDown: Service restbase1031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:03:55] !log btullis@cumin1003 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1172.eqiad.wmnet [15:04:38] (03PS1) 10Kosta Harlan: EventStreamConfig: Add performer attributes to SI interaction v2 stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259136 (https://phabricator.wikimedia.org/T418740) [15:05:31] (03PS2) 10Majavah: cloudlb: Merge http-by-host to main http service type [puppet] - 10https://gerrit.wikimedia.org/r/1259134 [15:05:46] !log btullis@cumin1003 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1172.eqiad.wmnet [15:06:10] (03CR) 10CI reject: [V:04-1] cloudlb: Merge http-by-host to main http service type [puppet] - 10https://gerrit.wikimedia.org/r/1259134 (owner: 10Majavah) [15:06:25] (03CR) 10Dreamy Jazz: [C:04-2] "We set these manually for server side instrumentation, so this would break that" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259136 (https://phabricator.wikimedia.org/T418740) (owner: 10Kosta Harlan) [15:06:33] (03PS1) 10Btullis: Revert "Temporarily set an-worker1172 into insetup mode" [puppet] - 10https://gerrit.wikimedia.org/r/1259138 [15:06:55] (03PS3) 10Majavah: cloudlb: Merge http-by-host to main http service type [puppet] - 10https://gerrit.wikimedia.org/r/1259134 [15:07:44] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11738735 (10ayounsi) Thanks, updated. [15:09:09] 06SRE, 06Infrastructure-Foundations, 10netops: Wikidough unreachable over IPv6 if it is depooled but still announced from a POP - https://phabricator.wikimedia.org/T420820#11738741 (10cmooney) 05Open→03Resolved a:03cmooney Ok this should no longer be an issue after updating the `wikimedia6` prefix... [15:09:30] (03CR) 10Btullis: [C:03+2] Revert "Temporarily set an-worker1172 into insetup mode" [puppet] - 10https://gerrit.wikimedia.org/r/1259138 (owner: 10Btullis) [15:11:50] (03PS1) 10Btullis: dse-k8s-eqiad: Set cert-manager leader election namespace to cert-manager [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259141 (https://phabricator.wikimedia.org/T383553) [15:12:16] 10ops-eqiad, 06DBA, 06DC-Ops: db1253 depooled following host crash - https://phabricator.wikimedia.org/T420041#11738758 (10FCeratto-WMF) a:05Jclark-ctr→03FCeratto-WMF @Jclark-ctr thank you. [15:13:38] (03PS4) 10Majavah: cloudlb: Merge http-by-host to main http service type [puppet] - 10https://gerrit.wikimedia.org/r/1259134 [15:14:29] FIRING: [12x] ProbeDown: Service restbase1031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:14:44] FIRING: [12x] ProbeDown: Service restbase1031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:14:50] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1020.eqiad.wmnet [15:14:58] !log fnegri@cumin1003 START - Cookbook sre.hosts.remove-downtime for clouddb1020.eqiad.wmnet [15:14:59] !log fnegri@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for clouddb1020.eqiad.wmnet [15:19:29] RESOLVED: [12x] ProbeDown: Service restbase1031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:20:55] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [15:21:07] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [15:21:08] (03PS2) 10Kosta Harlan: EventStreamConfig: Document not adding performer attributes to SI interaction v2 stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259136 (https://phabricator.wikimedia.org/T418740) [15:21:52] (03PS3) 10Kosta Harlan: EventStreamConfig: Document not adding performer attributes to SI interaction v2 stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259136 (https://phabricator.wikimedia.org/T418740) [15:22:17] (03PS5) 10Majavah: cloudlb: Merge http-by-host to main http service type [puppet] - 10https://gerrit.wikimedia.org/r/1259134 [15:23:04] (03CR) 10Jforrester: Enable view urls in abstract.wikipedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1256396 (https://phabricator.wikimedia.org/T420666) (owner: 10Genoveva Galarza) [15:23:22] (03CR) 10Dreamy Jazz: EventStreamConfig: Document not adding performer attributes to SI interaction v2 stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259136 (https://phabricator.wikimedia.org/T418740) (owner: 10Kosta Harlan) [15:23:30] (03CR) 10Dreamy Jazz: [C:03+1] EventStreamConfig: Document not adding performer attributes to SI interaction v2 stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259136 (https://phabricator.wikimedia.org/T418740) (owner: 10Kosta Harlan) [15:24:44] FIRING: [12x] ProbeDown: Service restbase1032-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:26:10] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8325/co" [puppet] - 10https://gerrit.wikimedia.org/r/1259134 (owner: 10Majavah) [15:26:29] (03PS6) 10Majavah: cloudlb: Merge http-by-host to main http service type [puppet] - 10https://gerrit.wikimedia.org/r/1259134 [15:28:52] (03CR) 10Ayounsi: [C:03+1] "lgtm but I'm not that familiar with nout nftables setup." [puppet] - 10https://gerrit.wikimedia.org/r/1259080 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney) [15:29:02] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8326/co" [puppet] - 10https://gerrit.wikimedia.org/r/1259134 (owner: 10Majavah) [15:29:44] RESOLVED: [12x] ProbeDown: Service restbase1032-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:30:05] jan_drewniak: It is that lovely time of the day again! You are hereby commanded to deploy Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260323T1530). [15:31:51] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1172.eqiad.wmnet [15:33:02] (03CR) 10BPirkle: [C:03+1] "looks good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250113 (owner: 10Jforrester) [15:34:44] FIRING: [12x] ProbeDown: Service restbase1033-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:37:50] (03CR) 10JHathaway: [C:03+1] nftables: support nftables::rules definitions targetting prerouting [puppet] - 10https://gerrit.wikimedia.org/r/1259080 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney) [15:39:44] RESOLVED: [12x] ProbeDown: Service restbase1033-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:39:50] (03PS1) 10Ebernhardson: search: Add codfw semanticsearch cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259143 [15:39:54] FIRING: JobUnavailable: Reduced availability for job wikidough in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:44:10] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:45:37] FIRING: [3x] ProbeDown: Service restbase1034-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:49:10] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [15:49:33] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install payments101[0-2] - https://phabricator.wikimedia.org/T416252#11738970 (10Jgreen) [15:50:00] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:50:37] FIRING: [12x] ProbeDown: Service restbase1034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:50:57] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1022.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:51:32] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1010 - https://phabricator.wikimedia.org/T420867#11738984 (10VRiley-WMF) Thanks @Eevans Admittedly, I think it would be safest to shut down the server in order to have it verified which disk we are replacing. We have a spare on standby for this. If you wanted... [15:52:19] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install payments101[0-2] - https://phabricator.wikimedia.org/T416252#11738990 (10Jgreen) @Jclark-ctr I just noticed it looks like these were configured in the frack-fundraising1-c-eqiad vlan, looks like I missed updating the install details when the task was create... [15:52:31] FIRING: ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:52:31] 10ops-codfw, 06cloud-services-team, 06DC-Ops: Power Supply - Status - issue on cloudbackup2003:9290 - https://phabricator.wikimedia.org/T420948 (10Andrew) 03NEW [15:53:33] !log disabling puppet for nftables-enabled machines to validate new ruleset on selected hosts before wider rollout T420715 [15:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:36] (03PS4) 10Genoveva Galarza: Enable view urls in abstract.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1256396 (https://phabricator.wikimedia.org/T420666) [15:55:18] (03CR) 10Genoveva Galarza: Enable view urls in abstract.wikipedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1256396 (https://phabricator.wikimedia.org/T420666) (owner: 10Genoveva Galarza) [15:55:37] RESOLVED: [12x] ProbeDown: Service restbase1034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:56:50] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1022.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:57:23] PROBLEM - Host lsw1-b7-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:57:31] RESOLVED: ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:57:35] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1023.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:57:45] PROBLEM - Host ps1-b7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:57:48] !log fnegri@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1025.eqiad.wmnet with reason: Rebooting clouddb1025 T419960 [15:58:05] !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1025.eqiad.wmnet [15:59:07] (03CR) 10Cathal Mooney: [C:03+2] nftables: support nftables::rules definitions targetting prerouting [puppet] - 10https://gerrit.wikimedia.org/r/1259080 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney) [15:59:11] 10ops-codfw, 06collaboration-services, 06DC-Ops, 10Phabricator: phab2002: SEL System Event:, System Board Front LED Panel, Critical, management controller unavailable - https://phabricator.wikimedia.org/T420228#11739070 (10ABran-WMF) 05Open→03In progress p:05Triage→03Medium [16:00:52] FIRING: [12x] ProbeDown: Service restbase1035-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:02:10] (03CR) 10Dwisehaupt: [C:03+1] Switch fundraising default bastion back to eqiad after kernel update. [dns] - 10https://gerrit.wikimedia.org/r/1259081 (owner: 10Jgreen) [16:02:37] (03PS1) 10Jdlrobson: Address FIXME and drop not selector for section headings [extensions/MobileFrontend] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1259147 (https://phabricator.wikimedia.org/T420085) [16:02:53] RECOVERY - Host ps1-b7-codfw is UP: PING WARNING - Packet loss = 71%, RTA = 31.03 ms [16:02:55] RECOVERY - Host lsw1-b7-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.60 ms [16:03:05] !log eevans@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on aqs1010.eqiad.wmnet with reason: Shutting down for SSD replacement — T420867 [16:03:06] 10ops-codfw, 06collaboration-services, 06DC-Ops, 10Phabricator: phab2002: SEL System Event:, System Board Front LED Panel, Critical, management controller unavailable - https://phabricator.wikimedia.org/T420228#11739121 (10Dzahn) @Jhancock.wm This server is currently not the active production Phabricator.... [16:03:11] T420867: Degraded RAID on aqs1010 - https://phabricator.wikimedia.org/T420867 [16:03:11] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1010 - https://phabricator.wikimedia.org/T420867#11739122 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=5399a138-5392-45c7-819b-6efa3f7d322a) set by eevans@cumin1003 for 8:00:00 on 1 host(s) and their services with reason: Shutting down... [16:03:21] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1023.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:03:49] (03PS1) 10Elukey: mcrouter: ease testing new cli parameters [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1259148 (https://phabricator.wikimedia.org/T420223) [16:04:50] !log stopping aqs1010 for SSD replacement — T420867 [16:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:11] (03CR) 10Elukey: "I would like to test the following:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1259148 (https://phabricator.wikimedia.org/T420223) (owner: 10Elukey) [16:05:31] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1010 - https://phabricator.wikimedia.org/T420867#11739147 (10Eevans) >>! In T420867#11738984, @VRiley-WMF wrote: > Thanks @Eevans Admittedly, I think it would be safest to shut down the server in order to have it verified which disk we are replacing. We have... [16:05:52] RESOLVED: [12x] ProbeDown: Service restbase1035-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:09:19] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1025.eqiad.wmnet [16:09:19] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:09:38] !log fnegri@cumin1003 START - Cookbook sre.hosts.remove-downtime for clouddb1025.eqiad.wmnet [16:09:39] !log fnegri@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for clouddb1025.eqiad.wmnet [16:10:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 13Patch-For-Review: Q3:rack/setup/install dse-k8s-worker10[20-23] - https://phabricator.wikimedia.org/T414216#11739173 (10elukey) @Jclark-ctr all hosts provisioned! The new cookbook is not merged, but I thought to unblock you :) [16:10:38] (03PS1) 10Clément Goubert: rest-gateway: Fix linkrecommendation definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259151 [16:10:52] FIRING: [12x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:11:01] !log fnegri@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1023.eqiad.wmnet with reason: Rebooting clouddb1023 T419960 [16:13:06] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1010 - https://phabricator.wikimedia.org/T420867#11739209 (10VRiley-WMF) Shut down the unit. Verified the disk location, and brought it back up. Once it was up, I performed the swap. This should be good to go! [16:13:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259120 (https://phabricator.wikimedia.org/T414148) (owner: 10Daimona Eaytoy) [16:14:14] (03CR) 10Dzahn: gerrit: add Envoy TLS termination for the CDN path (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1258976 (https://phabricator.wikimedia.org/T420909) (owner: 10Arnaudb) [16:14:32] (03CR) 10Hnowlan: [C:03+1] rest-gateway: Fix linkrecommendation definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259151 (owner: 10Clément Goubert) [16:15:20] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Fix linkrecommendation definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259151 (owner: 10Clément Goubert) [16:15:52] FIRING: [18x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:15:55] FIRING: [20x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:17:23] (03Merged) 10jenkins-bot: rest-gateway: Fix linkrecommendation definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259151 (owner: 10Clément Goubert) [16:17:54] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [16:18:10] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [16:18:49] (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256748 (https://phabricator.wikimedia.org/T420704) (owner: 10Codename Noreste) [16:19:19] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:19:22] (03CR) 10Jgreen: [C:03+2] Switch fundraising default bastion back to eqiad after kernel update. [dns] - 10https://gerrit.wikimedia.org/r/1259081 (owner: 10Jgreen) [16:19:37] !log jgreen@dns1004 START - running authdns-update [16:20:28] 10SRE-SLO, 06ServiceOps new, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work, and 2 others: IPoid: Define service level indicators and service level objectives - https://phabricator.wikimedia.org/T348935#11739302 (10OKryva-WMF) [16:20:52] FIRING: [20x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:20:55] (03PS1) 10Clément Goubert: rest-gateway: fix cidr [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259154 [16:21:09] !log jgreen@dns1004 END - running authdns-update [16:22:03] FIRING: PuppetFailure: Puppet has failed on ganeti2033:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:23:47] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: fix cidr [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259154 (owner: 10Clément Goubert) [16:24:01] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1010 - https://phabricator.wikimedia.org/T420867#11739322 (10Eevans) >>! In T420867#11739209, @VRiley-WMF wrote: > Shut down the unit. Verified the disk location, and brought it back up. Once it was up, I performed the swap. This should be good to go! Thanks... [16:24:22] !log eevans@cumin1003 START - Cookbook sre.hosts.remove-downtime for aqs1010.eqiad.wmnet [16:24:23] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs1010.eqiad.wmnet [16:25:03] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1010 - https://phabricator.wikimedia.org/T420867#11739351 (10VRiley-WMF) 05Open→03Resolved [16:25:52] FIRING: [20x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:26:06] (03Merged) 10jenkins-bot: rest-gateway: fix cidr [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259154 (owner: 10Clément Goubert) [16:27:59] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [16:28:11] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [16:29:16] !log fnegri@cumin1003 START - Cookbook sre.hosts.remove-downtime for clouddb1023.eqiad.wmnet [16:29:17] !log fnegri@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for clouddb1023.eqiad.wmnet [16:30:26] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [16:30:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250113 (owner: 10Jforrester) [16:30:41] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 90%, RTA = 6741.95 ms [16:30:52] FIRING: [21x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:30:55] FIRING: [21x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:31:12] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [16:31:41] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1170 - https://phabricator.wikimedia.org/T420873#11739396 (10VRiley-WMF) Hey @jcrespo we got this ticket to replace a drive on this unit. We can do this as soon as today if you're ready. Since this is under warrenty, we're going to use one that is fro... [16:32:19] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [16:32:26] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [16:34:11] (03PS14) 10Elukey: sre.hosts.provision: use more URIs to set Supermicro's BIOS settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216) [16:34:19] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:39] !log volans@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcumin2001.codfw.wmnet [16:34:39] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [16:34:43] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [16:35:11] (03PS1) 10Btullis: Update dse-k8s-eqiad to k8s 1.31 [puppet] - 10https://gerrit.wikimedia.org/r/1259155 (https://phabricator.wikimedia.org/T414484) [16:35:21] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259155 (https://phabricator.wikimedia.org/T414484) (owner: 10Btullis) [16:35:41] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [16:35:52] FIRING: [18x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:36:48] RESOLVED: PuppetFailure: Puppet has failed on ganeti2033:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:38:07] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Add core API support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255725 (https://phabricator.wikimedia.org/T418146) (owner: 10Clément Goubert) [16:38:18] (03CR) 10CI reject: [V:04-1] rest-gateway: Add core API support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255725 (https://phabricator.wikimedia.org/T418146) (owner: 10Clément Goubert) [16:38:27] (03PS6) 10Clément Goubert: rest-gateway: Add core API support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255725 (https://phabricator.wikimedia.org/T418146) [16:38:30] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcumin2001.codfw.wmnet [16:40:17] (03CR) 10Clément Goubert: rest-gateway: Add core API support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255725 (https://phabricator.wikimedia.org/T418146) (owner: 10Clément Goubert) [16:40:44] (03CR) 10Clément Goubert: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255725 (https://phabricator.wikimedia.org/T418146) (owner: 10Clément Goubert) [16:40:52] FIRING: [18x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:41:19] (03CR) 10Btullis: "Do not merge until the maintenance window on March 26th." [puppet] - 10https://gerrit.wikimedia.org/r/1259155 (https://phabricator.wikimedia.org/T414484) (owner: 10Btullis) [16:41:41] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [16:42:02] 06SRE, 10SRE-swift-storage, 10Observability-Metrics: thanos swift capacity for FY 26/27 - https://phabricator.wikimedia.org/T419713#11739454 (10hnowlan) For the immediate future I think for the moment we're fine with current thanos-swift capacity. We'll experiment with SSD storage elsewhere but for now we do... [16:42:32] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1170 - https://phabricator.wikimedia.org/T420873#11739456 (10jcrespo) That's an s7 core host, it is for @FCeratto-WMF to make the call. [16:45:26] (03CR) 10Btullis: wdqs-queryhammer: Deployment fixes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258956 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [16:45:52] FIRING: [18x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:46:21] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11739476 (10AnnieKim_WMDE) Encountering an error when I try to log into Superset: "Authentication Failure. Service access denied due to missing privileges." C... [16:46:59] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [16:47:30] (03CR) 10Trueg: wdqs-queryhammer: Deployment fixes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258956 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [16:49:22] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Add core API support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255725 (https://phabricator.wikimedia.org/T418146) (owner: 10Clément Goubert) [16:50:01] (03PS1) 10Btullis: Update dse-k8s-eqiad to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259158 (https://phabricator.wikimedia.org/T414484) [16:50:34] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker1007.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [16:50:52] FIRING: [17x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:51:53] (03Merged) 10jenkins-bot: rest-gateway: Add core API support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255725 (https://phabricator.wikimedia.org/T418146) (owner: 10Clément Goubert) [16:52:36] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [16:52:49] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [16:53:10] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [16:54:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11739508 (10BTullis) 05Open→03Resolved I belive that this is now fixed. Thanks @Jclar... [16:55:40] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-ctrl1001.eqiad.wmnet [16:55:47] RECOVERY - Bird Internet Routing Daemon on doh7004 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [16:55:52] FIRING: [16x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:55:55] FIRING: [16x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:55:55] !log sukhe@cumin1003 START - Cookbook sre.hosts.remove-downtime for 14 hosts [16:55:55] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker1007.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [16:56:04] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 14 hosts [16:56:37] (03CR) 10JHathaway: [C:03+2] run_ci_locally.sh: use bind mounts for local runs [puppet] - 10https://gerrit.wikimedia.org/r/1142675 (owner: 10JHathaway) [16:56:56] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker1008.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [16:57:46] (03CR) 10CI reject: [V:04-1] Update dse-k8s-eqiad to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259158 (https://phabricator.wikimedia.org/T414484) (owner: 10Btullis) [16:58:35] (03CR) 10Trueg: wdqs-queryhammer: Deployment fixes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258956 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [16:58:39] (03PS1) 10Clément Goubert: rest-gateway: fix mobileapps cluster for core [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259160 [16:59:11] (03PS2) 10Trueg: wdqs-queryhammer: Deployment fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258956 (https://phabricator.wikimedia.org/T417415) [16:59:19] RESOLVED: JobUnavailable: Reduced availability for job wikidough in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260323T1700) [17:00:05] ryankemper: Your horoscope predicts another Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260323T1700). [17:00:31] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-ctrl1001.eqiad.wmnet [17:00:52] FIRING: [16x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:01:57] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: fix mobileapps cluster for core [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259160 (owner: 10Clément Goubert) [17:02:03] (03PS1) 10Scott French: mw-web: Reenable envoy drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259161 (https://phabricator.wikimedia.org/T364245) [17:02:15] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker1008.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [17:03:41] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [17:03:59] (03Merged) 10jenkins-bot: rest-gateway: fix mobileapps cluster for core [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259160 (owner: 10Clément Goubert) [17:04:17] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [17:04:23] (03PS2) 10Scott French: mw-web: Reenable envoy drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259161 (https://phabricator.wikimedia.org/T364245) [17:04:39] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [17:05:14] (03PS1) 10JHathaway: WIP: do not merge [puppet] - 10https://gerrit.wikimedia.org/r/1259162 [17:05:41] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Add api.w.o device-analytics support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255590 (https://phabricator.wikimedia.org/T418147) (owner: 10Clément Goubert) [17:05:51] (03CR) 10RLazarus: [C:03+1] mw-web: Reenable envoy drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259161 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [17:05:52] FIRING: [16x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:06:27] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [17:06:45] (03PS2) 10JHathaway: WIP: do not merge, test 2 [puppet] - 10https://gerrit.wikimedia.org/r/1259162 [17:07:01] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [17:07:59] (03Merged) 10jenkins-bot: rest-gateway: Add api.w.o device-analytics support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255590 (https://phabricator.wikimedia.org/T418147) (owner: 10Clément Goubert) [17:08:13] 06SRE, 10Beta-Cluster-Infrastructure: Beta cluster is slow as sludge / serves 503 and 504 - https://phabricator.wikimedia.org/T420833#11739553 (10bd808) Load has been spiky over the last 7 days with increased spike frequency on 2026-03-22 for sure. {F73533258,size=full} We likely have either a new range that b... [17:08:26] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [17:08:33] o/ [17:08:33] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [17:08:42] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [17:09:11] FYI, as part of this infra window, I'll be applying a change to mw-web in a little bit [17:09:17] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [17:09:56] (03PS3) 10JHathaway: WIP: do not merge, test 2 [puppet] - 10https://gerrit.wikimedia.org/r/1259162 [17:10:52] FIRING: [16x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:12:14] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [17:12:36] (03CR) 10Scott French: [C:03+2] mw-web: Reenable envoy drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259161 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [17:12:49] !log bd808@deploy2002 Started deploy [releng/jenkins-deploy@f47af21] (releasing): jobs: Use TZ=UTC in branchMWSingleVersion.groovy trigger (T404399) [17:12:54] T404399: wmf/next branch cut job on releases-jenkins and systemd timer on deployment server times overlap - https://phabricator.wikimedia.org/T404399 [17:13:36] !log bd808@deploy2002 Finished deploy [releng/jenkins-deploy@f47af21] (releasing): jobs: Use TZ=UTC in branchMWSingleVersion.groovy trigger (T404399) (duration: 01m 36s) [17:13:41] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [17:14:46] (03Merged) 10jenkins-bot: mw-web: Reenable envoy drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259161 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [17:15:55] FIRING: [11x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:16:34] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker1009.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [17:17:01] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [17:17:39] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [17:18:30] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [17:19:54] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:20:05] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [17:20:41] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS trixie [17:20:52] FIRING: [16x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:21:05] (03PS1) 10Tiziano Fogli: thanos/compact: increase meta-fetch goroutines to fix compactor inactivity [puppet] - 10https://gerrit.wikimedia.org/r/1259168 (https://phabricator.wikimedia.org/T410152) [17:21:45] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [17:21:54] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker1009.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [17:22:05] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [17:23:41] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [17:24:16] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [17:25:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256498 (https://phabricator.wikimedia.org/T420785) (owner: 10Scardenasmolinar) [17:25:52] FIRING: [16x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:26:16] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [17:26:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247065 (https://phabricator.wikimedia.org/T411485) (owner: 10Kgraessle) [17:27:56] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [17:29:10] FIRING: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:30:01] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp1100.eqiad.wmnet [reason: trixie reimaging] [17:30:52] FIRING: [16x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:31:08] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp1100.eqiad.wmnet with OS trixie [17:31:28] brett@cumin2002 reimage (PID 1072326) is awaiting input [17:32:40] (03CR) 10Tiziano Fogli: [C:03+2] thanos/compact: increase meta-fetch goroutines to fix compactor inactivity [puppet] - 10https://gerrit.wikimedia.org/r/1259168 (https://phabricator.wikimedia.org/T410152) (owner: 10Tiziano Fogli) [17:33:09] (03PS1) 10Hnowlan: prometheus: add recording rules for the appservers RED dashboard [puppet] - 10https://gerrit.wikimedia.org/r/1259170 (https://phabricator.wikimedia.org/T249663) [17:33:14] jouncebot: nowandnext [17:33:14] For the next 0 hour(s) and 26 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260323T1700) [17:33:14] In 2 hour(s) and 26 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260323T2000) [17:33:38] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [17:34:10] RESOLVED: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:34:26] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp1101.eqiad.wmnet [reason: trixie reimaging] [17:34:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259136 (https://phabricator.wikimedia.org/T418740) (owner: 10Kosta Harlan) [17:34:49] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp1101.eqiad.wmnet with OS trixie [17:35:13] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [17:35:26] ?? why is there a backport happening [17:35:37] (03Merged) 10jenkins-bot: EventStreamConfig: Document not adding performer attributes to SI interaction v2 stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259136 (https://phabricator.wikimedia.org/T418740) (owner: 10Kosta Harlan) [17:35:48] Stopping scap [17:35:52] FIRING: [16x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:35:54] Thought the window wasn't being used [17:36:17] (additionally because the change is a no-op) [17:36:17] Dreamy_Jazz: ah, got it - I'll be done checking on things in ~ 10 mins or so [17:37:00] Thanks, apologies [17:38:46] 06SRE, 06Infrastructure-Foundations, 10netops: Atlas no longer reachable from monitoring on routed ganeti - https://phabricator.wikimedia.org/T420975 (10cmooney) 03NEW p:05Triage→03Medium [17:39:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-drmrs and NTT (2001:728:0:5000::164c) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [17:39:54] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:xe-0/1/5 (Transit: NTT (345038) {#345038}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:40:26] 06SRE, 06Infrastructure-Foundations, 10netops: Atlas no longer reachable from monitoring on routed ganeti - https://phabricator.wikimedia.org/T420975#11739860 (10cmooney) [17:40:52] FIRING: [22x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:41:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247073 (https://phabricator.wikimedia.org/T413771) (owner: 10STran) [17:42:33] Dreamy_Jazz: alright, things look good. all yours :) [17:42:56] Thanks, and apologies again (should have seen your message from above about using the window but missed it) [17:43:27] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1259136|EventStreamConfig: Document not adding performer attributes to SI interaction v2 stream (T418740)]] [17:43:32] T418740: Special:CheckUser: Conditionally show a link to "SI cases" - https://phabricator.wikimedia.org/T418740 [17:43:43] a lot of noise in here today! [17:45:17] !log dreamyjazz@deploy2002 kharlan, dreamyjazz: Backport for [[gerrit:1259136|EventStreamConfig: Document not adding performer attributes to SI interaction v2 stream (T418740)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:45:38] !log dreamyjazz@deploy2002 kharlan, dreamyjazz: Continuing with sync [17:45:52] FIRING: [16x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:49:19] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:xe-0/1/5 (Transit: NTT (345038) {#345038}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:49:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-drmrs and NTT (2001:728:0:5000::164c) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [17:49:56] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1259136|EventStreamConfig: Document not adding performer attributes to SI interaction v2 stream (T418740)]] (duration: 06m 28s) [17:50:01] T418740: Special:CheckUser: Conditionally show a link to "SI cases" - https://phabricator.wikimedia.org/T418740 [17:50:03] I'm done with scap [17:50:52] FIRING: [15x] ProbeDown: Service restbase1043-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:50:55] FIRING: [16x] ProbeDown: Service restbase1043-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:53:13] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:restbase-eqiad [17:54:03] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp1115.eqiad.wmnet with OS trixie [17:54:10] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and 10.192.16.35 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:54:48] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS trixie [17:55:52] FIRING: [14x] ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:59:40] (03PS4) 10SBassett: Allow-list some additional domains to the currently enforcing CSP [puppet] - 10https://gerrit.wikimedia.org/r/1255066 (https://phabricator.wikimedia.org/T420539) (owner: 10Sportzpikachu) [18:00:00] FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [18:00:00] FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [18:00:52] RESOLVED: [14x] ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:04:10] RESOLVED: [2x] BFDdown: BFD session down between cr1-codfw and 10.192.16.35 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:05:55] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-reboot rolling reboot on P{aqs[1011,1014,1016-1022]*} and P{P:Cassandra} [18:10:40] FIRING: [4x] BFDdown: BFD session down between cr1-codfw and 10.192.16.35 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:10:42] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1115.eqiad.wmnet with OS trixie [18:10:52] FIRING: [14x] ProbeDown: Service aqs1011-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:10:55] RESOLVED: [4x] BFDdown: BFD session down between cr1-codfw and 10.192.16.35 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:10:56] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS trixie [18:14:59] (03CR) 10Catrope: [C:03+1] Allow-list some additional domains to the currently enforcing CSP [puppet] - 10https://gerrit.wikimedia.org/r/1255066 (https://phabricator.wikimedia.org/T420539) (owner: 10Sportzpikachu) [18:15:29] (03CR) 10SBassett: Allow-list some additional domains to the currently enforcing CSP [puppet] - 10https://gerrit.wikimedia.org/r/1255066 (https://phabricator.wikimedia.org/T420539) (owner: 10Sportzpikachu) [18:15:48] PROBLEM - Bird Internet Routing Daemon on hcaptcha-proxy4002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [18:15:52] RESOLVED: [6x] ProbeDown: Service aqs1011-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:16:16] ^ bird is expected, trying to move traffic over [18:17:16] 06SRE, 06SRE-OnFire, 10Observability-Alerting: vopsbot !ack and !resolve without incident numbers aren't working - https://phabricator.wikimedia.org/T420982 (10RLazarus) 03NEW p:05Triage→03Medium [18:20:40] FIRING: [4x] BFDdown: BFD session down between cr1-eqiad and 10.64.16.86 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:20:52] FIRING: [8x] ProbeDown: Service aqs1011-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:20:55] FIRING: [9x] ProbeDown: Service aqs1011-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:22:03] (03CR) 10Ssingh: [C:03+1] "0 tests failed, 0 tests skipped, 40 tests passed" [puppet] - 10https://gerrit.wikimedia.org/r/1255066 (https://phabricator.wikimedia.org/T420539) (owner: 10Sportzpikachu) [18:22:04] (03CR) 10Ssingh: [C:03+2] Allow-list some additional domains to the currently enforcing CSP [puppet] - 10https://gerrit.wikimedia.org/r/1255066 (https://phabricator.wikimedia.org/T420539) (owner: 10Sportzpikachu) [18:25:48] PROBLEM - Bird Internet Routing Daemon on hcaptcha-proxy4001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [18:25:52] RESOLVED: [10x] ProbeDown: Service aqs1011-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:30:40] FIRING: [5x] BFDdown: BFD session down between cr1-eqiad and 10.64.16.86 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:31:32] (03PS1) 10Aaron Schulz: Add Analytics APIs to the RestSandbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259183 (https://phabricator.wikimedia.org/T419429) [18:35:51] !log sukhe@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on hcaptcha-proxy4001.wikimedia.org with reason: depooled host (soon to be decomed) [18:35:52] FIRING: [8x] ProbeDown: Service aqs1014-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:35:59] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11740152 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7de31a58-f28e-43d7-99e1-e30cec213330) set by sukhe@cumin1003 for 3 days, 0:00:00... [18:36:12] !log sukhe@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on hcaptcha-proxy4002.wikimedia.org with reason: depooled host (soon to be decomed) [18:36:20] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11740153 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c4c9ce4c-f2e1-4e09-a892-c11aee00f6ea) set by sukhe@cumin1003 for 3 days, 0:00:00... [18:40:52] RESOLVED: [8x] ProbeDown: Service aqs1014-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:41:07] (03CR) 10Rubah Hitam Vukova: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251193 (https://phabricator.wikimedia.org/T419105) (owner: 10Codename Noreste) [18:41:11] (03CR) 10Rubah Hitam Vukova: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251193 (https://phabricator.wikimedia.org/T419105) (owner: 10Codename Noreste) [18:41:13] (03CR) 10Rubah Hitam Vukova: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251193 (https://phabricator.wikimedia.org/T419105) (owner: 10Codename Noreste) [18:42:36] (03PS1) 10AKhatun: stream: mw-page-edit-type-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259186 (https://phabricator.wikimedia.org/T351225) [18:45:52] FIRING: [8x] ProbeDown: Service aqs1016-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:48:54] 06SRE, 06Infrastructure-Foundations, 10netops: Anycast services - depool strategy in terms of BGP routing - https://phabricator.wikimedia.org/T420821#11740224 (10ssingh) Thanks for all the work here @cmooney and for mentioning this, something that I had most certainly overlooked at least. I will think a bit... [18:49:42] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1115.eqiad.wmnet with OS trixie [18:50:03] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS trixie [18:50:52] RESOLVED: [8x] ProbeDown: Service aqs1016-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:53:32] !log cdobbins@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp1100.eqiad.wmnet with OS trixie [18:53:49] (03PS9) 10Eevans: services: add linked-artifacts service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250651 (https://phabricator.wikimedia.org/T414112) [18:53:49] (03PS1) 10Eevans: charts/cassandra-http-gateway: configurable Cassandra keyspace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259188 (https://phabricator.wikimedia.org/T414112) [18:54:11] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp1100.eqiad.wmnet with OS trixie [18:55:43] !log cdobbins@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp1101.eqiad.wmnet with OS trixie [18:55:52] FIRING: [9x] ProbeDown: Service aqs1016-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:55:55] FIRING: [9x] ProbeDown: Service aqs1016-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:56:07] (03CR) 10Bking: [C:03+2] "You are actually correct, we will be flying blind until we can get on the new chart (if we have to...we have also discussed making a separ" [puppet] - 10https://gerrit.wikimedia.org/r/1251117 (https://phabricator.wikimedia.org/T419289) (owner: 10Bking) [18:57:05] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp1101.eqiad.wmnet with OS trixie [18:59:32] (03CR) 10JavierMonton: [C:03+1] stream: mw-page-edit-type-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259186 (https://phabricator.wikimedia.org/T351225) (owner: 10AKhatun) [18:59:46] !log bking@deploy2002 restarting opensearch-semantic-search eqiad to renew certs [18:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:52] RESOLVED: [8x] ProbeDown: Service aqs1017-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:07:39] (03CR) 10Ottomata: [C:03+1] stream: mw-page-edit-type-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259186 (https://phabricator.wikimedia.org/T351225) (owner: 10AKhatun) [19:08:31] (03CR) 10AKhatun: [C:03+2] stream: mw-page-edit-type-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259186 (https://phabricator.wikimedia.org/T351225) (owner: 10AKhatun) [19:10:38] (03Merged) 10jenkins-bot: stream: mw-page-edit-type-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259186 (https://phabricator.wikimedia.org/T351225) (owner: 10AKhatun) [19:10:41] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1100.eqiad.wmnet with reason: host reimage [19:10:52] FIRING: [8x] ProbeDown: Service aqs1018-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:12:16] (03CR) 10Brouberol: [C:04-1] "LGTM! Setting a -1 so this does not get merged before the maintenance window" [puppet] - 10https://gerrit.wikimedia.org/r/1259155 (https://phabricator.wikimedia.org/T414484) (owner: 10Btullis) [19:12:51] (03CR) 10Brouberol: [C:04-1] "LGTM! Setting a -1 so this does not get merged before the maintenance window" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259141 (https://phabricator.wikimedia.org/T383553) (owner: 10Btullis) [19:13:05] !log akhatun@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-edit-type-enrich-next: apply [19:13:20] !log akhatun@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-edit-type-enrich-next: apply [19:13:44] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1101.eqiad.wmnet with reason: host reimage [19:14:26] (03PS7) 10Majavah: cloudlb: Merge http-by-host to main http service type [puppet] - 10https://gerrit.wikimedia.org/r/1259134 [19:14:44] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1100.eqiad.wmnet with reason: host reimage [19:14:58] (03CR) 10Brouberol: "I think that (looking at the CI logs) you also need to set `installCRDs: false` in `helmfile.d/admin_ng/cert-manager/cert-manager-values.y" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259158 (https://phabricator.wikimedia.org/T414484) (owner: 10Btullis) [19:15:52] RESOLVED: [8x] ProbeDown: Service aqs1018-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:17:07] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8328/co" [puppet] - 10https://gerrit.wikimedia.org/r/1259134 (owner: 10Majavah) [19:17:26] (03PS1) 10Ayounsi: anycast: don't prepent last AS in ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/1259199 [19:17:47] (03PS2) 10Ayounsi: anycast: don't prepend last AS in ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/1259199 [19:18:01] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1101.eqiad.wmnet with reason: host reimage [19:18:46] 06SRE, 06Infrastructure-Foundations, 10netops: Anycast services - depool strategy in terms of BGP routing - https://phabricator.wikimedia.org/T420821#11740353 (10cmooney) Thanks @ssingh. I think a cookbook that takes down doh and durum simultaneously at a site (I assume by changing bird?) would solve this p... [19:19:47] (03CR) 10Ssingh: [C:03+1] "I can confirm the behaviour we are seeing, not sure about the syntax but I trust you know it so looks good!" [homer/public] - 10https://gerrit.wikimedia.org/r/1259199 (owner: 10Ayounsi) [19:20:52] FIRING: [8x] ProbeDown: Service aqs1019-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:21:15] (03CR) 10Cathal Mooney: [C:03+1] "LGTM." [homer/public] - 10https://gerrit.wikimedia.org/r/1259199 (owner: 10Ayounsi) [19:23:50] (03CR) 10Ayounsi: [C:03+2] anycast: don't prepend last AS in ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/1259199 (owner: 10Ayounsi) [19:25:31] (03Merged) 10jenkins-bot: anycast: don't prepend last AS in ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/1259199 (owner: 10Ayounsi) [19:25:52] RESOLVED: [8x] ProbeDown: Service aqs1019-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:29:38] (03CR) 10Scott French: "Thanks, Matthew!" [puppet] - 10https://gerrit.wikimedia.org/r/1256520 (https://phabricator.wikimedia.org/T420458) (owner: 10Scott French) [19:30:01] (03CR) 10Scott French: [C:03+2] admin: Add mpostoronca shell access and deployment membership [puppet] - 10https://gerrit.wikimedia.org/r/1256520 (https://phabricator.wikimedia.org/T420458) (owner: 10Scott French) [19:30:32] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1115.eqiad.wmnet with OS trixie [19:30:52] FIRING: [11x] ProbeDown: Service aqs1019-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:34:56] (03PS7) 10CDanis: Add fundraising-data-uploader role user [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948) [19:35:52] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis) [19:35:52] RESOLVED: [8x] ProbeDown: Service aqs1020-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:37:55] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1100.eqiad.wmnet with OS trixie [19:38:30] 06SRE, 10SRE-Access-Requests: Requesting access to production for Mpostoronca-wmf - https://phabricator.wikimedia.org/T420458#11740519 (10Scott_French) 05Open→03Resolved a:03Scott_French @MPostoronca-WMF - Thanks for your patience. This should be rolling out over the next 30 minutes or so. [19:39:02] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp1100.eqiad.wmnet [reason: trixie reimaging] [19:40:20] PROBLEM - Bird Internet Routing Daemon on hcaptcha-proxy4004 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [19:40:28] PROBLEM - Bird Internet Routing Daemon on hcaptcha-proxy4003 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [19:40:30] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1101.eqiad.wmnet with OS trixie [19:40:31] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp1102.eqiad.wmnet [reason: trixie reimaging] [19:41:01] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp1102.eqiad.wmnet with OS trixie [19:41:30] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp1101.eqiad.wmnet [reason: trixie reimaging] [19:42:15] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp1103.eqiad.wmnet [reason: trixie reimaging] [19:42:34] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp1103.eqiad.wmnet with OS trixie [19:44:29] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on P{aqs[1011,1014,1016-1022]*} and P{P:Cassandra} [19:44:43] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS trixie [19:46:02] (03CR) 10Jforrester: [C:03+1] Enable view urls in abstract.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1256396 (https://phabricator.wikimedia.org/T420666) (owner: 10Genoveva Galarza) [19:46:02] (03CR) 10Gmodena: [C:03+1] wdqs-queryhammer: Deployment fixes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258956 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [19:46:44] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reboot-single for host hcaptcha-proxy4003.wikimedia.org [19:47:33] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host hcaptcha-proxy4003.wikimedia.org [19:47:58] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1115.eqiad.wmnet with OS trixie [19:48:47] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11740581 (10Scott_French) @AnnieKim_WMDE - Thanks for creating your LDAP account (having one is a prerequisite for gaining the privileges sought here). I'll f... [19:49:56] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-03-18-023444 to 2026-03-23-124102 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259205 (https://phabricator.wikimedia.org/T418150) [19:50:21] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reboot-single for host hcaptcha-proxy4004.wikimedia.org [19:50:23] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2026-03-18-023444 to 2026-03-23-124102 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259205 (https://phabricator.wikimedia.org/T418150) (owner: 10Jforrester) [19:50:51] !log cdobbins@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1103.eqiad.wmnet with OS trixie [19:51:01] !log cdobbins@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1102.eqiad.wmnet with OS trixie [19:51:09] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host hcaptcha-proxy4004.wikimedia.org [19:52:05] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11740594 (10Scott_French) [19:52:27] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2026-03-18-023444 to 2026-03-23-124102 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259205 (https://phabricator.wikimedia.org/T418150) (owner: 10Jforrester) [19:54:01] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [19:54:28] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS trixie [19:54:29] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [19:57:33] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [19:58:07] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp1102.eqiad.wmnet with OS trixie [19:58:10] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [19:58:15] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [19:58:47] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [19:59:06] (03PS1) 10Scott French: admin: Add anniekimwmde to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1259208 (https://phabricator.wikimedia.org/T420500) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260323T2000). [20:00:05] alexsanford, RoanKattouw, danisztls, James_F, milimetric, and cmelo: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:14] Hey. [20:00:21] Hey! [20:00:28] I can start with mine. I need to deploy a private file first, and then do the config update [20:00:33] deployment confusion time [20:00:35] Ack. [20:00:48] o/ [20:01:27] hi here [20:01:31] \o/ [20:02:01] my config update is very isolated if anyone wants to merge it with theirs [20:02:34] same [20:04:54] (03CR) 10CDanis: "I think this is fine, but, I'll note that you could also do this in the CDN directly with some extra mappings in `hieradata/common/profile" [puppet] - 10https://gerrit.wikimedia.org/r/1259121 (https://phabricator.wikimedia.org/T420595) (owner: 10Arnaudb) [20:07:47] !log Deployed mitigation for T419605 [20:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:04] (doing config change next) [20:08:27] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp1103.eqiad.wmnet with OS trixie [20:08:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by alexsanford@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256472 (https://phabricator.wikimedia.org/T419605) (owner: 10Alex.sanford) [20:08:41] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11740702 (10Scott_French) [20:09:34] (03Merged) 10jenkins-bot: Reduce reauth timeout for editing site JS to 10 minutes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256472 (https://phabricator.wikimedia.org/T419605) (owner: 10Alex.sanford) [20:09:52] !log alexsanford@deploy2002 Started scap sync-world: Backport for [[gerrit:1256472|Reduce reauth timeout for editing site JS to 10 minutes (T419605)]] [20:09:53] cmelo, milimetric: if James_F don't mind I can batch yours with my deployment [20:10:11] danisztls: thank you, that'd be great [20:10:17] danisztls: Sure! [20:10:29] (do I need to +2 it or you do that?) [20:10:40] milimetric: danisztls will do that. [20:10:52] (sorry thx) [20:11:08] Never a problem. :-) [20:11:31] thanks danisztls [20:11:41] I'll let SpiderPig do the 'dirty' work. [20:11:42] !log alexsanford@deploy2002 alexsanford: Backport for [[gerrit:1256472|Reduce reauth timeout for editing site JS to 10 minutes (T419605)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:13:06] !log alexsanford@deploy2002 alexsanford: Continuing with sync [20:14:45] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1102.eqiad.wmnet with reason: host reimage [20:17:24] !log alexsanford@deploy2002 Finished scap sync-world: Backport for [[gerrit:1256472|Reduce reauth timeout for editing site JS to 10 minutes (T419605)]] (duration: 07m 32s) [20:17:51] Ok, mine is all good :) [20:18:23] alexsanford: Are you doing RoanKattouw's patch too? Or is it over to danisztls? [20:18:24] (03CR) 10RLazarus: [C:03+1] admin: Add anniekimwmde to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1259208 (https://phabricator.wikimedia.org/T420500) (owner: 10Scott French) [20:19:10] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1102.eqiad.wmnet with reason: host reimage [20:19:23] Over to danisztls [20:20:34] alexsanford: thanks [20:20:36] proceeding [20:21:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254448 (https://phabricator.wikimedia.org/T419275) (owner: 10DDesouza) [20:21:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254450 (https://phabricator.wikimedia.org/T419275) (owner: 10DDesouza) [20:21:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254452 (https://phabricator.wikimedia.org/T419778) (owner: 10DDesouza) [20:21:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255763 (https://phabricator.wikimedia.org/T417050) (owner: 10Milimetric) [20:21:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259120 (https://phabricator.wikimedia.org/T414148) (owner: 10Daimona Eaytoy) [20:21:19] (03PS1) 10Bking: discovery: Replace soon-to-be-expired intermediate certificate [puppet] - 10https://gerrit.wikimedia.org/r/1259216 (https://phabricator.wikimedia.org/T420993) [20:21:42] Not my one? :-) [20:21:55] (I can also self-deploy, no worries.) [20:22:22] RECOVERY - Bird Internet Routing Daemon on hcaptcha-proxy4004 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [20:22:27] (03Merged) 10jenkins-bot: Undeploy participant recruitment survey on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254448 (https://phabricator.wikimedia.org/T419275) (owner: 10DDesouza) [20:22:28] RECOVERY - Bird Internet Routing Daemon on hcaptcha-proxy4003 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [20:22:30] (03CR) 10CI reject: [V:04-1] Undeploy participant recruitment survey on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254450 (https://phabricator.wikimedia.org/T419275) (owner: 10DDesouza) [20:22:31] (03CR) 10CI reject: [V:04-1] Undeploy participant recruitment survey on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254452 (https://phabricator.wikimedia.org/T419778) (owner: 10DDesouza) [20:22:34] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install payments1009 - https://phabricator.wikimedia.org/T416253#11740758 (10VRiley-WMF) [20:22:34] (03Merged) 10jenkins-bot: testKitchen: Add custom stream name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255763 (https://phabricator.wikimedia.org/T417050) (owner: 10Milimetric) [20:22:37] (03Merged) 10jenkins-bot: Enable wgCampaignEventsEnableEventGoals in beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259120 (https://phabricator.wikimedia.org/T414148) (owner: 10Daimona Eaytoy) [20:23:08] need to rebase [20:23:20] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1115.eqiad.wmnet with OS trixie [20:24:31] (03PS3) 10DDesouza: Undeploy participant recruitment survey on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254450 (https://phabricator.wikimedia.org/T419275) [20:24:56] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1103.eqiad.wmnet with reason: host reimage [20:26:09] (03CR) 10DDesouza: [C:03+2] Undeploy participant recruitment survey on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254450 (https://phabricator.wikimedia.org/T419275) (owner: 10DDesouza) [20:27:17] (03Merged) 10jenkins-bot: Undeploy participant recruitment survey on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254450 (https://phabricator.wikimedia.org/T419275) (owner: 10DDesouza) [20:27:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254452 (https://phabricator.wikimedia.org/T419778) (owner: 10DDesouza) [20:27:38] (03CR) 10CI reject: [V:04-1] Undeploy participant recruitment survey on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254452 (https://phabricator.wikimedia.org/T419778) (owner: 10DDesouza) [20:28:53] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1103.eqiad.wmnet with reason: host reimage [20:30:37] (03PS2) 10Bking: discovery: Replace soon-to-be-expired intermediate certificate [puppet] - 10https://gerrit.wikimedia.org/r/1259216 (https://phabricator.wikimedia.org/T420993) [20:30:59] (03PS2) 10DDesouza: Undeploy participant recruitment survey on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254452 (https://phabricator.wikimedia.org/T419778) [20:31:14] !log sukhe@cumin1003 START - Cookbook sre.hosts.decommission for hosts hcaptcha-proxy4001.wikimedia.org [20:33:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254452 (https://phabricator.wikimedia.org/T419778) (owner: 10DDesouza) [20:33:39] All good with mine, thank you!!! [20:33:51] cmelo: Yours isn't deployed yet. [20:33:55] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install fransw100[23] - https://phabricator.wikimedia.org/T417295#11740859 (10Jgreen) @Jclark-ctr we don't have the prod management password, only the frack one and a temporary one the other DC-Ops use for us. Can you reset these too? [20:33:57] cmelo: haven't deployed any yey [20:33:58] Just merged. [20:34:00] *yet [20:34:20] sorry about the delay, my patches needed to be rebased [20:34:21] (03Merged) 10jenkins-bot: Undeploy participant recruitment survey on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254452 (https://phabricator.wikimedia.org/T419778) (owner: 10DDesouza) [20:34:43] !log dani@deploy2002 Started scap sync-world: Backport for [[gerrit:1254448|Undeploy participant recruitment survey on ptwiki (T419275)]], [[gerrit:1254450|Undeploy participant recruitment survey on trwiki (T419275)]], [[gerrit:1254452|Undeploy participant recruitment survey on frwiki (T419778)]], [[gerrit:1255763|testKitchen: Add custom stream name (T417050)]], [[gerrit:1259120|Enable wgCampaignEventsEnableEventGoals in [20:34:43] beta wikis (T414148)]] [20:34:51] T419275: Deploy QuickSurvey for research participant registration drive on trwiki & ptwiki - https://phabricator.wikimedia.org/T419275 [20:34:51] T419778: Deploy QuickSurvey for research participant registration drive on frwiki - https://phabricator.wikimedia.org/T419778 [20:34:52] T417050: Attribution Research: Instrument pageviews - https://phabricator.wikimedia.org/T417050 [20:34:52] T414148: Enable event goals in beta - https://phabricator.wikimedia.org/T414148 [20:35:40] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [20:36:01] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install fransw100[23] - https://phabricator.wikimedia.org/T417295#11740893 (10Jgreen) a:05Jgreen→03Jclark-ctr [20:36:38] !log dani@deploy2002 milimetric, daimona, dani: Backport for [[gerrit:1254448|Undeploy participant recruitment survey on ptwiki (T419275)]], [[gerrit:1254450|Undeploy participant recruitment survey on trwiki (T419275)]], [[gerrit:1254452|Undeploy participant recruitment survey on frwiki (T419778)]], [[gerrit:1255763|testKitchen: Add custom stream name (T417050)]], [[gerrit:1259120|Enable wgCampaignEventsEnableEventGoals i [20:36:38] n beta wikis (T414148)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:36:41] No worries, I can already see the changes available in beta [20:36:51] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install payments101[0-2] - https://phabricator.wikimedia.org/T416252#11740899 (10Jgreen) a:05Jgreen→03Jclark-ctr [20:36:52] cmelo: great! [20:36:56] milimetric: can you test? [20:37:36] mine isn't testable until a deployment tomorrow, it's just preparing for that [20:37:46] milimetric: ok [20:37:49] nothing that uses that config is broken on debug servers, so all good [20:37:52] !log dani@deploy2002 milimetric, daimona, dani: Continuing with sync [20:39:51] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: hcaptcha-proxy4001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1003" [20:40:37] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: hcaptcha-proxy4001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1003" [20:40:37] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:40:39] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts hcaptcha-proxy4001.wikimedia.org [20:40:45] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11740918 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin1003 for hosts: `hcaptcha-proxy4001.wikimedia.org` - hcaptcha-proxy40... [20:41:55] !log sukhe@cumin1003 START - Cookbook sre.hosts.decommission for hosts hcaptcha-proxy4002.wikimedia.org [20:42:07] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1102.eqiad.wmnet with OS trixie [20:42:09] !log dani@deploy2002 Finished scap sync-world: Backport for [[gerrit:1254448|Undeploy participant recruitment survey on ptwiki (T419275)]], [[gerrit:1254450|Undeploy participant recruitment survey on trwiki (T419275)]], [[gerrit:1254452|Undeploy participant recruitment survey on frwiki (T419778)]], [[gerrit:1255763|testKitchen: Add custom stream name (T417050)]], [[gerrit:1259120|Enable wgCampaignEventsEnableEventGoals in [20:42:09] beta wikis (T414148)]] (duration: 07m 26s) [20:42:17] T419275: Deploy QuickSurvey for research participant registration drive on trwiki & ptwiki - https://phabricator.wikimedia.org/T419275 [20:42:18] T419778: Deploy QuickSurvey for research participant registration drive on frwiki - https://phabricator.wikimedia.org/T419778 [20:42:18] T417050: Attribution Research: Instrument pageviews - https://phabricator.wikimedia.org/T417050 [20:42:19] T414148: Enable event goals in beta - https://phabricator.wikimedia.org/T414148 [20:42:21] RoanKattouw, James_F: I'm done [20:42:26] OK. [20:42:29] (03CR) 10Jforrester: [C:03+2] Abstract Wikipedia: Fix API call to get page info [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1256394 (https://phabricator.wikimedia.org/T420725) (owner: 10Jforrester) [20:42:44] (03CR) 10Jforrester: [C:03+2] [abstractwiki] Enable the Translate extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259085 (https://phabricator.wikimedia.org/T420656) (owner: 10Jforrester) [20:42:46] (03CR) 10Jforrester: [C:03+2] Move testwiki-only Attribution REST API definition to IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250113 (owner: 10Jforrester) [20:42:56] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp1102.eqiad.wmnet [reason: trixie reimaging] [20:43:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1256394 (https://phabricator.wikimedia.org/T420725) (owner: 10Jforrester) [20:43:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259085 (https://phabricator.wikimedia.org/T420656) (owner: 10Jforrester) [20:43:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250113 (owner: 10Jforrester) [20:43:19] (03PS1) 10RLazarus: cache.mcrouter: Copy 1.3.4 to 1.3.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259221 [20:43:19] (03PS1) 10RLazarus: cache.mcrouter: Add replica.remote_read option [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259222 (https://phabricator.wikimedia.org/T411807) [20:44:24] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11740929 (10Ottomata) Approved. [20:44:38] (03Merged) 10jenkins-bot: [abstractwiki] Enable the Translate extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259085 (https://phabricator.wikimedia.org/T420656) (owner: 10Jforrester) [20:44:42] (03Merged) 10jenkins-bot: Move testwiki-only Attribution REST API definition to IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250113 (owner: 10Jforrester) [20:45:09] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS trixie [20:46:22] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [20:47:19] (03Merged) 10jenkins-bot: Abstract Wikipedia: Fix API call to get page info [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1256394 (https://phabricator.wikimedia.org/T420725) (owner: 10Jforrester) [20:47:40] !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1256394|Abstract Wikipedia: Fix API call to get page info (T420725)]], [[gerrit:1259085|[abstractwiki] Enable the Translate extension (T420656)]], [[gerrit:1250113|Move testwiki-only Attribution REST API definition to IS]] [20:47:46] T420725: Abstract Wikipedia allows creation of existing articles - https://phabricator.wikimedia.org/T420725 [20:47:47] T420656: Enable Translate extension for Abstract Wikipedia - https://phabricator.wikimedia.org/T420656 [20:47:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:49:10] (03CR) 10Scott French: [C:03+2] admin: Add anniekimwmde to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1259208 (https://phabricator.wikimedia.org/T420500) (owner: 10Scott French) [20:49:19] FIRING: [2x] JobUnavailable: Reduced availability for job mtail in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:50:10] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: hcaptcha-proxy4002.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1003" [20:50:25] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: hcaptcha-proxy4002.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1003" [20:50:25] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:50:30] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts hcaptcha-proxy4002.wikimedia.org [20:50:40] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11740956 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin1003 for hosts: `hcaptcha-proxy4002.wikimedia.org` - hcaptcha-proxy40... [20:51:25] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1103.eqiad.wmnet with OS trixie [20:51:34] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11740957 (10ssingh) hcaptcha-proxy400[12], on the old Ganeti setup are now decommissioned. I think these were the last two VMs that had to be moved. [20:53:34] !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1256394|Abstract Wikipedia: Fix API call to get page info (T420725)]], [[gerrit:1259085|[abstractwiki] Enable the Translate extension (T420656)]], [[gerrit:1250113|Move testwiki-only Attribution REST API definition to IS]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:53:40] T420725: Abstract Wikipedia allows creation of existing articles - https://phabricator.wikimedia.org/T420725 [20:53:40] T420656: Enable Translate extension for Abstract Wikipedia - https://phabricator.wikimedia.org/T420656 [20:53:55] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11740960 (10Scott_French) 05Open→03Resolved a:03Scott_French Thanks, all! @AnnieKim_WMDE - Your [[ http... [20:54:19] RESOLVED: [2x] JobUnavailable: Reduced availability for job mtail in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:54:31] !log jforrester@deploy2002 jforrester: Continuing with sync [20:56:50] 06SRE, 10SRE-Access-Requests: Requesting access to data and Superset for Daria-WMDE (Daria Ammalainen (WMDE)) - https://phabricator.wikimedia.org/T420716#11740964 (10Scott_French) [20:58:52] !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1256394|Abstract Wikipedia: Fix API call to get page info (T420725)]], [[gerrit:1259085|[abstractwiki] Enable the Translate extension (T420656)]], [[gerrit:1250113|Move testwiki-only Attribution REST API definition to IS]] (duration: 11m 12s) [20:58:58] T420725: Abstract Wikipedia allows creation of existing articles - https://phabricator.wikimedia.org/T420725 [20:58:58] All done, just in time. [20:58:58] T420656: Enable Translate extension for Abstract Wikipedia - https://phabricator.wikimedia.org/T420656 [21:00:05] Reedy, sbassett, Maryum, and manfredi: That opportune time for a Weekly Security deployment window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260323T2100). [21:01:02] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to data and Superset for Daria-WMDE (Daria Ammalainen (WMDE)) - https://phabricator.wikimedia.org/T420716#11740987 (10Scott_French) @Daria-WMDE - Great, thank you! Once the NDA comes through, I believe that should be everything we need to en... [21:03:03] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp1103.eqiad.wmnet [reason: trixie reimaging] [21:03:15] maryum: You're doing the security deploy I think? Once you're done I have another patch that I forgot to do during the previous window [21:03:15] 06SRE, 10SRE-Access-Requests: Requesting access to superset for alice.moutinho - https://phabricator.wikimedia.org/T420751#11740996 (10Scott_French) [21:04:12] (03PS2) 10Jforrester: Move GrowthExperiments REST API definition to IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250114 [21:04:38] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp1104.eqiad.wmnet [reason: trixie reimaging] [21:05:02] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install frdata1003, frmx1002, frqueue100[5-6] - https://phabricator.wikimedia.org/T416249#11741000 (10Jgreen) Working on frqueue1005: * disabled the "embedded" NICs * set serial port address: COM2 * set console redirection after boot: enabled * switched boot method... [21:05:10] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp1104.eqiad.wmnet with OS trixie [21:05:18] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp1106.eqiad.wmnet [reason: trixie reimaging] [21:05:26] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp1106.eqiad.wmnet [reason: trixie reimaging] [21:05:55] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to superset for alice.moutinho - https://phabricator.wikimedia.org/T420751#11741004 (10Scott_French) @Alice.moutinho - Great, thank you - I see alicem LDAP account was created. Once the NDA comes through, I believe that should be everything... [21:08:24] 06SRE, 10SRE-Access-Requests: Requesting access to Superset for keren.ramirezWMDE - https://phabricator.wikimedia.org/T420896#11741014 (10Scott_French) [21:08:36] Roankattouw: yes getting started with security deploys now [21:08:51] RoanKattouw: are you deploying anything? [21:10:15] maryum: Yes a patch from the backport window (previous hour) that I didn't get to [21:10:32] RoanKattouw: if you want you can go ahead and do that now [21:11:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255847 (owner: 10Catrope) [21:12:33] (03Merged) 10jenkins-bot: testwiki: Add temporary groups for security testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255847 (owner: 10Catrope) [21:12:52] !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1255847|testwiki: Add temporary groups for security testing]] [21:13:23] 06SRE, 10SRE-Access-Requests: Requesting access to Superset for keren.ramirezWMDE - https://phabricator.wikimedia.org/T420896#11741045 (10Scott_French) @WMDE-leszek - Thank you! @kera_wmde - Just to confirm, from the title of this task, it sounds like you are requesting "level 1" access [[ https://wikitech.wi... [21:13:31] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to Superset for keren.ramirezWMDE - https://phabricator.wikimedia.org/T420896#11741047 (10Scott_French) [21:14:59] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users level 3 for bvibber - https://phabricator.wikimedia.org/T420406#11741049 (10Scott_French) @bvibber - Just to signal boost in case it got lost in the noise: >>! In T420406#11722329, @ayounsi wrote: > @bvibber... [21:18:43] !log catrope@deploy2002 catrope: Backport for [[gerrit:1255847|testwiki: Add temporary groups for security testing]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:19:55] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:21:08] !log catrope@deploy2002 catrope: Continuing with sync [21:22:14] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1115.eqiad.wmnet with OS trixie [21:25:25] !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1255847|testwiki: Add temporary groups for security testing]] (duration: 12m 33s) [21:28:07] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users level 3 for bvibber - https://phabricator.wikimedia.org/T420406#11741074 (10bvibber) and..... signed :D thx! [21:28:50] (03Abandoned) 10JHathaway: WIP: do not merge, test 2 [puppet] - 10https://gerrit.wikimedia.org/r/1259162 (owner: 10JHathaway) [21:29:12] preparing to run scap [21:34:14] (03PS9) 10JHathaway: nftables::service: Improve src/dst filter handling [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah) [21:34:30] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users level 3 for bvibber - https://phabricator.wikimedia.org/T420406#11741106 (10Scott_French) [21:34:56] (03CR) 10CI reject: [V:04-1] nftables::service: Improve src/dst filter handling [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah) [21:35:20] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS trixie [21:37:28] (03CR) 10JHathaway: "Apologies for the wait @taavi@wikimedia.org. I made an attempt at iterating on your good work to further reproduce the duplication in logi" [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah) [21:39:23] (03PS10) 10JHathaway: nftables::service: Improve src/dst filter handling [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah) [21:40:04] (03CR) 10CI reject: [V:04-1] nftables::service: Improve src/dst filter handling [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah) [21:41:32] !log Deployed security fix for T419168 [21:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:41] first of three patches deployed [21:43:07] (03PS11) 10JHathaway: nftables::service: Improve src/dst filter handling [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah) [21:43:09] running scap for second patch [21:43:48] (03CR) 10CI reject: [V:04-1] nftables::service: Improve src/dst filter handling [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah) [21:44:10] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users level 3 for bvibber - https://phabricator.wikimedia.org/T420406#11741125 (10Scott_French) @bvibber - Great, thanks! One last question: I see that the SSH public key you've provided here is different from [[ htt... [21:50:24] (03PS3) 10Scott French: admin: Add bvibber to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1254868 (https://phabricator.wikimedia.org/T420406) (owner: 10Ayounsi) [21:50:31] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users level 3 for bvibber - https://phabricator.wikimedia.org/T420406#11741139 (10bvibber) @Scott_French ah I misread the instructions I think. :D Ok to proivde the same key as for other wikimedia production servers,... [21:51:28] (03CR) 10Scott French: "Manual rebase to absorb changes to `analytics_privatedata_users`." [puppet] - 10https://gerrit.wikimedia.org/r/1254868 (https://phabricator.wikimedia.org/T420406) (owner: 10Ayounsi) [21:53:03] (03PS12) 10JHathaway: nftables::service: Improve src/dst filter handling [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah) [21:53:17] !log Deployed security fix for T419192 [21:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:27] preparing to run scap for the 3rd and final security patch [21:54:53] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah) [21:56:14] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users level 3 for bvibber - https://phabricator.wikimedia.org/T420406#11741162 (10Scott_French) @bvibber - Thanks! Yes, exactly - you can continue to use your existing production SSH public key as usual (i.e., the on... [21:56:43] (03PS1) 10Daimona Eaytoy: Enable the CampaignEvents extension on all wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259231 (https://phabricator.wikimedia.org/T419597) [21:56:58] (03PS1) 10Bking: trixie: Add component/opensearch2 [puppet] - 10https://gerrit.wikimedia.org/r/1259232 (https://phabricator.wikimedia.org/T420759) [21:57:08] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users level 3 for bvibber - https://phabricator.wikimedia.org/T420406#11741165 (10bvibber) [21:57:27] (03CR) 10CI reject: [V:04-1] trixie: Add component/opensearch2 [puppet] - 10https://gerrit.wikimedia.org/r/1259232 (https://phabricator.wikimedia.org/T420759) (owner: 10Bking) [21:57:33] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users level 3 for bvibber - https://phabricator.wikimedia.org/T420406#11741168 (10bvibber) @Scott_French thanks done! Same ol' public key ;) [21:57:52] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install frdata1003, frmx1002, frqueue100[5-6] - https://phabricator.wikimedia.org/T416249#11741170 (10Jgreen) frmx1002's management interface isn't accessible, doesn't respond to ping [21:58:49] (03PS2) 10Bking: trixie: Add component/opensearch2 [puppet] - 10https://gerrit.wikimedia.org/r/1259232 (https://phabricator.wikimedia.org/T420759) [21:59:06] (03PS1) 10Daimona Eaytoy: [WIP] Enable CampaignEvents on all SUL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259233 [22:00:00] FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [22:00:00] FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [22:02:41] (03CR) 10RLazarus: [C:03+1] admin: Add bvibber to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1254868 (https://phabricator.wikimedia.org/T420406) (owner: 10Ayounsi) [22:03:07] (03Abandoned) 10Daimona Eaytoy: [WIP] Enable CampaignEvents on all SUL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259233 (owner: 10Daimona Eaytoy) [22:04:06] (03CR) 10Scott French: [C:03+2] admin: Add bvibber to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1254868 (https://phabricator.wikimedia.org/T420406) (owner: 10Ayounsi) [22:04:35] (03PS1) 10Daimona Eaytoy: Enable $wgCampaignEventsEnableEventGoals in prod wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259237 (https://phabricator.wikimedia.org/T414149) [22:04:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259231 (https://phabricator.wikimedia.org/T419597) (owner: 10Daimona Eaytoy) [22:05:32] !log Deployed security fix for T415584 [22:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:42] Security deploy is finished [22:05:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259237 (https://phabricator.wikimedia.org/T414149) (owner: 10Daimona Eaytoy) [22:07:00] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1115.eqiad.wmnet with OS trixie [22:08:47] 10ops-eqiad, 06DC-Ops: firmware troubleshooting: Unable to PXE boot cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007 (10BCornwall) 03NEW [22:19:03] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users level 3 for bvibber - https://phabricator.wikimedia.org/T420406#11741336 (10Scott_French) 05Open→03Resolved a:03Scott_French Alright, I think that should do it! @bvibber - The c... [22:25:52] !log cdobbins@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp1104.eqiad.wmnet with OS trixie [22:28:03] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host an-worker1172.eqiad.wmnet [22:31:55] FIRING: [4x] BFDdown: BFD session down between cr3-ulsfo and 198.35.26.13 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:36:52] (03CR) 10Majavah: [C:04-1] "Unfortunately the latest PS seems to be re-introducing T351094. See, e.g. here: https://puppet-compiler.wmflabs.org/output/1212097/6172/cl" [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah) [22:38:01] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:44:50] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS trixie [22:49:25] (03PS1) 10LorenMora: Transition reading list experiment to instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259251 (https://phabricator.wikimedia.org/T414368) [22:51:31] !log root@apt1002:~# reprepro --noskipold --restrict vopsbot update bookworm-wikimedia [22:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260323T2300) [23:04:49] !incidents [23:04:49] 7787 (ACKED) db1253 (paged)/MariaDB Replica IO: s7 (paged) [23:04:50] 7788 (ACKED) db1253 (paged)/MariaDB Replica SQL: s7 (paged) [23:04:50] 7786 (RESOLVED) db1253 (paged)/MariaDB Replica SQL: s7 (paged) [23:04:50] 7785 (RESOLVED) db1253 (paged)/MariaDB Replica Lag: s7 (paged) [23:04:50] 7784 (RESOLVED) db1253 (paged)/MariaDB Replica IO: s7 (paged) [23:04:57] !resolve [23:04:58] 7787 (RESOLVED) db1253 (paged)/MariaDB Replica IO: s7 (paged) [23:04:58] 7788 (RESOLVED) db1253 (paged)/MariaDB Replica SQL: s7 (paged) [23:05:02] \o/ [23:07:02] 06SRE, 06SRE-OnFire, 10Observability-Alerting: vopsbot !ack and !resolve without incident numbers aren't working - https://phabricator.wikimedia.org/T420982#11741518 (10RLazarus) 05Open→03Resolved [23:08:28] (03CR) 10Aude: "This looks good though think we need to wait until https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/1251505 is full" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259251 (https://phabricator.wikimedia.org/T414368) (owner: 10LorenMora) [23:18:14] 06SRE, 10Beta-Cluster-Infrastructure: Beta cluster is slow as sludge / serves 503 and 504 - https://phabricator.wikimedia.org/T420833#11741525 (10bd808) https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/5d4298ce7a31d1650f6741e2b9051b82e9661c8a%5E%21/#F0 ` diff --git a/deployment-prep/_.yam... [23:35:59] 06SRE, 10Beta-Cluster-Infrastructure: Beta cluster is slow as sludge / serves 503 and 504 - https://phabricator.wikimedia.org/T420833#11741537 (10bd808) https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/b304067573816dedc4548607ca96202083700afd%5E%21/#F0 ` diff --git a/deployment-prep/_.yam... [23:36:31] brett@cumin2002 reimage (PID 1146748) is awaiting input [23:39:20] 06SRE, 10Beta-Cluster-Infrastructure: Beta cluster is slow as sludge / serves 503 and 504 - https://phabricator.wikimedia.org/T420833#11741539 (10bd808) https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/af98ae0206c2602a25a4d88414d77291788c7f0f%5E%21/#F0 ` diff --git a/deployment-prep/_.yam... [23:39:23] (03PS3) 10Andrea Denisse: grafana: Hide version number for the anonymous role [puppet] - 10https://gerrit.wikimedia.org/r/1259254 (https://phabricator.wikimedia.org/T402844) [23:46:16] 06SRE, 10Beta-Cluster-Infrastructure: Beta cluster is slow as sludge / serves 503 and 504 - https://phabricator.wikimedia.org/T420833#11741541 (10bd808) https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/bda711470077c72c5c1d40f9b34a1f036bbd3981%5E%21/#F0 ` diff --git a/deployment-prep/_.yam... [23:47:27] 06SRE, 10Beta-Cluster-Infrastructure: Beta cluster is slow as sludge / serves 503 and 504 - https://phabricator.wikimedia.org/T420833#11741542 (10bd808) [23:59:50] 06SRE, 10Beta-Cluster-Infrastructure: Beta cluster is slow as sludge / serves 503 and 504 - https://phabricator.wikimedia.org/T420833#11741568 (10bd808) That's most of the really active networks Beta has seen in the last 24 hours blocked. Let's see what the 15 minute load graph looks like over the next couple...