[00:00:36] 10ops-eqiad, 06SRE, 06DC-Ops, 06SRE Observability (FY2025/2026-Q3): Q2:rack/setup/install mwlog1003 - https://phabricator.wikimedia.org/T412230#11519530 (10Jclark-ctr) [00:01:13] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1370.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:01:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11519531 (10VRiley-WMF) [00:04:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11519533 (10VRiley-WMF) [00:08:29] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1370.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:08:57] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1370.eqiad.wmnet with OS trixie [00:09:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11519535 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host wikikube-worker1370.eqiad.wmnet with OS trixie [00:15:12] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [00:15:31] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [00:15:32] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1372.eqiad.wmnet with OS trixie [00:15:35] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11519540 (10RKemper) (Working with @bking ) We provisioned a virtual disk for missing drive via the drac web ui. Then we entered the rescue shell and commented out the had... [00:15:39] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11519541 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host wikikube-worker1372.eqiad.wmnet with OS trixie completed: - wikikub... [00:20:03] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1370.eqiad.wmnet with reason: host reimage [00:24:34] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1370.eqiad.wmnet with reason: host reimage [00:29:15] PROBLEM - MegaRAID on an-worker1148 is CRITICAL: CRITICAL: 12 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:38:55] (03PS1) 10Herron: assign mwlog[12]003 insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1226396 (https://phabricator.wikimedia.org/T412229) [00:40:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1226398 [00:40:16] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1226398 (owner: 10TrainBranchBot) [00:40:34] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [00:40:54] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [00:40:55] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1370.eqiad.wmnet with OS trixie [00:41:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11519577 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host wikikube-worker1370.eqiad.wmnet with OS trixie completed: - wikikub... [00:41:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11519578 (10VRiley-WMF) 05Open→03Resolved This has been completed [00:42:52] (03CR) 10Herron: [C:03+2] assign mwlog[12]003 insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1226396 (https://phabricator.wikimedia.org/T412229) (owner: 10Herron) [00:45:05] 10ops-codfw, 06SRE, 06DC-Ops, 13Patch-For-Review, 06SRE Observability (FY2025/2026-Q3): Q2:rack/setup/install mwlog2003 - https://phabricator.wikimedia.org/T412229#11519583 (10herron) >>! In T412229#11513890, @Jhancock.wm wrote: > @herron two things > - do you mind if i rack this in the new expansion cag... [00:53:28] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1226398 (owner: 10TrainBranchBot) [01:01:05] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:10:10] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1226406 [01:10:11] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1226406 (owner: 10TrainBranchBot) [01:14:00] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 12m 54s) [01:24:11] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:34:05] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1226406 (owner: 10TrainBranchBot) [03:49:33] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 5088 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [04:01:57] PROBLEM - Host doh7003 is DOWN: CRITICAL - Time to live exceeded (195.200.68.98) [04:01:57] PROBLEM - Host doh7004 is DOWN: CRITICAL - Time to live exceeded (195.200.68.101) [04:02:39] RECOVERY - Host doh7003 is UP: PING OK - Packet loss = 0%, RTA = 137.85 ms [04:02:39] RECOVERY - Host doh7004 is UP: PING OK - Packet loss = 0%, RTA = 138.07 ms [04:34:50] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1226178 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [04:58:19] PROBLEM - Host cp7004 is DOWN: CRITICAL - Time to live exceeded (10.140.1.5) [04:58:21] PROBLEM - Host cp7006 is DOWN: CRITICAL - Time to live exceeded (10.140.1.6) [04:58:21] PROBLEM - Host cp7016 is DOWN: CRITICAL - Time to live exceeded (10.140.1.11) [04:58:21] PROBLEM - Host cp7008 is DOWN: CRITICAL - Time to live exceeded (10.140.1.7) [04:58:21] PROBLEM - Host durum7004 is DOWN: CRITICAL - Time to live exceeded (10.140.2.7) [04:58:25] PROBLEM - Host hcaptcha-proxy7001 is DOWN: CRITICAL - Time to live exceeded (195.200.68.102) [04:58:25] PROBLEM - Host cp7002 is DOWN: CRITICAL - Time to live exceeded (10.140.1.4) [04:58:41] PROBLEM - Host ncredir7004 is DOWN: CRITICAL - Time to live exceeded (10.140.2.8) [04:58:41] PROBLEM - Host ncredir7003 is DOWN: CRITICAL - Time to live exceeded (10.140.2.3) [04:58:47] RECOVERY - Host durum7004 is UP: PING OK - Packet loss = 0%, RTA = 138.15 ms [04:58:47] RECOVERY - Host hcaptcha-proxy7001 is UP: PING OK - Packet loss = 0%, RTA = 138.09 ms [04:58:57] RECOVERY - Host ncredir7003 is UP: PING OK - Packet loss = 0%, RTA = 138.15 ms [04:58:57] RECOVERY - Host cp7004 is UP: PING OK - Packet loss = 0%, RTA = 137.47 ms [04:58:59] RECOVERY - Host cp7002 is UP: PING OK - Packet loss = 0%, RTA = 137.74 ms [04:58:59] RECOVERY - Host cp7006 is UP: PING OK - Packet loss = 0%, RTA = 137.51 ms [04:59:05] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 138.17 ms [04:59:07] RECOVERY - Host cp7016 is UP: PING OK - Packet loss = 0%, RTA = 137.64 ms [04:59:07] RECOVERY - Host cp7008 is UP: PING OK - Packet loss = 0%, RTA = 137.43 ms [05:09:11] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:24:11] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:28:40] (03PS2) 10KartikMistry: Update cxserver to 2026-01-09-231405-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225807 (https://phabricator.wikimedia.org/T414237) [05:34:11] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:41:40] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226477 (https://phabricator.wikimedia.org/T128546) [05:45:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 15 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226477 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [06:06:01] (03PS1) 10Marostegui: site.pp: Add note about x3 tables [puppet] - 10https://gerrit.wikimedia.org/r/1226505 [06:06:47] (03CR) 10Marostegui: "This is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1226505 (owner: 10Marostegui) [06:06:49] (03CR) 10Marostegui: [C:03+2] site.pp: Add note about x3 tables [puppet] - 10https://gerrit.wikimedia.org/r/1226505 (owner: 10Marostegui) [06:06:53] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1244 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1226509 (https://phabricator.wikimedia.org/T414542) [06:07:26] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2240 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1226510 (https://phabricator.wikimedia.org/T414543) [06:07:31] (03PS1) 10Gerrit maintenance bot: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1226512 (https://phabricator.wikimedia.org/T414543) [06:16:26] (03CR) 10Phuedx: [C:03+1] Add Test Kitchen maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/1226318 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming) [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T0700) [07:01:33] (03CR) 10Marostegui: "verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1225136 (owner: 10Giuseppe Lavagetto) [07:01:37] (03CR) 10Marostegui: [C:03+2] admin: add the ssh key for my backup yubikey [puppet] - 10https://gerrit.wikimedia.org/r/1225136 (owner: 10Giuseppe Lavagetto) [07:02:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2247 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87490 and previous config saved to /var/cache/conftool/dbconfig/20260114-070230-marostegui.json [07:02:38] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [07:02:38] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [07:07:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqord:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:10:23] !log marostegui@cumin1003 START - Cookbook sre.wikireplicas.update-views [07:12:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2247', diff saved to https://phabricator.wikimedia.org/P87491 and previous config saved to /var/cache/conftool/dbconfig/20260114-071240-marostegui.json [07:12:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:17:24] !log marostegui@cumin1003 END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0) [07:17:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:22:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2247', diff saved to https://phabricator.wikimedia.org/P87492 and previous config saved to /var/cache/conftool/dbconfig/20260114-072248-marostegui.json [07:22:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:23:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:32:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2247 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87493 and previous config saved to /var/cache/conftool/dbconfig/20260114-073256-marostegui.json [07:33:02] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [07:33:02] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [07:33:06] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:33:13] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2248.codfw.wmnet with reason: Maintenance [07:33:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2248 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87494 and previous config saved to /var/cache/conftool/dbconfig/20260114-073321-marostegui.json [07:59:11] marostegui: OK to deploy cxserver? [08:00:05] Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:02:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87495 and previous config saved to /var/cache/conftool/dbconfig/20260114-080242-marostegui.json [08:02:48] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [08:02:49] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [08:03:52] kart_: yep! [08:12:49] Thanks [08:12:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252', diff saved to https://phabricator.wikimedia.org/P87496 and previous config saved to /var/cache/conftool/dbconfig/20260114-081251-marostegui.json [08:12:56] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2026-01-09-231405-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225807 (https://phabricator.wikimedia.org/T414237) (owner: 10KartikMistry) [08:14:44] (03Merged) 10jenkins-bot: Update cxserver to 2026-01-09-231405-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225807 (https://phabricator.wikimedia.org/T414237) (owner: 10KartikMistry) [08:20:34] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [08:22:32] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [08:23:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252', diff saved to https://phabricator.wikimedia.org/P87497 and previous config saved to /var/cache/conftool/dbconfig/20260114-082259-marostegui.json [08:27:38] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1226204 (https://phabricator.wikimedia.org/T412524) (owner: 10Dpogorzelski) [08:29:49] PROBLEM - Host hcaptcha-proxy7001 is DOWN: CRITICAL - Time to live exceeded (195.200.68.102) [08:29:56] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [08:30:07] PROBLEM - Host ganeti7002 is DOWN: CRITICAL - Time to live exceeded (10.140.1.12) [08:30:07] PROBLEM - Host ganeti7004 is DOWN: CRITICAL - Time to live exceeded (10.140.1.13) [08:30:07] PROBLEM - Host durum7004 is DOWN: CRITICAL - Time to live exceeded (10.140.2.7) [08:30:07] PROBLEM - Host doh7003 is DOWN: CRITICAL - Time to live exceeded (195.200.68.98) [08:30:07] PROBLEM - Host ncredir7004 is DOWN: CRITICAL - Time to live exceeded (10.140.2.8) [08:30:27] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [08:30:37] RECOVERY - Host doh7003 is UP: PING OK - Packet loss = 0%, RTA = 138.04 ms [08:30:39] RECOVERY - Host ganeti7004 is UP: PING OK - Packet loss = 0%, RTA = 137.68 ms [08:30:39] RECOVERY - Host ganeti7002 is UP: PING OK - Packet loss = 0%, RTA = 137.60 ms [08:30:41] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 138.26 ms [08:30:47] RECOVERY - Host durum7004 is UP: PING OK - Packet loss = 0%, RTA = 138.19 ms [08:30:47] RECOVERY - Host hcaptcha-proxy7001 is UP: PING OK - Packet loss = 0%, RTA = 137.83 ms [08:30:49] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [08:31:25] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [08:33:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87498 and previous config saved to /var/cache/conftool/dbconfig/20260114-083307-marostegui.json [08:33:13] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [08:33:14] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [08:33:24] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1260.eqiad.wmnet with reason: Maintenance [08:33:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1260 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87499 and previous config saved to /var/cache/conftool/dbconfig/20260114-083332-marostegui.json [08:39:39] PROBLEM - Host doh7003 is DOWN: CRITICAL - Time to live exceeded (195.200.68.98) [08:39:53] PROBLEM - Host hcaptcha-proxy7001 is DOWN: CRITICAL - Time to live exceeded (195.200.68.102) [08:39:57] PROBLEM - Host doh7004 is DOWN: CRITICAL - Time to live exceeded (195.200.68.101) [08:40:09] PROBLEM - Host durum7004 is DOWN: CRITICAL - Time to live exceeded (10.140.2.7) [08:40:09] PROBLEM - Host ncredir7004 is DOWN: CRITICAL - Time to live exceeded (10.140.2.8) [08:40:09] PROBLEM - Host install7002 is DOWN: CRITICAL - Time to live exceeded (195.200.68.100) [08:40:11] PROBLEM - Host tcp-proxy7002 is DOWN: CRITICAL - Time to live exceeded (10.140.2.11) [08:40:11] PROBLEM - Host ncredir7003 is DOWN: CRITICAL - Time to live exceeded (10.140.2.3) [08:40:11] PROBLEM - Host asw1-b3-magru is DOWN: CRITICAL - Time to live exceeded (195.200.68.130) [08:40:11] PROBLEM - Host asw1-b4-magru is DOWN: CRITICAL - Time to live exceeded (195.200.68.131) [08:40:19] PROBLEM - Host mr1-magru is DOWN: CRITICAL - Time to live exceeded (195.200.68.132) [08:40:25] RECOVERY - Host install7002 is UP: PING OK - Packet loss = 0%, RTA = 138.04 ms [08:40:29] RECOVERY - Host asw1-b3-magru is UP: PING OK - Packet loss = 0%, RTA = 144.38 ms [08:40:29] RECOVERY - Host asw1-b4-magru is UP: PING OK - Packet loss = 0%, RTA = 142.09 ms [08:40:37] RECOVERY - Host doh7003 is UP: PING OK - Packet loss = 0%, RTA = 138.02 ms [08:40:37] RECOVERY - Host doh7004 is UP: PING OK - Packet loss = 0%, RTA = 138.07 ms [08:40:41] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 138.27 ms [08:40:41] RECOVERY - Host ncredir7003 is UP: PING OK - Packet loss = 0%, RTA = 138.20 ms [08:40:45] RECOVERY - Host mr1-magru is UP: PING OK - Packet loss = 0%, RTA = 138.11 ms [08:40:47] RECOVERY - Host durum7004 is UP: PING OK - Packet loss = 0%, RTA = 138.14 ms [08:40:47] RECOVERY - Host hcaptcha-proxy7001 is UP: PING OK - Packet loss = 0%, RTA = 137.94 ms [08:41:07] RECOVERY - Host tcp-proxy7002 is UP: PING OK - Packet loss = 0%, RTA = 140.24 ms [08:42:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [08:43:30] !log Update cxserver to 2026-01-09-231405-production (T414237, T413646, T409998) [08:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:37] T414237: Post-creation work for kaiwiki - https://phabricator.wikimedia.org/T414237 [08:43:37] T413646: Content Translation: cannot select an existing target article; section translation is published to a redirect instead of the main article (target language: Russian). - https://phabricator.wikimedia.org/T413646 [08:43:38] T409998: cxserver: en > qqq pair should not be used for requests - https://phabricator.wikimedia.org/T409998 [08:44:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqord:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:46:40] (03CR) 10Joal: "Some changes" [puppet] - 10https://gerrit.wikimedia.org/r/1226270 (https://phabricator.wikimedia.org/T278056) (owner: 10Brouberol) [08:49:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:54:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:56:45] (03PS1) 10Brouberol: druid_exporter: Fixup metric definition [puppet] - 10https://gerrit.wikimedia.org/r/1226766 (https://phabricator.wikimedia.org/T278056) [08:56:57] (03PS2) 10Brouberol: druid_exporter: Fixup metric definition [puppet] - 10https://gerrit.wikimedia.org/r/1226766 (https://phabricator.wikimedia.org/T278056) [08:57:04] (03CR) 10CI reject: [V:04-1] druid_exporter: Fixup metric definition [puppet] - 10https://gerrit.wikimedia.org/r/1226766 (https://phabricator.wikimedia.org/T278056) (owner: 10Brouberol) [08:57:34] (03CR) 10Brouberol: [C:03+2] "Thanks for noticing this @joal@wikimedia.org! I've addressed your comments in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1226766" [puppet] - 10https://gerrit.wikimedia.org/r/1226270 (https://phabricator.wikimedia.org/T278056) (owner: 10Brouberol) [08:59:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:03:00] (03PS3) 10Brouberol: druid_exporter: Fixup metric definition [puppet] - 10https://gerrit.wikimedia.org/r/1226766 (https://phabricator.wikimedia.org/T278056) [09:03:00] (03PS13) 10Daniel Kinzler: charts: add redioscope chart and service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207256 (https://phabricator.wikimedia.org/T407999) [09:03:00] (03CR) 10Daniel Kinzler: "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207256 (https://phabricator.wikimedia.org/T407999) (owner: 10Daniel Kinzler) [09:03:18] (03CR) 10CI reject: [V:04-1] druid_exporter: Fixup metric definition [puppet] - 10https://gerrit.wikimedia.org/r/1226766 (https://phabricator.wikimedia.org/T278056) (owner: 10Brouberol) [09:03:26] (03PS4) 10Brouberol: druid_exporter: Fixup metric definition [puppet] - 10https://gerrit.wikimedia.org/r/1226766 (https://phabricator.wikimedia.org/T278056) [09:07:09] (03CR) 10Ayounsi: [C:03+2] Unconditionally use dnsmasq on routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1226285 (https://phabricator.wikimedia.org/T396864) (owner: 10Muehlenhoff) [09:08:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:09:35] (03CR) 10Daniel Kinzler: rest-gateway: generate retry-after header for rate-limited requests (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224937 (https://phabricator.wikimedia.org/T405636) (owner: 10Daniel Kinzler) [09:09:48] (03PS2) 10Daniel Kinzler: rest-gateway: generate retry-after header for rate-limited requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224937 (https://phabricator.wikimedia.org/T405636) [09:10:26] (03PS3) 10Daniel Kinzler: rest-gateway: generate retry-after header for rate-limited requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224937 (https://phabricator.wikimedia.org/T405636) [09:12:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:12:47] (03CR) 10Daniel Kinzler: rest-gateway: generate retry-after header for rate-limited requests (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224937 (https://phabricator.wikimedia.org/T405636) (owner: 10Daniel Kinzler) [09:15:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:16:02] (03CR) 10Daniel Kinzler: rest-gateway: generate retry-after header for rate-limited requests (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224937 (https://phabricator.wikimedia.org/T405636) (owner: 10Daniel Kinzler) [09:17:00] (03PS4) 10Daniel Kinzler: rest-gateway: generate retry-after header for rate-limited requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224937 (https://phabricator.wikimedia.org/T405636) [09:17:29] (03CR) 10Daniel Kinzler: rest-gateway: generate retry-after header for rate-limited requests (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224937 (https://phabricator.wikimedia.org/T405636) (owner: 10Daniel Kinzler) [09:18:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:20:19] (03PS1) 10Dpogorzelski: docker-registry: add ml user pwd [labs/private] - 10https://gerrit.wikimedia.org/r/1226768 [09:21:29] (03CR) 10Clément Goubert: "Adding @ksouckova@wikimedia.org for additional and subsequent reviews" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207256 (https://phabricator.wikimedia.org/T407999) (owner: 10Daniel Kinzler) [09:23:58] 10SRE-SLO, 10Observability-Metrics, 06SRE Observability (FY2025/2026-Q3): Thanos (store|query-frontend) memcached cache in bad status - https://phabricator.wikimedia.org/T411273#11520147 (10tappof) ssh titan1001.eqiad.wmnet -L 16902:localhost:16902 {F71524543} ssh titan1002.eqiad.wmnet -L 16903:localhost:16... [09:24:11] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:24:19] !log Depool titan1002; disable Puppet and enable debug log level (T411273) [09:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:23] T411273: Thanos (store|query-frontend) memcached cache in bad status - https://phabricator.wikimedia.org/T411273 [09:29:29] (03CR) 10Ladsgroup: [C:03+1] thumbor: reimplement SVG max size feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226286 (https://phabricator.wikimedia.org/T411076) (owner: 10Hnowlan) [09:30:42] (03PS1) 10Clément Goubert: kubernetes: Add dummy secrets for redioscope [labs/private] - 10https://gerrit.wikimedia.org/r/1226771 [09:31:14] (03CR) 10Clément Goubert: [V:03+2 C:03+2] kubernetes: Add dummy secrets for redioscope [labs/private] - 10https://gerrit.wikimedia.org/r/1226771 (owner: 10Clément Goubert) [09:32:11] 10SRE-SLO, 10Observability-Metrics, 06SRE Observability (FY2025/2026-Q3): Thanos (store|query-frontend) memcached cache in bad status - https://phabricator.wikimedia.org/T411273#11520159 (10tappof) {F71524745} root@titan1002:~# tshark -i lo -f 'tcp port 11211' -Y 'memcache' ` 55911 750.475540100 127.0.0.1... [09:34:18] (03CR) 10Elukey: [C:03+1] docker-registry: add ml user pwd [labs/private] - 10https://gerrit.wikimedia.org/r/1226768 (owner: 10Dpogorzelski) [09:34:50] (03PS2) 10Clément Goubert: Revert^4 "restgateway: migrate the /api/rest_v1/ sandbox to the rest gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1226288 (https://phabricator.wikimedia.org/T396807) [09:36:15] (03CR) 10Dpogorzelski: [C:03+2] docker-registry: add ml user pwd [labs/private] - 10https://gerrit.wikimedia.org/r/1226768 (owner: 10Dpogorzelski) [09:36:22] (03CR) 10Dpogorzelski: [V:03+2 C:03+2] docker-registry: add ml user pwd [labs/private] - 10https://gerrit.wikimedia.org/r/1226768 (owner: 10Dpogorzelski) [09:36:41] (03CR) 10CI reject: [V:04-1] Revert^4 "restgateway: migrate the /api/rest_v1/ sandbox to the rest gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1226288 (https://phabricator.wikimedia.org/T396807) (owner: 10Clément Goubert) [09:37:32] (03PS3) 10Clément Goubert: Revert^4 "restgateway: migrate the /api/rest_v1/ sandbox to the rest gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1226288 (https://phabricator.wikimedia.org/T396807) [09:37:33] (03CR) 10JMeybohm: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1226288 (https://phabricator.wikimedia.org/T396807) (owner: 10Clément Goubert) [09:37:34] (03PS1) 10Elukey: admin: add the analytics-sre uid and gid [puppet] - 10https://gerrit.wikimedia.org/r/1226774 (https://phabricator.wikimedia.org/T402512) [09:37:36] (03PS1) 10Elukey: role::puppetserver: deploy kerberos keytab for analytics-sre [puppet] - 10https://gerrit.wikimedia.org/r/1226775 (https://phabricator.wikimedia.org/T402512) [09:37:39] (03PS1) 10Elukey: WIP: profile::puppetserver::volatile: add hdfs rsync job [puppet] - 10https://gerrit.wikimedia.org/r/1226776 (https://phabricator.wikimedia.org/T402512) [09:40:03] (03CR) 10Dpogorzelski: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1226204 (https://phabricator.wikimedia.org/T412524) (owner: 10Dpogorzelski) [09:40:14] (03CR) 10jenkins-bot: Revert^4 "restgateway: migrate the /api/rest_v1/ sandbox to the rest gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1226288 (https://phabricator.wikimedia.org/T396807) (owner: 10Clément Goubert) [09:40:14] (03CR) 10Elukey: "This requires the creation of the keytabs, and their copy to the private repo. We'll need one keytab for each puppetserver hostname, so it" [puppet] - 10https://gerrit.wikimedia.org/r/1226775 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [09:40:55] (03CR) 10Elukey: "Very high level WIP patch, I don't know the more specific details but we can start with some know locations and see how it goes. Lemme kno" [puppet] - 10https://gerrit.wikimedia.org/r/1226776 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [09:42:46] (03PS1) 10Clément Goubert: gateway-check: Document additional query parameter [puppet] - 10https://gerrit.wikimedia.org/r/1226780 (https://phabricator.wikimedia.org/T396807) [09:43:58] !log configure Arelion LAG on cr1-codfw - T401100 [09:44:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:29] jouncebot: nowandnext [09:44:30] No deployments scheduled for the next 1 hour(s) and 15 minute(s) [09:44:30] In 1 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T1100) [09:45:49] (03CR) 10Clément Goubert: [C:03+2] Revert^4 "restgateway: migrate the /api/rest_v1/ sandbox to the rest gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1226288 (https://phabricator.wikimedia.org/T396807) (owner: 10Clément Goubert) [09:46:19] (03CR) 10Elukey: [C:03+1] "LGTM! Please try to deploy it during https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T1100 (also remember to add th" [puppet] - 10https://gerrit.wikimedia.org/r/1226204 (https://phabricator.wikimedia.org/T412524) (owner: 10Dpogorzelski) [09:48:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:ae1 (External: Arelion transit) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:55:50] (03PS1) 10Clément Goubert: site.pp: Add mc1055-72 [puppet] - 10https://gerrit.wikimedia.org/r/1226782 (https://phabricator.wikimedia.org/T412255) [09:56:00] expected ^ that's an interface being setup [09:57:20] (03CR) 10JMeybohm: [C:03+1] gateway-check: Document additional query parameter [puppet] - 10https://gerrit.wikimedia.org/r/1226780 (https://phabricator.wikimedia.org/T396807) (owner: 10Clément Goubert) [10:00:13] (03CR) 10Majavah: firewall: Declare resources for both providers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) (owner: 10Majavah) [10:02:59] (03CR) 10Majavah: nftables::service: Improve src/dst filter handling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah) [10:04:25] (03CR) 10Clément Goubert: [C:03+2] gateway-check: Document additional query parameter [puppet] - 10https://gerrit.wikimedia.org/r/1226780 (https://phabricator.wikimedia.org/T396807) (owner: 10Clément Goubert) [10:06:47] (03CR) 10Joal: druid_exporter: Fixup metric definition (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1226766 (https://phabricator.wikimedia.org/T278056) (owner: 10Brouberol) [10:18:33] (03PS1) 10Giuseppe Lavagetto: haproxy: actually set "robot" ua_class for identified MW requests [puppet] - 10https://gerrit.wikimedia.org/r/1226792 [10:20:22] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11520422 (10Clement_Goubert) >>! In T408757#11513905, @Jhancock.wm wrote: > @Clement_Goubert the servers landed last week. Gonna start unpacking them tomorrow or wednesday. A... [10:29:02] (03PS1) 10Clément Goubert: wikikube: Add ratelimit-media namespace [puppet] - 10https://gerrit.wikimedia.org/r/1226797 (https://phabricator.wikimedia.org/T414439) [10:31:12] (03PS1) 10Clément Goubert: Add ratelimit-upload namespace to wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226798 (https://phabricator.wikimedia.org/T414439) [10:31:32] 10ops-eqiad, 06DC-Ops: Power Supply Redundancy alert on es1057 - https://phabricator.wikimedia.org/T414564 (10FCeratto-WMF) 03NEW [10:32:21] (03PS2) 10Clément Goubert: Add ratelimit-upload namespace to wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226798 (https://phabricator.wikimedia.org/T414439) [10:35:03] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [10:40:03] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [10:46:04] (03CR) 10Hnowlan: [C:03+2] thanos: set performance governor [puppet] - 10https://gerrit.wikimedia.org/r/1226311 (owner: 10Hnowlan) [10:52:05] (03CR) 10Btullis: [C:04-1] "I don't believe that this change is required, since wikipedia25.org will not be sending any client events from the browser to the event pl" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224858 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [10:55:44] (03CR) 10Dzahn: [C:03+1] "ok! thanks for looking!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224858 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [10:55:55] (03CR) 10Dpogorzelski: [C:03+2] docker registry: add ml build user password [puppet] - 10https://gerrit.wikimedia.org/r/1226204 (https://phabricator.wikimedia.org/T412524) (owner: 10Dpogorzelski) [10:55:58] !log dpogorzelski@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on registry[2004-2005].codfw.wmnet,registry[1004-1005].eqiad.wmnet with reason: testing ml changes [10:56:16] (03Abandoned) 10Dzahn: eventgate-analytics-external: add wikipedia25.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224858 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [10:57:29] (03CR) 10Blake: [C:03+2] datacenter: remove unused EXCLUDED_SERVICES constant. [cookbooks] - 10https://gerrit.wikimedia.org/r/1226211 (https://phabricator.wikimedia.org/T412211) (owner: 10Blake) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T1100) [11:03:29] (03Merged) 10jenkins-bot: datacenter: remove unused EXCLUDED_SERVICES constant. [cookbooks] - 10https://gerrit.wikimedia.org/r/1226211 (https://phabricator.wikimedia.org/T412211) (owner: 10Blake) [11:09:06] (03PS1) 10Gergő Tisza: debug: Add some CDN Backend API headers to Logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226815 (https://phabricator.wikimedia.org/T412396) [11:10:00] (03CR) 10CI reject: [V:04-1] debug: Add some CDN Backend API headers to Logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226815 (https://phabricator.wikimedia.org/T412396) (owner: 10Gergő Tisza) [11:11:04] (03CR) 10Fabfur: [C:03+1] haproxy: actually set "robot" ua_class for identified MW requests [puppet] - 10https://gerrit.wikimedia.org/r/1226792 (owner: 10Giuseppe Lavagetto) [11:11:32] (03PS2) 10Gergő Tisza: debug: Add some CDN Backend API headers to Logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226815 (https://phabricator.wikimedia.org/T412396) [11:12:28] (03CR) 10Gergő Tisza: "@kharlan@wikimedia.org the task mentions `x_is_browser_likely_script` and `x_is_browser_likely_browser` but I don't see those documented a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226815 (https://phabricator.wikimedia.org/T412396) (owner: 10Gergő Tisza) [11:14:29] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11520607 (10Dzahn) I answered the question on Slack but for the record here: No, thats not the case. Making additional changes to the site re... [11:16:02] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11520613 (10Dzahn) a:05Dzahn→03ATitkov [11:20:30] !log dpogorzelski@cumin1003 START - Cookbook sre.hosts.remove-downtime for registry[2004-2005].codfw.wmnet,registry[1004-1005].eqiad.wmnet [11:20:33] !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for registry[2004-2005].codfw.wmnet,registry[1004-1005].eqiad.wmnet [11:23:22] (03PS1) 10Filippo Giunchedi: pontoon: complete reboot/destroy hosts with FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/1226818 [11:23:22] (03PS1) 10Filippo Giunchedi: pontoon: honor command line new-stack name [puppet] - 10https://gerrit.wikimedia.org/r/1226819 [11:25:34] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: honor command line new-stack name [puppet] - 10https://gerrit.wikimedia.org/r/1226819 (owner: 10Filippo Giunchedi) [11:25:38] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: complete reboot/destroy hosts with FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/1226818 (owner: 10Filippo Giunchedi) [11:35:37] (03PS11) 10Vgutierrez: cache::upload: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [11:37:27] (03CR) 10Giuseppe Lavagetto: [C:03+2] haproxy: actually set "robot" ua_class for identified MW requests [puppet] - 10https://gerrit.wikimedia.org/r/1226792 (owner: 10Giuseppe Lavagetto) [11:45:09] !log vgutierrez@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_codfw and A:cp - haproxy 2.8.18 upgrade (T414318) [11:45:14] T414318: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318 [11:53:17] !log vgutierrez@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_codfw and A:cp - haproxy 2.8.18 upgrade (T414318) [11:53:21] T414318: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318 [12:00:05] mvolz: May I have your attention please! Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T1200) [12:06:26] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11520728 (10cmooney) @VRiley-WMF I'll ping you on irc but we want to go ahead and replace the DAC on //d... [12:07:48] (03PS12) 10Vgutierrez: cache::upload: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [12:08:04] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11520733 (10cmooney) Hmm so I was going to see if there was any difference if I did a trace to the ceph... [12:09:11] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:09:22] (03CR) 10CI reject: [V:04-1] cache::upload: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [12:10:07] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:11:51] (03PS13) 10Vgutierrez: cache::upload: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [12:19:09] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11520751 (10cmooney) Also @VRiley-WMF it seems this is actually a 1G RJ45 link. So let's swap the coppe... [12:22:16] (03PS14) 10Vgutierrez: cache::upload: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [12:22:52] (03PS1) 10Daniel Kinzler: rest gateway: include a meaningful body with 429 responses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226827 (https://phabricator.wikimedia.org/T405636) [12:23:15] (03CR) 10Vgutierrez: "varnishtests are now happy against all the configurations introduced in PS11" [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [12:24:21] (03CR) 10Vgutierrez: cache::upload: introduce rate-limits by traffic class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [12:28:17] !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_codfw and A:cp - haproxy 2.8.18 upgrade (T414318) [12:28:20] T414318: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318 [12:37:04] !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_codfw and A:cp - haproxy 2.8.18 upgrade (T414318) [12:37:08] T414318: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318 [12:38:35] (03CR) 10Daniel Kinzler: rest-gateway: generate retry-after header for rate-limited requests (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224937 (https://phabricator.wikimedia.org/T405636) (owner: 10Daniel Kinzler) [12:55:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11520980 (10cmooney) Hmm so with the node un-cordoned the loss has not returned either, well one drop at the first hop but it seems insignific... [12:56:32] (03CR) 10Kosta Harlan: [C:03+1] debug: Add some CDN Backend API headers to Logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226815 (https://phabricator.wikimedia.org/T412396) (owner: 10Gergő Tisza) [13:03:16] (03CR) 10Kosta Harlan: [C:03+1] debug: Add some CDN Backend API headers to Logstash (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226815 (https://phabricator.wikimedia.org/T412396) (owner: 10Gergő Tisza) [13:05:42] (03PS3) 10Gergő Tisza: debug: Add some CDN Backend API headers to Logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226815 (https://phabricator.wikimedia.org/T412396) [13:05:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, January 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226815 (https://phabricator.wikimedia.org/T412396) (owner: 10Gergő Tisza) [13:10:47] !log elukey@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on registry1004.eqiad.wmnet with reason: testing [13:13:15] (03PS1) 10JMeybohm: admin/data: Add hfanwmf to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1226847 (https://phabricator.wikimedia.org/T414492) [13:13:15] 06SRE: Failing docker registry tests - https://phabricator.wikimedia.org/T414576 (10DPogorzelski-WMF) 03NEW [13:14:34] (03CR) 10Gergő Tisza: debug: Add some CDN Backend API headers to Logstash (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226815 (https://phabricator.wikimedia.org/T412396) (owner: 10Gergő Tisza) [13:18:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11521085 (10cmooney) >>! In T414460#11518808, @CDanis wrote: > FIN_WAIT_1 is //not// supposed to stick around for longer than a minute or two.... [13:20:09] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for kareid - https://phabricator.wikimedia.org/T413364#11521090 (10JMeybohm) a:05KReid-WMF→03None [13:24:11] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:26:40] (03Restored) 10Btullis: eventgate-analytics-external: add wikipedia25.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224858 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [13:27:23] (03CR) 10Btullis: [C:03+1] "It look like I was wrong. See this comment from Mikhal: https://wikimedia.slack.com/archives/CSV483812/p1768396226801899?thread_ts=1768388" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224858 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [13:27:46] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for kareid - https://phabricator.wikimedia.org/T413364#11521107 (10JMeybohm) [13:30:05] jouncebot: nowandnext [13:30:05] No deployments scheduled for the next 0 hour(s) and 29 minute(s) [13:30:05] In 0 hour(s) and 29 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T1400) [13:30:24] !log vgutierrez@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqsin and A:cp - haproxy 2.8.18 upgrade (T414318) [13:30:28] T414318: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318 [13:30:38] !log vgutierrez@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqsin and not P{cp5022.*} and A:cp - haproxy 2.8.18 upgrade (T414318) [13:31:50] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for kareid - https://phabricator.wikimedia.org/T413364#11521115 (10JMeybohm) a:03thcipriani @thcipriani this needs sign-off from you as the approver for the deployment group [13:31:51] (03PS3) 10Dreamy Jazz: Write new for CheckUser user agent table migration on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223673 (https://phabricator.wikimedia.org/T361196) [13:32:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223673 (https://phabricator.wikimedia.org/T361196) (owner: 10Dreamy Jazz) [13:32:30] (03PS3) 10Milimetric: trafficserver: Send /ins-502b/v2/events to intake-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1218817 (https://phabricator.wikimedia.org/T412863) [13:32:41] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-01-07-132737 to 2026-01-07-163903 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226852 (https://phabricator.wikimedia.org/T413732) [13:33:10] (03Merged) 10jenkins-bot: Write new for CheckUser user agent table migration on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223673 (https://phabricator.wikimedia.org/T361196) (owner: 10Dreamy Jazz) [13:34:26] (03PS1) 10Cory Massaro: wikifunctions: Upgrade orchestrator from 2026-01-07-132737 to 2026-01-07-163903. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226853 [13:34:37] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1223673|Write new for CheckUser user agent table migration on group0 (T361196)]] [13:34:41] T361196: Write to the cu_useragent table and agent_id columns on WMF wikis - https://phabricator.wikimedia.org/T361196 [13:34:53] (03PS2) 10Cory Massaro: wikifunctions: Upgrade orchestrator from 2026-01-07-132737 to 2026-01-07-163903. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226853 (https://phabricator.wikimedia.org/T413732) [13:35:12] (03PS1) 10JMeybohm: admin/data: Shell, deployers, analytics-privatedata-users for kareid [puppet] - 10https://gerrit.wikimedia.org/r/1226854 (https://phabricator.wikimedia.org/T413364) [13:36:06] (03CR) 10CI reject: [V:04-1] admin/data: Shell, deployers, analytics-privatedata-users for kareid [puppet] - 10https://gerrit.wikimedia.org/r/1226854 (https://phabricator.wikimedia.org/T413364) (owner: 10JMeybohm) [13:36:40] (03CR) 10JMeybohm: [C:04-2] "- SSH key needs off band verification" [puppet] - 10https://gerrit.wikimedia.org/r/1226854 (https://phabricator.wikimedia.org/T413364) (owner: 10JMeybohm) [13:36:50] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1223673|Write new for CheckUser user agent table migration on group0 (T361196)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:37:35] (03PS2) 10JMeybohm: admin/data: Shell, deployers, analytics-privatedata-users for kareid [puppet] - 10https://gerrit.wikimedia.org/r/1226854 (https://phabricator.wikimedia.org/T413364) [13:38:30] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudgw2002-dev.codfw.wmnet with OS trixie [13:39:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, January 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225005 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming) [13:39:36] 07sre-alert-triage, 06SRE Observability: Alert in need of triage: nrpe_Check_whether_ferm_is_active_by_checking_the_default_input_chain (instance cumin2002:9100) - https://phabricator.wikimedia.org/T413743#11521134 (10tappof) a:03tappof [13:39:42] 07sre-alert-triage, 06SRE Observability: Alert in need of triage: nrpe_Check_whether_ferm_is_active_by_checking_the_default_input_chain (instance cloudidp2001-dev:9100) - https://phabricator.wikimedia.org/T413744#11521135 (10tappof) a:03tappof [13:39:46] 07sre-alert-triage, 06SRE Observability: Alert in need of triage: nrpe_Check_whether_ferm_is_active_by_checking_the_default_input_chain (instance cloudcumin2001:9100) - https://phabricator.wikimedia.org/T413745#11521136 (10tappof) a:03tappof [13:39:59] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [13:40:33] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Grant Access to analytics-privatedata-users for HFanWMF - https://phabricator.wikimedia.org/T414492#11521137 (10JMeybohm) [13:41:11] 07sre-alert-triage, 06SRE Observability: Alert in need of triage: nrpe_Check_whether_ferm_is_active_by_checking_the_default_input_chain (instance cumin2002:9100) - https://phabricator.wikimedia.org/T413743#11521138 (10tappof) [13:41:16] 07sre-alert-triage, 06SRE Observability: Alert in need of triage: nrpe_Check_whether_ferm_is_active_by_checking_the_default_input_chain (instance cloudidp2001-dev:9100) - https://phabricator.wikimedia.org/T413744#11521151 (10tappof) [13:41:25] 07sre-alert-triage, 06SRE Observability: Alert in need of triage: nrpe_Check_whether_ferm_is_active_by_checking_the_default_input_chain (instance cloudcumin2001:9100) - https://phabricator.wikimedia.org/T413745#11521153 (10tappof) [13:42:43] 07sre-alert-triage, 10Observability-Alerting: Alert in need of triage: nrpe_Check_whether_ferm_is_active_by_checking_the_default_input_chain (instance cloudcumin2001:9100) - https://phabricator.wikimedia.org/T413745#11521155 (10tappof) [13:42:52] 07sre-alert-triage, 10Observability-Alerting: Alert in need of triage: nrpe_Check_whether_ferm_is_active_by_checking_the_default_input_chain (instance cloudidp2001-dev:9100) - https://phabricator.wikimedia.org/T413744#11521156 (10tappof) [13:43:00] 07sre-alert-triage, 10Observability-Alerting: Alert in need of triage: nrpe_Check_whether_ferm_is_active_by_checking_the_default_input_chain (instance cumin2002:9100) - https://phabricator.wikimedia.org/T413743#11521157 (10tappof) [13:44:02] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1223673|Write new for CheckUser user agent table migration on group0 (T361196)]] (duration: 09m 24s) [13:44:06] T361196: Write to the cu_useragent table and agent_id columns on WMF wikis - https://phabricator.wikimedia.org/T361196 [13:44:45] (03PS8) 10Brouberol: druid: inject flags allowing druid to access protected classes in java > 8 [puppet] - 10https://gerrit.wikimedia.org/r/1226844 (https://phabricator.wikimedia.org/T278056) [13:48:33] (03CR) 10Btullis: [C:03+1] druid: inject flags allowing druid to access protected classes in java > 8 [puppet] - 10https://gerrit.wikimedia.org/r/1226844 (https://phabricator.wikimedia.org/T278056) (owner: 10Brouberol) [13:54:05] 10SRE-SLO, 10Observability-Alerting, 06SRE Observability (FY2025/2026-Q3): sloth deployment - https://phabricator.wikimedia.org/T414579 (10tappof) 03NEW [13:55:43] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw2002-dev.codfw.wmnet with reason: host reimage [13:58:07] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [13:58:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2147 (T413525)', diff saved to https://phabricator.wikimedia.org/P87502 and previous config saved to /var/cache/conftool/dbconfig/20260114-135815-marostegui.json [13:58:19] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [13:58:33] 10ops-eqiad, 06DC-Ops: Power Supply Redundancy alert on es1057 - https://phabricator.wikimedia.org/T414564#11521231 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Reseated power cable [13:59:40] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for kareid - https://phabricator.wikimedia.org/T413364#11521239 (10JMeybohm) [13:59:49] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw2002-dev.codfw.wmnet with reason: host reimage [13:59:51] (03CR) 10JMeybohm: "SSH key has been verified, deployers access is pending sign off from group approver" [puppet] - 10https://gerrit.wikimedia.org/r/1226854 (https://phabricator.wikimedia.org/T413364) (owner: 10JMeybohm) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T1400). Please do the needful. [14:00:05] JSherman, tgr, and sfaci: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:11] here [14:00:14] o/ [14:00:19] o/ [14:01:09] JSherman: want to self-service? [14:01:18] happy to! [14:01:32] okay, go ahead :) [14:02:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217787 (https://phabricator.wikimedia.org/T412528) (owner: 10Jsn.sherman) [14:02:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217788 (https://phabricator.wikimedia.org/T412528) (owner: 10Jsn.sherman) [14:02:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217789 (https://phabricator.wikimedia.org/T412528) (owner: 10Jsn.sherman) [14:03:15] (03CR) 10Brouberol: [V:03+1 C:03+2] druid: inject flags allowing druid to access protected classes in java > 8 [puppet] - 10https://gerrit.wikimedia.org/r/1226844 (https://phabricator.wikimedia.org/T278056) (owner: 10Brouberol) [14:03:23] (03Merged) 10jenkins-bot: InitialiseSettings.php: Add wmgUsePersonalDashboard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217787 (https://phabricator.wikimedia.org/T412528) (owner: 10Jsn.sherman) [14:03:26] (03Merged) 10jenkins-bot: InitialiseSettings-labs.php: Deploy PersonalDashboard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217788 (https://phabricator.wikimedia.org/T412528) (owner: 10Jsn.sherman) [14:03:30] (03Merged) 10jenkins-bot: CommonSettings-labs: Load PersonalDashbard extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217789 (https://phabricator.wikimedia.org/T412528) (owner: 10Jsn.sherman) [14:04:01] !log jsn@deploy2002 Started scap sync-world: Backport for [[gerrit:1217787|InitialiseSettings.php: Add wmgUsePersonalDashboard (T412528)]], [[gerrit:1217788|InitialiseSettings-labs.php: Deploy PersonalDashboard (T412528)]], [[gerrit:1217789|CommonSettings-labs: Load PersonalDashbard extension (T412528)]] [14:04:05] T412528: Deploy the PersonalDashboard extension to Beta Cluster - https://phabricator.wikimedia.org/T412528 [14:06:30] !log jsn@deploy2002 jsn: Backport for [[gerrit:1217787|InitialiseSettings.php: Add wmgUsePersonalDashboard (T412528)]], [[gerrit:1217788|InitialiseSettings-labs.php: Deploy PersonalDashboard (T412528)]], [[gerrit:1217789|CommonSettings-labs: Load PersonalDashbard extension (T412528)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:07:10] !log jsn@deploy2002 jsn: Continuing with sync [14:08:18] (03PS1) 10Giuseppe Lavagetto: Add db user and password for hiddenparma [labs/private] - 10https://gerrit.wikimedia.org/r/1226857 [14:08:30] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Add db user and password for hiddenparma [labs/private] - 10https://gerrit.wikimedia.org/r/1226857 (owner: 10Giuseppe Lavagetto) [14:09:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11521279 (10cmooney) >>! In T414460#11521085, @cmooney wrote: > however surely it should try to resend the FIN, and if this state persists eve... [14:10:14] (03PS1) 10Giuseppe Lavagetto: hiddenparma: add configuration for the database [puppet] - 10https://gerrit.wikimedia.org/r/1226858 [14:11:20] !log jsn@deploy2002 Finished scap sync-world: Backport for [[gerrit:1217787|InitialiseSettings.php: Add wmgUsePersonalDashboard (T412528)]], [[gerrit:1217788|InitialiseSettings-labs.php: Deploy PersonalDashboard (T412528)]], [[gerrit:1217789|CommonSettings-labs: Load PersonalDashbard extension (T412528)]] (duration: 07m 19s) [14:11:24] T412528: Deploy the PersonalDashboard extension to Beta Cluster - https://phabricator.wikimedia.org/T412528 [14:11:26] Lucas_WMDE: back to you [14:12:39] ok! tgr_ would be up next [14:12:45] (03CR) 10Fabfur: [C:03+1] "Overall looks ok to me, given the work already done on cache::text, the question of actual ratelimits values is well placed. IMHO we shoul" [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [14:16:48] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7892/co" [puppet] - 10https://gerrit.wikimedia.org/r/1226858 (owner: 10Giuseppe Lavagetto) [14:17:39] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw2002-dev.codfw.wmnet with OS trixie [14:18:05] (03CR) 10Giuseppe Lavagetto: [V:03+1 C:03+2] hiddenparma: add configuration for the database [puppet] - 10https://gerrit.wikimedia.org/r/1226858 (owner: 10Giuseppe Lavagetto) [14:19:53] or sfaci, if tgr_ isn’t aroaund at the moment [14:19:57] sfaci: want to self-service? [14:21:23] !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqsin and not P{cp5022.*} and A:cp - haproxy 2.8.18 upgrade (T414318) [14:21:27] T414318: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318 [14:21:50] Lucas_WMDE: I can't. I need someone who deploy for me [14:22:15] I can self-deploy quickly if you haven't started yet [14:22:29] (sorry at and offsite so a bit unresponsive) [14:22:35] tgr_: go ahead [14:22:38] and then I can deploy for sfaci [14:22:46] (but I thought you had deployment access based on puppet, sorry) [14:22:51] It's ok tgr_ , I can wait! [14:22:55] Thanks Lucas_WMDE ! [14:24:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226815 (https://phabricator.wikimedia.org/T412396) (owner: 10Gergő Tisza) [14:24:44] thx [14:24:52] (03PS1) 10Giuseppe Lavagetto: Move status, commit status/history to database [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1226860 [14:25:12] (03Merged) 10jenkins-bot: debug: Add some CDN Backend API headers to Logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226815 (https://phabricator.wikimedia.org/T412396) (owner: 10Gergő Tisza) [14:25:43] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1226815|debug: Add some CDN Backend API headers to Logstash (T412396)]] [14:25:47] T412396: Pass through information about the client from the CDN to MediaWiki to Logstash - https://phabricator.wikimedia.org/T412396 [14:25:55] !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqsin and A:cp - haproxy 2.8.18 upgrade (T414318) [14:26:59] !log vgutierrez@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_drmrs and A:cp - haproxy 2.8.18 upgrade (T414318) [14:27:03] T414318: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318 [14:27:07] !log vgutierrez@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_drmrs and A:cp - haproxy 2.8.18 upgrade (T414318) [14:27:36] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11521326 (10bking) 05Resolved→03In progress a:05Jclark-ctr→03bking [14:28:13] !log tgr@deploy2002 tgr: Backport for [[gerrit:1226815|debug: Add some CDN Backend API headers to Logstash (T412396)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:30:36] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Move status, commit status/history to database [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1226860 (owner: 10Giuseppe Lavagetto) [14:30:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11521339 (10bking) Some more lines from dmesg: ` [ 1172.174064] sd 0:2:2:0: SCSI device is removed [ 1172.273429] megaraid_sa... [14:31:10] (03PS2) 10A smart kitten: CommonSettings-labs: Remove redundant code for loading/configuring Phonos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225075 [14:31:10] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "State, commit to database - oblivian@cumin1003" [14:31:13] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: State, commit to database - oblivian@cumin1003 [14:32:02] !log tgr@deploy2002 tgr: Continuing with sync [14:32:12] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: State, commit to database - oblivian@cumin1003 [14:32:13] (03CR) 10A smart kitten: "PS2 is a rebase to resolve a merge conflict from 1f1d2ae36f52" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225075 (owner: 10A smart kitten) [14:32:13] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "State, commit to database - oblivian@cumin1003" [14:32:29] 06SRE, 10MediaWiki-Debug-Logger, 06Traffic, 06MediaWiki-Platform-Team (Q3 Kanban Board): Pass through information about the client from the CDN to MediaWiki to Logstash - https://phabricator.wikimedia.org/T412396#11521361 (10Vgutierrez) the headers described on https://wikitech.wikimedia.org/wiki/CDN/Backe... [14:33:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11521367 (10CDanis) >>! In T414460#11521085, @cmooney wrote: > The k8s host sent a FIN to the remote side but due to the packet-loss issue the... [14:36:04] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1226815|debug: Add some CDN Backend API headers to Logstash (T412396)]] (duration: 10m 21s) [14:36:09] T412396: Pass through information about the client from the CDN to MediaWiki to Logstash - https://phabricator.wikimedia.org/T412396 [14:37:12] Lucas_WMDE: thanks, back to you [14:37:29] thanks! [14:37:51] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth1 (Subnet frack-fundraising-codfw in F5) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:38:03] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Present in all deployed branches, so I think this is good to go:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225005 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming) [14:38:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225005 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming) [14:38:32] sfaci: fyi ^ [14:38:56] Lucas_WMDE: cool! [14:39:21] (03Merged) 10jenkins-bot: Deploy TestKitchen to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225005 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming) [14:39:51] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1225005|Deploy TestKitchen to testwiki (T407806)]] [14:39:55] T407806: Rename Metrics Platform Extension to Test Kitchen - https://phabricator.wikimedia.org/T407806 [14:42:06] !log lucaswerkmeister-wmde@deploy2002 cjming, lucaswerkmeister-wmde: Backport for [[gerrit:1225005|Deploy TestKitchen to testwiki (T407806)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:42:15] sfaci: please test the change on mwdebug :) [14:43:38] ok [14:44:09] (03CR) 10Dpogorzelski: [C:03+2] Add vLLM image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1146891 (https://phabricator.wikimedia.org/T385173) (owner: 10Kevin Bazira) [14:44:21] Lucas_WMDE: Checked! The extension is loaded and working [14:45:17] (03PS1) 10Jsn.sherman: Deploy PersonalDashboard to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226862 (https://phabricator.wikimedia.org/T403982) [14:45:30] alright, thanks! [14:45:37] !log lucaswerkmeister-wmde@deploy2002 cjming, lucaswerkmeister-wmde: Continuing with sync [14:49:40] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1225005|Deploy TestKitchen to testwiki (T407806)]] (duration: 09m 49s) [14:49:41] (03CR) 10Elukey: [C:04-1] "This is still not under the /ml prefix/namespace :)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1146891 (https://phabricator.wikimedia.org/T385173) (owner: 10Kevin Bazira) [14:49:44] T407806: Rename Metrics Platform Extension to Test Kitchen - https://phabricator.wikimedia.org/T407806 [14:49:58] !log UTC afternoon backport+config window done [14:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:23] Lucas_WMDE: Thank you very much! [14:55:18] (03PS1) 10Giuseppe Lavagetto: Revert "Move status, commit status/history to database" [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1226867 [15:00:04] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T1500) [15:03:48] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:04:08] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:04:28] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:04:58] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:05:05] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:05:41] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:06:52] (03PS1) 10Giuseppe Lavagetto: hiddenparma: use sqlite for now [puppet] - 10https://gerrit.wikimedia.org/r/1226869 [15:07:02] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] hiddenparma: use sqlite for now [puppet] - 10https://gerrit.wikimedia.org/r/1226869 (owner: 10Giuseppe Lavagetto) [15:09:11] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:11:58] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11521553 (10elukey) It happens with all the image push tests, with Docker on bullseye and bookworm (tried both build nodes). I dumped the registry's goroutines when... [15:13:50] (03CR) 10Bking: [C:03+2] airflow-search: add enterprise extra_secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224894 (https://phabricator.wikimedia.org/T414066) (owner: 10DCausse) [15:15:39] !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_drmrs and A:cp - haproxy 2.8.18 upgrade (T414318) [15:15:44] T414318: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318 [15:17:08] !log vgutierrez@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_esams and A:cp - haproxy 2.8.18 upgrade (T414318) [15:19:08] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [15:19:10] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [15:20:42] !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_drmrs and A:cp - haproxy 2.8.18 upgrade (T414318) [15:20:46] T414318: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318 [15:24:07] (03PS2) 10Clément Goubert: api-gateway: Add external services support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225548 (https://phabricator.wikimedia.org/T414333) [15:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T1500) [15:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T1530) [15:30:07] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:30:48] !log vgutierrez@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_esams and A:cp - haproxy 2.8.18 upgrade (T414318) [15:30:52] T414318: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318 [15:32:39] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [15:33:32] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [15:34:11] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:38:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2248 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87504 and previous config saved to /var/cache/conftool/dbconfig/20260114-153811-marostegui.json [15:38:18] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [15:38:18] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [15:39:26] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11521716 (10elukey) Meanwhile this is the stacktrace for dockerd and the relevant goroutine: ` goroutine 148 [select, 3 minutes]: net/http.(*persistConn).roundTrip... [15:40:03] PROBLEM - Postfix SMTP on crm2001 is CRITICAL: CRITICAL - Certificate crm2001.codfw.wmnet expires in 15 day(s) (Fri 30 Jan 2026 03:40:00 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [15:43:27] !log cdobbins@cumin1003 START - Cookbook sre.dns.admin DNS admin: depool site ulsfo [reason: switch work, T408510] [15:43:31] T408510: ULSFO: switch refresh - https://phabricator.wikimedia.org/T408510 [15:43:36] !log cdobbins@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site ulsfo [reason: switch work, T408510] [15:44:39] (03PS15) 10Vgutierrez: cache::upload: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [15:45:10] (03CR) 10Vgutierrez: cache::upload: introduce rate-limits by traffic class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [15:45:12] (03PS1) 10Majavah: aptrepo: Drop packages for Kubeadm/1.30 [puppet] - 10https://gerrit.wikimedia.org/r/1226878 (https://phabricator.wikimedia.org/T372697) [15:45:14] (03PS1) 10Majavah: aptrepo: Import packages for Kubeadm/1.32 [puppet] - 10https://gerrit.wikimedia.org/r/1226879 (https://phabricator.wikimedia.org/T379047) [15:46:51] andrew@cumin2002 reimage (PID 2640472) is awaiting input [15:48:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2248', diff saved to https://phabricator.wikimedia.org/P87505 and previous config saved to /var/cache/conftool/dbconfig/20260114-154820-marostegui.json [15:49:00] !log drain eqsin-ulsfo transport [15:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:13] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudgw2003-dev.codfw.wmnet with OS trixie [15:54:15] (03CR) 10Papaul: [C:03+2] Comment out temporarily the anycast ranges [homer/public] - 10https://gerrit.wikimedia.org/r/1216677 (https://phabricator.wikimedia.org/T408892) (owner: 10Papaul) [15:54:55] PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:54:55] PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:56:06] jouncebot: nowandnext [15:56:07] For the next 0 hour(s) and 3 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T1500) [15:56:07] For the next 0 hour(s) and 3 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T1530) [15:56:07] In 2 hour(s) and 3 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T1800) [15:57:08] We want to do a deploy to fix a security issue [15:57:12] Any objection? [15:57:59] Dreamy_Jazz: should be fine, but please check with oncallers, _joe_ fabfur ^ [15:58:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2248', diff saved to https://phabricator.wikimedia.org/P87506 and previous config saved to /var/cache/conftool/dbconfig/20260114-155828-marostegui.json [15:58:36] We are intending to use scap to deploy it as it's only on wmf branches and minor enough to fix publicly [15:58:56] PROBLEM - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:58:56] PROBLEM - grafana-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:59:01] 06SRE, 10MediaWiki-Debug-Logger, 06Traffic, 06MediaWiki-Platform-Team (Q3 Kanban Board): Pass through information about the client from the CDN to MediaWiki to Logstash - https://phabricator.wikimedia.org/T412396#11521815 (10Tgr) a:03Tgr [15:59:08] Dreamy_Jazz: go ahead please but note that grafana is down [15:59:19] in case you need that, for whatever reason [15:59:56] PROBLEM - SSH on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:00:17] That should be fine [16:00:24] We just need excimer to check that things are not slow [16:00:25] 06SRE, 10MediaWiki-Debug-Logger, 06Traffic, 06MediaWiki-Platform-Team (Q3 Kanban Board): Pass through information about the client from the CDN to MediaWiki to Logstash - https://phabricator.wikimedia.org/T412396#11521825 (10Tgr) We should also update some of the dashboards (at least the login one) with so... [16:00:56] Dreamy_Jazz: ok +1 from me, I am not on on-call but since the people who are busy with meetings, please go ahead and I can help things go south [16:00:57] 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11521827 (10cmooney) The SFP module in port 14 of lsw1-c5-eqiad has been swapped out now. So we can observe over the next... [16:01:48] RECOVERY - SSH on grafana1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:01:52] RECOVERY - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Sat 31 Jan 2026 10:43:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [16:01:52] RECOVERY - grafana-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Sat 31 Jan 2026 10:43:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [16:02:54] RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 554 bytes in 6.732 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [16:02:54] RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 564 bytes in 6.697 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [16:02:56] 06SRE, 10MediaWiki-Debug-Logger, 06Traffic, 06MediaWiki-Platform-Team (Q3 Kanban Board): Pass through information about the client from the CDN to MediaWiki to Logstash - https://phabricator.wikimedia.org/T412396#11521866 (10Tgr) >>! In T412396#11521361, @Vgutierrez wrote: > the headers described on https:... [16:04:39] !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_esams and A:cp - haproxy 2.8.18 upgrade (T414318) [16:04:44] T414318: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318 [16:05:00] !log pt1979@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 8 hosts with reason: loopback IPV4 change on ulsfo core router [16:05:20] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11521900 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=cf1deaa2-45c3-45e8-bdad-1303b0075f87) set by pt1979@cumin2002 for 2:00:00 on 8 h... [16:06:28] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw2003-dev.codfw.wmnet with reason: host reimage [16:07:20] !log ongoing loopback ip's change on cr3/cr4-ulsfo [16:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2248 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87507 and previous config saved to /var/cache/conftool/dbconfig/20260114-160836-marostegui.json [16:08:43] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [16:08:44] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [16:09:56] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw2003-dev.codfw.wmnet with reason: host reimage [16:14:57] !log fnegri@cumin1003 START - Cookbook sre.wikireplicas.update-views [16:15:39] (03CR) 10RLazarus: [C:03+1] admin/data: Add hfanwmf to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1226847 (https://phabricator.wikimedia.org/T414492) (owner: 10JMeybohm) [16:16:15] 10SRE-swift-storage, 10Ceph, 07Epic, 07Kubernetes, and 2 others: Move the docker registry's /restricted prefix to Docker Distribution backed up by Ceph - https://phabricator.wikimedia.org/T412951#11522011 (10JMeybohm) p:05Triage→03High [16:17:29] (03CR) 10JMeybohm: [C:03+2] admin/data: Add hfanwmf to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1226847 (https://phabricator.wikimedia.org/T414492) (owner: 10JMeybohm) [16:18:04] !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_esams and A:cp - haproxy 2.8.18 upgrade (T414318) [16:18:08] T414318: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318 [16:20:15] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Grant Access to analytics-privatedata-users for HFanWMF - https://phabricator.wikimedia.org/T414492#11522046 (10JMeybohm) 05Open→03Resolved a:03JMeybohm I have added you to the `analytics-privatedata-users` group. If that does not grand you the requ... [16:21:20] !log fnegri@cumin1003 END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0) [16:27:39] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw2003-dev.codfw.wmnet with OS trixie [16:28:10] (03PS1) 10Dreamy Jazz: Only validate IRS configs on writes; skip validations for reads [extensions/ReportIncident] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1226896 (https://phabricator.wikimedia.org/T414582) [16:28:18] (03PS1) 10Dreamy Jazz: Only validate IRS configs on writes; skip validations for reads [extensions/ReportIncident] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1226897 (https://phabricator.wikimedia.org/T414582) [16:29:02] o/ Dreamy_Jazz asked earlier but I'll be doing the backport he mentioned now if that's alright? [16:29:41] Tran: there are no issues on our end, except that ulsfo is depooled, so yes, from SRE's side you can go ahead if you want [16:31:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy2002 using scap backport" [extensions/ReportIncident] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1226897 (https://phabricator.wikimedia.org/T414582) (owner: 10Dreamy Jazz) [16:31:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy2002 using scap backport" [extensions/ReportIncident] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1226896 (https://phabricator.wikimedia.org/T414582) (owner: 10Dreamy Jazz) [16:34:36] (03Merged) 10jenkins-bot: Only validate IRS configs on writes; skip validations for reads [extensions/ReportIncident] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1226897 (https://phabricator.wikimedia.org/T414582) (owner: 10Dreamy Jazz) [16:35:17] (03Merged) 10jenkins-bot: Only validate IRS configs on writes; skip validations for reads [extensions/ReportIncident] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1226896 (https://phabricator.wikimedia.org/T414582) (owner: 10Dreamy Jazz) [16:35:52] !log stran@deploy2002 Started scap sync-world: Backport for [[gerrit:1226897|Only validate IRS configs on writes; skip validations for reads (T414582)]], [[gerrit:1226896|Only validate IRS configs on writes; skip validations for reads (T414582)]] [16:35:54] 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11522109 (10cmooney) Ok currently seeing no loss (though that was the case when we were cordoned before the swap). ` cmoon... [16:36:11] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11522110 (10elukey) As the last times, I went afk for ~1h, got back and retried the same docker push that completed immediately without hanging. [16:38:01] !log stran@deploy2002 dreamyjazz, stran: Backport for [[gerrit:1226897|Only validate IRS configs on writes; skip validations for reads (T414582)]], [[gerrit:1226896|Only validate IRS configs on writes; skip validations for reads (T414582)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:38:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:38:53] testing my patch now [16:40:43] looks good, moving forward [16:40:46] !log stran@deploy2002 dreamyjazz, stran: Continuing with sync [16:42:39] FIRING: CoreBGPDown: Core BGP session down between cr1-codfw and cr4-ulsfo (198.35.26.129) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr1-codfw:9804&var-bgp_group=Confed_ulsfo&var-bgp_neighbor=cr4-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:42:44] !log restarting grafana1002 for memory increase T414604 [16:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:48] T414604: Increase Grafana VM memory - https://phabricator.wikimedia.org/T414604 [16:44:24] !log herron@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM grafana1002.eqiad.wmnet [16:44:51] !log stran@deploy2002 Finished scap sync-world: Backport for [[gerrit:1226897|Only validate IRS configs on writes; skip validations for reads (T414582)]], [[gerrit:1226896|Only validate IRS configs on writes; skip validations for reads (T414582)]] (duration: 08m 59s) [16:45:41] done, thanks! [16:45:51] (03CR) 10Wfan: [C:03+1] Revert "Shorten 'close' cookie wait period for enwiki banners" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226275 (https://phabricator.wikimedia.org/T411800) (owner: 10Ejegg) [16:47:39] FIRING: [5x] CoreBGPDown: Core BGP session down between cr1-codfw and cr4-ulsfo (198.35.26.129) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:48:50] !log brouberol@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1013.eqiad.wmnet [16:49:13] 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11522186 (10ops-monitoring-bot) Host dse-k8s-worker1013.eqiad.wmnet rebooted by brouberol@cumin1003 with reason: Getting a... [16:49:19] !log herron@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM grafana1002.eqiad.wmnet [16:50:09] 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11522190 (10VRiley-WMF) Happy to help with this. Let us know if there is anything else we can help with. [16:50:46] (03PS1) 10Dzahn: miscweb: update wikipedia25 image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226902 (https://phabricator.wikimedia.org/T408592) [16:50:53] !log herron@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM grafana2001.codfw.wmnet [16:51:40] (03PS1) 10Gergő Tisza: debug: Add X-Provenance header to Logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226903 (https://phabricator.wikimedia.org/T412396) [16:51:49] (03CR) 10CI reject: [V:04-1] debug: Add X-Provenance header to Logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226903 (https://phabricator.wikimedia.org/T412396) (owner: 10Gergő Tisza) [16:52:39] FIRING: [8x] CoreBGPDown: Core BGP session down between cr1-codfw and cr4-ulsfo (198.35.26.129) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:53:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, January 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226275 (https://phabricator.wikimedia.org/T411800) (owner: 10Ejegg) [16:54:52] !log herron@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM grafana2001.codfw.wmnet [16:55:08] !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1013.eqiad.wmnet [16:57:26] (03CR) 10Dzahn: [C:03+2] miscweb: update wikipedia25 image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226902 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [16:57:39] FIRING: [10x] CoreBGPDown: Core BGP session down between cr1-codfw and cr4-ulsfo (198.35.26.129) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:59:38] (03Merged) 10jenkins-bot: miscweb: update wikipedia25 image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226902 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [16:59:39] (03PS1) 10Ssingh: wikimedia/wikipedia.org: match TTLs for NS and glue records [dns] - 10https://gerrit.wikimedia.org/r/1226904 (https://phabricator.wikimedia.org/T81605) [17:02:05] (03CR) 10Btullis: [C:03+2] eventgate-analytics-external: add wikipedia25.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224858 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [17:02:39] FIRING: [15x] CoreBGPDown: Core BGP session down between cr1-codfw and cr3-ulsfo (198.35.26.192) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:03:57] FIRING: [7x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:04:00] (03Merged) 10jenkins-bot: eventgate-analytics-external: add wikipedia25.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224858 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [17:04:19] <_joe_> is this ulsfo? [17:04:33] yes, weird, we should have downtimed I guess and I think we did? [17:04:37] <_joe_> yes [17:04:39] this is ulsfo [17:04:40] <_joe_> ok np [17:04:42] !incidents [17:04:43] 7336 (UNACKED) [7x] ProbeDown sre (probes/service ulsfo) [17:04:43] looks like ulsfo yeah [17:04:45] !ack 7336 [17:04:46] 7336 (ACKED) [7x] ProbeDown sre (probes/service ulsfo) [17:04:46] <_joe_> !ack [17:04:47] no value provided for parameter incident and no default available [17:04:47] All incidents are already acked. [17:04:50] nothing to worry [17:04:52] still depooled [17:04:56] ack [17:04:57] <_joe_> sukhe: use "ack" without args [17:05:01] <_joe_> thanks rzl [17:05:02] ah yes thanks [17:07:14] (03CR) 10Papaul: [C:03+2] Change cr3/4-ulsfo loopback ip's in puppet before tomorrow's maintenance window [puppet] - 10https://gerrit.wikimedia.org/r/1216679 (https://phabricator.wikimedia.org/T408892) (owner: 10Papaul) [17:07:39] FIRING: [15x] CoreBGPDown: Core BGP session down between cr1-codfw and cr3-ulsfo (198.35.26.192) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:09:24] 10SRE-SLO, 10Observability-Metrics, 06SRE Observability (FY2025/2026-Q3): Thanos (store|query-frontend) memcached cache in bad status - https://phabricator.wikimedia.org/T411273#11522264 (10tappof) The debug log level does not provide information about cache usage. [17:10:35] (03PS1) 10Superpes15: [itwiki] Add a temporary logo for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226908 (https://phabricator.wikimedia.org/T414320) [17:12:03] !log dzahn@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [17:12:27] !log dzahn@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [17:12:45] (03PS6) 10Kevin Bazira: Add vLLM image in ML namespace [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1146891 (https://phabricator.wikimedia.org/T385173) [17:13:16] !log dzahn@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [17:13:35] !log dzahn@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [17:13:39] !log pool titan1002 [17:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:01] !log pool titan1002 (T411273) [17:14:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:05] T411273: Thanos (store|query-frontend) memcached cache in bad status - https://phabricator.wikimedia.org/T411273 [17:14:12] !log dzahn@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [17:14:34] !log dzahn@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [17:15:30] FIRING: LibericaStaleConfig: Liberica instance lvs4009 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&viewPanel=10&var-site=ulsfo&var-instance=lvs4009 - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [17:17:48] ^ well, it's depooled so I am not worried but will look after meeting [17:18:02] probably because puppet hasn't run in a while [17:18:20] (03PS1) 10Andrew Bogott: cloudbackup: move postgres data to /var/lib for all eqiad1 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1226909 [17:19:14] RECOVERY - MegaRAID on an-worker1148 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:19:57] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1226909 (owner: 10Andrew Bogott) [17:20:16] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1148.eqiad.wmnet [17:20:30] FIRING: [3x] LibericaStaleConfig: Liberica instance lvs4008 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [17:20:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11522297 (10ops-monitoring-bot) Host an-worker1148.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting after fixi... [17:21:30] (03PS1) 10Papaul: comment back the anycast ranges [homer/public] - 10https://gerrit.wikimedia.org/r/1226911 (https://phabricator.wikimedia.org/T408892) [17:22:32] (03CR) 10Andrew Bogott: [C:03+2] cloudbackup: move postgres data to /var/lib for all eqiad1 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1226909 (owner: 10Andrew Bogott) [17:23:31] (03CR) 10Dzahn: [C:03+1] "thank you for this and reaching out to the team!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224858 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [17:24:11] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:24:24] (03PS14) 10Federico Ceratto: sre.mysql.newpool: [de]pool various section kinds [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) [17:24:28] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup2004.codfw.wmnet with OS trixie [17:24:57] (03PS2) 10Dzahn: Revert "trafficserver: disable wikipedia25" [puppet] - 10https://gerrit.wikimedia.org/r/1224959 (https://phabricator.wikimedia.org/T408592) [17:25:02] (03CR) 10Federico Ceratto: sre.mysql.newpool: [de]pool various section kinds (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [17:26:34] !log sukhe@cumin1003 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs4010*} and A:liberica (T408510) [17:26:39] T408510: ULSFO: switch refresh - https://phabricator.wikimedia.org/T408510 [17:26:53] !log sukhe@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs4010*} and A:liberica (T408510) [17:27:22] !log btullis@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [17:27:30] !log btullis@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [17:27:56] !log btullis@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [17:28:02] !log btullis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [17:28:56] !log sukhe@cumin1003 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs4009*} and A:liberica (T408510) [17:29:00] !log btullis@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [17:29:04] !log btullis@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [17:29:14] !log sukhe@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs4009*} and A:liberica (T408510) [17:29:35] !log sukhe@cumin1003 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs4008*} and A:liberica (T408510) [17:29:47] (03CR) 10Ayounsi: [C:03+1] comment back the anycast ranges [homer/public] - 10https://gerrit.wikimedia.org/r/1226911 (https://phabricator.wikimedia.org/T408892) (owner: 10Papaul) [17:29:54] !log sukhe@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs4008*} and A:liberica (T408510) [17:30:30] FIRING: [3x] LibericaStaleConfig: Liberica instance lvs4008 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [17:30:37] that is weird for sure [17:31:02] (03CR) 10Cathal Mooney: [C:03+1] comment back the anycast ranges [homer/public] - 10https://gerrit.wikimedia.org/r/1226911 (https://phabricator.wikimedia.org/T408892) (owner: 10Papaul) [17:31:03] na, that's an old alert, it cleared up [17:31:04] (03CR) 10CI reject: [V:04-1] sre.mysql.newpool: [de]pool various section kinds [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [17:31:05] 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11522388 (10akosiaris) [17:31:18] (03CR) 10Papaul: [C:03+2] comment back the anycast ranges [homer/public] - 10https://gerrit.wikimedia.org/r/1226911 (https://phabricator.wikimedia.org/T408892) (owner: 10Papaul) [17:33:27] !log btullis@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-worker1148.eqiad.wmnet [17:33:50] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts an-worker1148.eqiad.wmnet [17:34:03] !log btullis@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-worker1148.eqiad.wmnet [17:34:11] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts an-worker1148.eqiad.wmnet [17:35:24] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: Fr-tech expansion - https://phabricator.wikimedia.org/T403035#11522415 (10RobH) [17:35:30] RESOLVED: [3x] LibericaStaleConfig: Liberica instance lvs4008 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [17:35:53] (03PS1) 10Elukey: profile::docker_registry: tune the s3 config for /restricted [puppet] - 10https://gerrit.wikimedia.org/r/1226914 (https://phabricator.wikimedia.org/T394476) [17:36:06] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1226914 (https://phabricator.wikimedia.org/T394476) (owner: 10Elukey) [17:36:13] 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11522428 (10cmooney) Thanks @VRiley. Happy to say we aren't seeing any loss as of yet after the node was uncordoned: ` cm... [17:38:38] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Eqiad: Fr-tech expansion - https://phabricator.wikimedia.org/T403035#11522433 (10Dwisehaupt) [17:40:02] (03PS1) 10Superpes15: [slwiki] Add a temporary logo for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226915 (https://phabricator.wikimedia.org/T414265) [17:40:29] PROBLEM - Host an-worker1148 is DOWN: PING CRITICAL - Packet loss = 100% [17:41:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11522439 (10BTullis) We went through the process of: * Deleting a foreign config for VD 02 * Deleting the preserved cache for... [17:43:12] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudbackup2004.codfw.wmnet with reason: host reimage [17:43:24] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:43:50] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Eqiad: Fr-tech expansion - https://phabricator.wikimedia.org/T403035#11522441 (10VRiley-WMF) For the scs console server, I believe it would be the one located in F8, is that correct? [17:49:54] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudbackup2004.codfw.wmnet with reason: host reimage [17:50:51] pt1979@cumin2002 netbox (PID 2702015) is awaiting input [17:52:30] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [17:52:52] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update cr3-cr4-ulsfo loopback - pt1979@cumin2002" [17:53:10] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update cr3-cr4-ulsfo loopback - pt1979@cumin2002" [17:53:10] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:55:17] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T1800) [18:01:42] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [18:04:30] (03PS2) 10Superpes15: [slwiki] Add a temporary logo for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226915 (https://phabricator.wikimedia.org/T414265) [18:04:39] (03PS2) 10Superpes15: [itwiki] Add a temporary logo for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226908 (https://phabricator.wikimedia.org/T414320) [18:05:25] !log cdobbins@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool site ulsfo [reason: switch work, T408510] [18:05:29] T408510: ULSFO: switch refresh - https://phabricator.wikimedia.org/T408510 [18:05:37] !log cdobbins@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site ulsfo [reason: switch work, T408510] [18:06:30] (03PS1) 10Superpes15: [kkwiki] Add a temporary logo for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226918 (https://phabricator.wikimedia.org/T414267) [18:07:40] pt1979@cumin2002 netbox (PID 2712631) is awaiting input [18:08:12] PROBLEM - Host mr1-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [18:08:13] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Eqiad: Fr-tech expansion - https://phabricator.wikimedia.org/T403035#11522495 (10cmooney) >>! In T403035#11522441, @VRiley-WMF wrote: > For the scs console server, I believe it would be the one located in F8, is that corr... [18:10:26] RECOVERY - Host mr1-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 71.86 ms [18:14:07] <_joe_> !incidents [18:14:07] 7336 (ACKED) [7x] ProbeDown sre (probes/service ulsfo) [18:14:17] <_joe_> uhm why are services still down though [18:14:43] they should not be, I don't see any pending alerts on alertmanger, though not sure why resolves haven't come in [18:14:46] checking [18:15:10] probes look OK as well [18:15:12] <_joe_> yes [18:15:15] <_joe_> !resolve [18:15:16] 7336 (RESOLVED) [7x] ProbeDown sre (probes/service ulsfo) [18:15:21] <_joe_> rzl: <3 [18:16:17] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update cr3-cr4-ulsfo loopback - pt1979@cumin2002" [18:16:39] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update cr3-cr4-ulsfo loopback - pt1979@cumin2002" [18:16:39] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:21:55] !log cmooney@dns2005 START - running authdns-update [18:23:28] PROBLEM - Host mr1-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [18:25:36] RECOVERY - Host mr1-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 71.82 ms [18:28:03] 06SRE, 06Traffic, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11522551 (10ssingh) @cmooney: Any picks for your favourite v6 address for `ns1`? I was thinking of allocating `2620:0:860:ed1a::4/128` under LVS service IPs `2620:0:860:ed1a::/64`, since unfortuna... [18:29:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T413525)', diff saved to https://phabricator.wikimedia.org/P87508 and previous config saved to /var/cache/conftool/dbconfig/20260114-182942-marostegui.json [18:29:50] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [18:29:51] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [18:31:59] (03PS4) 10Bking: WIP: Alert DPE SRE when probes fail in dse-k8s clusters [alerts] - 10https://gerrit.wikimedia.org/r/1226282 (https://phabricator.wikimedia.org/T412447) [18:33:04] (03PS5) 10Bking: Alert DPE SRE when probes fail in dse-k8s clusters [alerts] - 10https://gerrit.wikimedia.org/r/1226282 (https://phabricator.wikimedia.org/T412447) [18:33:15] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [18:33:34] 10SRE-Access-Requests: Yubikey-SSH-FIDO access for dduvall - https://phabricator.wikimedia.org/T414619 (10dduvall) 03NEW [18:33:39] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [18:35:06] (03PS1) 10Dduvall: admin: Add new yubikey-ssh-fido keys for dduvall [puppet] - 10https://gerrit.wikimedia.org/r/1226922 (https://phabricator.wikimedia.org/T414619) [18:36:02] the page https://wikipedia25.org/ shows "Domain not configured". this is bad because it's already linked in some live banners. does anyone here know anything about it? [18:36:09] (this was reported at https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#https://wikipedia25.org/_banner ) [18:36:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87509 and previous config saved to /var/cache/conftool/dbconfig/20260114-183621-marostegui.json [18:36:30] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [18:36:31] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [18:38:01] mutante: ^ in case you're aware of what's happening re wikipedia25.org [18:38:06] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth1 (Subnet frack-fundraising-codfw in F5) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:38:06] xref T408592 [18:38:07] T408592: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592 [18:38:13] per T408592 it's not supposed to be up until tomorrow morning [18:38:40] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [18:39:20] cmooney@cumin1003 netbox (PID 1520061) is awaiting input [18:39:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P87511 and previous config saved to /var/cache/conftool/dbconfig/20260114-183951-marostegui.json [18:40:19] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host an-worker1148.eqiad.wmnet [18:40:20] VPT says banners were accidentally up too early and have been fixed [18:41:26] (03PS1) 10Cathal Mooney: Add INCLUDE statement to cover new netbox snippet for 198.35.26.128/27 [dns] - 10https://gerrit.wikimedia.org/r/1226923 (https://phabricator.wikimedia.org/T408892) [18:41:34] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [18:41:41] ah. thanks for responding anyway :) [18:42:12] (03CR) 10CI reject: [V:04-1] Add INCLUDE statement to cover new netbox snippet for 198.35.26.128/27 [dns] - 10https://gerrit.wikimedia.org/r/1226923 (https://phabricator.wikimedia.org/T408892) (owner: 10Cathal Mooney) [18:42:32] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update cr3-cr4-ulsfo loopback - pt1979@cumin2002" [18:42:37] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update cr3-cr4-ulsfo loopback - pt1979@cumin2002" [18:42:37] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:44:36] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [18:46:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260', diff saved to https://phabricator.wikimedia.org/P87512 and previous config saved to /var/cache/conftool/dbconfig/20260114-184630-marostegui.json [18:47:13] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:48:38] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [18:48:49] !log cmooney@cumin1003 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [18:49:25] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [18:50:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P87513 and previous config saved to /var/cache/conftool/dbconfig/20260114-185001-marostegui.json [18:51:28] PROBLEM - Host mr1-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [18:52:04] PROBLEM - Host mr1-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [18:52:22] me^ [18:52:40] RECOVERY - Host mr1-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 71.75 ms [18:53:03] (03PS1) 10Ssingh: dnsbox: codfw: advertise ns1 IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1226928 (https://phabricator.wikimedia.org/T81605) [18:55:03] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update netbox entries beofre dns patch - cmooney@cumin1003" [18:55:29] (03PS2) 10Ssingh: dnsbox: codfw: advertise ns1 IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1226928 (https://phabricator.wikimedia.org/T81605) [18:55:45] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update netbox entries beofre dns patch - cmooney@cumin1003" [18:55:45] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [18:56:29] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7894/co" [puppet] - 10https://gerrit.wikimedia.org/r/1226928 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [18:56:38] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [18:56:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260', diff saved to https://phabricator.wikimedia.org/P87514 and previous config saved to /var/cache/conftool/dbconfig/20260114-185638-marostegui.json [18:56:56] !log cmooney@cumin1003 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [18:57:06] RECOVERY - Host mr1-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 71.72 ms [18:57:14] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [18:59:35] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudbackup2004.codfw.wmnet with OS trixie [19:00:05] jeena and dduvall: How many deployers does it take to do MediaWiki train - Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T1900). [19:00:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T413525)', diff saved to https://phabricator.wikimedia.org/P87515 and previous config saved to /var/cache/conftool/dbconfig/20260114-190008-marostegui.json [19:00:10] PROBLEM - Host mr1-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [19:00:14] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [19:00:25] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [19:00:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2155 (T413525)', diff saved to https://phabricator.wikimedia.org/P87516 and previous config saved to /var/cache/conftool/dbconfig/20260114-190033-marostegui.json [19:00:53] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update netbox entries beofre dns patch - cmooney@cumin1003" [19:00:58] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update netbox entries beofre dns patch - cmooney@cumin1003" [19:00:58] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:01:45] !log pt1979@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mr1-ulsfo,mr1-ulsfo IPv6 with reason: loopback IPV4 change on ulsfo core router [19:02:02] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11522682 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8cc58471-31d6-4e79-ae14-124cd9a6b684) set by pt1979@cumin2002 for 1:00:00 on 2 h... [19:03:04] (03PS1) 10TrainBranchBot: group1 to 1.46.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226930 (https://phabricator.wikimedia.org/T413802) [19:03:07] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jhuneidi@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226930 (https://phabricator.wikimedia.org/T413802) (owner: 10TrainBranchBot) [19:04:00] (03Merged) 10jenkins-bot: group1 to 1.46.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226930 (https://phabricator.wikimedia.org/T413802) (owner: 10TrainBranchBot) [19:04:00] (03PS2) 10Cathal Mooney: Add INCLUDE statement to cover new netbox snippet for 198.35.26.128/27 [dns] - 10https://gerrit.wikimedia.org/r/1226923 (https://phabricator.wikimedia.org/T408892) [19:04:02] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [19:05:01] (03PS3) 10Ssingh: dnsbox: codfw: advertise ns1 IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1226928 (https://phabricator.wikimedia.org/T81605) [19:06:07] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1226928 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [19:06:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87517 and previous config saved to /var/cache/conftool/dbconfig/20260114-190647-marostegui.json [19:06:55] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [19:06:55] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [19:07:04] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1261.eqiad.wmnet with reason: Maintenance [19:07:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1261 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87518 and previous config saved to /var/cache/conftool/dbconfig/20260114-190711-marostegui.json [19:08:25] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update netbox entries beofre dns patch - cmooney@cumin1003" [19:08:27] There seem to be a lot of db connection errors https://logstash.wikimedia.org/goto/5c16b9ac7ad6b93093cdbe4984eb0f38 [19:08:29] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update netbox entries beofre dns patch - cmooney@cumin1003" [19:08:29] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:08:33] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [19:10:17] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.46.0-wmf.11 refs T413802 [19:10:22] T413802: 1.46.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T413802 [19:10:23] (03CR) 10Ssingh: [C:03+1] Add INCLUDE statement to cover new netbox snippet for 198.35.26.128/27 [dns] - 10https://gerrit.wikimedia.org/r/1226923 (https://phabricator.wikimedia.org/T408892) (owner: 10Cathal Mooney) [19:11:20] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:11:34] (03CR) 10Cathal Mooney: [C:03+2] Add INCLUDE statement to cover new netbox snippet for 198.35.26.128/27 [dns] - 10https://gerrit.wikimedia.org/r/1226923 (https://phabricator.wikimedia.org/T408892) (owner: 10Cathal Mooney) [19:11:49] !log cmooney@dns2005 START - running authdns-update [19:12:17] (03PS8) 10CDanis: lvs7003: add gerrit-ssh and gerrit-https [puppet] - 10https://gerrit.wikimedia.org/r/1215388 (https://phabricator.wikimedia.org/T411895) [19:12:17] (03PS14) 10CDanis: gerrit services: lvs_setup! but only in magru. [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (https://phabricator.wikimedia.org/T411895) [19:12:17] (03PS7) 10CDanis: lvs7001: add gerrit services [puppet] - 10https://gerrit.wikimedia.org/r/1215398 (https://phabricator.wikimedia.org/T411895) [19:12:17] (03PS4) 10CDanis: gerrit/Liberica: expand to drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1215693 (https://phabricator.wikimedia.org/T411895) [19:12:18] (03PS1) 10CDanis: cache_text: add gerrit-https to realservers [puppet] - 10https://gerrit.wikimedia.org/r/1226932 (https://phabricator.wikimedia.org/T411895) [19:12:35] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1226932 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [19:12:43] !log cmooney@dns2005 END - running authdns-update [19:14:25] (03CR) 10Ssingh: [V:03+1] "I was split on modifying authdns_addrs but it's pretty clear we have to do that as well since it's tightly couple on our assumption of val" [puppet] - 10https://gerrit.wikimedia.org/r/1226928 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [19:14:43] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup1004.eqiad.wmnet with OS trixie [19:16:27] (03PS4) 10Ssingh: dnsbox: codfw: advertise ns1 IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1226928 (https://phabricator.wikimedia.org/T81605) [19:17:31] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7896/co" [puppet] - 10https://gerrit.wikimedia.org/r/1226928 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [19:20:18] 06SRE, 06Traffic, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11522747 (10taavi) 05Stalled→03Open [19:20:36] PROBLEM - Host cp7016 is DOWN: CRITICAL - Time to live exceeded (10.140.1.11) [19:20:54] PROBLEM - Host cp7003 is DOWN: CRITICAL - Time to live exceeded (10.140.0.4) [19:20:54] PROBLEM - Host cp7005 is DOWN: CRITICAL - Time to live exceeded (10.140.0.5) [19:20:55] PROBLEM - Host cp7007 is DOWN: CRITICAL - Time to live exceeded (10.140.0.6) [19:20:55] PROBLEM - Host cp7009 is DOWN: CRITICAL - Time to live exceeded (10.140.0.7) [19:21:10] RECOVERY - Host cp7003 is UP: PING OK - Packet loss = 0%, RTA = 137.51 ms [19:21:10] RECOVERY - Host cp7005 is UP: PING OK - Packet loss = 0%, RTA = 137.45 ms [19:21:10] RECOVERY - Host cp7016 is UP: PING OK - Packet loss = 0%, RTA = 137.52 ms [19:21:10] RECOVERY - Host cp7009 is UP: PING OK - Packet loss = 0%, RTA = 137.38 ms [19:21:12] RECOVERY - Host cp7007 is UP: PING OK - Packet loss = 0%, RTA = 137.56 ms [19:21:17] come on [19:21:30] PROBLEM - SSH on cp7016 is CRITICAL: connect to address 10.140.1.11 and port 22: No route to host https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:22:27] not me :) [19:22:38] RECOVERY - SSH on cp7016 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:22:58] cdanis: yeah this is T414473 [19:22:59] T414473: magru hosts (erroneously) reported down due to TTL exceeded - https://phabricator.wikimedia.org/T414473 [19:23:19] sukhe: I suspect we're winding up with a temporary routing loop [19:23:20] PROBLEM - Host ncredir7003 is DOWN: CRITICAL - Time to live exceeded (10.140.2.3) [19:23:24] PROBLEM - Host ncredir7004 is DOWN: CRITICAL - Time to live exceeded (10.140.2.8) [19:23:29] cdanis: yep [19:23:47] something something OSPF something BGP something confederation [19:23:56] PROBLEM - Host doh7003 is DOWN: PING CRITICAL - Packet loss = 100% [19:23:56] PROBLEM - Host doh7004 is DOWN: PING CRITICAL - Packet loss = 100% [19:24:02] RECOVERY - Host doh7003 is UP: PING WARNING - Packet loss = 33%, RTA = 347.92 ms [19:24:02] RECOVERY - Host doh7004 is UP: PING WARNING - Packet loss = 33%, RTA = 347.74 ms [19:24:05] RECOVERY - Host ncredir7003 is UP: PING OK - Packet loss = 0%, RTA = 138.15 ms [19:24:08] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 137.90 ms [19:24:20] PROBLEM - Host hcaptcha-proxy7002 is DOWN: CRITICAL - Time to live exceeded (195.200.68.103) [19:24:48] RECOVERY - Host hcaptcha-proxy7002 is UP: PING OK - Packet loss = 0%, RTA = 138.27 ms [19:24:56] PROBLEM - Recursive DNS on 195.200.68.37 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [19:25:50] PROBLEM - Host asw1-b4-magru is DOWN: CRITICAL - Time to live exceeded (195.200.68.131) [19:25:54] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp7005 is CRITICAL: connect to address 10.140.0.5 and port 3128: No route to host https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:26:00] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:26:00] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp7008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:26:00] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp7005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:26:00] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp7016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:26:00] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:26:01] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp7015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:26:01] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp7009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:26:02] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp7014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:26:02] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp7010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:26:03] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp7013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:26:03] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp7003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:26:04] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp7006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:26:06] PROBLEM - HTTPS non-canonical-redirect-11 on ncredir7003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir [19:26:06] PROBLEM - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp7016 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/HTTPS [19:26:06] PROBLEM - HTTPS non-canonical-redirect-6 on ncredir7003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir [19:26:06] PROBLEM - HTTPS non-canonical-redirect-19 on ncredir7003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir [19:26:06] PROBLEM - HTTPS non-canonical-redirect-34 on ncredir7003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir [19:26:07] PROBLEM - HTTPS non-canonical-redirect-9 on ncredir7004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir [19:26:07] PROBLEM - HTTPS non-canonical-redirect-38 on ncredir7004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir [19:26:08] PROBLEM - HTTPS non-canonical-redirect-12 on ncredir7004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir [19:26:08] PROBLEM - HTTPS non-canonical-redirect-27 on ncredir7003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir [19:26:09] PROBLEM - HTTPS non-canonical-redirect-33 on ncredir7003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir [19:26:09] PROBLEM - HTTPS non-canonical-redirect-21 on ncredir7004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir [19:26:10] PROBLEM - HTTPS non-canonical-redirect-21 on ncredir7003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir [19:26:10] PROBLEM - HTTPS non-canonical-redirect-37 on ncredir7003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir [19:26:11] PROBLEM - HTTPS non-canonical-redirect-10 on ncredir7003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir [19:26:11] PROBLEM - HTTPS non-canonical-redirect-15 on ncredir7004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir [19:26:12] PROBLEM - HTTPS non-canonical-redirect-8 on ncredir7003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir [19:26:12] PROBLEM - HTTPS non-canonical-redirect-23 on ncredir7004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir [19:26:13] PROBLEM - HTTPS non-canonical-redirect-37 on ncredir7004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir [19:26:13] PROBLEM - HTTPS non-canonical-redirect-1 on ncredir7004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir [19:26:14] PROBLEM - HTTPS non-canonical-redirect-22 on ncredir7004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir [19:26:14] RECOVERY - Host asw1-b4-magru is UP: PING OK - Packet loss = 0%, RTA = 144.10 ms [19:26:52] RECOVERY - HTTPS non-canonical-redirect-12 on ncredir7004 is OK: SSL OK - Certificate wikiedia.org valid until 2026-02-13 13:40:57 +0000 (expires in 29 days) https://wikitech.wikimedia.org/wiki/Ncredir [19:26:52] RECOVERY - HTTPS non-canonical-redirect-6 on ncredir7003 is OK: SSL OK - Certificate wikipedia.fi valid until 2026-02-22 01:44:33 +0000 (expires in 38 days) https://wikitech.wikimedia.org/wiki/Ncredir [19:26:52] RECOVERY - HTTPS non-canonical-redirect-27 on ncredir7003 is OK: SSL OK - Certificate wiktionary.ee valid until 2026-03-19 20:17:32 +0000 (expires in 64 days) https://wikitech.wikimedia.org/wiki/Ncredir [19:26:52] RECOVERY - HTTPS non-canonical-redirect-38 on ncredir7004 is OK: SSL OK - Certificate wikipublications.com valid until 2026-03-10 18:09:14 +0000 (expires in 54 days) https://wikitech.wikimedia.org/wiki/Ncredir [19:26:52] RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp7010 is OK: HTTP OK: HTTP/1.0 200 OK - 36924 bytes in 0.486 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:26:57] welp [19:27:00] raising the priority on that one I guess [19:27:54] FIRING: [14x] CoreBGPDown: Core BGP session down between cr1-codfw and cr3-ulsfo (198.35.26.192) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:28:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and 195.200.68.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:28:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [19:31:22] (03PS2) 10CDanis: cache_text: add gerrit-https to realservers [puppet] - 10https://gerrit.wikimedia.org/r/1226932 (https://phabricator.wikimedia.org/T411895) [19:31:22] (03PS9) 10CDanis: lvs7003: add gerrit-ssh and gerrit-https [puppet] - 10https://gerrit.wikimedia.org/r/1215388 (https://phabricator.wikimedia.org/T411895) [19:31:22] (03PS15) 10CDanis: gerrit services: lvs_setup! but only in magru. [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (https://phabricator.wikimedia.org/T411895) [19:31:23] (03PS8) 10CDanis: lvs7001: add gerrit services [puppet] - 10https://gerrit.wikimedia.org/r/1215398 (https://phabricator.wikimedia.org/T411895) [19:31:24] (03PS5) 10CDanis: gerrit/Liberica: expand to drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1215693 (https://phabricator.wikimedia.org/T411895) [19:31:28] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1226932 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [19:32:39] FIRING: [14x] CoreBGPDown: Core BGP session down between cr1-codfw and cr3-ulsfo (198.35.26.192) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:32:55] looking at that^ [19:33:10] RESOLVED: BFDdown: BFD session down between cr2-eqdfw and 195.200.68.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:37:39] FIRING: [14x] CoreBGPDown: Core BGP session down between cr1-codfw and cr3-ulsfo (198.35.26.192) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:40:54] (03PS3) 10CDanis: cache_text: add gerrit-https to realservers [puppet] - 10https://gerrit.wikimedia.org/r/1226932 (https://phabricator.wikimedia.org/T411895) [19:40:54] (03PS10) 10CDanis: lvs7003: add gerrit-ssh and gerrit-https [puppet] - 10https://gerrit.wikimedia.org/r/1215388 (https://phabricator.wikimedia.org/T411895) [19:40:54] (03PS16) 10CDanis: gerrit services: lvs_setup! but only in magru. [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (https://phabricator.wikimedia.org/T411895) [19:40:54] (03PS9) 10CDanis: lvs7001: add gerrit services [puppet] - 10https://gerrit.wikimedia.org/r/1215398 (https://phabricator.wikimedia.org/T411895) [19:40:55] (03PS6) 10CDanis: gerrit/Liberica: expand to drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1215693 (https://phabricator.wikimedia.org/T411895) [19:41:03] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1226932 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [19:44:24] 10ops-eqiad, 06SRE, 06DC-Ops: Remove second network connection for cloudcephosd hosts with single uplink - https://phabricator.wikimedia.org/T410989#11522824 (10VRiley-WMF) 05Open→03Resolved Thanks for this. I have unplugged the secondary cable for cloudcephosd1052. I have also went through the cable... [19:55:49] (03CR) 10CDanis: [V:03+1] "PCC is looking good so far, although I imagine we'll do this in exclusively magru first https://puppet-compiler.wmflabs.org/output/1226932" [puppet] - 10https://gerrit.wikimedia.org/r/1226932 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [19:58:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [20:00:53] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215388 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [20:00:56] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [20:00:58] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215398 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [20:02:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr3-eqsin and cr3-ulsfo (198.35.26.192) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [20:04:42] (03CR) 10Ssingh: [C:03+1] cache_text: add gerrit-https to realservers [puppet] - 10https://gerrit.wikimedia.org/r/1226932 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [20:05:52] PROBLEM - Host mr1-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [20:06:21] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host mwlog1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:06:54] andrew@cumin2002 reimage (PID 2749329) is awaiting input [20:08:37] 06SRE, 06Traffic, 07HTTPS: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#11522895 (10Izno) [20:21:17] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudbackup1004.eqiad.wmnet with OS trixie [20:22:16] (03CR) 10CDanis: [V:03+1 C:03+2] cache_text: add gerrit-https to realservers [puppet] - 10https://gerrit.wikimedia.org/r/1226932 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [20:22:46] !log 💔cdanis@cumin1003.eqiad.wmnet ~ 🕞🍵 sudo cumin 'A:cp-text' 'disable-puppet "cdanis deploy Ie99c64c48d"' [20:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:22] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup1004.eqiad.wmnet with OS trixie [20:27:28] (03CR) 10Ssingh: [V:03+1 C:04-2] "do not merge." [puppet] - 10https://gerrit.wikimedia.org/r/1226928 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [20:29:27] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mwlog1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:30:27] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host mwlog1003.eqiad.wmnet with OS bookworm [20:30:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06SRE Observability (FY2025/2026-Q3): Q2:rack/setup/install mwlog1003 - https://phabricator.wikimedia.org/T412230#11522910 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host mwlog1003.eqiad.wmnet with OS bookworm [20:36:46] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudbackup1004.eqiad.wmnet with reason: host reimage [20:38:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:43:50] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudbackup1004.eqiad.wmnet with reason: host reimage [20:44:33] jclark@cumin1003 reimage (PID 1533287) is awaiting input [20:49:19] (03PS1) 10CDanis: Revert "cache_text: add gerrit-https to realservers" [puppet] - 10https://gerrit.wikimedia.org/r/1226939 [20:52:09] (03PS5) 10Chlod Alejandro: enwiki: change to Wikipedia 25 logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225127 (https://phabricator.wikimedia.org/T414271) [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T2100). [21:00:05] chlod, ZhaoFJx, ejegg, and Superpes: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:09] o/ [21:00:14] o/ [21:00:24] I can deploy [21:01:47] hello [21:02:42] (03CR) 10Zabe: [C:03+2] enwiki: change to Wikipedia 25 logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225127 (https://phabricator.wikimedia.org/T414271) (owner: 10Chlod Alejandro) [21:02:56] (03CR) 10CDanis: [C:03+2] lvs7003: add gerrit-ssh and gerrit-https [puppet] - 10https://gerrit.wikimedia.org/r/1215388 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [21:03:19] (03CR) 10Zabe: [C:03+2] zhwiki: Temporary Logo Change for WP25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225285 (https://phabricator.wikimedia.org/T414299) (owner: 10ZhaoFJx) [21:03:23] (03CR) 10CDanis: [C:03+2] gerrit services: lvs_setup! but only in magru. [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [21:03:38] (03Merged) 10jenkins-bot: enwiki: change to Wikipedia 25 logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225127 (https://phabricator.wikimedia.org/T414271) (owner: 10Chlod Alejandro) [21:03:40] (03CR) 10CI reject: [V:04-1] zhwiki: Temporary Logo Change for WP25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225285 (https://phabricator.wikimedia.org/T414299) (owner: 10ZhaoFJx) [21:04:25] (03PS3) 10Zabe: zhwiki: Temporary Logo Change for WP25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225285 (https://phabricator.wikimedia.org/T414299) (owner: 10ZhaoFJx) [21:04:29] merge conflict [21:04:31] love it [21:05:21] (03CR) 10Zabe: [C:03+2] zhwiki: Temporary Logo Change for WP25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225285 (https://phabricator.wikimedia.org/T414299) (owner: 10ZhaoFJx) [21:05:37] it is what it is [21:06:13] (03Merged) 10jenkins-bot: zhwiki: Temporary Logo Change for WP25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225285 (https://phabricator.wikimedia.org/T414299) (owner: 10ZhaoFJx) [21:06:56] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1225127|enwiki: change to Wikipedia 25 logo (T414271)]], [[gerrit:1225285|zhwiki: Temporary Logo Change for WP25 (T414299)]] [21:07:04] T414271: Requesting temporary logo change for en.wikipedia.org (WP25) - https://phabricator.wikimedia.org/T414271 [21:07:05] T414299: Requesting temporary logo change for zh.wikipedia.org (WP25) - https://phabricator.wikimedia.org/T414299 [21:09:11] !log zabe@deploy2002 chlod, zhaofjx, zabe: Backport for [[gerrit:1225127|enwiki: change to Wikipedia 25 logo (T414271)]], [[gerrit:1225285|zhwiki: Temporary Logo Change for WP25 (T414299)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:09:30] FIRING: LibericaStaleConfig: Liberica instance lvs7003 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&viewPanel=10&var-site=magru&var-instance=lvs7003 - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [21:09:54] !log cdanis@cumin1003 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs7003.magru.wmnet} and A:liberica [21:09:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:09:56] checking now [21:10:12] Checking… [21:10:14] !log cdanis@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs7003.magru.wmnet} and A:liberica [21:10:46] zabe working [21:11:00] woo, i see the wp25 logo on enwiki (debug) too :) [21:11:02] Perfectly [21:11:18] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudbackup1004.eqiad.wmnet with OS trixie [21:11:43] good on enwiki as well :) [21:11:47] nice! [21:11:50] !log zabe@deploy2002 chlod, zhaofjx, zabe: Continuing with sync [21:14:10] (03CR) 10CDanis: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1215398/5616/" [puppet] - 10https://gerrit.wikimedia.org/r/1215398 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [21:14:30] RESOLVED: LibericaStaleConfig: Liberica instance lvs7003 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&viewPanel=10&var-site=magru&var-instance=lvs7003 - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [21:14:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:15:58] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1225127|enwiki: change to Wikipedia 25 logo (T414271)]], [[gerrit:1225285|zhwiki: Temporary Logo Change for WP25 (T414299)]] (duration: 09m 03s) [21:16:08] T414271: Requesting temporary logo change for en.wikipedia.org (WP25) - https://phabricator.wikimedia.org/T414271 [21:16:09] T414299: Requesting temporary logo change for zh.wikipedia.org (WP25) - https://phabricator.wikimedia.org/T414299 [21:16:12] ejegg: you want to self-serve? [21:16:28] zabe thanks a lot [21:16:32] yw [21:16:41] let me see if I have my credentials in order zabe [21:16:42] all those files need purging I guess [21:17:02] uh why do I see no logo on enwiki atm [21:17:18] (03PS4) 10Zabe: [itwiki] Add a temporary logo for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226908 (https://phabricator.wikimedia.org/T414320) (owner: 10Superpes15) [21:17:21] !log cdanis@cumin1003 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs7001.magru.wmnet} and A:liberica [21:17:29] !log cdanis@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs7001.magru.wmnet} and A:liberica [21:18:37] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for hmonroy - https://phabricator.wikimedia.org/T414375#11523067 (10HMonroy) @JMeybohm Hi! I'm trying a query wmf.mediawiki_history in superset. I'm getting: `mysql error: SELECT command denied to user 'research'@'1... [21:19:41] hmm, i don't have the ssh config locally to get in to deploy1002 [21:19:43] getting the same as taavi: for some reason it's 404ing? [21:20:07] I think the 404s got cached in the CDN just before they were synced out everywhere [21:20:12] zabe: you purging them already or should I? [21:20:22] (03PS1) 10Andrew Bogott: Revert "cloudbackup: flip all backups from cloudbackup1004 to 1003" [puppet] - 10https://gerrit.wikimedia.org/r/1226942 [21:20:39] I was currently doing it [21:21:17] ejegg: I can sync yours [21:21:24] (03CR) 10Zabe: [C:03+2] Revert "Shorten 'close' cookie wait period for enwiki banners" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226275 (https://phabricator.wikimedia.org/T411800) (owner: 10Ejegg) [21:21:28] thanks zabe [21:21:33] ejegg: deploy1002 has not been a thing in years? [21:21:53] oh hah, still on the deployments docs page! [21:22:10] !log manually purge 17 URLs for enwiki and zhwiki 25 year anniversary logos [21:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:22] which doc? [21:22:26] (03Merged) 10jenkins-bot: Revert "Shorten 'close' cookie wait period for enwiki banners" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226275 (https://phabricator.wikimedia.org/T411800) (owner: 10Ejegg) [21:22:40] https://wikitech.wikimedia.org/wiki/Backport_windows#Doing_the_deploy [21:22:49] The process: step 4 [21:24:11] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [21:24:12] jeena: regarding train blocker. I can deploy this in at 3pm PST (90m from now) unless you want to deploy it now? [21:24:26] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1226275|Revert "Shorten 'close' cookie wait period for enwiki banners" (T411800)]] [21:24:28] chlod: zabe: I purged all of the new logo files, and it's fixed at least for me [21:24:29] (03PS1) 10JHathaway: firewall: add cloud services, remove includes [puppet] - 10https://gerrit.wikimedia.org/r/1226944 (https://phabricator.wikimedia.org/T411089) [21:24:32] T411800: CentralNotice config changes to show a banner to a reader with the 'waitdate: close' status - https://phabricator.wikimedia.org/T411800 [21:24:42] also fixed for me! :D [21:24:50] thank you both, zabe and taavi [21:25:00] (03PS1) 10Jdlrobson: Revert "Do not use deprecated menu" [extensions/ProofreadPage] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1226945 (https://phabricator.wikimedia.org/T414630) [21:25:08] (03PS4) 10Zabe: [slwiki] Add a temporary logo for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226915 (https://phabricator.wikimedia.org/T414265) (owner: 10Superpes15) [21:25:26] Jdlrobson: if you prefer I can do it after these backports are run [21:26:06] (03PS3) 10Zabe: [kkwiki] Add a temporary logo for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226918 (https://phabricator.wikimedia.org/T414267) (owner: 10Superpes15) [21:26:36] !log zabe@deploy2002 zabe, ejegg: Backport for [[gerrit:1226275|Revert "Shorten 'close' cookie wait period for enwiki banners" (T411800)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:26:37] (03CR) 10CI reject: [V:04-1] firewall: add cloud services, remove includes [puppet] - 10https://gerrit.wikimedia.org/r/1226944 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [21:27:18] (03CR) 10JHathaway: firewall: Declare resources for both providers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) (owner: 10Majavah) [21:27:30] filed T414634 for the outdated docs as they're outdated enough to not be trivially fixable on the spot [21:27:31] T414634: Fix outdated deployment instructions on Wikitech - https://phabricator.wikimedia.org/T414634 [21:27:48] thanks zabe [21:28:52] the setting seems to have the correct value [21:28:58] Nice [21:29:01] !log zabe@deploy2002 zabe, ejegg: Continuing with sync [21:29:44] (and I have successfully logged in to deploy2002.codfw.wmnet so I can try to self-service next time) [21:30:42] (03CR) 10Zabe: [C:03+2] [itwiki] Add a temporary logo for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226908 (https://phabricator.wikimedia.org/T414320) (owner: 10Superpes15) [21:30:43] (03CR) 10Zabe: [C:03+2] [kkwiki] Add a temporary logo for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226918 (https://phabricator.wikimedia.org/T414267) (owner: 10Superpes15) [21:30:44] (03CR) 10Zabe: [C:03+2] [slwiki] Add a temporary logo for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226915 (https://phabricator.wikimedia.org/T414265) (owner: 10Superpes15) [21:30:54] ejegg we are using spiderpig to do backports now (https://spiderpig.wikimedia.org/mediawiki/backport) which uses the scap backport command (what you would use if logged into the deployment server) https://wikitech.wikimedia.org/wiki/Scap#scap_backport [21:31:54] (03Merged) 10jenkins-bot: [itwiki] Add a temporary logo for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226908 (https://phabricator.wikimedia.org/T414320) (owner: 10Superpes15) [21:31:58] (03Merged) 10jenkins-bot: [slwiki] Add a temporary logo for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226915 (https://phabricator.wikimedia.org/T414265) (owner: 10Superpes15) [21:32:02] (03Merged) 10jenkins-bot: [kkwiki] Add a temporary logo for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226918 (https://phabricator.wikimedia.org/T414267) (owner: 10Superpes15) [21:32:04] whoa, spiderpig is web based? [21:32:11] yeah :D [21:32:23] nice [21:32:38] hmm, login fail, lemme check my pw vault [21:33:07] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1226275|Revert "Shorten 'close' cookie wait period for enwiki banners" (T411800)]] (duration: 08m 40s) [21:33:12] T411800: CentralNotice config changes to show a banner to a reader with the 'waitdate: close' status - https://phabricator.wikimedia.org/T411800 [21:33:22] oh i see, I'm just not authorized to use SpiderPig [21:33:22] you might need to be in a special LDAP group https://wikitech.wikimedia.org/wiki/Scap/SpiderPig [21:33:38] I think you just need to make a request for it [21:33:45] thanks, I'll do that now! [21:33:52] you're welcome! [21:33:56] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1226908|[itwiki] Add a temporary logo for Wikipedia 25 (T414320)]], [[gerrit:1226915|[slwiki] Add a temporary logo for Wikipedia 25 (T414265)]], [[gerrit:1226918|[kkwiki] Add a temporary logo for Wikipedia 25 (T414267)]] [21:34:04] T414320: Requesting temporary logo change for it.wikipedia.org (WP25) - https://phabricator.wikimedia.org/T414320 [21:34:04] T414265: Requesting temporary logo change for sl.wikipedia.org (WP25) - https://phabricator.wikimedia.org/T414265 [21:34:05] T414267: Requesting temporary logo change for kk.wikipedia.org (WP25) - https://phabricator.wikimedia.org/T414267 [21:36:06] !log zabe@deploy2002 zabe, superpes: Backport for [[gerrit:1226908|[itwiki] Add a temporary logo for Wikipedia 25 (T414320)]], [[gerrit:1226915|[slwiki] Add a temporary logo for Wikipedia 25 (T414265)]], [[gerrit:1226918|[kkwiki] Add a temporary logo for Wikipedia 25 (T414267)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:36:30] !log zabe@deploy2002 zabe, superpes: Continuing with sync [21:36:43] !log cdanis@cumin1003 conftool action : set/weight=1; selector: service=gerrit [21:36:48] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:39:25] jeena: if you can do it that would be great. [21:39:33] sure [21:40:32] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1226908|[itwiki] Add a temporary logo for Wikipedia 25 (T414320)]], [[gerrit:1226915|[slwiki] Add a temporary logo for Wikipedia 25 (T414265)]], [[gerrit:1226918|[kkwiki] Add a temporary logo for Wikipedia 25 (T414267)]] (duration: 06m 36s) [21:40:39] T414320: Requesting temporary logo change for it.wikipedia.org (WP25) - https://phabricator.wikimedia.org/T414320 [21:40:39] T414265: Requesting temporary logo change for sl.wikipedia.org (WP25) - https://phabricator.wikimedia.org/T414265 [21:40:40] T414267: Requesting temporary logo change for kk.wikipedia.org (WP25) - https://phabricator.wikimedia.org/T414267 [21:41:30] FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-sshlb6_29418 has 1 unhealthy realservers pooled on lvs7001:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [21:41:49] !log cdanis@cumin1003 conftool action : set/pooled=yes; selector: service=gerrit,dc=magru [21:42:07] (03PS1) 10Zabe: Start writing to il_target_id on non-large wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226948 (https://phabricator.wikimedia.org/T413526) [21:42:29] (03PS2) 10JHathaway: firewall: add cloud services, remove includes [puppet] - 10https://gerrit.wikimedia.org/r/1226944 (https://phabricator.wikimedia.org/T411089) [21:43:41] (03CR) 10Zabe: [C:03+2] Start writing to il_target_id on non-large wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226948 (https://phabricator.wikimedia.org/T413526) (owner: 10Zabe) [21:44:30] (03Merged) 10jenkins-bot: Start writing to il_target_id on non-large wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226948 (https://phabricator.wikimedia.org/T413526) (owner: 10Zabe) [21:44:53] (03CR) 10CI reject: [V:04-1] firewall: add cloud services, remove includes [puppet] - 10https://gerrit.wikimedia.org/r/1226944 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [21:45:03] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1226948|Start writing to il_target_id on non-large wikis (T413526)]] [21:45:07] T413526: Set imagelinks migration to write both - https://phabricator.wikimedia.org/T413526 [21:47:12] !log zabe@deploy2002 zabe: Backport for [[gerrit:1226948|Start writing to il_target_id on non-large wikis (T413526)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:47:33] !log zabe@deploy2002 zabe: Continuing with sync [21:49:25] (03PS3) 10JHathaway: firewall: add cloud services, remove includes [puppet] - 10https://gerrit.wikimedia.org/r/1226944 (https://phabricator.wikimedia.org/T411089) [21:51:30] (03PS2) 10Jforrester: Revert "Do not use deprecated menu" [extensions/ProofreadPage] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1226945 (https://phabricator.wikimedia.org/T414630) (owner: 10Jdlrobson) [21:51:33] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1226948|Start writing to il_target_id on non-large wikis (T413526)]] (duration: 06m 30s) [21:51:38] (03CR) 10CI reject: [V:04-1] firewall: add cloud services, remove includes [puppet] - 10https://gerrit.wikimedia.org/r/1226944 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [21:51:40] T413526: Set imagelinks migration to write both - https://phabricator.wikimedia.org/T413526 [21:52:02] (03CR) 10Jforrester: [C:03+1] Revert "Do not use deprecated menu" [extensions/ProofreadPage] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1226945 (https://phabricator.wikimedia.org/T414630) (owner: 10Jdlrobson) [21:52:12] jeena: feel free to take over [21:53:10] thank you zabe [21:53:50] (03PS4) 10JHathaway: firewall: add cloud services, remove includes [puppet] - 10https://gerrit.wikimedia.org/r/1226944 (https://phabricator.wikimedia.org/T411089) [21:54:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy2002 using scap backport" [extensions/ProofreadPage] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1226945 (https://phabricator.wikimedia.org/T414630) (owner: 10Jdlrobson) [21:56:14] (03Merged) 10jenkins-bot: Revert "Do not use deprecated menu" [extensions/ProofreadPage] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1226945 (https://phabricator.wikimedia.org/T414630) (owner: 10Jdlrobson) [21:56:30] (03CR) 10CI reject: [V:04-1] firewall: add cloud services, remove includes [puppet] - 10https://gerrit.wikimedia.org/r/1226944 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [21:56:47] !log jhuneidi@deploy2002 Started scap sync-world: Backport for [[gerrit:1226945|Revert "Do not use deprecated menu" (T414630)]] [21:56:52] T414630: [regression, 1.46.0-wmf.11] ProofreadPage navigation tabs on Page pages are missing in Vector, Monobook, CologneBlue and Modern skins - https://phabricator.wikimedia.org/T414630 [21:58:59] !log jhuneidi@deploy2002 jhuneidi, jdlrobson: Backport for [[gerrit:1226945|Revert "Do not use deprecated menu" (T414630)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:00:04] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T2200) [22:01:58] !log jhuneidi@deploy2002 jhuneidi, jdlrobson: Continuing with sync [22:04:36] (03CR) 10Andrew Bogott: [C:03+2] Revert "cloudbackup: flip all backups from cloudbackup1004 to 1003" [puppet] - 10https://gerrit.wikimedia.org/r/1226942 (owner: 10Andrew Bogott) [22:05:16] (03PS5) 10JHathaway: firewall: add cloud services [puppet] - 10https://gerrit.wikimedia.org/r/1226944 (https://phabricator.wikimedia.org/T411089) [22:06:03] !log jhuneidi@deploy2002 Finished scap sync-world: Backport for [[gerrit:1226945|Revert "Do not use deprecated menu" (T414630)]] (duration: 09m 16s) [22:06:09] T414630: [regression, 1.46.0-wmf.11] ProofreadPage navigation tabs on Page pages are missing in Vector, Monobook, CologneBlue and Modern skins - https://phabricator.wikimedia.org/T414630 [22:10:14] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1226944 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [22:10:50] (03PS7) 10CDanis: gerrit/Liberica: expand to drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1215693 (https://phabricator.wikimedia.org/T411895) [22:10:51] (03PS1) 10CDanis: tcp-proxy: allow lb healthchecks [puppet] - 10https://gerrit.wikimedia.org/r/1226951 (https://phabricator.wikimedia.org/T411895) [22:11:03] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1226951 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [22:12:10] !log Updating development images on contint primary for T412259 [22:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:15] T412259: Update PatchDemo to Node 20 - https://phabricator.wikimedia.org/T412259 [22:13:16] (03PS1) 10Andrew Bogott: Revert "wmcs cinder backups: move all backups to 2003 so 2004 can be reimaged" [puppet] - 10https://gerrit.wikimedia.org/r/1226952 [22:14:07] (03PS2) 10CDanis: tcp-proxy: allow lb healthchecks [puppet] - 10https://gerrit.wikimedia.org/r/1226951 (https://phabricator.wikimedia.org/T411895) [22:15:26] (03Abandoned) 10CDanis: Revert "cache_text: add gerrit-https to realservers" [puppet] - 10https://gerrit.wikimedia.org/r/1226939 (owner: 10CDanis) [22:15:59] (03CR) 10CDanis: [C:03+2] tcp-proxy: allow lb healthchecks [puppet] - 10https://gerrit.wikimedia.org/r/1226951 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [22:17:14] (03CR) 10Ryan Kemper: [C:03+1] Alert DPE SRE when probes fail in dse-k8s clusters [alerts] - 10https://gerrit.wikimedia.org/r/1226282 (https://phabricator.wikimedia.org/T412447) (owner: 10Bking) [22:17:37] (03CR) 10Bking: [C:03+2] Alert DPE SRE when probes fail in dse-k8s clusters [alerts] - 10https://gerrit.wikimedia.org/r/1226282 (https://phabricator.wikimedia.org/T412447) (owner: 10Bking) [22:18:02] (03CR) 10Scott French: [C:03+1] "I could easily imagine these defaults (motivated by performance characteristics of actual S3) being entirely inappropriate for our environ" [puppet] - 10https://gerrit.wikimedia.org/r/1226914 (https://phabricator.wikimedia.org/T394476) (owner: 10Elukey) [22:22:57] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hadoop.reboot-workers (exit_code=99) for Hadoop analytics cluster [22:26:19] !log ryankemper@cumin2002 START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster [22:26:48] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:30:04] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:37:16] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:38:06] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth1 (Subnet frack-fundraising-codfw in F5) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:42:16] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:52:54] (03CR) 10RLazarus: [C:03+2] Add Test Kitchen maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/1226318 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming) [22:57:54] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T2300) [23:01:36] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100% [23:02:06] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 55565 bytes in 0.072 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:02:06] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:02:21] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/mw-cron: apply [23:03:22] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [23:03:32] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [23:04:11] FIRING: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:05:07] RESOLVED: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:20:54] (03PS1) 10Zabe: Removed dropped special page from disabled query pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226962 (https://phabricator.wikimedia.org/T414202) [23:25:00] !log zabe@deploy2002:~$ mwscript migrateLinksTable.php testwiki --table imagelinks # T413668 [23:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:05] T413668: Run the data migration of imagelinks - https://phabricator.wikimedia.org/T413668 [23:25:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T413525)', diff saved to https://phabricator.wikimedia.org/P87520 and previous config saved to /var/cache/conftool/dbconfig/20260114-232541-marostegui.json [23:25:45] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [23:35:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P87521 and previous config saved to /var/cache/conftool/dbconfig/20260114-233549-marostegui.json [23:45:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P87522 and previous config saved to /var/cache/conftool/dbconfig/20260114-234557-marostegui.json [23:56:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T413525)', diff saved to https://phabricator.wikimedia.org/P87523 and previous config saved to /var/cache/conftool/dbconfig/20260114-235606-marostegui.json [23:56:11] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance [23:56:11] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [23:56:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2172 (T413525)', diff saved to https://phabricator.wikimedia.org/P87524 and previous config saved to /var/cache/conftool/dbconfig/20260114-235619-marostegui.json [23:58:07] (03PS1) 10Zabe: Start reading from il_target_id on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226965 (https://phabricator.wikimedia.org/T413669) [23:59:38] jouncebot: nowandnext [23:59:38] For the next 0 hour(s) and 0 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T2300) [23:59:38] In 7 hour(s) and 0 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260115T0700) [23:59:39] In 7 hour(s) and 0 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260115T0700) [23:59:49] (03CR) 10Zabe: [C:03+2] Start reading from il_target_id on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226965 (https://phabricator.wikimedia.org/T413669) (owner: 10Zabe)