[00:00:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06SRE Observability (FY2025/2026-Q3): Q2:rack/setup/install mwlog1003 - https://phabricator.wikimedia.org/T412230#11519530 (10Jclark-ctr)
[00:01:13] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1370.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[00:01:49] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11519531 (10VRiley-WMF)
[00:04:20] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11519533 (10VRiley-WMF)
[00:08:29] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1370.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[00:08:57] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1370.eqiad.wmnet with OS trixie
[00:09:07] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11519535 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host wikikube-worker1370.eqiad.wmnet with OS trixie
[00:15:12] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003"
[00:15:31] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003"
[00:15:32] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1372.eqiad.wmnet with OS trixie
[00:15:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11519540 (10RKemper) (Working with @bking ) We provisioned a virtual disk for missing drive via the drac web ui. Then we entered the rescue shell and commented out the had...
[00:15:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11519541 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host wikikube-worker1372.eqiad.wmnet with OS trixie completed: - wikikub...
[00:20:03] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1370.eqiad.wmnet with reason: host reimage
[00:24:34] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1370.eqiad.wmnet with reason: host reimage
[00:29:15] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1148 is CRITICAL: CRITICAL: 12 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[00:38:55] <wikibugs>	 (03PS1) 10Herron: assign mwlog[12]003 insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1226396 (https://phabricator.wikimedia.org/T412229)
[00:40:16] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1226398
[00:40:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1226398 (owner: 10TrainBranchBot)
[00:40:34] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003"
[00:40:54] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003"
[00:40:55] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1370.eqiad.wmnet with OS trixie
[00:41:09] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11519577 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host wikikube-worker1370.eqiad.wmnet with OS trixie completed: - wikikub...
[00:41:20] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11519578 (10VRiley-WMF) 05Open→03Resolved This has been completed
[00:42:52] <wikibugs>	 (03CR) 10Herron: [C:03+2] assign mwlog[12]003 insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1226396 (https://phabricator.wikimedia.org/T412229) (owner: 10Herron)
[00:45:05] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 13Patch-For-Review, 06SRE Observability (FY2025/2026-Q3): Q2:rack/setup/install mwlog2003 - https://phabricator.wikimedia.org/T412229#11519583 (10herron) >>! In T412229#11513890, @Jhancock.wm wrote: > @herron two things > - do you mind if i rack this in the new expansion cag...
[00:53:28] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1226398 (owner: 10TrainBranchBot)
[01:01:05] <logmsgbot>	 !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image
[01:10:10] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1226406
[01:10:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1226406 (owner: 10TrainBranchBot)
[01:14:00] <logmsgbot>	 !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 12m 54s)
[01:24:11] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[01:34:05] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1226406 (owner: 10TrainBranchBot)
[03:49:33] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 5088 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
[04:01:57] <icinga-wm>	 PROBLEM - Host doh7003 is DOWN: CRITICAL - Time to live exceeded (195.200.68.98)
[04:01:57] <icinga-wm>	 PROBLEM - Host doh7004 is DOWN: CRITICAL - Time to live exceeded (195.200.68.101)
[04:02:39] <icinga-wm>	 RECOVERY - Host doh7003 is UP: PING OK - Packet loss = 0%, RTA = 137.85 ms
[04:02:39] <icinga-wm>	 RECOVERY - Host doh7004 is UP: PING OK - Packet loss = 0%, RTA = 138.07 ms
[04:34:50] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1226178 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[04:58:19] <icinga-wm>	 PROBLEM - Host cp7004 is DOWN: CRITICAL - Time to live exceeded (10.140.1.5)
[04:58:21] <icinga-wm>	 PROBLEM - Host cp7006 is DOWN: CRITICAL - Time to live exceeded (10.140.1.6)
[04:58:21] <icinga-wm>	 PROBLEM - Host cp7016 is DOWN: CRITICAL - Time to live exceeded (10.140.1.11)
[04:58:21] <icinga-wm>	 PROBLEM - Host cp7008 is DOWN: CRITICAL - Time to live exceeded (10.140.1.7)
[04:58:21] <icinga-wm>	 PROBLEM - Host durum7004 is DOWN: CRITICAL - Time to live exceeded (10.140.2.7)
[04:58:25] <icinga-wm>	 PROBLEM - Host hcaptcha-proxy7001 is DOWN: CRITICAL - Time to live exceeded (195.200.68.102)
[04:58:25] <icinga-wm>	 PROBLEM - Host cp7002 is DOWN: CRITICAL - Time to live exceeded (10.140.1.4)
[04:58:41] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: CRITICAL - Time to live exceeded (10.140.2.8)
[04:58:41] <icinga-wm>	 PROBLEM - Host ncredir7003 is DOWN: CRITICAL - Time to live exceeded (10.140.2.3)
[04:58:47] <icinga-wm>	 RECOVERY - Host durum7004 is UP: PING OK - Packet loss = 0%, RTA = 138.15 ms
[04:58:47] <icinga-wm>	 RECOVERY - Host hcaptcha-proxy7001 is UP: PING OK - Packet loss = 0%, RTA = 138.09 ms
[04:58:57] <icinga-wm>	 RECOVERY - Host ncredir7003 is UP: PING OK - Packet loss = 0%, RTA = 138.15 ms
[04:58:57] <icinga-wm>	 RECOVERY - Host cp7004 is UP: PING OK - Packet loss = 0%, RTA = 137.47 ms
[04:58:59] <icinga-wm>	 RECOVERY - Host cp7002 is UP: PING OK - Packet loss = 0%, RTA = 137.74 ms
[04:58:59] <icinga-wm>	 RECOVERY - Host cp7006 is UP: PING OK - Packet loss = 0%, RTA = 137.51 ms
[04:59:05] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 138.17 ms
[04:59:07] <icinga-wm>	 RECOVERY - Host cp7016 is UP: PING OK - Packet loss = 0%, RTA = 137.64 ms
[04:59:07] <icinga-wm>	 RECOVERY - Host cp7008 is UP: PING OK - Packet loss = 0%, RTA = 137.43 ms
[05:09:11] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:24:11] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[05:28:40] <wikibugs>	 (03PS2) 10KartikMistry: Update cxserver to 2026-01-09-231405-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225807 (https://phabricator.wikimedia.org/T414237)
[05:34:11] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:41:40] <wikibugs>	 (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226477 (https://phabricator.wikimedia.org/T128546)
[05:45:14] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 15 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226477 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[06:06:01] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Add note about x3 tables [puppet] - 10https://gerrit.wikimedia.org/r/1226505
[06:06:47] <wikibugs>	 (03CR) 10Marostegui: "This is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1226505 (owner: 10Marostegui)
[06:06:49] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] site.pp: Add note about x3 tables [puppet] - 10https://gerrit.wikimedia.org/r/1226505 (owner: 10Marostegui)
[06:06:53] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1244 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1226509 (https://phabricator.wikimedia.org/T414542)
[06:07:26] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2240 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1226510 (https://phabricator.wikimedia.org/T414543)
[06:07:31] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1226512 (https://phabricator.wikimedia.org/T414543)
[06:16:26] <wikibugs>	 (03CR) 10Phuedx: [C:03+1] Add Test Kitchen maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/1226318 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming)
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T0700)
[07:01:33] <wikibugs>	 (03CR) 10Marostegui: "verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1225136 (owner: 10Giuseppe Lavagetto)
[07:01:37] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] admin: add the ssh key for my backup yubikey [puppet] - 10https://gerrit.wikimedia.org/r/1225136 (owner: 10Giuseppe Lavagetto)
[07:02:32] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2247 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87490 and previous config saved to /var/cache/conftool/dbconfig/20260114-070230-marostegui.json
[07:02:38] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[07:02:38] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[07:07:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqord:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[07:10:23] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.wikireplicas.update-views
[07:12:40] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2247', diff saved to https://phabricator.wikimedia.org/P87491 and previous config saved to /var/cache/conftool/dbconfig/20260114-071240-marostegui.json
[07:12:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[07:17:24] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0)
[07:17:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[07:22:49] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2247', diff saved to https://phabricator.wikimedia.org/P87492 and previous config saved to /var/cache/conftool/dbconfig/20260114-072248-marostegui.json
[07:22:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[07:23:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[07:32:57] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2247 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87493 and previous config saved to /var/cache/conftool/dbconfig/20260114-073256-marostegui.json
[07:33:02] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[07:33:02] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[07:33:06] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[07:33:13] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2248.codfw.wmnet with reason: Maintenance
[07:33:22] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2248 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87494 and previous config saved to /var/cache/conftool/dbconfig/20260114-073321-marostegui.json
[07:59:11] <kart_>	 marostegui: OK to deploy cxserver?
[08:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T0800).
[08:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:02:43] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87495 and previous config saved to /var/cache/conftool/dbconfig/20260114-080242-marostegui.json
[08:02:48] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[08:02:49] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[08:03:52] <marostegui>	 kart_: yep!
[08:12:49] <kart_>	 Thanks
[08:12:51] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252', diff saved to https://phabricator.wikimedia.org/P87496 and previous config saved to /var/cache/conftool/dbconfig/20260114-081251-marostegui.json
[08:12:56] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2026-01-09-231405-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225807 (https://phabricator.wikimedia.org/T414237) (owner: 10KartikMistry)
[08:14:44] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2026-01-09-231405-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225807 (https://phabricator.wikimedia.org/T414237) (owner: 10KartikMistry)
[08:20:34] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply
[08:22:32] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[08:23:00] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252', diff saved to https://phabricator.wikimedia.org/P87497 and previous config saved to /var/cache/conftool/dbconfig/20260114-082259-marostegui.json
[08:27:38] <wikibugs>	 (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1226204 (https://phabricator.wikimedia.org/T412524) (owner: 10Dpogorzelski)
[08:29:49] <icinga-wm>	 PROBLEM - Host hcaptcha-proxy7001 is DOWN: CRITICAL - Time to live exceeded (195.200.68.102)
[08:29:56] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[08:30:07] <icinga-wm>	 PROBLEM - Host ganeti7002 is DOWN: CRITICAL - Time to live exceeded (10.140.1.12)
[08:30:07] <icinga-wm>	 PROBLEM - Host ganeti7004 is DOWN: CRITICAL - Time to live exceeded (10.140.1.13)
[08:30:07] <icinga-wm>	 PROBLEM - Host durum7004 is DOWN: CRITICAL - Time to live exceeded (10.140.2.7)
[08:30:07] <icinga-wm>	 PROBLEM - Host doh7003 is DOWN: CRITICAL - Time to live exceeded (195.200.68.98)
[08:30:07] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: CRITICAL - Time to live exceeded (10.140.2.8)
[08:30:27] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[08:30:37] <icinga-wm>	 RECOVERY - Host doh7003 is UP: PING OK - Packet loss = 0%, RTA = 138.04 ms
[08:30:39] <icinga-wm>	 RECOVERY - Host ganeti7004 is UP: PING OK - Packet loss = 0%, RTA = 137.68 ms
[08:30:39] <icinga-wm>	 RECOVERY - Host ganeti7002 is UP: PING OK - Packet loss = 0%, RTA = 137.60 ms
[08:30:41] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 138.26 ms
[08:30:47] <icinga-wm>	 RECOVERY - Host durum7004 is UP: PING OK - Packet loss = 0%, RTA = 138.19 ms
[08:30:47] <icinga-wm>	 RECOVERY - Host hcaptcha-proxy7001 is UP: PING OK - Packet loss = 0%, RTA = 137.83 ms
[08:30:49] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[08:31:25] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[08:33:08] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87498 and previous config saved to /var/cache/conftool/dbconfig/20260114-083307-marostegui.json
[08:33:13] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[08:33:14] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[08:33:24] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1260.eqiad.wmnet with reason: Maintenance
[08:33:33] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1260 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87499 and previous config saved to /var/cache/conftool/dbconfig/20260114-083332-marostegui.json
[08:39:39] <icinga-wm>	 PROBLEM - Host doh7003 is DOWN: CRITICAL - Time to live exceeded (195.200.68.98)
[08:39:53] <icinga-wm>	 PROBLEM - Host hcaptcha-proxy7001 is DOWN: CRITICAL - Time to live exceeded (195.200.68.102)
[08:39:57] <icinga-wm>	 PROBLEM - Host doh7004 is DOWN: CRITICAL - Time to live exceeded (195.200.68.101)
[08:40:09] <icinga-wm>	 PROBLEM - Host durum7004 is DOWN: CRITICAL - Time to live exceeded (10.140.2.7)
[08:40:09] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: CRITICAL - Time to live exceeded (10.140.2.8)
[08:40:09] <icinga-wm>	 PROBLEM - Host install7002 is DOWN: CRITICAL - Time to live exceeded (195.200.68.100)
[08:40:11] <icinga-wm>	 PROBLEM - Host tcp-proxy7002 is DOWN: CRITICAL - Time to live exceeded (10.140.2.11)
[08:40:11] <icinga-wm>	 PROBLEM - Host ncredir7003 is DOWN: CRITICAL - Time to live exceeded (10.140.2.3)
[08:40:11] <icinga-wm>	 PROBLEM - Host asw1-b3-magru is DOWN: CRITICAL - Time to live exceeded (195.200.68.130)
[08:40:11] <icinga-wm>	 PROBLEM - Host asw1-b4-magru is DOWN: CRITICAL - Time to live exceeded (195.200.68.131)
[08:40:19] <icinga-wm>	 PROBLEM - Host mr1-magru is DOWN: CRITICAL - Time to live exceeded (195.200.68.132)
[08:40:25] <icinga-wm>	 RECOVERY - Host install7002 is UP: PING OK - Packet loss = 0%, RTA = 138.04 ms
[08:40:29] <icinga-wm>	 RECOVERY - Host asw1-b3-magru is UP: PING OK - Packet loss = 0%, RTA = 144.38 ms
[08:40:29] <icinga-wm>	 RECOVERY - Host asw1-b4-magru is UP: PING OK - Packet loss = 0%, RTA = 142.09 ms
[08:40:37] <icinga-wm>	 RECOVERY - Host doh7003 is UP: PING OK - Packet loss = 0%, RTA = 138.02 ms
[08:40:37] <icinga-wm>	 RECOVERY - Host doh7004 is UP: PING OK - Packet loss = 0%, RTA = 138.07 ms
[08:40:41] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 138.27 ms
[08:40:41] <icinga-wm>	 RECOVERY - Host ncredir7003 is UP: PING OK - Packet loss = 0%, RTA = 138.20 ms
[08:40:45] <icinga-wm>	 RECOVERY - Host mr1-magru is UP: PING OK - Packet loss = 0%, RTA = 138.11 ms
[08:40:47] <icinga-wm>	 RECOVERY - Host durum7004 is UP: PING OK - Packet loss = 0%, RTA = 138.14 ms
[08:40:47] <icinga-wm>	 RECOVERY - Host hcaptcha-proxy7001 is UP: PING OK - Packet loss = 0%, RTA = 137.94 ms
[08:41:07] <icinga-wm>	 RECOVERY - Host tcp-proxy7002 is UP: PING OK - Packet loss = 0%, RTA = 140.24 ms
[08:42:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[08:43:30] <kart_>	 !log Update cxserver to 2026-01-09-231405-production (T414237, T413646, T409998)
[08:43:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:37] <stashbot>	 T414237: Post-creation work for kaiwiki - https://phabricator.wikimedia.org/T414237
[08:43:37] <stashbot>	 T413646: Content Translation: cannot select an existing target article; section translation is published to a redirect instead of the main article (target language: Russian). - https://phabricator.wikimedia.org/T413646
[08:43:38] <stashbot>	 T409998: cxserver: en > qqq pair should not be used for requests - https://phabricator.wikimedia.org/T409998
[08:44:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqord:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[08:46:40] <wikibugs>	 (03CR) 10Joal: "Some changes" [puppet] - 10https://gerrit.wikimedia.org/r/1226270 (https://phabricator.wikimedia.org/T278056) (owner: 10Brouberol)
[08:49:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[08:54:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[08:56:45] <wikibugs>	 (03PS1) 10Brouberol: druid_exporter: Fixup metric definition [puppet] - 10https://gerrit.wikimedia.org/r/1226766 (https://phabricator.wikimedia.org/T278056)
[08:56:57] <wikibugs>	 (03PS2) 10Brouberol: druid_exporter: Fixup metric definition [puppet] - 10https://gerrit.wikimedia.org/r/1226766 (https://phabricator.wikimedia.org/T278056)
[08:57:04] <wikibugs>	 (03CR) 10CI reject: [V:04-1] druid_exporter: Fixup metric definition [puppet] - 10https://gerrit.wikimedia.org/r/1226766 (https://phabricator.wikimedia.org/T278056) (owner: 10Brouberol)
[08:57:34] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] "Thanks for noticing this @joal@wikimedia.org! I've addressed your comments in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1226766" [puppet] - 10https://gerrit.wikimedia.org/r/1226270 (https://phabricator.wikimedia.org/T278056) (owner: 10Brouberol)
[08:59:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[09:03:00] <wikibugs>	 (03PS3) 10Brouberol: druid_exporter: Fixup metric definition [puppet] - 10https://gerrit.wikimedia.org/r/1226766 (https://phabricator.wikimedia.org/T278056)
[09:03:00] <wikibugs>	 (03PS13) 10Daniel Kinzler: charts: add redioscope chart and service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207256 (https://phabricator.wikimedia.org/T407999)
[09:03:00] <wikibugs>	 (03CR) 10Daniel Kinzler: "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207256 (https://phabricator.wikimedia.org/T407999) (owner: 10Daniel Kinzler)
[09:03:18] <wikibugs>	 (03CR) 10CI reject: [V:04-1] druid_exporter: Fixup metric definition [puppet] - 10https://gerrit.wikimedia.org/r/1226766 (https://phabricator.wikimedia.org/T278056) (owner: 10Brouberol)
[09:03:26] <wikibugs>	 (03PS4) 10Brouberol: druid_exporter: Fixup metric definition [puppet] - 10https://gerrit.wikimedia.org/r/1226766 (https://phabricator.wikimedia.org/T278056)
[09:07:09] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Unconditionally use dnsmasq on routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1226285 (https://phabricator.wikimedia.org/T396864) (owner: 10Muehlenhoff)
[09:08:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[09:09:35] <wikibugs>	 (03CR) 10Daniel Kinzler: rest-gateway: generate retry-after header for rate-limited requests (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224937 (https://phabricator.wikimedia.org/T405636) (owner: 10Daniel Kinzler)
[09:09:48] <wikibugs>	 (03PS2) 10Daniel Kinzler: rest-gateway: generate retry-after header for rate-limited requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224937 (https://phabricator.wikimedia.org/T405636)
[09:10:26] <wikibugs>	 (03PS3) 10Daniel Kinzler: rest-gateway: generate retry-after header for rate-limited requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224937 (https://phabricator.wikimedia.org/T405636)
[09:12:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[09:12:47] <wikibugs>	 (03CR) 10Daniel Kinzler: rest-gateway: generate retry-after header for rate-limited requests (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224937 (https://phabricator.wikimedia.org/T405636) (owner: 10Daniel Kinzler)
[09:15:06] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[09:16:02] <wikibugs>	 (03CR) 10Daniel Kinzler: rest-gateway: generate retry-after header for rate-limited requests (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224937 (https://phabricator.wikimedia.org/T405636) (owner: 10Daniel Kinzler)
[09:17:00] <wikibugs>	 (03PS4) 10Daniel Kinzler: rest-gateway: generate retry-after header for rate-limited requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224937 (https://phabricator.wikimedia.org/T405636)
[09:17:29] <wikibugs>	 (03CR) 10Daniel Kinzler: rest-gateway: generate retry-after header for rate-limited requests (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224937 (https://phabricator.wikimedia.org/T405636) (owner: 10Daniel Kinzler)
[09:18:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[09:20:19] <wikibugs>	 (03PS1) 10Dpogorzelski: docker-registry: add ml user pwd [labs/private] - 10https://gerrit.wikimedia.org/r/1226768
[09:21:29] <wikibugs>	 (03CR) 10Clément Goubert: "Adding @ksouckova@wikimedia.org for additional and subsequent reviews" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207256 (https://phabricator.wikimedia.org/T407999) (owner: 10Daniel Kinzler)
[09:23:58] <wikibugs>	 10SRE-SLO, 10Observability-Metrics, 06SRE Observability (FY2025/2026-Q3): Thanos (store|query-frontend) memcached cache in bad status - https://phabricator.wikimedia.org/T411273#11520147 (10tappof) ssh titan1001.eqiad.wmnet -L 16902:localhost:16902 {F71524543}  ssh titan1002.eqiad.wmnet -L 16903:localhost:16...
[09:24:11] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[09:24:19] <tappof>	 !log Depool titan1002; disable Puppet and enable debug log level (T411273)
[09:24:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:23] <stashbot>	 T411273: Thanos (store|query-frontend) memcached cache in bad status - https://phabricator.wikimedia.org/T411273
[09:29:29] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] thumbor: reimplement SVG max size feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226286 (https://phabricator.wikimedia.org/T411076) (owner: 10Hnowlan)
[09:30:42] <wikibugs>	 (03PS1) 10Clément Goubert: kubernetes: Add dummy secrets for redioscope [labs/private] - 10https://gerrit.wikimedia.org/r/1226771
[09:31:14] <wikibugs>	 (03CR) 10Clément Goubert: [V:03+2 C:03+2] kubernetes: Add dummy secrets for redioscope [labs/private] - 10https://gerrit.wikimedia.org/r/1226771 (owner: 10Clément Goubert)
[09:32:11] <wikibugs>	 10SRE-SLO, 10Observability-Metrics, 06SRE Observability (FY2025/2026-Q3): Thanos (store|query-frontend) memcached cache in bad status - https://phabricator.wikimedia.org/T411273#11520159 (10tappof) {F71524745} root@titan1002:~# tshark -i lo -f 'tcp port 11211' -Y 'memcache' ` 55911 750.475540100    127.0.0.1...
[09:34:18] <wikibugs>	 (03CR) 10Elukey: [C:03+1] docker-registry: add ml user pwd [labs/private] - 10https://gerrit.wikimedia.org/r/1226768 (owner: 10Dpogorzelski)
[09:34:50] <wikibugs>	 (03PS2) 10Clément Goubert: Revert^4 "restgateway: migrate the /api/rest_v1/ sandbox to the rest gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1226288 (https://phabricator.wikimedia.org/T396807)
[09:36:15] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+2] docker-registry: add ml user pwd [labs/private] - 10https://gerrit.wikimedia.org/r/1226768 (owner: 10Dpogorzelski)
[09:36:22] <wikibugs>	 (03CR) 10Dpogorzelski: [V:03+2 C:03+2] docker-registry: add ml user pwd [labs/private] - 10https://gerrit.wikimedia.org/r/1226768 (owner: 10Dpogorzelski)
[09:36:41] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert^4 "restgateway: migrate the /api/rest_v1/ sandbox to the rest gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1226288 (https://phabricator.wikimedia.org/T396807) (owner: 10Clément Goubert)
[09:37:32] <wikibugs>	 (03PS3) 10Clément Goubert: Revert^4 "restgateway: migrate the /api/rest_v1/ sandbox to the rest gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1226288 (https://phabricator.wikimedia.org/T396807)
[09:37:33] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1226288 (https://phabricator.wikimedia.org/T396807) (owner: 10Clément Goubert)
[09:37:34] <wikibugs>	 (03PS1) 10Elukey: admin: add the analytics-sre uid and gid [puppet] - 10https://gerrit.wikimedia.org/r/1226774 (https://phabricator.wikimedia.org/T402512)
[09:37:36] <wikibugs>	 (03PS1) 10Elukey: role::puppetserver: deploy kerberos keytab for analytics-sre [puppet] - 10https://gerrit.wikimedia.org/r/1226775 (https://phabricator.wikimedia.org/T402512)
[09:37:39] <wikibugs>	 (03PS1) 10Elukey: WIP: profile::puppetserver::volatile: add hdfs rsync job [puppet] - 10https://gerrit.wikimedia.org/r/1226776 (https://phabricator.wikimedia.org/T402512)
[09:40:03] <wikibugs>	 (03CR) 10Dpogorzelski: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1226204 (https://phabricator.wikimedia.org/T412524) (owner: 10Dpogorzelski)
[09:40:14] <wikibugs>	 (03CR) 10jenkins-bot: Revert^4 "restgateway: migrate the /api/rest_v1/ sandbox to the rest gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1226288 (https://phabricator.wikimedia.org/T396807) (owner: 10Clément Goubert)
[09:40:14] <wikibugs>	 (03CR) 10Elukey: "This requires the creation of the keytabs, and their copy to the private repo. We'll need one keytab for each puppetserver hostname, so it" [puppet] - 10https://gerrit.wikimedia.org/r/1226775 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey)
[09:40:55] <wikibugs>	 (03CR) 10Elukey: "Very high level WIP patch, I don't know the more specific details but we can start with some know locations and see how it goes. Lemme kno" [puppet] - 10https://gerrit.wikimedia.org/r/1226776 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey)
[09:42:46] <wikibugs>	 (03PS1) 10Clément Goubert: gateway-check: Document additional query parameter [puppet] - 10https://gerrit.wikimedia.org/r/1226780 (https://phabricator.wikimedia.org/T396807)
[09:43:58] <XioNoX>	 !log configure Arelion LAG on cr1-codfw - T401100
[09:44:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:29] <claime>	 jouncebot: nowandnext
[09:44:30] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 15 minute(s)
[09:44:30] <jouncebot>	 In 1 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T1100)
[09:45:49] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] Revert^4 "restgateway: migrate the /api/rest_v1/ sandbox to the rest gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1226288 (https://phabricator.wikimedia.org/T396807) (owner: 10Clément Goubert)
[09:46:19] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "LGTM! Please try to deploy it during https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T1100 (also remember to add th" [puppet] - 10https://gerrit.wikimedia.org/r/1226204 (https://phabricator.wikimedia.org/T412524) (owner: 10Dpogorzelski)
[09:48:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:ae1 (External: Arelion transit) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[09:55:50] <wikibugs>	 (03PS1) 10Clément Goubert: site.pp: Add mc1055-72 [puppet] - 10https://gerrit.wikimedia.org/r/1226782 (https://phabricator.wikimedia.org/T412255)
[09:56:00] <XioNoX>	 expected ^ that's an interface being setup
[09:57:20] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] gateway-check: Document additional query parameter [puppet] - 10https://gerrit.wikimedia.org/r/1226780 (https://phabricator.wikimedia.org/T396807) (owner: 10Clément Goubert)
[10:00:13] <wikibugs>	 (03CR) 10Majavah: firewall: Declare resources for both providers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) (owner: 10Majavah)
[10:02:59] <wikibugs>	 (03CR) 10Majavah: nftables::service: Improve src/dst filter handling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah)
[10:04:25] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] gateway-check: Document additional query parameter [puppet] - 10https://gerrit.wikimedia.org/r/1226780 (https://phabricator.wikimedia.org/T396807) (owner: 10Clément Goubert)
[10:06:47] <wikibugs>	 (03CR) 10Joal: druid_exporter: Fixup metric definition (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1226766 (https://phabricator.wikimedia.org/T278056) (owner: 10Brouberol)
[10:18:33] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: haproxy: actually set "robot" ua_class for identified MW requests [puppet] - 10https://gerrit.wikimedia.org/r/1226792
[10:20:22] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11520422 (10Clement_Goubert) >>! In T408757#11513905, @Jhancock.wm wrote: > @Clement_Goubert the servers landed last week. Gonna start unpacking them tomorrow or wednesday. A...
[10:29:02] <wikibugs>	 (03PS1) 10Clément Goubert: wikikube: Add ratelimit-media namespace [puppet] - 10https://gerrit.wikimedia.org/r/1226797 (https://phabricator.wikimedia.org/T414439)
[10:31:12] <wikibugs>	 (03PS1) 10Clément Goubert: Add ratelimit-upload namespace to wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226798 (https://phabricator.wikimedia.org/T414439)
[10:31:32] <wikibugs>	 10ops-eqiad, 06DC-Ops: Power Supply Redundancy alert on es1057 - https://phabricator.wikimedia.org/T414564 (10FCeratto-WMF) 03NEW
[10:32:21] <wikibugs>	 (03PS2) 10Clément Goubert: Add ratelimit-upload namespace to wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226798 (https://phabricator.wikimedia.org/T414439)
[10:35:03] <jinxer-wm>	 FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[10:40:03] <jinxer-wm>	 RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[10:46:04] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] thanos: set performance governor [puppet] - 10https://gerrit.wikimedia.org/r/1226311 (owner: 10Hnowlan)
[10:52:05] <wikibugs>	 (03CR) 10Btullis: [C:04-1] "I don't believe that this change is required, since wikipedia25.org will not be sending any client events from the browser to the event pl" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224858 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[10:55:44] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "ok! thanks for looking!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224858 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[10:55:55] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+2] docker registry: add ml build user password [puppet] - 10https://gerrit.wikimedia.org/r/1226204 (https://phabricator.wikimedia.org/T412524) (owner: 10Dpogorzelski)
[10:55:58] <logmsgbot>	 !log dpogorzelski@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on registry[2004-2005].codfw.wmnet,registry[1004-1005].eqiad.wmnet with reason: testing ml changes
[10:56:16] <wikibugs>	 (03Abandoned) 10Dzahn: eventgate-analytics-external: add wikipedia25.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224858 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[10:57:29] <wikibugs>	 (03CR) 10Blake: [C:03+2] datacenter: remove unused EXCLUDED_SERVICES constant. [cookbooks] - 10https://gerrit.wikimedia.org/r/1226211 (https://phabricator.wikimedia.org/T412211) (owner: 10Blake)
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T1100)
[11:03:29] <wikibugs>	 (03Merged) 10jenkins-bot: datacenter: remove unused EXCLUDED_SERVICES constant. [cookbooks] - 10https://gerrit.wikimedia.org/r/1226211 (https://phabricator.wikimedia.org/T412211) (owner: 10Blake)
[11:09:06] <wikibugs>	 (03PS1) 10Gergő Tisza: debug: Add some CDN Backend API headers to Logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226815 (https://phabricator.wikimedia.org/T412396)
[11:10:00] <wikibugs>	 (03CR) 10CI reject: [V:04-1] debug: Add some CDN Backend API headers to Logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226815 (https://phabricator.wikimedia.org/T412396) (owner: 10Gergő Tisza)
[11:11:04] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] haproxy: actually set "robot" ua_class for identified MW requests [puppet] - 10https://gerrit.wikimedia.org/r/1226792 (owner: 10Giuseppe Lavagetto)
[11:11:32] <wikibugs>	 (03PS2) 10Gergő Tisza: debug: Add some CDN Backend API headers to Logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226815 (https://phabricator.wikimedia.org/T412396)
[11:12:28] <wikibugs>	 (03CR) 10Gergő Tisza: "@kharlan@wikimedia.org the task mentions `x_is_browser_likely_script` and `x_is_browser_likely_browser` but I don't see those documented a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226815 (https://phabricator.wikimedia.org/T412396) (owner: 10Gergő Tisza)
[11:14:29] <wikibugs>	 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11520607 (10Dzahn) I answered the question on Slack but for the record here: No, thats not the case. Making additional changes to the site re...
[11:16:02] <wikibugs>	 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11520613 (10Dzahn) a:05Dzahn→03ATitkov
[11:20:30] <logmsgbot>	 !log dpogorzelski@cumin1003 START - Cookbook sre.hosts.remove-downtime for registry[2004-2005].codfw.wmnet,registry[1004-1005].eqiad.wmnet
[11:20:33] <logmsgbot>	 !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for registry[2004-2005].codfw.wmnet,registry[1004-1005].eqiad.wmnet
[11:23:22] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: complete reboot/destroy hosts with FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/1226818
[11:23:22] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: honor command line new-stack name [puppet] - 10https://gerrit.wikimedia.org/r/1226819
[11:25:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: honor command line new-stack name [puppet] - 10https://gerrit.wikimedia.org/r/1226819 (owner: 10Filippo Giunchedi)
[11:25:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: complete reboot/destroy hosts with FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/1226818 (owner: 10Filippo Giunchedi)
[11:35:37] <wikibugs>	 (03PS11) 10Vgutierrez: cache::upload: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto)
[11:37:27] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] haproxy: actually set "robot" ua_class for identified MW requests [puppet] - 10https://gerrit.wikimedia.org/r/1226792 (owner: 10Giuseppe Lavagetto)
[11:45:09] <logmsgbot>	 !log vgutierrez@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_codfw and A:cp - haproxy 2.8.18 upgrade (T414318)
[11:45:14] <stashbot>	 T414318: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318
[11:53:17] <logmsgbot>	 !log vgutierrez@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_codfw and A:cp - haproxy 2.8.18 upgrade (T414318)
[11:53:21] <stashbot>	 T414318: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318
[12:00:05] <jouncebot>	 mvolz: May I have your attention please! Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T1200)
[12:06:26] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11520728 (10cmooney) @VRiley-WMF I'll ping you on irc but we want to go ahead and replace the DAC on //d...
[12:07:48] <wikibugs>	 (03PS12) 10Vgutierrez: cache::upload: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto)
[12:08:04] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11520733 (10cmooney) Hmm so I was going to see if there was any difference if I did a trace to the ceph...
[12:09:11] <jinxer-wm>	 FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:09:22] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cache::upload: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto)
[12:10:07] <jinxer-wm>	 RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:11:51] <wikibugs>	 (03PS13) 10Vgutierrez: cache::upload: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto)
[12:19:09] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11520751 (10cmooney) Also @VRiley-WMF it seems this is actually a 1G RJ45 link.  So let's swap the coppe...
[12:22:16] <wikibugs>	 (03PS14) 10Vgutierrez: cache::upload: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto)
[12:22:52] <wikibugs>	 (03PS1) 10Daniel Kinzler: rest gateway: include a meaningful body with 429 responses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226827 (https://phabricator.wikimedia.org/T405636)
[12:23:15] <wikibugs>	 (03CR) 10Vgutierrez: "varnishtests are now happy against all the configurations introduced in PS11" [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto)
[12:24:21] <wikibugs>	 (03CR) 10Vgutierrez: cache::upload: introduce rate-limits by traffic class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto)
[12:28:17] <logmsgbot>	 !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_codfw and A:cp - haproxy 2.8.18 upgrade (T414318)
[12:28:20] <stashbot>	 T414318: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318
[12:37:04] <logmsgbot>	 !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_codfw and A:cp - haproxy 2.8.18 upgrade (T414318)
[12:37:08] <stashbot>	 T414318: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318
[12:38:35] <wikibugs>	 (03CR) 10Daniel Kinzler: rest-gateway: generate retry-after header for rate-limited requests (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224937 (https://phabricator.wikimedia.org/T405636) (owner: 10Daniel Kinzler)
[12:55:32] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11520980 (10cmooney) Hmm so with the node un-cordoned the loss has not returned either, well one drop at the first hop but it seems insignific...
[12:56:32] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] debug: Add some CDN Backend API headers to Logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226815 (https://phabricator.wikimedia.org/T412396) (owner: 10Gergő Tisza)
[13:03:16] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] debug: Add some CDN Backend API headers to Logstash (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226815 (https://phabricator.wikimedia.org/T412396) (owner: 10Gergő Tisza)
[13:05:42] <wikibugs>	 (03PS3) 10Gergő Tisza: debug: Add some CDN Backend API headers to Logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226815 (https://phabricator.wikimedia.org/T412396)
[13:05:58] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, January 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226815 (https://phabricator.wikimedia.org/T412396) (owner: 10Gergő Tisza)
[13:10:47] <logmsgbot>	 !log elukey@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on registry1004.eqiad.wmnet with reason: testing
[13:13:15] <wikibugs>	 (03PS1) 10JMeybohm: admin/data: Add hfanwmf to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1226847 (https://phabricator.wikimedia.org/T414492)
[13:13:15] <wikibugs>	 06SRE: Failing docker registry tests - https://phabricator.wikimedia.org/T414576 (10DPogorzelski-WMF) 03NEW
[13:14:34] <wikibugs>	 (03CR) 10Gergő Tisza: debug: Add some CDN Backend API headers to Logstash (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226815 (https://phabricator.wikimedia.org/T412396) (owner: 10Gergő Tisza)
[13:18:10] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11521085 (10cmooney) >>! In T414460#11518808, @CDanis wrote: > FIN_WAIT_1 is //not// supposed to stick around for longer than a minute or two....
[13:20:09] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for kareid - https://phabricator.wikimedia.org/T413364#11521090 (10JMeybohm) a:05KReid-WMF→03None
[13:24:11] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[13:26:40] <wikibugs>	 (03Restored) 10Btullis: eventgate-analytics-external: add wikipedia25.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224858 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[13:27:23] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "It look like I was wrong. See this comment from Mikhal: https://wikimedia.slack.com/archives/CSV483812/p1768396226801899?thread_ts=1768388" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224858 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[13:27:46] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for kareid - https://phabricator.wikimedia.org/T413364#11521107 (10JMeybohm)
[13:30:05] <Dreamy_Jazz>	 jouncebot: nowandnext
[13:30:05] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 29 minute(s)
[13:30:05] <jouncebot>	 In 0 hour(s) and 29 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T1400)
[13:30:24] <logmsgbot>	 !log vgutierrez@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqsin and A:cp - haproxy 2.8.18 upgrade (T414318)
[13:30:28] <stashbot>	 T414318: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318
[13:30:38] <logmsgbot>	 !log vgutierrez@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqsin and not P{cp5022.*} and A:cp - haproxy 2.8.18 upgrade (T414318)
[13:31:50] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for kareid - https://phabricator.wikimedia.org/T413364#11521115 (10JMeybohm) a:03thcipriani @thcipriani this needs sign-off from you as the approver for the deployment group
[13:31:51] <wikibugs>	 (03PS3) 10Dreamy Jazz: Write new for CheckUser user agent table migration on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223673 (https://phabricator.wikimedia.org/T361196)
[13:32:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223673 (https://phabricator.wikimedia.org/T361196) (owner: 10Dreamy Jazz)
[13:32:30] <wikibugs>	 (03PS3) 10Milimetric: trafficserver: Send /ins-502b/v2/events to intake-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1218817 (https://phabricator.wikimedia.org/T412863)
[13:32:41] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-01-07-132737 to 2026-01-07-163903 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226852 (https://phabricator.wikimedia.org/T413732)
[13:33:10] <wikibugs>	 (03Merged) 10jenkins-bot: Write new for CheckUser user agent table migration on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223673 (https://phabricator.wikimedia.org/T361196) (owner: 10Dreamy Jazz)
[13:34:26] <wikibugs>	 (03PS1) 10Cory Massaro: wikifunctions: Upgrade orchestrator from 2026-01-07-132737 to 2026-01-07-163903. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226853
[13:34:37] <logmsgbot>	 !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1223673|Write new for CheckUser user agent table migration on group0 (T361196)]]
[13:34:41] <stashbot>	 T361196: Write to the cu_useragent table and agent_id columns on WMF wikis - https://phabricator.wikimedia.org/T361196
[13:34:53] <wikibugs>	 (03PS2) 10Cory Massaro: wikifunctions: Upgrade orchestrator from 2026-01-07-132737 to 2026-01-07-163903. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226853 (https://phabricator.wikimedia.org/T413732)
[13:35:12] <wikibugs>	 (03PS1) 10JMeybohm: admin/data: Shell, deployers, analytics-privatedata-users for kareid [puppet] - 10https://gerrit.wikimedia.org/r/1226854 (https://phabricator.wikimedia.org/T413364)
[13:36:06] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin/data: Shell, deployers, analytics-privatedata-users for kareid [puppet] - 10https://gerrit.wikimedia.org/r/1226854 (https://phabricator.wikimedia.org/T413364) (owner: 10JMeybohm)
[13:36:40] <wikibugs>	 (03CR) 10JMeybohm: [C:04-2] "- SSH key needs off band verification" [puppet] - 10https://gerrit.wikimedia.org/r/1226854 (https://phabricator.wikimedia.org/T413364) (owner: 10JMeybohm)
[13:36:50] <logmsgbot>	 !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1223673|Write new for CheckUser user agent table migration on group0 (T361196)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:37:35] <wikibugs>	 (03PS2) 10JMeybohm: admin/data: Shell, deployers, analytics-privatedata-users for kareid [puppet] - 10https://gerrit.wikimedia.org/r/1226854 (https://phabricator.wikimedia.org/T413364)
[13:38:30] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudgw2002-dev.codfw.wmnet with OS trixie
[13:39:15] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, January 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225005 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming)
[13:39:36] <wikibugs>	 07sre-alert-triage, 06SRE Observability: Alert in need of triage: nrpe_Check_whether_ferm_is_active_by_checking_the_default_input_chain (instance cumin2002:9100) - https://phabricator.wikimedia.org/T413743#11521134 (10tappof) a:03tappof
[13:39:42] <wikibugs>	 07sre-alert-triage, 06SRE Observability: Alert in need of triage: nrpe_Check_whether_ferm_is_active_by_checking_the_default_input_chain (instance cloudidp2001-dev:9100) - https://phabricator.wikimedia.org/T413744#11521135 (10tappof) a:03tappof
[13:39:46] <wikibugs>	 07sre-alert-triage, 06SRE Observability: Alert in need of triage: nrpe_Check_whether_ferm_is_active_by_checking_the_default_input_chain (instance cloudcumin2001:9100) - https://phabricator.wikimedia.org/T413745#11521136 (10tappof) a:03tappof
[13:39:59] <logmsgbot>	 !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync
[13:40:33] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Grant Access to analytics-privatedata-users for HFanWMF - https://phabricator.wikimedia.org/T414492#11521137 (10JMeybohm)
[13:41:11] <wikibugs>	 07sre-alert-triage, 06SRE Observability: Alert in need of triage: nrpe_Check_whether_ferm_is_active_by_checking_the_default_input_chain (instance cumin2002:9100) - https://phabricator.wikimedia.org/T413743#11521138 (10tappof)
[13:41:16] <wikibugs>	 07sre-alert-triage, 06SRE Observability: Alert in need of triage: nrpe_Check_whether_ferm_is_active_by_checking_the_default_input_chain (instance cloudidp2001-dev:9100) - https://phabricator.wikimedia.org/T413744#11521151 (10tappof)
[13:41:25] <wikibugs>	 07sre-alert-triage, 06SRE Observability: Alert in need of triage: nrpe_Check_whether_ferm_is_active_by_checking_the_default_input_chain (instance cloudcumin2001:9100) - https://phabricator.wikimedia.org/T413745#11521153 (10tappof)
[13:42:43] <wikibugs>	 07sre-alert-triage, 10Observability-Alerting: Alert in need of triage: nrpe_Check_whether_ferm_is_active_by_checking_the_default_input_chain (instance cloudcumin2001:9100) - https://phabricator.wikimedia.org/T413745#11521155 (10tappof)
[13:42:52] <wikibugs>	 07sre-alert-triage, 10Observability-Alerting: Alert in need of triage: nrpe_Check_whether_ferm_is_active_by_checking_the_default_input_chain (instance cloudidp2001-dev:9100) - https://phabricator.wikimedia.org/T413744#11521156 (10tappof)
[13:43:00] <wikibugs>	 07sre-alert-triage, 10Observability-Alerting: Alert in need of triage: nrpe_Check_whether_ferm_is_active_by_checking_the_default_input_chain (instance cumin2002:9100) - https://phabricator.wikimedia.org/T413743#11521157 (10tappof)
[13:44:02] <logmsgbot>	 !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1223673|Write new for CheckUser user agent table migration on group0 (T361196)]] (duration: 09m 24s)
[13:44:06] <stashbot>	 T361196: Write to the cu_useragent table and agent_id columns on WMF wikis - https://phabricator.wikimedia.org/T361196
[13:44:45] <wikibugs>	 (03PS8) 10Brouberol: druid: inject flags allowing druid to access protected classes in java > 8 [puppet] - 10https://gerrit.wikimedia.org/r/1226844 (https://phabricator.wikimedia.org/T278056)
[13:48:33] <wikibugs>	 (03CR) 10Btullis: [C:03+1] druid: inject flags allowing druid to access protected classes in java > 8 [puppet] - 10https://gerrit.wikimedia.org/r/1226844 (https://phabricator.wikimedia.org/T278056) (owner: 10Brouberol)
[13:54:05] <wikibugs>	 10SRE-SLO, 10Observability-Alerting, 06SRE Observability (FY2025/2026-Q3): sloth deployment - https://phabricator.wikimedia.org/T414579 (10tappof) 03NEW
[13:55:43] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw2002-dev.codfw.wmnet with reason: host reimage
[13:58:07] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance
[13:58:15] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2147 (T413525)', diff saved to https://phabricator.wikimedia.org/P87502 and previous config saved to /var/cache/conftool/dbconfig/20260114-135815-marostegui.json
[13:58:19] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[13:58:33] <wikibugs>	 10ops-eqiad, 06DC-Ops: Power Supply Redundancy alert on es1057 - https://phabricator.wikimedia.org/T414564#11521231 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Reseated power cable
[13:59:40] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for kareid - https://phabricator.wikimedia.org/T413364#11521239 (10JMeybohm)
[13:59:49] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw2002-dev.codfw.wmnet with reason: host reimage
[13:59:51] <wikibugs>	 (03CR) 10JMeybohm: "SSH key has been verified, deployers access is pending sign off from group approver" [puppet] - 10https://gerrit.wikimedia.org/r/1226854 (https://phabricator.wikimedia.org/T413364) (owner: 10JMeybohm)
[14:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T1400). Please do the needful.
[14:00:05] <jouncebot>	 JSherman, tgr, and sfaci: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:11] <JSherman>	 here
[14:00:14] <sfaci>	 o/
[14:00:19] <Lucas_WMDE>	 o/
[14:01:09] <Lucas_WMDE>	 JSherman: want to self-service?
[14:01:18] <JSherman>	 happy to!
[14:01:32] <Lucas_WMDE>	 okay, go ahead :)
[14:02:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217787 (https://phabricator.wikimedia.org/T412528) (owner: 10Jsn.sherman)
[14:02:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217788 (https://phabricator.wikimedia.org/T412528) (owner: 10Jsn.sherman)
[14:02:06] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217789 (https://phabricator.wikimedia.org/T412528) (owner: 10Jsn.sherman)
[14:03:15] <wikibugs>	 (03CR) 10Brouberol: [V:03+1 C:03+2] druid: inject flags allowing druid to access protected classes in java > 8 [puppet] - 10https://gerrit.wikimedia.org/r/1226844 (https://phabricator.wikimedia.org/T278056) (owner: 10Brouberol)
[14:03:23] <wikibugs>	 (03Merged) 10jenkins-bot: InitialiseSettings.php: Add wmgUsePersonalDashboard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217787 (https://phabricator.wikimedia.org/T412528) (owner: 10Jsn.sherman)
[14:03:26] <wikibugs>	 (03Merged) 10jenkins-bot: InitialiseSettings-labs.php: Deploy PersonalDashboard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217788 (https://phabricator.wikimedia.org/T412528) (owner: 10Jsn.sherman)
[14:03:30] <wikibugs>	 (03Merged) 10jenkins-bot: CommonSettings-labs: Load PersonalDashbard extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217789 (https://phabricator.wikimedia.org/T412528) (owner: 10Jsn.sherman)
[14:04:01] <logmsgbot>	 !log jsn@deploy2002 Started scap sync-world: Backport for [[gerrit:1217787|InitialiseSettings.php: Add wmgUsePersonalDashboard (T412528)]], [[gerrit:1217788|InitialiseSettings-labs.php: Deploy PersonalDashboard (T412528)]], [[gerrit:1217789|CommonSettings-labs: Load PersonalDashbard extension (T412528)]]
[14:04:05] <stashbot>	 T412528: Deploy the PersonalDashboard extension to Beta Cluster - https://phabricator.wikimedia.org/T412528
[14:06:30] <logmsgbot>	 !log jsn@deploy2002 jsn: Backport for [[gerrit:1217787|InitialiseSettings.php: Add wmgUsePersonalDashboard (T412528)]], [[gerrit:1217788|InitialiseSettings-labs.php: Deploy PersonalDashboard (T412528)]], [[gerrit:1217789|CommonSettings-labs: Load PersonalDashbard extension (T412528)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:07:10] <logmsgbot>	 !log jsn@deploy2002 jsn: Continuing with sync
[14:08:18] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add db user and password for hiddenparma [labs/private] - 10https://gerrit.wikimedia.org/r/1226857
[14:08:30] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Add db user and password for hiddenparma [labs/private] - 10https://gerrit.wikimedia.org/r/1226857 (owner: 10Giuseppe Lavagetto)
[14:09:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11521279 (10cmooney) >>! In T414460#11521085, @cmooney wrote: > however surely it should try to resend the FIN, and if this state persists eve...
[14:10:14] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: hiddenparma: add configuration for the database [puppet] - 10https://gerrit.wikimedia.org/r/1226858
[14:11:20] <logmsgbot>	 !log jsn@deploy2002 Finished scap sync-world: Backport for [[gerrit:1217787|InitialiseSettings.php: Add wmgUsePersonalDashboard (T412528)]], [[gerrit:1217788|InitialiseSettings-labs.php: Deploy PersonalDashboard (T412528)]], [[gerrit:1217789|CommonSettings-labs: Load PersonalDashbard extension (T412528)]] (duration: 07m 19s)
[14:11:24] <stashbot>	 T412528: Deploy the PersonalDashboard extension to Beta Cluster - https://phabricator.wikimedia.org/T412528
[14:11:26] <JSherman>	 Lucas_WMDE: back to you
[14:12:39] <Lucas_WMDE>	 ok! tgr_ would be up next
[14:12:45] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] "Overall looks ok to me, given the work already done on cache::text, the question of actual ratelimits values is well placed. IMHO we shoul" [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto)
[14:16:48] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7892/co" [puppet] - 10https://gerrit.wikimedia.org/r/1226858 (owner: 10Giuseppe Lavagetto)
[14:17:39] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw2002-dev.codfw.wmnet with OS trixie
[14:18:05] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+1 C:03+2] hiddenparma: add configuration for the database [puppet] - 10https://gerrit.wikimedia.org/r/1226858 (owner: 10Giuseppe Lavagetto)
[14:19:53] <Lucas_WMDE>	 or sfaci, if tgr_ isn’t aroaund at the moment
[14:19:57] <Lucas_WMDE>	 sfaci: want to self-service?
[14:21:23] <logmsgbot>	 !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqsin and not P{cp5022.*} and A:cp - haproxy 2.8.18 upgrade (T414318)
[14:21:27] <stashbot>	 T414318: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318
[14:21:50] <sfaci>	 Lucas_WMDE: I can't. I need someone who deploy for me
[14:22:15] <tgr_>	 I can self-deploy quickly if you haven't started yet
[14:22:29] <tgr_>	 (sorry at and offsite so a bit unresponsive)
[14:22:35] <Lucas_WMDE>	 tgr_: go ahead
[14:22:38] <Lucas_WMDE>	 and then I can deploy for sfaci 
[14:22:46] <Lucas_WMDE>	 (but I thought you had deployment access based on puppet, sorry)
[14:22:51] <sfaci>	 It's ok tgr_ , I can wait!
[14:22:55] <sfaci>	 Thanks Lucas_WMDE !
[14:24:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226815 (https://phabricator.wikimedia.org/T412396) (owner: 10Gergő Tisza)
[14:24:44] <tgr_>	 thx
[14:24:52] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Move status, commit status/history to database [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1226860
[14:25:12] <wikibugs>	 (03Merged) 10jenkins-bot: debug: Add some CDN Backend API headers to Logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226815 (https://phabricator.wikimedia.org/T412396) (owner: 10Gergő Tisza)
[14:25:43] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1226815|debug: Add some CDN Backend API headers to Logstash (T412396)]]
[14:25:47] <stashbot>	 T412396: Pass through information about the client from the CDN to MediaWiki to Logstash - https://phabricator.wikimedia.org/T412396
[14:25:55] <logmsgbot>	 !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqsin and A:cp - haproxy 2.8.18 upgrade (T414318)
[14:26:59] <logmsgbot>	 !log vgutierrez@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_drmrs and A:cp - haproxy 2.8.18 upgrade (T414318)
[14:27:03] <stashbot>	 T414318: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318
[14:27:07] <logmsgbot>	 !log vgutierrez@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_drmrs and A:cp - haproxy 2.8.18 upgrade (T414318)
[14:27:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11521326 (10bking) 05Resolved→03In progress a:05Jclark-ctr→03bking
[14:28:13] <logmsgbot>	 !log tgr@deploy2002 tgr: Backport for [[gerrit:1226815|debug: Add some CDN Backend API headers to Logstash (T412396)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:30:36] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Move status, commit status/history to database [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1226860 (owner: 10Giuseppe Lavagetto)
[14:30:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11521339 (10bking) Some more lines from dmesg:  ` [ 1172.174064] sd 0:2:2:0: SCSI device is removed [ 1172.273429] megaraid_sa...
[14:31:10] <wikibugs>	 (03PS2) 10A smart kitten: CommonSettings-labs: Remove redundant code for loading/configuring Phonos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225075
[14:31:10] <logmsgbot>	 !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "State, commit to database - oblivian@cumin1003"
[14:31:13] <logmsgbot>	 !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: State, commit to database - oblivian@cumin1003
[14:32:02] <logmsgbot>	 !log tgr@deploy2002 tgr: Continuing with sync
[14:32:12] <logmsgbot>	 !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: State, commit to database - oblivian@cumin1003
[14:32:13] <wikibugs>	 (03CR) 10A smart kitten: "PS2 is a rebase to resolve a merge conflict from 1f1d2ae36f52" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225075 (owner: 10A smart kitten)
[14:32:13] <logmsgbot>	 !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "State, commit to database - oblivian@cumin1003"
[14:32:29] <wikibugs>	 06SRE, 10MediaWiki-Debug-Logger, 06Traffic, 06MediaWiki-Platform-Team (Q3 Kanban Board): Pass through information about the client from the CDN to MediaWiki to Logstash - https://phabricator.wikimedia.org/T412396#11521361 (10Vgutierrez) the headers described on https://wikitech.wikimedia.org/wiki/CDN/Backe...
[14:33:09] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11521367 (10CDanis) >>! In T414460#11521085, @cmooney wrote: > The k8s host sent a FIN to the remote side but due to the packet-loss issue the...
[14:36:04] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1226815|debug: Add some CDN Backend API headers to Logstash (T412396)]] (duration: 10m 21s)
[14:36:09] <stashbot>	 T412396: Pass through information about the client from the CDN to MediaWiki to Logstash - https://phabricator.wikimedia.org/T412396
[14:37:12] <tgr_>	 Lucas_WMDE: thanks, back to you
[14:37:29] <Lucas_WMDE>	 thanks!
[14:37:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth1 (Subnet frack-fundraising-codfw in F5) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[14:38:03] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Present in all deployed branches, so I think this is good to go:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225005 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming)
[14:38:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225005 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming)
[14:38:32] <Lucas_WMDE>	 sfaci: fyi ^
[14:38:56] <sfaci>	 Lucas_WMDE: cool!
[14:39:21] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy TestKitchen to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225005 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming)
[14:39:51] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1225005|Deploy TestKitchen to testwiki (T407806)]]
[14:39:55] <stashbot>	 T407806: Rename Metrics Platform Extension to Test Kitchen - https://phabricator.wikimedia.org/T407806
[14:42:06] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 cjming, lucaswerkmeister-wmde: Backport for [[gerrit:1225005|Deploy TestKitchen to testwiki (T407806)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:42:15] <Lucas_WMDE>	 sfaci: please test the change on mwdebug :)
[14:43:38] <sfaci>	 ok
[14:44:09] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+2] Add vLLM image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1146891 (https://phabricator.wikimedia.org/T385173) (owner: 10Kevin Bazira)
[14:44:21] <sfaci>	 Lucas_WMDE: Checked! The extension is loaded and working
[14:45:17] <wikibugs>	 (03PS1) 10Jsn.sherman: Deploy PersonalDashboard to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226862 (https://phabricator.wikimedia.org/T403982)
[14:45:30] <Lucas_WMDE>	 alright, thanks!
[14:45:37] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 cjming, lucaswerkmeister-wmde: Continuing with sync
[14:49:40] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1225005|Deploy TestKitchen to testwiki (T407806)]] (duration: 09m 49s)
[14:49:41] <wikibugs>	 (03CR) 10Elukey: [C:04-1] "This is still not under the /ml prefix/namespace :)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1146891 (https://phabricator.wikimedia.org/T385173) (owner: 10Kevin Bazira)
[14:49:44] <stashbot>	 T407806: Rename Metrics Platform Extension to Test Kitchen - https://phabricator.wikimedia.org/T407806
[14:49:58] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[14:50:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:23] <sfaci>	 Lucas_WMDE: Thank you very much!
[14:55:18] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Revert "Move status, commit status/history to database" [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1226867
[15:00:04] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T1500)
[15:03:48] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[15:04:08] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[15:04:28] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[15:04:58] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[15:05:05] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[15:05:41] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[15:06:52] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: hiddenparma: use sqlite for now [puppet] - 10https://gerrit.wikimedia.org/r/1226869
[15:07:02] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] hiddenparma: use sqlite for now [puppet] - 10https://gerrit.wikimedia.org/r/1226869 (owner: 10Giuseppe Lavagetto)
[15:09:11] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:11:58] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11521553 (10elukey) It happens with all the image push tests, with Docker on bullseye and bookworm (tried both build nodes). I dumped the registry's goroutines when...
[15:13:50] <wikibugs>	 (03CR) 10Bking: [C:03+2] airflow-search: add enterprise extra_secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224894 (https://phabricator.wikimedia.org/T414066) (owner: 10DCausse)
[15:15:39] <logmsgbot>	 !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_drmrs and A:cp - haproxy 2.8.18 upgrade (T414318)
[15:15:44] <stashbot>	 T414318: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318
[15:17:08] <logmsgbot>	 !log vgutierrez@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_esams and A:cp - haproxy 2.8.18 upgrade (T414318)
[15:19:08] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply
[15:19:10] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply
[15:20:42] <logmsgbot>	 !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_drmrs and A:cp - haproxy 2.8.18 upgrade (T414318)
[15:20:46] <stashbot>	 T414318: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318
[15:24:07] <wikibugs>	 (03PS2) 10Clément Goubert: api-gateway: Add external services support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225548 (https://phabricator.wikimedia.org/T414333)
[15:30:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T1500)
[15:30:05] <jouncebot>	 Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T1530)
[15:30:07] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:30:48] <logmsgbot>	 !log vgutierrez@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_esams and A:cp - haproxy 2.8.18 upgrade (T414318)
[15:30:52] <stashbot>	 T414318: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318
[15:32:39] <logmsgbot>	 !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply
[15:33:32] <logmsgbot>	 !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply
[15:34:11] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:38:12] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2248 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87504 and previous config saved to /var/cache/conftool/dbconfig/20260114-153811-marostegui.json
[15:38:18] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[15:38:18] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[15:39:26] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11521716 (10elukey) Meanwhile this is the stacktrace for dockerd and the relevant goroutine:  ` goroutine 148 [select, 3 minutes]: net/http.(*persistConn).roundTrip...
[15:40:03] <icinga-wm>	 PROBLEM - Postfix SMTP on crm2001 is CRITICAL: CRITICAL - Certificate crm2001.codfw.wmnet expires in 15 day(s) (Fri 30 Jan 2026 03:40:00 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting
[15:43:27] <logmsgbot>	 !log cdobbins@cumin1003 START - Cookbook sre.dns.admin DNS admin: depool site ulsfo [reason: switch work, T408510]
[15:43:31] <stashbot>	 T408510: ULSFO: switch refresh - https://phabricator.wikimedia.org/T408510
[15:43:36] <logmsgbot>	 !log cdobbins@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site ulsfo [reason: switch work, T408510]
[15:44:39] <wikibugs>	 (03PS15) 10Vgutierrez: cache::upload: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto)
[15:45:10] <wikibugs>	 (03CR) 10Vgutierrez: cache::upload: introduce rate-limits by traffic class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto)
[15:45:12] <wikibugs>	 (03PS1) 10Majavah: aptrepo: Drop packages for Kubeadm/1.30 [puppet] - 10https://gerrit.wikimedia.org/r/1226878 (https://phabricator.wikimedia.org/T372697)
[15:45:14] <wikibugs>	 (03PS1) 10Majavah: aptrepo: Import packages for Kubeadm/1.32 [puppet] - 10https://gerrit.wikimedia.org/r/1226879 (https://phabricator.wikimedia.org/T379047)
[15:46:51] <logmsgbot>	 andrew@cumin2002 reimage (PID 2640472) is awaiting input
[15:48:20] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2248', diff saved to https://phabricator.wikimedia.org/P87505 and previous config saved to /var/cache/conftool/dbconfig/20260114-154820-marostegui.json
[15:49:00] <XioNoX>	 !log drain eqsin-ulsfo transport
[15:49:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:13] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudgw2003-dev.codfw.wmnet with OS trixie
[15:54:15] <wikibugs>	 (03CR) 10Papaul: [C:03+2] Comment out temporarily the anycast ranges [homer/public] - 10https://gerrit.wikimedia.org/r/1216677 (https://phabricator.wikimedia.org/T408892) (owner: 10Papaul)
[15:54:55] <icinga-wm>	 PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[15:54:55] <icinga-wm>	 PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[15:56:06] <Dreamy_Jazz>	 jouncebot: nowandnext
[15:56:07] <jouncebot>	 For the next 0 hour(s) and 3 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T1500)
[15:56:07] <jouncebot>	 For the next 0 hour(s) and 3 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T1530)
[15:56:07] <jouncebot>	 In 2 hour(s) and 3 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T1800)
[15:57:08] <Dreamy_Jazz>	 We want to do a deploy to fix a security issue
[15:57:12] <Dreamy_Jazz>	 Any objection?
[15:57:59] <sukhe>	 Dreamy_Jazz: should be fine, but please check with oncallers, _joe_ fabfur ^
[15:58:28] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2248', diff saved to https://phabricator.wikimedia.org/P87506 and previous config saved to /var/cache/conftool/dbconfig/20260114-155828-marostegui.json
[15:58:36] <Dreamy_Jazz>	 We are intending to use scap to deploy it as it's only on wmf branches and minor enough to fix publicly
[15:58:56] <icinga-wm>	 PROBLEM - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[15:58:56] <icinga-wm>	 PROBLEM - grafana-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[15:59:01] <wikibugs>	 06SRE, 10MediaWiki-Debug-Logger, 06Traffic, 06MediaWiki-Platform-Team (Q3 Kanban Board): Pass through information about the client from the CDN to MediaWiki to Logstash - https://phabricator.wikimedia.org/T412396#11521815 (10Tgr) a:03Tgr
[15:59:08] <sukhe>	 Dreamy_Jazz: go ahead please but note that grafana is down
[15:59:19] <sukhe>	 in case you need that, for whatever reason
[15:59:56] <icinga-wm>	 PROBLEM - SSH on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[16:00:17] <Dreamy_Jazz>	 That should be fine
[16:00:24] <Dreamy_Jazz>	 We just need excimer to check that things are not slow
[16:00:25] <wikibugs>	 06SRE, 10MediaWiki-Debug-Logger, 06Traffic, 06MediaWiki-Platform-Team (Q3 Kanban Board): Pass through information about the client from the CDN to MediaWiki to Logstash - https://phabricator.wikimedia.org/T412396#11521825 (10Tgr) We should also update some of the dashboards (at least the login one) with so...
[16:00:56] <sukhe>	 Dreamy_Jazz: ok +1 from me, I am not on on-call but since the people who are busy with meetings, please go ahead and I can help things go south
[16:00:57] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11521827 (10cmooney) The SFP module in port 14 of lsw1-c5-eqiad has been swapped out now.  So we can observe over the next...
[16:01:48] <icinga-wm>	 RECOVERY - SSH on grafana1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[16:01:52] <icinga-wm>	 RECOVERY - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Sat 31 Jan 2026 10:43:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[16:01:52] <icinga-wm>	 RECOVERY - grafana-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Sat 31 Jan 2026 10:43:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[16:02:54] <icinga-wm>	 RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 554 bytes in 6.732 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[16:02:54] <icinga-wm>	 RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 564 bytes in 6.697 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[16:02:56] <wikibugs>	 06SRE, 10MediaWiki-Debug-Logger, 06Traffic, 06MediaWiki-Platform-Team (Q3 Kanban Board): Pass through information about the client from the CDN to MediaWiki to Logstash - https://phabricator.wikimedia.org/T412396#11521866 (10Tgr) >>! In T412396#11521361, @Vgutierrez wrote: > the headers described on https:...
[16:04:39] <logmsgbot>	 !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_esams and A:cp - haproxy 2.8.18 upgrade (T414318)
[16:04:44] <stashbot>	 T414318: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318
[16:05:00] <logmsgbot>	 !log pt1979@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 8 hosts with reason: loopback IPV4 change on ulsfo core router
[16:05:20] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11521900 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=cf1deaa2-45c3-45e8-bdad-1303b0075f87) set by pt1979@cumin2002 for 2:00:00 on 8 h...
[16:06:28] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw2003-dev.codfw.wmnet with reason: host reimage
[16:07:20] <papaul>	 !log ongoing loopback ip's change on cr3/cr4-ulsfo
[16:07:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:37] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2248 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87507 and previous config saved to /var/cache/conftool/dbconfig/20260114-160836-marostegui.json
[16:08:43] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[16:08:44] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[16:09:56] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw2003-dev.codfw.wmnet with reason: host reimage
[16:14:57] <logmsgbot>	 !log fnegri@cumin1003 START - Cookbook sre.wikireplicas.update-views
[16:15:39] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] admin/data: Add hfanwmf to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1226847 (https://phabricator.wikimedia.org/T414492) (owner: 10JMeybohm)
[16:16:15] <wikibugs>	 10SRE-swift-storage, 10Ceph, 07Epic, 07Kubernetes, and 2 others: Move the docker registry's /restricted prefix to Docker Distribution backed up by Ceph - https://phabricator.wikimedia.org/T412951#11522011 (10JMeybohm) p:05Triage→03High
[16:17:29] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] admin/data: Add hfanwmf to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1226847 (https://phabricator.wikimedia.org/T414492) (owner: 10JMeybohm)
[16:18:04] <logmsgbot>	 !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_esams and A:cp - haproxy 2.8.18 upgrade (T414318)
[16:18:08] <stashbot>	 T414318: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318
[16:20:15] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Grant Access to analytics-privatedata-users for HFanWMF - https://phabricator.wikimedia.org/T414492#11522046 (10JMeybohm) 05Open→03Resolved a:03JMeybohm I have added you to the `analytics-privatedata-users` group. If that does not grand you the requ...
[16:21:20] <logmsgbot>	 !log fnegri@cumin1003 END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0)
[16:27:39] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw2003-dev.codfw.wmnet with OS trixie
[16:28:10] <wikibugs>	 (03PS1) 10Dreamy Jazz: Only validate IRS configs on writes; skip validations for reads [extensions/ReportIncident] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1226896 (https://phabricator.wikimedia.org/T414582)
[16:28:18] <wikibugs>	 (03PS1) 10Dreamy Jazz: Only validate IRS configs on writes; skip validations for reads [extensions/ReportIncident] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1226897 (https://phabricator.wikimedia.org/T414582)
[16:29:02] <Tran>	 o/ Dreamy_Jazz asked earlier but I'll be doing the backport he mentioned now if that's alright?
[16:29:41] <sukhe>	 Tran: there are no issues on our end, except that ulsfo is depooled, so yes, from SRE's side you can go ahead if you want
[16:31:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy2002 using scap backport" [extensions/ReportIncident] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1226897 (https://phabricator.wikimedia.org/T414582) (owner: 10Dreamy Jazz)
[16:31:29] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy2002 using scap backport" [extensions/ReportIncident] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1226896 (https://phabricator.wikimedia.org/T414582) (owner: 10Dreamy Jazz)
[16:34:36] <wikibugs>	 (03Merged) 10jenkins-bot: Only validate IRS configs on writes; skip validations for reads [extensions/ReportIncident] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1226897 (https://phabricator.wikimedia.org/T414582) (owner: 10Dreamy Jazz)
[16:35:17] <wikibugs>	 (03Merged) 10jenkins-bot: Only validate IRS configs on writes; skip validations for reads [extensions/ReportIncident] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1226896 (https://phabricator.wikimedia.org/T414582) (owner: 10Dreamy Jazz)
[16:35:52] <logmsgbot>	 !log stran@deploy2002 Started scap sync-world: Backport for [[gerrit:1226897|Only validate IRS configs on writes; skip validations for reads (T414582)]], [[gerrit:1226896|Only validate IRS configs on writes; skip validations for reads (T414582)]]
[16:35:54] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11522109 (10cmooney) Ok currently seeing no loss (though that was the case when we were cordoned before the swap). ` cmoon...
[16:36:11] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11522110 (10elukey) As the last times, I went afk for ~1h, got back and retried the same docker push that completed immediately without hanging.
[16:38:01] <logmsgbot>	 !log stran@deploy2002 dreamyjazz, stran: Backport for [[gerrit:1226897|Only validate IRS configs on writes; skip validations for reads (T414582)]], [[gerrit:1226896|Only validate IRS configs on writes; skip validations for reads (T414582)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[16:38:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:38:53] <Tran>	 testing my patch now
[16:40:43] <Tran>	 looks good, moving forward
[16:40:46] <logmsgbot>	 !log stran@deploy2002 dreamyjazz, stran: Continuing with sync
[16:42:39] <jinxer-wm>	 FIRING: CoreBGPDown: Core BGP session down between cr1-codfw and cr4-ulsfo (198.35.26.129) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr1-codfw:9804&var-bgp_group=Confed_ulsfo&var-bgp_neighbor=cr4-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[16:42:44] <herron>	 !log restarting grafana1002 for memory increase T414604 
[16:42:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:42:48] <stashbot>	 T414604: Increase Grafana VM memory - https://phabricator.wikimedia.org/T414604
[16:44:24] <logmsgbot>	 !log herron@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM grafana1002.eqiad.wmnet
[16:44:51] <logmsgbot>	 !log stran@deploy2002 Finished scap sync-world: Backport for [[gerrit:1226897|Only validate IRS configs on writes; skip validations for reads (T414582)]], [[gerrit:1226896|Only validate IRS configs on writes; skip validations for reads (T414582)]] (duration: 08m 59s)
[16:45:41] <Tran>	 done, thanks!
[16:45:51] <wikibugs>	 (03CR) 10Wfan: [C:03+1] Revert "Shorten 'close' cookie wait period for enwiki banners" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226275 (https://phabricator.wikimedia.org/T411800) (owner: 10Ejegg)
[16:47:39] <jinxer-wm>	 FIRING: [5x] CoreBGPDown: Core BGP session down between cr1-codfw and cr4-ulsfo (198.35.26.129) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[16:48:50] <logmsgbot>	 !log brouberol@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1013.eqiad.wmnet
[16:49:13] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11522186 (10ops-monitoring-bot) Host dse-k8s-worker1013.eqiad.wmnet rebooted by brouberol@cumin1003 with reason: Getting a...
[16:49:19] <logmsgbot>	 !log herron@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM grafana1002.eqiad.wmnet
[16:50:09] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11522190 (10VRiley-WMF) Happy to help with this. Let us know if there is anything else we can help with.
[16:50:46] <wikibugs>	 (03PS1) 10Dzahn: miscweb: update wikipedia25 image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226902 (https://phabricator.wikimedia.org/T408592)
[16:50:53] <logmsgbot>	 !log herron@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM grafana2001.codfw.wmnet
[16:51:40] <wikibugs>	 (03PS1) 10Gergő Tisza: debug: Add X-Provenance header to Logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226903 (https://phabricator.wikimedia.org/T412396)
[16:51:49] <wikibugs>	 (03CR) 10CI reject: [V:04-1] debug: Add X-Provenance header to Logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226903 (https://phabricator.wikimedia.org/T412396) (owner: 10Gergő Tisza)
[16:52:39] <jinxer-wm>	 FIRING: [8x] CoreBGPDown: Core BGP session down between cr1-codfw and cr4-ulsfo (198.35.26.129) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[16:53:26] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, January 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226275 (https://phabricator.wikimedia.org/T411800) (owner: 10Ejegg)
[16:54:52] <logmsgbot>	 !log herron@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM grafana2001.codfw.wmnet
[16:55:08] <logmsgbot>	 !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1013.eqiad.wmnet
[16:57:26] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] miscweb: update wikipedia25 image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226902 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[16:57:39] <jinxer-wm>	 FIRING: [10x] CoreBGPDown: Core BGP session down between cr1-codfw and cr4-ulsfo (198.35.26.129) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[16:59:38] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb: update wikipedia25 image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226902 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[16:59:39] <wikibugs>	 (03PS1) 10Ssingh: wikimedia/wikipedia.org: match TTLs for NS and glue records [dns] - 10https://gerrit.wikimedia.org/r/1226904 (https://phabricator.wikimedia.org/T81605)
[17:02:05] <wikibugs>	 (03CR) 10Btullis: [C:03+2] eventgate-analytics-external: add wikipedia25.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224858 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[17:02:39] <jinxer-wm>	 FIRING: [15x] CoreBGPDown: Core BGP session down between cr1-codfw and cr3-ulsfo (198.35.26.192) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[17:03:57] <jinxer-wm>	 FIRING: [7x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:04:00] <wikibugs>	 (03Merged) 10jenkins-bot: eventgate-analytics-external: add wikipedia25.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224858 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[17:04:19] <_joe_>	 is this ulsfo?
[17:04:33] <sukhe>	 yes, weird, we should have downtimed I guess and I think we did?
[17:04:37] <_joe_>	 yes
[17:04:39] <sukhe>	 this is ulsfo
[17:04:40] <_joe_>	 ok np
[17:04:42] <sukhe>	 !incidents
[17:04:43] <sirenbot>	 7336 (UNACKED)  [7x] ProbeDown sre (probes/service ulsfo)
[17:04:43] <claime>	 looks like ulsfo yeah
[17:04:45] <sukhe>	 !ack 7336
[17:04:46] <sirenbot>	 7336 (ACKED)  [7x] ProbeDown sre (probes/service ulsfo)
[17:04:46] <_joe_>	 !ack
[17:04:47] <sirenbot>	 no value provided for parameter incident and no default available
[17:04:47] <sirenbot>	 All incidents are already acked.
[17:04:50] <sukhe>	 nothing to worry 
[17:04:52] <sukhe>	 still depooled
[17:04:56] <claime>	 ack
[17:04:57] <_joe_>	 sukhe: use "ack" without args
[17:05:01] <_joe_>	 thanks rzl 
[17:05:02] <sukhe>	 ah yes thanks
[17:07:14] <wikibugs>	 (03CR) 10Papaul: [C:03+2] Change cr3/4-ulsfo loopback ip's in puppet before tomorrow's maintenance window [puppet] - 10https://gerrit.wikimedia.org/r/1216679 (https://phabricator.wikimedia.org/T408892) (owner: 10Papaul)
[17:07:39] <jinxer-wm>	 FIRING: [15x] CoreBGPDown: Core BGP session down between cr1-codfw and cr3-ulsfo (198.35.26.192) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[17:09:24] <wikibugs>	 10SRE-SLO, 10Observability-Metrics, 06SRE Observability (FY2025/2026-Q3): Thanos (store|query-frontend) memcached cache in bad status - https://phabricator.wikimedia.org/T411273#11522264 (10tappof) The debug log level does not provide information about cache usage.
[17:10:35] <wikibugs>	 (03PS1) 10Superpes15: [itwiki] Add a temporary logo for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226908 (https://phabricator.wikimedia.org/T414320)
[17:12:03] <logmsgbot>	 !log dzahn@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[17:12:27] <logmsgbot>	 !log dzahn@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[17:12:45] <wikibugs>	 (03PS6) 10Kevin Bazira: Add vLLM image in ML namespace [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1146891 (https://phabricator.wikimedia.org/T385173)
[17:13:16] <logmsgbot>	 !log dzahn@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply
[17:13:35] <logmsgbot>	 !log dzahn@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[17:13:39] <tappof>	 !log pool titan1002
[17:13:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:01] <tappof>	 !log pool titan1002 (T411273)
[17:14:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:05] <stashbot>	 T411273: Thanos (store|query-frontend) memcached cache in bad status - https://phabricator.wikimedia.org/T411273
[17:14:12] <logmsgbot>	 !log dzahn@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[17:14:34] <logmsgbot>	 !log dzahn@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[17:15:30] <jinxer-wm>	 FIRING: LibericaStaleConfig: Liberica instance lvs4009 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&viewPanel=10&var-site=ulsfo&var-instance=lvs4009 - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig
[17:17:48] <sukhe>	 ^ well, it's depooled so I am not worried but will look after meeting
[17:18:02] <sukhe>	 probably because puppet hasn't run in a while
[17:18:20] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudbackup: move postgres data to /var/lib for all eqiad1 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1226909
[17:19:14] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1148 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[17:19:57] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1226909 (owner: 10Andrew Bogott)
[17:20:16] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1148.eqiad.wmnet
[17:20:30] <jinxer-wm>	 FIRING: [3x] LibericaStaleConfig: Liberica instance lvs4008 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig  - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig
[17:20:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11522297 (10ops-monitoring-bot) Host an-worker1148.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting after fixi...
[17:21:30] <wikibugs>	 (03PS1) 10Papaul: comment back the anycast ranges [homer/public] - 10https://gerrit.wikimedia.org/r/1226911 (https://phabricator.wikimedia.org/T408892)
[17:22:32] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cloudbackup: move postgres data to /var/lib for all eqiad1 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1226909 (owner: 10Andrew Bogott)
[17:23:31] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "thank you for this and reaching out to the team!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224858 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[17:24:11] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[17:24:24] <wikibugs>	 (03PS14) 10Federico Ceratto: sre.mysql.newpool: [de]pool various section kinds [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573)
[17:24:28] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup2004.codfw.wmnet with OS trixie
[17:24:57] <wikibugs>	 (03PS2) 10Dzahn: Revert "trafficserver: disable wikipedia25" [puppet] - 10https://gerrit.wikimedia.org/r/1224959 (https://phabricator.wikimedia.org/T408592)
[17:25:02] <wikibugs>	 (03CR) 10Federico Ceratto: sre.mysql.newpool: [de]pool various section kinds (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto)
[17:26:34] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs4010*} and A:liberica (T408510)
[17:26:39] <stashbot>	 T408510: ULSFO: switch refresh - https://phabricator.wikimedia.org/T408510
[17:26:53] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs4010*} and A:liberica (T408510)
[17:27:22] <logmsgbot>	 !log btullis@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply
[17:27:30] <logmsgbot>	 !log btullis@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply
[17:27:56] <logmsgbot>	 !log btullis@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply
[17:28:02] <logmsgbot>	 !log btullis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply
[17:28:56] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs4009*} and A:liberica (T408510)
[17:29:00] <logmsgbot>	 !log btullis@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply
[17:29:04] <logmsgbot>	 !log btullis@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply
[17:29:14] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs4009*} and A:liberica (T408510)
[17:29:35] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs4008*} and A:liberica (T408510)
[17:29:47] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] comment back the anycast ranges [homer/public] - 10https://gerrit.wikimedia.org/r/1226911 (https://phabricator.wikimedia.org/T408892) (owner: 10Papaul)
[17:29:54] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs4008*} and A:liberica (T408510)
[17:30:30] <jinxer-wm>	 FIRING: [3x] LibericaStaleConfig: Liberica instance lvs4008 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig  - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig
[17:30:37] <sukhe>	 that is weird for sure
[17:31:02] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] comment back the anycast ranges [homer/public] - 10https://gerrit.wikimedia.org/r/1226911 (https://phabricator.wikimedia.org/T408892) (owner: 10Papaul)
[17:31:03] <sukhe>	 na, that's an old alert, it cleared up
[17:31:04] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.mysql.newpool: [de]pool various section kinds [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto)
[17:31:05] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11522388 (10akosiaris)
[17:31:18] <wikibugs>	 (03CR) 10Papaul: [C:03+2] comment back the anycast ranges [homer/public] - 10https://gerrit.wikimedia.org/r/1226911 (https://phabricator.wikimedia.org/T408892) (owner: 10Papaul)
[17:33:27] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-worker1148.eqiad.wmnet
[17:33:50] <logmsgbot>	 !log btullis@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts an-worker1148.eqiad.wmnet
[17:34:03] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-worker1148.eqiad.wmnet
[17:34:11] <logmsgbot>	 !log btullis@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts an-worker1148.eqiad.wmnet
[17:35:24] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: Fr-tech expansion - https://phabricator.wikimedia.org/T403035#11522415 (10RobH)
[17:35:30] <jinxer-wm>	 RESOLVED: [3x] LibericaStaleConfig: Liberica instance lvs4008 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig  - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig
[17:35:53] <wikibugs>	 (03PS1) 10Elukey: profile::docker_registry: tune the s3 config for /restricted [puppet] - 10https://gerrit.wikimedia.org/r/1226914 (https://phabricator.wikimedia.org/T394476)
[17:36:06] <wikibugs>	 (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1226914 (https://phabricator.wikimedia.org/T394476) (owner: 10Elukey)
[17:36:13] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11522428 (10cmooney) Thanks @VRiley.  Happy to say we aren't seeing any loss as of yet after the node was uncordoned: ` cm...
[17:38:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Eqiad: Fr-tech expansion - https://phabricator.wikimedia.org/T403035#11522433 (10Dwisehaupt)
[17:40:02] <wikibugs>	 (03PS1) 10Superpes15: [slwiki] Add a temporary logo for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226915 (https://phabricator.wikimedia.org/T414265)
[17:40:29] <icinga-wm>	 PROBLEM - Host an-worker1148 is DOWN: PING CRITICAL - Packet loss = 100%
[17:41:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11522439 (10BTullis) We went through the process of:  * Deleting a foreign config for VD 02 * Deleting the preserved cache for...
[17:43:12] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudbackup2004.codfw.wmnet with reason: host reimage
[17:43:24] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[17:43:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Eqiad: Fr-tech expansion - https://phabricator.wikimedia.org/T403035#11522441 (10VRiley-WMF) For the scs console server, I believe it would be the one located in F8, is that correct?
[17:49:54] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudbackup2004.codfw.wmnet with reason: host reimage
[17:50:51] <logmsgbot>	 pt1979@cumin2002 netbox (PID 2702015) is awaiting input
[17:52:30] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.netbox
[17:52:52] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update cr3-cr4-ulsfo loopback - pt1979@cumin2002"
[17:53:10] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update cr3-cr4-ulsfo loopback - pt1979@cumin2002"
[17:53:10] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:55:17] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T1800)
[18:01:42] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[18:04:30] <wikibugs>	 (03PS2) 10Superpes15: [slwiki] Add a temporary logo for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226915 (https://phabricator.wikimedia.org/T414265)
[18:04:39] <wikibugs>	 (03PS2) 10Superpes15: [itwiki] Add a temporary logo for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226908 (https://phabricator.wikimedia.org/T414320)
[18:05:25] <logmsgbot>	 !log cdobbins@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool site ulsfo [reason: switch work, T408510]
[18:05:29] <stashbot>	 T408510: ULSFO: switch refresh - https://phabricator.wikimedia.org/T408510
[18:05:37] <logmsgbot>	 !log cdobbins@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site ulsfo [reason: switch work, T408510]
[18:06:30] <wikibugs>	 (03PS1) 10Superpes15: [kkwiki] Add a temporary logo for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226918 (https://phabricator.wikimedia.org/T414267)
[18:07:40] <logmsgbot>	 pt1979@cumin2002 netbox (PID 2712631) is awaiting input
[18:08:12] <icinga-wm>	 PROBLEM - Host mr1-ulsfo is DOWN: PING CRITICAL - Packet loss = 100%
[18:08:13] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Eqiad: Fr-tech expansion - https://phabricator.wikimedia.org/T403035#11522495 (10cmooney) >>! In T403035#11522441, @VRiley-WMF wrote: > For the scs console server, I believe it would be the one located in F8, is that corr...
[18:10:26] <icinga-wm>	 RECOVERY - Host mr1-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 71.86 ms
[18:14:07] <_joe_>	 !incidents
[18:14:07] <sirenbot>	 7336 (ACKED)  [7x] ProbeDown sre (probes/service ulsfo)
[18:14:17] <_joe_>	 uhm why are services still down though
[18:14:43] <sukhe>	 they should not be, I don't see any pending alerts on alertmanger, though not sure why resolves haven't come in
[18:14:46] <sukhe>	 checking
[18:15:10] <sukhe>	 probes look OK as well
[18:15:12] <_joe_>	 yes
[18:15:15] <_joe_>	 !resolve
[18:15:16] <sirenbot>	 7336 (RESOLVED)  [7x] ProbeDown sre (probes/service ulsfo)
[18:15:21] <_joe_>	 rzl: <3
[18:16:17] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update cr3-cr4-ulsfo loopback - pt1979@cumin2002"
[18:16:39] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update cr3-cr4-ulsfo loopback - pt1979@cumin2002"
[18:16:39] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:21:55] <logmsgbot>	 !log cmooney@dns2005 START - running authdns-update
[18:23:28] <icinga-wm>	 PROBLEM - Host mr1-ulsfo is DOWN: PING CRITICAL - Packet loss = 100%
[18:25:36] <icinga-wm>	 RECOVERY - Host mr1-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 71.82 ms
[18:28:03] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11522551 (10ssingh) @cmooney: Any picks for your favourite v6 address for `ns1`? I was thinking of allocating `2620:0:860:ed1a::4/128` under LVS service IPs `2620:0:860:ed1a::/64`, since unfortuna...
[18:29:43] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T413525)', diff saved to https://phabricator.wikimedia.org/P87508 and previous config saved to /var/cache/conftool/dbconfig/20260114-182942-marostegui.json
[18:29:50] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[18:29:51] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.netbox
[18:31:59] <wikibugs>	 (03PS4) 10Bking: WIP: Alert DPE SRE when probes fail in dse-k8s clusters [alerts] - 10https://gerrit.wikimedia.org/r/1226282 (https://phabricator.wikimedia.org/T412447)
[18:33:04] <wikibugs>	 (03PS5) 10Bking: Alert DPE SRE when probes fail in dse-k8s clusters [alerts] - 10https://gerrit.wikimedia.org/r/1226282 (https://phabricator.wikimedia.org/T412447)
[18:33:15] <logmsgbot>	 !log cmooney@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[18:33:34] <wikibugs>	 10SRE-Access-Requests: Yubikey-SSH-FIDO access for dduvall - https://phabricator.wikimedia.org/T414619 (10dduvall) 03NEW
[18:33:39] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.netbox
[18:35:06] <wikibugs>	 (03PS1) 10Dduvall: admin: Add new yubikey-ssh-fido keys for dduvall [puppet] - 10https://gerrit.wikimedia.org/r/1226922 (https://phabricator.wikimedia.org/T414619)
[18:36:02] <MatmaRex>	 the page https://wikipedia25.org/ shows "Domain not configured". this is bad because it's already linked in some live banners. does anyone here know anything about it?
[18:36:09] <MatmaRex>	 (this was reported at https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#https://wikipedia25.org/_banner )
[18:36:22] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87509 and previous config saved to /var/cache/conftool/dbconfig/20260114-183621-marostegui.json
[18:36:30] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[18:36:31] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[18:38:01] <A_smart_kitten>	 mutante: ^ in case you're aware of what's happening re wikipedia25.org
[18:38:06] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth1 (Subnet frack-fundraising-codfw in F5) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[18:38:06] <A_smart_kitten>	 xref T408592
[18:38:07] <stashbot>	 T408592: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592
[18:38:13] <taavi>	 per T408592 it's not supposed to be up until tomorrow morning
[18:38:40] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[18:39:20] <logmsgbot>	 cmooney@cumin1003 netbox (PID 1520061) is awaiting input
[18:39:53] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P87511 and previous config saved to /var/cache/conftool/dbconfig/20260114-183951-marostegui.json
[18:40:19] <logmsgbot>	 !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host an-worker1148.eqiad.wmnet
[18:40:20] <taavi>	 VPT says banners were accidentally up too early and have been fixed
[18:41:26] <wikibugs>	 (03PS1) 10Cathal Mooney: Add INCLUDE statement to cover new netbox snippet for 198.35.26.128/27 [dns] - 10https://gerrit.wikimedia.org/r/1226923 (https://phabricator.wikimedia.org/T408892)
[18:41:34] <logmsgbot>	 !log cmooney@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[18:41:41] <MatmaRex>	 ah. thanks for responding anyway :)
[18:42:12] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add INCLUDE statement to cover new netbox snippet for 198.35.26.128/27 [dns] - 10https://gerrit.wikimedia.org/r/1226923 (https://phabricator.wikimedia.org/T408892) (owner: 10Cathal Mooney)
[18:42:32] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update cr3-cr4-ulsfo loopback - pt1979@cumin2002"
[18:42:37] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update cr3-cr4-ulsfo loopback - pt1979@cumin2002"
[18:42:37] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:44:36] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.netbox
[18:46:31] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260', diff saved to https://phabricator.wikimedia.org/P87512 and previous config saved to /var/cache/conftool/dbconfig/20260114-184630-marostegui.json
[18:47:13] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:48:38] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.netbox
[18:48:49] <logmsgbot>	 !log cmooney@cumin1003 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97)
[18:49:25] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.netbox
[18:50:02] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P87513 and previous config saved to /var/cache/conftool/dbconfig/20260114-185001-marostegui.json
[18:51:28] <icinga-wm>	 PROBLEM - Host mr1-ulsfo is DOWN: PING CRITICAL - Packet loss = 100%
[18:52:04] <icinga-wm>	 PROBLEM - Host mr1-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[18:52:22] <papaul>	 me^
[18:52:40] <icinga-wm>	 RECOVERY - Host mr1-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 71.75 ms
[18:53:03] <wikibugs>	 (03PS1) 10Ssingh: dnsbox: codfw: advertise ns1 IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1226928 (https://phabricator.wikimedia.org/T81605)
[18:55:03] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update netbox entries beofre dns patch - cmooney@cumin1003"
[18:55:29] <wikibugs>	 (03PS2) 10Ssingh: dnsbox: codfw: advertise ns1 IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1226928 (https://phabricator.wikimedia.org/T81605)
[18:55:45] <logmsgbot>	 !log cmooney@cumin1003 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update netbox entries beofre dns patch - cmooney@cumin1003"
[18:55:45] <logmsgbot>	 !log cmooney@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[18:56:29] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7894/co" [puppet] - 10https://gerrit.wikimedia.org/r/1226928 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh)
[18:56:38] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.netbox
[18:56:39] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260', diff saved to https://phabricator.wikimedia.org/P87514 and previous config saved to /var/cache/conftool/dbconfig/20260114-185638-marostegui.json
[18:56:56] <logmsgbot>	 !log cmooney@cumin1003 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97)
[18:57:06] <icinga-wm>	 RECOVERY - Host mr1-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 71.72 ms
[18:57:14] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.netbox
[18:59:35] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudbackup2004.codfw.wmnet with OS trixie
[19:00:05] <jouncebot>	 jeena and dduvall: How many deployers does it take to do MediaWiki train - Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T1900).
[19:00:09] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T413525)', diff saved to https://phabricator.wikimedia.org/P87515 and previous config saved to /var/cache/conftool/dbconfig/20260114-190008-marostegui.json
[19:00:10] <icinga-wm>	 PROBLEM - Host mr1-ulsfo is DOWN: PING CRITICAL - Packet loss = 100%
[19:00:14] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[19:00:25] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance
[19:00:34] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2155 (T413525)', diff saved to https://phabricator.wikimedia.org/P87516 and previous config saved to /var/cache/conftool/dbconfig/20260114-190033-marostegui.json
[19:00:53] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update netbox entries beofre dns patch - cmooney@cumin1003"
[19:00:58] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update netbox entries beofre dns patch - cmooney@cumin1003"
[19:00:58] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:01:45] <logmsgbot>	 !log pt1979@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mr1-ulsfo,mr1-ulsfo IPv6 with reason: loopback IPV4 change on ulsfo core router
[19:02:02] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11522682 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8cc58471-31d6-4e79-ae14-124cd9a6b684) set by pt1979@cumin2002 for 1:00:00 on 2 h...
[19:03:04] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.46.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226930 (https://phabricator.wikimedia.org/T413802)
[19:03:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jhuneidi@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226930 (https://phabricator.wikimedia.org/T413802) (owner: 10TrainBranchBot)
[19:04:00] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.46.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226930 (https://phabricator.wikimedia.org/T413802) (owner: 10TrainBranchBot)
[19:04:00] <wikibugs>	 (03PS2) 10Cathal Mooney: Add INCLUDE statement to cover new netbox snippet for 198.35.26.128/27 [dns] - 10https://gerrit.wikimedia.org/r/1226923 (https://phabricator.wikimedia.org/T408892)
[19:04:02] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.netbox
[19:05:01] <wikibugs>	 (03PS3) 10Ssingh: dnsbox: codfw: advertise ns1 IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1226928 (https://phabricator.wikimedia.org/T81605)
[19:06:07] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1226928 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh)
[19:06:48] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87517 and previous config saved to /var/cache/conftool/dbconfig/20260114-190647-marostegui.json
[19:06:55] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[19:06:55] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[19:07:04] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1261.eqiad.wmnet with reason: Maintenance
[19:07:12] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1261 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87518 and previous config saved to /var/cache/conftool/dbconfig/20260114-190711-marostegui.json
[19:08:25] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update netbox entries beofre dns patch - cmooney@cumin1003"
[19:08:27] <jeena>	 There seem to be a lot of db connection errors https://logstash.wikimedia.org/goto/5c16b9ac7ad6b93093cdbe4984eb0f38
[19:08:29] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update netbox entries beofre dns patch - cmooney@cumin1003"
[19:08:29] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:08:33] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[19:10:17] <logmsgbot>	 !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.46.0-wmf.11  refs T413802
[19:10:22] <stashbot>	 T413802: 1.46.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T413802
[19:10:23] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Add INCLUDE statement to cover new netbox snippet for 198.35.26.128/27 [dns] - 10https://gerrit.wikimedia.org/r/1226923 (https://phabricator.wikimedia.org/T408892) (owner: 10Cathal Mooney)
[19:11:20] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:11:34] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Add INCLUDE statement to cover new netbox snippet for 198.35.26.128/27 [dns] - 10https://gerrit.wikimedia.org/r/1226923 (https://phabricator.wikimedia.org/T408892) (owner: 10Cathal Mooney)
[19:11:49] <logmsgbot>	 !log cmooney@dns2005 START - running authdns-update
[19:12:17] <wikibugs>	 (03PS8) 10CDanis: lvs7003: add gerrit-ssh and gerrit-https [puppet] - 10https://gerrit.wikimedia.org/r/1215388 (https://phabricator.wikimedia.org/T411895)
[19:12:17] <wikibugs>	 (03PS14) 10CDanis: gerrit services: lvs_setup! but only in magru. [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (https://phabricator.wikimedia.org/T411895)
[19:12:17] <wikibugs>	 (03PS7) 10CDanis: lvs7001: add gerrit services [puppet] - 10https://gerrit.wikimedia.org/r/1215398 (https://phabricator.wikimedia.org/T411895)
[19:12:17] <wikibugs>	 (03PS4) 10CDanis: gerrit/Liberica: expand to drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1215693 (https://phabricator.wikimedia.org/T411895)
[19:12:18] <wikibugs>	 (03PS1) 10CDanis: cache_text: add gerrit-https to realservers [puppet] - 10https://gerrit.wikimedia.org/r/1226932 (https://phabricator.wikimedia.org/T411895)
[19:12:35] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1226932 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis)
[19:12:43] <logmsgbot>	 !log cmooney@dns2005 END - running authdns-update
[19:14:25] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "I was split on modifying authdns_addrs but it's pretty clear we have to do that as well since it's tightly couple on our assumption of val" [puppet] - 10https://gerrit.wikimedia.org/r/1226928 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh)
[19:14:43] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup1004.eqiad.wmnet with OS trixie
[19:16:27] <wikibugs>	 (03PS4) 10Ssingh: dnsbox: codfw: advertise ns1 IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1226928 (https://phabricator.wikimedia.org/T81605)
[19:17:31] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7896/co" [puppet] - 10https://gerrit.wikimedia.org/r/1226928 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh)
[19:20:18] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11522747 (10taavi) 05Stalled→03Open
[19:20:36] <icinga-wm>	 PROBLEM - Host cp7016 is DOWN: CRITICAL - Time to live exceeded (10.140.1.11)
[19:20:54] <icinga-wm>	 PROBLEM - Host cp7003 is DOWN: CRITICAL - Time to live exceeded (10.140.0.4)
[19:20:54] <icinga-wm>	 PROBLEM - Host cp7005 is DOWN: CRITICAL - Time to live exceeded (10.140.0.5)
[19:20:55] <icinga-wm>	 PROBLEM - Host cp7007 is DOWN: CRITICAL - Time to live exceeded (10.140.0.6)
[19:20:55] <icinga-wm>	 PROBLEM - Host cp7009 is DOWN: CRITICAL - Time to live exceeded (10.140.0.7)
[19:21:10] <icinga-wm>	 RECOVERY - Host cp7003 is UP: PING OK - Packet loss = 0%, RTA = 137.51 ms
[19:21:10] <icinga-wm>	 RECOVERY - Host cp7005 is UP: PING OK - Packet loss = 0%, RTA = 137.45 ms
[19:21:10] <icinga-wm>	 RECOVERY - Host cp7016 is UP: PING OK - Packet loss = 0%, RTA = 137.52 ms
[19:21:10] <icinga-wm>	 RECOVERY - Host cp7009 is UP: PING OK - Packet loss = 0%, RTA = 137.38 ms
[19:21:12] <icinga-wm>	 RECOVERY - Host cp7007 is UP: PING OK - Packet loss = 0%, RTA = 137.56 ms
[19:21:17] <sukhe>	 come on
[19:21:30] <icinga-wm>	 PROBLEM - SSH on cp7016 is CRITICAL: connect to address 10.140.1.11 and port 22: No route to host https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:22:27] <cdanis>	 not me :)
[19:22:38] <icinga-wm>	 RECOVERY - SSH on cp7016 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:22:58] <sukhe>	 cdanis: yeah this is T414473
[19:22:59] <stashbot>	 T414473: magru hosts (erroneously) reported down due to TTL exceeded - https://phabricator.wikimedia.org/T414473
[19:23:19] <cdanis>	 sukhe: I suspect we're winding up with a temporary routing loop
[19:23:20] <icinga-wm>	 PROBLEM - Host ncredir7003 is DOWN: CRITICAL - Time to live exceeded (10.140.2.3)
[19:23:24] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: CRITICAL - Time to live exceeded (10.140.2.8)
[19:23:29] <sukhe>	 cdanis: yep
[19:23:47] <cdanis>	 something something OSPF something BGP something confederation
[19:23:56] <icinga-wm>	 PROBLEM - Host doh7003 is DOWN: PING CRITICAL - Packet loss = 100%
[19:23:56] <icinga-wm>	 PROBLEM - Host doh7004 is DOWN: PING CRITICAL - Packet loss = 100%
[19:24:02] <icinga-wm>	 RECOVERY - Host doh7003 is UP: PING WARNING - Packet loss = 33%, RTA = 347.92 ms
[19:24:02] <icinga-wm>	 RECOVERY - Host doh7004 is UP: PING WARNING - Packet loss = 33%, RTA = 347.74 ms
[19:24:05] <icinga-wm>	 RECOVERY - Host ncredir7003 is UP: PING OK - Packet loss = 0%, RTA = 138.15 ms
[19:24:08] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 137.90 ms
[19:24:20] <icinga-wm>	 PROBLEM - Host hcaptcha-proxy7002 is DOWN: CRITICAL - Time to live exceeded (195.200.68.103)
[19:24:48] <icinga-wm>	 RECOVERY - Host hcaptcha-proxy7002 is UP: PING OK - Packet loss = 0%, RTA = 138.27 ms
[19:24:56] <icinga-wm>	 PROBLEM - Recursive DNS on 195.200.68.37 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[19:25:50] <icinga-wm>	 PROBLEM - Host asw1-b4-magru is DOWN: CRITICAL - Time to live exceeded (195.200.68.131)
[19:25:54] <icinga-wm>	 PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp7005 is CRITICAL: connect to address 10.140.0.5 and port 3128: No route to host https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:26:00] <icinga-wm>	 PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:26:00] <icinga-wm>	 PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp7008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:26:00] <icinga-wm>	 PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp7005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:26:00] <icinga-wm>	 PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp7016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:26:00] <icinga-wm>	 PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:26:01] <icinga-wm>	 PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp7015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:26:01] <icinga-wm>	 PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp7009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:26:02] <icinga-wm>	 PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp7014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:26:02] <icinga-wm>	 PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp7010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:26:03] <icinga-wm>	 PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp7013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:26:03] <icinga-wm>	 PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp7003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:26:04] <icinga-wm>	 PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp7006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:26:06] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-11 on ncredir7003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir
[19:26:06] <icinga-wm>	 PROBLEM - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp7016 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/HTTPS
[19:26:06] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-6 on ncredir7003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir
[19:26:06] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-19 on ncredir7003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir
[19:26:06] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-34 on ncredir7003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir
[19:26:07] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-9 on ncredir7004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir
[19:26:07] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-38 on ncredir7004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir
[19:26:08] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-12 on ncredir7004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir
[19:26:08] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-27 on ncredir7003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir
[19:26:09] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-33 on ncredir7003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir
[19:26:09] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-21 on ncredir7004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir
[19:26:10] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-21 on ncredir7003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir
[19:26:10] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-37 on ncredir7003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir
[19:26:11] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-10 on ncredir7003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir
[19:26:11] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-15 on ncredir7004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir
[19:26:12] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-8 on ncredir7003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir
[19:26:12] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-23 on ncredir7004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir
[19:26:13] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-37 on ncredir7004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir
[19:26:13] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-1 on ncredir7004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir
[19:26:14] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-22 on ncredir7004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host https://wikitech.wikimedia.org/wiki/Ncredir
[19:26:14] <icinga-wm>	 RECOVERY - Host asw1-b4-magru is UP: PING OK - Packet loss = 0%, RTA = 144.10 ms
[19:26:52] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-12 on ncredir7004 is OK: SSL OK - Certificate wikiedia.org valid until 2026-02-13 13:40:57 +0000 (expires in 29 days) https://wikitech.wikimedia.org/wiki/Ncredir
[19:26:52] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-6 on ncredir7003 is OK: SSL OK - Certificate wikipedia.fi valid until 2026-02-22 01:44:33 +0000 (expires in 38 days) https://wikitech.wikimedia.org/wiki/Ncredir
[19:26:52] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-27 on ncredir7003 is OK: SSL OK - Certificate wiktionary.ee valid until 2026-03-19 20:17:32 +0000 (expires in 64 days) https://wikitech.wikimedia.org/wiki/Ncredir
[19:26:52] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-38 on ncredir7004 is OK: SSL OK - Certificate wikipublications.com valid until 2026-03-10 18:09:14 +0000 (expires in 54 days) https://wikitech.wikimedia.org/wiki/Ncredir
[19:26:52] <icinga-wm>	 RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp7010 is OK: HTTP OK: HTTP/1.0 200 OK - 36924 bytes in 0.486 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:26:57] <sukhe>	 welp
[19:27:00] <sukhe>	 raising the priority on that one I guess
[19:27:54] <jinxer-wm>	 FIRING: [14x] CoreBGPDown: Core BGP session down between cr1-codfw and cr3-ulsfo (198.35.26.192) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[19:28:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr2-eqdfw and 195.200.68.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[19:28:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[19:31:22] <wikibugs>	 (03PS2) 10CDanis: cache_text: add gerrit-https to realservers [puppet] - 10https://gerrit.wikimedia.org/r/1226932 (https://phabricator.wikimedia.org/T411895)
[19:31:22] <wikibugs>	 (03PS9) 10CDanis: lvs7003: add gerrit-ssh and gerrit-https [puppet] - 10https://gerrit.wikimedia.org/r/1215388 (https://phabricator.wikimedia.org/T411895)
[19:31:22] <wikibugs>	 (03PS15) 10CDanis: gerrit services: lvs_setup! but only in magru. [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (https://phabricator.wikimedia.org/T411895)
[19:31:23] <wikibugs>	 (03PS8) 10CDanis: lvs7001: add gerrit services [puppet] - 10https://gerrit.wikimedia.org/r/1215398 (https://phabricator.wikimedia.org/T411895)
[19:31:24] <wikibugs>	 (03PS5) 10CDanis: gerrit/Liberica: expand to drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1215693 (https://phabricator.wikimedia.org/T411895)
[19:31:28] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1226932 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis)
[19:32:39] <jinxer-wm>	 FIRING: [14x] CoreBGPDown: Core BGP session down between cr1-codfw and cr3-ulsfo (198.35.26.192) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[19:32:55] <papaul>	 looking at that^
[19:33:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr2-eqdfw and 195.200.68.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[19:37:39] <jinxer-wm>	 FIRING: [14x] CoreBGPDown: Core BGP session down between cr1-codfw and cr3-ulsfo (198.35.26.192) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[19:40:54] <wikibugs>	 (03PS3) 10CDanis: cache_text: add gerrit-https to realservers [puppet] - 10https://gerrit.wikimedia.org/r/1226932 (https://phabricator.wikimedia.org/T411895)
[19:40:54] <wikibugs>	 (03PS10) 10CDanis: lvs7003: add gerrit-ssh and gerrit-https [puppet] - 10https://gerrit.wikimedia.org/r/1215388 (https://phabricator.wikimedia.org/T411895)
[19:40:54] <wikibugs>	 (03PS16) 10CDanis: gerrit services: lvs_setup! but only in magru. [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (https://phabricator.wikimedia.org/T411895)
[19:40:54] <wikibugs>	 (03PS9) 10CDanis: lvs7001: add gerrit services [puppet] - 10https://gerrit.wikimedia.org/r/1215398 (https://phabricator.wikimedia.org/T411895)
[19:40:55] <wikibugs>	 (03PS6) 10CDanis: gerrit/Liberica: expand to drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1215693 (https://phabricator.wikimedia.org/T411895)
[19:41:03] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1226932 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis)
[19:44:24] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Remove second network connection for cloudcephosd hosts with single uplink - https://phabricator.wikimedia.org/T410989#11522824 (10VRiley-WMF) 05Open→03Resolved Thanks for this. I have unplugged the secondary cable for cloudcephosd1052. I have also went through the cable...
[19:55:49] <wikibugs>	 (03CR) 10CDanis: [V:03+1] "PCC is looking good so far, although I imagine we'll do this in exclusively magru first https://puppet-compiler.wmflabs.org/output/1226932" [puppet] - 10https://gerrit.wikimedia.org/r/1226932 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis)
[19:58:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[20:00:53] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215388 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis)
[20:00:56] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis)
[20:00:58] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215398 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis)
[20:02:39] <jinxer-wm>	 RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr3-eqsin and cr3-ulsfo (198.35.26.192) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[20:04:42] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] cache_text: add gerrit-https to realservers [puppet] - 10https://gerrit.wikimedia.org/r/1226932 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis)
[20:05:52] <icinga-wm>	 PROBLEM - Host mr1-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[20:06:21] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host mwlog1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[20:06:54] <logmsgbot>	 andrew@cumin2002 reimage (PID 2749329) is awaiting input
[20:08:37] <wikibugs>	 06SRE, 06Traffic, 07HTTPS: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#11522895 (10Izno)
[20:21:17] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudbackup1004.eqiad.wmnet with OS trixie
[20:22:16] <wikibugs>	 (03CR) 10CDanis: [V:03+1 C:03+2] cache_text: add gerrit-https to realservers [puppet] - 10https://gerrit.wikimedia.org/r/1226932 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis)
[20:22:46] <cdanis>	 !log 💔cdanis@cumin1003.eqiad.wmnet ~ 🕞🍵 sudo cumin 'A:cp-text' 'disable-puppet "cdanis deploy Ie99c64c48d"'
[20:22:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:23:22] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup1004.eqiad.wmnet with OS trixie
[20:27:28] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:04-2] "do not merge." [puppet] - 10https://gerrit.wikimedia.org/r/1226928 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh)
[20:29:27] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mwlog1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[20:30:27] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host mwlog1003.eqiad.wmnet with OS bookworm
[20:30:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06SRE Observability (FY2025/2026-Q3): Q2:rack/setup/install mwlog1003 - https://phabricator.wikimedia.org/T412230#11522910 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host mwlog1003.eqiad.wmnet with OS bookworm
[20:36:46] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudbackup1004.eqiad.wmnet with reason: host reimage
[20:38:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:43:50] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudbackup1004.eqiad.wmnet with reason: host reimage
[20:44:33] <logmsgbot>	 jclark@cumin1003 reimage (PID 1533287) is awaiting input
[20:49:19] <wikibugs>	 (03PS1) 10CDanis: Revert "cache_text: add gerrit-https to realservers" [puppet] - 10https://gerrit.wikimedia.org/r/1226939
[20:52:09] <wikibugs>	 (03PS5) 10Chlod Alejandro: enwiki: change to Wikipedia 25 logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225127 (https://phabricator.wikimedia.org/T414271)
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T2100).
[21:00:05] <jouncebot>	 chlod, ZhaoFJx, ejegg, and Superpes: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:09] <ZhaoFJx>	 o/
[21:00:14] <chlod>	 o/
[21:00:24] <zabe>	 I can deploy
[21:01:47] <ejegg>	 hello
[21:02:42] <wikibugs>	 (03CR) 10Zabe: [C:03+2] enwiki: change to Wikipedia 25 logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225127 (https://phabricator.wikimedia.org/T414271) (owner: 10Chlod Alejandro)
[21:02:56] <wikibugs>	 (03CR) 10CDanis: [C:03+2] lvs7003: add gerrit-ssh and gerrit-https [puppet] - 10https://gerrit.wikimedia.org/r/1215388 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis)
[21:03:19] <wikibugs>	 (03CR) 10Zabe: [C:03+2] zhwiki: Temporary Logo Change for WP25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225285 (https://phabricator.wikimedia.org/T414299) (owner: 10ZhaoFJx)
[21:03:23] <wikibugs>	 (03CR) 10CDanis: [C:03+2] gerrit services: lvs_setup! but only in magru. [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis)
[21:03:38] <wikibugs>	 (03Merged) 10jenkins-bot: enwiki: change to Wikipedia 25 logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225127 (https://phabricator.wikimedia.org/T414271) (owner: 10Chlod Alejandro)
[21:03:40] <wikibugs>	 (03CR) 10CI reject: [V:04-1] zhwiki: Temporary Logo Change for WP25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225285 (https://phabricator.wikimedia.org/T414299) (owner: 10ZhaoFJx)
[21:04:25] <wikibugs>	 (03PS3) 10Zabe: zhwiki: Temporary Logo Change for WP25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225285 (https://phabricator.wikimedia.org/T414299) (owner: 10ZhaoFJx)
[21:04:29] <zabe>	 merge conflict 
[21:04:31] <zabe>	 love it
[21:05:21] <wikibugs>	 (03CR) 10Zabe: [C:03+2] zhwiki: Temporary Logo Change for WP25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225285 (https://phabricator.wikimedia.org/T414299) (owner: 10ZhaoFJx)
[21:05:37] <chlod>	 it is what it is
[21:06:13] <wikibugs>	 (03Merged) 10jenkins-bot: zhwiki: Temporary Logo Change for WP25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225285 (https://phabricator.wikimedia.org/T414299) (owner: 10ZhaoFJx)
[21:06:56] <logmsgbot>	 !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1225127|enwiki: change to Wikipedia 25 logo (T414271)]], [[gerrit:1225285|zhwiki: Temporary Logo Change for WP25 (T414299)]]
[21:07:04] <stashbot>	 T414271: Requesting temporary logo change for en.wikipedia.org (WP25) - https://phabricator.wikimedia.org/T414271
[21:07:05] <stashbot>	 T414299: Requesting temporary logo change for zh.wikipedia.org (WP25) - https://phabricator.wikimedia.org/T414299
[21:09:11] <logmsgbot>	 !log zabe@deploy2002 chlod, zhaofjx, zabe: Backport for [[gerrit:1225127|enwiki: change to Wikipedia 25 logo (T414271)]], [[gerrit:1225285|zhwiki: Temporary Logo Change for WP25 (T414299)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:09:30] <jinxer-wm>	 FIRING: LibericaStaleConfig: Liberica instance lvs7003 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&viewPanel=10&var-site=magru&var-instance=lvs7003 - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig
[21:09:54] <logmsgbot>	 !log cdanis@cumin1003 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs7003.magru.wmnet} and A:liberica
[21:09:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:09:56] <chlod>	 checking now
[21:10:12] <ZhaoFJx>	 Checking…
[21:10:14] <logmsgbot>	 !log cdanis@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs7003.magru.wmnet} and A:liberica
[21:10:46] <ZhaoFJx>	 zabe working
[21:11:00] <ejegg>	 woo, i see the wp25 logo on enwiki (debug) too :)
[21:11:02] <ZhaoFJx>	 Perfectly
[21:11:18] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudbackup1004.eqiad.wmnet with OS trixie
[21:11:43] <chlod>	 good on enwiki as well :)
[21:11:47] <zabe>	 nice!
[21:11:50] <logmsgbot>	 !log zabe@deploy2002 chlod, zhaofjx, zabe: Continuing with sync
[21:14:10] <wikibugs>	 (03CR) 10CDanis: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1215398/5616/" [puppet] - 10https://gerrit.wikimedia.org/r/1215398 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis)
[21:14:30] <jinxer-wm>	 RESOLVED: LibericaStaleConfig: Liberica instance lvs7003 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&viewPanel=10&var-site=magru&var-instance=lvs7003 - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig
[21:14:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:15:58] <logmsgbot>	 !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1225127|enwiki: change to Wikipedia 25 logo (T414271)]], [[gerrit:1225285|zhwiki: Temporary Logo Change for WP25 (T414299)]] (duration: 09m 03s)
[21:16:08] <stashbot>	 T414271: Requesting temporary logo change for en.wikipedia.org (WP25) - https://phabricator.wikimedia.org/T414271
[21:16:09] <stashbot>	 T414299: Requesting temporary logo change for zh.wikipedia.org (WP25) - https://phabricator.wikimedia.org/T414299
[21:16:12] <zabe>	 ejegg: you want to self-serve?
[21:16:28] <ZhaoFJx>	 zabe thanks a lot
[21:16:32] <zabe>	 yw
[21:16:41] <ejegg>	 let me see if I have my credentials in order zabe 
[21:16:42] <zabe>	 all those files need purging I guess
[21:17:02] <taavi>	 uh why do I see no logo on enwiki atm
[21:17:18] <wikibugs>	 (03PS4) 10Zabe: [itwiki] Add a temporary logo for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226908 (https://phabricator.wikimedia.org/T414320) (owner: 10Superpes15)
[21:17:21] <logmsgbot>	 !log cdanis@cumin1003 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs7001.magru.wmnet} and A:liberica
[21:17:29] <logmsgbot>	 !log cdanis@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs7001.magru.wmnet} and A:liberica
[21:18:37] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for hmonroy - https://phabricator.wikimedia.org/T414375#11523067 (10HMonroy) @JMeybohm Hi! I'm trying a query wmf.mediawiki_history in superset. I'm getting: `mysql error: SELECT command denied to user 'research'@'1...
[21:19:41] <ejegg>	 hmm, i don't have the ssh config locally to get in to deploy1002
[21:19:43] <chlod>	 getting the same as taavi: for some reason it's 404ing?
[21:20:07] <taavi>	 I think the 404s got cached in the CDN just before they were synced out everywhere
[21:20:12] <taavi>	 zabe: you purging them already or should I?
[21:20:22] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "cloudbackup: flip all backups from cloudbackup1004 to 1003" [puppet] - 10https://gerrit.wikimedia.org/r/1226942
[21:20:39] <zabe>	 I was currently doing it 
[21:21:17] <zabe>	 ejegg: I can sync yours
[21:21:24] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Revert "Shorten 'close' cookie wait period for enwiki banners" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226275 (https://phabricator.wikimedia.org/T411800) (owner: 10Ejegg)
[21:21:28] <ejegg>	 thanks zabe
[21:21:33] <taavi>	 ejegg: deploy1002 has not been a thing in years?
[21:21:53] <ejegg>	 oh hah, still on the deployments docs page!
[21:22:10] <taavi>	 !log manually purge 17 URLs for enwiki and zhwiki 25 year anniversary logos
[21:22:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:22:22] <taavi>	 which doc?
[21:22:26] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Shorten 'close' cookie wait period for enwiki banners" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226275 (https://phabricator.wikimedia.org/T411800) (owner: 10Ejegg)
[21:22:40] <ejegg>	 https://wikitech.wikimedia.org/wiki/Backport_windows#Doing_the_deploy
[21:22:49] <ejegg>	 The process: step 4
[21:24:11] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[21:24:12] <Jdlrobson>	 jeena: regarding train blocker. I can deploy this in at 3pm PST (90m from now) unless you want to deploy it now?
[21:24:26] <logmsgbot>	 !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1226275|Revert "Shorten 'close' cookie wait period for enwiki banners" (T411800)]]
[21:24:28] <taavi>	 chlod: zabe: I purged all of the new logo files, and it's fixed at least for me
[21:24:29] <wikibugs>	 (03PS1) 10JHathaway: firewall: add cloud services, remove includes [puppet] - 10https://gerrit.wikimedia.org/r/1226944 (https://phabricator.wikimedia.org/T411089)
[21:24:32] <stashbot>	 T411800: CentralNotice config changes to show a banner to a reader with the 'waitdate: close' status - https://phabricator.wikimedia.org/T411800
[21:24:42] <chlod>	 also fixed for me! :D
[21:24:50] <chlod>	 thank you both, zabe and taavi
[21:25:00] <wikibugs>	 (03PS1) 10Jdlrobson: Revert "Do not use deprecated menu" [extensions/ProofreadPage] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1226945 (https://phabricator.wikimedia.org/T414630)
[21:25:08] <wikibugs>	 (03PS4) 10Zabe: [slwiki] Add a temporary logo for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226915 (https://phabricator.wikimedia.org/T414265) (owner: 10Superpes15)
[21:25:26] <jeena>	 Jdlrobson: if you prefer I can do it after these backports are run
[21:26:06] <wikibugs>	 (03PS3) 10Zabe: [kkwiki] Add a temporary logo for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226918 (https://phabricator.wikimedia.org/T414267) (owner: 10Superpes15)
[21:26:36] <logmsgbot>	 !log zabe@deploy2002 zabe, ejegg: Backport for [[gerrit:1226275|Revert "Shorten 'close' cookie wait period for enwiki banners" (T411800)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:26:37] <wikibugs>	 (03CR) 10CI reject: [V:04-1] firewall: add cloud services, remove includes [puppet] - 10https://gerrit.wikimedia.org/r/1226944 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway)
[21:27:18] <wikibugs>	 (03CR) 10JHathaway: firewall: Declare resources for both providers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) (owner: 10Majavah)
[21:27:30] <taavi>	 filed T414634 for the outdated docs as they're outdated enough to not be trivially fixable on the spot
[21:27:31] <stashbot>	 T414634: Fix outdated deployment instructions on Wikitech - https://phabricator.wikimedia.org/T414634
[21:27:48] <ejegg>	 thanks zabe
[21:28:52] <ejegg>	 the setting seems to have the correct value
[21:28:58] <zabe>	 Nice
[21:29:01] <logmsgbot>	 !log zabe@deploy2002 zabe, ejegg: Continuing with sync
[21:29:44] <ejegg>	 (and I have successfully logged in to deploy2002.codfw.wmnet so I can try to self-service next time)
[21:30:42] <wikibugs>	 (03CR) 10Zabe: [C:03+2] [itwiki] Add a temporary logo for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226908 (https://phabricator.wikimedia.org/T414320) (owner: 10Superpes15)
[21:30:43] <wikibugs>	 (03CR) 10Zabe: [C:03+2] [kkwiki] Add a temporary logo for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226918 (https://phabricator.wikimedia.org/T414267) (owner: 10Superpes15)
[21:30:44] <wikibugs>	 (03CR) 10Zabe: [C:03+2] [slwiki] Add a temporary logo for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226915 (https://phabricator.wikimedia.org/T414265) (owner: 10Superpes15)
[21:30:54] <jeena>	 ejegg we are using spiderpig to do backports now (https://spiderpig.wikimedia.org/mediawiki/backport) which uses the scap backport command (what you would use if logged into the deployment server) https://wikitech.wikimedia.org/wiki/Scap#scap_backport
[21:31:54] <wikibugs>	 (03Merged) 10jenkins-bot: [itwiki] Add a temporary logo for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226908 (https://phabricator.wikimedia.org/T414320) (owner: 10Superpes15)
[21:31:58] <wikibugs>	 (03Merged) 10jenkins-bot: [slwiki] Add a temporary logo for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226915 (https://phabricator.wikimedia.org/T414265) (owner: 10Superpes15)
[21:32:02] <wikibugs>	 (03Merged) 10jenkins-bot: [kkwiki] Add a temporary logo for Wikipedia 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226918 (https://phabricator.wikimedia.org/T414267) (owner: 10Superpes15)
[21:32:04] <ejegg>	 whoa, spiderpig is web based?
[21:32:11] <jeena>	 yeah :D
[21:32:23] <ejegg>	 nice
[21:32:38] <ejegg>	 hmm, login fail, lemme check my pw vault
[21:33:07] <logmsgbot>	 !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1226275|Revert "Shorten 'close' cookie wait period for enwiki banners" (T411800)]] (duration: 08m 40s)
[21:33:12] <stashbot>	 T411800: CentralNotice config changes to show a banner to a reader with the 'waitdate: close' status - https://phabricator.wikimedia.org/T411800
[21:33:22] <ejegg>	 oh i see, I'm just not authorized to use SpiderPig
[21:33:22] <jeena>	 you might need to be in a special LDAP group https://wikitech.wikimedia.org/wiki/Scap/SpiderPig
[21:33:38] <jeena>	 I think you just need to make a request for it
[21:33:45] <ejegg>	 thanks, I'll do that now!
[21:33:52] <jeena>	 you're welcome!
[21:33:56] <logmsgbot>	 !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1226908|[itwiki] Add a temporary logo for Wikipedia 25 (T414320)]], [[gerrit:1226915|[slwiki] Add a temporary logo for Wikipedia 25 (T414265)]], [[gerrit:1226918|[kkwiki] Add a temporary logo for Wikipedia 25 (T414267)]]
[21:34:04] <stashbot>	 T414320: Requesting temporary logo change for it.wikipedia.org (WP25) - https://phabricator.wikimedia.org/T414320
[21:34:04] <stashbot>	 T414265: Requesting temporary logo change for sl.wikipedia.org (WP25) - https://phabricator.wikimedia.org/T414265
[21:34:05] <stashbot>	 T414267: Requesting temporary logo change for kk.wikipedia.org (WP25) - https://phabricator.wikimedia.org/T414267
[21:36:06] <logmsgbot>	 !log zabe@deploy2002 zabe, superpes: Backport for [[gerrit:1226908|[itwiki] Add a temporary logo for Wikipedia 25 (T414320)]], [[gerrit:1226915|[slwiki] Add a temporary logo for Wikipedia 25 (T414265)]], [[gerrit:1226918|[kkwiki] Add a temporary logo for Wikipedia 25 (T414267)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:36:30] <logmsgbot>	 !log zabe@deploy2002 zabe, superpes: Continuing with sync
[21:36:43] <logmsgbot>	 !log cdanis@cumin1003 conftool action : set/weight=1; selector: service=gerrit
[21:36:48] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:39:25] <Jdlrobson>	 jeena: if you can do it that would be great.
[21:39:33] <jeena>	 sure
[21:40:32] <logmsgbot>	 !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1226908|[itwiki] Add a temporary logo for Wikipedia 25 (T414320)]], [[gerrit:1226915|[slwiki] Add a temporary logo for Wikipedia 25 (T414265)]], [[gerrit:1226918|[kkwiki] Add a temporary logo for Wikipedia 25 (T414267)]] (duration: 06m 36s)
[21:40:39] <stashbot>	 T414320: Requesting temporary logo change for it.wikipedia.org (WP25) - https://phabricator.wikimedia.org/T414320
[21:40:39] <stashbot>	 T414265: Requesting temporary logo change for sl.wikipedia.org (WP25) - https://phabricator.wikimedia.org/T414265
[21:40:40] <stashbot>	 T414267: Requesting temporary logo change for kk.wikipedia.org (WP25) - https://phabricator.wikimedia.org/T414267
[21:41:30] <jinxer-wm>	 FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-sshlb6_29418 has 1 unhealthy realservers pooled on lvs7001:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled  - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled
[21:41:49] <logmsgbot>	 !log cdanis@cumin1003 conftool action : set/pooled=yes; selector: service=gerrit,dc=magru
[21:42:07] <wikibugs>	 (03PS1) 10Zabe: Start writing to il_target_id on non-large wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226948 (https://phabricator.wikimedia.org/T413526)
[21:42:29] <wikibugs>	 (03PS2) 10JHathaway: firewall: add cloud services, remove includes [puppet] - 10https://gerrit.wikimedia.org/r/1226944 (https://phabricator.wikimedia.org/T411089)
[21:43:41] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Start writing to il_target_id on non-large wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226948 (https://phabricator.wikimedia.org/T413526) (owner: 10Zabe)
[21:44:30] <wikibugs>	 (03Merged) 10jenkins-bot: Start writing to il_target_id on non-large wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226948 (https://phabricator.wikimedia.org/T413526) (owner: 10Zabe)
[21:44:53] <wikibugs>	 (03CR) 10CI reject: [V:04-1] firewall: add cloud services, remove includes [puppet] - 10https://gerrit.wikimedia.org/r/1226944 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway)
[21:45:03] <logmsgbot>	 !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1226948|Start writing to il_target_id on non-large wikis (T413526)]]
[21:45:07] <stashbot>	 T413526: Set imagelinks migration to write both - https://phabricator.wikimedia.org/T413526
[21:47:12] <logmsgbot>	 !log zabe@deploy2002 zabe: Backport for [[gerrit:1226948|Start writing to il_target_id on non-large wikis (T413526)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:47:33] <logmsgbot>	 !log zabe@deploy2002 zabe: Continuing with sync
[21:49:25] <wikibugs>	 (03PS3) 10JHathaway: firewall: add cloud services, remove includes [puppet] - 10https://gerrit.wikimedia.org/r/1226944 (https://phabricator.wikimedia.org/T411089)
[21:51:30] <wikibugs>	 (03PS2) 10Jforrester: Revert "Do not use deprecated menu" [extensions/ProofreadPage] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1226945 (https://phabricator.wikimedia.org/T414630) (owner: 10Jdlrobson)
[21:51:33] <logmsgbot>	 !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1226948|Start writing to il_target_id on non-large wikis (T413526)]] (duration: 06m 30s)
[21:51:38] <wikibugs>	 (03CR) 10CI reject: [V:04-1] firewall: add cloud services, remove includes [puppet] - 10https://gerrit.wikimedia.org/r/1226944 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway)
[21:51:40] <stashbot>	 T413526: Set imagelinks migration to write both - https://phabricator.wikimedia.org/T413526
[21:52:02] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] Revert "Do not use deprecated menu" [extensions/ProofreadPage] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1226945 (https://phabricator.wikimedia.org/T414630) (owner: 10Jdlrobson)
[21:52:12] <zabe>	 jeena: feel free to take over
[21:53:10] <jeena>	 thank you zabe
[21:53:50] <wikibugs>	 (03PS4) 10JHathaway: firewall: add cloud services, remove includes [puppet] - 10https://gerrit.wikimedia.org/r/1226944 (https://phabricator.wikimedia.org/T411089)
[21:54:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy2002 using scap backport" [extensions/ProofreadPage] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1226945 (https://phabricator.wikimedia.org/T414630) (owner: 10Jdlrobson)
[21:56:14] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Do not use deprecated menu" [extensions/ProofreadPage] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1226945 (https://phabricator.wikimedia.org/T414630) (owner: 10Jdlrobson)
[21:56:30] <wikibugs>	 (03CR) 10CI reject: [V:04-1] firewall: add cloud services, remove includes [puppet] - 10https://gerrit.wikimedia.org/r/1226944 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway)
[21:56:47] <logmsgbot>	 !log jhuneidi@deploy2002 Started scap sync-world: Backport for [[gerrit:1226945|Revert "Do not use deprecated menu" (T414630)]]
[21:56:52] <stashbot>	 T414630: [regression, 1.46.0-wmf.11] ProofreadPage navigation tabs on Page pages are missing in Vector, Monobook, CologneBlue and Modern skins - https://phabricator.wikimedia.org/T414630
[21:58:59] <logmsgbot>	 !log jhuneidi@deploy2002 jhuneidi, jdlrobson: Backport for [[gerrit:1226945|Revert "Do not use deprecated menu" (T414630)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[22:00:04] <jouncebot>	 Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T2200)
[22:01:58] <logmsgbot>	 !log jhuneidi@deploy2002 jhuneidi, jdlrobson: Continuing with sync
[22:04:36] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Revert "cloudbackup: flip all backups from cloudbackup1004 to 1003" [puppet] - 10https://gerrit.wikimedia.org/r/1226942 (owner: 10Andrew Bogott)
[22:05:16] <wikibugs>	 (03PS5) 10JHathaway: firewall: add cloud services [puppet] - 10https://gerrit.wikimedia.org/r/1226944 (https://phabricator.wikimedia.org/T411089)
[22:06:03] <logmsgbot>	 !log jhuneidi@deploy2002 Finished scap sync-world: Backport for [[gerrit:1226945|Revert "Do not use deprecated menu" (T414630)]] (duration: 09m 16s)
[22:06:09] <stashbot>	 T414630: [regression, 1.46.0-wmf.11] ProofreadPage navigation tabs on Page pages are missing in Vector, Monobook, CologneBlue and Modern skins - https://phabricator.wikimedia.org/T414630
[22:10:14] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1226944 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway)
[22:10:50] <wikibugs>	 (03PS7) 10CDanis: gerrit/Liberica: expand to drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1215693 (https://phabricator.wikimedia.org/T411895)
[22:10:51] <wikibugs>	 (03PS1) 10CDanis: tcp-proxy: allow lb healthchecks [puppet] - 10https://gerrit.wikimedia.org/r/1226951 (https://phabricator.wikimedia.org/T411895)
[22:11:03] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1226951 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis)
[22:12:10] <jeena>	 !log Updating development images on contint primary for T412259
[22:12:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:12:15] <stashbot>	 T412259: Update PatchDemo to Node 20 - https://phabricator.wikimedia.org/T412259
[22:13:16] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "wmcs cinder backups: move all backups to 2003 so 2004 can be reimaged" [puppet] - 10https://gerrit.wikimedia.org/r/1226952
[22:14:07] <wikibugs>	 (03PS2) 10CDanis: tcp-proxy: allow lb healthchecks [puppet] - 10https://gerrit.wikimedia.org/r/1226951 (https://phabricator.wikimedia.org/T411895)
[22:15:26] <wikibugs>	 (03Abandoned) 10CDanis: Revert "cache_text: add gerrit-https to realservers" [puppet] - 10https://gerrit.wikimedia.org/r/1226939 (owner: 10CDanis)
[22:15:59] <wikibugs>	 (03CR) 10CDanis: [C:03+2] tcp-proxy: allow lb healthchecks [puppet] - 10https://gerrit.wikimedia.org/r/1226951 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis)
[22:17:14] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+1] Alert DPE SRE when probes fail in dse-k8s clusters [alerts] - 10https://gerrit.wikimedia.org/r/1226282 (https://phabricator.wikimedia.org/T412447) (owner: 10Bking)
[22:17:37] <wikibugs>	 (03CR) 10Bking: [C:03+2] Alert DPE SRE when probes fail in dse-k8s clusters [alerts] - 10https://gerrit.wikimedia.org/r/1226282 (https://phabricator.wikimedia.org/T412447) (owner: 10Bking)
[22:18:02] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "I could easily imagine these defaults (motivated by performance characteristics of actual S3) being entirely inappropriate for our environ" [puppet] - 10https://gerrit.wikimedia.org/r/1226914 (https://phabricator.wikimedia.org/T394476) (owner: 10Elukey)
[22:22:57] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hadoop.reboot-workers (exit_code=99) for Hadoop analytics cluster
[22:26:19] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster
[22:26:48] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:30:04] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:37:16] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:38:06] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth1 (Subnet frack-fundraising-codfw in F5) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[22:42:16] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:52:54] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] Add Test Kitchen maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/1226318 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming)
[22:57:54] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T2300)
[23:01:36] <icinga-wm>	 PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100%
[23:02:06] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 55565 bytes in 0.072 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:02:06] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:02:21] <logmsgbot>	 !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/mw-cron: apply
[23:03:22] <logmsgbot>	 !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply
[23:03:32] <icinga-wm>	 RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms
[23:04:11] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:05:07] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:20:54] <wikibugs>	 (03PS1) 10Zabe: Removed dropped special page from disabled query pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226962 (https://phabricator.wikimedia.org/T414202)
[23:25:00] <zabe>	 !log zabe@deploy2002:~$ mwscript migrateLinksTable.php testwiki --table imagelinks # T413668
[23:25:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:25:05] <stashbot>	 T413668: Run the data migration of imagelinks - https://phabricator.wikimedia.org/T413668
[23:25:41] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T413525)', diff saved to https://phabricator.wikimedia.org/P87520 and previous config saved to /var/cache/conftool/dbconfig/20260114-232541-marostegui.json
[23:25:45] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[23:35:50] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P87521 and previous config saved to /var/cache/conftool/dbconfig/20260114-233549-marostegui.json
[23:45:58] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P87522 and previous config saved to /var/cache/conftool/dbconfig/20260114-234557-marostegui.json
[23:56:06] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T413525)', diff saved to https://phabricator.wikimedia.org/P87523 and previous config saved to /var/cache/conftool/dbconfig/20260114-235606-marostegui.json
[23:56:11] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance
[23:56:11] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[23:56:20] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2172 (T413525)', diff saved to https://phabricator.wikimedia.org/P87524 and previous config saved to /var/cache/conftool/dbconfig/20260114-235619-marostegui.json
[23:58:07] <wikibugs>	 (03PS1) 10Zabe: Start reading from il_target_id on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226965 (https://phabricator.wikimedia.org/T413669)
[23:59:38] <zabe>	 jouncebot: nowandnext
[23:59:38] <jouncebot>	 For the next 0 hour(s) and 0 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T2300)
[23:59:38] <jouncebot>	 In 7 hour(s) and 0 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260115T0700)
[23:59:39] <jouncebot>	 In 7 hour(s) and 0 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260115T0700)
[23:59:49] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Start reading from il_target_id on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226965 (https://phabricator.wikimedia.org/T413669) (owner: 10Zabe)