[00:01:44] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T410589)', diff saved to https://phabricator.wikimedia.org/P85477 and previous config saved to /var/cache/conftool/dbconfig/20251124-000144-ladsgroup.json
[00:01:49] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[00:02:00] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1245.eqiad.wmnet with reason: Maintenance
[00:39:36] <jinxer-wm>	 FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[00:40:15] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1210171
[00:40:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1210171 (owner: 10TrainBranchBot)
[00:52:44] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1210171 (owner: 10TrainBranchBot)
[01:00:40] <logmsgbot>	 !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image
[01:09:36] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[01:10:01] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1210179
[01:10:01] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1210179 (owner: 10TrainBranchBot)
[01:27:21] <icinga-wm>	 PROBLEM - Host cloudidp2001-dev is DOWN: PING CRITICAL - Packet loss = 100%
[01:32:01] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1210179 (owner: 10TrainBranchBot)
[01:32:55] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:36:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:55:49] <icinga-wm>	 RECOVERY - Host cloudidp2001-dev is UP: PING OK - Packet loss = 0%, RTA = 30.65 ms
[02:13:58] <jinxer-wm>	 FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage  - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[02:23:58] <jinxer-wm>	 FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage  - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[02:28:58] <jinxer-wm>	 FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage  - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[02:29:55] <icinga-wm>	 PROBLEM - Host cloudidp2001-dev is DOWN: PING CRITICAL - Packet loss = 100%
[02:42:45] <wikibugs>	 (03PS2) 10Tim Starling: Revert "Authorize self for Google Search Console" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175850
[02:48:58] <jinxer-wm>	 FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage  - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[02:50:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175850 (owner: 10Tim Starling)
[02:51:01] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Authorize self for Google Search Console" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175850 (owner: 10Tim Starling)
[02:51:42] <logmsgbot>	 !log tstarling@deploy2002 Started scap sync-world: Backport for [[gerrit:1175850|Revert "Authorize self for Google Search Console"]]
[02:54:36] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[02:54:50] <icinga-wm>	 RECOVERY - Host cloudidp2001-dev is UP: PING OK - Packet loss = 0%, RTA = 30.50 ms
[03:06:54] <wikibugs>	 (03PS1) 10Tim Starling: admin: Remove my non-FIDO keys [puppet] - 10https://gerrit.wikimedia.org/r/1210224
[03:17:48] <logmsgbot>	 !log tstarling@deploy2002 tstarling: Backport for [[gerrit:1175850|Revert "Authorize self for Google Search Console"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[03:18:25] <logmsgbot>	 !log tstarling@deploy2002 tstarling: Continuing with sync
[03:31:58] <logmsgbot>	 !log tstarling@deploy2002 Finished scap sync-world: Backport for [[gerrit:1175850|Revert "Authorize self for Google Search Console"]] (duration: 40m 16s)
[04:08:58] <jinxer-wm>	 FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage  - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[04:18:58] <jinxer-wm>	 FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage  - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[04:23:58] <jinxer-wm>	 FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage  - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[04:26:29] <icinga-wm>	 PROBLEM - Host cloudidp2001-dev is DOWN: PING CRITICAL - Packet loss = 100%
[04:33:58] <jinxer-wm>	 FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage  - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[04:39:36] <jinxer-wm>	 FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[04:43:58] <jinxer-wm>	 FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage  - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[04:53:58] <jinxer-wm>	 FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage  - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[05:08:58] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:09:36] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[05:25:49] <icinga-wm>	 RECOVERY - Host cloudidp2001-dev is UP: PING OK - Packet loss = 0%, RTA = 30.56 ms
[05:26:19] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:27:31] <icinga-wm>	 PROBLEM - OSPF status on cr2-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:28:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr2-eqdfw and 195.200.68.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[05:29:39] <jinxer-wm>	 FIRING: CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-magru (2a02:ec80:700:fe0b::2) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Confed_magru&var-bgp_neighbor=cr2-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[05:30:19] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:33:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr2-eqdfw and 195.200.68.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[05:33:31] <icinga-wm>	 RECOVERY - OSPF status on cr2-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:33:58] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:34:39] <jinxer-wm>	 RESOLVED: CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-magru (2a02:ec80:700:fe0b::2) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Confed_magru&var-bgp_neighbor=cr2-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[05:36:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:13:09] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] data.yaml: Add FIDO key for marostegui [puppet] - 10https://gerrit.wikimedia.org/r/1207863 (owner: 10Marostegui)
[06:23:09] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2144 (ms2) memory error - https://phabricator.wikimedia.org/T410480#11399296 (10Marostegui) 05Open→03Resolved a:03Marostegui Closing this for now - we will see how long it takes for the DIMM to crash again. Thanks @Jhancock.wm!
[06:23:19] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2144 (ms2) memory error - https://phabricator.wikimedia.org/T410480#11399299 (10Marostegui) a:05Marostegui→03Jhancock.wm
[06:26:22] <icinga-wm>	 PROBLEM - Host cloudidp2001-dev is DOWN: PING CRITICAL - Packet loss = 100%
[06:28:01] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install x1 host - https://phabricator.wikimedia.org/T407897#11399303 (10Marostegui) Thanks Rob, I think the confusion was whether we ordered the right HW or not. Doing 1G is fine for this host, 10G would be ideal, but we are not expecting...
[06:37:37] <marostegui>	 !log Deploy schema change on s6 on the master with replication T410531
[06:37:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:37:42] <stashbot>	 T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531
[06:38:32] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 22 hosts with reason: Schema change
[06:38:33] <stashbot>	 marostegui@cumin1003: Failed to log message to wiki. Somebody should check the error logs.
[06:38:58] <jinxer-wm>	 FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage  - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[06:48:27] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] cache::text: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203055 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto)
[06:48:58] <jinxer-wm>	 FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage  - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[06:50:42] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[06:54:36] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[06:54:52] <icinga-wm>	 RECOVERY - Host cloudidp2001-dev is UP: PING OK - Packet loss = 0%, RTA = 30.61 ms
[06:59:46] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] apt: add an alert on reprepro errors [alerts] - 10https://gerrit.wikimedia.org/r/1207791 (https://phabricator.wikimedia.org/T409835) (owner: 10Arnaudb)
[07:00:43] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[07:00:51] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1157 (T410531)', diff saved to https://phabricator.wikimedia.org/P85478 and previous config saved to /var/cache/conftool/dbconfig/20251124-070050-marostegui.json
[07:00:55] <stashbot>	 T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531
[07:01:28] <wikibugs>	 (03Merged) 10jenkins-bot: apt: add an alert on reprepro errors [alerts] - 10https://gerrit.wikimedia.org/r/1207791 (https://phabricator.wikimedia.org/T409835) (owner: 10Arnaudb)
[07:05:40] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T410531)', diff saved to https://phabricator.wikimedia.org/P85479 and previous config saved to /var/cache/conftool/dbconfig/20251124-070539-marostegui.json
[07:14:33] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: admin: add FIDO ssh key for oblivian [puppet] - 10https://gerrit.wikimedia.org/r/1210368
[07:20:48] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P85480 and previous config saved to /var/cache/conftool/dbconfig/20251124-072047-marostegui.json
[07:30:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1210224 (owner: 10Tim Starling)
[07:35:44] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[07:35:56] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P85481 and previous config saved to /var/cache/conftool/dbconfig/20251124-073555-marostegui.json
[07:36:13] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] admin: add FIDO ssh key for oblivian [puppet] - 10https://gerrit.wikimedia.org/r/1210368 (owner: 10Giuseppe Lavagetto)
[07:37:21] <wikibugs>	 (03PS4) 10Brouberol: growthbook: add the kerberos token renewer sidecar to support kerberized connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206187 (https://phabricator.wikimedia.org/T408907)
[07:38:38] <wikibugs>	 (03PS5) 10Brouberol: growthbook: add the kerberos token renewer sidecar to support kerberized connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206187 (https://phabricator.wikimedia.org/T408907)
[07:40:08] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: add dry run rsync [cookbooks] - 10https://gerrit.wikimedia.org/r/1195437 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[07:40:14] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: add a local backup cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1193590 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[07:40:59] <wikibugs>	 (03PS6) 10Brouberol: growthbook: add the kerberos token renewer sidecar to support kerberized connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206187 (https://phabricator.wikimedia.org/T408907)
[07:44:39] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] admin: add FIDO ssh key for oblivian [puppet] - 10https://gerrit.wikimedia.org/r/1210368 (owner: 10Giuseppe Lavagetto)
[07:46:37] <wikibugs>	 (03Merged) 10jenkins-bot: gerrit: add dry run rsync [cookbooks] - 10https://gerrit.wikimedia.org/r/1195437 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[07:46:58] <wikibugs>	 (03Merged) 10jenkins-bot: gerrit: add a local backup cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1193590 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[07:51:03] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T410531)', diff saved to https://phabricator.wikimedia.org/P85482 and previous config saved to /var/cache/conftool/dbconfig/20251124-075103-marostegui.json
[07:51:08] <stashbot>	 T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531
[07:51:20] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[07:51:27] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1166 (T410531)', diff saved to https://phabricator.wikimedia.org/P85483 and previous config saved to /var/cache/conftool/dbconfig/20251124-075126-marostegui.json
[07:56:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcumin2001 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1204369 (owner: 10Muehlenhoff)
[08:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251124T0800).
[08:00:05] <jouncebot>	 hubaishan: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:05:20] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T410531)', diff saved to https://phabricator.wikimedia.org/P85484 and previous config saved to /var/cache/conftool/dbconfig/20251124-080519-marostegui.json
[08:05:25] <stashbot>	 T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531
[08:07:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcumin2001.codfw.wmnet
[08:08:50] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "Very nice." [puppet] - 10https://gerrit.wikimedia.org/r/1208362 (owner: 10Muehlenhoff)
[08:11:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcumin2001.codfw.wmnet
[08:15:33] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch the cluster::cloud_management role to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1210395
[08:18:10] <wikibugs>	 (03PS2) 10Arnaudb: gerrit: remove localbackup logic from failover [cookbooks] - 10https://gerrit.wikimedia.org/r/1210386 (https://phabricator.wikimedia.org/T387833)
[08:18:10] <wikibugs>	 (03CR) 10Arnaudb: "after merging 1193590 this patch removes the redundant logic in the failover cookbook" [cookbooks] - 10https://gerrit.wikimedia.org/r/1210386 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[08:20:28] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P85485 and previous config saved to /var/cache/conftool/dbconfig/20251124-082027-marostegui.json
[08:27:23] <icinga-wm>	 PROBLEM - Host cloudidp2001-dev is DOWN: PING CRITICAL - Packet loss = 100%
[08:31:54] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1210395 (owner: 10Muehlenhoff)
[08:35:35] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P85487 and previous config saved to /var/cache/conftool/dbconfig/20251124-083535-marostegui.json
[08:39:36] <jinxer-wm>	 FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[08:43:58] <jinxer-wm>	 FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage  - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[08:44:31] <wikibugs>	 06SRE, 06Traffic, 06Data-Engineering (Q2 FY25/26 October 1st - December 31th), 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Sustainability (Incident Followup): alerts should be triggered if druid fails to consume webrequest_sampled kafka topic - https://phabricator.wikimedia.org/T410019#11399416 (10Ge...
[08:44:31] <moritzm>	 !log installing jinja2 security updates
[08:44:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:35] <wikibugs>	 (03CR) 10AikoChou: [C:03+1] "We can wait for the patch extending paragraph extraction code (https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-service" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208310 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz)
[08:50:42] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T410531)', diff saved to https://phabricator.wikimedia.org/P85488 and previous config saved to /var/cache/conftool/dbconfig/20251124-085042-marostegui.json
[08:50:47] <stashbot>	 T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531
[08:50:58] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[08:51:04] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1175 (T410531)', diff saved to https://phabricator.wikimedia.org/P85489 and previous config saved to /var/cache/conftool/dbconfig/20251124-085104-marostegui.json
[08:53:58] <jinxer-wm>	 FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage  - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[08:54:14] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: "Good idea, let's do it this way :) I'll start with reviewing the paragraph extraction patch." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208310 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz)
[08:54:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1208451 (owner: 10RLazarus)
[08:55:52] <icinga-wm>	 RECOVERY - Host cloudidp2001-dev is UP: PING OK - Packet loss = 0%, RTA = 30.57 ms
[08:55:55] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T410531)', diff saved to https://phabricator.wikimedia.org/P85490 and previous config saved to /var/cache/conftool/dbconfig/20251124-085554-marostegui.json
[08:55:59] <stashbot>	 T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531
[08:58:08] <wikibugs>	 (03PS1) 10Volans: admin: add user chandra-wmde [puppet] - 10https://gerrit.wikimedia.org/r/1210496 (https://phabricator.wikimedia.org/T409707)
[08:58:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] Switch the cluster::cloud_management role to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1210395 (owner: 10Muehlenhoff)
[08:58:36] <wikibugs>	 (03CR) 10Volans: [C:04-1] "Pending approval on task." [puppet] - 10https://gerrit.wikimedia.org/r/1210496 (https://phabricator.wikimedia.org/T409707) (owner: 10Volans)
[09:03:55] <logmsgbot>	 !log gehel@cumin2002 START - Cookbook sre.hosts.reboot-cluster
[09:03:56] <logmsgbot>	 !log gehel@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=99)
[09:05:50] <logmsgbot>	 !log gehel@cumin2002 START - Cookbook sre.hosts.reboot-cluster
[09:06:00] <wikibugs>	 (03PS1) 10AikoChou: changeprop: add LiftWing revise-tone-task-generator to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210505 (https://phabricator.wikimedia.org/T408538)
[09:09:36] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[09:09:50] <taavi>	 !log taavi@puppetserver1001 ~ $ sudo puppet node deactivate cloudidp2001-dev.wikimedia.org # leftover from move to private addresses T410294
[09:09:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:09:55] <stashbot>	 T410294: Site: codfw   1 VM request for codfw1dev CAS test/dev, hostname: cloudidp2001-dev - https://phabricator.wikimedia.org/T410294
[09:11:02] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P85491 and previous config saved to /var/cache/conftool/dbconfig/20251124-091102-marostegui.json
[09:11:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:11:43] <jinxer-wm>	 FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[09:16:02] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[09:19:13] <wikibugs>	 (03PS2) 10Clément Goubert: trafficserver: action api to rest-gateway enwiki 100% [puppet] - 10https://gerrit.wikimedia.org/r/1198940 (https://phabricator.wikimedia.org/T408223)
[09:19:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1208426 (owner: 10Ayounsi)
[09:26:10] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P85492 and previous config saved to /var/cache/conftool/dbconfig/20251124-092609-marostegui.json
[09:26:43] <jinxer-wm>	 RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[09:30:11] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] trafficserver: action api to rest-gateway enwiki 100% [puppet] - 10https://gerrit.wikimedia.org/r/1198940 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert)
[09:31:32] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[09:32:17] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1209791 (https://phabricator.wikimedia.org/T410840) (owner: 10Hubaishan)
[09:34:46] <wikibugs>	 07sre-alert-triage, 06serviceops: Alert in need of triage: HelmfileAdminNGPendingChanges (instance deploy1003:9100) - https://phabricator.wikimedia.org/T410858 (10LSobanski) 03NEW
[09:34:56] <wikibugs>	 07sre-alert-triage, 06serviceops: Alert in need of triage: HelmfileAdminNGPendingChanges (instance deploy1003:9100) - https://phabricator.wikimedia.org/T410858#11399550 (10LSobanski) Also eqiad-staging and codfw-staging.
[09:38:05] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[09:38:12] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:38:55] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[09:39:14] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[09:40:16] <wikibugs>	 (03CR) 10Btullis: growthbook: add the kerberos token renewer sidecar to support kerberized connections (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206187 (https://phabricator.wikimedia.org/T408907) (owner: 10Brouberol)
[09:40:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:40:30] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[09:40:49] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] sre.mysql.clone: Refactor, Pool in source host ASAP [cookbooks] - 10https://gerrit.wikimedia.org/r/1202673 (https://phabricator.wikimedia.org/T410376) (owner: 10Federico Ceratto)
[09:40:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Failover URL downloaders [dns] - 10https://gerrit.wikimedia.org/r/1207786 (owner: 10Muehlenhoff)
[09:40:56] <wikibugs>	 (03CR) 10Brouberol: growthbook: add the kerberos token renewer sidecar to support kerberized connections (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206187 (https://phabricator.wikimedia.org/T408907) (owner: 10Brouberol)
[09:40:59] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
[09:41:05] <logmsgbot>	 !log jmm@dns1004 START - running authdns-update
[09:41:17] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T410531)', diff saved to https://phabricator.wikimedia.org/P85494 and previous config saved to /var/cache/conftool/dbconfig/20251124-094117-marostegui.json
[09:41:22] <stashbot>	 T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531
[09:41:31] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[09:41:34] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1189.eqiad.wmnet with reason: Maintenance
[09:41:42] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1189 (T410531)', diff saved to https://phabricator.wikimedia.org/P85495 and previous config saved to /var/cache/conftool/dbconfig/20251124-094141-marostegui.json
[09:41:52] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[09:42:06] <logmsgbot>	 !log jmm@dns1004 END - running authdns-update
[09:42:37] <wikibugs>	 (03PS7) 10Brouberol: growthbook: add the kerberos token renewer sidecar to support kerberized connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206187 (https://phabricator.wikimedia.org/T408907)
[09:42:38] <wikibugs>	 (03CR) 10Brouberol: growthbook: add the kerberos token renewer sidecar to support kerberized connections (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206187 (https://phabricator.wikimedia.org/T408907) (owner: 10Brouberol)
[09:42:44] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[09:43:35] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Nice, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206187 (https://phabricator.wikimedia.org/T408907) (owner: 10Brouberol)
[09:43:53] <wikibugs>	 07sre-alert-triage, 06serviceops: Alert in need of triage: HelmfileAdminNGPendingChanges (instance deploy1003:9100) - https://phabricator.wikimedia.org/T410858#11399655 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert
[09:44:57] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] growthbook: add the kerberos token renewer sidecar to support kerberized connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206187 (https://phabricator.wikimedia.org/T408907) (owner: 10Brouberol)
[09:46:32] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T410531)', diff saved to https://phabricator.wikimedia.org/P85496 and previous config saved to /var/cache/conftool/dbconfig/20251124-094632-marostegui.json
[09:46:37] <stashbot>	 T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531
[09:49:15] <wikibugs>	 (03PS1) 10Brouberol: growthbook: add the general values to the list of environment values to inject to the subcharts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210516 (https://phabricator.wikimedia.org/T408907)
[09:49:35] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] ayounsi: Add new yubikey key [puppet] - 10https://gerrit.wikimedia.org/r/1208426 (owner: 10Ayounsi)
[09:51:08] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] growthbook: add the general values to the list of environment values to inject to the subcharts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210516 (https://phabricator.wikimedia.org/T408907) (owner: 10Brouberol)
[09:51:16] <wikibugs>	 (03CR) 10Tchanders: "Looks good from the perspective of aligning with temporary accounts policy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208334 (https://phabricator.wikimedia.org/T409687) (owner: 10Dragoniez)
[09:53:03] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+1] "Looks good from the Product Safety & Integrity team's point of view" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208334 (https://phabricator.wikimedia.org/T409687) (owner: 10Dragoniez)
[09:53:43] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] fetch_external_clouds_vendors_nets.py: ipblock-source support [puppet] - 10https://gerrit.wikimedia.org/r/1207848 (https://phabricator.wikimedia.org/T402014) (owner: 10JMeybohm)
[09:55:03] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply
[09:55:47] <logmsgbot>	 !log gehel@cumin1003 START - Cookbook sre.hosts.reboot-cluster
[09:56:04] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply
[09:57:17] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+1] "Noting that `wgRemoveGroups` was not updated, so only the `sysop` group can remove the `temporary-account-viewer` group. However, I assume" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208334 (https://phabricator.wikimedia.org/T409687) (owner: 10Dragoniez)
[09:58:07] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply
[09:58:30] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply
[09:59:02] <logmsgbot>	 !log gehel@cumin1003 START - Cookbook sre.hosts.reboot-cluster
[10:00:30] <logmsgbot>	 !log gehel@cumin2002 START - Cookbook sre.hosts.reboot-cluster
[10:01:40] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P85497 and previous config saved to /var/cache/conftool/dbconfig/20251124-100139-marostegui.json
[10:03:03] <wikibugs>	 (03PS1) 10Marostegui: db1153: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1210518
[10:07:13] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] metamonitoring/icinga: suppress script-managed notifications and pages [puppet] - 10https://gerrit.wikimedia.org/r/1206884 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli)
[10:07:22] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] metamonitoring/icinga: add smtp settings to config.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1206885 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli)
[10:07:35] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] metamonitoring/icinga: generate contacts list [puppet] - 10https://gerrit.wikimedia.org/r/1206886 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli)
[10:07:55] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] metamonitoring/icinga: trigger pages only for the active instance [puppet] - 10https://gerrit.wikimedia.org/r/1207113 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli)
[10:12:02] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Add a new deploy-spark-support clusterrole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208316 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis)
[10:14:01] <logmsgbot>	 !log gehel@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0)
[10:14:26] <logmsgbot>	 !log gehel@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0)
[10:16:48] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P85498 and previous config saved to /var/cache/conftool/dbconfig/20251124-101647-marostegui.json
[10:17:36] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1153: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1210518 (owner: 10Marostegui)
[10:18:23] <icinga-wm>	 ACKNOWLEDGEMENT - snapshot of s5 in eqiad on backupmon1001 is CRITICAL: Last snapshot for s5 at eqiad (db1216) taken on 2025-11-23 20:35:02 is 395 GiB, but the previous one was 517 GiB, a change of -23.7 % Jcrespo expected by DBAs https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[10:19:17] <jinxer-wm>	 FIRING: ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:22:09] <wikibugs>	 06SRE, 06Traffic, 06Data-Engineering (Q2 FY25/26 October 1st - December 31th), 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Sustainability (Incident Followup): alerts should be triggered if druid fails to consume webrequest_sampled kafka topic - https://phabricator.wikimedia.org/T410019#11399911 (10Ge...
[10:22:51] <logmsgbot>	 !log gehel@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0)
[10:23:28] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add a new deploy-spark-support clusterrole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208316 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis)
[10:24:17] <jinxer-wm>	 RESOLVED: ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:25:43] <logmsgbot>	 !log gehel@cumin1003 START - Cookbook sre.hosts.reboot-cluster
[10:25:52] <logmsgbot>	 !log gehel@cumin2002 START - Cookbook sre.hosts.reboot-cluster
[10:26:56] <logmsgbot>	 !log gehel@cumin1003 START - Cookbook sre.hosts.reboot-cluster
[10:27:10] <logmsgbot>	 !log gehel@cumin2002 START - Cookbook sre.hosts.reboot-cluster
[10:27:47] <wikibugs>	 (03PS2) 10AikoChou: changeprop: add LiftWing revise-tone-task-generator to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210505 (https://phabricator.wikimedia.org/T408538)
[10:29:58] <wikibugs>	 (03PS12) 10Btullis: Enable an oauth2-proxy for growthbook frontend and api pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202726 (https://phabricator.wikimedia.org/T409183)
[10:30:18] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[10:31:26] <wikibugs>	 (03Merged) 10jenkins-bot: Add a new deploy-spark-support clusterrole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208316 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis)
[10:31:54] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[10:31:55] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T410531)', diff saved to https://phabricator.wikimedia.org/P85499 and previous config saved to /var/cache/conftool/dbconfig/20251124-103155-marostegui.json
[10:32:00] <stashbot>	 T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531
[10:32:07] <wikibugs>	 (03PS3) 10Federico Ceratto: Support both hostname and FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/1141895 (https://phabricator.wikimedia.org/T391581)
[10:32:11] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1198.eqiad.wmnet with reason: Maintenance
[10:32:19] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1198 (T410531)', diff saved to https://phabricator.wikimedia.org/P85500 and previous config saved to /var/cache/conftool/dbconfig/20251124-103218-marostegui.json
[10:33:18] <wikibugs>	 (03CR) 10Federico Ceratto: "Updated to use a more strict hostname check based the discussion with Manuel on IRC" [cookbooks] - 10https://gerrit.wikimedia.org/r/1141895 (https://phabricator.wikimedia.org/T391581) (owner: 10Federico Ceratto)
[10:33:27] <wikibugs>	 (03PS1) 10Tiziano Fogli: metamonitoring/icinga: convert last_check to timestamp [puppet] - 10https://gerrit.wikimedia.org/r/1210523 (https://phabricator.wikimedia.org/T393625)
[10:34:15] <wikibugs>	 (03PS13) 10Btullis: Enable an oauth2-proxy for growthbook frontend and api pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202726 (https://phabricator.wikimedia.org/T409183)
[10:34:16] <wikibugs>	 (03CR) 10Dragoniez: "@thalia.e.chan@googlemail.com @dreamyjazzwikipedia@gmail.com Thanks for the reviews! About `wgRemoveGroups`, I think I'll leave it as is s" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208334 (https://phabricator.wikimedia.org/T409687) (owner: 10Dragoniez)
[10:34:21] <logmsgbot>	 !log gehel@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0)
[10:36:18] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[10:36:29] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[10:36:41] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] "I'm self-merging since this is just a time-format conversion fix for an already deployed patch." [puppet] - 10https://gerrit.wikimedia.org/r/1210523 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli)
[10:36:54] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[10:37:09] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T410531)', diff saved to https://phabricator.wikimedia.org/P85501 and previous config saved to /var/cache/conftool/dbconfig/20251124-103708-marostegui.json
[10:37:12] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'.
[10:37:13] <stashbot>	 T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531
[10:37:24] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'.
[10:38:12] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:39:02] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Support both hostname and FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/1141895 (https://phabricator.wikimedia.org/T391581) (owner: 10Federico Ceratto)
[10:39:52] <logmsgbot>	 !log gehel@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0)
[10:40:18] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[10:40:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:43:13] <wikibugs>	 (03PS1) 10Sergio Gimeno: [beta] GrowthExperiments: increase to log level to debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210526 (https://phabricator.wikimedia.org/T405177)
[10:43:29] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210526 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno)
[10:44:00] <wikibugs>	 (03PS2) 10Sergio Gimeno: [beta] GrowthExperiments: increase log level to debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210526 (https://phabricator.wikimedia.org/T405177)
[10:44:45] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to ops for blake - https://phabricator.wikimedia.org/T410612#11399970 (10KOfori) Hi, approving this on behalf of @Kappakayala as her delegate while OOO.
[10:46:08] <claime>	 !log Deploying envoy 1.32 to api-gateway
[10:46:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:15] <logmsgbot>	 !log gehel@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0)
[10:47:21] <wikibugs>	 (03PS5) 10Daniel Kinzler: rest-gateway: allow rate limits per time unit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205191 (https://phabricator.wikimedia.org/T408132)
[10:47:22] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply
[10:48:43] <logmsgbot>	 !log gehel@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0)
[10:48:51] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply
[10:51:14] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply
[10:51:33] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply
[10:51:55] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply
[10:52:16] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P85502 and previous config saved to /var/cache/conftool/dbconfig/20251124-105216-marostegui.json
[10:52:18] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply
[10:52:24] <wikibugs>	 (03PS1) 10Tiziano Fogli: metamonitoring/icinga: convert now variable to timestamp [puppet] - 10https://gerrit.wikimedia.org/r/1210529 (https://phabricator.wikimedia.org/T393625)
[10:53:47] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] "I'm self-merging since this is just a time-conversion fix for an already deployed patch." [puppet] - 10https://gerrit.wikimedia.org/r/1210529 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli)
[10:54:36] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[10:55:33] <wikibugs>	 (03PS14) 10Btullis: Enable an oauth2-proxy for growthbook frontend and api pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202726 (https://phabricator.wikimedia.org/T409183)
[10:55:47] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[10:56:02] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[10:56:19] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply
[10:56:45] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply
[10:56:48] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[10:56:51] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[10:56:55] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[10:57:08] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[10:59:35] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] rest-gateway: allow rate limits per time unit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205191 (https://phabricator.wikimedia.org/T408132) (owner: 10Daniel Kinzler)
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251124T1100)
[11:00:32] <wikibugs>	 (03PS3) 10Daniel Kinzler: rest-gateway: implement per-route rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206898 (https://phabricator.wikimedia.org/T409044)
[11:01:57] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: allow rate limits per time unit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205191 (https://phabricator.wikimedia.org/T408132) (owner: 10Daniel Kinzler)
[11:02:39] <wikibugs>	 (03CR) 10Michael Große: [C:03+1] [beta] GrowthExperiments: increase log level to debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210526 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno)
[11:05:27] <wikibugs>	 (03PS3) 10Daniel Kinzler: rest-gateway: assign ratelimit class by network range [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206956 (https://phabricator.wikimedia.org/T410273)
[11:07:24] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P85503 and previous config saved to /var/cache/conftool/dbconfig/20251124-110723-marostegui.json
[11:15:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:16:43] <jinxer-wm>	 FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[11:21:04] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[11:21:11] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1187 (T299441)', diff saved to https://phabricator.wikimedia.org/P85504 and previous config saved to /var/cache/conftool/dbconfig/20251124-112111-marostegui.json
[11:21:16] <stashbot>	 T299441: Avoid depooling hosts if the schema change has been applied before - https://phabricator.wikimedia.org/T299441
[11:22:32] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T410531)', diff saved to https://phabricator.wikimedia.org/P85505 and previous config saved to /var/cache/conftool/dbconfig/20251124-112231-marostegui.json
[11:22:36] <stashbot>	 T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531
[11:22:49] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1212.eqiad.wmnet with reason: Maintenance
[11:22:59] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on 6 hosts with reason: Maintenance
[11:23:07] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1212 (T410531)', diff saved to https://phabricator.wikimedia.org/P85506 and previous config saved to /var/cache/conftool/dbconfig/20251124-112306-marostegui.json
[11:23:52] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1187 gradually with 4 steps - repool after schema change test
[11:24:38] <logmsgbot>	 !log gehel@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0)
[11:25:15] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[11:25:35] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[11:26:05] <logmsgbot>	 !log marostegui@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) db1187 gradually with 4 steps - repool after schema change test
[11:26:43] <jinxer-wm>	 RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[11:28:19] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T410531)', diff saved to https://phabricator.wikimedia.org/P85508 and previous config saved to /var/cache/conftool/dbconfig/20251124-112819-marostegui.json
[11:28:24] <stashbot>	 T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531
[11:28:43] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[11:28:50] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1187 (T299441)', diff saved to https://phabricator.wikimedia.org/P85509 and previous config saved to /var/cache/conftool/dbconfig/20251124-112850-marostegui.json
[11:28:55] <stashbot>	 T299441: Avoid depooling hosts if the schema change has been applied before - https://phabricator.wikimedia.org/T299441
[11:31:34] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1187 gradually with 4 steps - repool after schema change test
[11:32:21] <wikibugs>	 (03PS4) 10Federico Ceratto: Support both hostname and FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/1141895 (https://phabricator.wikimedia.org/T391581)
[11:39:01] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Support both hostname and FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/1141895 (https://phabricator.wikimedia.org/T391581) (owner: 10Federico Ceratto)
[11:40:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test2005.wikimedia.org
[11:41:07] <wikibugs>	 (03PS1) 10Hashar: gerrit: add a layer of CNAME to ease switch overs [dns] - 10https://gerrit.wikimedia.org/r/1210560 (https://phabricator.wikimedia.org/T387833)
[11:41:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] gerrit: add a layer of CNAME to ease switch overs [dns] - 10https://gerrit.wikimedia.org/r/1210560 (https://phabricator.wikimedia.org/T387833) (owner: 10Hashar)
[11:43:27] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P85511 and previous config saved to /var/cache/conftool/dbconfig/20251124-114326-marostegui.json
[11:44:47] <wikibugs>	 (03PS2) 10Hashar: gerrit: add a layer of CNAME to ease switch overs [dns] - 10https://gerrit.wikimedia.org/r/1210560 (https://phabricator.wikimedia.org/T387833)
[11:44:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test2005.wikimedia.org
[11:46:06] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[11:46:18] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[11:52:01] <logmsgbot>	 !log daniel@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[11:53:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C:04-1] "Blocked by https://phabricator.wikimedia.org/T410879" [puppet] - 10https://gerrit.wikimedia.org/r/1208362 (owner: 10Muehlenhoff)
[11:53:39] <logmsgbot>	 !log daniel@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[11:54:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test1005.wikimedia.org
[11:56:24] <logmsgbot>	 !log daniel@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[11:56:54] <logmsgbot>	 !log daniel@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[11:58:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test1005.wikimedia.org
[11:58:34] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P85513 and previous config saved to /var/cache/conftool/dbconfig/20251124-115834-marostegui.json
[11:58:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp2005.wikimedia.org
[11:59:39] <wikibugs>	 (03PS1) 10Muehlenhoff: Failover idp.w.o [dns] - 10https://gerrit.wikimedia.org/r/1210566
[12:01:42] <moritzm>	 !log installing Squid security updates
[12:01:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:02:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp2005.wikimedia.org
[12:05:46] <wikibugs>	 (03PS2) 10Bartosz Wójtowicz: ml-services: Update the image for revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208310 (https://phabricator.wikimedia.org/T408538)
[12:13:13] <wikibugs>	 (03CR) 10Klausman: [C:03+1] changeprop: add LiftWing revise-tone-task-generator to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210505 (https://phabricator.wikimedia.org/T408538) (owner: 10AikoChou)
[12:13:42] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T410531)', diff saved to https://phabricator.wikimedia.org/P85515 and previous config saved to /var/cache/conftool/dbconfig/20251124-121341-marostegui.json
[12:13:47] <stashbot>	 T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531
[12:13:58] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1240.eqiad.wmnet with reason: Maintenance
[12:15:25] <wikibugs>	 (03CR) 10AikoChou: [C:03+2] ml-services: Update the image for revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208310 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz)
[12:17:15] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1187 gradually with 4 steps - repool after schema change test
[12:17:21] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: Update the image for revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208310 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz)
[12:18:50] <wikibugs>	 (03PS1) 10Btullis: Attempt to fix the OIDC authentication for growthbook [puppet] - 10https://gerrit.wikimedia.org/r/1210570 (https://phabricator.wikimedia.org/T409183)
[12:19:33] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7686/co" [puppet] - 10https://gerrit.wikimedia.org/r/1210570 (https://phabricator.wikimedia.org/T409183) (owner: 10Btullis)
[12:23:09] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[12:24:34] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] Add blake to ops, remove blake from ops-limited. [puppet] - 10https://gerrit.wikimedia.org/r/1207824 (https://phabricator.wikimedia.org/T410612) (owner: 10Blake)
[12:26:52] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] Add blake to ops, remove blake from ops-limited. [puppet] - 10https://gerrit.wikimedia.org/r/1207824 (https://phabricator.wikimedia.org/T410612) (owner: 10Blake)
[12:32:17] <logmsgbot>	 !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' .
[12:34:00] <wikibugs>	 (03PS1) 10Volans: wmcs k8s nfs: add NFS tracing script [puppet] - 10https://gerrit.wikimedia.org/r/1210582 (https://phabricator.wikimedia.org/T399313)
[12:36:04] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wmcs k8s nfs: add NFS tracing script [puppet] - 10https://gerrit.wikimedia.org/r/1210582 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans)
[12:39:37] <jinxer-wm>	 FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[12:39:47] <wikibugs>	 (03PS1) 10Bartosz Wójtowicz: ml-services: Remove experimental revise-tone-task-generator deployment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210583 (https://phabricator.wikimedia.org/T408538)
[12:40:45] <wikibugs>	 (03PS2) 10Volans: wmcs k8s nfs: add NFS tracing script [puppet] - 10https://gerrit.wikimedia.org/r/1210582 (https://phabricator.wikimedia.org/T399313)
[12:41:31] <wikibugs>	 (03PS5) 10Federico Ceratto: Support both hostname and FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/1141895 (https://phabricator.wikimedia.org/T391581)
[12:41:32] <wikibugs>	 (03PS2) 10Muehlenhoff: EFI-enabled Partman recipe (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1207124 (https://phabricator.wikimedia.org/T410400)
[12:41:59] <wikibugs>	 (03CR) 10AikoChou: [C:03+1] ml-services: Remove experimental revise-tone-task-generator deployment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210583 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz)
[12:42:00] <wikibugs>	 (03CR) 10Klausman: [C:03+1] ml-services: Remove experimental revise-tone-task-generator deployment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210583 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz)
[12:42:06] <logmsgbot>	 !log gehel@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs2025.codfw.wmnet
[12:42:44] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Remove experimental revise-tone-task-generator deployment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210583 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz)
[12:42:52] <wikibugs>	 (03PS3) 10Muehlenhoff: Test EFI-enabled Partman recipe on db1169 [puppet] - 10https://gerrit.wikimedia.org/r/1207124 (https://phabricator.wikimedia.org/T410400)
[12:43:24] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wmcs k8s nfs: add NFS tracing script [puppet] - 10https://gerrit.wikimedia.org/r/1210582 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans)
[12:44:25] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: Remove experimental revise-tone-task-generator deployment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210583 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz)
[12:45:17] <logmsgbot>	 !log bwojtowicz@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[12:45:20] <wikibugs>	 (03PS3) 10Volans: wmcs k8s nfs: add NFS tracing script [puppet] - 10https://gerrit.wikimedia.org/r/1210582 (https://phabricator.wikimedia.org/T399313)
[12:47:36] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wmcs k8s nfs: add NFS tracing script [puppet] - 10https://gerrit.wikimedia.org/r/1210582 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans)
[12:47:45] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1223.eqiad.wmnet with reason: Maintenance
[12:48:27] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208328 (https://phabricator.wikimedia.org/T410731) (owner: 10D3r1ck01)
[12:48:53] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Support both hostname and FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/1141895 (https://phabricator.wikimedia.org/T391581) (owner: 10Federico Ceratto)
[12:49:02] <logmsgbot>	 !log gehel@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2025.codfw.wmnet
[12:51:45] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] Test EFI-enabled Partman recipe on db1169 [puppet] - 10https://gerrit.wikimedia.org/r/1207124 (https://phabricator.wikimedia.org/T410400) (owner: 10Muehlenhoff)
[12:54:12] <logmsgbot>	 !log gehel@cumin2002 START - Cookbook sre.hosts.reboot-cluster
[12:54:12] <logmsgbot>	 !log gehel@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=99)
[12:54:36] <jinxer-wm>	 FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage  - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[12:54:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Test EFI-enabled Partman recipe on db1169 [puppet] - 10https://gerrit.wikimedia.org/r/1207124 (https://phabricator.wikimedia.org/T410400) (owner: 10Muehlenhoff)
[12:55:01] <logmsgbot>	 !log gehel@cumin2002 START - Cookbook sre.hosts.reboot-cluster
[12:55:28] <logmsgbot>	 !log gehel@cumin2002 START - Cookbook sre.hosts.reboot-cluster
[12:57:44] <wikibugs>	 (03PS1) 10Kosta Harlan: MonologChannels: Add WikiEditor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210586 (https://phabricator.wikimedia.org/T410877)
[12:58:23] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable imports on maps-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/1210587 (https://phabricator.wikimedia.org/T409528)
[13:00:43] <logmsgbot>	 !log aikochou@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' .
[13:05:44] <logmsgbot>	 !log aikochou@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' .
[13:06:30] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1210587 (https://phabricator.wikimedia.org/T409528) (owner: 10Muehlenhoff)
[13:07:59] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply
[13:08:28] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply
[13:09:36] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[13:14:46] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+1] Turn paging on for kartotherian [puppet] - 10https://gerrit.wikimedia.org/r/1203835 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey)
[13:15:44] <logmsgbot>	 !log gehel@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0)
[13:16:11] <logmsgbot>	 !log gehel@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0)
[13:17:39] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to ops for blake - https://phabricator.wikimedia.org/T410612#11400340 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert
[13:20:24] <wikibugs>	 (03PS4) 10Volans: wmcs k8s nfs: add NFS tracing script [puppet] - 10https://gerrit.wikimedia.org/r/1210582 (https://phabricator.wikimedia.org/T399313)
[13:21:13] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+1] MonologChannels: Add WikiEditor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210586 (https://phabricator.wikimedia.org/T410877) (owner: 10Kosta Harlan)
[13:27:02] <wikibugs>	 (03PS1) 10Volans: labs: add infra-tracing-nfs account [labs/private] - 10https://gerrit.wikimedia.org/r/1210591 (https://phabricator.wikimedia.org/T399313)
[13:28:01] <Amir1>	 !log cleaning up watchlist of deceased User:JarrahTree in enwiki and commonswiki
[13:28:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:18] <wikibugs>	 (03CR) 10Clément Goubert: rest-gateway: assign ratelimit class by network range (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206956 (https://phabricator.wikimedia.org/T410273) (owner: 10Daniel Kinzler)
[13:32:14] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: [C:03+1] changeprop: add LiftWing revise-tone-task-generator to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210505 (https://phabricator.wikimedia.org/T408538) (owner: 10AikoChou)
[13:33:20] <wikibugs>	 (03CR) 10AikoChou: [C:03+2] changeprop: add LiftWing revise-tone-task-generator to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210505 (https://phabricator.wikimedia.org/T408538) (owner: 10AikoChou)
[13:33:21] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1028.eqiad.wmnet with OS trixie
[13:35:04] <wikibugs>	 10SRE-Access-Requests, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Yubikey-SSH-FIDO for Guillaume (gehel) - https://phabricator.wikimedia.org/T410888 (10Gehel) 03NEW
[13:35:10] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop: add LiftWing revise-tone-task-generator to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210505 (https://phabricator.wikimedia.org/T408538) (owner: 10AikoChou)
[13:35:42] <wikibugs>	 (03PS1) 10Gehel: ssh: FIDO key for Guillaume Lederrey [puppet] - 10https://gerrit.wikimedia.org/r/1210592 (https://phabricator.wikimedia.org/T410888)
[13:36:25] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1031.eqiad.wmnet with OS trixie
[13:40:10] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1032.eqiad.wmnet with OS trixie
[13:40:16] <Amir1>	 jouncebot: nowandnext
[13:40:16] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 19 minute(s)
[13:40:16] <jouncebot>	 In 0 hour(s) and 19 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251124T1400)
[13:40:42] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Fix db config for offline maint scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208439 (https://phabricator.wikimedia.org/T410738) (owner: 10Ladsgroup)
[13:41:02] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208439 (https://phabricator.wikimedia.org/T410738) (owner: 10Ladsgroup)
[13:41:06] <wikibugs>	 (03PS5) 10Volans: wmcs k8s nfs: add NFS tracing script [puppet] - 10https://gerrit.wikimedia.org/r/1210582 (https://phabricator.wikimedia.org/T399313)
[13:41:32] <wikibugs>	 (03Merged) 10jenkins-bot: Fix db config for offline maint scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208439 (https://phabricator.wikimedia.org/T410738) (owner: 10Ladsgroup)
[13:41:54] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1208439|Fix db config for offline maint scripts (T410738 T405087)]]
[13:42:00] <stashbot>	 T410738: pretrain failing when calling mergeMessageFileList.php - https://phabricator.wikimedia.org/T410738
[13:42:00] <stashbot>	 T405087: Remove concept of groups in rdbms load balancer and replace it with shuffle sharding - https://phabricator.wikimedia.org/T405087
[13:42:00] <logmsgbot>	 !log ladsgroup@deploy2002 sync-world failed: <CalledProcessError> Command '['sudo', '-u', 'mwbuilder', '-n', '--', '/usr/bin/scap', 'mwscript', '--no-local-config', '--directory', '/srv/mediawiki-staging', '--user', 'www-data', '--', 'mergeMessageFileList.php', '--wiki=aawiki', '--force-version', '1.46.0-wmf.3', '--list-file', '/srv/mediawiki-staging/wmf-config/extension-list', '--output', '/tmp/tmp.1aRzXHW4OP']' returned
[13:42:00] <logmsgbot>	 non-zero exit status 255. (scap version: 4.228.0) (duration: 00m 07s)
[13:43:47] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1208439|Fix db config for offline maint scripts (T410738 T405087)]]
[13:43:53] <logmsgbot>	 !log ladsgroup@deploy2002 sync-world failed: <CalledProcessError> Command '['sudo', '-u', 'mwbuilder', '-n', '--', '/usr/bin/scap', 'mwscript', '--no-local-config', '--directory', '/srv/mediawiki-staging', '--user', 'www-data', '--', 'mergeMessageFileList.php', '--wiki=aawiki', '--force-version', '1.46.0-wmf.3', '--list-file', '/srv/mediawiki-staging/wmf-config/extension-list', '--output', '/tmp/tmp.Seyz9S1dDd']' returned
[13:43:54] <logmsgbot>	 non-zero exit status 255. (scap version: 4.228.0) (duration: 00m 06s)
[13:46:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] cloudcephosd: move row C hosts to single NIC [puppet] - 10https://gerrit.wikimedia.org/r/1207739 (https://phabricator.wikimedia.org/T399180) (owner: 10Filippo Giunchedi)
[13:46:58] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10[28-32] - https://phabricator.wikimedia.org/T410406#11400460 (10Jclark-ctr) @bking  @RKemper  I’m having issues imaging these servers. Since they’re UEFI, shouldn’t the preseed file be -efi?
[13:47:27] <wikibugs>	 06SRE, 10Cassandra, 06Data-Persistence: Discovery of Cassandra cluster nodes - https://phabricator.wikimedia.org/T410075#11400462 (10elukey)
[13:51:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10[28-32] - https://phabricator.wikimedia.org/T410406#11400466 (10Jclark-ctr) a:05Jclark-ctr→03bking {F70586111} Also when trying to image with Trixie i did notice output   <Puppet 7 auto-selected on >= Bookworm>
[13:52:42] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good. I have also requested that the user send me the same key over Slack, and it matches." [puppet] - 10https://gerrit.wikimedia.org/r/1210592 (https://phabricator.wikimedia.org/T410888) (owner: 10Gehel)
[13:53:18] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: sync
[13:53:51] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync
[13:54:11] <wikibugs>	 (03PS1) 10Ladsgroup: Fix fix db config for offline maint scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210593 (https://phabricator.wikimedia.org/T410738)
[13:54:16] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: sync
[13:54:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210593 (https://phabricator.wikimedia.org/T410738) (owner: 10Ladsgroup)
[13:54:45] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Enable the deploy-spark-support deploy clusterrole for two test namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208317 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis)
[13:54:51] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync
[13:54:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1210592 (https://phabricator.wikimedia.org/T410888) (owner: 10Gehel)
[13:55:24] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: sync
[13:55:34] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: sync
[13:55:35] <wikibugs>	 (03Merged) 10jenkins-bot: Fix fix db config for offline maint scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210593 (https://phabricator.wikimedia.org/T410738) (owner: 10Ladsgroup)
[13:55:54] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1208439|Fix db config for offline maint scripts (T410738 T405087)]], [[gerrit:1210593|Fix fix db config for offline maint scripts (T410738 T405087)]]
[13:56:00] <stashbot>	 T410738: pretrain failing when calling mergeMessageFileList.php - https://phabricator.wikimedia.org/T410738
[13:56:00] <stashbot>	 T405087: Remove concept of groups in rdbms load balancer and replace it with shuffle sharding - https://phabricator.wikimedia.org/T405087
[13:56:11] <wikibugs>	 (03PS2) 10Gehel: ssh: FIDO key for Guillaume Lederrey [puppet] - 10https://gerrit.wikimedia.org/r/1210592 (https://phabricator.wikimedia.org/T410888)
[13:56:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] cloudcephosd: move row D hosts to single NIC [puppet] - 10https://gerrit.wikimedia.org/r/1207740 (https://phabricator.wikimedia.org/T399180) (owner: 10Filippo Giunchedi)
[13:57:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10[28-32] - https://phabricator.wikimedia.org/T410406#11400477 (10MoritzMuehlenhoff) You can simply confirm and continue, Puppet 7 is already enabled for wdqs1031 via the insetup::data_platform_ferm role in site.pp
[13:57:20] <wikibugs>	 (03CR) 10Gehel: [C:03+2] ssh: FIDO key for Guillaume Lederrey [puppet] - 10https://gerrit.wikimedia.org/r/1210592 (https://phabricator.wikimedia.org/T410888) (owner: 10Gehel)
[14:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251124T1400).
[14:00:05] <jouncebot>	 anzx, edsanders, Dragoniez, hubaishan, Sergi0, and xSavitar: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:20] <Dragoniez>	 o/
[14:00:20] <Lucas_WMDE>	 I can’t deploy, sorry (maybe in half an hour)
[14:00:27] <Amir1>	 o/ my deployment is taking a bit longer but I can do the deployments a bit
[14:00:32] <sergi0>	 o/
[14:00:32] <Amir1>	 until Lucas would take over
[14:00:36] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1208439|Fix db config for offline maint scripts (T410738 T405087)]], [[gerrit:1210593|Fix fix db config for offline maint scripts (T410738 T405087)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:00:41] <xSavitar>	 o/
[14:00:59] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Continuing with sync
[14:01:45] <anzx>	 o/
[14:02:00] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to ops for blake - https://phabricator.wikimedia.org/T410612#11400490 (10MoritzMuehlenhoff) 05Resolved→03Open This broke Puppet runs on the puppetservers:   ` Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation...
[14:02:34] <wikibugs>	 (03Merged) 10jenkins-bot: Enable the deploy-spark-support deploy clusterrole for two test namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208317 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis)
[14:02:45] <wikibugs>	 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11400493 (10EMill-WMF) >>! In T408592#11390152, @ATitkov wrote: >> Who will be responsible for security review, when this is sharing important top level domains ?...
[14:02:52] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Revert "tcywikisource: throttle exception" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208292 (https://phabricator.wikimedia.org/T410507) (owner: 10Anzx)
[14:03:45] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "tcywikisource: throttle exception" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208292 (https://phabricator.wikimedia.org/T410507) (owner: 10Anzx)
[14:05:01] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1208439|Fix db config for offline maint scripts (T410738 T405087)]], [[gerrit:1210593|Fix fix db config for offline maint scripts (T410738 T405087)]] (duration: 09m 07s)
[14:05:07] <stashbot>	 T410738: pretrain failing when calling mergeMessageFileList.php - https://phabricator.wikimedia.org/T410738
[14:05:08] <stashbot>	 T405087: Remove concept of groups in rdbms load balancer and replace it with shuffle sharding - https://phabricator.wikimedia.org/T405087
[14:05:28] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1208292|Revert "tcywikisource: throttle exception" (T410507)]]
[14:05:33] <stashbot>	 T410507: Increase AccountCreationThrottle for Tulu Wikisource - https://phabricator.wikimedia.org/T410507
[14:05:56] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] [beta] GrowthExperiments: increase log level to debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210526 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno)
[14:06:37] <wikibugs>	 (03Merged) 10jenkins-bot: [beta] GrowthExperiments: increase log level to debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210526 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno)
[14:07:41] <Amir1>	 sergi0: yours is beta cluster only, I merged and rebased it, it'll be live in ten minutes automatically
[14:07:59] <sergi0>	 @Amir1 <3, ty!
[14:08:08] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Enable DiscussionTools visual enhancements on ruwiki & svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208320 (https://phabricator.wikimedia.org/T379264) (owner: 10Esanders)
[14:09:02] <wikibugs>	 (03Merged) 10jenkins-bot: Enable DiscussionTools visual enhancements on ruwiki & svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208320 (https://phabricator.wikimedia.org/T379264) (owner: 10Esanders)
[14:09:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11400527 (10BTullis) I have drained dse-k8s-worker10[11,13,19] prior to this afternoon's maintenance. ` root@deploy2002:~#...
[14:10:31] <logmsgbot>	 !log ladsgroup@deploy2002 anzx, ladsgroup: Backport for [[gerrit:1208292|Revert "tcywikisource: throttle exception" (T410507)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:10:36] <stashbot>	 T410507: Increase AccountCreationThrottle for Tulu Wikisource - https://phabricator.wikimedia.org/T410507
[14:11:12] <logmsgbot>	 !log ladsgroup@deploy2002 anzx, ladsgroup: Continuing with sync
[14:11:22] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[14:12:52] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[14:13:37] <wikibugs>	 (03CR) 10Arnaudb: "for this, I think we should also swap around PTR records in" [dns] - 10https://gerrit.wikimedia.org/r/1210560 (https://phabricator.wikimedia.org/T387833) (owner: 10Hashar)
[14:13:58] <jinxer-wm>	 FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage  - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[14:15:12] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1208292|Revert "tcywikisource: throttle exception" (T410507)]] (duration: 09m 44s)
[14:15:26] <anzx>	 Amir1: thanks for deploying 
[14:15:38] <wikibugs>	 (03CR) 10Bking: [C:03+1] aptrepo: add component/opensearch27 [puppet] - 10https://gerrit.wikimedia.org/r/1208499 (https://phabricator.wikimedia.org/T410795) (owner: 10Cwhite)
[14:15:52] <Amir1>	 ^_^
[14:15:57] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1208320|Enable DiscussionTools visual enhancements on ruwiki & svwiki (T379264)]]
[14:16:02] <stashbot>	 T379264: Phase 5: Offer Usability Improvements as default-on feature at remaining large wikis - https://phabricator.wikimedia.org/T379264
[14:16:25] <wikibugs>	 (03CR) 10Bking: [C:03+1] opensearch: add $apt_component parameter [puppet] - 10https://gerrit.wikimedia.org/r/1208500 (https://phabricator.wikimedia.org/T410795) (owner: 10Cwhite)
[14:16:44] <wikibugs>	 (03PS1) 10Elukey: Add a staging-specific stream for Maps tiles change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210598 (https://phabricator.wikimedia.org/T409528)
[14:18:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] cloudcephosd: move rack E4 hosts to single NIC [puppet] - 10https://gerrit.wikimedia.org/r/1207741 (https://phabricator.wikimedia.org/T399180) (owner: 10Filippo Giunchedi)
[14:21:11] <logmsgbot>	 !log ladsgroup@deploy2002 esanders, ladsgroup: Backport for [[gerrit:1208320|Enable DiscussionTools visual enhancements on ruwiki & svwiki (T379264)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:21:17] <stashbot>	 T379264: Phase 5: Offer Usability Improvements as default-on feature at remaining large wikis - https://phabricator.wikimedia.org/T379264
[14:21:54] <wikibugs>	 (03PS1) 10Elukey: profile::thanos::swift: add tegola account for staging [puppet] - 10https://gerrit.wikimedia.org/r/1210599 (https://phabricator.wikimedia.org/T409528)
[14:21:55] <Amir1>	 edsanders: live in mwdebug
[14:22:11] <Amir1>	 let me know once it's good to go
[14:22:14] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2149.codfw.wmnet with reason: Maintenance
[14:22:19] <tappof>	 !log Remove unused md2 and add its devices to vg0 on titan1002 T410152
[14:22:21] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2149 (T410531)', diff saved to https://phabricator.wikimedia.org/P85520 and previous config saved to /var/cache/conftool/dbconfig/20251124-142221-marostegui.json
[14:22:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:24] <stashbot>	 T410152: Disk space saturation (/srv) on Titan hosts - https://phabricator.wikimedia.org/T410152
[14:22:29] <stashbot>	 T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531
[14:22:47] <Lucas_WMDE>	 o/
[14:23:58] <jinxer-wm>	 FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage  - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[14:25:39] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[14:26:25] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[14:27:12] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'.
[14:27:33] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'.
[14:28:10] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'.
[14:28:12] <wikibugs>	 (03PS1) 10Slyngshede: C:varnish [puppet] - 10https://gerrit.wikimedia.org/r/1210600
[14:28:26] <wikibugs>	 06SRE, 06Traffic, 06Data-Engineering (Q2 FY25/26 October 1st - December 31th), 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Sustainability (Incident Followup): alerts should be triggered if druid fails to consume webrequest_sampled kafka topic - https://phabricator.wikimedia.org/T410019#11400587 (10Ge...
[14:28:26] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'.
[14:28:28] <wikibugs>	 (03CR) 10Hashar: "Indeed! :-) thanks" [dns] - 10https://gerrit.wikimedia.org/r/1210560 (https://phabricator.wikimedia.org/T387833) (owner: 10Hashar)
[14:28:34] <wikibugs>	 (03PS3) 10Hashar: gerrit: add a layer of CNAME to ease switch overs [dns] - 10https://gerrit.wikimedia.org/r/1210560 (https://phabricator.wikimedia.org/T387833)
[14:28:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] C:varnish [puppet] - 10https://gerrit.wikimedia.org/r/1210600 (owner: 10Slyngshede)
[14:29:30] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:30:26] <wikibugs>	 (03PS2) 10Slyngshede: C:varnish [puppet] - 10https://gerrit.wikimedia.org/r/1210600
[14:30:46] <Amir1>	 Lucas_WMDE: I'm waiting for edsanders :D)
[14:30:50] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add an analytics namespace to both dse-k8s clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208318 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis)
[14:30:53] <Lucas_WMDE>	 ack
[14:31:16] <logmsgbot>	 !log ladsgroup@deploy2002 esanders, ladsgroup: Continuing with sync
[14:31:30] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:31:42] <Lucas_WMDE>	 Amir1: not waiting anymore?
[14:31:57] <Amir1>	 yeah, I decided that it's straightforward and can move forward
[14:32:56] <wikibugs>	 (03CR) 10Michael Große: "This is now ready for review (and deployment, if approved). The data from the machine learning team is now available for testwiki!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207886 (https://phabricator.wikimedia.org/T407029) (owner: 10Michael Große)
[14:33:58] <wikibugs>	 (03PS1) 10Gehel: Webrequests: alert when webrequest_sampled isn't consumed. [alerts] - 10https://gerrit.wikimedia.org/r/1210601 (https://phabricator.wikimedia.org/T410019)
[14:34:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host db1169.eqiad.wmnet with OS bookworm
[14:34:56] <wikibugs>	 06SRE, 10observability, 06Traffic, 06Data-Engineering (Q2 FY25/26 October 1st - December 31th), and 3 others: alerts should be triggered if druid fails to consume webrequest_sampled kafka topic - https://phabricator.wikimedia.org/T410019#11400611 (10Gehel)
[14:35:16] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1208320|Enable DiscussionTools visual enhancements on ruwiki & svwiki (T379264)]] (duration: 19m 18s)
[14:35:20] <stashbot>	 T379264: Phase 5: Offer Usability Improvements as default-on feature at remaining large wikis - https://phabricator.wikimedia.org/T379264
[14:35:31] <Amir1>	 Lucas_WMDE: wanna take over?
[14:35:32] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Webrequests: alert when webrequest_sampled isn't consumed. [alerts] - 10https://gerrit.wikimedia.org/r/1210601 (https://phabricator.wikimedia.org/T410019) (owner: 10Gehel)
[14:35:49] <wikibugs>	 06SRE, 10observability, 06Traffic, 06Data-Engineering (Q2 FY25/26 October 1st - December 31th), and 3 others: alerts should be triggered if druid fails to consume webrequest_sampled kafka topic - https://phabricator.wikimedia.org/T410019#11400632 (10Gehel) As webrequest is critical for operational support,...
[14:36:45] <wikibugs>	 (03CR) 10Michael Große: "At time of writing, this search string gives us 49 results on testwiki: https://test.wikipedia.org/w/index.php?search=hasrecommendation%3A" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207886 (https://phabricator.wikimedia.org/T407029) (owner: 10Michael Große)
[14:37:23] <wikibugs>	 (03Abandoned) 10Btullis: Attempt to fix the OIDC authentication for growthbook [puppet] - 10https://gerrit.wikimedia.org/r/1210570 (https://phabricator.wikimedia.org/T409183) (owner: 10Btullis)
[14:37:26] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] hiera: trafficserver: switch hcaptcha backend to anycast [puppet] - 10https://gerrit.wikimedia.org/r/1207978 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh)
[14:37:50] <edsanders>	 Amir1: I'm here
[14:37:55] <Lucas_WMDE>	 Amir1: sure
[14:37:58] <Amir1>	 edsanders: already deployed :P
[14:38:03] <edsanders>	 thanks
[14:38:03] <Lucas_WMDE>	 (sorry, got distracted for a moment reading https://techblog.wikimedia.org/2025/11/21/unifying-mobile-and-desktop-domains/ ^^)
[14:38:05] <sukhe>	 !log sudo cumin "A:cp" "disable-puppet 'merging CR 1207978'": T409780
[14:38:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:19] <Lucas_WMDE>	 so, up next is Dragoniez_?
[14:38:31] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10[28-32] - https://phabricator.wikimedia.org/T410406#11400638 (10bking) @Jclark-ctr good catch. I didn't know about [[ https://phabricator.wikimedia.org/T409286 | the Nokia bugs that prevent legacy BIOS reimage in eqiad rows C...
[14:38:31] <wikibugs>	 (03Merged) 10jenkins-bot: Add an analytics namespace to both dse-k8s clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208318 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis)
[14:38:42] <wikibugs>	 (03Abandoned) 10Btullis: Use our PKI generated certificate for the opensearch http interface [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196505 (https://phabricator.wikimedia.org/T406876) (owner: 10Btullis)
[14:39:02] <Dragoniez_>	 I assume so
[14:39:16] <Dreamy_Jazz>	 jouncebot: nowandnext
[14:39:16] <jouncebot>	 For the next 0 hour(s) and 20 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251124T1400)
[14:39:16] <jouncebot>	 In 0 hour(s) and 50 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251124T1530)
[14:39:17] <wikibugs>	 (03PS2) 10Ssingh: hiera: trafficserver: switch hcaptcha backend to anycast [puppet] - 10https://gerrit.wikimedia.org/r/1207978 (https://phabricator.wikimedia.org/T409780)
[14:39:23] <wikibugs>	 (03CR) 10Ssingh: "rebased" [puppet] - 10https://gerrit.wikimedia.org/r/1207978 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh)
[14:39:45] * Lucas_WMDE tries to follow the on-wiki discussion
[14:39:57] * Lucas_WMDE chuckles at firefox translation yielding “Permission granted by confidence only to the vieweric rat”
[14:40:02] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206882 (https://phabricator.wikimedia.org/T409717) (owner: 10Reedy)
[14:40:06] <wikibugs>	 (03CR) 10Ssingh: [V:03+2 C:03+2] hiera: trafficserver: switch hcaptcha backend to anycast [puppet] - 10https://gerrit.wikimedia.org/r/1207978 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh)
[14:41:22] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "If Firefox Translations is representing the community discussion semi-accurately, then this appears to be intentional (proposal 3 in the l" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208334 (https://phabricator.wikimedia.org/T409687) (owner: 10Dragoniez)
[14:41:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] cloudcephosd: move rack F4 hosts to single NIC [puppet] - 10https://gerrit.wikimedia.org/r/1207742 (https://phabricator.wikimedia.org/T399180) (owner: 10Filippo Giunchedi)
[14:42:19] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T410531)', diff saved to https://phabricator.wikimedia.org/P85521 and previous config saved to /var/cache/conftool/dbconfig/20251124-144218-marostegui.json
[14:42:23] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[14:42:24] <stashbot>	 T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531
[14:42:28] <tappof>	 !log Remove unused md2 and add its devices to vg0 on titan2002 T410152
[14:42:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:36] <stashbot>	 T410152: Disk space saturation (/srv) on Titan hosts - https://phabricator.wikimedia.org/T410152
[14:43:01] <wikibugs>	 (03PS1) 10Ssingh: Revert "hiera: trafficserver: switch hcaptcha backend to anycast" [puppet] - 10https://gerrit.wikimedia.org/r/1210603
[14:43:09] <wikibugs>	 (03CR) 10Ssingh: "do not merge, emergency revert only" [puppet] - 10https://gerrit.wikimedia.org/r/1210603 (owner: 10Ssingh)
[14:43:23] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[14:44:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10[28-32] - https://phabricator.wikimedia.org/T410406#11400667 (10MoritzMuehlenhoff) >>! In T410406#11400638, @bking wrote: > I'll grab it back and update the partman recipes. Keep in mind that these are very old Dells as oppos...
[14:44:32] <Lucas_WMDE>	 I think I want to deploy these separately tbh
[14:44:36] <Lucas_WMDE>	 I’m feeling unsure about the rowiki change
[14:44:42] <Lucas_WMDE>	 let’s start with jawiki
[14:44:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208334 (https://phabricator.wikimedia.org/T409687) (owner: 10Dragoniez)
[14:46:07] <wikibugs>	 (03Merged) 10jenkins-bot: jawiki: Disallow sysops from granting temporary-account-viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208334 (https://phabricator.wikimedia.org/T409687) (owner: 10Dragoniez)
[14:46:27] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1208334|jawiki: Disallow sysops from granting temporary-account-viewer (T409687)]]
[14:46:32] <stashbot>	 T409687: jawiki: Disallow sysops to grant temporary-account-viewer - https://phabricator.wikimedia.org/T409687
[14:47:41] <wikibugs>	 (03PS6) 10Volans: wmcs k8s nfs: add NFS tracing script [puppet] - 10https://gerrit.wikimedia.org/r/1210582 (https://phabricator.wikimedia.org/T399313)
[14:49:03] <wikibugs>	 (03PS1) 10Tchanders: Assign 'ignore-restricted-groups' to steward group on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210605 (https://phabricator.wikimedia.org/T409717)
[14:49:12] <Dragoniez_>	 The rowiki patch is surely complex. I believe it's good cuz I've checked it several times though
[14:49:36] <wikibugs>	 (03CR) 10Tchanders: [C:03+1] "Done in I51f7458e735f11ddaaa880fcf1c8ddfbad2be76b" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206882 (https://phabricator.wikimedia.org/T409717) (owner: 10Reedy)
[14:51:14] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 dragoniez, lucaswerkmeister-wmde: Backport for [[gerrit:1208334|jawiki: Disallow sysops from granting temporary-account-viewer (T409687)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:51:21] <Dragoniez_>	 Checking
[14:51:23] <wikibugs>	 (03CR) 10Btullis: Report integrity metric from wikidata dump scripts (031 comment) [dumps] - 10https://gerrit.wikimedia.org/r/1203410 (https://phabricator.wikimedia.org/T403482) (owner: 10Silvan Heintze)
[14:51:34] <wikibugs>	 (03PS1) 10Bking: wdqs: provision temporary hosts via UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1210606 (https://phabricator.wikimedia.org/T410406)
[14:52:24] <Dragoniez_>	 Looking good
[14:52:47] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 dragoniez, lucaswerkmeister-wmde: Continuing with sync
[14:52:48] <Lucas_WMDE>	 thanks!
[14:53:31] <wikibugs>	 (03PS1) 10Elukey: profile::pyrra::fs::slos::editing: fix citoid's success ratio SLO [puppet] - 10https://gerrit.wikimedia.org/r/1210608 (https://phabricator.wikimedia.org/T345627)
[14:53:50] <wikibugs>	 (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1210608 (https://phabricator.wikimedia.org/T345627) (owner: 10Elukey)
[14:54:32] <wikibugs>	 (03PS2) 10Volans: labs: add infra-tracing-nfs account [labs/private] - 10https://gerrit.wikimedia.org/r/1210591 (https://phabricator.wikimedia.org/T399313)
[14:54:37] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[14:54:38] <wikibugs>	 (03CR) 10Sergio Gimeno: [C:03+1] "No objections" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207886 (https://phabricator.wikimedia.org/T407029) (owner: 10Michael Große)
[14:54:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 4 others: Racking request for wdqs10[28-32] - https://phabricator.wikimedia.org/T410406#11400736 (10bking) Thanks @MoritzMuehlenhoff ! Do I need to run the provisioning cookbook or make any other changes to put the host in UEFI mode? I know Cathal had to do som...
[14:55:37] <wikibugs>	 (03CR) 10CI reject: [V:04-1] profile::pyrra::fs::slos::editing: fix citoid's success ratio SLO [puppet] - 10https://gerrit.wikimedia.org/r/1210608 (https://phabricator.wikimedia.org/T345627) (owner: 10Elukey)
[14:56:35] <wikibugs>	 (03CR) 10Volans: "The script has been tested in toolsbeta:" [puppet] - 10https://gerrit.wikimedia.org/r/1210582 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans)
[14:56:38] <wikibugs>	 (03PS2) 10Elukey: profile::pyrra::fs::slos::editing: fix citoid's success ratio SLO [puppet] - 10https://gerrit.wikimedia.org/r/1210608 (https://phabricator.wikimedia.org/T345627)
[14:57:01] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "I think the changes in here look correct. The one part I’m still not sure about is `abusefilter-view-private` and `abusefilter-log-private" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208329 (https://phabricator.wikimedia.org/T407978) (owner: 10Dragoniez)
[14:57:06] <wikibugs>	 (03CR) 10Volans: "Required by the related change:" [labs/private] - 10https://gerrit.wikimedia.org/r/1210591 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans)
[14:57:26] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P85523 and previous config saved to /var/cache/conftool/dbconfig/20251124-145726-marostegui.json
[14:58:00] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1208334|jawiki: Disallow sysops from granting temporary-account-viewer (T409687)]] (duration: 11m 33s)
[14:58:05] <stashbot>	 T409687: jawiki: Disallow sysops to grant temporary-account-viewer - https://phabricator.wikimedia.org/T409687
[14:58:47] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 4 others: Racking request for wdqs10[28-32] - https://phabricator.wikimedia.org/T410406#11400749 (10MoritzMuehlenhoff) The SuperMicro hosts are somewhat special, for the Dells the following cookbook should handle the reprovision to UEFI mode:   ` cookbook sre.h...
[14:59:21] <Lucas_WMDE>	 jouncebot: nowandnext
[14:59:21] <jouncebot>	 For the next 0 hour(s) and 0 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251124T1400)
[14:59:21] <jouncebot>	 In 0 hour(s) and 30 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251124T1530)
[14:59:45] <Lucas_WMDE>	 Dragoniez_: do you still have time? if yes, I think I’d deploy the rowiki change in the break between windows now
[14:59:47] <wikibugs>	 (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1210608 (https://phabricator.wikimedia.org/T345627) (owner: 10Elukey)
[15:00:58] <Dragoniez_>	 Lucas_WMDE: Yep!
[15:01:39] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance
[15:01:44] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "…but in the interest of getting the main part (removing access from anons) deployed, I’ll deploy this anyway. If the rowiki community want" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208329 (https://phabricator.wikimedia.org/T407978) (owner: 10Dragoniez)
[15:01:47] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1156 (T410589)', diff saved to https://phabricator.wikimedia.org/P85524 and previous config saved to /var/cache/conftool/dbconfig/20251124-150146-ladsgroup.json
[15:01:52] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[15:01:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208329 (https://phabricator.wikimedia.org/T407978) (owner: 10Dragoniez)
[15:03:12] <wikibugs>	 (03Merged) 10jenkins-bot: rowiki: Redefine AbuseFilter permission model [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208329 (https://phabricator.wikimedia.org/T407978) (owner: 10Dragoniez)
[15:03:25] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] Failover idp.w.o [dns] - 10https://gerrit.wikimedia.org/r/1210566 (owner: 10Muehlenhoff)
[15:03:31] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1208329|rowiki: Redefine AbuseFilter permission model (T407978)]]
[15:03:36] <stashbot>	 T407978: Restrict abusefilter-log-detail to sysops on rowiki - https://phabricator.wikimedia.org/T407978
[15:04:48] <wikibugs>	 (03CR) 10Gehel: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1210606 (https://phabricator.wikimedia.org/T410406) (owner: 10Bking)
[15:05:35] <wikibugs>	 (03CR) 10Bking: [C:03+2] wdqs: provision temporary hosts via UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1210606 (https://phabricator.wikimedia.org/T410406) (owner: 10Bking)
[15:05:54] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to ops for blake - https://phabricator.wikimedia.org/T410612#11400772 (10Clement_Goubert) >>! In T410612#11400490, @MoritzMuehlenhoff wrote: > This broke Puppet runs on the puppetservers: >  >  > ` > Error: Could not retrieve catalog from remote server: Error 500...
[15:08:08] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, dragoniez: Backport for [[gerrit:1208329|rowiki: Redefine AbuseFilter permission model (T407978)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[15:08:15] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] Authorize blake for icinga tasks [puppet] - 10https://gerrit.wikimedia.org/r/1206858 (https://phabricator.wikimedia.org/T410390) (owner: 10Blake)
[15:08:26] <wikibugs>	 (03CR) 10Blake: [C:03+2] Authorize blake for icinga tasks [puppet] - 10https://gerrit.wikimedia.org/r/1206858 (https://phabricator.wikimedia.org/T410390) (owner: 10Blake)
[15:08:58] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:09:09] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1184.eqiad.wmnet with reason: Testing latency
[15:10:20] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1209791 (https://phabricator.wikimedia.org/T410840) (owner: 10Hubaishan)
[15:10:30] <Amir1>	 !log cumin2024@db2191.codfw.wmnet[wikishared]> drop table if exists wikimedia_editor_tasks_counts; drop table if exists wikimedia_editor_tasks_edit_streak; drop table if exists wikimedia_editor_tasks_keys; (T410692)
[15:10:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:35] <stashbot>	 T410692: Drop the WikimediaEditorTasks extension's tables from Wikimedia production - https://phabricator.wikimedia.org/T410692
[15:10:39] <Lucas_WMDE>	 Dragoniez_: please test!
[15:10:40] <Dragoniez_>	 The rowiki thing does look good to me. I'll include your comment on the patch in the task when I close it
[15:10:56] <wikibugs>	 (03PS2) 10Jforrester: tables-catalog: Drop WikimediaEditorTasks tables [puppet] - 10https://gerrit.wikimedia.org/r/1208014 (https://phabricator.wikimedia.org/T376954)
[15:10:57] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] tables-catalog: Drop WikimediaEditorTasks tables [puppet] - 10https://gerrit.wikimedia.org/r/1208014 (https://phabricator.wikimedia.org/T376954) (owner: 10Jforrester)
[15:10:59] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables-catalog: Drop WikimediaEditorTasks tables [puppet] - 10https://gerrit.wikimedia.org/r/1208014 (https://phabricator.wikimedia.org/T376954) (owner: 10Jforrester)
[15:11:30] <Lucas_WMDE>	 ok, just checking the diff of permissions myself
[15:12:30] <Lucas_WMDE>	 ok I think it’s correct
[15:12:34] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, dragoniez: Continuing with sync
[15:12:35] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P85526 and previous config saved to /var/cache/conftool/dbconfig/20251124-151233-marostegui.json
[15:12:37] <Lucas_WMDE>	 thank you!
[15:12:53] <Dragoniez_>	 Thank YOU :)
[15:13:17] <wikibugs>	 (03PS2) 10Gehel: Webrequests: alert when webrequest_sampled isn't consumed. [alerts] - 10https://gerrit.wikimedia.org/r/1210601 (https://phabricator.wikimedia.org/T410019)
[15:13:18] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s2 on clouddb1014 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 630.04 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:13:32] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s2 on clouddb1018 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 642.91 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:13:54] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: admin: remove non-fido keys for oblivian [puppet] - 10https://gerrit.wikimedia.org/r/1210609
[15:15:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:16:33] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1208329|rowiki: Redefine AbuseFilter permission model (T407978)]] (duration: 13m 02s)
[15:16:38] <stashbot>	 T407978: Restrict abusefilter-log-detail to sysops on rowiki - https://phabricator.wikimedia.org/T407978
[15:17:42] <Lucas_WMDE>	 sorry there wasn’t time for your change hubaishan
[15:18:03] <hubaishan>	 OK
[15:18:15] <Lucas_WMDE>	 xSavitar: should we try to deploy your change? I *think* scap will skip the actual deployment anyway because it only touches tests
[15:18:29] <xSavitar>	 Lucas_WMDE, sure
[15:18:37] <xSavitar>	 No testing needed for mine actually
[15:19:19] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] admin: remove non-fido keys for oblivian [puppet] - 10https://gerrit.wikimedia.org/r/1210609 (owner: 10Giuseppe Lavagetto)
[15:19:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208328 (https://phabricator.wikimedia.org/T410731) (owner: 10D3r1ck01)
[15:19:45] <Lucas_WMDE>	 let’s find out
[15:19:53] <Amir1>	 !log cumin2024@db2205.codfw.wmnet[(none)]> drop database if exists blocker; drop database if exists defoundation; drop database if exists oai; drop database if exists steward; (T297297)
[15:19:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:19:58] <stashbot>	 T297297: Investigate the unusual dbs in s3 - https://phabricator.wikimedia.org/T297297
[15:20:39] <wikibugs>	 (03Merged) 10jenkins-bot: tests: Make data providers static methods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208328 (https://phabricator.wikimedia.org/T410731) (owner: 10D3r1ck01)
[15:20:40] <Lucas_WMDE>	 hm, it might do a full deploy after all, because tests/ isn’t part of the beta_only_config_files: https://gerrit.wikimedia.org/g/operations/puppet/+/9a31426114/modules/scap/templates/scap.cfg.erb#122
[15:21:00] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1208328|tests: Make data providers static methods (T410731)]]
[15:21:03] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Webrequests: alert when webrequest_sampled isn't consumed. [alerts] - 10https://gerrit.wikimedia.org/r/1210601 (https://phabricator.wikimedia.org/T410019) (owner: 10Gehel)
[15:21:03] <Lucas_WMDE>	 yup, it sure does. oh well
[15:21:05] <stashbot>	 T410731: Make production extensions PHPUnit tests data providers real providers (and use static methods) - https://phabricator.wikimedia.org/T410731
[15:21:19] * xSavitar nods
[15:21:45] <wikibugs>	 (03CR) 10Btullis: Webrequests: alert when webrequest_sampled isn't consumed. (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1210601 (https://phabricator.wikimedia.org/T410019) (owner: 10Gehel)
[15:22:29] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Add k8s tokens for the analytics namespace [puppet] - 10https://gerrit.wikimedia.org/r/1208321 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis)
[15:22:30] <urbanecm>	 Lucas_WMDE: i see you're deploying. that's fallback from the backport window? can you let me know once done?
[15:22:43] <Lucas_WMDE>	 urbanecm: yes and yes
[15:22:47] <urbanecm>	 thank you!
[15:22:55] <Lucas_WMDE>	 the current change is a no-op but scap is rolling it out anyway
[15:23:11] <urbanecm>	 iirc only beta-only changes are auto-excluded. 
[15:23:15] <Lucas_WMDE>	 yeah, exactly
[15:24:42] <Lucas_WMDE>	 “66% (ok: 8; fail: 0; left: 4)”
[15:24:44] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Clean up existing symlink before creating a new one [dumps] - 10https://gerrit.wikimedia.org/r/1207110 (https://phabricator.wikimedia.org/T406044) (owner: 10Itamar Givon)
[15:24:46] <Lucas_WMDE>	 isn’t that 67% 🤔
[15:25:18] <MichaelG_WMF>	 probably a rounding down so that one is not at a 100% before actually done
[15:25:23] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] admin: remove non-fido keys for oblivian [puppet] - 10https://gerrit.wikimedia.org/r/1210609 (owner: 10Giuseppe Lavagetto)
[15:25:26] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Replace 'let' with arithmetic expansion [dumps] - 10https://gerrit.wikimedia.org/r/1207109 (https://phabricator.wikimedia.org/T406044) (owner: 10Itamar Givon)
[15:25:34] <Lucas_WMDE>	 ah, fair point
[15:25:37] <wikibugs>	 (03CR) 10Mszwarc: [C:03+1] Assign 'ignore-restricted-groups' to steward group on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210605 (https://phabricator.wikimedia.org/T409717) (owner: 10Tchanders)
[15:25:49] <wikibugs>	 (03Merged) 10jenkins-bot: Replace 'let' with arithmetic expansion [dumps] - 10https://gerrit.wikimedia.org/r/1207109 (https://phabricator.wikimedia.org/T406044) (owner: 10Itamar Givon)
[15:25:54] <wikibugs>	 (03Merged) 10jenkins-bot: Clean up existing symlink before creating a new one [dumps] - 10https://gerrit.wikimedia.org/r/1207110 (https://phabricator.wikimedia.org/T406044) (owner: 10Itamar Givon)
[15:25:57] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, d3r1ck01: Backport for [[gerrit:1208328|tests: Make data providers static methods (T410731)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[15:25:59] <Lucas_WMDE>	 yup, explicit math.floor() in the python code
[15:26:18] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, d3r1ck01: Continuing with sync
[15:26:27] <Lucas_WMDE>	 MichaelG_WMF: you’re *exactly* right :) https://gerrit.wikimedia.org/r/c/mediawiki/tools/scap/+/155683
[15:27:05] <MichaelG_WMF>	 yay 😊
[15:27:29] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add output-dir option to specify target directory for rdf dumps [dumps] - 10https://gerrit.wikimedia.org/r/1204595 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon)
[15:27:30] <logmsgbot>	 jmm@cumin2002 reimage (PID 3961086) is awaiting input
[15:27:41] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.provision for host wdqs1028.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:27:42] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T410531)', diff saved to https://phabricator.wikimedia.org/P85527 and previous config saved to /var/cache/conftool/dbconfig/20251124-152741-marostegui.json
[15:27:46] <wikibugs>	 (03PS1) 10Vgutierrez: thumbor: reduce HAProxy queue timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210611
[15:27:49] <stashbot>	 T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531
[15:27:58] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2156.codfw.wmnet with reason: Maintenance
[15:28:01] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Rename targetDir to targetDirDefault [dumps] - 10https://gerrit.wikimedia.org/r/1204592 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon)
[15:28:06] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2156 (T410531)', diff saved to https://phabricator.wikimedia.org/P85528 and previous config saved to /var/cache/conftool/dbconfig/20251124-152805-marostegui.json
[15:28:17] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add makeTargetDir function to create target directory [dumps] - 10https://gerrit.wikimedia.org/r/1204593 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon)
[15:28:24] <wikibugs>	 (03Merged) 10jenkins-bot: Rename targetDir to targetDirDefault [dumps] - 10https://gerrit.wikimedia.org/r/1204592 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon)
[15:28:30] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1028.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:28:38] <wikibugs>	 (03Merged) 10jenkins-bot: Add makeTargetDir function to create target directory [dumps] - 10https://gerrit.wikimedia.org/r/1204593 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon)
[15:28:47] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Refactor moveLinkFile and putDumpChecksums [dumps] - 10https://gerrit.wikimedia.org/r/1204594 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon)
[15:28:54] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Refactor moveLinkFile and putDumpChecksums [dumps] - 10https://gerrit.wikimedia.org/r/1204594 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon)
[15:28:54] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add output-dir option to specify target directory for rdf dumps [dumps] - 10https://gerrit.wikimedia.org/r/1204595 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon)
[15:29:27] <wikibugs>	 (03PS2) 10Daimona Eaytoy: Drop $wgCampaignEventsCountrySchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201814 (https://phabricator.wikimedia.org/T408932)
[15:29:50] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] thumbor: reduce HAProxy queue timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210611 (owner: 10Vgutierrez)
[15:29:51] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201814 (https://phabricator.wikimedia.org/T408932) (owner: 10Daimona Eaytoy)
[15:30:05] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251124T1530)
[15:30:15] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1208328|tests: Make data providers static methods (T410731)]] (duration: 09m 15s)
[15:30:20] <stashbot>	 T410731: Make production extensions PHPUnit tests data providers real providers (and use static methods) - https://phabricator.wikimedia.org/T410731
[15:30:26] <wikibugs>	 (03PS1) 10Gehel: SSH: remove non FIDO key for Guillaume Lederrey [puppet] - 10https://gerrit.wikimedia.org/r/1210612 (https://phabricator.wikimedia.org/T410888)
[15:30:28] <Lucas_WMDE>	 urbanecm: over to you
[15:30:29] <Lucas_WMDE>	 well
[15:30:32] <Lucas_WMDE>	 except for xLab
[15:30:40] <urbanecm>	 does that actually do something...
[15:31:17] * urbanecm is going to be bold
[15:31:30] <wikibugs>	 (03PS3) 10Gehel: Webrequests: alert when webrequest_sampled isn't consumed. [alerts] - 10https://gerrit.wikimedia.org/r/1210601 (https://phabricator.wikimedia.org/T410019)
[15:31:44] <wikibugs>	 (03CR) 10Gehel: Webrequests: alert when webrequest_sampled isn't consumed. (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1210601 (https://phabricator.wikimedia.org/T410019) (owner: 10Gehel)
[15:33:02] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] SSH: remove non FIDO key for Guillaume Lederrey [puppet] - 10https://gerrit.wikimedia.org/r/1210612 (https://phabricator.wikimedia.org/T410888) (owner: 10Gehel)
[15:33:12] <wikibugs>	 (03CR) 10Gehel: [C:03+2] SSH: remove non FIDO key for Guillaume Lederrey [puppet] - 10https://gerrit.wikimedia.org/r/1210612 (https://phabricator.wikimedia.org/T410888) (owner: 10Gehel)
[15:33:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207886 (https://phabricator.wikimedia.org/T407029) (owner: 10Michael Große)
[15:33:31] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Webrequests: alert when webrequest_sampled isn't consumed. [alerts] - 10https://gerrit.wikimedia.org/r/1210601 (https://phabricator.wikimedia.org/T410019) (owner: 10Gehel)
[15:33:58] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:34:06] * MichaelG_WMF is here and ready to test
[15:34:14] <logmsgbot>	 bking@cumin2002 provision (PID 3989894) is awaiting input
[15:34:24] <wikibugs>	 (03Merged) 10jenkins-bot: testwiki: enable ReviseTone experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207886 (https://phabricator.wikimedia.org/T407029) (owner: 10Michael Große)
[15:34:39] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.provision for host wdqs1029.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:34:44] <logmsgbot>	 !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1207886|testwiki: enable ReviseTone experiment (T407029)]]
[15:34:49] <stashbot>	 T407029: Revise Tone: Release on Test Wikipedia integrated with Production DataGateway - https://phabricator.wikimedia.org/T407029
[15:34:53] <urbanecm>	 MichaelG_WMF: thank you, very helpful!
[15:35:13] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] Remove maps from SKIP_V6_DNS_PREFIXES [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1208360 (owner: 10Ayounsi)
[15:35:32] <wikibugs>	 (03PS11) 10Scott French: P:cache::varnish::frontend: render known-client rate limit VCL [puppet] - 10https://gerrit.wikimedia.org/r/1198182 (https://phabricator.wikimedia.org/T403220)
[15:35:32] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1029.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:35:44] <wikibugs>	 (03PS4) 10Gehel: Webrequests: alert when webrequest_sampled isn't consumed. [alerts] - 10https://gerrit.wikimedia.org/r/1210601 (https://phabricator.wikimedia.org/T410019)
[15:35:55] <wikibugs>	 (03CR) 10Krinkle: [C:03+1] deployment_server: drop PHP 8.1 fallback in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1207979 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[15:36:12] <xSavitar>	 Lucas_WMDE, thanks for deploying 🙏🏽
[15:36:44] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.provision for host wdqs1030.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:36:53] <wikibugs>	 (03PS1) 10Kosta Harlan: Hooks: Log the status message when responseUnknown occurs [extensions/WikiEditor] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1210614 (https://phabricator.wikimedia.org/T410877)
[15:37:02] <Lucas_WMDE>	 np :)
[15:37:34] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1030.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:38:18] <wikibugs>	 (03CR) 10Gehel: [C:03+2] Webrequests: alert when webrequest_sampled isn't consumed. [alerts] - 10https://gerrit.wikimedia.org/r/1210601 (https://phabricator.wikimedia.org/T410019) (owner: 10Gehel)
[15:38:28] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.provision for host wdqs1031.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:39:13] <wikibugs>	 06SRE, 10observability, 06Traffic, 06Data-Engineering (Q2 FY25/26 October 1st - December 31th), and 3 others: alerts should be triggered if druid fails to consume webrequest_sampled kafka topic - https://phabricator.wikimedia.org/T410019#11401023 (10Gehel) 05Open→03Resolved
[15:39:17] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1031.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:39:20] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 13Patch-For-Review: Yubikey-SSH-FIDO for Guillaume (gehel) - https://phabricator.wikimedia.org/T410888#11401026 (10Gehel) 05Open→03Resolved a:03Gehel
[15:39:24] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm, migr: Backport for [[gerrit:1207886|testwiki: enable ReviseTone experiment (T407029)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[15:39:48] <urbanecm>	 MichaelG_WMF: available on debug!
[15:39:52] <urbanecm>	 (I'm also testing)
[15:40:06] <MichaelG_WMF>	 thanks, testing!
[15:40:29] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.provision for host wdqs1032.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:40:35] <wikibugs>	 (03PS1) 10Brouberol: dse-k8s: delete the stat-> PG on k8s ingress firewall rule [puppet] - 10https://gerrit.wikimedia.org/r/1210616 (https://phabricator.wikimedia.org/T409591)
[15:42:39] <MichaelG_WMF>	 @urbanecm looks good to me. What about you?
[15:42:49] <urbanecm>	 MichaelG_WMF: works for me!
[15:42:56] <MichaelG_WMF>	 🙌
[15:43:02] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm, migr: Continuing with sync
[15:43:12] <urbanecm>	 proceeding
[15:43:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] cloudcephosd: move codfw hosts to single NIC [puppet] - 10https://gerrit.wikimedia.org/r/1207743 (https://phabricator.wikimedia.org/T399180) (owner: 10Filippo Giunchedi)
[15:44:33] <urbanecm>	 MichaelG_WMF: just noticed, mwdebug logs says `Expectation (masterConns <= 0) by MediaWiki\Actions\ActionEntryPoint::execute not met (actual: 1): [connect to db2191 (wikishared)]`. did that...change?
[15:44:48] <urbanecm>	 or did we create master connection before?
[15:44:56] * urbanecm is trying to identify whether this is coming from revise tone work
[15:45:16] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] labs: add infra-tracing-nfs account [labs/private] - 10https://gerrit.wikimedia.org/r/1210591 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans)
[15:45:18] <MichaelG_WMF>	 🤔
[15:45:30] <wikibugs>	 (03CR) 10Bking: [C:03+1] dse-k8s: delete the stat-> PG on k8s ingress firewall rule [puppet] - 10https://gerrit.wikimedia.org/r/1210616 (https://phabricator.wikimedia.org/T409591) (owner: 10Brouberol)
[15:45:41] <MichaelG_WMF>	 We might have
[15:45:44] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] dse-k8s: delete the stat-> PG on k8s ingress firewall rule [puppet] - 10https://gerrit.wikimedia.org/r/1210616 (https://phabricator.wikimedia.org/T409591) (owner: 10Brouberol)
[15:46:18] <wikibugs>	 (03PS2) 10Silvan Heintze: Report integrity metric from Wikidata dump scripts [dumps] - 10https://gerrit.wikimedia.org/r/1203410 (https://phabricator.wikimedia.org/T403482)
[15:46:22] <urbanecm>	 MichaelG_WMF: can you fill a task to investigate that (prior to larger deployment)?
[15:46:41] * urbanecm will proceed with the rest of the deployment in the meantime
[15:46:47] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1032.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:47:03] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1207886|testwiki: enable ReviseTone experiment (T407029)]] (duration: 12m 19s)
[15:47:07] <stashbot>	 T407029: Revise Tone: Release on Test Wikipedia integrated with Production DataGateway - https://phabricator.wikimedia.org/T407029
[15:47:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206948 (https://phabricator.wikimedia.org/T407818) (owner: 10Urbanecm)
[15:47:58] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T410531)', diff saved to https://phabricator.wikimedia.org/P85529 and previous config saved to /var/cache/conftool/dbconfig/20251124-154758-marostegui.json
[15:48:03] <stashbot>	 T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531
[15:48:07] <MichaelG_WMF>	 urbanecm: In https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/1202224 to check for a race-condition. Though I would assume that to be fine, but maybe it isn't. Or maybe we have to move the check to later
[15:48:18] <wikibugs>	 (03PS2) 10STran: Enable v2 non-emergency workflow by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207845 (https://phabricator.wikimedia.org/T410512)
[15:48:26] <wikibugs>	 (03Merged) 10jenkins-bot: [Growth] Enable Add Link task pool generation for 3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206948 (https://phabricator.wikimedia.org/T407818) (owner: 10Urbanecm)
[15:48:34] <wikibugs>	 (03CR) 10Gehel: "DNS is now configured and propagated:" [puppet] - 10https://gerrit.wikimedia.org/r/1200034 (https://phabricator.wikimedia.org/T403955) (owner: 10Stevemunene)
[15:48:47] <logmsgbot>	 !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1206948|[Growth] Enable Add Link task pool generation for 3 wikis (T407818)]]
[15:48:52] <stashbot>	 T407818: Add a Link: Rollout "Add a Link" Structured Task to Chinese, Japanese, & Urdu Wikipedias - https://phabricator.wikimedia.org/T407818
[15:48:58] <MichaelG_WMF>	 Also, I don't think that this should be an `ActionEntryPoint`, shouldn't that be index.php  with the homepage? 🤔
[15:49:04] <wikibugs>	 (03CR) 10STran: Enable v2 non-emergency workflow by default (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207845 (https://phabricator.wikimedia.org/T410512) (owner: 10STran)
[15:49:28] <urbanecm>	 MichaelG_WMF: we definitely shouldn't deploy a feature that triggers a warning.
[15:49:29] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1028.eqiad.wmnet with OS trixie
[15:50:01] <urbanecm>	 so if it is indeed new, we need to fix that (move later/use replica/silence warning/etc) before deployment
[15:50:02] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS trixie
[15:50:05] <MichaelG_WMF>	 that for sure. I'm just unsure if we triggered this warning. (I am in the process of creating the task)
[15:50:18] <urbanecm>	 ah, i thought what you said means "it is us". sorry!
[15:50:31] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] druid: switch to using the druid-public-coordinator url [puppet] - 10https://gerrit.wikimedia.org/r/1200034 (https://phabricator.wikimedia.org/T403955) (owner: 10Stevemunene)
[15:50:34] <wikibugs>	 (03CR) 10Silvan Heintze: "Thanks for the review" [dumps] - 10https://gerrit.wikimedia.org/r/1203410 (https://phabricator.wikimedia.org/T403482) (owner: 10Silvan Heintze)
[15:50:34] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1030.eqiad.wmnet with OS trixie
[15:51:06] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1031.eqiad.wmnet with OS trixie
[15:51:32] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1032.eqiad.wmnet with OS trixie
[15:52:18] <wikibugs>	 06SRE, 10SRE-tools, 06Infrastructure-Foundations: Add an option to the reimage cookbook to also update firmware - https://phabricator.wikimedia.org/T410384#11401096 (10LSobanski) p:05Triage→03Medium
[15:53:10] <MichaelG_WMF>	 @urbanecm: https://phabricator.wikimedia.org/T410907 here is a simple task
[15:53:15] <urbanecm>	 ty
[15:53:54] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1206948|[Growth] Enable Add Link task pool generation for 3 wikis (T407818)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[15:53:59] <stashbot>	 T407818: Add a Link: Rollout "Add a Link" Structured Task to Chinese, Japanese, & Urdu Wikipedias - https://phabricator.wikimedia.org/T407818
[15:54:27] <wikibugs>	 (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210618 (https://phabricator.wikimedia.org/T128546)
[15:54:36] <wikibugs>	 (03CR) 10Aude: [Legal Footer] Create config for adding legal footer (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208380 (https://phabricator.wikimedia.org/T410163) (owner: 10LorenMora)
[15:56:54] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Continuing with sync
[16:00:51] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1206948|[Growth] Enable Add Link task pool generation for 3 wikis (T407818)]] (duration: 12m 04s)
[16:00:52] <MichaelG_WMF>	 @urbanecm Early discovery: It was probably not us. The three events that I can find in logstash all have ReadingList in their stacktrace and not GrowthExperiments
[16:00:57] <stashbot>	 T407818: Add a Link: Rollout "Add a Link" Structured Task to Chinese, Japanese, & Urdu Wikipedias - https://phabricator.wikimedia.org/T407818
[16:01:05] <urbanecm>	 MichaelG_WMF: sounds promising!
[16:01:11] <urbanecm>	 thanks for investigating
[16:02:55] <urbanecm>	 MichaelG_WMF: on second thought, that makes a lot of sense. We don't use wikishared at all, so
[16:03:06] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P85530 and previous config saved to /var/cache/conftool/dbconfig/20251124-160305-marostegui.json
[16:03:14] <wikibugs>	 06SRE, 10SRE-tools, 06Infrastructure-Foundations: Add an option to the reimage cookbook to also update firmware - https://phabricator.wikimedia.org/T410384#11401180 (10cmooney) For a little bit more background we most regularly encounter PXEboot failures due to a firmware version on hosts with Broadcom BCM57...
[16:05:16] <wikibugs>	 (03CR) 10Urbanecm: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210605 (https://phabricator.wikimedia.org/T409717) (owner: 10Tchanders)
[16:05:48] <wikibugs>	 (03CR) 10Btullis: [C:04-1] "Unfortunately, the LVS service is still not yet in production." [puppet] - 10https://gerrit.wikimedia.org/r/1200034 (https://phabricator.wikimedia.org/T403955) (owner: 10Stevemunene)
[16:06:01] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1184 depool for testing', diff saved to https://phabricator.wikimedia.org/P85531 and previous config saved to /var/cache/conftool/dbconfig/20251124-160601-marostegui.json
[16:06:12] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Testing latency
[16:06:38] <wikibugs>	 (03PS3) 10Elukey: profile::pyrra::fs::slos::editing: fix citoid's success ratio SLO [puppet] - 10https://gerrit.wikimedia.org/r/1210608 (https://phabricator.wikimedia.org/T345627)
[16:06:58] <wikibugs>	 (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1210608 (https://phabricator.wikimedia.org/T345627) (owner: 10Elukey)
[16:07:23] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Update the definition of @dse_kubepods_networks [puppet] - 10https://gerrit.wikimedia.org/r/1195694 (https://phabricator.wikimedia.org/T404576) (owner: 10Btullis)
[16:08:26] <hnowlan>	 jouncebot: nowandnext
[16:08:26] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 21 minute(s)
[16:08:26] <jouncebot>	 In 0 hour(s) and 21 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251124T1630)
[16:08:57] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Add k8s tokens for the analytics namespace [puppet] - 10https://gerrit.wikimedia.org/r/1208321 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis)
[16:09:12] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207276 (https://phabricator.wikimedia.org/T410564) (owner: 10Arlolra)
[16:09:27] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] thumbor: reduce HAProxy queue timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210611 (owner: 10Vgutierrez)
[16:10:08] <wikibugs>	 (03PS4) 10Elukey: profile::pyrra::fs::slos::editing: fix citoid's success ratio SLO [puppet] - 10https://gerrit.wikimedia.org/r/1210608 (https://phabricator.wikimedia.org/T345627)
[16:10:16] <wikibugs>	 (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1210608 (https://phabricator.wikimedia.org/T345627) (owner: 10Elukey)
[16:11:44] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: reduce HAProxy queue timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210611 (owner: 10Vgutierrez)
[16:11:46] <wikibugs>	 (03PS1) 10Kosta Harlan: hCaptcha: Adjust addurl config for zhwiki and jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210621 (https://phabricator.wikimedia.org/T410354)
[16:13:45] <wikibugs>	 (03PS5) 10Elukey: profile::pyrra::fs::slos::editing: fix citoid's success ratio SLO [puppet] - 10https://gerrit.wikimedia.org/r/1210608 (https://phabricator.wikimedia.org/T345627)
[16:14:18] <wikibugs>	 (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1210608 (https://phabricator.wikimedia.org/T345627) (owner: 10Elukey)
[16:14:24] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply
[16:14:37] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[16:14:43] <logmsgbot>	 !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image
[16:15:28] <urbanecm>	 is it possible to _stop_ a mw-cron job? would deleting the pod be the expected thing to do in that scenario?
[16:15:54] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply
[16:15:55] <urbanecm>	 (or deleting the job itself?)
[16:16:09] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply
[16:16:11] <urbanecm>	 https://wikitech.wikimedia.org/wiki/Mw-cron_jobs#Manually_deleting_a_failed_Job talks about deleting failed jobs, but not about something running
[16:17:30] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Codfw row C/D servers need to boot/reimage in UEFI mode - https://phabricator.wikimedia.org/T410910 (10cmooney) 03NEW p:05Triage→03Medium
[16:17:41] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Codfw row C/D servers need to boot/reimage in UEFI mode - https://phabricator.wikimedia.org/T410910#11401242 (10cmooney)
[16:18:13] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P85532 and previous config saved to /var/cache/conftool/dbconfig/20251124-161813-marostegui.json
[16:18:19] <hnowlan>	 urbanecm: deleting a running job will also stop it
[16:18:47] <urbanecm>	 good to know. and hopefully wouldn't generate alerting (on k8s level, at least).
[16:19:52] <hnowlan>	 it might notify teams of a failed job, if that's configured. but it won't p.age anyone 
[16:21:15] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Request additional access for Dcops group - https://phabricator.wikimedia.org/T395939#11401253 (10Jclark-ctr) @elukey We now have additional smartctl options for pulling drive information for Supermicro repairs. Because the Servers use software RAID, the drives are not visi...
[16:21:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Failover idp.w.o [dns] - 10https://gerrit.wikimedia.org/r/1210566 (owner: 10Muehlenhoff)
[16:22:03] <logmsgbot>	 !log jmm@dns1004 START - running authdns-update
[16:23:06] <logmsgbot>	 !log jmm@dns1004 END - running authdns-update
[16:23:18] <urbanecm>	 !log Delete job/growthexperiments-refreshlinkrecommendations-s2-29399967 and job/growthexperiments-refreshlinkrecommendations-s3-29399607 (T407818)
[16:23:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:28] <stashbot>	 T407818: Add a Link: Rollout "Add a Link" Structured Task to Chinese, Japanese, & Urdu Wikipedias - https://phabricator.wikimedia.org/T407818
[16:24:08] <wikibugs>	 (03PS1) 10Kosta Harlan: hCaptcha: Enable hCaptcha editing on frwiki in 100% passive mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210622 (https://phabricator.wikimedia.org/T405586)
[16:24:38] <urbanecm>	 👀
[16:25:11] <wikibugs>	 (03PS1) 10Hnowlan: thumbor: reduce queue time to 10s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210624
[16:25:11] <wikibugs>	 (03PS1) 10Hnowlan: thumbor: drop queue timeout to 2s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210625
[16:25:30] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+1] hCaptcha: Enable hCaptcha editing on frwiki in 100% passive mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210622 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan)
[16:26:06] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+1] hCaptcha: Adjust addurl config for zhwiki and jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210621 (https://phabricator.wikimedia.org/T410354) (owner: 10Kosta Harlan)
[16:27:09] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Codfw row C/D servers need to boot/reimage in UEFI mode - https://phabricator.wikimedia.org/T410910#11401277 (10cmooney)
[16:28:26] <wikibugs>	 06SRE, 06collaboration-services, 10MW-on-K8s, 06serviceops: Use encrypted rsync for releases - https://phabricator.wikimedia.org/T289858#11401282 (10LSobanski) p:05Medium→03Low
[16:28:35] <logmsgbot>	 !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 52s)
[16:29:03] <wikibugs>	 (03PS4) 10Muehlenhoff: Properly rename tilerator_pass variable [puppet] - 10https://gerrit.wikimedia.org/r/1204900 (https://phabricator.wikimedia.org/T381565)
[16:29:23] <wikibugs>	 (03PS1) 10Kosta Harlan: hCaptcha: Define list of valid SiteKeys for createaccount trigger [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210627 (https://phabricator.wikimedia.org/T410657)
[16:30:05] <jouncebot>	 jan_drewniak: #bothumor My software never has bugs. It just develops random features. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251124T1630).
[16:30:18] <jan_drewniak>	 ^ starting portal banner deploy
[16:30:25] <wikibugs>	 (03CR) 10Jdrewniak: [C:03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210618 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[16:30:32] <moritzm>	 !log installing usb.ids updates from Bookworm point release
[16:30:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:31:02] <wikibugs>	 (03PS2) 10Kosta Harlan: (WIP) hCaptcha: Define list of valid SiteKeys for createaccount trigger [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210627 (https://phabricator.wikimedia.org/T410657)
[16:31:06] <wikibugs>	 (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210618 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[16:32:48] <logmsgbot>	 !log btullis@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dse-k8s-worker[1011,1013,1019].eqiad.wmnet with reason: Prepping for switch swap
[16:32:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11401326 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=77fc5d5e-4014-4521-90fb-3e67d8114900) set by...
[16:33:15] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1169.eqiad.wmnet with OS bookworm
[16:33:21] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T410531)', diff saved to https://phabricator.wikimedia.org/P85533 and previous config saved to /var/cache/conftool/dbconfig/20251124-163320-marostegui.json
[16:33:27] <stashbot>	 T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531
[16:33:38] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2177.codfw.wmnet with reason: Maintenance
[16:33:45] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2177 (T410531)', diff saved to https://phabricator.wikimedia.org/P85534 and previous config saved to /var/cache/conftool/dbconfig/20251124-163345-marostegui.json
[16:34:04] <logmsgbot>	 !log btullis@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on an-test-master1002.eqiad.wmnet with reason: Prepping for switch swap
[16:34:10] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11401333 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7d21afc7-5634-452f-ae59-c9787b2c0108) set by...
[16:34:43] <logmsgbot>	 !log btullis@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on stat1011.eqiad.wmnet with reason: Prepping for switch swap
[16:34:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11401338 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=2ceb8409-0adc-48e2-b350-9299f0cfd430) set by...
[16:35:47] <wikibugs>	 (03PS3) 10Clément Goubert: trafficserver: action api to rest-gateway cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1198941 (https://phabricator.wikimedia.org/T408223)
[16:35:49] <wikibugs>	 07Puppet, 06SRE, 06Infrastructure-Foundations, 06serviceops-radar: Fix UIDs for deployment server users - https://phabricator.wikimedia.org/T163667#11401343 (10LSobanski)
[16:36:00] <logmsgbot>	 !log btullis@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on an-master1004.eqiad.wmnet,an-redacteddb1001.eqiad.wmnet,an-test-coord1001.eqiad.wmnet with reason: Prepping for switch swap
[16:36:06] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11401346 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a41ee425-7380-4cb9-8254-04c2c38218ab) set by...
[16:38:45] <wikibugs>	 (03PS4) 10Clément Goubert: trafficserver: action api to rest-gateway cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1198941 (https://phabricator.wikimedia.org/T408223)
[16:39:37] <jinxer-wm>	 FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[16:41:13] <logmsgbot>	 !log jdrewniak@deploy2002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1210618| Bumping portals to master (T128546)]] (duration: 08m 44s)
[16:41:18] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[16:43:13] <logmsgbot>	 !log jdrewniak@deploy2002 Synchronized portals: Wikimedia Portals Update: [[gerrit:1210618| Bumping portals to master (T128546)]] (duration: 01m 59s)
[16:44:07] <logmsgbot>	 bking@cumin2002 reimage (PID 3998088) is awaiting input
[16:44:47] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11401417 (10MoritzMuehlenhoff)
[16:47:12] <wikibugs>	 (03PS1) 10Fabfur: admin: add fido key for fabfur [puppet] - 10https://gerrit.wikimedia.org/r/1210629
[16:48:17] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Remove the new unused tilerator_pass [puppet] - 10https://gerrit.wikimedia.org/r/1204914 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[16:48:21] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Properly rename tilerator_pass variable [puppet] - 10https://gerrit.wikimedia.org/r/1204900 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[16:52:14] <wikibugs>	 (03PS2) 10Aaron Schulz: Route /page/lint(.*) to the gateway on all wikis [puppet] - 10https://gerrit.wikimedia.org/r/1199035 (https://phabricator.wikimedia.org/T384216)
[16:53:10] <icinga-wm>	 PROBLEM - Host conf1009 is DOWN: PING CRITICAL - Packet loss = 100%
[16:54:06] <swfrench-wmf>	 ^ what
[16:54:47] <moritzm>	 the C/D switch migration I suppose?
[16:55:11] <swfrench-wmf>	 that's not supposed to happen for another 1.25h
[16:56:00] <wikibugs>	 (03PS1) 10Aaron Schulz: Cleanup redundant lint-related rest gateway routing config [puppet] - 10https://gerrit.wikimedia.org/r/1210631
[16:56:41] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1013.eqiad.wmnet
[16:58:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-debug releases routed via next (k8s) 1.75s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-debug&var-release=next - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[16:59:10] <icinga-wm>	 RECOVERY - Host conf1009 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[16:59:11] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T410531)', diff saved to https://phabricator.wikimedia.org/P85535 and previous config saved to /var/cache/conftool/dbconfig/20251124-165910-marostegui.json
[16:59:16] <stashbot>	 T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531
[16:59:46] <wikibugs>	 (03PS3) 10Aaron Schulz: Route /page/lint(.*) to the gateway on all wikis [puppet] - 10https://gerrit.wikimedia.org/r/1199035 (https://phabricator.wikimedia.org/T384216)
[17:00:31] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1031.eqiad.wmnet with OS trixie
[17:01:07] <wikibugs>	 (03PS1) 10DCausse: dumps: Update cirrus index dumps path to point to new dumps [puppet] - 10https://gerrit.wikimedia.org/r/1210636
[17:01:16] <wikibugs>	 (03PS1) 10Kosta Harlan: hCaptcha: Adjust addurl logic for 100% passive mode [extensions/ConfirmEdit] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1210637 (https://phabricator.wikimedia.org/T409957)
[17:01:27] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1032.eqiad.wmnet with OS trixie
[17:01:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10[28-32] - https://phabricator.wikimedia.org/T410406#11401589 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host wdqs1032.eqiad.wmnet with OS trixie
[17:01:50] <wikibugs>	 (03PS2) 10Aaron Schulz: Cleanup redundant lint-related rest gateway routing config [puppet] - 10https://gerrit.wikimedia.org/r/1210631
[17:02:49] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1013.eqiad.wmnet
[17:02:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1210629 (owner: 10Fabfur)
[17:03:10] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1019.eqiad.wmnet
[17:03:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-debug releases routed via next (k8s) 1.75s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-debug&var-release=next - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[17:03:20] <swfrench-wmf>	 oncalls, FYI - page maybe incoming
[17:04:02] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] admin: add fido key for fabfur [puppet] - 10https://gerrit.wikimedia.org/r/1210629 (owner: 10Fabfur)
[17:05:25] <swfrench-wmf>	 topranks: claime: you're about to get paged, FYI
[17:05:33] <claime>	 lol
[17:05:36] <swfrench-wmf>	 etcd-mirror is down in codfw
[17:05:37] <claime>	 preemptive strike
[17:05:41] <jinxer-wm>	 FIRING: EtcdReplicationDown: etcd replication down on conf2005:8000 #page - https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Replication - TODO - https://alerts.wikimedia.org/?q=alertname%3DEtcdReplicationDown
[17:05:43] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Codfw row C/D servers need to boot/reimage in UEFI mode - https://phabricator.wikimedia.org/T410910#11401603 (10cmooney)
[17:05:48] <claime>	 Do we have to something
[17:05:50] <claime>	 ?
[17:05:52] <claime>	 or is expected
[17:06:15] <swfrench-wmf>	 I'm trying to sort it out in #wikimedia-dcops 
[17:06:19] <claime>	 ok
[17:06:22] <swfrench-wmf>	 here's some sort of network cable failure
[17:06:23] <topranks>	 etcd replication down?
[17:06:26] <swfrench-wmf>	 yup
[17:06:35] <topranks>	 conf2005
[17:08:01] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11401619 (10MoritzMuehlenhoff)
[17:08:12] <topranks>	 swfrench-wmf: fwiw the link is up to the switch
[17:08:13] <swfrench-wmf>	 so, this is not going to easy to restore - I'm reading through the log on the process and it might not be possible to simply restart it
[17:08:18] <topranks>	 https://www.irccloud.com/pastebin/mP1kBSxx/
[17:08:23] <claime>	 Crap
[17:08:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: etcdmirror--eqiad-wmnet.service on conf2005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:08:31] <swfrench-wmf>	 topranks: yeah, there was a transient disruption there before
[17:08:58] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job etcdmirror in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:09:13] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1019.eqiad.wmnet
[17:09:20] <wikibugs>	 06SRE: Authorize blake for Icinga tasks - https://phabricator.wikimedia.org/T410390#11401624 (10Blake) 05Open→03Resolved Submitted and merged.
[17:09:33] <swfrench-wmf>	 trying to figure out if I can perform some surgury
[17:09:37] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[17:09:39] <swfrench-wmf>	 *surgery
[17:09:50] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1028.eqiad.wmnet with OS trixie
[17:09:58] <claime>	 swfrench-wmf: tell us if we need us for anything
[17:10:03] <claime>	 s/we/you/
[17:10:10] <icinga-wm>	 PROBLEM - Postfix SMTP on crm2001 is CRITICAL: CRITICAL - Certificate crm2001.codfw.wmnet expires in 15 day(s) (Wed 10 Dec 2025 05:10:00 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting
[17:10:11] <topranks>	 +1
[17:10:15] <swfrench-wmf>	 ack, I may need to use the --reload script, which will be rather disruptive
[17:10:23] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1029.eqiad.wmnet with OS trixie
[17:10:23] <swfrench-wmf>	 I'll give you a heads-up if that's the case
[17:10:56] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1030.eqiad.wmnet with OS trixie
[17:11:53] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1032.eqiad.wmnet with OS trixie
[17:13:52] <swfrench-wmf>	 topranks: claime: restored
[17:14:00] <claime>	 swfrench-wmf: <3 good job
[17:14:21] * swfrench-wmf needs a drink
[17:14:22] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P85536 and previous config saved to /var/cache/conftool/dbconfig/20251124-171418-marostegui.json
[17:14:27] <swfrench-wmf>	 now to figure out what the hell happened
[17:14:37] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job etcdmirror in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:14:40] <claime>	 swfrench-wmf: It's 5 o'clock somewhere right :P
[17:15:11] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11401676 (10Jclark-ctr) an-test-master1002 dse-k8s-worker1011 dse-k8s-worker1013 dse-k8s-worker1019 stat1011 an-redacteddb...
[17:15:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:15:28] <swfrench-wmf>	 claime: what makes it extra-fun is that it's the read-only cluster, so you can't use etcdctl to mutate the keyspace. you have to sling API ops w/ curl.
[17:15:35] <claime>	 awesome
[17:15:41] <jinxer-wm>	 RESOLVED: EtcdReplicationDown: etcd replication down on conf2005:8000 #page - https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Replication - TODO - https://alerts.wikimedia.org/?q=alertname%3DEtcdReplicationDown
[17:16:13] <rzl>	 swfrench-wmf: damn, nice job
[17:16:33] <swfrench-wmf>	 rzl: fortunately, we've been to a similar rodeo before :)
[17:16:49] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: update llm model-server image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210643 (https://phabricator.wikimedia.org/T410906)
[17:16:52] <claime>	 Would probably be worth documenting how to recover that
[17:17:01] <claime>	 Especially since we can't use etcdctl
[17:18:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: etcdmirror--eqiad-wmnet.service on conf2005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:18:46] <swfrench-wmf>	 claime: yeah, the blunt option is (i.e., the --reload script), but this kind of surgery isn't, which maybe we should rethink
[17:21:01] <_joe_>	 swfrench-wmf: what happened exactly?
[17:21:35] <_joe_>	 and yes, etcdctl is not a great tool in general to interact with etcd, amazingly
[17:21:38] <moritzm>	 broken cable clip
[17:21:52] <_joe_>	 yeah ok, why did recovery need "surgery" is my question
[17:23:20] <logmsgbot>	 !log urbanecm@deploy2002 helmfile [codfw] START helmfile.d/services/mw-experimental: apply
[17:23:33] <swfrench-wmf>	 _joe_: so, what happened is that _somehow_ related to the connectivity blip toward conf1009, we either lost a mirrored write _or_ doubly applied a delete on the conf2005 side.
[17:23:35] <_joe_>	 ah I see
[17:23:42] <swfrench-wmf>	 that left the replication index out of sync
[17:23:55] <_joe_>	 swfrench-wmf: no I think the index was not updated after the write of the delete
[17:24:08] <_joe_>	 the failure happened *exactly* between the two
[17:24:10] <logmsgbot>	 !log urbanecm@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply
[17:24:13] <swfrench-wmf>	 right, that's what I mean - on restart, that would doubly apply
[17:24:21] <swfrench-wmf>	 exactly, yeah
[17:24:26] <_joe_>	 I'm looking at the logs and sigh that's an interesting amount of bad luck 
[17:24:32] <swfrench-wmf>	 this is the torn-write scenario we've talked about
[17:24:46] <_joe_>	 so yes in that case the two solutions are either moving the replica index by hand
[17:24:48] <swfrench-wmf>	 exactly, yeah :)
[17:24:50] <_joe_>	 which I guess you did
[17:24:57] <_joe_>	 or reload everything
[17:25:21] <swfrench-wmf>	 exactly, yeah
[17:25:42] <urbanecm>	 fwiw, `helmfile.d/services/rest-gateway/values-staging.yaml` seems to have uncommited changes at `deploy2002:/srv/deployment-charts`. that...doesn't seem to be expected?
[17:26:00] <claime>	 urbanecm: yeah that's my bad
[17:26:06] <claime>	 Leftover from morning tests
[17:26:12] <claime>	 Is it blocking anything?
[17:26:17] <claime>	 I can reset it if needed
[17:26:23] <urbanecm>	 no, i just noticed that while doing an unrelated deployment
[17:26:37] <urbanecm>	 just wanted to flag it as it seemed unusual
[17:26:37] <claime>	 ack, yeah I'll reset as to not cause anymroe confusion then
[17:26:57] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[17:27:06] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[17:27:28] <claime>	 {{done}}
[17:27:32] <urbanecm>	 thanks!
[17:27:58] <wikibugs>	 (03PS2) 10FNegri: toolsdb: increase innodb_log_file_size to 512M [puppet] - 10https://gerrit.wikimedia.org/r/1204472 (https://phabricator.wikimedia.org/T409922)
[17:29:30] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P85537 and previous config saved to /var/cache/conftool/dbconfig/20251124-172929-marostegui.json
[17:33:24] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206466 (https://phabricator.wikimedia.org/T409773) (owner: 10Aaron Schulz)
[17:44:37] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T410531)', diff saved to https://phabricator.wikimedia.org/P85538 and previous config saved to /var/cache/conftool/dbconfig/20251124-174437-marostegui.json
[17:44:42] <stashbot>	 T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531
[17:44:54] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2190.codfw.wmnet with reason: Maintenance
[17:45:02] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2190 (T410531)', diff saved to https://phabricator.wikimedia.org/P85539 and previous config saved to /var/cache/conftool/dbconfig/20251124-174501-marostegui.json
[17:46:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10[28-32] - https://phabricator.wikimedia.org/T410406#11401912 (10bking) Note to selves:  - All 5 hosts failed to reimage to UEFI, even after I ran the `sre.hosts.provision` cookbook with the arguments listed above. - @Jclark-c...
[17:50:20] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+1] dumps: Update cirrus index dumps path to point to new dumps [puppet] - 10https://gerrit.wikimedia.org/r/1210636 (owner: 10DCausse)
[17:51:43] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS bookworm
[17:52:36] <wikibugs>	 (03PS2) 10MusikAnimal: [metawiki] enable voting on entities with the 'Under review' status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208231 (https://phabricator.wikimedia.org/T409613)
[17:52:49] <wikibugs>	 (03CR) 10MusikAnimal: [metawiki] enable voting on entities with the 'Under review' status (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208231 (https://phabricator.wikimedia.org/T409613) (owner: 10MusikAnimal)
[17:53:47] <wikibugs>	 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Measure request frequency of thumbnail sizes - https://phabricator.wikimedia.org/T410304#11401955 (10akosiaris) Turnilo for the Telegram Logo (first hit in what @Ladsgroup ) says: Google Proxy as the ISP, in an staggering 85% o...
[17:55:50] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1030.eqiad.wmnet with OS trixie
[17:57:30] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] admin: deprecate the releasers-blubber group [puppet] - 10https://gerrit.wikimedia.org/r/1207313 (owner: 10Dzahn)
[17:58:11] <wikibugs>	 (03PS3) 10DDesouza: Pre-deploy 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208408 (https://phabricator.wikimedia.org/T410696)
[17:58:30] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208408 (https://phabricator.wikimedia.org/T410696) (owner: 10DDesouza)
[18:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251124T1800)
[18:00:05] <jouncebot>	 ryankemper: Time to do the Wikidata Query Service weekly deploy deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251124T1800).
[18:01:19] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "thanks! related ticket mostly https://phabricator.wikimedia.org/T410418  because this started by asking "who is still uploading releases i" [puppet] - 10https://gerrit.wikimedia.org/r/1207313 (owner: 10Dzahn)
[18:02:57] <swfrench-wmf>	 FYI, please do not begin any MediaWiki deployments during this window. I'll be taking the scap lock for a brief period during an upcoming etcd maintenance.
[18:05:02] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1031.eqiad.wmnet with OS trixie
[18:05:04] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T410531)', diff saved to https://phabricator.wikimedia.org/P85540 and previous config saved to /var/cache/conftool/dbconfig/20251124-180503-marostegui.json
[18:05:10] <stashbot>	 T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531
[18:05:19] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10[28-32] - https://phabricator.wikimedia.org/T410406#11402039 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wdqs1031.eqiad.wmnet with OS trixie
[18:09:17] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1028.eqiad.wmnet with OS bookworm
[18:09:31] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10[28-32] - https://phabricator.wikimedia.org/T410406#11402048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wdqs1028.eqiad.wmnet with OS bookworm
[18:09:55] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] hieradata: lvs: Store VLAN tags as numbers [puppet] - 10https://gerrit.wikimedia.org/r/1208299 (owner: 10Majavah)
[18:10:15] <wikibugs>	 (03CR) 10Majavah: [C:03+2] hieradata: lvs: Store VLAN tags as numbers [puppet] - 10https://gerrit.wikimedia.org/r/1208299 (owner: 10Majavah)
[18:10:58] <wikibugs>	 (03PS1) 10Dzahn: admin/releases: deprecate the releasers-wikibase shell user group [puppet] - 10https://gerrit.wikimedia.org/r/1210654 (https://phabricator.wikimedia.org/T410418)
[18:11:53] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1030.eqiad.wmnet with reason: host reimage
[18:12:54] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin/releases: deprecate the releasers-wikibase shell user group [puppet] - 10https://gerrit.wikimedia.org/r/1210654 (https://phabricator.wikimedia.org/T410418) (owner: 10Dzahn)
[18:13:08] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1207294 (owner: 10Ncmonitor)
[18:13:13] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1207295 (owner: 10Ncmonitor)
[18:14:14] <wikibugs>	 (03PS4) 10Majavah: interface::tagged: Add strict typing [puppet] - 10https://gerrit.wikimedia.org/r/1208293
[18:14:15] <wikibugs>	 (03PS2) 10Majavah: P:openstack: neutron: Cleanup legacy_vlan_naming hiera key [puppet] - 10https://gerrit.wikimedia.org/r/1208306
[18:14:15] <wikibugs>	 (03PS2) 10Majavah: interface::tagged: Remove legacy_vlan_naming option [puppet] - 10https://gerrit.wikimedia.org/r/1208307
[18:15:18] <wikibugs>	 (03PS2) 10Dzahn: admin/releases: deprecate the releasers-wikibase shell user group [puppet] - 10https://gerrit.wikimedia.org/r/1210654 (https://phabricator.wikimedia.org/T410418)
[18:16:04] <wikibugs>	 (03CR) 10CI reject: [V:04-1] interface::tagged: Remove legacy_vlan_naming option [puppet] - 10https://gerrit.wikimedia.org/r/1208307 (owner: 10Majavah)
[18:16:37] <swfrench-wmf>	 !log silenced EtcdReplicationDown. f75c71c9-62d3-449f-860a-9b5e4570717a - T405950
[18:16:38] <wikibugs>	 (03CR) 10Majavah: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1208307 (owner: 10Majavah)
[18:16:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:16:41] <stashbot>	 T405950: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950
[18:17:02] <wikibugs>	 (03PS1) 10DDesouza: Deploy experiment for 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210655 (https://phabricator.wikimedia.org/T410696)
[18:17:58] <wikibugs>	 (03PS1) 10Dzahn: releases: change group ownership of blubber releases to root [puppet] - 10https://gerrit.wikimedia.org/r/1210656 (https://phabricator.wikimedia.org/T410418)
[18:19:11] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1030.eqiad.wmnet with reason: host reimage
[18:20:11] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P85541 and previous config saved to /var/cache/conftool/dbconfig/20251124-182011-marostegui.json
[18:20:54] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1031.eqiad.wmnet with reason: host reimage
[18:21:08] <swfrench-wmf>	 !log manually transferred etcd-mirror replication source to conf1008 - T405950
[18:21:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:21:43] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1032.eqiad.wmnet with OS trixie
[18:21:58] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10[28-32] - https://phabricator.wikimedia.org/T410406#11402120 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host wdqs1032.eqiad.wmnet with OS trixie executed with errors: - wdqs1032...
[18:23:25] <wikibugs>	 (03CR) 10Dzahn: "I am not sure I would mess with this in the light of these IPs probably soon pointing to the CDN. Then the public IPs will permanently poi" [dns] - 10https://gerrit.wikimedia.org/r/1210560 (https://phabricator.wikimedia.org/T387833) (owner: 10Hashar)
[18:23:32] <logmsgbot>	 !log swfrench@deploy2002 Locking from deployment [ALL REPOSITORIES]: Hold deployments during etcd ToR switch migration - T405950
[18:23:36] <stashbot>	 T405950: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950
[18:24:02] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1031.eqiad.wmnet with reason: host reimage
[18:24:36] <jinxer-wm>	 FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage  - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[18:24:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10[28-32] - https://phabricator.wikimedia.org/T410406#11402141 (10Jclark-ctr) {F70616591} {F70616621}. They still seem to be failing for Raid configuration files.
[18:25:05] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:05:00 on conf1009.eqiad.wmnet with reason: C/D Migration
[18:25:50] <wikibugs>	 (03PS4) 10DDesouza: Pre-deploy 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208408 (https://phabricator.wikimedia.org/T410696)
[18:26:38] <wikibugs>	 06SRE, 06SRE Observability: Add Druid as a Private Grafana Datasource - https://phabricator.wikimedia.org/T410933 (10herron) 03NEW
[18:27:33] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 26 NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7691" [puppet] - 10https://gerrit.wikimedia.org/r/1208293 (owner: 10Majavah)
[18:27:38] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210655 (https://phabricator.wikimedia.org/T410696) (owner: 10DDesouza)
[18:28:52] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11402180 (10RobH) conf1009 migrated,  @brouberol: Please provide feedback on migration of wikikube-ctrl1003 and kafka-main1008 as these are the last #serviceops hosts to migrate...
[18:31:39] <swfrench-wmf>	 !log manually transferred etcd-mirror replication source back to conf1009 - T405950
[18:31:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:31:44] <stashbot>	 T405950: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950
[18:32:15] <logmsgbot>	 !log swfrench@deploy2002 Unlocked for deployment [ALL REPOSITORIES]: Hold deployments during etcd ToR switch migration - T405950 (duration: 08m 43s)
[18:34:34] <swfrench-wmf>	 !log begin restarts of eqiad-associated confds, navtiming, requestctl - T405950
[18:34:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:35:19] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P85542 and previous config saved to /var/cache/conftool/dbconfig/20251124-183518-marostegui.json
[18:36:14] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[18:36:21] <swfrench-wmf>	 !log deleted EtcdReplicationDown silence. f75c71c9-62d3-449f-860a-9b5e4570717a - T405950
[18:36:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:36:27] <wikibugs>	 (03PS1) 10Volans: labs: enable infra-tracing-nfs tracing [labs/private] - 10https://gerrit.wikimedia.org/r/1210664 (https://phabricator.wikimedia.org/T399313)
[18:39:18] <logmsgbot>	 jclark@cumin1003 reimage (PID 1589693) is awaiting input
[18:41:17] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[18:41:20] <wikibugs>	 (03PS4) 10CDobbins: sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1208415 (https://phabricator.wikimedia.org/T395240)
[18:42:10] <wikibugs>	 (03PS5) 10CDobbins: sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1208415 (https://phabricator.wikimedia.org/T395240)
[18:43:31] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Looks good!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1208415 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins)
[18:44:21] <logmsgbot>	 jclark@cumin1003 reimage (PID 1591045) is awaiting input
[18:45:10] <wikibugs>	 (03CR) 10CDobbins: sre.loadbalancer: patch to fix reboot action (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1208415 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins)
[18:47:57] <wikibugs>	 (03CR) 10CDobbins: [C:03+2] sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1208415 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins)
[18:49:52] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11402364 (10RobH) >>! In T405950#11402180, @RobH wrote: > conf1009 migrated, >  > @brouberol: Please provide feedback on migration of wikikube-ctrl1003 and kafka-main1008 as the...
[18:50:26] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T410531)', diff saved to https://phabricator.wikimedia.org/P85543 and previous config saved to /var/cache/conftool/dbconfig/20251124-185026-marostegui.json
[18:50:31] <stashbot>	 T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531
[18:50:43] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2194.codfw.wmnet with reason: Maintenance
[18:50:51] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2194 (T410531)', diff saved to https://phabricator.wikimedia.org/P85544 and previous config saved to /var/cache/conftool/dbconfig/20251124-185050-marostegui.json
[18:50:59] <wikibugs>	 (03Abandoned) 10Ssingh: Revert "hiera: trafficserver: switch hcaptcha backend to anycast" [puppet] - 10https://gerrit.wikimedia.org/r/1210603 (owner: 10Ssingh)
[18:52:52] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11402389 (10RobH) IRC Echo Update (chatting with Scott in irc about this just echoing to task for history):  * We want to get feedback from @brouberol on migration of kafka-main...
[18:53:40] <wikibugs>	 (03PS1) 10Bking: wdqs: use correct regex in preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1210667 (https://phabricator.wikimedia.org/T410406)
[18:54:22] <wikibugs>	 (03Merged) 10jenkins-bot: sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1208415 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins)
[18:54:37] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[18:54:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 4 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11402399 (10bking)
[18:54:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 4 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11402404 (10bking)
[18:55:38] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wdqs: use correct regex in preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1210667 (https://phabricator.wikimedia.org/T410406) (owner: 10Bking)
[18:57:30] <wikibugs>	 (03PS2) 10Bking: wdqs: use correct regex in preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1210667 (https://phabricator.wikimedia.org/T410406)
[18:57:46] <wikibugs>	 (03PS1) 10Bvibber: Show "no data" message when tooltip does not contain to show [extensions/Chart] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1210669 (https://phabricator.wikimedia.org/T401990)
[18:58:22] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs7003*} and A:liberica
[18:58:24] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11402451 (10RobH) Day 9 Update: * 9 hosts moved, 10 remain - 300 hosts total at start of migration * John worked with Ben directly to migrate the (8) Data Pla...
[18:58:27] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] "It looks good." [dns] - 10https://gerrit.wikimedia.org/r/1206185 (https://phabricator.wikimedia.org/T409735) (owner: 10Slyngshede)
[19:02:49] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/Chart] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1210669 (https://phabricator.wikimedia.org/T401990) (owner: 10Bvibber)
[19:04:01] <wikibugs>	 (03CR) 10Xcollazo: [C:03+1] Rename targetDir to targetDirDefault [dumps] - 10https://gerrit.wikimedia.org/r/1204592 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon)
[19:09:44] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "please see https://phabricator.wikimedia.org/T410729 for a related discussion" [puppet] - 10https://gerrit.wikimedia.org/r/1024336 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[19:12:01] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T410531)', diff saved to https://phabricator.wikimedia.org/P85545 and previous config saved to /var/cache/conftool/dbconfig/20251124-191200-marostegui.json
[19:12:06] <stashbot>	 T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531
[19:13:32] <mutante>	 win 12
[19:17:59] <wikibugs>	 (03CR) 10BCornwall: switch wikipedia25.org from ncredir-lb to dyna (034 comments) [dns] - 10https://gerrit.wikimedia.org/r/1207288 (owner: 10Dzahn)
[19:18:15] <wikibugs>	 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11402572 (10Dzahn) While other things are still being discussed here.. for now I would like to add that we have settled on the URL/domain:  > The url http://wikipe...
[19:19:04] <wikibugs>	 (03CR) 10Dzahn: "The URL has been approved now for use with the new micro site." [dns] - 10https://gerrit.wikimedia.org/r/1207288 (owner: 10Dzahn)
[19:20:23] <wikibugs>	 (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210678
[19:23:04] <wikibugs>	 (03CR) 10Bking: [C:03+2] wdqs: use correct regex in preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1210667 (https://phabricator.wikimedia.org/T410406) (owner: 10Bking)
[19:24:40] <wikibugs>	 (03PS2) 10Dzahn: switch wikipedia25.org from ncredir-lb to dyna [dns] - 10https://gerrit.wikimedia.org/r/1207288
[19:24:44] <wikibugs>	 (03CR) 10Dzahn: switch wikipedia25.org from ncredir-lb to dyna (033 comments) [dns] - 10https://gerrit.wikimedia.org/r/1207288 (owner: 10Dzahn)
[19:25:11] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs7003*} and A:liberica
[19:25:28] <wikibugs>	 (03CR) 10Dzahn: switch wikipedia25.org from ncredir-lb to dyna (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1207288 (owner: 10Dzahn)
[19:26:53] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "ship it" [puppet] - 10https://gerrit.wikimedia.org/r/1205162 (https://phabricator.wikimedia.org/T409833) (owner: 10Arnaudb)
[19:27:01] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1030.eqiad.wmnet with OS trixie
[19:27:08] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P85547 and previous config saved to /var/cache/conftool/dbconfig/20251124-192707-marostegui.json
[19:28:40] <wikibugs>	 (03CR) 10Dzahn: "I just don't have the context for this to say anything." [cookbooks] - 10https://gerrit.wikimedia.org/r/1210386 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[19:29:33] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1028.eqiad.wmnet with OS bookworm
[19:29:45] <wikibugs>	 (03CR) 10Dzahn: "maybe Moritz or Simon would be best reviewers for this.. since it's about actual failure modes of reprepro" [puppet] - 10https://gerrit.wikimedia.org/r/1206887 (https://phabricator.wikimedia.org/T409832) (owner: 10Arnaudb)
[19:29:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 4 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11402596 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wdqs1028.eqiad.wmnet with OS bookworm executed with errors:...
[19:30:55] <wikibugs>	 (03PS1) 10Neriah: trwikisource: Create rollbacker user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210681 (https://phabricator.wikimedia.org/T410931)
[19:31:20] <wikibugs>	 (03CR) 10WMDE-leszek: [C:03+1] "I confirm that WMDE no longer intends to publish Wikibase release files to releases.wikimedia.org. Thank you for deprecating the user grou" [puppet] - 10https://gerrit.wikimedia.org/r/1210654 (https://phabricator.wikimedia.org/T410418) (owner: 10Dzahn)
[19:32:47] <wikibugs>	 (03CR) 10AOkoth: [C:03+1] httpbb: move os-reports test file for services on miscweb-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1208398 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn)
[19:33:17] <wikibugs>	 (03CR) 10AOkoth: [C:03+1] httpbb: delete tests on legacy miscweb VMs [puppet] - 10https://gerrit.wikimedia.org/r/1208399 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn)
[19:33:26] <wikibugs>	 (03CR) 10Xcollazo: Report integrity metric from Wikidata dump scripts (031 comment) [dumps] - 10https://gerrit.wikimedia.org/r/1203410 (https://phabricator.wikimedia.org/T403482) (owner: 10Silvan Heintze)
[19:33:41] <wikibugs>	 (03CR) 10AOkoth: [C:03+1] installserver: remove legacy miscweb VMs [puppet] - 10https://gerrit.wikimedia.org/r/1208400 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn)
[19:34:20] <wikibugs>	 (03CR) 10AOkoth: [C:03+1] prometheus: drop class config for role::miscweb [puppet] - 10https://gerrit.wikimedia.org/r/1208401 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn)
[19:35:34] <wikibugs>	 (03CR) 10AOkoth: [C:03+1] site: remove legacy miscweb VMs [puppet] - 10https://gerrit.wikimedia.org/r/1208402 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn)
[19:38:09] <logmsgbot>	 bking@cumin2002 reimage (PID 4108620) is awaiting input
[19:42:01] <wikibugs>	 (03CR) 10ToprakM: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210681 (https://phabricator.wikimedia.org/T410931) (owner: 10Neriah)
[19:42:16] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P85548 and previous config saved to /var/cache/conftool/dbconfig/20251124-194215-marostegui.json
[19:44:46] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for dsmit - https://phabricator.wikimedia.org/T410426#11402647 (10RLazarus) 05Open→03In progress Followed up with @DSmit-WMF and confirmed level 1 is what we're doing. Implementation to follow.
[19:45:03] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for dsmit - https://phabricator.wikimedia.org/T410426#11402650 (10RLazarus)
[19:45:30] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] installserver: remove legacy miscweb VMs [puppet] - 10https://gerrit.wikimedia.org/r/1208400 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn)
[19:45:46] <wikibugs>	 (03PS3) 10Dzahn: installserver: remove legacy miscweb VMs [puppet] - 10https://gerrit.wikimedia.org/r/1208400 (https://phabricator.wikimedia.org/T397080)
[19:48:13] <wikibugs>	 06SRE: Reboot cookbook workflow leaves Puppet disabled - https://phabricator.wikimedia.org/T410944 (10CDobbins) 03NEW
[19:49:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763982 (https://phabricator.wikimedia.org/T302227) (owner: 10Huji)
[19:50:16] <wikibugs>	 (03Merged) 10jenkins-bot: Increase AbuseFilter's emergency disable threshold for fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763982 (https://phabricator.wikimedia.org/T302227) (owner: 10Huji)
[19:50:35] <logmsgbot>	 !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:763982|Increase AbuseFilter's emergency disable threshold for fawiki (T302227)]]
[19:50:40] <stashbot>	 T302227: Increase AbuseFilter's emergency disable threshold for fawiki - https://phabricator.wikimedia.org/T302227
[19:52:24] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] installserver: remove legacy miscweb VMs [puppet] - 10https://gerrit.wikimedia.org/r/1208400 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn)
[19:52:50] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] prometheus: drop class config for role::miscweb [puppet] - 10https://gerrit.wikimedia.org/r/1208401 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn)
[19:53:08] <wikibugs>	 (03CR) 10Neriah: [C:03+1] labswiki: Enable sitenotice on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208478 (https://phabricator.wikimedia.org/T410702) (owner: 10BryanDavis)
[19:54:37] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] httpbb: move os-reports test file for services on miscweb-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1208398 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn)
[19:55:03] <wikibugs>	 (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210687
[19:55:19] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11402695 (10RobH) I've chatted with @brouberol via IRC:  > 11:50  <brouberol> kafka hosts can be shut down / disconnected from the network, but not more than one at a time, to b...
[19:55:46] <logmsgbot>	 !log urbanecm@deploy2002 huji, urbanecm: Backport for [[gerrit:763982|Increase AbuseFilter's emergency disable threshold for fawiki (T302227)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[19:55:51] <stashbot>	 T302227: Increase AbuseFilter's emergency disable threshold for fawiki - https://phabricator.wikimedia.org/T302227
[19:56:04] <logmsgbot>	 !log urbanecm@deploy2002 huji, urbanecm: Continuing with sync
[19:56:54] <wikibugs>	 (03CR) 10Mszwarc: [C:03+1] Enable v2 non-emergency workflow by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207845 (https://phabricator.wikimedia.org/T410512) (owner: 10STran)
[19:57:24] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T410531)', diff saved to https://phabricator.wikimedia.org/P85549 and previous config saved to /var/cache/conftool/dbconfig/20251124-195723-marostegui.json
[19:57:29] <stashbot>	 T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531
[19:57:40] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2209.codfw.wmnet with reason: Maintenance
[19:57:48] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2209 (T410531)', diff saved to https://phabricator.wikimedia.org/P85550 and previous config saved to /var/cache/conftool/dbconfig/20251124-195747-marostegui.json
[20:00:07] <wikibugs>	 (03CR) 10BCornwall: switch wikipedia25.org from ncredir-lb to dyna (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1207288 (owner: 10Dzahn)
[20:00:18] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:763982|Increase AbuseFilter's emergency disable threshold for fawiki (T302227)]] (duration: 09m 43s)
[20:02:49] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for Arian Bozorg (WMDE) - https://phabricator.wikimedia.org/T409409#11402722 (10RLazarus) 05In progress→03Resolved a:03Volans Optimistically resolving. :) @Arian_Bozorg please let us know if you have any troubl...
[20:06:48] <wikibugs>	 (03CR) 10Dzahn: [C:04-2] switch wikipedia25.org from ncredir-lb to dyna (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1207288 (owner: 10Dzahn)
[20:07:34] <wikibugs>	 (03CR) 10Dzahn: "waiting for input first if these tests should just move to a new target" [puppet] - 10https://gerrit.wikimedia.org/r/1208399 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn)
[20:13:47] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1210656" [puppet] - 10https://gerrit.wikimedia.org/r/1207313 (owner: 10Dzahn)
[20:13:51] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] releases: change group ownership of blubber releases to root [puppet] - 10https://gerrit.wikimedia.org/r/1210656 (https://phabricator.wikimedia.org/T410418) (owner: 10Dzahn)
[20:13:57] <wikibugs>	 (03PS2) 10Dzahn: releases: change group ownership of blubber releases to root [puppet] - 10https://gerrit.wikimedia.org/r/1210656 (https://phabricator.wikimedia.org/T410418)
[20:14:37] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1030.eqiad.wmnet with OS trixie
[20:15:24] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1030.eqiad.wmnet with OS trixie
[20:15:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11402775 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host wdqs1030.eqiad.wmnet with OS trixie
[20:17:40] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T410531)', diff saved to https://phabricator.wikimedia.org/P85551 and previous config saved to /var/cache/conftool/dbconfig/20251124-201739-marostegui.json
[20:17:45] <stashbot>	 T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531
[20:20:22] <wikibugs>	 (03PS1) 10RLazarus: admin: Add daphnesmit to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1210695 (https://phabricator.wikimedia.org/T410426)
[20:21:17] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[20:23:14] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] releases: change group ownership of blubber releases to root [puppet] - 10https://gerrit.wikimedia.org/r/1210656 (https://phabricator.wikimedia.org/T410418) (owner: 10Dzahn)
[20:25:59] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "all tests on miscweb-k8s fail with:" [puppet] - 10https://gerrit.wikimedia.org/r/1208398 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn)
[20:26:31] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "unless I am doing the test wrong - have you ever done it against miscweb-k8s?" [puppet] - 10https://gerrit.wikimedia.org/r/1208398 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn)
[20:32:20] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1030.eqiad.wmnet with reason: host reimage
[20:32:47] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P85552 and previous config saved to /var/cache/conftool/dbconfig/20251124-203247-marostegui.json
[20:36:20] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s2 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 20011.89 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[20:38:06] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1030.eqiad.wmnet with reason: host reimage
[20:39:37] <jinxer-wm>	 FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[20:39:45] <swfrench-wmf>	 FYI, in a couple of minutes I'm going to be updating the local PHP CLI installation on the deployment hosts from PHP 8.1. to 8.3. no impact expected, but wanted to mention.
[20:40:07] <wikibugs>	 (03CR) 10Scott French: [C:03+2] deployment_server: switch deployment hosts to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1208006 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[20:46:59] <wikibugs>	 06SRE, 10SRE-tools, 06Infrastructure-Foundations: Add an option to the reimage cookbook to also update firmware - https://phabricator.wikimedia.org/T410384#11402862 (10bking) Hey Moritz and Cathal,  Just wanted to add my .02 as someone who's been bitten a few times by the firmware stuff, including writing [[...
[20:47:55] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P85553 and previous config saved to /var/cache/conftool/dbconfig/20251124-204754-marostegui.json
[20:50:52] <wikibugs>	 (03CR) 10Dzahn: [C:04-2] "in that case I will just abandon" [dns] - 10https://gerrit.wikimedia.org/r/1207288 (owner: 10Dzahn)
[20:50:55] <wikibugs>	 (03Abandoned) 10Dzahn: switch wikipedia25.org from ncredir-lb to dyna [dns] - 10https://gerrit.wikimedia.org/r/1207288 (owner: 10Dzahn)
[20:51:11] <swfrench-wmf>	 !log updated local PHP CLI installation on deploy1003 to 8.3 - T405955
[20:51:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:51:16] <stashbot>	 T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955
[20:55:42] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1030.eqiad.wmnet with OS trixie
[20:55:57] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11402890 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host wdqs1030.eqiad.wmnet with OS trixie completed: - wdqs1030 (*...
[20:56:01] <swfrench-wmf>	 !log updated local PHP CLI installation on deploy2002 to 8.3 - T405955
[20:56:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:58:48] <swfrench-wmf>	 FYI, all done with the above-mentioned PHP upgrades on deployment hosts.
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251124T2100). nyaa~
[21:00:05] <jouncebot>	 hubaishan, arlolra, AaronSchulz, danisztls, and bvibber: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:01:02] <arlolra>	 o/
[21:02:15] <arlolra>	 I can get the party started
[21:02:35] <AaronSchulz>	 arlolra: my patch can be rolled with other stuff again
[21:02:56] <arlolra>	 I'll add it to mine
[21:03:03] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T410531)', diff saved to https://phabricator.wikimedia.org/P85554 and previous config saved to /var/cache/conftool/dbconfig/20251124-210302-marostegui.json
[21:03:08] <stashbot>	 T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531
[21:03:19] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2227.codfw.wmnet with reason: Maintenance
[21:03:27] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2227 (T410531)', diff saved to https://phabricator.wikimedia.org/P85555 and previous config saved to /var/cache/conftool/dbconfig/20251124-210326-marostegui.json
[21:03:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207276 (https://phabricator.wikimedia.org/T410564) (owner: 10Arlolra)
[21:03:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206466 (https://phabricator.wikimedia.org/T409773) (owner: 10Aaron Schulz)
[21:05:00] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy Parsoid Read Views to 18 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207276 (https://phabricator.wikimedia.org/T410564) (owner: 10Arlolra)
[21:05:02] <wikibugs>	 (03Merged) 10jenkins-bot: Mark non-wikimedia.org math APIs as deprecated in the sandbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206466 (https://phabricator.wikimedia.org/T409773) (owner: 10Aaron Schulz)
[21:05:20] <logmsgbot>	 !log arlolra@deploy2002 Started scap sync-world: Backport for [[gerrit:1207276|Deploy Parsoid Read Views to 18 wikis (T410564)]], [[gerrit:1206466|Mark non-wikimedia.org math APIs as deprecated in the sandbox (T409773)]]
[21:05:26] <stashbot>	 T410564: Parsoid Read Views to deploy ~2025-11-24 - https://phabricator.wikimedia.org/T410564
[21:05:26] <stashbot>	 T409773: Mark /math/ APIs outside of "wikimedia.org/api/rest_v1" as deprecated - https://phabricator.wikimedia.org/T409773
[21:05:57] <danisztls>	 o/
[21:08:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:09:37] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[21:10:26] <logmsgbot>	 !log arlolra@deploy2002 arlolra, aaron: Backport for [[gerrit:1207276|Deploy Parsoid Read Views to 18 wikis (T410564)]], [[gerrit:1206466|Mark non-wikimedia.org math APIs as deprecated in the sandbox (T409773)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:10:32] <stashbot>	 T410564: Parsoid Read Views to deploy ~2025-11-24 - https://phabricator.wikimedia.org/T410564
[21:10:32] <stashbot>	 T409773: Mark /math/ APIs outside of "wikimedia.org/api/rest_v1" as deprecated - https://phabricator.wikimedia.org/T409773
[21:12:54] <logmsgbot>	 !log arlolra@deploy2002 arlolra, aaron: Continuing with sync
[21:13:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:14:33] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1032.eqiad.wmnet with OS trixie
[21:14:46] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11403020 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host wdqs1032.eqiad.wmnet with OS trixie
[21:16:35] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1031.eqiad.wmnet with OS trixie
[21:17:08] <logmsgbot>	 !log arlolra@deploy2002 Finished scap sync-world: Backport for [[gerrit:1207276|Deploy Parsoid Read Views to 18 wikis (T410564)]], [[gerrit:1206466|Mark non-wikimedia.org math APIs as deprecated in the sandbox (T409773)]] (duration: 11m 49s)
[21:17:15] <stashbot>	 T410564: Parsoid Read Views to deploy ~2025-11-24 - https://phabricator.wikimedia.org/T410564
[21:17:15] <stashbot>	 T409773: Mark /math/ APIs outside of "wikimedia.org/api/rest_v1" as deprecated - https://phabricator.wikimedia.org/T409773
[21:17:33] <arlolra>	 who's next
[21:17:35] <bvibber>	 o/ sorry was late to my window :D
[21:17:42] <bvibber>	 my patch may update localization files -- do it last
[21:17:55] <bvibber>	 (adds one string to english)
[21:18:35] <AaronSchulz>	 arlolra: thanks
[21:19:52] <arlolra>	 hubaishan: do you want me to deploy for you?
[21:20:01] <hubaishan>	 yes
[21:20:07] <arlolra>	 alrighty
[21:20:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1209791 (https://phabricator.wikimedia.org/T410840) (owner: 10Hubaishan)
[21:21:21] <wikibugs>	 (03Merged) 10jenkins-bot: arwiktionary: make Cite button in main VE bar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1209791 (https://phabricator.wikimedia.org/T410840) (owner: 10Hubaishan)
[21:21:37] <logmsgbot>	 !log arlolra@deploy2002 Started scap sync-world: Backport for [[gerrit:1209791|arwiktionary: make Cite button in main VE bar (T410840)]]
[21:21:42] <stashbot>	 T410840: [config] arwiktionary: make Cite button in main VE bar - https://phabricator.wikimedia.org/T410840
[21:25:08] <danisztls>	 bvibber: I can add yours to my batch
[21:25:19] <bvibber>	 \o/ tx
[21:26:05] <logmsgbot>	 !log arlolra@deploy2002 arlolra, hubaishan: Backport for [[gerrit:1209791|arwiktionary: make Cite button in main VE bar (T410840)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:26:14] <hubaishan>	 OK in debug server
[21:26:31] <arlolra>	 great, thanks
[21:26:36] <logmsgbot>	 !log arlolra@deploy2002 arlolra, hubaishan: Continuing with sync
[21:26:44] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T410531)', diff saved to https://phabricator.wikimedia.org/P85556 and previous config saved to /var/cache/conftool/dbconfig/20251124-212643-marostegui.json
[21:26:49] <stashbot>	 T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531
[21:30:32] <logmsgbot>	 !log arlolra@deploy2002 Finished scap sync-world: Backport for [[gerrit:1209791|arwiktionary: make Cite button in main VE bar (T410840)]] (duration: 08m 54s)
[21:30:37] <stashbot>	 T410840: [config] arwiktionary: make Cite button in main VE bar - https://phabricator.wikimedia.org/T410840
[21:31:25] <arlolra>	 danisztls: all yours
[21:31:42] <danisztls>	 arlolra: thanks
[21:32:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208408 (https://phabricator.wikimedia.org/T410696) (owner: 10DDesouza)
[21:32:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210655 (https://phabricator.wikimedia.org/T410696) (owner: 10DDesouza)
[21:32:01] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [extensions/Chart] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1210669 (https://phabricator.wikimedia.org/T401990) (owner: 10Bvibber)
[21:32:12] <bvibber>	 whee
[21:32:31] <wikibugs>	 (03PS1) 10Scott French: admin: Move swfrench non-FIDO ssh key to buster_ssh_keys [puppet] - 10https://gerrit.wikimedia.org/r/1210705
[21:33:07] <wikibugs>	 (03Merged) 10jenkins-bot: Pre-deploy 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208408 (https://phabricator.wikimedia.org/T410696) (owner: 10DDesouza)
[21:33:10] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy experiment for 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210655 (https://phabricator.wikimedia.org/T410696) (owner: 10DDesouza)
[21:33:17] <wikibugs>	 (03Merged) 10jenkins-bot: Show "no data" message when tooltip does not contain to show [extensions/Chart] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1210669 (https://phabricator.wikimedia.org/T401990) (owner: 10Bvibber)
[21:33:38] <logmsgbot>	 !log dani@deploy2002 Started scap sync-world: Backport for [[gerrit:1208408|Pre-deploy 2025 Global Readers Survey (T410696)]], [[gerrit:1210655|Deploy experiment for 2025 Global Readers Survey (T410696)]], [[gerrit:1210669|Show "no data" message when tooltip does not contain to show (T401990)]]
[21:33:44] <stashbot>	 T410696: Deploy enwiki edition of 2025 GRS - https://phabricator.wikimedia.org/T410696
[21:33:45] <stashbot>	 T401990: Chart displays NaN for entries with no data - https://phabricator.wikimedia.org/T401990
[21:34:43] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1031.eqiad.wmnet with reason: host reimage
[21:37:55] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] admin: Move swfrench non-FIDO ssh key to buster_ssh_keys [puppet] - 10https://gerrit.wikimedia.org/r/1210705 (owner: 10Scott French)
[21:38:40] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1031.eqiad.wmnet with reason: host reimage
[21:39:36] <danisztls>	 bvibber: is there any problem in deploying your patch via spiderpig?
[21:39:52] <wikibugs>	 (03CR) 10Scott French: "Thanks, Reuven!" [puppet] - 10https://gerrit.wikimedia.org/r/1210705 (owner: 10Scott French)
[21:40:06] <wikibugs>	 (03CR) 10Scott French: [C:03+2] admin: Move swfrench non-FIDO ssh key to buster_ssh_keys [puppet] - 10https://gerrit.wikimedia.org/r/1210705 (owner: 10Scott French)
[21:40:38] <bvibber>	 should work but it's gonna be regenerating the localization cache ;_;
[21:41:26] <danisztls>	 bvibber: ok
[21:41:45] <bvibber>	 really lighting a fire under my ass on my project to reduce the localization cache size by a factor of 10 (i'm up to a factor of 6 and i think i'm going to reach my goal with the next refactor)
[21:41:51] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P85557 and previous config saved to /var/cache/conftool/dbconfig/20251124-214151-marostegui.json
[21:42:00] <danisztls>	 bvibber: I'm seeing 40 MediaWiki errors in the log
[21:42:00] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install x1 host - https://phabricator.wikimedia.org/T407897#11403185 (10RobH) >>! In T407897#11399303, @Marostegui wrote: > Thanks Rob, I think the confusion was whether we ordered the right HW or not. Doing 1G is fine for this host, 10G w...
[21:42:22] <bvibber>	 hmm it should be JS only changes and a new message
[21:44:00] <danisztls>	 bvibber: maybe they aren't related to your patch, but they are there
[21:44:29] <bvibber>	 got a linky to em in logstash?
[21:44:53] <danisztls>	 bvibber: yep, but I don't have logstash perms
[21:45:00] <bvibber>	 heh
[21:45:43] <danisztls>	 https://logstash.wikimedia.org/app/dashboards#/view/mediawiki-errors
[21:48:08] <bvibber>	 nothing particularly suspicious in there i'd expect to have been affected by the message update
[21:49:21] <danisztls>	 bvibber: yeah, just to make sure, anyway it's still building the images and that log is from production, right?
[21:49:29] <bvibber>	 right
[21:50:04] <danisztls>	 bvibber: thanks
[21:54:09] <bd808>	 danisztls: > I don't have logstash perms -- you appear to be in the "wmf" LDAP group. That should give you logstash access.
[21:56:29] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1031.eqiad.wmnet with OS trixie
[21:56:59] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P85558 and previous config saved to /var/cache/conftool/dbconfig/20251124-215659-marostegui.json
[21:58:38] <danisztls>	 bd808: I get service denied due to missing privileges when I try.
[22:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: How many deployers does it take to do Weekly Security deployment window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251124T2200).
[22:00:28] <wikibugs>	 (03PS1) 10Daimona Eaytoy: Stop setting $wgCampaignEventsEnableContributionTracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210716 (https://phabricator.wikimedia.org/T410939)
[22:00:32] <danisztls>	 bvibber: it's finally on test servers
[22:00:54] <sbassett>	 Hey all - is the late backport still happening?
[22:01:04] <bd808>	 danisztls: hmmm... and you authenticated with your https://ldap.toolforge.org/user/dani account?
[22:01:07] <logmsgbot>	 !log dani@deploy2002 dani, bvibber: Backport for [[gerrit:1208408|Pre-deploy 2025 Global Readers Survey (T410696)]], [[gerrit:1210655|Deploy experiment for 2025 Global Readers Survey (T410696)]], [[gerrit:1210669|Show "no data" message when tooltip does not contain to show (T401990)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[22:01:08] <bvibber>	 whee
[22:01:09] <wikibugs>	 (03PS2) 10Daimona Eaytoy: Stop setting $wgCampaignEventsEnableContributionTracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210716 (https://phabricator.wikimedia.org/T410939)
[22:01:13] <stashbot>	 T410696: Deploy enwiki edition of 2025 GRS - https://phabricator.wikimedia.org/T410696
[22:01:14] <stashbot>	 T401990: Chart displays NaN for entries with no data - https://phabricator.wikimedia.org/T401990
[22:01:26] <bd808>	 sbassett: yeah. they just got to the staging servers. l01n update slowness.
[22:01:29] <bvibber>	 danisztls: confirmed works
[22:01:41] <bd808>	 *l10n
[22:01:42] <sbassett>	 Ok.  Have one sec patch to get out but I can wait a bit.
[22:02:08] <danisztls>	 bvibber: I'm getting MediaWiki internal error.
[22:02:15] <bvibber>	 bd808: i think i'm going to push to finish this l10n cache shrinkage fix way before the may hackathon ;)
[22:02:40] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[22:02:42] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[22:03:08] <bvibber>	 mysterious
[22:03:20] <bd808>	 "TypeError: QuickSurveys\SurveyQuestion::__construct(): Argument #1 ($questionDefinition) must be of type array, string given, called in /srv/mediawiki/php-1.46.0-wmf.3/extensions/QuickSurveys/includes/SurveyFactory.php on line"
[22:03:22] <bvibber>	 i was literally looking at a tst server page on commons and it rendered my page with updated js
[22:03:33] <bvibber>	 aha
[22:03:42] <danisztls>	 mu fault them
[22:03:44] <danisztls>	 *my
[22:03:57] <bvibber>	 ;_;
[22:04:12] <bvibber>	 if you break it and fix it, you get the t-shirt ;)
[22:05:24] <icinga-wm>	 PROBLEM - MD RAID on ms-fe2014 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[22:05:26] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on ms-fe2014 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T410959 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[22:05:32] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Swift
[22:05:32] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 505 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Swift
[22:05:35] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on ms-fe2014 - https://phabricator.wikimedia.org/T410959 (10ops-monitoring-bot) 03NEW
[22:05:44] <bd808>	 danisztls: you are going to need to "exit scap" to roll back and then fix the config.
[22:06:01] <danisztls>	 bd808: thanks
[22:06:40] <danisztls>	 bd808: now I do a patch to fix and a new deploy?
[22:07:01] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11403258 (10Andrew)
[22:08:54] <bd808>	 danisztls: you will need to revert the config changes and try https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1208408/5/wmf-config/InitialiseSettings.php again after you fix the syntax problems.
[22:09:35] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1028.eqiad.wmnet with OS trixie
[22:09:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11403268 (10Andrew)
[22:10:18] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS trixie
[22:10:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11403282 (10Andrew) Assigning to myself pending a decision about hostnames
[22:10:32] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11403281 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host wdqs1029.eqiad.wmnet with OS trixie
[22:10:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11403283 (10BTullis) I have failed over the active namenode, so an-master1003 is now ready for the network cable move. ` b...
[22:12:07] <wikibugs>	 (03PS1) 10DDesouza: Revert "Deploy experiment for 2025 Global Readers Survey" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210722
[22:12:07] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T410531)', diff saved to https://phabricator.wikimedia.org/P85559 and previous config saved to /var/cache/conftool/dbconfig/20251124-221207-marostegui.json
[22:12:13] <stashbot>	 T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531
[22:12:17] <wikibugs>	 (03PS1) 10DDesouza: Revert "Pre-deploy 2025 Global Readers Survey" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210723
[22:12:23] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2239.codfw.wmnet with reason: Maintenance
[22:15:47] <bd808>	 danisztls: do you know how to revert those backports, or do you need help?
[22:16:47] <danisztls>	 bd808: I didn't but I think I figured it out, I reverted on Gerrit and I need to deploy the reverts like a patch, right?
[22:17:34] <bd808>	 danisztls: yeah. that should work. There is a cli tool to do that, but spiderpig doesn't have a gui for it yet.
[22:17:47] <bd808>	 it == rollback in gerrit and merge
[22:18:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210722 (owner: 10DDesouza)
[22:18:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210723 (owner: 10DDesouza)
[22:18:35] <danisztls>	 bd808: thanks
[22:19:13] <bd808>	 The cli way to do it is `scap backport --revert [change_numbers ...]`
[22:19:27] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Deploy experiment for 2025 Global Readers Survey" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210722 (owner: 10DDesouza)
[22:19:28] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Pre-deploy 2025 Global Readers Survey" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210723 (owner: 10DDesouza)
[22:19:36] <wikibugs>	 (03PS1) 10DDesouza: Pre-deploy 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210727 (https://phabricator.wikimedia.org/T410696)
[22:19:47] <logmsgbot>	 !log dani@deploy2002 Started scap sync-world: Backport for [[gerrit:1210722|Revert "Deploy experiment for 2025 Global Readers Survey"]], [[gerrit:1210723|Revert "Pre-deploy 2025 Global Readers Survey"]]
[22:20:47] <bd808>	 bvibber's config change is still in there. Let's see how horrible the build time is, but I'd expect another 20 minutes.
[22:20:52] <wikibugs>	 (03PS1) 10DDesouza: Deploy experiment for 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210729 (https://phabricator.wikimedia.org/T410696)
[22:20:54] <bvibber>	 hehe
[22:21:51] <wikibugs>	 (03PS2) 10DDesouza: Deploy experiment for 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210729 (https://phabricator.wikimedia.org/T410696)
[22:21:52] <bd808>	 nope. it was fast
[22:22:11] <bd808>	 "Finished build-and-push-container-images (duration: 01m 35s)"
[22:23:30] <bd808>	 Oh, we didn't use the prior container but it had been built and pushed so really there was no l10n rebuild. That is sort of confusing but it makes sense.
[22:23:41] <wikibugs>	 (03PS2) 10DDesouza: Pre-deploy 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210727 (https://phabricator.wikimedia.org/T410696)
[22:24:37] <jinxer-wm>	 FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage  - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[22:25:45] <danisztls>	 sorry about the event and thanks for the help bd808 
[22:26:16] <logmsgbot>	 !log dani@deploy2002 dani: Backport for [[gerrit:1210722|Revert "Deploy experiment for 2025 Global Readers Survey"]], [[gerrit:1210723|Revert "Pre-deploy 2025 Global Readers Survey"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[22:26:49] <bd808>	 you're doing fine danisztls :) 
[22:27:26] <bd808>	 bvibber: you should probably double check your change on the debug servers
[22:27:54] <bvibber>	 bd808: confirmed good on debug!
[22:28:09] <logmsgbot>	 !log dani@deploy2002 dani: Continuing with sync
[22:32:12] <wikibugs>	 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Measure request frequency of thumbnail sizes - https://phabricator.wikimedia.org/T410304#11403354 (10Ladsgroup) ` spark-sql (default)> select uri_path, count(*) as hits from wmf.webrequest where webrequest_source='upload' and y...
[22:34:56] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1032.eqiad.wmnet with OS trixie
[22:35:07] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudrabbit2001-dev.codfw.wmnet
[22:35:08] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11403357 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host wdqs1032.eqiad.wmnet with OS trixie executed with errors: -...
[22:36:09] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudbackup1001-dev.eqiad.wmnet
[22:38:32] <wikibugs>	 (03PS2) 10Cwhite: opensearch: add $apt_component parameter [puppet] - 10https://gerrit.wikimedia.org/r/1208500 (https://phabricator.wikimedia.org/T410795)
[22:39:10] <wikibugs>	 (03CR) 10CI reject: [V:04-1] opensearch: add $apt_component parameter [puppet] - 10https://gerrit.wikimedia.org/r/1208500 (https://phabricator.wikimedia.org/T410795) (owner: 10Cwhite)
[22:40:12] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudbackup1001-dev.eqiad.wmnet
[22:40:29] <logmsgbot>	 !log dani@deploy2002 Finished scap sync-world: Backport for [[gerrit:1210722|Revert "Deploy experiment for 2025 Global Readers Survey"]], [[gerrit:1210723|Revert "Pre-deploy 2025 Global Readers Survey"]] (duration: 20m 42s)
[22:41:36] <wikibugs>	 (03PS3) 10Cwhite: opensearch: add $apt_component parameter [puppet] - 10https://gerrit.wikimedia.org/r/1208500 (https://phabricator.wikimedia.org/T410795)
[22:41:43] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit2001-dev.codfw.wmnet
[22:41:59] <sbassett>	 backport window changes looking stable now?
[22:42:36] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] aptrepo: add component/opensearch27 [puppet] - 10https://gerrit.wikimedia.org/r/1208499 (https://phabricator.wikimedia.org/T410795) (owner: 10Cwhite)
[22:42:53] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudbackup1002-dev.eqiad.wmnet
[22:44:57] <sbassett>	 Eh, looks like the patch I wanted to deploy went out with the scap prep from that last revert deploy.  So I guess we’re good on that :)
[22:45:37] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin2002 - T390860
[22:45:41] <stashbot>	 T390860: Elasticsearch dependency upgrade in spicerack - https://phabricator.wikimedia.org/T390860
[22:46:45] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudbackup1002-dev.eqiad.wmnet
[22:46:50] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcontrol2005-dev.codfw.wmnet
[22:51:48] <bd808>	 danisztls: Are you going to try another deployment, or can sbassett take over for his security backport window?
[22:53:48] <wikibugs>	 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Measure request frequency of thumbnail sizes - https://phabricator.wikimedia.org/T410304#11403412 (10Ladsgroup) The query was wrong, the like should have an extra % at the end. Let me try again.
[22:54:28] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210727 (https://phabricator.wikimedia.org/T410696) (owner: 10DDesouza)
[22:54:35] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol2005-dev.codfw.wmnet
[22:54:37] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[22:54:37] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210729 (https://phabricator.wikimedia.org/T410696) (owner: 10DDesouza)
[22:54:40] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudlb2002-dev.codfw.wmnet
[22:55:24] <wikibugs>	 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Measure request frequency of thumbnail sizes - https://phabricator.wikimedia.org/T410304#11403413 (10Ladsgroup) ` spark-sql (default)> select uri_path, count(*) as hits from wmf.webrequest where webrequest_source='upload' and y...
[22:55:30] <danisztls>	 bd808: he can take over, I will deploy tomorrow since its 1 hour past the window
[22:55:59] <bd808>	 :+1: You have the con sbassett 
[22:56:11] <bvibber>	 🫡
[22:56:25] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] "PCC OK: https://puppet-compiler.wmflabs.org/output/1208500/7696/" [puppet] - 10https://gerrit.wikimedia.org/r/1208500 (https://phabricator.wikimedia.org/T410795) (owner: 10Cwhite)
[22:57:39] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2002-dev (172.20.5.3) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[22:57:54] <icinga-wm>	 PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:58:20] <bd808>	 danisztls: I chatted with thcipriani and he pointed out that logstash-access is a separate right these days. You can  apply for it at https://idm.wikimedia.org/permissions/. You should get it to go along with your spiderpig deployment rights.
[22:59:14] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin2002 - T390860
[22:59:20] <stashbot>	 T390860: Elasticsearch dependency upgrade in spicerack - https://phabricator.wikimedia.org/T390860
[22:59:34] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin2002 - T390860
[23:00:54] <icinga-wm>	 RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:01:46] <danisztls>	 bd808: thanks! just requested
[23:02:39] <jinxer-wm>	 RESOLVED: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2002-dev (172.20.5.3) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[23:03:52] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2002-dev.codfw.wmnet
[23:03:57] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcontrol2006-dev.codfw.wmnet
[23:13:28] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol2006-dev.codfw.wmnet
[23:13:33] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet
[23:15:54] <icinga-wm>	 PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:16:09] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2002-dev (172.20.5.3) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[23:19:01] <wikibugs>	 (03PS1) 10Kosta Harlan: hCaptcha: Allow providing a set of valid keys for site verify per action [extensions/ConfirmEdit] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1210737 (https://phabricator.wikimedia.org/T410657)
[23:19:24] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 25 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1210737 (https://phabricator.wikimedia.org/T410657) (owner: 10Kosta Harlan)
[23:19:54] <icinga-wm>	 RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:21:09] <jinxer-wm>	 RESOLVED: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2003-dev (172.20.5.4) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[23:22:41] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet
[23:22:46] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcontrol2010-dev.codfw.wmnet
[23:24:58] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 25 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1210637 (https://phabricator.wikimedia.org/T409957) (owner: 10Kosta Harlan)
[23:25:28] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 25 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210621 (https://phabricator.wikimedia.org/T410354) (owner: 10Kosta Harlan)
[23:25:41] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 25 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210622 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan)
[23:25:49] <wikibugs>	 (03PS3) 10Kosta Harlan: hCaptcha: Define valid SiteKeys for account creation and edit triggers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210627 (https://phabricator.wikimedia.org/T410657)
[23:25:55] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 25 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210627 (https://phabricator.wikimedia.org/T410657) (owner: 10Kosta Harlan)
[23:29:49] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet
[23:29:54] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudnet2005-dev.codfw.wmnet
[23:29:56] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1028.eqiad.wmnet with OS trixie
[23:30:39] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1029.eqiad.wmnet with OS trixie
[23:30:57] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11403508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host wdqs1029.eqiad.wmnet with OS trixie executed with errors: -...
[23:35:33] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+1] P:openstack: neutron: Cleanup legacy_vlan_naming hiera key [puppet] - 10https://gerrit.wikimedia.org/r/1208306 (owner: 10Majavah)
[23:37:45] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet2005-dev.codfw.wmnet
[23:37:50] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudrabbit2002-dev.codfw.wmnet
[23:39:36] <wikibugs>	 (03PS1) 10Joal: Bump Hadoop max container size to 128Gb [puppet] - 10https://gerrit.wikimedia.org/r/1210744 (https://phabricator.wikimedia.org/T410966)
[23:44:47] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit2002-dev.codfw.wmnet
[23:44:52] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudnet2006-dev.codfw.wmnet
[23:45:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2065:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:50:25] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2065:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:52:28] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet2006-dev.codfw.wmnet
[23:52:32] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudrabbit2003-dev.codfw.wmnet
[23:59:14] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit2003-dev.codfw.wmnet
[23:59:18] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudnet2007-dev.codfw.wmnet