[00:01:44] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T410589)', diff saved to https://phabricator.wikimedia.org/P85477 and previous config saved to /var/cache/conftool/dbconfig/20251124-000144-ladsgroup.json [00:01:49] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [00:02:00] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1245.eqiad.wmnet with reason: Maintenance [00:39:36] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:40:15] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1210171 [00:40:15] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1210171 (owner: 10TrainBranchBot) [00:52:44] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1210171 (owner: 10TrainBranchBot) [01:00:40] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:09:36] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:10:01] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1210179 [01:10:01] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1210179 (owner: 10TrainBranchBot) [01:27:21] PROBLEM - Host cloudidp2001-dev is DOWN: PING CRITICAL - Packet loss = 100% [01:32:01] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1210179 (owner: 10TrainBranchBot) [01:32:55] RESOLVED: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:36:25] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:55:49] RECOVERY - Host cloudidp2001-dev is UP: PING OK - Packet loss = 0%, RTA = 30.65 ms [02:13:58] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [02:23:58] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [02:28:58] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [02:29:55] PROBLEM - Host cloudidp2001-dev is DOWN: PING CRITICAL - Packet loss = 100% [02:42:45] (03PS2) 10Tim Starling: Revert "Authorize self for Google Search Console" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175850 [02:48:58] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [02:50:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175850 (owner: 10Tim Starling) [02:51:01] (03Merged) 10jenkins-bot: Revert "Authorize self for Google Search Console" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175850 (owner: 10Tim Starling) [02:51:42] !log tstarling@deploy2002 Started scap sync-world: Backport for [[gerrit:1175850|Revert "Authorize self for Google Search Console"]] [02:54:36] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:54:50] RECOVERY - Host cloudidp2001-dev is UP: PING OK - Packet loss = 0%, RTA = 30.50 ms [03:06:54] (03PS1) 10Tim Starling: admin: Remove my non-FIDO keys [puppet] - 10https://gerrit.wikimedia.org/r/1210224 [03:17:48] !log tstarling@deploy2002 tstarling: Backport for [[gerrit:1175850|Revert "Authorize self for Google Search Console"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [03:18:25] !log tstarling@deploy2002 tstarling: Continuing with sync [03:31:58] !log tstarling@deploy2002 Finished scap sync-world: Backport for [[gerrit:1175850|Revert "Authorize self for Google Search Console"]] (duration: 40m 16s) [04:08:58] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [04:18:58] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [04:23:58] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [04:26:29] PROBLEM - Host cloudidp2001-dev is DOWN: PING CRITICAL - Packet loss = 100% [04:33:58] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [04:39:36] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [04:43:58] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [04:53:58] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [05:08:58] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:36] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:25:49] RECOVERY - Host cloudidp2001-dev is UP: PING OK - Packet loss = 0%, RTA = 30.56 ms [05:26:19] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:27:31] PROBLEM - OSPF status on cr2-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:28:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and 195.200.68.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:29:39] FIRING: CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-magru (2a02:ec80:700:fe0b::2) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Confed_magru&var-bgp_neighbor=cr2-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:30:19] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:33:10] RESOLVED: BFDdown: BFD session down between cr2-eqdfw and 195.200.68.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:33:31] RECOVERY - OSPF status on cr2-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:33:58] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:34:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-magru (2a02:ec80:700:fe0b::2) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Confed_magru&var-bgp_neighbor=cr2-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:36:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:13:09] (03CR) 10Marostegui: [C:03+2] data.yaml: Add FIDO key for marostegui [puppet] - 10https://gerrit.wikimedia.org/r/1207863 (owner: 10Marostegui) [06:23:09] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2144 (ms2) memory error - https://phabricator.wikimedia.org/T410480#11399296 (10Marostegui) 05Open→03Resolved a:03Marostegui Closing this for now - we will see how long it takes for the DIMM to crash again. Thanks @Jhancock.wm! [06:23:19] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2144 (ms2) memory error - https://phabricator.wikimedia.org/T410480#11399299 (10Marostegui) a:05Marostegui→03Jhancock.wm [06:26:22] PROBLEM - Host cloudidp2001-dev is DOWN: PING CRITICAL - Packet loss = 100% [06:28:01] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install x1 host - https://phabricator.wikimedia.org/T407897#11399303 (10Marostegui) Thanks Rob, I think the confusion was whether we ordered the right HW or not. Doing 1G is fine for this host, 10G would be ideal, but we are not expecting... [06:37:37] !log Deploy schema change on s6 on the master with replication T410531 [06:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:42] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [06:38:32] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 22 hosts with reason: Schema change [06:38:33] marostegui@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [06:38:58] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [06:48:27] (03CR) 10Giuseppe Lavagetto: [C:03+2] cache::text: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203055 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [06:48:58] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [06:50:42] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1150.eqiad.wmnet with reason: Maintenance [06:54:36] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:54:52] RECOVERY - Host cloudidp2001-dev is UP: PING OK - Packet loss = 0%, RTA = 30.61 ms [06:59:46] (03CR) 10Arnaudb: [C:03+2] apt: add an alert on reprepro errors [alerts] - 10https://gerrit.wikimedia.org/r/1207791 (https://phabricator.wikimedia.org/T409835) (owner: 10Arnaudb) [07:00:43] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1157.eqiad.wmnet with reason: Maintenance [07:00:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1157 (T410531)', diff saved to https://phabricator.wikimedia.org/P85478 and previous config saved to /var/cache/conftool/dbconfig/20251124-070050-marostegui.json [07:00:55] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [07:01:28] (03Merged) 10jenkins-bot: apt: add an alert on reprepro errors [alerts] - 10https://gerrit.wikimedia.org/r/1207791 (https://phabricator.wikimedia.org/T409835) (owner: 10Arnaudb) [07:05:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T410531)', diff saved to https://phabricator.wikimedia.org/P85479 and previous config saved to /var/cache/conftool/dbconfig/20251124-070539-marostegui.json [07:14:33] (03PS1) 10Giuseppe Lavagetto: admin: add FIDO ssh key for oblivian [puppet] - 10https://gerrit.wikimedia.org/r/1210368 [07:20:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P85480 and previous config saved to /var/cache/conftool/dbconfig/20251124-072047-marostegui.json [07:30:02] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1210224 (owner: 10Tim Starling) [07:35:44] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [07:35:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P85481 and previous config saved to /var/cache/conftool/dbconfig/20251124-073555-marostegui.json [07:36:13] (03CR) 10Marostegui: [C:03+1] admin: add FIDO ssh key for oblivian [puppet] - 10https://gerrit.wikimedia.org/r/1210368 (owner: 10Giuseppe Lavagetto) [07:37:21] (03PS4) 10Brouberol: growthbook: add the kerberos token renewer sidecar to support kerberized connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206187 (https://phabricator.wikimedia.org/T408907) [07:38:38] (03PS5) 10Brouberol: growthbook: add the kerberos token renewer sidecar to support kerberized connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206187 (https://phabricator.wikimedia.org/T408907) [07:40:08] (03CR) 10Arnaudb: [C:03+2] gerrit: add dry run rsync [cookbooks] - 10https://gerrit.wikimedia.org/r/1195437 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [07:40:14] (03CR) 10Arnaudb: [C:03+2] gerrit: add a local backup cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1193590 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [07:40:59] (03PS6) 10Brouberol: growthbook: add the kerberos token renewer sidecar to support kerberized connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206187 (https://phabricator.wikimedia.org/T408907) [07:44:39] (03CR) 10Giuseppe Lavagetto: [C:03+2] admin: add FIDO ssh key for oblivian [puppet] - 10https://gerrit.wikimedia.org/r/1210368 (owner: 10Giuseppe Lavagetto) [07:46:37] (03Merged) 10jenkins-bot: gerrit: add dry run rsync [cookbooks] - 10https://gerrit.wikimedia.org/r/1195437 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [07:46:58] (03Merged) 10jenkins-bot: gerrit: add a local backup cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1193590 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [07:51:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T410531)', diff saved to https://phabricator.wikimedia.org/P85482 and previous config saved to /var/cache/conftool/dbconfig/20251124-075103-marostegui.json [07:51:08] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [07:51:20] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1166.eqiad.wmnet with reason: Maintenance [07:51:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1166 (T410531)', diff saved to https://phabricator.wikimedia.org/P85483 and previous config saved to /var/cache/conftool/dbconfig/20251124-075126-marostegui.json [07:56:18] (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcumin2001 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1204369 (owner: 10Muehlenhoff) [08:00:05] Amir1, Urbanecm, and awight: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251124T0800). [08:00:05] hubaishan: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:05:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T410531)', diff saved to https://phabricator.wikimedia.org/P85484 and previous config saved to /var/cache/conftool/dbconfig/20251124-080519-marostegui.json [08:05:25] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [08:07:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcumin2001.codfw.wmnet [08:08:50] (03CR) 10Slyngshede: [C:03+1] "Very nice." [puppet] - 10https://gerrit.wikimedia.org/r/1208362 (owner: 10Muehlenhoff) [08:11:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcumin2001.codfw.wmnet [08:15:33] (03PS1) 10Muehlenhoff: Switch the cluster::cloud_management role to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1210395 [08:18:10] (03PS2) 10Arnaudb: gerrit: remove localbackup logic from failover [cookbooks] - 10https://gerrit.wikimedia.org/r/1210386 (https://phabricator.wikimedia.org/T387833) [08:18:10] (03CR) 10Arnaudb: "after merging 1193590 this patch removes the redundant logic in the failover cookbook" [cookbooks] - 10https://gerrit.wikimedia.org/r/1210386 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [08:20:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P85485 and previous config saved to /var/cache/conftool/dbconfig/20251124-082027-marostegui.json [08:27:23] PROBLEM - Host cloudidp2001-dev is DOWN: PING CRITICAL - Packet loss = 100% [08:31:54] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1210395 (owner: 10Muehlenhoff) [08:35:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P85487 and previous config saved to /var/cache/conftool/dbconfig/20251124-083535-marostegui.json [08:39:36] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [08:43:58] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [08:44:31] 06SRE, 06Traffic, 06Data-Engineering (Q2 FY25/26 October 1st - December 31th), 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Sustainability (Incident Followup): alerts should be triggered if druid fails to consume webrequest_sampled kafka topic - https://phabricator.wikimedia.org/T410019#11399416 (10Ge... [08:44:31] !log installing jinja2 security updates [08:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:35] (03CR) 10AikoChou: [C:03+1] "We can wait for the patch extending paragraph extraction code (https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-service" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208310 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [08:50:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T410531)', diff saved to https://phabricator.wikimedia.org/P85488 and previous config saved to /var/cache/conftool/dbconfig/20251124-085042-marostegui.json [08:50:47] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [08:50:58] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1175.eqiad.wmnet with reason: Maintenance [08:51:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1175 (T410531)', diff saved to https://phabricator.wikimedia.org/P85489 and previous config saved to /var/cache/conftool/dbconfig/20251124-085104-marostegui.json [08:53:58] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [08:54:14] (03CR) 10Bartosz Wójtowicz: "Good idea, let's do it this way :) I'll start with reviewing the paragraph extraction patch." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208310 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [08:54:16] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1208451 (owner: 10RLazarus) [08:55:52] RECOVERY - Host cloudidp2001-dev is UP: PING OK - Packet loss = 0%, RTA = 30.57 ms [08:55:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T410531)', diff saved to https://phabricator.wikimedia.org/P85490 and previous config saved to /var/cache/conftool/dbconfig/20251124-085554-marostegui.json [08:55:59] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [08:58:08] (03PS1) 10Volans: admin: add user chandra-wmde [puppet] - 10https://gerrit.wikimedia.org/r/1210496 (https://phabricator.wikimedia.org/T409707) [08:58:15] (03CR) 10Filippo Giunchedi: [C:03+1] Switch the cluster::cloud_management role to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1210395 (owner: 10Muehlenhoff) [08:58:36] (03CR) 10Volans: [C:04-1] "Pending approval on task." [puppet] - 10https://gerrit.wikimedia.org/r/1210496 (https://phabricator.wikimedia.org/T409707) (owner: 10Volans) [09:03:55] !log gehel@cumin2002 START - Cookbook sre.hosts.reboot-cluster [09:03:56] !log gehel@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=99) [09:05:50] !log gehel@cumin2002 START - Cookbook sre.hosts.reboot-cluster [09:06:00] (03PS1) 10AikoChou: changeprop: add LiftWing revise-tone-task-generator to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210505 (https://phabricator.wikimedia.org/T408538) [09:09:36] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:09:50] !log taavi@puppetserver1001 ~ $ sudo puppet node deactivate cloudidp2001-dev.wikimedia.org # leftover from move to private addresses T410294 [09:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:55] T410294: Site: codfw 1 VM request for codfw1dev CAS test/dev, hostname: cloudidp2001-dev - https://phabricator.wikimedia.org/T410294 [09:11:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P85491 and previous config saved to /var/cache/conftool/dbconfig/20251124-091102-marostegui.json [09:11:25] RESOLVED: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:11:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:16:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:19:13] (03PS2) 10Clément Goubert: trafficserver: action api to rest-gateway enwiki 100% [puppet] - 10https://gerrit.wikimedia.org/r/1198940 (https://phabricator.wikimedia.org/T408223) [09:19:31] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1208426 (owner: 10Ayounsi) [09:26:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P85492 and previous config saved to /var/cache/conftool/dbconfig/20251124-092609-marostegui.json [09:26:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:30:11] (03CR) 10Clément Goubert: [C:03+2] trafficserver: action api to rest-gateway enwiki 100% [puppet] - 10https://gerrit.wikimedia.org/r/1198940 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [09:31:32] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:32:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1209791 (https://phabricator.wikimedia.org/T410840) (owner: 10Hubaishan) [09:34:46] 07sre-alert-triage, 06serviceops: Alert in need of triage: HelmfileAdminNGPendingChanges (instance deploy1003:9100) - https://phabricator.wikimedia.org/T410858 (10LSobanski) 03NEW [09:34:56] 07sre-alert-triage, 06serviceops: Alert in need of triage: HelmfileAdminNGPendingChanges (instance deploy1003:9100) - https://phabricator.wikimedia.org/T410858#11399550 (10LSobanski) Also eqiad-staging and codfw-staging. [09:38:05] !log cgoubert@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:38:12] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:38:55] !log cgoubert@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:39:14] !log cgoubert@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:40:16] (03CR) 10Btullis: growthbook: add the kerberos token renewer sidecar to support kerberized connections (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206187 (https://phabricator.wikimedia.org/T408907) (owner: 10Brouberol) [09:40:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:40:30] !log cgoubert@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:40:49] (03CR) 10Federico Ceratto: [C:03+2] sre.mysql.clone: Refactor, Pool in source host ASAP [cookbooks] - 10https://gerrit.wikimedia.org/r/1202673 (https://phabricator.wikimedia.org/T410376) (owner: 10Federico Ceratto) [09:40:55] (03CR) 10Muehlenhoff: [C:03+2] Failover URL downloaders [dns] - 10https://gerrit.wikimedia.org/r/1207786 (owner: 10Muehlenhoff) [09:40:56] (03CR) 10Brouberol: growthbook: add the kerberos token renewer sidecar to support kerberized connections (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206187 (https://phabricator.wikimedia.org/T408907) (owner: 10Brouberol) [09:40:59] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [09:41:05] !log jmm@dns1004 START - running authdns-update [09:41:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T410531)', diff saved to https://phabricator.wikimedia.org/P85494 and previous config saved to /var/cache/conftool/dbconfig/20251124-094117-marostegui.json [09:41:22] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [09:41:31] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [09:41:34] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1189.eqiad.wmnet with reason: Maintenance [09:41:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1189 (T410531)', diff saved to https://phabricator.wikimedia.org/P85495 and previous config saved to /var/cache/conftool/dbconfig/20251124-094141-marostegui.json [09:41:52] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [09:42:06] !log jmm@dns1004 END - running authdns-update [09:42:37] (03PS7) 10Brouberol: growthbook: add the kerberos token renewer sidecar to support kerberized connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206187 (https://phabricator.wikimedia.org/T408907) [09:42:38] (03CR) 10Brouberol: growthbook: add the kerberos token renewer sidecar to support kerberized connections (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206187 (https://phabricator.wikimedia.org/T408907) (owner: 10Brouberol) [09:42:44] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [09:43:35] (03CR) 10Btullis: [C:03+1] "Nice, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206187 (https://phabricator.wikimedia.org/T408907) (owner: 10Brouberol) [09:43:53] 07sre-alert-triage, 06serviceops: Alert in need of triage: HelmfileAdminNGPendingChanges (instance deploy1003:9100) - https://phabricator.wikimedia.org/T410858#11399655 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert [09:44:57] (03CR) 10Brouberol: [C:03+2] growthbook: add the kerberos token renewer sidecar to support kerberized connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206187 (https://phabricator.wikimedia.org/T408907) (owner: 10Brouberol) [09:46:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T410531)', diff saved to https://phabricator.wikimedia.org/P85496 and previous config saved to /var/cache/conftool/dbconfig/20251124-094632-marostegui.json [09:46:37] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [09:49:15] (03PS1) 10Brouberol: growthbook: add the general values to the list of environment values to inject to the subcharts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210516 (https://phabricator.wikimedia.org/T408907) [09:49:35] (03CR) 10Ayounsi: [C:03+2] ayounsi: Add new yubikey key [puppet] - 10https://gerrit.wikimedia.org/r/1208426 (owner: 10Ayounsi) [09:51:08] (03CR) 10Brouberol: [C:03+2] growthbook: add the general values to the list of environment values to inject to the subcharts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210516 (https://phabricator.wikimedia.org/T408907) (owner: 10Brouberol) [09:51:16] (03CR) 10Tchanders: "Looks good from the perspective of aligning with temporary accounts policy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208334 (https://phabricator.wikimedia.org/T409687) (owner: 10Dragoniez) [09:53:03] (03CR) 10Dreamy Jazz: [C:03+1] "Looks good from the Product Safety & Integrity team's point of view" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208334 (https://phabricator.wikimedia.org/T409687) (owner: 10Dragoniez) [09:53:43] (03CR) 10JMeybohm: [C:03+2] fetch_external_clouds_vendors_nets.py: ipblock-source support [puppet] - 10https://gerrit.wikimedia.org/r/1207848 (https://phabricator.wikimedia.org/T402014) (owner: 10JMeybohm) [09:55:03] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [09:55:47] !log gehel@cumin1003 START - Cookbook sre.hosts.reboot-cluster [09:56:04] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [09:57:17] (03CR) 10Dreamy Jazz: [C:03+1] "Noting that `wgRemoveGroups` was not updated, so only the `sysop` group can remove the `temporary-account-viewer` group. However, I assume" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208334 (https://phabricator.wikimedia.org/T409687) (owner: 10Dragoniez) [09:58:07] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [09:58:30] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [09:59:02] !log gehel@cumin1003 START - Cookbook sre.hosts.reboot-cluster [10:00:30] !log gehel@cumin2002 START - Cookbook sre.hosts.reboot-cluster [10:01:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P85497 and previous config saved to /var/cache/conftool/dbconfig/20251124-100139-marostegui.json [10:03:03] (03PS1) 10Marostegui: db1153: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1210518 [10:07:13] (03CR) 10Tiziano Fogli: [C:03+2] metamonitoring/icinga: suppress script-managed notifications and pages [puppet] - 10https://gerrit.wikimedia.org/r/1206884 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli) [10:07:22] (03CR) 10Tiziano Fogli: [C:03+2] metamonitoring/icinga: add smtp settings to config.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1206885 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli) [10:07:35] (03CR) 10Tiziano Fogli: [C:03+2] metamonitoring/icinga: generate contacts list [puppet] - 10https://gerrit.wikimedia.org/r/1206886 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli) [10:07:55] (03CR) 10Tiziano Fogli: [C:03+2] metamonitoring/icinga: trigger pages only for the active instance [puppet] - 10https://gerrit.wikimedia.org/r/1207113 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli) [10:12:02] (03CR) 10Brouberol: [C:03+1] Add a new deploy-spark-support clusterrole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208316 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [10:14:01] !log gehel@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [10:14:26] !log gehel@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [10:16:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P85498 and previous config saved to /var/cache/conftool/dbconfig/20251124-101647-marostegui.json [10:17:36] (03CR) 10Marostegui: [C:03+2] db1153: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1210518 (owner: 10Marostegui) [10:18:23] ACKNOWLEDGEMENT - snapshot of s5 in eqiad on backupmon1001 is CRITICAL: Last snapshot for s5 at eqiad (db1216) taken on 2025-11-23 20:35:02 is 395 GiB, but the previous one was 517 GiB, a change of -23.7 % Jcrespo expected by DBAs https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [10:19:17] FIRING: ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:22:09] 06SRE, 06Traffic, 06Data-Engineering (Q2 FY25/26 October 1st - December 31th), 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Sustainability (Incident Followup): alerts should be triggered if druid fails to consume webrequest_sampled kafka topic - https://phabricator.wikimedia.org/T410019#11399911 (10Ge... [10:22:51] !log gehel@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [10:23:28] (03CR) 10Btullis: [C:03+2] Add a new deploy-spark-support clusterrole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208316 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [10:24:17] RESOLVED: ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:25:43] !log gehel@cumin1003 START - Cookbook sre.hosts.reboot-cluster [10:25:52] !log gehel@cumin2002 START - Cookbook sre.hosts.reboot-cluster [10:26:56] !log gehel@cumin1003 START - Cookbook sre.hosts.reboot-cluster [10:27:10] !log gehel@cumin2002 START - Cookbook sre.hosts.reboot-cluster [10:27:47] (03PS2) 10AikoChou: changeprop: add LiftWing revise-tone-task-generator to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210505 (https://phabricator.wikimedia.org/T408538) [10:29:58] (03PS12) 10Btullis: Enable an oauth2-proxy for growthbook frontend and api pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202726 (https://phabricator.wikimedia.org/T409183) [10:30:18] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:31:26] (03Merged) 10jenkins-bot: Add a new deploy-spark-support clusterrole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208316 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [10:31:54] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:31:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T410531)', diff saved to https://phabricator.wikimedia.org/P85499 and previous config saved to /var/cache/conftool/dbconfig/20251124-103155-marostegui.json [10:32:00] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [10:32:07] (03PS3) 10Federico Ceratto: Support both hostname and FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/1141895 (https://phabricator.wikimedia.org/T391581) [10:32:11] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1198.eqiad.wmnet with reason: Maintenance [10:32:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1198 (T410531)', diff saved to https://phabricator.wikimedia.org/P85500 and previous config saved to /var/cache/conftool/dbconfig/20251124-103218-marostegui.json [10:33:18] (03CR) 10Federico Ceratto: "Updated to use a more strict hostname check based the discussion with Manuel on IRC" [cookbooks] - 10https://gerrit.wikimedia.org/r/1141895 (https://phabricator.wikimedia.org/T391581) (owner: 10Federico Ceratto) [10:33:27] (03PS1) 10Tiziano Fogli: metamonitoring/icinga: convert last_check to timestamp [puppet] - 10https://gerrit.wikimedia.org/r/1210523 (https://phabricator.wikimedia.org/T393625) [10:34:15] (03PS13) 10Btullis: Enable an oauth2-proxy for growthbook frontend and api pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202726 (https://phabricator.wikimedia.org/T409183) [10:34:16] (03CR) 10Dragoniez: "@thalia.e.chan@googlemail.com @dreamyjazzwikipedia@gmail.com Thanks for the reviews! About `wgRemoveGroups`, I think I'll leave it as is s" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208334 (https://phabricator.wikimedia.org/T409687) (owner: 10Dragoniez) [10:34:21] !log gehel@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [10:36:18] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:36:29] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:36:41] (03CR) 10Tiziano Fogli: [C:03+2] "I'm self-merging since this is just a time-format conversion fix for an already deployed patch." [puppet] - 10https://gerrit.wikimedia.org/r/1210523 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli) [10:36:54] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:37:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T410531)', diff saved to https://phabricator.wikimedia.org/P85501 and previous config saved to /var/cache/conftool/dbconfig/20251124-103708-marostegui.json [10:37:12] !log btullis@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [10:37:13] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [10:37:24] !log btullis@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [10:38:12] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:39:02] (03CR) 10CI reject: [V:04-1] Support both hostname and FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/1141895 (https://phabricator.wikimedia.org/T391581) (owner: 10Federico Ceratto) [10:39:52] !log gehel@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [10:40:18] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:40:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:43:13] (03PS1) 10Sergio Gimeno: [beta] GrowthExperiments: increase to log level to debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210526 (https://phabricator.wikimedia.org/T405177) [10:43:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210526 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno) [10:44:00] (03PS2) 10Sergio Gimeno: [beta] GrowthExperiments: increase log level to debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210526 (https://phabricator.wikimedia.org/T405177) [10:44:45] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to ops for blake - https://phabricator.wikimedia.org/T410612#11399970 (10KOfori) Hi, approving this on behalf of @Kappakayala as her delegate while OOO. [10:46:08] !log Deploying envoy 1.32 to api-gateway [10:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:15] !log gehel@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [10:47:21] (03PS5) 10Daniel Kinzler: rest-gateway: allow rate limits per time unit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205191 (https://phabricator.wikimedia.org/T408132) [10:47:22] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [10:48:43] !log gehel@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [10:48:51] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [10:51:14] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [10:51:33] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [10:51:55] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [10:52:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P85502 and previous config saved to /var/cache/conftool/dbconfig/20251124-105216-marostegui.json [10:52:18] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [10:52:24] (03PS1) 10Tiziano Fogli: metamonitoring/icinga: convert now variable to timestamp [puppet] - 10https://gerrit.wikimedia.org/r/1210529 (https://phabricator.wikimedia.org/T393625) [10:53:47] (03CR) 10Tiziano Fogli: [C:03+2] "I'm self-merging since this is just a time-conversion fix for an already deployed patch." [puppet] - 10https://gerrit.wikimedia.org/r/1210529 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli) [10:54:36] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:55:33] (03PS14) 10Btullis: Enable an oauth2-proxy for growthbook frontend and api pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202726 (https://phabricator.wikimedia.org/T409183) [10:55:47] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:56:02] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:56:19] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [10:56:45] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [10:56:48] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:56:51] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:56:55] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [10:57:08] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [10:59:35] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: allow rate limits per time unit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205191 (https://phabricator.wikimedia.org/T408132) (owner: 10Daniel Kinzler) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251124T1100) [11:00:32] (03PS3) 10Daniel Kinzler: rest-gateway: implement per-route rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206898 (https://phabricator.wikimedia.org/T409044) [11:01:57] (03Merged) 10jenkins-bot: rest-gateway: allow rate limits per time unit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205191 (https://phabricator.wikimedia.org/T408132) (owner: 10Daniel Kinzler) [11:02:39] (03CR) 10Michael Große: [C:03+1] [beta] GrowthExperiments: increase log level to debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210526 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno) [11:05:27] (03PS3) 10Daniel Kinzler: rest-gateway: assign ratelimit class by network range [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206956 (https://phabricator.wikimedia.org/T410273) [11:07:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P85503 and previous config saved to /var/cache/conftool/dbconfig/20251124-110723-marostegui.json [11:15:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:16:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [11:21:04] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1187.eqiad.wmnet with reason: Maintenance [11:21:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1187 (T299441)', diff saved to https://phabricator.wikimedia.org/P85504 and previous config saved to /var/cache/conftool/dbconfig/20251124-112111-marostegui.json [11:21:16] T299441: Avoid depooling hosts if the schema change has been applied before - https://phabricator.wikimedia.org/T299441 [11:22:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T410531)', diff saved to https://phabricator.wikimedia.org/P85505 and previous config saved to /var/cache/conftool/dbconfig/20251124-112231-marostegui.json [11:22:36] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [11:22:49] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1212.eqiad.wmnet with reason: Maintenance [11:22:59] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on 6 hosts with reason: Maintenance [11:23:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1212 (T410531)', diff saved to https://phabricator.wikimedia.org/P85506 and previous config saved to /var/cache/conftool/dbconfig/20251124-112306-marostegui.json [11:23:52] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1187 gradually with 4 steps - repool after schema change test [11:24:38] !log gehel@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [11:25:15] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:25:35] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:26:05] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) db1187 gradually with 4 steps - repool after schema change test [11:26:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [11:28:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T410531)', diff saved to https://phabricator.wikimedia.org/P85508 and previous config saved to /var/cache/conftool/dbconfig/20251124-112819-marostegui.json [11:28:24] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [11:28:43] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1187.eqiad.wmnet with reason: Maintenance [11:28:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1187 (T299441)', diff saved to https://phabricator.wikimedia.org/P85509 and previous config saved to /var/cache/conftool/dbconfig/20251124-112850-marostegui.json [11:28:55] T299441: Avoid depooling hosts if the schema change has been applied before - https://phabricator.wikimedia.org/T299441 [11:31:34] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1187 gradually with 4 steps - repool after schema change test [11:32:21] (03PS4) 10Federico Ceratto: Support both hostname and FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/1141895 (https://phabricator.wikimedia.org/T391581) [11:39:01] (03CR) 10CI reject: [V:04-1] Support both hostname and FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/1141895 (https://phabricator.wikimedia.org/T391581) (owner: 10Federico Ceratto) [11:40:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test2005.wikimedia.org [11:41:07] (03PS1) 10Hashar: gerrit: add a layer of CNAME to ease switch overs [dns] - 10https://gerrit.wikimedia.org/r/1210560 (https://phabricator.wikimedia.org/T387833) [11:41:56] (03CR) 10CI reject: [V:04-1] gerrit: add a layer of CNAME to ease switch overs [dns] - 10https://gerrit.wikimedia.org/r/1210560 (https://phabricator.wikimedia.org/T387833) (owner: 10Hashar) [11:43:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P85511 and previous config saved to /var/cache/conftool/dbconfig/20251124-114326-marostegui.json [11:44:47] (03PS2) 10Hashar: gerrit: add a layer of CNAME to ease switch overs [dns] - 10https://gerrit.wikimedia.org/r/1210560 (https://phabricator.wikimedia.org/T387833) [11:44:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test2005.wikimedia.org [11:46:06] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:46:18] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:52:01] !log daniel@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [11:53:15] (03CR) 10Muehlenhoff: [C:04-1] "Blocked by https://phabricator.wikimedia.org/T410879" [puppet] - 10https://gerrit.wikimedia.org/r/1208362 (owner: 10Muehlenhoff) [11:53:39] !log daniel@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [11:54:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test1005.wikimedia.org [11:56:24] !log daniel@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [11:56:54] !log daniel@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [11:58:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test1005.wikimedia.org [11:58:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P85513 and previous config saved to /var/cache/conftool/dbconfig/20251124-115834-marostegui.json [11:58:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp2005.wikimedia.org [11:59:39] (03PS1) 10Muehlenhoff: Failover idp.w.o [dns] - 10https://gerrit.wikimedia.org/r/1210566 [12:01:42] !log installing Squid security updates [12:01:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp2005.wikimedia.org [12:05:46] (03PS2) 10Bartosz Wójtowicz: ml-services: Update the image for revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208310 (https://phabricator.wikimedia.org/T408538) [12:13:13] (03CR) 10Klausman: [C:03+1] changeprop: add LiftWing revise-tone-task-generator to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210505 (https://phabricator.wikimedia.org/T408538) (owner: 10AikoChou) [12:13:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T410531)', diff saved to https://phabricator.wikimedia.org/P85515 and previous config saved to /var/cache/conftool/dbconfig/20251124-121341-marostegui.json [12:13:47] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [12:13:58] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1240.eqiad.wmnet with reason: Maintenance [12:15:25] (03CR) 10AikoChou: [C:03+2] ml-services: Update the image for revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208310 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [12:17:15] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1187 gradually with 4 steps - repool after schema change test [12:17:21] (03Merged) 10jenkins-bot: ml-services: Update the image for revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208310 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [12:18:50] (03PS1) 10Btullis: Attempt to fix the OIDC authentication for growthbook [puppet] - 10https://gerrit.wikimedia.org/r/1210570 (https://phabricator.wikimedia.org/T409183) [12:19:33] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7686/co" [puppet] - 10https://gerrit.wikimedia.org/r/1210570 (https://phabricator.wikimedia.org/T409183) (owner: 10Btullis) [12:23:09] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [12:24:34] (03CR) 10Hnowlan: [C:03+1] Add blake to ops, remove blake from ops-limited. [puppet] - 10https://gerrit.wikimedia.org/r/1207824 (https://phabricator.wikimedia.org/T410612) (owner: 10Blake) [12:26:52] (03CR) 10Clément Goubert: [C:03+2] Add blake to ops, remove blake from ops-limited. [puppet] - 10https://gerrit.wikimedia.org/r/1207824 (https://phabricator.wikimedia.org/T410612) (owner: 10Blake) [12:32:17] !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [12:34:00] (03PS1) 10Volans: wmcs k8s nfs: add NFS tracing script [puppet] - 10https://gerrit.wikimedia.org/r/1210582 (https://phabricator.wikimedia.org/T399313) [12:36:04] (03CR) 10CI reject: [V:04-1] wmcs k8s nfs: add NFS tracing script [puppet] - 10https://gerrit.wikimedia.org/r/1210582 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [12:39:37] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:39:47] (03PS1) 10Bartosz Wójtowicz: ml-services: Remove experimental revise-tone-task-generator deployment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210583 (https://phabricator.wikimedia.org/T408538) [12:40:45] (03PS2) 10Volans: wmcs k8s nfs: add NFS tracing script [puppet] - 10https://gerrit.wikimedia.org/r/1210582 (https://phabricator.wikimedia.org/T399313) [12:41:31] (03PS5) 10Federico Ceratto: Support both hostname and FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/1141895 (https://phabricator.wikimedia.org/T391581) [12:41:32] (03PS2) 10Muehlenhoff: EFI-enabled Partman recipe (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1207124 (https://phabricator.wikimedia.org/T410400) [12:41:59] (03CR) 10AikoChou: [C:03+1] ml-services: Remove experimental revise-tone-task-generator deployment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210583 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [12:42:00] (03CR) 10Klausman: [C:03+1] ml-services: Remove experimental revise-tone-task-generator deployment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210583 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [12:42:06] !log gehel@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs2025.codfw.wmnet [12:42:44] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Remove experimental revise-tone-task-generator deployment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210583 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [12:42:52] (03PS3) 10Muehlenhoff: Test EFI-enabled Partman recipe on db1169 [puppet] - 10https://gerrit.wikimedia.org/r/1207124 (https://phabricator.wikimedia.org/T410400) [12:43:24] (03CR) 10CI reject: [V:04-1] wmcs k8s nfs: add NFS tracing script [puppet] - 10https://gerrit.wikimedia.org/r/1210582 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [12:44:25] (03Merged) 10jenkins-bot: ml-services: Remove experimental revise-tone-task-generator deployment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210583 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [12:45:17] !log bwojtowicz@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:45:20] (03PS3) 10Volans: wmcs k8s nfs: add NFS tracing script [puppet] - 10https://gerrit.wikimedia.org/r/1210582 (https://phabricator.wikimedia.org/T399313) [12:47:36] (03CR) 10CI reject: [V:04-1] wmcs k8s nfs: add NFS tracing script [puppet] - 10https://gerrit.wikimedia.org/r/1210582 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [12:47:45] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1223.eqiad.wmnet with reason: Maintenance [12:48:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208328 (https://phabricator.wikimedia.org/T410731) (owner: 10D3r1ck01) [12:48:53] (03CR) 10CI reject: [V:04-1] Support both hostname and FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/1141895 (https://phabricator.wikimedia.org/T391581) (owner: 10Federico Ceratto) [12:49:02] !log gehel@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2025.codfw.wmnet [12:51:45] (03CR) 10Marostegui: [C:03+1] Test EFI-enabled Partman recipe on db1169 [puppet] - 10https://gerrit.wikimedia.org/r/1207124 (https://phabricator.wikimedia.org/T410400) (owner: 10Muehlenhoff) [12:54:12] !log gehel@cumin2002 START - Cookbook sre.hosts.reboot-cluster [12:54:12] !log gehel@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=99) [12:54:36] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [12:54:43] (03CR) 10Muehlenhoff: [C:03+2] Test EFI-enabled Partman recipe on db1169 [puppet] - 10https://gerrit.wikimedia.org/r/1207124 (https://phabricator.wikimedia.org/T410400) (owner: 10Muehlenhoff) [12:55:01] !log gehel@cumin2002 START - Cookbook sre.hosts.reboot-cluster [12:55:28] !log gehel@cumin2002 START - Cookbook sre.hosts.reboot-cluster [12:57:44] (03PS1) 10Kosta Harlan: MonologChannels: Add WikiEditor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210586 (https://phabricator.wikimedia.org/T410877) [12:58:23] (03PS1) 10Muehlenhoff: Enable imports on maps-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/1210587 (https://phabricator.wikimedia.org/T409528) [13:00:43] !log aikochou@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [13:05:44] !log aikochou@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [13:06:30] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1210587 (https://phabricator.wikimedia.org/T409528) (owner: 10Muehlenhoff) [13:07:59] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [13:08:28] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [13:09:36] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:14:46] (03CR) 10Jgiannelos: [C:03+1] Turn paging on for kartotherian [puppet] - 10https://gerrit.wikimedia.org/r/1203835 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [13:15:44] !log gehel@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [13:16:11] !log gehel@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [13:17:39] 06SRE, 10SRE-Access-Requests: Requesting access to ops for blake - https://phabricator.wikimedia.org/T410612#11400340 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert [13:20:24] (03PS4) 10Volans: wmcs k8s nfs: add NFS tracing script [puppet] - 10https://gerrit.wikimedia.org/r/1210582 (https://phabricator.wikimedia.org/T399313) [13:21:13] (03CR) 10Dreamy Jazz: [C:03+1] MonologChannels: Add WikiEditor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210586 (https://phabricator.wikimedia.org/T410877) (owner: 10Kosta Harlan) [13:27:02] (03PS1) 10Volans: labs: add infra-tracing-nfs account [labs/private] - 10https://gerrit.wikimedia.org/r/1210591 (https://phabricator.wikimedia.org/T399313) [13:28:01] !log cleaning up watchlist of deceased User:JarrahTree in enwiki and commonswiki [13:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:18] (03CR) 10Clément Goubert: rest-gateway: assign ratelimit class by network range (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206956 (https://phabricator.wikimedia.org/T410273) (owner: 10Daniel Kinzler) [13:32:14] (03CR) 10Bartosz Wójtowicz: [C:03+1] changeprop: add LiftWing revise-tone-task-generator to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210505 (https://phabricator.wikimedia.org/T408538) (owner: 10AikoChou) [13:33:20] (03CR) 10AikoChou: [C:03+2] changeprop: add LiftWing revise-tone-task-generator to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210505 (https://phabricator.wikimedia.org/T408538) (owner: 10AikoChou) [13:33:21] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1028.eqiad.wmnet with OS trixie [13:35:04] 10SRE-Access-Requests, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Yubikey-SSH-FIDO for Guillaume (gehel) - https://phabricator.wikimedia.org/T410888 (10Gehel) 03NEW [13:35:10] (03Merged) 10jenkins-bot: changeprop: add LiftWing revise-tone-task-generator to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210505 (https://phabricator.wikimedia.org/T408538) (owner: 10AikoChou) [13:35:42] (03PS1) 10Gehel: ssh: FIDO key for Guillaume Lederrey [puppet] - 10https://gerrit.wikimedia.org/r/1210592 (https://phabricator.wikimedia.org/T410888) [13:36:25] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1031.eqiad.wmnet with OS trixie [13:40:10] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1032.eqiad.wmnet with OS trixie [13:40:16] jouncebot: nowandnext [13:40:16] No deployments scheduled for the next 0 hour(s) and 19 minute(s) [13:40:16] In 0 hour(s) and 19 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251124T1400) [13:40:42] (03CR) 10Ladsgroup: [C:03+2] Fix db config for offline maint scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208439 (https://phabricator.wikimedia.org/T410738) (owner: 10Ladsgroup) [13:41:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208439 (https://phabricator.wikimedia.org/T410738) (owner: 10Ladsgroup) [13:41:06] (03PS5) 10Volans: wmcs k8s nfs: add NFS tracing script [puppet] - 10https://gerrit.wikimedia.org/r/1210582 (https://phabricator.wikimedia.org/T399313) [13:41:32] (03Merged) 10jenkins-bot: Fix db config for offline maint scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208439 (https://phabricator.wikimedia.org/T410738) (owner: 10Ladsgroup) [13:41:54] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1208439|Fix db config for offline maint scripts (T410738 T405087)]] [13:42:00] T410738: pretrain failing when calling mergeMessageFileList.php - https://phabricator.wikimedia.org/T410738 [13:42:00] T405087: Remove concept of groups in rdbms load balancer and replace it with shuffle sharding - https://phabricator.wikimedia.org/T405087 [13:42:00] !log ladsgroup@deploy2002 sync-world failed: Command '['sudo', '-u', 'mwbuilder', '-n', '--', '/usr/bin/scap', 'mwscript', '--no-local-config', '--directory', '/srv/mediawiki-staging', '--user', 'www-data', '--', 'mergeMessageFileList.php', '--wiki=aawiki', '--force-version', '1.46.0-wmf.3', '--list-file', '/srv/mediawiki-staging/wmf-config/extension-list', '--output', '/tmp/tmp.1aRzXHW4OP']' returned [13:42:00] non-zero exit status 255. (scap version: 4.228.0) (duration: 00m 07s) [13:43:47] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1208439|Fix db config for offline maint scripts (T410738 T405087)]] [13:43:53] !log ladsgroup@deploy2002 sync-world failed: Command '['sudo', '-u', 'mwbuilder', '-n', '--', '/usr/bin/scap', 'mwscript', '--no-local-config', '--directory', '/srv/mediawiki-staging', '--user', 'www-data', '--', 'mergeMessageFileList.php', '--wiki=aawiki', '--force-version', '1.46.0-wmf.3', '--list-file', '/srv/mediawiki-staging/wmf-config/extension-list', '--output', '/tmp/tmp.Seyz9S1dDd']' returned [13:43:54] non-zero exit status 255. (scap version: 4.228.0) (duration: 00m 06s) [13:46:48] (03CR) 10Filippo Giunchedi: [C:03+2] cloudcephosd: move row C hosts to single NIC [puppet] - 10https://gerrit.wikimedia.org/r/1207739 (https://phabricator.wikimedia.org/T399180) (owner: 10Filippo Giunchedi) [13:46:58] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10[28-32] - https://phabricator.wikimedia.org/T410406#11400460 (10Jclark-ctr) @bking @RKemper I’m having issues imaging these servers. Since they’re UEFI, shouldn’t the preseed file be -efi? [13:47:27] 06SRE, 10Cassandra, 06Data-Persistence: Discovery of Cassandra cluster nodes - https://phabricator.wikimedia.org/T410075#11400462 (10elukey) [13:51:50] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10[28-32] - https://phabricator.wikimedia.org/T410406#11400466 (10Jclark-ctr) a:05Jclark-ctr→03bking {F70586111} Also when trying to image with Trixie i did notice output = Bookworm> [13:52:42] (03CR) 10Btullis: [C:03+1] "Looks good. I have also requested that the user send me the same key over Slack, and it matches." [puppet] - 10https://gerrit.wikimedia.org/r/1210592 (https://phabricator.wikimedia.org/T410888) (owner: 10Gehel) [13:53:18] !log dpogorzelski@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: sync [13:53:51] !log dpogorzelski@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [13:54:11] (03PS1) 10Ladsgroup: Fix fix db config for offline maint scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210593 (https://phabricator.wikimedia.org/T410738) [13:54:16] !log dpogorzelski@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: sync [13:54:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210593 (https://phabricator.wikimedia.org/T410738) (owner: 10Ladsgroup) [13:54:45] (03CR) 10Btullis: [C:03+2] Enable the deploy-spark-support deploy clusterrole for two test namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208317 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [13:54:51] !log dpogorzelski@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync [13:54:59] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1210592 (https://phabricator.wikimedia.org/T410888) (owner: 10Gehel) [13:55:24] !log dpogorzelski@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: sync [13:55:34] !log dpogorzelski@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [13:55:35] (03Merged) 10jenkins-bot: Fix fix db config for offline maint scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210593 (https://phabricator.wikimedia.org/T410738) (owner: 10Ladsgroup) [13:55:54] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1208439|Fix db config for offline maint scripts (T410738 T405087)]], [[gerrit:1210593|Fix fix db config for offline maint scripts (T410738 T405087)]] [13:56:00] T410738: pretrain failing when calling mergeMessageFileList.php - https://phabricator.wikimedia.org/T410738 [13:56:00] T405087: Remove concept of groups in rdbms load balancer and replace it with shuffle sharding - https://phabricator.wikimedia.org/T405087 [13:56:11] (03PS2) 10Gehel: ssh: FIDO key for Guillaume Lederrey [puppet] - 10https://gerrit.wikimedia.org/r/1210592 (https://phabricator.wikimedia.org/T410888) [13:56:34] (03CR) 10Filippo Giunchedi: [C:03+2] cloudcephosd: move row D hosts to single NIC [puppet] - 10https://gerrit.wikimedia.org/r/1207740 (https://phabricator.wikimedia.org/T399180) (owner: 10Filippo Giunchedi) [13:57:18] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10[28-32] - https://phabricator.wikimedia.org/T410406#11400477 (10MoritzMuehlenhoff) You can simply confirm and continue, Puppet 7 is already enabled for wdqs1031 via the insetup::data_platform_ferm role in site.pp [13:57:20] (03CR) 10Gehel: [C:03+2] ssh: FIDO key for Guillaume Lederrey [puppet] - 10https://gerrit.wikimedia.org/r/1210592 (https://phabricator.wikimedia.org/T410888) (owner: 10Gehel) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251124T1400). [14:00:05] anzx, edsanders, Dragoniez, hubaishan, Sergi0, and xSavitar: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:20] o/ [14:00:20] I can’t deploy, sorry (maybe in half an hour) [14:00:27] o/ my deployment is taking a bit longer but I can do the deployments a bit [14:00:32] o/ [14:00:32] until Lucas would take over [14:00:36] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1208439|Fix db config for offline maint scripts (T410738 T405087)]], [[gerrit:1210593|Fix fix db config for offline maint scripts (T410738 T405087)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:00:41] o/ [14:00:59] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [14:01:45] o/ [14:02:00] 06SRE, 10SRE-Access-Requests: Requesting access to ops for blake - https://phabricator.wikimedia.org/T410612#11400490 (10MoritzMuehlenhoff) 05Resolved→03Open This broke Puppet runs on the puppetservers: ` Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation... [14:02:34] (03Merged) 10jenkins-bot: Enable the deploy-spark-support deploy clusterrole for two test namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208317 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [14:02:45] 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11400493 (10EMill-WMF) >>! In T408592#11390152, @ATitkov wrote: >> Who will be responsible for security review, when this is sharing important top level domains ?... [14:02:52] (03CR) 10Ladsgroup: [C:03+2] Revert "tcywikisource: throttle exception" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208292 (https://phabricator.wikimedia.org/T410507) (owner: 10Anzx) [14:03:45] (03Merged) 10jenkins-bot: Revert "tcywikisource: throttle exception" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208292 (https://phabricator.wikimedia.org/T410507) (owner: 10Anzx) [14:05:01] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1208439|Fix db config for offline maint scripts (T410738 T405087)]], [[gerrit:1210593|Fix fix db config for offline maint scripts (T410738 T405087)]] (duration: 09m 07s) [14:05:07] T410738: pretrain failing when calling mergeMessageFileList.php - https://phabricator.wikimedia.org/T410738 [14:05:08] T405087: Remove concept of groups in rdbms load balancer and replace it with shuffle sharding - https://phabricator.wikimedia.org/T405087 [14:05:28] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1208292|Revert "tcywikisource: throttle exception" (T410507)]] [14:05:33] T410507: Increase AccountCreationThrottle for Tulu Wikisource - https://phabricator.wikimedia.org/T410507 [14:05:56] (03CR) 10Ladsgroup: [C:03+2] [beta] GrowthExperiments: increase log level to debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210526 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno) [14:06:37] (03Merged) 10jenkins-bot: [beta] GrowthExperiments: increase log level to debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210526 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno) [14:07:41] sergi0: yours is beta cluster only, I merged and rebased it, it'll be live in ten minutes automatically [14:07:59] @Amir1 <3, ty! [14:08:08] (03CR) 10Ladsgroup: [C:03+2] Enable DiscussionTools visual enhancements on ruwiki & svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208320 (https://phabricator.wikimedia.org/T379264) (owner: 10Esanders) [14:09:02] (03Merged) 10jenkins-bot: Enable DiscussionTools visual enhancements on ruwiki & svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208320 (https://phabricator.wikimedia.org/T379264) (owner: 10Esanders) [14:09:44] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11400527 (10BTullis) I have drained dse-k8s-worker10[11,13,19] prior to this afternoon's maintenance. ` root@deploy2002:~#... [14:10:31] !log ladsgroup@deploy2002 anzx, ladsgroup: Backport for [[gerrit:1208292|Revert "tcywikisource: throttle exception" (T410507)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:10:36] T410507: Increase AccountCreationThrottle for Tulu Wikisource - https://phabricator.wikimedia.org/T410507 [14:11:12] !log ladsgroup@deploy2002 anzx, ladsgroup: Continuing with sync [14:11:22] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:12:52] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:13:37] (03CR) 10Arnaudb: "for this, I think we should also swap around PTR records in" [dns] - 10https://gerrit.wikimedia.org/r/1210560 (https://phabricator.wikimedia.org/T387833) (owner: 10Hashar) [14:13:58] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [14:15:12] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1208292|Revert "tcywikisource: throttle exception" (T410507)]] (duration: 09m 44s) [14:15:26] Amir1: thanks for deploying [14:15:38] (03CR) 10Bking: [C:03+1] aptrepo: add component/opensearch27 [puppet] - 10https://gerrit.wikimedia.org/r/1208499 (https://phabricator.wikimedia.org/T410795) (owner: 10Cwhite) [14:15:52] ^_^ [14:15:57] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1208320|Enable DiscussionTools visual enhancements on ruwiki & svwiki (T379264)]] [14:16:02] T379264: Phase 5: Offer Usability Improvements as default-on feature at remaining large wikis - https://phabricator.wikimedia.org/T379264 [14:16:25] (03CR) 10Bking: [C:03+1] opensearch: add $apt_component parameter [puppet] - 10https://gerrit.wikimedia.org/r/1208500 (https://phabricator.wikimedia.org/T410795) (owner: 10Cwhite) [14:16:44] (03PS1) 10Elukey: Add a staging-specific stream for Maps tiles change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210598 (https://phabricator.wikimedia.org/T409528) [14:18:28] (03CR) 10Filippo Giunchedi: [C:03+2] cloudcephosd: move rack E4 hosts to single NIC [puppet] - 10https://gerrit.wikimedia.org/r/1207741 (https://phabricator.wikimedia.org/T399180) (owner: 10Filippo Giunchedi) [14:21:11] !log ladsgroup@deploy2002 esanders, ladsgroup: Backport for [[gerrit:1208320|Enable DiscussionTools visual enhancements on ruwiki & svwiki (T379264)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:21:17] T379264: Phase 5: Offer Usability Improvements as default-on feature at remaining large wikis - https://phabricator.wikimedia.org/T379264 [14:21:54] (03PS1) 10Elukey: profile::thanos::swift: add tegola account for staging [puppet] - 10https://gerrit.wikimedia.org/r/1210599 (https://phabricator.wikimedia.org/T409528) [14:21:55] edsanders: live in mwdebug [14:22:11] let me know once it's good to go [14:22:14] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2149.codfw.wmnet with reason: Maintenance [14:22:19] !log Remove unused md2 and add its devices to vg0 on titan1002 T410152 [14:22:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2149 (T410531)', diff saved to https://phabricator.wikimedia.org/P85520 and previous config saved to /var/cache/conftool/dbconfig/20251124-142221-marostegui.json [14:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:24] T410152: Disk space saturation (/srv) on Titan hosts - https://phabricator.wikimedia.org/T410152 [14:22:29] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [14:22:47] o/ [14:23:58] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [14:25:39] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:26:25] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:27:12] !log btullis@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [14:27:33] !log btullis@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [14:28:10] !log btullis@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [14:28:12] (03PS1) 10Slyngshede: C:varnish [puppet] - 10https://gerrit.wikimedia.org/r/1210600 [14:28:26] 06SRE, 06Traffic, 06Data-Engineering (Q2 FY25/26 October 1st - December 31th), 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Sustainability (Incident Followup): alerts should be triggered if druid fails to consume webrequest_sampled kafka topic - https://phabricator.wikimedia.org/T410019#11400587 (10Ge... [14:28:26] !log btullis@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [14:28:28] (03CR) 10Hashar: "Indeed! :-) thanks" [dns] - 10https://gerrit.wikimedia.org/r/1210560 (https://phabricator.wikimedia.org/T387833) (owner: 10Hashar) [14:28:34] (03PS3) 10Hashar: gerrit: add a layer of CNAME to ease switch overs [dns] - 10https://gerrit.wikimedia.org/r/1210560 (https://phabricator.wikimedia.org/T387833) [14:28:46] (03CR) 10CI reject: [V:04-1] C:varnish [puppet] - 10https://gerrit.wikimedia.org/r/1210600 (owner: 10Slyngshede) [14:29:30] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:30:26] (03PS2) 10Slyngshede: C:varnish [puppet] - 10https://gerrit.wikimedia.org/r/1210600 [14:30:46] Lucas_WMDE: I'm waiting for edsanders :D) [14:30:50] (03CR) 10Btullis: [C:03+2] Add an analytics namespace to both dse-k8s clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208318 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [14:30:53] ack [14:31:16] !log ladsgroup@deploy2002 esanders, ladsgroup: Continuing with sync [14:31:30] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:31:42] Amir1: not waiting anymore? [14:31:57] yeah, I decided that it's straightforward and can move forward [14:32:56] (03CR) 10Michael Große: "This is now ready for review (and deployment, if approved). The data from the machine learning team is now available for testwiki!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207886 (https://phabricator.wikimedia.org/T407029) (owner: 10Michael Große) [14:33:58] (03PS1) 10Gehel: Webrequests: alert when webrequest_sampled isn't consumed. [alerts] - 10https://gerrit.wikimedia.org/r/1210601 (https://phabricator.wikimedia.org/T410019) [14:34:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host db1169.eqiad.wmnet with OS bookworm [14:34:56] 06SRE, 10observability, 06Traffic, 06Data-Engineering (Q2 FY25/26 October 1st - December 31th), and 3 others: alerts should be triggered if druid fails to consume webrequest_sampled kafka topic - https://phabricator.wikimedia.org/T410019#11400611 (10Gehel) [14:35:16] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1208320|Enable DiscussionTools visual enhancements on ruwiki & svwiki (T379264)]] (duration: 19m 18s) [14:35:20] T379264: Phase 5: Offer Usability Improvements as default-on feature at remaining large wikis - https://phabricator.wikimedia.org/T379264 [14:35:31] Lucas_WMDE: wanna take over? [14:35:32] (03CR) 10CI reject: [V:04-1] Webrequests: alert when webrequest_sampled isn't consumed. [alerts] - 10https://gerrit.wikimedia.org/r/1210601 (https://phabricator.wikimedia.org/T410019) (owner: 10Gehel) [14:35:49] 06SRE, 10observability, 06Traffic, 06Data-Engineering (Q2 FY25/26 October 1st - December 31th), and 3 others: alerts should be triggered if druid fails to consume webrequest_sampled kafka topic - https://phabricator.wikimedia.org/T410019#11400632 (10Gehel) As webrequest is critical for operational support,... [14:36:45] (03CR) 10Michael Große: "At time of writing, this search string gives us 49 results on testwiki: https://test.wikipedia.org/w/index.php?search=hasrecommendation%3A" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207886 (https://phabricator.wikimedia.org/T407029) (owner: 10Michael Große) [14:37:23] (03Abandoned) 10Btullis: Attempt to fix the OIDC authentication for growthbook [puppet] - 10https://gerrit.wikimedia.org/r/1210570 (https://phabricator.wikimedia.org/T409183) (owner: 10Btullis) [14:37:26] (03CR) 10Kosta Harlan: [C:03+1] hiera: trafficserver: switch hcaptcha backend to anycast [puppet] - 10https://gerrit.wikimedia.org/r/1207978 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh) [14:37:50] Amir1: I'm here [14:37:55] Amir1: sure [14:37:58] edsanders: already deployed :P [14:38:03] thanks [14:38:03] (sorry, got distracted for a moment reading https://techblog.wikimedia.org/2025/11/21/unifying-mobile-and-desktop-domains/ ^^) [14:38:05] !log sudo cumin "A:cp" "disable-puppet 'merging CR 1207978'": T409780 [14:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:19] so, up next is Dragoniez_? [14:38:31] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10[28-32] - https://phabricator.wikimedia.org/T410406#11400638 (10bking) @Jclark-ctr good catch. I didn't know about [[ https://phabricator.wikimedia.org/T409286 | the Nokia bugs that prevent legacy BIOS reimage in eqiad rows C... [14:38:31] (03Merged) 10jenkins-bot: Add an analytics namespace to both dse-k8s clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208318 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [14:38:42] (03Abandoned) 10Btullis: Use our PKI generated certificate for the opensearch http interface [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196505 (https://phabricator.wikimedia.org/T406876) (owner: 10Btullis) [14:39:02] I assume so [14:39:16] jouncebot: nowandnext [14:39:16] For the next 0 hour(s) and 20 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251124T1400) [14:39:16] In 0 hour(s) and 50 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251124T1530) [14:39:17] (03PS2) 10Ssingh: hiera: trafficserver: switch hcaptcha backend to anycast [puppet] - 10https://gerrit.wikimedia.org/r/1207978 (https://phabricator.wikimedia.org/T409780) [14:39:23] (03CR) 10Ssingh: "rebased" [puppet] - 10https://gerrit.wikimedia.org/r/1207978 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh) [14:39:45] * Lucas_WMDE tries to follow the on-wiki discussion [14:39:57] * Lucas_WMDE chuckles at firefox translation yielding “Permission granted by confidence only to the vieweric rat” [14:40:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206882 (https://phabricator.wikimedia.org/T409717) (owner: 10Reedy) [14:40:06] (03CR) 10Ssingh: [V:03+2 C:03+2] hiera: trafficserver: switch hcaptcha backend to anycast [puppet] - 10https://gerrit.wikimedia.org/r/1207978 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh) [14:41:22] (03CR) 10Lucas Werkmeister (WMDE): "If Firefox Translations is representing the community discussion semi-accurately, then this appears to be intentional (proposal 3 in the l" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208334 (https://phabricator.wikimedia.org/T409687) (owner: 10Dragoniez) [14:41:43] (03CR) 10Filippo Giunchedi: [C:03+2] cloudcephosd: move rack F4 hosts to single NIC [puppet] - 10https://gerrit.wikimedia.org/r/1207742 (https://phabricator.wikimedia.org/T399180) (owner: 10Filippo Giunchedi) [14:42:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T410531)', diff saved to https://phabricator.wikimedia.org/P85521 and previous config saved to /var/cache/conftool/dbconfig/20251124-144218-marostegui.json [14:42:23] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:42:24] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [14:42:28] !log Remove unused md2 and add its devices to vg0 on titan2002 T410152 [14:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:36] T410152: Disk space saturation (/srv) on Titan hosts - https://phabricator.wikimedia.org/T410152 [14:43:01] (03PS1) 10Ssingh: Revert "hiera: trafficserver: switch hcaptcha backend to anycast" [puppet] - 10https://gerrit.wikimedia.org/r/1210603 [14:43:09] (03CR) 10Ssingh: "do not merge, emergency revert only" [puppet] - 10https://gerrit.wikimedia.org/r/1210603 (owner: 10Ssingh) [14:43:23] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:44:30] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10[28-32] - https://phabricator.wikimedia.org/T410406#11400667 (10MoritzMuehlenhoff) >>! In T410406#11400638, @bking wrote: > I'll grab it back and update the partman recipes. Keep in mind that these are very old Dells as oppos... [14:44:32] I think I want to deploy these separately tbh [14:44:36] I’m feeling unsure about the rowiki change [14:44:42] let’s start with jawiki [14:44:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208334 (https://phabricator.wikimedia.org/T409687) (owner: 10Dragoniez) [14:46:07] (03Merged) 10jenkins-bot: jawiki: Disallow sysops from granting temporary-account-viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208334 (https://phabricator.wikimedia.org/T409687) (owner: 10Dragoniez) [14:46:27] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1208334|jawiki: Disallow sysops from granting temporary-account-viewer (T409687)]] [14:46:32] T409687: jawiki: Disallow sysops to grant temporary-account-viewer - https://phabricator.wikimedia.org/T409687 [14:47:41] (03PS6) 10Volans: wmcs k8s nfs: add NFS tracing script [puppet] - 10https://gerrit.wikimedia.org/r/1210582 (https://phabricator.wikimedia.org/T399313) [14:49:03] (03PS1) 10Tchanders: Assign 'ignore-restricted-groups' to steward group on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210605 (https://phabricator.wikimedia.org/T409717) [14:49:12] The rowiki patch is surely complex. I believe it's good cuz I've checked it several times though [14:49:36] (03CR) 10Tchanders: [C:03+1] "Done in I51f7458e735f11ddaaa880fcf1c8ddfbad2be76b" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206882 (https://phabricator.wikimedia.org/T409717) (owner: 10Reedy) [14:51:14] !log lucaswerkmeister-wmde@deploy2002 dragoniez, lucaswerkmeister-wmde: Backport for [[gerrit:1208334|jawiki: Disallow sysops from granting temporary-account-viewer (T409687)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:51:21] Checking [14:51:23] (03CR) 10Btullis: Report integrity metric from wikidata dump scripts (031 comment) [dumps] - 10https://gerrit.wikimedia.org/r/1203410 (https://phabricator.wikimedia.org/T403482) (owner: 10Silvan Heintze) [14:51:34] (03PS1) 10Bking: wdqs: provision temporary hosts via UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1210606 (https://phabricator.wikimedia.org/T410406) [14:52:24] Looking good [14:52:47] !log lucaswerkmeister-wmde@deploy2002 dragoniez, lucaswerkmeister-wmde: Continuing with sync [14:52:48] thanks! [14:53:31] (03PS1) 10Elukey: profile::pyrra::fs::slos::editing: fix citoid's success ratio SLO [puppet] - 10https://gerrit.wikimedia.org/r/1210608 (https://phabricator.wikimedia.org/T345627) [14:53:50] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1210608 (https://phabricator.wikimedia.org/T345627) (owner: 10Elukey) [14:54:32] (03PS2) 10Volans: labs: add infra-tracing-nfs account [labs/private] - 10https://gerrit.wikimedia.org/r/1210591 (https://phabricator.wikimedia.org/T399313) [14:54:37] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:54:38] (03CR) 10Sergio Gimeno: [C:03+1] "No objections" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207886 (https://phabricator.wikimedia.org/T407029) (owner: 10Michael Große) [14:54:59] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 4 others: Racking request for wdqs10[28-32] - https://phabricator.wikimedia.org/T410406#11400736 (10bking) Thanks @MoritzMuehlenhoff ! Do I need to run the provisioning cookbook or make any other changes to put the host in UEFI mode? I know Cathal had to do som... [14:55:37] (03CR) 10CI reject: [V:04-1] profile::pyrra::fs::slos::editing: fix citoid's success ratio SLO [puppet] - 10https://gerrit.wikimedia.org/r/1210608 (https://phabricator.wikimedia.org/T345627) (owner: 10Elukey) [14:56:35] (03CR) 10Volans: "The script has been tested in toolsbeta:" [puppet] - 10https://gerrit.wikimedia.org/r/1210582 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [14:56:38] (03PS2) 10Elukey: profile::pyrra::fs::slos::editing: fix citoid's success ratio SLO [puppet] - 10https://gerrit.wikimedia.org/r/1210608 (https://phabricator.wikimedia.org/T345627) [14:57:01] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "I think the changes in here look correct. The one part I’m still not sure about is `abusefilter-view-private` and `abusefilter-log-private" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208329 (https://phabricator.wikimedia.org/T407978) (owner: 10Dragoniez) [14:57:06] (03CR) 10Volans: "Required by the related change:" [labs/private] - 10https://gerrit.wikimedia.org/r/1210591 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [14:57:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P85523 and previous config saved to /var/cache/conftool/dbconfig/20251124-145726-marostegui.json [14:58:00] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1208334|jawiki: Disallow sysops from granting temporary-account-viewer (T409687)]] (duration: 11m 33s) [14:58:05] T409687: jawiki: Disallow sysops to grant temporary-account-viewer - https://phabricator.wikimedia.org/T409687 [14:58:47] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 4 others: Racking request for wdqs10[28-32] - https://phabricator.wikimedia.org/T410406#11400749 (10MoritzMuehlenhoff) The SuperMicro hosts are somewhat special, for the Dells the following cookbook should handle the reprovision to UEFI mode: ` cookbook sre.h... [14:59:21] jouncebot: nowandnext [14:59:21] For the next 0 hour(s) and 0 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251124T1400) [14:59:21] In 0 hour(s) and 30 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251124T1530) [14:59:45] Dragoniez_: do you still have time? if yes, I think I’d deploy the rowiki change in the break between windows now [14:59:47] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1210608 (https://phabricator.wikimedia.org/T345627) (owner: 10Elukey) [15:00:58] Lucas_WMDE: Yep! [15:01:39] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [15:01:44] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "…but in the interest of getting the main part (removing access from anons) deployed, I’ll deploy this anyway. If the rowiki community want" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208329 (https://phabricator.wikimedia.org/T407978) (owner: 10Dragoniez) [15:01:47] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1156 (T410589)', diff saved to https://phabricator.wikimedia.org/P85524 and previous config saved to /var/cache/conftool/dbconfig/20251124-150146-ladsgroup.json [15:01:52] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [15:01:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208329 (https://phabricator.wikimedia.org/T407978) (owner: 10Dragoniez) [15:03:12] (03Merged) 10jenkins-bot: rowiki: Redefine AbuseFilter permission model [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208329 (https://phabricator.wikimedia.org/T407978) (owner: 10Dragoniez) [15:03:25] (03CR) 10Slyngshede: [C:03+1] Failover idp.w.o [dns] - 10https://gerrit.wikimedia.org/r/1210566 (owner: 10Muehlenhoff) [15:03:31] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1208329|rowiki: Redefine AbuseFilter permission model (T407978)]] [15:03:36] T407978: Restrict abusefilter-log-detail to sysops on rowiki - https://phabricator.wikimedia.org/T407978 [15:04:48] (03CR) 10Gehel: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1210606 (https://phabricator.wikimedia.org/T410406) (owner: 10Bking) [15:05:35] (03CR) 10Bking: [C:03+2] wdqs: provision temporary hosts via UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1210606 (https://phabricator.wikimedia.org/T410406) (owner: 10Bking) [15:05:54] 06SRE, 10SRE-Access-Requests: Requesting access to ops for blake - https://phabricator.wikimedia.org/T410612#11400772 (10Clement_Goubert) >>! In T410612#11400490, @MoritzMuehlenhoff wrote: > This broke Puppet runs on the puppetservers: > > > ` > Error: Could not retrieve catalog from remote server: Error 500... [15:08:08] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, dragoniez: Backport for [[gerrit:1208329|rowiki: Redefine AbuseFilter permission model (T407978)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:08:15] (03CR) 10Clément Goubert: [C:03+1] Authorize blake for icinga tasks [puppet] - 10https://gerrit.wikimedia.org/r/1206858 (https://phabricator.wikimedia.org/T410390) (owner: 10Blake) [15:08:26] (03CR) 10Blake: [C:03+2] Authorize blake for icinga tasks [puppet] - 10https://gerrit.wikimedia.org/r/1206858 (https://phabricator.wikimedia.org/T410390) (owner: 10Blake) [15:08:58] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:09] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1184.eqiad.wmnet with reason: Testing latency [15:10:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1209791 (https://phabricator.wikimedia.org/T410840) (owner: 10Hubaishan) [15:10:30] !log cumin2024@db2191.codfw.wmnet[wikishared]> drop table if exists wikimedia_editor_tasks_counts; drop table if exists wikimedia_editor_tasks_edit_streak; drop table if exists wikimedia_editor_tasks_keys; (T410692) [15:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:35] T410692: Drop the WikimediaEditorTasks extension's tables from Wikimedia production - https://phabricator.wikimedia.org/T410692 [15:10:39] Dragoniez_: please test! [15:10:40] The rowiki thing does look good to me. I'll include your comment on the patch in the task when I close it [15:10:56] (03PS2) 10Jforrester: tables-catalog: Drop WikimediaEditorTasks tables [puppet] - 10https://gerrit.wikimedia.org/r/1208014 (https://phabricator.wikimedia.org/T376954) [15:10:57] (03CR) 10Ladsgroup: [C:03+2] tables-catalog: Drop WikimediaEditorTasks tables [puppet] - 10https://gerrit.wikimedia.org/r/1208014 (https://phabricator.wikimedia.org/T376954) (owner: 10Jforrester) [15:10:59] (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables-catalog: Drop WikimediaEditorTasks tables [puppet] - 10https://gerrit.wikimedia.org/r/1208014 (https://phabricator.wikimedia.org/T376954) (owner: 10Jforrester) [15:11:30] ok, just checking the diff of permissions myself [15:12:30] ok I think it’s correct [15:12:34] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, dragoniez: Continuing with sync [15:12:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P85526 and previous config saved to /var/cache/conftool/dbconfig/20251124-151233-marostegui.json [15:12:37] thank you! [15:12:53] Thank YOU :) [15:13:17] (03PS2) 10Gehel: Webrequests: alert when webrequest_sampled isn't consumed. [alerts] - 10https://gerrit.wikimedia.org/r/1210601 (https://phabricator.wikimedia.org/T410019) [15:13:18] PROBLEM - MariaDB Replica Lag: s2 on clouddb1014 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 630.04 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:13:32] PROBLEM - MariaDB Replica Lag: s2 on clouddb1018 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 642.91 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:13:54] (03PS1) 10Giuseppe Lavagetto: admin: remove non-fido keys for oblivian [puppet] - 10https://gerrit.wikimedia.org/r/1210609 [15:15:40] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:16:33] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1208329|rowiki: Redefine AbuseFilter permission model (T407978)]] (duration: 13m 02s) [15:16:38] T407978: Restrict abusefilter-log-detail to sysops on rowiki - https://phabricator.wikimedia.org/T407978 [15:17:42] sorry there wasn’t time for your change hubaishan [15:18:03] OK [15:18:15] xSavitar: should we try to deploy your change? I *think* scap will skip the actual deployment anyway because it only touches tests [15:18:29] Lucas_WMDE, sure [15:18:37] No testing needed for mine actually [15:19:19] (03CR) 10Slyngshede: [C:03+1] admin: remove non-fido keys for oblivian [puppet] - 10https://gerrit.wikimedia.org/r/1210609 (owner: 10Giuseppe Lavagetto) [15:19:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208328 (https://phabricator.wikimedia.org/T410731) (owner: 10D3r1ck01) [15:19:45] let’s find out [15:19:53] !log cumin2024@db2205.codfw.wmnet[(none)]> drop database if exists blocker; drop database if exists defoundation; drop database if exists oai; drop database if exists steward; (T297297) [15:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:58] T297297: Investigate the unusual dbs in s3 - https://phabricator.wikimedia.org/T297297 [15:20:39] (03Merged) 10jenkins-bot: tests: Make data providers static methods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208328 (https://phabricator.wikimedia.org/T410731) (owner: 10D3r1ck01) [15:20:40] hm, it might do a full deploy after all, because tests/ isn’t part of the beta_only_config_files: https://gerrit.wikimedia.org/g/operations/puppet/+/9a31426114/modules/scap/templates/scap.cfg.erb#122 [15:21:00] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1208328|tests: Make data providers static methods (T410731)]] [15:21:03] (03CR) 10Brouberol: [C:03+1] Webrequests: alert when webrequest_sampled isn't consumed. [alerts] - 10https://gerrit.wikimedia.org/r/1210601 (https://phabricator.wikimedia.org/T410019) (owner: 10Gehel) [15:21:03] yup, it sure does. oh well [15:21:05] T410731: Make production extensions PHPUnit tests data providers real providers (and use static methods) - https://phabricator.wikimedia.org/T410731 [15:21:19] * xSavitar nods [15:21:45] (03CR) 10Btullis: Webrequests: alert when webrequest_sampled isn't consumed. (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1210601 (https://phabricator.wikimedia.org/T410019) (owner: 10Gehel) [15:22:29] (03CR) 10Brouberol: [C:03+1] Add k8s tokens for the analytics namespace [puppet] - 10https://gerrit.wikimedia.org/r/1208321 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [15:22:30] Lucas_WMDE: i see you're deploying. that's fallback from the backport window? can you let me know once done? [15:22:43] urbanecm: yes and yes [15:22:47] thank you! [15:22:55] the current change is a no-op but scap is rolling it out anyway [15:23:11] iirc only beta-only changes are auto-excluded. [15:23:15] yeah, exactly [15:24:42] “66% (ok: 8; fail: 0; left: 4)” [15:24:44] (03CR) 10Btullis: [C:03+2] Clean up existing symlink before creating a new one [dumps] - 10https://gerrit.wikimedia.org/r/1207110 (https://phabricator.wikimedia.org/T406044) (owner: 10Itamar Givon) [15:24:46] isn’t that 67% 🤔 [15:25:18] probably a rounding down so that one is not at a 100% before actually done [15:25:23] (03CR) 10Giuseppe Lavagetto: [C:03+2] admin: remove non-fido keys for oblivian [puppet] - 10https://gerrit.wikimedia.org/r/1210609 (owner: 10Giuseppe Lavagetto) [15:25:26] (03CR) 10Btullis: [C:03+2] Replace 'let' with arithmetic expansion [dumps] - 10https://gerrit.wikimedia.org/r/1207109 (https://phabricator.wikimedia.org/T406044) (owner: 10Itamar Givon) [15:25:34] ah, fair point [15:25:37] (03CR) 10Mszwarc: [C:03+1] Assign 'ignore-restricted-groups' to steward group on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210605 (https://phabricator.wikimedia.org/T409717) (owner: 10Tchanders) [15:25:49] (03Merged) 10jenkins-bot: Replace 'let' with arithmetic expansion [dumps] - 10https://gerrit.wikimedia.org/r/1207109 (https://phabricator.wikimedia.org/T406044) (owner: 10Itamar Givon) [15:25:54] (03Merged) 10jenkins-bot: Clean up existing symlink before creating a new one [dumps] - 10https://gerrit.wikimedia.org/r/1207110 (https://phabricator.wikimedia.org/T406044) (owner: 10Itamar Givon) [15:25:57] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, d3r1ck01: Backport for [[gerrit:1208328|tests: Make data providers static methods (T410731)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:25:59] yup, explicit math.floor() in the python code [15:26:18] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, d3r1ck01: Continuing with sync [15:26:27] MichaelG_WMF: you’re *exactly* right :) https://gerrit.wikimedia.org/r/c/mediawiki/tools/scap/+/155683 [15:27:05] yay 😊 [15:27:29] (03CR) 10Btullis: [C:03+2] Add output-dir option to specify target directory for rdf dumps [dumps] - 10https://gerrit.wikimedia.org/r/1204595 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon) [15:27:30] jmm@cumin2002 reimage (PID 3961086) is awaiting input [15:27:41] !log bking@cumin2002 START - Cookbook sre.hosts.provision for host wdqs1028.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:27:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T410531)', diff saved to https://phabricator.wikimedia.org/P85527 and previous config saved to /var/cache/conftool/dbconfig/20251124-152741-marostegui.json [15:27:46] (03PS1) 10Vgutierrez: thumbor: reduce HAProxy queue timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210611 [15:27:49] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [15:27:58] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2156.codfw.wmnet with reason: Maintenance [15:28:01] (03CR) 10Btullis: [C:03+2] Rename targetDir to targetDirDefault [dumps] - 10https://gerrit.wikimedia.org/r/1204592 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon) [15:28:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2156 (T410531)', diff saved to https://phabricator.wikimedia.org/P85528 and previous config saved to /var/cache/conftool/dbconfig/20251124-152805-marostegui.json [15:28:17] (03CR) 10Btullis: [C:03+2] Add makeTargetDir function to create target directory [dumps] - 10https://gerrit.wikimedia.org/r/1204593 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon) [15:28:24] (03Merged) 10jenkins-bot: Rename targetDir to targetDirDefault [dumps] - 10https://gerrit.wikimedia.org/r/1204592 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon) [15:28:30] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1028.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:28:38] (03Merged) 10jenkins-bot: Add makeTargetDir function to create target directory [dumps] - 10https://gerrit.wikimedia.org/r/1204593 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon) [15:28:47] (03CR) 10Btullis: [C:03+2] Refactor moveLinkFile and putDumpChecksums [dumps] - 10https://gerrit.wikimedia.org/r/1204594 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon) [15:28:54] (03CR) 10CI reject: [V:04-1] Refactor moveLinkFile and putDumpChecksums [dumps] - 10https://gerrit.wikimedia.org/r/1204594 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon) [15:28:54] (03CR) 10CI reject: [V:04-1] Add output-dir option to specify target directory for rdf dumps [dumps] - 10https://gerrit.wikimedia.org/r/1204595 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon) [15:29:27] (03PS2) 10Daimona Eaytoy: Drop $wgCampaignEventsCountrySchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201814 (https://phabricator.wikimedia.org/T408932) [15:29:50] (03CR) 10Hnowlan: [C:03+1] thumbor: reduce HAProxy queue timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210611 (owner: 10Vgutierrez) [15:29:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201814 (https://phabricator.wikimedia.org/T408932) (owner: 10Daimona Eaytoy) [15:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251124T1530) [15:30:15] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1208328|tests: Make data providers static methods (T410731)]] (duration: 09m 15s) [15:30:20] T410731: Make production extensions PHPUnit tests data providers real providers (and use static methods) - https://phabricator.wikimedia.org/T410731 [15:30:26] (03PS1) 10Gehel: SSH: remove non FIDO key for Guillaume Lederrey [puppet] - 10https://gerrit.wikimedia.org/r/1210612 (https://phabricator.wikimedia.org/T410888) [15:30:28] urbanecm: over to you [15:30:29] well [15:30:32] except for xLab [15:30:40] does that actually do something... [15:31:17] * urbanecm is going to be bold [15:31:30] (03PS3) 10Gehel: Webrequests: alert when webrequest_sampled isn't consumed. [alerts] - 10https://gerrit.wikimedia.org/r/1210601 (https://phabricator.wikimedia.org/T410019) [15:31:44] (03CR) 10Gehel: Webrequests: alert when webrequest_sampled isn't consumed. (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1210601 (https://phabricator.wikimedia.org/T410019) (owner: 10Gehel) [15:33:02] (03CR) 10Brouberol: [C:03+1] SSH: remove non FIDO key for Guillaume Lederrey [puppet] - 10https://gerrit.wikimedia.org/r/1210612 (https://phabricator.wikimedia.org/T410888) (owner: 10Gehel) [15:33:12] (03CR) 10Gehel: [C:03+2] SSH: remove non FIDO key for Guillaume Lederrey [puppet] - 10https://gerrit.wikimedia.org/r/1210612 (https://phabricator.wikimedia.org/T410888) (owner: 10Gehel) [15:33:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207886 (https://phabricator.wikimedia.org/T407029) (owner: 10Michael Große) [15:33:31] (03CR) 10CI reject: [V:04-1] Webrequests: alert when webrequest_sampled isn't consumed. [alerts] - 10https://gerrit.wikimedia.org/r/1210601 (https://phabricator.wikimedia.org/T410019) (owner: 10Gehel) [15:33:58] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:06] * MichaelG_WMF is here and ready to test [15:34:14] bking@cumin2002 provision (PID 3989894) is awaiting input [15:34:24] (03Merged) 10jenkins-bot: testwiki: enable ReviseTone experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207886 (https://phabricator.wikimedia.org/T407029) (owner: 10Michael Große) [15:34:39] !log bking@cumin2002 START - Cookbook sre.hosts.provision for host wdqs1029.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:34:44] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1207886|testwiki: enable ReviseTone experiment (T407029)]] [15:34:49] T407029: Revise Tone: Release on Test Wikipedia integrated with Production DataGateway - https://phabricator.wikimedia.org/T407029 [15:34:53] MichaelG_WMF: thank you, very helpful! [15:35:13] (03CR) 10Cathal Mooney: [C:03+1] Remove maps from SKIP_V6_DNS_PREFIXES [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1208360 (owner: 10Ayounsi) [15:35:32] (03PS11) 10Scott French: P:cache::varnish::frontend: render known-client rate limit VCL [puppet] - 10https://gerrit.wikimedia.org/r/1198182 (https://phabricator.wikimedia.org/T403220) [15:35:32] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1029.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:35:44] (03PS4) 10Gehel: Webrequests: alert when webrequest_sampled isn't consumed. [alerts] - 10https://gerrit.wikimedia.org/r/1210601 (https://phabricator.wikimedia.org/T410019) [15:35:55] (03CR) 10Krinkle: [C:03+1] deployment_server: drop PHP 8.1 fallback in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1207979 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [15:36:12] Lucas_WMDE, thanks for deploying 🙏🏽 [15:36:44] !log bking@cumin2002 START - Cookbook sre.hosts.provision for host wdqs1030.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:36:53] (03PS1) 10Kosta Harlan: Hooks: Log the status message when responseUnknown occurs [extensions/WikiEditor] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1210614 (https://phabricator.wikimedia.org/T410877) [15:37:02] np :) [15:37:34] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1030.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:38:18] (03CR) 10Gehel: [C:03+2] Webrequests: alert when webrequest_sampled isn't consumed. [alerts] - 10https://gerrit.wikimedia.org/r/1210601 (https://phabricator.wikimedia.org/T410019) (owner: 10Gehel) [15:38:28] !log bking@cumin2002 START - Cookbook sre.hosts.provision for host wdqs1031.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:39:13] 06SRE, 10observability, 06Traffic, 06Data-Engineering (Q2 FY25/26 October 1st - December 31th), and 3 others: alerts should be triggered if druid fails to consume webrequest_sampled kafka topic - https://phabricator.wikimedia.org/T410019#11401023 (10Gehel) 05Open→03Resolved [15:39:17] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1031.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:39:20] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 13Patch-For-Review: Yubikey-SSH-FIDO for Guillaume (gehel) - https://phabricator.wikimedia.org/T410888#11401026 (10Gehel) 05Open→03Resolved a:03Gehel [15:39:24] !log urbanecm@deploy2002 urbanecm, migr: Backport for [[gerrit:1207886|testwiki: enable ReviseTone experiment (T407029)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:39:48] MichaelG_WMF: available on debug! [15:39:52] (I'm also testing) [15:40:06] thanks, testing! [15:40:29] !log bking@cumin2002 START - Cookbook sre.hosts.provision for host wdqs1032.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:40:35] (03PS1) 10Brouberol: dse-k8s: delete the stat-> PG on k8s ingress firewall rule [puppet] - 10https://gerrit.wikimedia.org/r/1210616 (https://phabricator.wikimedia.org/T409591) [15:42:39] @urbanecm looks good to me. What about you? [15:42:49] MichaelG_WMF: works for me! [15:42:56] 🙌 [15:43:02] !log urbanecm@deploy2002 urbanecm, migr: Continuing with sync [15:43:12] proceeding [15:43:46] (03CR) 10Filippo Giunchedi: [C:03+2] cloudcephosd: move codfw hosts to single NIC [puppet] - 10https://gerrit.wikimedia.org/r/1207743 (https://phabricator.wikimedia.org/T399180) (owner: 10Filippo Giunchedi) [15:44:33] MichaelG_WMF: just noticed, mwdebug logs says `Expectation (masterConns <= 0) by MediaWiki\Actions\ActionEntryPoint::execute not met (actual: 1): [connect to db2191 (wikishared)]`. did that...change? [15:44:48] or did we create master connection before? [15:44:56] * urbanecm is trying to identify whether this is coming from revise tone work [15:45:16] (03CR) 10Alexandros Kosiaris: [C:03+1] labs: add infra-tracing-nfs account [labs/private] - 10https://gerrit.wikimedia.org/r/1210591 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [15:45:18] 🤔 [15:45:30] (03CR) 10Bking: [C:03+1] dse-k8s: delete the stat-> PG on k8s ingress firewall rule [puppet] - 10https://gerrit.wikimedia.org/r/1210616 (https://phabricator.wikimedia.org/T409591) (owner: 10Brouberol) [15:45:41] We might have [15:45:44] (03CR) 10Brouberol: [C:03+2] dse-k8s: delete the stat-> PG on k8s ingress firewall rule [puppet] - 10https://gerrit.wikimedia.org/r/1210616 (https://phabricator.wikimedia.org/T409591) (owner: 10Brouberol) [15:46:18] (03PS2) 10Silvan Heintze: Report integrity metric from Wikidata dump scripts [dumps] - 10https://gerrit.wikimedia.org/r/1203410 (https://phabricator.wikimedia.org/T403482) [15:46:22] MichaelG_WMF: can you fill a task to investigate that (prior to larger deployment)? [15:46:41] * urbanecm will proceed with the rest of the deployment in the meantime [15:46:47] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1032.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:47:03] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1207886|testwiki: enable ReviseTone experiment (T407029)]] (duration: 12m 19s) [15:47:07] T407029: Revise Tone: Release on Test Wikipedia integrated with Production DataGateway - https://phabricator.wikimedia.org/T407029 [15:47:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206948 (https://phabricator.wikimedia.org/T407818) (owner: 10Urbanecm) [15:47:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T410531)', diff saved to https://phabricator.wikimedia.org/P85529 and previous config saved to /var/cache/conftool/dbconfig/20251124-154758-marostegui.json [15:48:03] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [15:48:07] urbanecm: In https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/1202224 to check for a race-condition. Though I would assume that to be fine, but maybe it isn't. Or maybe we have to move the check to later [15:48:18] (03PS2) 10STran: Enable v2 non-emergency workflow by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207845 (https://phabricator.wikimedia.org/T410512) [15:48:26] (03Merged) 10jenkins-bot: [Growth] Enable Add Link task pool generation for 3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206948 (https://phabricator.wikimedia.org/T407818) (owner: 10Urbanecm) [15:48:34] (03CR) 10Gehel: "DNS is now configured and propagated:" [puppet] - 10https://gerrit.wikimedia.org/r/1200034 (https://phabricator.wikimedia.org/T403955) (owner: 10Stevemunene) [15:48:47] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1206948|[Growth] Enable Add Link task pool generation for 3 wikis (T407818)]] [15:48:52] T407818: Add a Link: Rollout "Add a Link" Structured Task to Chinese, Japanese, & Urdu Wikipedias - https://phabricator.wikimedia.org/T407818 [15:48:58] Also, I don't think that this should be an `ActionEntryPoint`, shouldn't that be index.php with the homepage? 🤔 [15:49:04] (03CR) 10STran: Enable v2 non-emergency workflow by default (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207845 (https://phabricator.wikimedia.org/T410512) (owner: 10STran) [15:49:28] MichaelG_WMF: we definitely shouldn't deploy a feature that triggers a warning. [15:49:29] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1028.eqiad.wmnet with OS trixie [15:50:01] so if it is indeed new, we need to fix that (move later/use replica/silence warning/etc) before deployment [15:50:02] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS trixie [15:50:05] that for sure. I'm just unsure if we triggered this warning. (I am in the process of creating the task) [15:50:18] ah, i thought what you said means "it is us". sorry! [15:50:31] (03CR) 10Brouberol: [C:03+1] druid: switch to using the druid-public-coordinator url [puppet] - 10https://gerrit.wikimedia.org/r/1200034 (https://phabricator.wikimedia.org/T403955) (owner: 10Stevemunene) [15:50:34] (03CR) 10Silvan Heintze: "Thanks for the review" [dumps] - 10https://gerrit.wikimedia.org/r/1203410 (https://phabricator.wikimedia.org/T403482) (owner: 10Silvan Heintze) [15:50:34] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1030.eqiad.wmnet with OS trixie [15:51:06] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1031.eqiad.wmnet with OS trixie [15:51:32] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1032.eqiad.wmnet with OS trixie [15:52:18] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: Add an option to the reimage cookbook to also update firmware - https://phabricator.wikimedia.org/T410384#11401096 (10LSobanski) p:05Triage→03Medium [15:53:10] @urbanecm: https://phabricator.wikimedia.org/T410907 here is a simple task [15:53:15] ty [15:53:54] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1206948|[Growth] Enable Add Link task pool generation for 3 wikis (T407818)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:53:59] T407818: Add a Link: Rollout "Add a Link" Structured Task to Chinese, Japanese, & Urdu Wikipedias - https://phabricator.wikimedia.org/T407818 [15:54:27] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210618 (https://phabricator.wikimedia.org/T128546) [15:54:36] (03CR) 10Aude: [Legal Footer] Create config for adding legal footer (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208380 (https://phabricator.wikimedia.org/T410163) (owner: 10LorenMora) [15:56:54] !log urbanecm@deploy2002 urbanecm: Continuing with sync [16:00:51] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1206948|[Growth] Enable Add Link task pool generation for 3 wikis (T407818)]] (duration: 12m 04s) [16:00:52] @urbanecm Early discovery: It was probably not us. The three events that I can find in logstash all have ReadingList in their stacktrace and not GrowthExperiments [16:00:57] T407818: Add a Link: Rollout "Add a Link" Structured Task to Chinese, Japanese, & Urdu Wikipedias - https://phabricator.wikimedia.org/T407818 [16:01:05] MichaelG_WMF: sounds promising! [16:01:11] thanks for investigating [16:02:55] MichaelG_WMF: on second thought, that makes a lot of sense. We don't use wikishared at all, so [16:03:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P85530 and previous config saved to /var/cache/conftool/dbconfig/20251124-160305-marostegui.json [16:03:14] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: Add an option to the reimage cookbook to also update firmware - https://phabricator.wikimedia.org/T410384#11401180 (10cmooney) For a little bit more background we most regularly encounter PXEboot failures due to a firmware version on hosts with Broadcom BCM57... [16:05:16] (03CR) 10Urbanecm: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210605 (https://phabricator.wikimedia.org/T409717) (owner: 10Tchanders) [16:05:48] (03CR) 10Btullis: [C:04-1] "Unfortunately, the LVS service is still not yet in production." [puppet] - 10https://gerrit.wikimedia.org/r/1200034 (https://phabricator.wikimedia.org/T403955) (owner: 10Stevemunene) [16:06:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1184 depool for testing', diff saved to https://phabricator.wikimedia.org/P85531 and previous config saved to /var/cache/conftool/dbconfig/20251124-160601-marostegui.json [16:06:12] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Testing latency [16:06:38] (03PS3) 10Elukey: profile::pyrra::fs::slos::editing: fix citoid's success ratio SLO [puppet] - 10https://gerrit.wikimedia.org/r/1210608 (https://phabricator.wikimedia.org/T345627) [16:06:58] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1210608 (https://phabricator.wikimedia.org/T345627) (owner: 10Elukey) [16:07:23] (03CR) 10Btullis: [V:03+1 C:03+2] Update the definition of @dse_kubepods_networks [puppet] - 10https://gerrit.wikimedia.org/r/1195694 (https://phabricator.wikimedia.org/T404576) (owner: 10Btullis) [16:08:26] jouncebot: nowandnext [16:08:26] No deployments scheduled for the next 0 hour(s) and 21 minute(s) [16:08:26] In 0 hour(s) and 21 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251124T1630) [16:08:57] (03CR) 10Btullis: [V:03+1 C:03+2] Add k8s tokens for the analytics namespace [puppet] - 10https://gerrit.wikimedia.org/r/1208321 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [16:09:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207276 (https://phabricator.wikimedia.org/T410564) (owner: 10Arlolra) [16:09:27] (03CR) 10Hnowlan: [C:03+2] thumbor: reduce HAProxy queue timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210611 (owner: 10Vgutierrez) [16:10:08] (03PS4) 10Elukey: profile::pyrra::fs::slos::editing: fix citoid's success ratio SLO [puppet] - 10https://gerrit.wikimedia.org/r/1210608 (https://phabricator.wikimedia.org/T345627) [16:10:16] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1210608 (https://phabricator.wikimedia.org/T345627) (owner: 10Elukey) [16:11:44] (03Merged) 10jenkins-bot: thumbor: reduce HAProxy queue timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210611 (owner: 10Vgutierrez) [16:11:46] (03PS1) 10Kosta Harlan: hCaptcha: Adjust addurl config for zhwiki and jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210621 (https://phabricator.wikimedia.org/T410354) [16:13:45] (03PS5) 10Elukey: profile::pyrra::fs::slos::editing: fix citoid's success ratio SLO [puppet] - 10https://gerrit.wikimedia.org/r/1210608 (https://phabricator.wikimedia.org/T345627) [16:14:18] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1210608 (https://phabricator.wikimedia.org/T345627) (owner: 10Elukey) [16:14:24] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply [16:14:37] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [16:14:43] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [16:15:28] is it possible to _stop_ a mw-cron job? would deleting the pod be the expected thing to do in that scenario? [16:15:54] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [16:15:55] (or deleting the job itself?) [16:16:09] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [16:16:11] https://wikitech.wikimedia.org/wiki/Mw-cron_jobs#Manually_deleting_a_failed_Job talks about deleting failed jobs, but not about something running [16:17:30] 06SRE, 06Infrastructure-Foundations, 10netops: Codfw row C/D servers need to boot/reimage in UEFI mode - https://phabricator.wikimedia.org/T410910 (10cmooney) 03NEW p:05Triage→03Medium [16:17:41] 06SRE, 06Infrastructure-Foundations, 10netops: Codfw row C/D servers need to boot/reimage in UEFI mode - https://phabricator.wikimedia.org/T410910#11401242 (10cmooney) [16:18:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P85532 and previous config saved to /var/cache/conftool/dbconfig/20251124-161813-marostegui.json [16:18:19] urbanecm: deleting a running job will also stop it [16:18:47] good to know. and hopefully wouldn't generate alerting (on k8s level, at least). [16:19:52] it might notify teams of a failed job, if that's configured. but it won't p.age anyone [16:21:15] 06SRE, 06Infrastructure-Foundations: Request additional access for Dcops group - https://phabricator.wikimedia.org/T395939#11401253 (10Jclark-ctr) @elukey We now have additional smartctl options for pulling drive information for Supermicro repairs. Because the Servers use software RAID, the drives are not visi... [16:21:51] (03CR) 10Muehlenhoff: [C:03+2] Failover idp.w.o [dns] - 10https://gerrit.wikimedia.org/r/1210566 (owner: 10Muehlenhoff) [16:22:03] !log jmm@dns1004 START - running authdns-update [16:23:06] !log jmm@dns1004 END - running authdns-update [16:23:18] !log Delete job/growthexperiments-refreshlinkrecommendations-s2-29399967 and job/growthexperiments-refreshlinkrecommendations-s3-29399607 (T407818) [16:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:28] T407818: Add a Link: Rollout "Add a Link" Structured Task to Chinese, Japanese, & Urdu Wikipedias - https://phabricator.wikimedia.org/T407818 [16:24:08] (03PS1) 10Kosta Harlan: hCaptcha: Enable hCaptcha editing on frwiki in 100% passive mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210622 (https://phabricator.wikimedia.org/T405586) [16:24:38] 👀 [16:25:11] (03PS1) 10Hnowlan: thumbor: reduce queue time to 10s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210624 [16:25:11] (03PS1) 10Hnowlan: thumbor: drop queue timeout to 2s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210625 [16:25:30] (03CR) 10Dreamy Jazz: [C:03+1] hCaptcha: Enable hCaptcha editing on frwiki in 100% passive mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210622 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [16:26:06] (03CR) 10Dreamy Jazz: [C:03+1] hCaptcha: Adjust addurl config for zhwiki and jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210621 (https://phabricator.wikimedia.org/T410354) (owner: 10Kosta Harlan) [16:27:09] 06SRE, 06Infrastructure-Foundations, 10netops: Codfw row C/D servers need to boot/reimage in UEFI mode - https://phabricator.wikimedia.org/T410910#11401277 (10cmooney) [16:28:26] 06SRE, 06collaboration-services, 10MW-on-K8s, 06serviceops: Use encrypted rsync for releases - https://phabricator.wikimedia.org/T289858#11401282 (10LSobanski) p:05Medium→03Low [16:28:35] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 52s) [16:29:03] (03PS4) 10Muehlenhoff: Properly rename tilerator_pass variable [puppet] - 10https://gerrit.wikimedia.org/r/1204900 (https://phabricator.wikimedia.org/T381565) [16:29:23] (03PS1) 10Kosta Harlan: hCaptcha: Define list of valid SiteKeys for createaccount trigger [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210627 (https://phabricator.wikimedia.org/T410657) [16:30:05] jan_drewniak: #bothumor My software never has bugs. It just develops random features. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251124T1630). [16:30:18] ^ starting portal banner deploy [16:30:25] (03CR) 10Jdrewniak: [C:03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210618 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:30:32] !log installing usb.ids updates from Bookworm point release [16:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:02] (03PS2) 10Kosta Harlan: (WIP) hCaptcha: Define list of valid SiteKeys for createaccount trigger [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210627 (https://phabricator.wikimedia.org/T410657) [16:31:06] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210618 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:32:48] !log btullis@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dse-k8s-worker[1011,1013,1019].eqiad.wmnet with reason: Prepping for switch swap [16:32:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11401326 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=77fc5d5e-4014-4521-90fb-3e67d8114900) set by... [16:33:15] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1169.eqiad.wmnet with OS bookworm [16:33:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T410531)', diff saved to https://phabricator.wikimedia.org/P85533 and previous config saved to /var/cache/conftool/dbconfig/20251124-163320-marostegui.json [16:33:27] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [16:33:38] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2177.codfw.wmnet with reason: Maintenance [16:33:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2177 (T410531)', diff saved to https://phabricator.wikimedia.org/P85534 and previous config saved to /var/cache/conftool/dbconfig/20251124-163345-marostegui.json [16:34:04] !log btullis@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on an-test-master1002.eqiad.wmnet with reason: Prepping for switch swap [16:34:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11401333 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7d21afc7-5634-452f-ae59-c9787b2c0108) set by... [16:34:43] !log btullis@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on stat1011.eqiad.wmnet with reason: Prepping for switch swap [16:34:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11401338 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=2ceb8409-0adc-48e2-b350-9299f0cfd430) set by... [16:35:47] (03PS3) 10Clément Goubert: trafficserver: action api to rest-gateway cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1198941 (https://phabricator.wikimedia.org/T408223) [16:35:49] 07Puppet, 06SRE, 06Infrastructure-Foundations, 06serviceops-radar: Fix UIDs for deployment server users - https://phabricator.wikimedia.org/T163667#11401343 (10LSobanski) [16:36:00] !log btullis@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on an-master1004.eqiad.wmnet,an-redacteddb1001.eqiad.wmnet,an-test-coord1001.eqiad.wmnet with reason: Prepping for switch swap [16:36:06] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11401346 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a41ee425-7380-4cb9-8254-04c2c38218ab) set by... [16:38:45] (03PS4) 10Clément Goubert: trafficserver: action api to rest-gateway cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1198941 (https://phabricator.wikimedia.org/T408223) [16:39:37] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:41:13] !log jdrewniak@deploy2002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1210618| Bumping portals to master (T128546)]] (duration: 08m 44s) [16:41:18] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:43:13] !log jdrewniak@deploy2002 Synchronized portals: Wikimedia Portals Update: [[gerrit:1210618| Bumping portals to master (T128546)]] (duration: 01m 59s) [16:44:07] bking@cumin2002 reimage (PID 3998088) is awaiting input [16:44:47] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11401417 (10MoritzMuehlenhoff) [16:47:12] (03PS1) 10Fabfur: admin: add fido key for fabfur [puppet] - 10https://gerrit.wikimedia.org/r/1210629 [16:48:17] (03CR) 10Elukey: [C:03+1] Remove the new unused tilerator_pass [puppet] - 10https://gerrit.wikimedia.org/r/1204914 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [16:48:21] (03CR) 10Elukey: [C:03+1] Properly rename tilerator_pass variable [puppet] - 10https://gerrit.wikimedia.org/r/1204900 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [16:52:14] (03PS2) 10Aaron Schulz: Route /page/lint(.*) to the gateway on all wikis [puppet] - 10https://gerrit.wikimedia.org/r/1199035 (https://phabricator.wikimedia.org/T384216) [16:53:10] PROBLEM - Host conf1009 is DOWN: PING CRITICAL - Packet loss = 100% [16:54:06] ^ what [16:54:47] the C/D switch migration I suppose? [16:55:11] that's not supposed to happen for another 1.25h [16:56:00] (03PS1) 10Aaron Schulz: Cleanup redundant lint-related rest gateway routing config [puppet] - 10https://gerrit.wikimedia.org/r/1210631 [16:56:41] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1013.eqiad.wmnet [16:58:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-debug releases routed via next (k8s) 1.75s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-debug&var-release=next - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:59:10] RECOVERY - Host conf1009 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [16:59:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T410531)', diff saved to https://phabricator.wikimedia.org/P85535 and previous config saved to /var/cache/conftool/dbconfig/20251124-165910-marostegui.json [16:59:16] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [16:59:46] (03PS3) 10Aaron Schulz: Route /page/lint(.*) to the gateway on all wikis [puppet] - 10https://gerrit.wikimedia.org/r/1199035 (https://phabricator.wikimedia.org/T384216) [17:00:31] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1031.eqiad.wmnet with OS trixie [17:01:07] (03PS1) 10DCausse: dumps: Update cirrus index dumps path to point to new dumps [puppet] - 10https://gerrit.wikimedia.org/r/1210636 [17:01:16] (03PS1) 10Kosta Harlan: hCaptcha: Adjust addurl logic for 100% passive mode [extensions/ConfirmEdit] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1210637 (https://phabricator.wikimedia.org/T409957) [17:01:27] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1032.eqiad.wmnet with OS trixie [17:01:42] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10[28-32] - https://phabricator.wikimedia.org/T410406#11401589 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host wdqs1032.eqiad.wmnet with OS trixie [17:01:50] (03PS2) 10Aaron Schulz: Cleanup redundant lint-related rest gateway routing config [puppet] - 10https://gerrit.wikimedia.org/r/1210631 [17:02:49] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1013.eqiad.wmnet [17:02:53] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1210629 (owner: 10Fabfur) [17:03:10] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1019.eqiad.wmnet [17:03:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-debug releases routed via next (k8s) 1.75s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-debug&var-release=next - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:03:20] oncalls, FYI - page maybe incoming [17:04:02] (03CR) 10Fabfur: [C:03+2] admin: add fido key for fabfur [puppet] - 10https://gerrit.wikimedia.org/r/1210629 (owner: 10Fabfur) [17:05:25] topranks: claime: you're about to get paged, FYI [17:05:33] lol [17:05:36] etcd-mirror is down in codfw [17:05:37] preemptive strike [17:05:41] FIRING: EtcdReplicationDown: etcd replication down on conf2005:8000 #page - https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Replication - TODO - https://alerts.wikimedia.org/?q=alertname%3DEtcdReplicationDown [17:05:43] 06SRE, 06Infrastructure-Foundations, 10netops: Codfw row C/D servers need to boot/reimage in UEFI mode - https://phabricator.wikimedia.org/T410910#11401603 (10cmooney) [17:05:48] Do we have to something [17:05:50] ? [17:05:52] or is expected [17:06:15] I'm trying to sort it out in #wikimedia-dcops [17:06:19] ok [17:06:22] here's some sort of network cable failure [17:06:23] etcd replication down? [17:06:26] yup [17:06:35] conf2005 [17:08:01] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11401619 (10MoritzMuehlenhoff) [17:08:12] swfrench-wmf: fwiw the link is up to the switch [17:08:13] so, this is not going to easy to restore - I'm reading through the log on the process and it might not be possible to simply restart it [17:08:18] https://www.irccloud.com/pastebin/mP1kBSxx/ [17:08:23] Crap [17:08:25] FIRING: SystemdUnitFailed: etcdmirror--eqiad-wmnet.service on conf2005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:08:31] topranks: yeah, there was a transient disruption there before [17:08:58] FIRING: JobUnavailable: Reduced availability for job etcdmirror in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:09:13] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1019.eqiad.wmnet [17:09:20] 06SRE: Authorize blake for Icinga tasks - https://phabricator.wikimedia.org/T410390#11401624 (10Blake) 05Open→03Resolved Submitted and merged. [17:09:33] trying to figure out if I can perform some surgury [17:09:37] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:09:39] *surgery [17:09:50] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1028.eqiad.wmnet with OS trixie [17:09:58] swfrench-wmf: tell us if we need us for anything [17:10:03] s/we/you/ [17:10:10] PROBLEM - Postfix SMTP on crm2001 is CRITICAL: CRITICAL - Certificate crm2001.codfw.wmnet expires in 15 day(s) (Wed 10 Dec 2025 05:10:00 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [17:10:11] +1 [17:10:15] ack, I may need to use the --reload script, which will be rather disruptive [17:10:23] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1029.eqiad.wmnet with OS trixie [17:10:23] I'll give you a heads-up if that's the case [17:10:56] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1030.eqiad.wmnet with OS trixie [17:11:53] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1032.eqiad.wmnet with OS trixie [17:13:52] topranks: claime: restored [17:14:00] swfrench-wmf: <3 good job [17:14:21] * swfrench-wmf needs a drink [17:14:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P85536 and previous config saved to /var/cache/conftool/dbconfig/20251124-171418-marostegui.json [17:14:27] now to figure out what the hell happened [17:14:37] RESOLVED: JobUnavailable: Reduced availability for job etcdmirror in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:14:40] swfrench-wmf: It's 5 o'clock somewhere right :P [17:15:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11401676 (10Jclark-ctr) an-test-master1002 dse-k8s-worker1011 dse-k8s-worker1013 dse-k8s-worker1019 stat1011 an-redacteddb... [17:15:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:15:28] claime: what makes it extra-fun is that it's the read-only cluster, so you can't use etcdctl to mutate the keyspace. you have to sling API ops w/ curl. [17:15:35] awesome [17:15:41] RESOLVED: EtcdReplicationDown: etcd replication down on conf2005:8000 #page - https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Replication - TODO - https://alerts.wikimedia.org/?q=alertname%3DEtcdReplicationDown [17:16:13] swfrench-wmf: damn, nice job [17:16:33] rzl: fortunately, we've been to a similar rodeo before :) [17:16:49] (03PS1) 10Kevin Bazira: ml-services: update llm model-server image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210643 (https://phabricator.wikimedia.org/T410906) [17:16:52] Would probably be worth documenting how to recover that [17:17:01] Especially since we can't use etcdctl [17:18:25] RESOLVED: SystemdUnitFailed: etcdmirror--eqiad-wmnet.service on conf2005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:18:46] claime: yeah, the blunt option is (i.e., the --reload script), but this kind of surgery isn't, which maybe we should rethink [17:21:01] <_joe_> swfrench-wmf: what happened exactly? [17:21:35] <_joe_> and yes, etcdctl is not a great tool in general to interact with etcd, amazingly [17:21:38] broken cable clip [17:21:52] <_joe_> yeah ok, why did recovery need "surgery" is my question [17:23:20] !log urbanecm@deploy2002 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [17:23:33] _joe_: so, what happened is that _somehow_ related to the connectivity blip toward conf1009, we either lost a mirrored write _or_ doubly applied a delete on the conf2005 side. [17:23:35] <_joe_> ah I see [17:23:42] that left the replication index out of sync [17:23:55] <_joe_> swfrench-wmf: no I think the index was not updated after the write of the delete [17:24:08] <_joe_> the failure happened *exactly* between the two [17:24:10] !log urbanecm@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [17:24:13] right, that's what I mean - on restart, that would doubly apply [17:24:21] exactly, yeah [17:24:26] <_joe_> I'm looking at the logs and sigh that's an interesting amount of bad luck [17:24:32] this is the torn-write scenario we've talked about [17:24:46] <_joe_> so yes in that case the two solutions are either moving the replica index by hand [17:24:48] exactly, yeah :) [17:24:50] <_joe_> which I guess you did [17:24:57] <_joe_> or reload everything [17:25:21] exactly, yeah [17:25:42] fwiw, `helmfile.d/services/rest-gateway/values-staging.yaml` seems to have uncommited changes at `deploy2002:/srv/deployment-charts`. that...doesn't seem to be expected? [17:26:00] urbanecm: yeah that's my bad [17:26:06] Leftover from morning tests [17:26:12] Is it blocking anything? [17:26:17] I can reset it if needed [17:26:23] no, i just noticed that while doing an unrelated deployment [17:26:37] just wanted to flag it as it seemed unusual [17:26:37] ack, yeah I'll reset as to not cause anymroe confusion then [17:26:57] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [17:27:06] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [17:27:28] {{done}} [17:27:32] thanks! [17:27:58] (03PS2) 10FNegri: toolsdb: increase innodb_log_file_size to 512M [puppet] - 10https://gerrit.wikimedia.org/r/1204472 (https://phabricator.wikimedia.org/T409922) [17:29:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P85537 and previous config saved to /var/cache/conftool/dbconfig/20251124-172929-marostegui.json [17:33:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206466 (https://phabricator.wikimedia.org/T409773) (owner: 10Aaron Schulz) [17:44:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T410531)', diff saved to https://phabricator.wikimedia.org/P85538 and previous config saved to /var/cache/conftool/dbconfig/20251124-174437-marostegui.json [17:44:42] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [17:44:54] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2190.codfw.wmnet with reason: Maintenance [17:45:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2190 (T410531)', diff saved to https://phabricator.wikimedia.org/P85539 and previous config saved to /var/cache/conftool/dbconfig/20251124-174501-marostegui.json [17:46:37] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10[28-32] - https://phabricator.wikimedia.org/T410406#11401912 (10bking) Note to selves: - All 5 hosts failed to reimage to UEFI, even after I ran the `sre.hosts.provision` cookbook with the arguments listed above. - @Jclark-c... [17:50:20] (03CR) 10Ebernhardson: [C:03+1] dumps: Update cirrus index dumps path to point to new dumps [puppet] - 10https://gerrit.wikimedia.org/r/1210636 (owner: 10DCausse) [17:51:43] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS bookworm [17:52:36] (03PS2) 10MusikAnimal: [metawiki] enable voting on entities with the 'Under review' status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208231 (https://phabricator.wikimedia.org/T409613) [17:52:49] (03CR) 10MusikAnimal: [metawiki] enable voting on entities with the 'Under review' status (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208231 (https://phabricator.wikimedia.org/T409613) (owner: 10MusikAnimal) [17:53:47] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Measure request frequency of thumbnail sizes - https://phabricator.wikimedia.org/T410304#11401955 (10akosiaris) Turnilo for the Telegram Logo (first hit in what @Ladsgroup ) says: Google Proxy as the ISP, in an staggering 85% o... [17:55:50] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1030.eqiad.wmnet with OS trixie [17:57:30] (03CR) 10Dzahn: [C:03+2] admin: deprecate the releasers-blubber group [puppet] - 10https://gerrit.wikimedia.org/r/1207313 (owner: 10Dzahn) [17:58:11] (03PS3) 10DDesouza: Pre-deploy 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208408 (https://phabricator.wikimedia.org/T410696) [17:58:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208408 (https://phabricator.wikimedia.org/T410696) (owner: 10DDesouza) [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251124T1800) [18:00:05] ryankemper: Time to do the Wikidata Query Service weekly deploy deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251124T1800). [18:01:19] (03CR) 10Dzahn: [C:03+2] "thanks! related ticket mostly https://phabricator.wikimedia.org/T410418 because this started by asking "who is still uploading releases i" [puppet] - 10https://gerrit.wikimedia.org/r/1207313 (owner: 10Dzahn) [18:02:57] FYI, please do not begin any MediaWiki deployments during this window. I'll be taking the scap lock for a brief period during an upcoming etcd maintenance. [18:05:02] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1031.eqiad.wmnet with OS trixie [18:05:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T410531)', diff saved to https://phabricator.wikimedia.org/P85540 and previous config saved to /var/cache/conftool/dbconfig/20251124-180503-marostegui.json [18:05:10] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [18:05:19] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10[28-32] - https://phabricator.wikimedia.org/T410406#11402039 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wdqs1031.eqiad.wmnet with OS trixie [18:09:17] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1028.eqiad.wmnet with OS bookworm [18:09:31] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10[28-32] - https://phabricator.wikimedia.org/T410406#11402048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wdqs1028.eqiad.wmnet with OS bookworm [18:09:55] (03CR) 10BCornwall: [C:03+1] hieradata: lvs: Store VLAN tags as numbers [puppet] - 10https://gerrit.wikimedia.org/r/1208299 (owner: 10Majavah) [18:10:15] (03CR) 10Majavah: [C:03+2] hieradata: lvs: Store VLAN tags as numbers [puppet] - 10https://gerrit.wikimedia.org/r/1208299 (owner: 10Majavah) [18:10:58] (03PS1) 10Dzahn: admin/releases: deprecate the releasers-wikibase shell user group [puppet] - 10https://gerrit.wikimedia.org/r/1210654 (https://phabricator.wikimedia.org/T410418) [18:11:53] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1030.eqiad.wmnet with reason: host reimage [18:12:54] (03CR) 10CI reject: [V:04-1] admin/releases: deprecate the releasers-wikibase shell user group [puppet] - 10https://gerrit.wikimedia.org/r/1210654 (https://phabricator.wikimedia.org/T410418) (owner: 10Dzahn) [18:13:08] (03CR) 10BCornwall: [C:03+2] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1207294 (owner: 10Ncmonitor) [18:13:13] (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1207295 (owner: 10Ncmonitor) [18:14:14] (03PS4) 10Majavah: interface::tagged: Add strict typing [puppet] - 10https://gerrit.wikimedia.org/r/1208293 [18:14:15] (03PS2) 10Majavah: P:openstack: neutron: Cleanup legacy_vlan_naming hiera key [puppet] - 10https://gerrit.wikimedia.org/r/1208306 [18:14:15] (03PS2) 10Majavah: interface::tagged: Remove legacy_vlan_naming option [puppet] - 10https://gerrit.wikimedia.org/r/1208307 [18:15:18] (03PS2) 10Dzahn: admin/releases: deprecate the releasers-wikibase shell user group [puppet] - 10https://gerrit.wikimedia.org/r/1210654 (https://phabricator.wikimedia.org/T410418) [18:16:04] (03CR) 10CI reject: [V:04-1] interface::tagged: Remove legacy_vlan_naming option [puppet] - 10https://gerrit.wikimedia.org/r/1208307 (owner: 10Majavah) [18:16:37] !log silenced EtcdReplicationDown. f75c71c9-62d3-449f-860a-9b5e4570717a - T405950 [18:16:38] (03CR) 10Majavah: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1208307 (owner: 10Majavah) [18:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:41] T405950: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950 [18:17:02] (03PS1) 10DDesouza: Deploy experiment for 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210655 (https://phabricator.wikimedia.org/T410696) [18:17:58] (03PS1) 10Dzahn: releases: change group ownership of blubber releases to root [puppet] - 10https://gerrit.wikimedia.org/r/1210656 (https://phabricator.wikimedia.org/T410418) [18:19:11] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1030.eqiad.wmnet with reason: host reimage [18:20:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P85541 and previous config saved to /var/cache/conftool/dbconfig/20251124-182011-marostegui.json [18:20:54] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1031.eqiad.wmnet with reason: host reimage [18:21:08] !log manually transferred etcd-mirror replication source to conf1008 - T405950 [18:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:43] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1032.eqiad.wmnet with OS trixie [18:21:58] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10[28-32] - https://phabricator.wikimedia.org/T410406#11402120 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host wdqs1032.eqiad.wmnet with OS trixie executed with errors: - wdqs1032... [18:23:25] (03CR) 10Dzahn: "I am not sure I would mess with this in the light of these IPs probably soon pointing to the CDN. Then the public IPs will permanently poi" [dns] - 10https://gerrit.wikimedia.org/r/1210560 (https://phabricator.wikimedia.org/T387833) (owner: 10Hashar) [18:23:32] !log swfrench@deploy2002 Locking from deployment [ALL REPOSITORIES]: Hold deployments during etcd ToR switch migration - T405950 [18:23:36] T405950: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950 [18:24:02] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1031.eqiad.wmnet with reason: host reimage [18:24:36] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [18:24:37] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10[28-32] - https://phabricator.wikimedia.org/T410406#11402141 (10Jclark-ctr) {F70616591} {F70616621}. They still seem to be failing for Raid configuration files. [18:25:05] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:05:00 on conf1009.eqiad.wmnet with reason: C/D Migration [18:25:50] (03PS4) 10DDesouza: Pre-deploy 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208408 (https://phabricator.wikimedia.org/T410696) [18:26:38] 06SRE, 06SRE Observability: Add Druid as a Private Grafana Datasource - https://phabricator.wikimedia.org/T410933 (10herron) 03NEW [18:27:33] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 26 NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7691" [puppet] - 10https://gerrit.wikimedia.org/r/1208293 (owner: 10Majavah) [18:27:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210655 (https://phabricator.wikimedia.org/T410696) (owner: 10DDesouza) [18:28:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11402180 (10RobH) conf1009 migrated, @brouberol: Please provide feedback on migration of wikikube-ctrl1003 and kafka-main1008 as these are the last #serviceops hosts to migrate... [18:31:39] !log manually transferred etcd-mirror replication source back to conf1009 - T405950 [18:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:44] T405950: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950 [18:32:15] !log swfrench@deploy2002 Unlocked for deployment [ALL REPOSITORIES]: Hold deployments during etcd ToR switch migration - T405950 (duration: 08m 43s) [18:34:34] !log begin restarts of eqiad-associated confds, navtiming, requestctl - T405950 [18:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P85542 and previous config saved to /var/cache/conftool/dbconfig/20251124-183518-marostegui.json [18:36:14] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [18:36:21] !log deleted EtcdReplicationDown silence. f75c71c9-62d3-449f-860a-9b5e4570717a - T405950 [18:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:27] (03PS1) 10Volans: labs: enable infra-tracing-nfs tracing [labs/private] - 10https://gerrit.wikimedia.org/r/1210664 (https://phabricator.wikimedia.org/T399313) [18:39:18] jclark@cumin1003 reimage (PID 1589693) is awaiting input [18:41:17] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [18:41:20] (03PS4) 10CDobbins: sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1208415 (https://phabricator.wikimedia.org/T395240) [18:42:10] (03PS5) 10CDobbins: sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1208415 (https://phabricator.wikimedia.org/T395240) [18:43:31] (03CR) 10Ssingh: [C:03+1] "Looks good!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1208415 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [18:44:21] jclark@cumin1003 reimage (PID 1591045) is awaiting input [18:45:10] (03CR) 10CDobbins: sre.loadbalancer: patch to fix reboot action (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1208415 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [18:47:57] (03CR) 10CDobbins: [C:03+2] sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1208415 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [18:49:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11402364 (10RobH) >>! In T405950#11402180, @RobH wrote: > conf1009 migrated, > > @brouberol: Please provide feedback on migration of wikikube-ctrl1003 and kafka-main1008 as the... [18:50:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T410531)', diff saved to https://phabricator.wikimedia.org/P85543 and previous config saved to /var/cache/conftool/dbconfig/20251124-185026-marostegui.json [18:50:31] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [18:50:43] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2194.codfw.wmnet with reason: Maintenance [18:50:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2194 (T410531)', diff saved to https://phabricator.wikimedia.org/P85544 and previous config saved to /var/cache/conftool/dbconfig/20251124-185050-marostegui.json [18:50:59] (03Abandoned) 10Ssingh: Revert "hiera: trafficserver: switch hcaptcha backend to anycast" [puppet] - 10https://gerrit.wikimedia.org/r/1210603 (owner: 10Ssingh) [18:52:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11402389 (10RobH) IRC Echo Update (chatting with Scott in irc about this just echoing to task for history): * We want to get feedback from @brouberol on migration of kafka-main... [18:53:40] (03PS1) 10Bking: wdqs: use correct regex in preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1210667 (https://phabricator.wikimedia.org/T410406) [18:54:22] (03Merged) 10jenkins-bot: sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1208415 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [18:54:37] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:54:37] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 4 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11402399 (10bking) [18:54:50] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 4 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11402404 (10bking) [18:55:38] (03CR) 10CI reject: [V:04-1] wdqs: use correct regex in preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1210667 (https://phabricator.wikimedia.org/T410406) (owner: 10Bking) [18:57:30] (03PS2) 10Bking: wdqs: use correct regex in preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1210667 (https://phabricator.wikimedia.org/T410406) [18:57:46] (03PS1) 10Bvibber: Show "no data" message when tooltip does not contain to show [extensions/Chart] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1210669 (https://phabricator.wikimedia.org/T401990) [18:58:22] !log cdobbins@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs7003*} and A:liberica [18:58:24] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11402451 (10RobH) Day 9 Update: * 9 hosts moved, 10 remain - 300 hosts total at start of migration * John worked with Ben directly to migrate the (8) Data Pla... [18:58:27] (03CR) 10BCornwall: [C:03+1] "It looks good." [dns] - 10https://gerrit.wikimedia.org/r/1206185 (https://phabricator.wikimedia.org/T409735) (owner: 10Slyngshede) [19:02:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/Chart] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1210669 (https://phabricator.wikimedia.org/T401990) (owner: 10Bvibber) [19:04:01] (03CR) 10Xcollazo: [C:03+1] Rename targetDir to targetDirDefault [dumps] - 10https://gerrit.wikimedia.org/r/1204592 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon) [19:09:44] (03CR) 10Dzahn: [C:03+2] "please see https://phabricator.wikimedia.org/T410729 for a related discussion" [puppet] - 10https://gerrit.wikimedia.org/r/1024336 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [19:12:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T410531)', diff saved to https://phabricator.wikimedia.org/P85545 and previous config saved to /var/cache/conftool/dbconfig/20251124-191200-marostegui.json [19:12:06] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [19:13:32] win 12 [19:17:59] (03CR) 10BCornwall: switch wikipedia25.org from ncredir-lb to dyna (034 comments) [dns] - 10https://gerrit.wikimedia.org/r/1207288 (owner: 10Dzahn) [19:18:15] 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11402572 (10Dzahn) While other things are still being discussed here.. for now I would like to add that we have settled on the URL/domain: > The url http://wikipe... [19:19:04] (03CR) 10Dzahn: "The URL has been approved now for use with the new micro site." [dns] - 10https://gerrit.wikimedia.org/r/1207288 (owner: 10Dzahn) [19:20:23] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210678 [19:23:04] (03CR) 10Bking: [C:03+2] wdqs: use correct regex in preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1210667 (https://phabricator.wikimedia.org/T410406) (owner: 10Bking) [19:24:40] (03PS2) 10Dzahn: switch wikipedia25.org from ncredir-lb to dyna [dns] - 10https://gerrit.wikimedia.org/r/1207288 [19:24:44] (03CR) 10Dzahn: switch wikipedia25.org from ncredir-lb to dyna (033 comments) [dns] - 10https://gerrit.wikimedia.org/r/1207288 (owner: 10Dzahn) [19:25:11] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs7003*} and A:liberica [19:25:28] (03CR) 10Dzahn: switch wikipedia25.org from ncredir-lb to dyna (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1207288 (owner: 10Dzahn) [19:26:53] (03CR) 10Dzahn: [C:03+1] "ship it" [puppet] - 10https://gerrit.wikimedia.org/r/1205162 (https://phabricator.wikimedia.org/T409833) (owner: 10Arnaudb) [19:27:01] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1030.eqiad.wmnet with OS trixie [19:27:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P85547 and previous config saved to /var/cache/conftool/dbconfig/20251124-192707-marostegui.json [19:28:40] (03CR) 10Dzahn: "I just don't have the context for this to say anything." [cookbooks] - 10https://gerrit.wikimedia.org/r/1210386 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [19:29:33] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1028.eqiad.wmnet with OS bookworm [19:29:45] (03CR) 10Dzahn: "maybe Moritz or Simon would be best reviewers for this.. since it's about actual failure modes of reprepro" [puppet] - 10https://gerrit.wikimedia.org/r/1206887 (https://phabricator.wikimedia.org/T409832) (owner: 10Arnaudb) [19:29:50] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 4 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11402596 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wdqs1028.eqiad.wmnet with OS bookworm executed with errors:... [19:30:55] (03PS1) 10Neriah: trwikisource: Create rollbacker user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210681 (https://phabricator.wikimedia.org/T410931) [19:31:20] (03CR) 10WMDE-leszek: [C:03+1] "I confirm that WMDE no longer intends to publish Wikibase release files to releases.wikimedia.org. Thank you for deprecating the user grou" [puppet] - 10https://gerrit.wikimedia.org/r/1210654 (https://phabricator.wikimedia.org/T410418) (owner: 10Dzahn) [19:32:47] (03CR) 10AOkoth: [C:03+1] httpbb: move os-reports test file for services on miscweb-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1208398 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn) [19:33:17] (03CR) 10AOkoth: [C:03+1] httpbb: delete tests on legacy miscweb VMs [puppet] - 10https://gerrit.wikimedia.org/r/1208399 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn) [19:33:26] (03CR) 10Xcollazo: Report integrity metric from Wikidata dump scripts (031 comment) [dumps] - 10https://gerrit.wikimedia.org/r/1203410 (https://phabricator.wikimedia.org/T403482) (owner: 10Silvan Heintze) [19:33:41] (03CR) 10AOkoth: [C:03+1] installserver: remove legacy miscweb VMs [puppet] - 10https://gerrit.wikimedia.org/r/1208400 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn) [19:34:20] (03CR) 10AOkoth: [C:03+1] prometheus: drop class config for role::miscweb [puppet] - 10https://gerrit.wikimedia.org/r/1208401 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn) [19:35:34] (03CR) 10AOkoth: [C:03+1] site: remove legacy miscweb VMs [puppet] - 10https://gerrit.wikimedia.org/r/1208402 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn) [19:38:09] bking@cumin2002 reimage (PID 4108620) is awaiting input [19:42:01] (03CR) 10ToprakM: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210681 (https://phabricator.wikimedia.org/T410931) (owner: 10Neriah) [19:42:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P85548 and previous config saved to /var/cache/conftool/dbconfig/20251124-194215-marostegui.json [19:44:46] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for dsmit - https://phabricator.wikimedia.org/T410426#11402647 (10RLazarus) 05Open→03In progress Followed up with @DSmit-WMF and confirmed level 1 is what we're doing. Implementation to follow. [19:45:03] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for dsmit - https://phabricator.wikimedia.org/T410426#11402650 (10RLazarus) [19:45:30] (03CR) 10Dzahn: [C:03+2] installserver: remove legacy miscweb VMs [puppet] - 10https://gerrit.wikimedia.org/r/1208400 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn) [19:45:46] (03PS3) 10Dzahn: installserver: remove legacy miscweb VMs [puppet] - 10https://gerrit.wikimedia.org/r/1208400 (https://phabricator.wikimedia.org/T397080) [19:48:13] 06SRE: Reboot cookbook workflow leaves Puppet disabled - https://phabricator.wikimedia.org/T410944 (10CDobbins) 03NEW [19:49:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763982 (https://phabricator.wikimedia.org/T302227) (owner: 10Huji) [19:50:16] (03Merged) 10jenkins-bot: Increase AbuseFilter's emergency disable threshold for fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763982 (https://phabricator.wikimedia.org/T302227) (owner: 10Huji) [19:50:35] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:763982|Increase AbuseFilter's emergency disable threshold for fawiki (T302227)]] [19:50:40] T302227: Increase AbuseFilter's emergency disable threshold for fawiki - https://phabricator.wikimedia.org/T302227 [19:52:24] (03CR) 10Dzahn: [C:03+2] installserver: remove legacy miscweb VMs [puppet] - 10https://gerrit.wikimedia.org/r/1208400 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn) [19:52:50] (03CR) 10Dzahn: [C:03+2] prometheus: drop class config for role::miscweb [puppet] - 10https://gerrit.wikimedia.org/r/1208401 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn) [19:53:08] (03CR) 10Neriah: [C:03+1] labswiki: Enable sitenotice on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208478 (https://phabricator.wikimedia.org/T410702) (owner: 10BryanDavis) [19:54:37] (03CR) 10Dzahn: [C:03+2] httpbb: move os-reports test file for services on miscweb-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1208398 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn) [19:55:03] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210687 [19:55:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11402695 (10RobH) I've chatted with @brouberol via IRC: > 11:50 kafka hosts can be shut down / disconnected from the network, but not more than one at a time, to b... [19:55:46] !log urbanecm@deploy2002 huji, urbanecm: Backport for [[gerrit:763982|Increase AbuseFilter's emergency disable threshold for fawiki (T302227)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:55:51] T302227: Increase AbuseFilter's emergency disable threshold for fawiki - https://phabricator.wikimedia.org/T302227 [19:56:04] !log urbanecm@deploy2002 huji, urbanecm: Continuing with sync [19:56:54] (03CR) 10Mszwarc: [C:03+1] Enable v2 non-emergency workflow by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207845 (https://phabricator.wikimedia.org/T410512) (owner: 10STran) [19:57:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T410531)', diff saved to https://phabricator.wikimedia.org/P85549 and previous config saved to /var/cache/conftool/dbconfig/20251124-195723-marostegui.json [19:57:29] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [19:57:40] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2209.codfw.wmnet with reason: Maintenance [19:57:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2209 (T410531)', diff saved to https://phabricator.wikimedia.org/P85550 and previous config saved to /var/cache/conftool/dbconfig/20251124-195747-marostegui.json [20:00:07] (03CR) 10BCornwall: switch wikipedia25.org from ncredir-lb to dyna (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1207288 (owner: 10Dzahn) [20:00:18] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:763982|Increase AbuseFilter's emergency disable threshold for fawiki (T302227)]] (duration: 09m 43s) [20:02:49] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for Arian Bozorg (WMDE) - https://phabricator.wikimedia.org/T409409#11402722 (10RLazarus) 05In progress→03Resolved a:03Volans Optimistically resolving. :) @Arian_Bozorg please let us know if you have any troubl... [20:06:48] (03CR) 10Dzahn: [C:04-2] switch wikipedia25.org from ncredir-lb to dyna (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1207288 (owner: 10Dzahn) [20:07:34] (03CR) 10Dzahn: "waiting for input first if these tests should just move to a new target" [puppet] - 10https://gerrit.wikimedia.org/r/1208399 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn) [20:13:47] (03CR) 10Dzahn: [C:03+2] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1210656" [puppet] - 10https://gerrit.wikimedia.org/r/1207313 (owner: 10Dzahn) [20:13:51] (03CR) 10Dzahn: [C:03+2] releases: change group ownership of blubber releases to root [puppet] - 10https://gerrit.wikimedia.org/r/1210656 (https://phabricator.wikimedia.org/T410418) (owner: 10Dzahn) [20:13:57] (03PS2) 10Dzahn: releases: change group ownership of blubber releases to root [puppet] - 10https://gerrit.wikimedia.org/r/1210656 (https://phabricator.wikimedia.org/T410418) [20:14:37] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1030.eqiad.wmnet with OS trixie [20:15:24] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1030.eqiad.wmnet with OS trixie [20:15:44] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11402775 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host wdqs1030.eqiad.wmnet with OS trixie [20:17:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T410531)', diff saved to https://phabricator.wikimedia.org/P85551 and previous config saved to /var/cache/conftool/dbconfig/20251124-201739-marostegui.json [20:17:45] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [20:20:22] (03PS1) 10RLazarus: admin: Add daphnesmit to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1210695 (https://phabricator.wikimedia.org/T410426) [20:21:17] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:23:14] (03CR) 10Dzahn: [C:03+2] releases: change group ownership of blubber releases to root [puppet] - 10https://gerrit.wikimedia.org/r/1210656 (https://phabricator.wikimedia.org/T410418) (owner: 10Dzahn) [20:25:59] (03CR) 10Dzahn: [C:03+2] "all tests on miscweb-k8s fail with:" [puppet] - 10https://gerrit.wikimedia.org/r/1208398 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn) [20:26:31] (03CR) 10Dzahn: [C:03+2] "unless I am doing the test wrong - have you ever done it against miscweb-k8s?" [puppet] - 10https://gerrit.wikimedia.org/r/1208398 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn) [20:32:20] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1030.eqiad.wmnet with reason: host reimage [20:32:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P85552 and previous config saved to /var/cache/conftool/dbconfig/20251124-203247-marostegui.json [20:36:20] PROBLEM - MariaDB Replica Lag: s2 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 20011.89 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:38:06] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1030.eqiad.wmnet with reason: host reimage [20:39:37] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:39:45] FYI, in a couple of minutes I'm going to be updating the local PHP CLI installation on the deployment hosts from PHP 8.1. to 8.3. no impact expected, but wanted to mention. [20:40:07] (03CR) 10Scott French: [C:03+2] deployment_server: switch deployment hosts to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1208006 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [20:46:59] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: Add an option to the reimage cookbook to also update firmware - https://phabricator.wikimedia.org/T410384#11402862 (10bking) Hey Moritz and Cathal, Just wanted to add my .02 as someone who's been bitten a few times by the firmware stuff, including writing [[... [20:47:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P85553 and previous config saved to /var/cache/conftool/dbconfig/20251124-204754-marostegui.json [20:50:52] (03CR) 10Dzahn: [C:04-2] "in that case I will just abandon" [dns] - 10https://gerrit.wikimedia.org/r/1207288 (owner: 10Dzahn) [20:50:55] (03Abandoned) 10Dzahn: switch wikipedia25.org from ncredir-lb to dyna [dns] - 10https://gerrit.wikimedia.org/r/1207288 (owner: 10Dzahn) [20:51:11] !log updated local PHP CLI installation on deploy1003 to 8.3 - T405955 [20:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:16] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [20:55:42] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1030.eqiad.wmnet with OS trixie [20:55:57] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11402890 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host wdqs1030.eqiad.wmnet with OS trixie completed: - wdqs1030 (*... [20:56:01] !log updated local PHP CLI installation on deploy2002 to 8.3 - T405955 [20:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:48] FYI, all done with the above-mentioned PHP upgrades on deployment hosts. [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251124T2100). nyaa~ [21:00:05] hubaishan, arlolra, AaronSchulz, danisztls, and bvibber: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:01:02] o/ [21:02:15] I can get the party started [21:02:35] arlolra: my patch can be rolled with other stuff again [21:02:56] I'll add it to mine [21:03:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T410531)', diff saved to https://phabricator.wikimedia.org/P85554 and previous config saved to /var/cache/conftool/dbconfig/20251124-210302-marostegui.json [21:03:08] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [21:03:19] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2227.codfw.wmnet with reason: Maintenance [21:03:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2227 (T410531)', diff saved to https://phabricator.wikimedia.org/P85555 and previous config saved to /var/cache/conftool/dbconfig/20251124-210326-marostegui.json [21:03:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207276 (https://phabricator.wikimedia.org/T410564) (owner: 10Arlolra) [21:03:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206466 (https://phabricator.wikimedia.org/T409773) (owner: 10Aaron Schulz) [21:05:00] (03Merged) 10jenkins-bot: Deploy Parsoid Read Views to 18 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207276 (https://phabricator.wikimedia.org/T410564) (owner: 10Arlolra) [21:05:02] (03Merged) 10jenkins-bot: Mark non-wikimedia.org math APIs as deprecated in the sandbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206466 (https://phabricator.wikimedia.org/T409773) (owner: 10Aaron Schulz) [21:05:20] !log arlolra@deploy2002 Started scap sync-world: Backport for [[gerrit:1207276|Deploy Parsoid Read Views to 18 wikis (T410564)]], [[gerrit:1206466|Mark non-wikimedia.org math APIs as deprecated in the sandbox (T409773)]] [21:05:26] T410564: Parsoid Read Views to deploy ~2025-11-24 - https://phabricator.wikimedia.org/T410564 [21:05:26] T409773: Mark /math/ APIs outside of "wikimedia.org/api/rest_v1" as deprecated - https://phabricator.wikimedia.org/T409773 [21:05:57] o/ [21:08:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:09:37] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:10:26] !log arlolra@deploy2002 arlolra, aaron: Backport for [[gerrit:1207276|Deploy Parsoid Read Views to 18 wikis (T410564)]], [[gerrit:1206466|Mark non-wikimedia.org math APIs as deprecated in the sandbox (T409773)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:10:32] T410564: Parsoid Read Views to deploy ~2025-11-24 - https://phabricator.wikimedia.org/T410564 [21:10:32] T409773: Mark /math/ APIs outside of "wikimedia.org/api/rest_v1" as deprecated - https://phabricator.wikimedia.org/T409773 [21:12:54] !log arlolra@deploy2002 arlolra, aaron: Continuing with sync [21:13:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:14:33] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1032.eqiad.wmnet with OS trixie [21:14:46] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11403020 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host wdqs1032.eqiad.wmnet with OS trixie [21:16:35] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1031.eqiad.wmnet with OS trixie [21:17:08] !log arlolra@deploy2002 Finished scap sync-world: Backport for [[gerrit:1207276|Deploy Parsoid Read Views to 18 wikis (T410564)]], [[gerrit:1206466|Mark non-wikimedia.org math APIs as deprecated in the sandbox (T409773)]] (duration: 11m 49s) [21:17:15] T410564: Parsoid Read Views to deploy ~2025-11-24 - https://phabricator.wikimedia.org/T410564 [21:17:15] T409773: Mark /math/ APIs outside of "wikimedia.org/api/rest_v1" as deprecated - https://phabricator.wikimedia.org/T409773 [21:17:33] who's next [21:17:35] o/ sorry was late to my window :D [21:17:42] my patch may update localization files -- do it last [21:17:55] (adds one string to english) [21:18:35] arlolra: thanks [21:19:52] hubaishan: do you want me to deploy for you? [21:20:01] yes [21:20:07] alrighty [21:20:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1209791 (https://phabricator.wikimedia.org/T410840) (owner: 10Hubaishan) [21:21:21] (03Merged) 10jenkins-bot: arwiktionary: make Cite button in main VE bar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1209791 (https://phabricator.wikimedia.org/T410840) (owner: 10Hubaishan) [21:21:37] !log arlolra@deploy2002 Started scap sync-world: Backport for [[gerrit:1209791|arwiktionary: make Cite button in main VE bar (T410840)]] [21:21:42] T410840: [config] arwiktionary: make Cite button in main VE bar - https://phabricator.wikimedia.org/T410840 [21:25:08] bvibber: I can add yours to my batch [21:25:19] \o/ tx [21:26:05] !log arlolra@deploy2002 arlolra, hubaishan: Backport for [[gerrit:1209791|arwiktionary: make Cite button in main VE bar (T410840)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:26:14] OK in debug server [21:26:31] great, thanks [21:26:36] !log arlolra@deploy2002 arlolra, hubaishan: Continuing with sync [21:26:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T410531)', diff saved to https://phabricator.wikimedia.org/P85556 and previous config saved to /var/cache/conftool/dbconfig/20251124-212643-marostegui.json [21:26:49] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [21:30:32] !log arlolra@deploy2002 Finished scap sync-world: Backport for [[gerrit:1209791|arwiktionary: make Cite button in main VE bar (T410840)]] (duration: 08m 54s) [21:30:37] T410840: [config] arwiktionary: make Cite button in main VE bar - https://phabricator.wikimedia.org/T410840 [21:31:25] danisztls: all yours [21:31:42] arlolra: thanks [21:32:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208408 (https://phabricator.wikimedia.org/T410696) (owner: 10DDesouza) [21:32:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210655 (https://phabricator.wikimedia.org/T410696) (owner: 10DDesouza) [21:32:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [extensions/Chart] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1210669 (https://phabricator.wikimedia.org/T401990) (owner: 10Bvibber) [21:32:12] whee [21:32:31] (03PS1) 10Scott French: admin: Move swfrench non-FIDO ssh key to buster_ssh_keys [puppet] - 10https://gerrit.wikimedia.org/r/1210705 [21:33:07] (03Merged) 10jenkins-bot: Pre-deploy 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208408 (https://phabricator.wikimedia.org/T410696) (owner: 10DDesouza) [21:33:10] (03Merged) 10jenkins-bot: Deploy experiment for 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210655 (https://phabricator.wikimedia.org/T410696) (owner: 10DDesouza) [21:33:17] (03Merged) 10jenkins-bot: Show "no data" message when tooltip does not contain to show [extensions/Chart] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1210669 (https://phabricator.wikimedia.org/T401990) (owner: 10Bvibber) [21:33:38] !log dani@deploy2002 Started scap sync-world: Backport for [[gerrit:1208408|Pre-deploy 2025 Global Readers Survey (T410696)]], [[gerrit:1210655|Deploy experiment for 2025 Global Readers Survey (T410696)]], [[gerrit:1210669|Show "no data" message when tooltip does not contain to show (T401990)]] [21:33:44] T410696: Deploy enwiki edition of 2025 GRS - https://phabricator.wikimedia.org/T410696 [21:33:45] T401990: Chart displays NaN for entries with no data - https://phabricator.wikimedia.org/T401990 [21:34:43] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1031.eqiad.wmnet with reason: host reimage [21:37:55] (03CR) 10RLazarus: [C:03+1] admin: Move swfrench non-FIDO ssh key to buster_ssh_keys [puppet] - 10https://gerrit.wikimedia.org/r/1210705 (owner: 10Scott French) [21:38:40] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1031.eqiad.wmnet with reason: host reimage [21:39:36] bvibber: is there any problem in deploying your patch via spiderpig? [21:39:52] (03CR) 10Scott French: "Thanks, Reuven!" [puppet] - 10https://gerrit.wikimedia.org/r/1210705 (owner: 10Scott French) [21:40:06] (03CR) 10Scott French: [C:03+2] admin: Move swfrench non-FIDO ssh key to buster_ssh_keys [puppet] - 10https://gerrit.wikimedia.org/r/1210705 (owner: 10Scott French) [21:40:38] should work but it's gonna be regenerating the localization cache ;_; [21:41:26] bvibber: ok [21:41:45] really lighting a fire under my ass on my project to reduce the localization cache size by a factor of 10 (i'm up to a factor of 6 and i think i'm going to reach my goal with the next refactor) [21:41:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P85557 and previous config saved to /var/cache/conftool/dbconfig/20251124-214151-marostegui.json [21:42:00] bvibber: I'm seeing 40 MediaWiki errors in the log [21:42:00] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install x1 host - https://phabricator.wikimedia.org/T407897#11403185 (10RobH) >>! In T407897#11399303, @Marostegui wrote: > Thanks Rob, I think the confusion was whether we ordered the right HW or not. Doing 1G is fine for this host, 10G w... [21:42:22] hmm it should be JS only changes and a new message [21:44:00] bvibber: maybe they aren't related to your patch, but they are there [21:44:29] got a linky to em in logstash? [21:44:53] bvibber: yep, but I don't have logstash perms [21:45:00] heh [21:45:43] https://logstash.wikimedia.org/app/dashboards#/view/mediawiki-errors [21:48:08] nothing particularly suspicious in there i'd expect to have been affected by the message update [21:49:21] bvibber: yeah, just to make sure, anyway it's still building the images and that log is from production, right? [21:49:29] right [21:50:04] bvibber: thanks [21:54:09] danisztls: > I don't have logstash perms -- you appear to be in the "wmf" LDAP group. That should give you logstash access. [21:56:29] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1031.eqiad.wmnet with OS trixie [21:56:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P85558 and previous config saved to /var/cache/conftool/dbconfig/20251124-215659-marostegui.json [21:58:38] bd808: I get service denied due to missing privileges when I try. [22:00:05] Reedy, sbassett, Maryum, and manfredi: How many deployers does it take to do Weekly Security deployment window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251124T2200). [22:00:28] (03PS1) 10Daimona Eaytoy: Stop setting $wgCampaignEventsEnableContributionTracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210716 (https://phabricator.wikimedia.org/T410939) [22:00:32] bvibber: it's finally on test servers [22:00:54] Hey all - is the late backport still happening? [22:01:04] danisztls: hmmm... and you authenticated with your https://ldap.toolforge.org/user/dani account? [22:01:07] !log dani@deploy2002 dani, bvibber: Backport for [[gerrit:1208408|Pre-deploy 2025 Global Readers Survey (T410696)]], [[gerrit:1210655|Deploy experiment for 2025 Global Readers Survey (T410696)]], [[gerrit:1210669|Show "no data" message when tooltip does not contain to show (T401990)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:01:08] whee [22:01:09] (03PS2) 10Daimona Eaytoy: Stop setting $wgCampaignEventsEnableContributionTracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210716 (https://phabricator.wikimedia.org/T410939) [22:01:13] T410696: Deploy enwiki edition of 2025 GRS - https://phabricator.wikimedia.org/T410696 [22:01:14] T401990: Chart displays NaN for entries with no data - https://phabricator.wikimedia.org/T401990 [22:01:26] sbassett: yeah. they just got to the staging servers. l01n update slowness. [22:01:29] danisztls: confirmed works [22:01:41] *l10n [22:01:42] Ok. Have one sec patch to get out but I can wait a bit. [22:02:08] bvibber: I'm getting MediaWiki internal error. [22:02:15] bd808: i think i'm going to push to finish this l10n cache shrinkage fix way before the may hackathon ;) [22:02:40] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [22:02:42] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [22:03:08] mysterious [22:03:20] "TypeError: QuickSurveys\SurveyQuestion::__construct(): Argument #1 ($questionDefinition) must be of type array, string given, called in /srv/mediawiki/php-1.46.0-wmf.3/extensions/QuickSurveys/includes/SurveyFactory.php on line" [22:03:22] i was literally looking at a tst server page on commons and it rendered my page with updated js [22:03:33] aha [22:03:42] mu fault them [22:03:44] *my [22:03:57] ;_; [22:04:12] if you break it and fix it, you get the t-shirt ;) [22:05:24] PROBLEM - MD RAID on ms-fe2014 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [22:05:26] ACKNOWLEDGEMENT - MD RAID on ms-fe2014 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T410959 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [22:05:32] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Swift [22:05:32] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 505 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Swift [22:05:35] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on ms-fe2014 - https://phabricator.wikimedia.org/T410959 (10ops-monitoring-bot) 03NEW [22:05:44] danisztls: you are going to need to "exit scap" to roll back and then fix the config. [22:06:01] bd808: thanks [22:06:40] bd808: now I do a patch to fix and a new deploy? [22:07:01] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11403258 (10Andrew) [22:08:54] danisztls: you will need to revert the config changes and try https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1208408/5/wmf-config/InitialiseSettings.php again after you fix the syntax problems. [22:09:35] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1028.eqiad.wmnet with OS trixie [22:09:59] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11403268 (10Andrew) [22:10:18] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS trixie [22:10:26] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11403282 (10Andrew) Assigning to myself pending a decision about hostnames [22:10:32] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11403281 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host wdqs1029.eqiad.wmnet with OS trixie [22:10:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11403283 (10BTullis) I have failed over the active namenode, so an-master1003 is now ready for the network cable move. ` b... [22:12:07] (03PS1) 10DDesouza: Revert "Deploy experiment for 2025 Global Readers Survey" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210722 [22:12:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T410531)', diff saved to https://phabricator.wikimedia.org/P85559 and previous config saved to /var/cache/conftool/dbconfig/20251124-221207-marostegui.json [22:12:13] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [22:12:17] (03PS1) 10DDesouza: Revert "Pre-deploy 2025 Global Readers Survey" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210723 [22:12:23] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2239.codfw.wmnet with reason: Maintenance [22:15:47] danisztls: do you know how to revert those backports, or do you need help? [22:16:47] bd808: I didn't but I think I figured it out, I reverted on Gerrit and I need to deploy the reverts like a patch, right? [22:17:34] danisztls: yeah. that should work. There is a cli tool to do that, but spiderpig doesn't have a gui for it yet. [22:17:47] it == rollback in gerrit and merge [22:18:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210722 (owner: 10DDesouza) [22:18:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210723 (owner: 10DDesouza) [22:18:35] bd808: thanks [22:19:13] The cli way to do it is `scap backport --revert [change_numbers ...]` [22:19:27] (03Merged) 10jenkins-bot: Revert "Deploy experiment for 2025 Global Readers Survey" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210722 (owner: 10DDesouza) [22:19:28] (03Merged) 10jenkins-bot: Revert "Pre-deploy 2025 Global Readers Survey" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210723 (owner: 10DDesouza) [22:19:36] (03PS1) 10DDesouza: Pre-deploy 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210727 (https://phabricator.wikimedia.org/T410696) [22:19:47] !log dani@deploy2002 Started scap sync-world: Backport for [[gerrit:1210722|Revert "Deploy experiment for 2025 Global Readers Survey"]], [[gerrit:1210723|Revert "Pre-deploy 2025 Global Readers Survey"]] [22:20:47] bvibber's config change is still in there. Let's see how horrible the build time is, but I'd expect another 20 minutes. [22:20:52] (03PS1) 10DDesouza: Deploy experiment for 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210729 (https://phabricator.wikimedia.org/T410696) [22:20:54] hehe [22:21:51] (03PS2) 10DDesouza: Deploy experiment for 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210729 (https://phabricator.wikimedia.org/T410696) [22:21:52] nope. it was fast [22:22:11] "Finished build-and-push-container-images (duration: 01m 35s)" [22:23:30] Oh, we didn't use the prior container but it had been built and pushed so really there was no l10n rebuild. That is sort of confusing but it makes sense. [22:23:41] (03PS2) 10DDesouza: Pre-deploy 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210727 (https://phabricator.wikimedia.org/T410696) [22:24:37] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [22:25:45] sorry about the event and thanks for the help bd808 [22:26:16] !log dani@deploy2002 dani: Backport for [[gerrit:1210722|Revert "Deploy experiment for 2025 Global Readers Survey"]], [[gerrit:1210723|Revert "Pre-deploy 2025 Global Readers Survey"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:26:49] you're doing fine danisztls :) [22:27:26] bvibber: you should probably double check your change on the debug servers [22:27:54] bd808: confirmed good on debug! [22:28:09] !log dani@deploy2002 dani: Continuing with sync [22:32:12] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Measure request frequency of thumbnail sizes - https://phabricator.wikimedia.org/T410304#11403354 (10Ladsgroup) ` spark-sql (default)> select uri_path, count(*) as hits from wmf.webrequest where webrequest_source='upload' and y... [22:34:56] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1032.eqiad.wmnet with OS trixie [22:35:07] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudrabbit2001-dev.codfw.wmnet [22:35:08] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11403357 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host wdqs1032.eqiad.wmnet with OS trixie executed with errors: -... [22:36:09] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudbackup1001-dev.eqiad.wmnet [22:38:32] (03PS2) 10Cwhite: opensearch: add $apt_component parameter [puppet] - 10https://gerrit.wikimedia.org/r/1208500 (https://phabricator.wikimedia.org/T410795) [22:39:10] (03CR) 10CI reject: [V:04-1] opensearch: add $apt_component parameter [puppet] - 10https://gerrit.wikimedia.org/r/1208500 (https://phabricator.wikimedia.org/T410795) (owner: 10Cwhite) [22:40:12] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudbackup1001-dev.eqiad.wmnet [22:40:29] !log dani@deploy2002 Finished scap sync-world: Backport for [[gerrit:1210722|Revert "Deploy experiment for 2025 Global Readers Survey"]], [[gerrit:1210723|Revert "Pre-deploy 2025 Global Readers Survey"]] (duration: 20m 42s) [22:41:36] (03PS3) 10Cwhite: opensearch: add $apt_component parameter [puppet] - 10https://gerrit.wikimedia.org/r/1208500 (https://phabricator.wikimedia.org/T410795) [22:41:43] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit2001-dev.codfw.wmnet [22:41:59] backport window changes looking stable now? [22:42:36] (03CR) 10Cwhite: [C:03+2] aptrepo: add component/opensearch27 [puppet] - 10https://gerrit.wikimedia.org/r/1208499 (https://phabricator.wikimedia.org/T410795) (owner: 10Cwhite) [22:42:53] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudbackup1002-dev.eqiad.wmnet [22:44:57] Eh, looks like the patch I wanted to deploy went out with the scap prep from that last revert deploy. So I guess we’re good on that :) [22:45:37] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin2002 - T390860 [22:45:41] T390860: Elasticsearch dependency upgrade in spicerack - https://phabricator.wikimedia.org/T390860 [22:46:45] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudbackup1002-dev.eqiad.wmnet [22:46:50] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcontrol2005-dev.codfw.wmnet [22:51:48] danisztls: Are you going to try another deployment, or can sbassett take over for his security backport window? [22:53:48] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Measure request frequency of thumbnail sizes - https://phabricator.wikimedia.org/T410304#11403412 (10Ladsgroup) The query was wrong, the like should have an extra % at the end. Let me try again. [22:54:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210727 (https://phabricator.wikimedia.org/T410696) (owner: 10DDesouza) [22:54:35] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol2005-dev.codfw.wmnet [22:54:37] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:54:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210729 (https://phabricator.wikimedia.org/T410696) (owner: 10DDesouza) [22:54:40] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudlb2002-dev.codfw.wmnet [22:55:24] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Measure request frequency of thumbnail sizes - https://phabricator.wikimedia.org/T410304#11403413 (10Ladsgroup) ` spark-sql (default)> select uri_path, count(*) as hits from wmf.webrequest where webrequest_source='upload' and y... [22:55:30] bd808: he can take over, I will deploy tomorrow since its 1 hour past the window [22:55:59] :+1: You have the con sbassett [22:56:11] 🫡 [22:56:25] (03CR) 10Cwhite: [C:03+2] "PCC OK: https://puppet-compiler.wmflabs.org/output/1208500/7696/" [puppet] - 10https://gerrit.wikimedia.org/r/1208500 (https://phabricator.wikimedia.org/T410795) (owner: 10Cwhite) [22:57:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2002-dev (172.20.5.3) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:57:54] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:58:20] danisztls: I chatted with thcipriani and he pointed out that logstash-access is a separate right these days. You can apply for it at https://idm.wikimedia.org/permissions/. You should get it to go along with your spiderpig deployment rights. [22:59:14] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin2002 - T390860 [22:59:20] T390860: Elasticsearch dependency upgrade in spicerack - https://phabricator.wikimedia.org/T390860 [22:59:34] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin2002 - T390860 [23:00:54] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:01:46] bd808: thanks! just requested [23:02:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2002-dev (172.20.5.3) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:03:52] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2002-dev.codfw.wmnet [23:03:57] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcontrol2006-dev.codfw.wmnet [23:13:28] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol2006-dev.codfw.wmnet [23:13:33] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet [23:15:54] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:16:09] FIRING: [4x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2002-dev (172.20.5.3) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:19:01] (03PS1) 10Kosta Harlan: hCaptcha: Allow providing a set of valid keys for site verify per action [extensions/ConfirmEdit] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1210737 (https://phabricator.wikimedia.org/T410657) [23:19:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 25 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1210737 (https://phabricator.wikimedia.org/T410657) (owner: 10Kosta Harlan) [23:19:54] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:21:09] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2003-dev (172.20.5.4) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:22:41] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet [23:22:46] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcontrol2010-dev.codfw.wmnet [23:24:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 25 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1210637 (https://phabricator.wikimedia.org/T409957) (owner: 10Kosta Harlan) [23:25:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 25 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210621 (https://phabricator.wikimedia.org/T410354) (owner: 10Kosta Harlan) [23:25:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 25 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210622 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [23:25:49] (03PS3) 10Kosta Harlan: hCaptcha: Define valid SiteKeys for account creation and edit triggers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210627 (https://phabricator.wikimedia.org/T410657) [23:25:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 25 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210627 (https://phabricator.wikimedia.org/T410657) (owner: 10Kosta Harlan) [23:29:49] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet [23:29:54] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudnet2005-dev.codfw.wmnet [23:29:56] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1028.eqiad.wmnet with OS trixie [23:30:39] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1029.eqiad.wmnet with OS trixie [23:30:57] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11403508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host wdqs1029.eqiad.wmnet with OS trixie executed with errors: -... [23:35:33] (03CR) 10Andrew Bogott: [C:03+1] P:openstack: neutron: Cleanup legacy_vlan_naming hiera key [puppet] - 10https://gerrit.wikimedia.org/r/1208306 (owner: 10Majavah) [23:37:45] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet2005-dev.codfw.wmnet [23:37:50] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudrabbit2002-dev.codfw.wmnet [23:39:36] (03PS1) 10Joal: Bump Hadoop max container size to 128Gb [puppet] - 10https://gerrit.wikimedia.org/r/1210744 (https://phabricator.wikimedia.org/T410966) [23:44:47] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit2002-dev.codfw.wmnet [23:44:52] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudnet2006-dev.codfw.wmnet [23:45:25] FIRING: [4x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2065:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:50:25] RESOLVED: [4x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2065:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:52:28] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet2006-dev.codfw.wmnet [23:52:32] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudrabbit2003-dev.codfw.wmnet [23:59:14] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit2003-dev.codfw.wmnet [23:59:18] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudnet2007-dev.codfw.wmnet