[00:03:23] (03CR) 10Dzahn: [V:03+1 C:03+2] "like this the diff is just some inconsistencies about "status_matches" but the default value should be 200." [puppet] - 10https://gerrit.wikimedia.org/r/1161509 (owner: 10Filippo Giunchedi) [00:04:30] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:04:54] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:06:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:06:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:08:17] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1166943 [00:08:17] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1166943 (owner: 10TrainBranchBot) [00:12:53] (03CR) 10Dzahn: [V:03+1 C:03+2] "noop on people* and alert*" [puppet] - 10https://gerrit.wikimedia.org/r/1161509 (owner: 10Filippo Giunchedi) [00:15:52] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Hiera#Puppet_enc_system" [puppet] - 10https://gerrit.wikimedia.org/r/1166263 (https://phabricator.wikimedia.org/T396936) (owner: 10BryanDavis) [00:16:56] (03CR) 10Dzahn: [C:03+2] "https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Hiera#Puppet_enc_system" [puppet] - 10https://gerrit.wikimedia.org/r/1166262 (https://phabricator.wikimedia.org/T397591) (owner: 10BryanDavis) [00:19:25] (03CR) 10Dzahn: [C:03+2] gitlab: Allow WMCS runners to talk to deployment-prep wikis (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1166262 (https://phabricator.wikimedia.org/T397591) (owner: 10BryanDavis) [00:20:04] (03PS3) 10BryanDavis: gitlab: Allow WMCS runners to talk to puppet-enc.cloudinfra [puppet] - 10https://gerrit.wikimedia.org/r/1166263 (https://phabricator.wikimedia.org/T396936) [00:20:39] !log zabe@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [00:21:03] !log zabe@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [00:21:15] (03CR) 10Dzahn: [C:03+2] gitlab: Allow WMCS runners to talk to puppet-enc.cloudinfra [puppet] - 10https://gerrit.wikimedia.org/r/1166263 (https://phabricator.wikimedia.org/T396936) (owner: 10BryanDavis) [00:33:48] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1166943 (owner: 10TrainBranchBot) [01:08:02] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.45.0-wmf.9 [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1166950 (https://phabricator.wikimedia.org/T392179) [01:08:04] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.45.0-wmf.9 [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1166950 (https://phabricator.wikimedia.org/T392179) (owner: 10TrainBranchBot) [01:19:27] (03Merged) 10jenkins-bot: Branch commit for wmf/1.45.0-wmf.9 [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1166950 (https://phabricator.wikimedia.org/T392179) (owner: 10TrainBranchBot) [01:42:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:57:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:59:35] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd2007-dev.codfw.wmnet with OS bookworm [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250708T0200) [02:20:19] !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd2007-dev.codfw.wmnet with reason: host reimage [02:21:40] FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:23:36] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd2007-dev.codfw.wmnet with reason: host reimage [02:42:12] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd2007-dev.codfw.wmnet with OS bookworm [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250708T0300) [04:00:04] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250708T0400) [04:04:28] !log mwpresync@deploy1003 Pruned MediaWiki: 1.45.0-wmf.6 (duration: 04m 24s) [04:06:25] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:06:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:14:34] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1237 is not booting up - https://phabricator.wikimedia.org/T398794#10981837 (10Marostegui) @VRiley-WMF from our side the host is fine. If you or @Jclark-ctr need to work on upgrade firmwares and BIOS, please let me know so I can depool it and have it ready for it. [04:17:41] (03PS1) 10Marostegui: db1237: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1166959 (https://phabricator.wikimedia.org/T397279) [04:18:15] (03CR) 10Marostegui: [C:03+2] db1237: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1166959 (https://phabricator.wikimedia.org/T397279) (owner: 10Marostegui) [04:23:47] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1222 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1166960 (https://phabricator.wikimedia.org/T398906) [04:23:51] (03PS1) 10Gerrit maintenance bot: wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/1166961 (https://phabricator.wikimedia.org/T398906) [04:26:20] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s2 T398906 [04:26:23] T398906: Switchover s2 master (db1162 -> db1222) - https://phabricator.wikimedia.org/T398906 [04:26:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1222 with weight 0 T398906', diff saved to https://phabricator.wikimedia.org/P78780 and previous config saved to /var/cache/conftool/dbconfig/20250708-042646-root.json [04:31:20] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1222 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1166960 (https://phabricator.wikimedia.org/T398906) (owner: 10Gerrit maintenance bot) [04:33:59] (03PS3) 10KartikMistry: machinetranslation: Use s3 for model download in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166543 (https://phabricator.wikimedia.org/T335491) [04:36:15] !log Starting s2 eqiad failover from db1162 to db1222 - T398906 [04:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:36:18] T398906: Switchover s2 master (db1162 -> db1222) - https://phabricator.wikimedia.org/T398906 [04:36:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s2 eqiad as read-only for maintenance - T398906', diff saved to https://phabricator.wikimedia.org/P78781 and previous config saved to /var/cache/conftool/dbconfig/20250708-043628-root.json [04:36:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1222 to s2 primary and set section read-write T398906', diff saved to https://phabricator.wikimedia.org/P78782 and previous config saved to /var/cache/conftool/dbconfig/20250708-043654-root.json [04:37:20] !log marostegui@dns1006 START - running authdns-update [04:37:30] (03CR) 10Marostegui: [C:03+2] wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/1166961 (https://phabricator.wikimedia.org/T398906) (owner: 10Gerrit maintenance bot) [04:38:04] !log marostegui@dns1006 END - running authdns-update [04:38:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1162 T398906', diff saved to https://phabricator.wikimedia.org/P78783 and previous config saved to /var/cache/conftool/dbconfig/20250708-043814-marostegui.json [04:38:36] !log marostegui@dns1006 START - running authdns-update [04:39:23] !log marostegui@dns1006 END - running authdns-update [04:40:31] (03PS1) 10Marostegui: db1162: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1166963 (https://phabricator.wikimedia.org/T396549) [04:41:00] (03CR) 10Marostegui: [C:03+2] db1162: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1166963 (https://phabricator.wikimedia.org/T396549) (owner: 10Marostegui) [04:47:20] PROBLEM - Host an-worker1095 is DOWN: PING CRITICAL - Packet loss = 100% [04:47:48] (03PS1) 10Marostegui: db1237: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1166964 [04:48:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1237 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P78784 and previous config saved to /var/cache/conftool/dbconfig/20250708-044803-root.json [04:51:54] (03CR) 10Marostegui: [C:03+2] db1237: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1166964 (owner: 10Marostegui) [04:58:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1162 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78785 and previous config saved to /var/cache/conftool/dbconfig/20250708-045812-root.json [05:03:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1237 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78786 and previous config saved to /var/cache/conftool/dbconfig/20250708-050308-root.json [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:13:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1162 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P78787 and previous config saved to /var/cache/conftool/dbconfig/20250708-051318-root.json [05:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:18:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1237 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P78788 and previous config saved to /var/cache/conftool/dbconfig/20250708-051814-root.json [05:28:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1162 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78789 and previous config saved to /var/cache/conftool/dbconfig/20250708-052823-root.json [05:33:10] (03PS1) 10Giuseppe Lavagetto: Stop loggging requests that would not be rate-limited [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1166967 [05:33:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1237 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78790 and previous config saved to /var/cache/conftool/dbconfig/20250708-053320-root.json [05:33:28] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Stop loggging requests that would not be rate-limited [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1166967 (owner: 10Giuseppe Lavagetto) [05:33:40] !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on gerrit2003.wikimedia.org with reason: WIP [05:35:14] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Feature: better logging of varnish rate-limits - oblivian@cumin1003" [05:35:15] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Feature: better logging of varnish rate-limits - oblivian@cumin1003 [05:35:47] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Feature: better logging of varnish rate-limits - oblivian@cumin1003 [05:35:48] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Feature: better logging of varnish rate-limits - oblivian@cumin1003" [05:41:35] (03PS1) 10Giuseppe Lavagetto: Revert "Stop loggging requests that would not be rate-limited" [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1166968 [05:41:42] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Revert "Stop loggging requests that would not be rate-limited" [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1166968 (owner: 10Giuseppe Lavagetto) [05:41:48] FIRING: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:41:58] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Reverty - oblivian@cumin1003" [05:41:59] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Reverty - oblivian@cumin1003 [05:42:28] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Reverty - oblivian@cumin1003 [05:42:29] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Reverty - oblivian@cumin1003" [05:42:36] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Reverty - oblivian@cumin1003" [05:42:37] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Reverty - oblivian@cumin1003 [05:43:04] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Reverty - oblivian@cumin1003 [05:43:06] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Reverty - oblivian@cumin1003" [05:43:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1162 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78791 and previous config saved to /var/cache/conftool/dbconfig/20250708-054329-root.json [05:48:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1237 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78792 and previous config saved to /var/cache/conftool/dbconfig/20250708-054825-root.json [05:50:49] (03PS1) 10Marostegui: s3 codfw: Migrate to SBR [puppet] - 10https://gerrit.wikimedia.org/r/1166969 (https://phabricator.wikimedia.org/T383795) [05:51:23] (03CR) 10Marostegui: "This is a NOOP until the change is made lively on the hosts (or mariadb is restarted)" [puppet] - 10https://gerrit.wikimedia.org/r/1166969 (https://phabricator.wikimedia.org/T383795) (owner: 10Marostegui) [05:51:27] (03CR) 10Marostegui: [C:03+2] s3 codfw: Migrate to SBR [puppet] - 10https://gerrit.wikimedia.org/r/1166969 (https://phabricator.wikimedia.org/T383795) (owner: 10Marostegui) [05:51:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:52:43] !log Migrate s3 codfw to SBR T383795 [05:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:46] T383795: Move sX to STATEMENT based replication - https://phabricator.wikimedia.org/T383795 [05:53:08] (03CR) 10Arnaudb: [C:03+1] "looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250708T0600) [06:00:05] marostegui, Amir1, and federico3: Time to do the Primary database switchover deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250708T0600). [06:13:21] (03PS1) 10Giuseppe Lavagetto: Fix varnish logging of rate-limiting, take 2 [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1166970 [06:13:34] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Fix varnish logging of rate-limiting, take 2 [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1166970 (owner: 10Giuseppe Lavagetto) [06:14:27] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Fix varnis logging (take 2) - oblivian@cumin1003" [06:14:28] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Fix varnis logging (take 2) - oblivian@cumin1003 [06:14:58] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Fix varnis logging (take 2) - oblivian@cumin1003 [06:15:00] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Fix varnis logging (take 2) - oblivian@cumin1003" [06:16:20] PROBLEM - Exim SMTP on lists1004 is CRITICAL: connect to address 208.80.154.81 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/Exim [06:19:25] RECOVERY - Exim SMTP on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 07 Aug 2025 09:25:51 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Exim [06:21:41] FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:30:00] (03PS1) 10Giuseppe Lavagetto: Revert "Fix varnish logging of rate-limiting, take 2" [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1167078 [06:30:29] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Revert "Fix varnish logging of rate-limiting, take 2" [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1167078 (owner: 10Giuseppe Lavagetto) [06:30:50] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Revert - oblivian@cumin1003" [06:30:51] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Revert - oblivian@cumin1003 [06:31:22] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Revert - oblivian@cumin1003 [06:31:23] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Revert - oblivian@cumin1003" [06:35:47] !log rebalance following reimages T382513 [06:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:49] T382513: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513 [06:36:38] (03PS1) 10Giuseppe Lavagetto: Revert logging changes [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1167079 [06:38:26] (03CR) 10Elukey: pyrra: remove multi-dc for istio-based SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1166076 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [06:38:39] (03CR) 10Vgutierrez: [C:03+2] hiera: Remove esams and magru bgp peer overrides [puppet] - 10https://gerrit.wikimedia.org/r/1166870 (owner: 10Vgutierrez) [06:42:09] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167080 [06:50:26] (03CR) 10Filippo Giunchedi: "How did you pick 5m ? The current puppet runs on alert hosts take ~3m so 5m would mean puppet-agent basically running all the time, is tha" [puppet] - 10https://gerrit.wikimedia.org/r/1166846 (https://phabricator.wikimedia.org/T398444) (owner: 10Herron) [06:56:15] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167080 (owner: 10PipelineBot) [06:58:09] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167080 (owner: 10PipelineBot) [07:00:04] Amir1, Urbanecm, and awight: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250708T0700). [07:00:04] Tchanders: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:52] o/ [07:00:57] I'll deploy my own patch [07:01:27] FIRING: ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:01:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tchanders@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166791 (https://phabricator.wikimedia.org/T381845) (owner: 10Tchanders) [07:02:06] (03PS1) 10Volans: tox.ini: skip Python 3.10 in CI [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167081 [07:02:25] (03PS2) 10Volans: cookbook API: simplify -t/--task-id support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1154787 [07:02:25] (03CR) 10Volans: "ready for review" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1154787 (owner: 10Volans) [07:03:06] (03CR) 10Nikerabbit: [C:03+1] CX: Add virtual-cx-shared DatabaseVirtualDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152065 (https://phabricator.wikimedia.org/T348513) (owner: 10Abijeet Patro) [07:03:18] (03Merged) 10jenkins-bot: temp accounts: Separate digits in user names with hyphens [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166791 (https://phabricator.wikimedia.org/T381845) (owner: 10Tchanders) [07:03:42] !log tchanders@deploy1003 Started scap sync-world: Backport for [[gerrit:1166791|temp accounts: Separate digits in user names with hyphens (T381845)]] [07:03:44] T381845: Add hyphens to break temporary user names into groups of <5 digits - https://phabricator.wikimedia.org/T381845 [07:05:48] !log tchanders@deploy1003 tchanders: Backport for [[gerrit:1166791|temp accounts: Separate digits in user names with hyphens (T381845)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:06:27] RESOLVED: ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:09:18] !log tchanders@deploy1003 tchanders: Continuing with sync [07:14:44] !log tchanders@deploy1003 Finished scap sync-world: Backport for [[gerrit:1166791|temp accounts: Separate digits in user names with hyphens (T381845)]] (duration: 11m 02s) [07:14:48] T381845: Add hyphens to break temporary user names into groups of <5 digits - https://phabricator.wikimedia.org/T381845 [07:17:13] My patch is done, but I won't log that the window is done, in case anyone else wants to deploy something in the next 40 minutes [07:19:28] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade Replica to GitLab 18.0 [07:22:16] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:22:35] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135643 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [07:26:26] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:30:32] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade Replica to GitLab 18.0 [07:30:37] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513#10982110 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All done. [07:32:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:36:41] (03CR) 10Gmodena: [C:03+2] services: mw-page-content-change-enrich: version bump image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166923 (https://phabricator.wikimedia.org/T347282) (owner: 10Gmodena) [07:38:17] (03Merged) 10jenkins-bot: services: mw-page-content-change-enrich: version bump image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166923 (https://phabricator.wikimedia.org/T347282) (owner: 10Gmodena) [07:42:06] (03CR) 10Fabfur: [C:03+2] cache: install benthos on all cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1135643 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [07:42:14] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [07:42:36] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [07:45:12] !log temporary disable puppet on A:cp to apply https://gerrit.wikimedia.org/r/1135643 (T329332) [07:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:52:00] (03PS1) 10Marostegui: s3 eqiad: Migrate to SBR [puppet] - 10https://gerrit.wikimedia.org/r/1167142 (https://phabricator.wikimedia.org/T383795) [07:52:19] (03PS1) 10Vgutierrez: hiera: Issue dedicated certs for probenet endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1167143 (https://phabricator.wikimedia.org/T398596) [07:53:03] (03CR) 10Marostegui: "This is a NOOP until the change is made lively on the databases or we restart mariadb" [puppet] - 10https://gerrit.wikimedia.org/r/1167142 (https://phabricator.wikimedia.org/T383795) (owner: 10Marostegui) [07:53:07] (03CR) 10Marostegui: [C:03+2] s3 eqiad: Migrate to SBR [puppet] - 10https://gerrit.wikimedia.org/r/1167142 (https://phabricator.wikimedia.org/T383795) (owner: 10Marostegui) [07:53:53] (03PS2) 10Vgutierrez: hiera: Issue dedicated certs for probenet endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1167143 (https://phabricator.wikimedia.org/T398596) [07:54:21] !log Migrate s3 eqiad to SBR T383795 [07:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:24] T383795: Move sX to STATEMENT based replication - https://phabricator.wikimedia.org/T383795 [07:55:27] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1167143 (https://phabricator.wikimedia.org/T398596) (owner: 10Vgutierrez) [07:55:28] (03CR) 10Klausman: [C:03+1] machinetranslation: Use s3 for model download in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166543 (https://phabricator.wikimedia.org/T335491) (owner: 10KartikMistry) [07:55:48] !log enabling puppet on A:cp (T329332) [07:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:45] (03CR) 10Jgiannelos: [C:04-1] "Overall other than the kafka topic, it looks OK." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165550 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [08:00:05] andre and jnuche: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250708T0800). [08:00:36] (03CR) 10Jgiannelos: [C:04-1] services: configure tegola in codfw to use maps-test (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165550 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [08:00:54] 06SRE, 06Traffic: Benthos - remove the kafka output module - https://phabricator.wikimedia.org/T398916 (10Fabfur) 03NEW [08:01:58] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [08:02:14] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [08:06:25] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:06:47] !log gmodena@deploy1003 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [08:06:54] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:06:57] !log gmodena@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [08:10:38] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3585 MB (3% inode=98%): /tmp 3585 MB (3% inode=98%): /var/tmp 3585 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [08:11:21] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1159995 (https://phabricator.wikimedia.org/T396621) (owner: 10Fabfur) [08:11:34] !log gmodena@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [08:11:42] !log gmodena@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [08:11:45] !log installing postgresql-15 security updates [08:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:48] (03PS1) 10TrainBranchBot: testwikis to 1.45.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167144 (https://phabricator.wikimedia.org/T392179) [08:14:50] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.45.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167144 (https://phabricator.wikimedia.org/T392179) (owner: 10TrainBranchBot) [08:15:51] (03Merged) 10jenkins-bot: testwikis to 1.45.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167144 (https://phabricator.wikimedia.org/T392179) (owner: 10TrainBranchBot) [08:16:17] !log aklapper@deploy1003 Started scap sync-world: testwikis to 1.45.0-wmf.9 refs T392179 [08:16:21] T392179: 1.45.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T392179 [08:17:41] (03PS9) 10Fabfur: varnish: replace X-Public-Cloud with new X-Provenance header check [puppet] - 10https://gerrit.wikimedia.org/r/1159995 (https://phabricator.wikimedia.org/T396621) [08:21:26] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:21:51] (03CR) 10Gmodena: [C:03+2] dse: mw-content-history: version bump image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166921 (https://phabricator.wikimedia.org/T347282) (owner: 10Gmodena) [08:22:16] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:23:29] (03Merged) 10jenkins-bot: dse: mw-content-history: version bump image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166921 (https://phabricator.wikimedia.org/T347282) (owner: 10Gmodena) [08:26:00] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1167143 (https://phabricator.wikimedia.org/T398596) (owner: 10Vgutierrez) [08:26:11] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [08:26:18] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [08:28:30] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [08:28:54] (03PS1) 10Tiziano Fogli: Review access change [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1167145 [08:30:07] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [08:30:23] !log created a stub user "bumpuid" to move the allocation of UIDs for accounted created in Wikimedia IDM to 100000+ T355663 [08:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:26] T355663: Allocate more available UNIX UIDs for human users - https://phabricator.wikimedia.org/T355663 [08:30:35] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [08:30:44] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [08:35:04] (03PS9) 10Btullis: Add the new cephosd200[1-3] servers in codfw to their role [puppet] - 10https://gerrit.wikimedia.org/r/1166866 (https://phabricator.wikimedia.org/T374923) [08:36:50] (03PS3) 10Ladsgroup: tables-catalog: Mark vision to 1 [puppet] - 10https://gerrit.wikimedia.org/r/1166854 (https://phabricator.wikimedia.org/T363581) [08:36:56] (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables-catalog: Mark vision to 1 [puppet] - 10https://gerrit.wikimedia.org/r/1166854 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [08:38:12] (03CR) 10Vgutierrez: "looks good,added some inline comments about discrepancies between regex in requestcl and here and a suggestion about how to improve one of" [puppet] - 10https://gerrit.wikimedia.org/r/1159995 (https://phabricator.wikimedia.org/T396621) (owner: 10Fabfur) [08:39:10] (03PS1) 10Ladsgroup: Fully get rid of tracking and updating pages [extensions/FlaggedRevs] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1167148 (https://phabricator.wikimedia.org/T398033) [08:39:22] (03PS1) 10Ladsgroup: Fully get rid of tracking and updating pages [extensions/FlaggedRevs] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167149 (https://phabricator.wikimedia.org/T398033) [08:39:34] (03Abandoned) 10Majavah: etcd: Use cfssl for peer-to-peer communication [puppet] - 10https://gerrit.wikimedia.org/r/674077 (owner: 10Majavah) [08:39:42] jouncebot: nowandnext [08:39:42] For the next 1 hour(s) and 20 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250708T0800) [08:39:42] In 1 hour(s) and 20 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250708T1000) [08:40:01] (03PS1) 10Hashar: Review access change [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1167150 [08:40:27] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10982431 (10MoritzMuehlenhoff) [08:40:45] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10982433 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All done [08:42:20] (03PS2) 10Hashar: Remove specific force push to refs/sandbox/* branches [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1167150 (https://phabricator.wikimedia.org/T398921) [08:43:51] (03Abandoned) 10Tiziano Fogli: Review access change [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1167145 (owner: 10Tiziano Fogli) [08:44:07] (03CR) 10Hashar: [V:03+2 C:03+2] Remove specific force push to refs/sandbox/* branches [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1167150 (https://phabricator.wikimedia.org/T398921) (owner: 10Hashar) [08:45:17] (03CR) 10Brouberol: Add the new cephosd200[1-3] servers in codfw to their role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1166866 (https://phabricator.wikimedia.org/T374923) (owner: 10Btullis) [08:46:14] (03Abandoned) 10Majavah: Ensure service catalog schema matches spicerack release [puppet] - 10https://gerrit.wikimedia.org/r/931241 (https://phabricator.wikimedia.org/T339243) (owner: 10Majavah) [08:48:06] (03CR) 10CI reject: [V:04-1] Fully get rid of tracking and updating pages [extensions/FlaggedRevs] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167149 (https://phabricator.wikimedia.org/T398033) (owner: 10Ladsgroup) [08:48:09] !log installing Redis security updates [08:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:01] (03PS3) 10Majavah: Remove l10nupdate manifests [puppet] - 10https://gerrit.wikimedia.org/r/928582 [08:50:27] (03CR) 10Majavah: "found this while cleaning up my puppet.git clone.. this still looks relevant?" [puppet] - 10https://gerrit.wikimedia.org/r/928582 (owner: 10Majavah) [08:52:50] !log installing nginx security updates [08:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:02] (03Abandoned) 10Majavah: openstack::util::patch: add define [puppet] - 10https://gerrit.wikimedia.org/r/958931 (owner: 10David Caro) [08:54:15] (03Abandoned) 10Majavah: P:toolforge::grid: add bash completion to exec-manage [puppet] - 10https://gerrit.wikimedia.org/r/815780 (owner: 10Majavah) [08:55:13] (03Abandoned) 10Majavah: aptrepo: cleanup haproxy update and component names [puppet] - 10https://gerrit.wikimedia.org/r/969819 (owner: 10Majavah) [08:56:12] (03PS1) 10Vgutierrez: varnish: Prevent unknown clients from reaching /evt-103e/v2/events [puppet] - 10https://gerrit.wikimedia.org/r/1167151 (https://phabricator.wikimedia.org/T398181) [08:58:05] (03Abandoned) 10Majavah: kerberos: manage users with custom puppet type [puppet] - 10https://gerrit.wikimedia.org/r/751100 (https://phabricator.wikimedia.org/T292389) (owner: 10Majavah) [08:59:15] (03CR) 10Btullis: [V:03+1] Add the new cephosd200[1-3] servers in codfw to their role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1166866 (https://phabricator.wikimedia.org/T374923) (owner: 10Btullis) [08:59:35] !log aklapper@deploy1003 Finished scap sync-world: testwikis to 1.45.0-wmf.9 refs T392179 (duration: 43m 18s) [08:59:38] T392179: 1.45.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T392179 [09:02:46] (03PS1) 10TrainBranchBot: group0 to 1.45.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167152 (https://phabricator.wikimedia.org/T392179) [09:02:47] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.45.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167152 (https://phabricator.wikimedia.org/T392179) (owner: 10TrainBranchBot) [09:03:47] (03Merged) 10jenkins-bot: group0 to 1.45.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167152 (https://phabricator.wikimedia.org/T392179) (owner: 10TrainBranchBot) [09:04:38] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling reboot on A:schema-eqiad [09:08:13] (03PS2) 10Vgutierrez: varnish: Prevent unknown clients from reaching /evt-103e/v2/events [puppet] - 10https://gerrit.wikimedia.org/r/1167151 (https://phabricator.wikimedia.org/T398181) [09:12:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling reboot on A:schema-eqiad [09:15:10] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:15:25] !log aklapper@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.9 refs T392179 [09:15:28] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:15:30] T392179: 1.45.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T392179 [09:17:00] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.233 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:17:18] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54224 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:18:53] (03PS3) 10Vgutierrez: varnish: Prevent unknown clients from reaching /evt-103e/v2/events [puppet] - 10https://gerrit.wikimedia.org/r/1167151 (https://phabricator.wikimedia.org/T398181) [09:18:57] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Codfw: management down to racks D3 and D8 (switch port down) - https://phabricator.wikimedia.org/T398598#10982612 (10cmooney) 05Open→03Resolved >>! In T398598#10980766, @Jhancock.wm wrote: > reset the tripped breaker in D3. On... [09:19:11] I had a quick look at lists1004, nothing out of the ordinary [09:19:26] If this croaks again in the day I 'll have a more serious look [09:19:54] (03CR) 10Vgutierrez: "varnishtests are happy: `0 tests failed, 0 tests skipped, 39 tests passed`" [puppet] - 10https://gerrit.wikimedia.org/r/1167151 (https://phabricator.wikimedia.org/T398181) (owner: 10Vgutierrez) [09:21:53] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2329.mgmt:22 - https://phabricator.wikimedia.org/T398559#10982616 (10cmooney) 05Open→03Resolved a:03cmooney [09:21:55] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for bast2003.mgmt:22 - https://phabricator.wikimedia.org/T398557#10982619 (10cmooney) 05Open→03Resolved a:03cmooney [09:22:02] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2219.mgmt:22 - https://phabricator.wikimedia.org/T398556#10982622 (10cmooney) 05Open→03Resolved a:03cmooney [09:22:19] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for db2181.mgmt:22 - https://phabricator.wikimedia.org/T398573#10982625 (10cmooney) 05Open→03Resolved a:03cmooney [09:22:27] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for aux-k8s-worker2009.mgmt:22 - https://phabricator.wikimedia.org/T398572#10982628 (10cmooney) 05Open→03Resolved a:03cmooney [09:22:33] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for db2213.mgmt:22 - https://phabricator.wikimedia.org/T398571#10982631 (10cmooney) 05Open→03Resolved a:03cmooney [09:22:41] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for es2040.mgmt:22 - https://phabricator.wikimedia.org/T398570#10982634 (10cmooney) 05Open→03Resolved a:03cmooney [09:22:48] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for es2044.mgmt:22 - https://phabricator.wikimedia.org/T398569#10982637 (10cmooney) 05Open→03Resolved a:03cmooney [09:22:54] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for db2182.mgmt:22 - https://phabricator.wikimedia.org/T398568#10982640 (10cmooney) 05Open→03Resolved a:03cmooney [09:23:02] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for puppetdb2003.mgmt:22 - https://phabricator.wikimedia.org/T398567#10982643 (10cmooney) 05Open→03Resolved a:03cmooney [09:23:15] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for db2173.mgmt:22 - https://phabricator.wikimedia.org/T398565#10982646 (10cmooney) 05Open→03Resolved a:03cmooney [09:23:23] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2217.mgmt:22 - https://phabricator.wikimedia.org/T398564#10982649 (10cmooney) 05Open→03Resolved a:03cmooney [09:23:29] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2330.mgmt:22 - https://phabricator.wikimedia.org/T398563#10982652 (10cmooney) 05Open→03Resolved a:03cmooney [09:23:37] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2320.mgmt:22 - https://phabricator.wikimedia.org/T398562#10982655 (10cmooney) 05Open→03Resolved a:03cmooney [09:23:44] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2201.mgmt:22 - https://phabricator.wikimedia.org/T398561#10982658 (10cmooney) 05Open→03Resolved a:03cmooney [09:23:52] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for pc2016.mgmt:22 - https://phabricator.wikimedia.org/T398560#10982661 (10cmooney) 05Open→03Resolved a:03cmooney [09:30:38] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3437 MB (3% inode=98%): /tmp 3437 MB (3% inode=98%): /var/tmp 3437 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [09:38:46] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167158 [09:39:17] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167158 (owner: 10PipelineBot) [09:40:40] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10982781 (10MoritzMuehlenhoff) [09:40:50] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167158 (owner: 10PipelineBot) [09:41:08] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [09:41:34] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [09:44:41] (03PS1) 10Ladsgroup: tables-catalog: Temporarily set categorylinks to partially public [puppet] - 10https://gerrit.wikimedia.org/r/1167159 (https://phabricator.wikimedia.org/T299951) [09:45:39] (03PS1) 10Jgiannelos: pcs: Enable profiler on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167160 [09:46:34] (03CR) 10Clément Goubert: [C:03+1] api-gateway: use ratelimit's inbuilt promethus-statsd agent [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166790 (https://phabricator.wikimedia.org/T388804) (owner: 10Hnowlan) [09:46:46] (03PS2) 10Jgiannelos: pcs: Enable profiler on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167160 [09:47:21] (03CR) 10Hnowlan: [C:03+1] pcs: Enable profiler on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167160 (owner: 10Jgiannelos) [09:47:31] (03CR) 10Ladsgroup: [C:03+2] tables-catalog: Temporarily set categorylinks to partially public [puppet] - 10https://gerrit.wikimedia.org/r/1167159 (https://phabricator.wikimedia.org/T299951) (owner: 10Ladsgroup) [09:48:05] (03CR) 10Jgiannelos: [C:03+2] pcs: Enable profiler on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167160 (owner: 10Jgiannelos) [09:50:02] (03Merged) 10jenkins-bot: pcs: Enable profiler on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167160 (owner: 10Jgiannelos) [09:50:59] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [09:51:06] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [09:51:22] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [09:51:31] !log installling openssl security updates on Bullseye [09:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:40] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [09:52:17] (03PS4) 10Tiziano Fogli: prom/metamonitor: add dead man switch and public endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1167157 (https://phabricator.wikimedia.org/T397003) [09:53:10] !log dropping term store tables on s8 sanitarium master (T351820) [09:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:13] T351820: Move Wikidata term store to separate database cluster - https://phabricator.wikimedia.org/T351820 [09:55:11] (03PS1) 10Zabe: Remove redundant group0 config for categorylinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167162 [09:55:57] (03CR) 10CI reject: [V:04-1] Remove redundant group0 config for categorylinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167162 (owner: 10Zabe) [09:56:18] (03PS2) 10Zabe: Remove redundant group0 config for categorylinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167162 [09:57:59] (03PS1) 10Zabe: Set categorylinks to read new in cebwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167164 (https://phabricator.wikimedia.org/T397912) [09:58:05] (03PS1) 10Majavah: hieradata: Bump Striker to 2025-07-08-094946-production [puppet] - 10https://gerrit.wikimedia.org/r/1167165 (https://phabricator.wikimedia.org/T355663) [09:59:46] (03CR) 10Majavah: [C:03+2] hieradata: Bump Striker to 2025-07-08-094946-production [puppet] - 10https://gerrit.wikimedia.org/r/1167165 (https://phabricator.wikimedia.org/T355663) (owner: 10Majavah) [10:00:00] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2157.codfw.wmnet with reason: Maintenance [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250708T1000) [10:01:55] (03PS1) 10Marostegui: db2157: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167168 (https://phabricator.wikimedia.org/T398928) [10:02:12] PROBLEM - MariaDB Replica Lag: s8 on clouddb1020 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 618.40 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:03:34] (03CR) 10Marostegui: [C:03+2] db2157: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167168 (https://phabricator.wikimedia.org/T398928) (owner: 10Marostegui) [10:03:46] it'll recover soon [10:04:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2157', diff saved to https://phabricator.wikimedia.org/P78795 and previous config saved to /var/cache/conftool/dbconfig/20250708-100434-marostegui.json [10:05:16] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10982848 (10BTullis) [10:05:48] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10982849 (10BTullis) 05Open→03Resolved a:03BTullis [10:06:14] (03PS1) 10Jcrespo: mariadb: Upgrade db1216 & db2201 MariaDB package to 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167173 (https://phabricator.wikimedia.org/T398928) [10:07:14] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10982862 (10MoritzMuehlenhoff) [10:07:29] !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe-codfw [10:09:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe-codfw [10:11:40] !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe-eqiad [10:12:33] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2157.codfw.wmnet with reason: Maintenance [10:13:12] RECOVERY - MariaDB Replica Lag: s8 on clouddb1020 is OK: OK slave_sql_lag Replication lag: 0.28 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe-eqiad [10:14:36] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2157.codfw.wmnet with reason: Maintenance [10:14:37] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wcqs-public [10:16:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wcqs-public [10:20:33] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1159.eqiad.wmnet with reason: Maintenance [10:21:11] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1159.eqiad.wmnet with reason: Maintenance [10:21:12] (03CR) 10Fabfur: varnish: replace X-Public-Cloud with new X-Provenance header check (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1159995 (https://phabricator.wikimedia.org/T396621) (owner: 10Fabfur) [10:21:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1159 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P78796 and previous config saved to /var/cache/conftool/dbconfig/20250708-102114-marostegui.json [10:21:35] (03PS10) 10Fabfur: varnish: replace X-Public-Cloud with new X-Provenance header check [puppet] - 10https://gerrit.wikimedia.org/r/1159995 (https://phabricator.wikimedia.org/T396621) [10:21:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2157 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P78797 and previous config saved to /var/cache/conftool/dbconfig/20250708-102140-root.json [10:21:46] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-conf1004.eqiad.wmnet [10:25:17] (03PS1) 10Ladsgroup: api-testing: Loosen the assert on max-age header [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167176 [10:25:49] (03PS2) 10Ladsgroup: Fully get rid of tracking and updating pages [extensions/FlaggedRevs] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167149 (https://phabricator.wikimedia.org/T398033)