[00:04:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 21.26% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:05:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T415786)', diff saved to https://phabricator.wikimedia.org/P88571 and previous config saved to /var/cache/conftool/dbconfig/20260204-000501-marostegui.json [00:05:04] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [00:05:08] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host tools-k8s-worker1001.eqiad.wmnet with OS trixie [00:05:16] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host tools-k8s-worker1002.eqiad.wmnet with OS trixie [00:05:18] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581394 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host tools-k8s-worker1001.eqiad.wmnet with OS trixie [00:05:24] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581395 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host tools-k8s-worker1002.eqiad.wmnet with OS trixie [00:09:46] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on tools-k8s-ctrl1002.eqiad.wmnet with reason: host reimage [00:09:55] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on tools-k8s-ctrl1001.eqiad.wmnet with reason: host reimage [00:11:29] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host tools-k8s-worker1004.eqiad.wmnet with OS trixie [00:11:38] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581404 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host tools-k8s-worker1004.eqiad.wmnet with OS trixie [00:11:42] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host tools-k8s-worker1003.eqiad.wmnet with OS trixie [00:11:49] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581405 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host tools-k8s-worker1003.eqiad.wmnet with OS trixie [00:13:34] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tools-k8s-ctrl1002.eqiad.wmnet with reason: host reimage [00:14:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 21.4% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:15:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P88572 and previous config saved to /var/cache/conftool/dbconfig/20260204-001509-marostegui.json [00:17:33] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tools-k8s-ctrl1001.eqiad.wmnet with reason: host reimage [00:17:50] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on tools-k8s-worker1001.eqiad.wmnet with reason: host reimage [00:17:54] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on tools-k8s-worker1002.eqiad.wmnet with reason: host reimage [00:21:58] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tools-k8s-worker1001.eqiad.wmnet with reason: host reimage [00:23:37] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on tools-k8s-worker1004.eqiad.wmnet with reason: host reimage [00:24:14] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on tools-k8s-worker1003.eqiad.wmnet with reason: host reimage [00:25:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P88573 and previous config saved to /var/cache/conftool/dbconfig/20260204-002518-marostegui.json [00:25:37] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tools-k8s-worker1002.eqiad.wmnet with reason: host reimage [00:29:36] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tools-k8s-worker1003.eqiad.wmnet with reason: host reimage [00:30:24] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tools-k8s-ctrl1002.eqiad.wmnet with OS trixie [00:30:36] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host tools-k8s-ctrl1002.eqiad.wmnet with OS trixie completed: - tools-k8s-ctrl1002 (**PASS**) - Down... [00:33:11] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-worker1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:33:37] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tools-k8s-ctrl1001.eqiad.wmnet with OS trixie [00:33:45] (03PS1) 10Dzahn: wmnet: upgrade vrts from the "without multiple backends" section [dns] - 10https://gerrit.wikimedia.org/r/1236384 [00:33:46] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tools-k8s-worker1004.eqiad.wmnet with reason: host reimage [00:33:48] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581536 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host tools-k8s-ctrl1001.eqiad.wmnet with OS trixie completed: - tools-k8s-ctrl1001 (**PASS**) - Down... [00:34:19] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:35:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T415786)', diff saved to https://phabricator.wikimedia.org/P88574 and previous config saved to /var/cache/conftool/dbconfig/20260204-003526-marostegui.json [00:35:30] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [00:35:44] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1232.eqiad.wmnet with reason: Maintenance [00:35:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1232 (T415786)', diff saved to https://phabricator.wikimedia.org/P88575 and previous config saved to /var/cache/conftool/dbconfig/20260204-003551-marostegui.json [00:37:39] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tools-k8s-worker1001.eqiad.wmnet with OS trixie [00:37:54] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581544 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host tools-k8s-worker1001.eqiad.wmnet with OS trixie completed: - tools-k8s-worker1001 (**PASS**) -... [00:38:09] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-worker1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:40:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1236385 [00:40:21] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1236385 (owner: 10TrainBranchBot) [00:40:32] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host tools-k8s-worker1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:41:22] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-worker1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:42:20] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tools-k8s-worker1002.eqiad.wmnet with OS trixie [00:42:30] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host tools-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:42:30] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581564 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host tools-k8s-worker1002.eqiad.wmnet with OS trixie completed: - tools-k8s-worker1002 (**PASS**) -... [00:44:27] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host tools-k8s-worker1005.eqiad.wmnet with OS trixie [00:44:40] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581583 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host tools-k8s-worker1005.eqiad.wmnet with OS trixie [00:45:00] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host tools-k8s-worker1006.eqiad.wmnet with OS trixie [00:45:10] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581584 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host tools-k8s-worker1006.eqiad.wmnet with OS trixie [00:45:11] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tools-k8s-worker1003.eqiad.wmnet with OS trixie [00:45:19] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581585 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host tools-k8s-worker1003.eqiad.wmnet with OS trixie completed: - tools-k8s-worker1003 (**PASS**) -... [00:45:47] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host tools-k8s-worker1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:46:07] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host tools-k8s-worker1007.eqiad.wmnet with OS trixie [00:46:14] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581586 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host tools-k8s-worker1007.eqiad.wmnet with OS trixie [00:49:04] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host tools-k8s-worker1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:49:36] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tools-k8s-worker1004.eqiad.wmnet with OS trixie [00:49:43] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581590 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host tools-k8s-worker1004.eqiad.wmnet with OS trixie completed: - tools-k8s-worker1004 (**PASS**) -... [00:50:09] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host tools-k8s-worker1008.eqiad.wmnet with OS trixie [00:50:17] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581592 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host tools-k8s-worker1008.eqiad.wmnet with OS trixie [00:51:44] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1236385 (owner: 10TrainBranchBot) [00:54:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T415786)', diff saved to https://phabricator.wikimedia.org/P88576 and previous config saved to /var/cache/conftool/dbconfig/20260204-005419-marostegui.json [00:54:24] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [00:55:36] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on tools-k8s-worker1005.eqiad.wmnet with reason: host reimage [00:56:28] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on tools-k8s-worker1006.eqiad.wmnet with reason: host reimage [00:57:31] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on tools-k8s-worker1007.eqiad.wmnet with reason: host reimage [00:59:23] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tools-k8s-worker1005.eqiad.wmnet with reason: host reimage [01:00:57] (03PS1) 10Dzahn: zuul: set owner and notify zookeeper service with pki::get_cert [puppet] - 10https://gerrit.wikimedia.org/r/1236386 (https://phabricator.wikimedia.org/T405119) [01:01:09] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on tools-k8s-worker1008.eqiad.wmnet with reason: host reimage [01:01:26] (03CR) 10CI reject: [V:04-1] zuul: set owner and notify zookeeper service with pki::get_cert [puppet] - 10https://gerrit.wikimedia.org/r/1236386 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn) [01:03:13] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tools-k8s-worker1007.eqiad.wmnet with reason: host reimage [01:05:53] (03PS2) 10Dzahn: zuul: set owner and notify zookeeper service with pki::get_cert [puppet] - 10https://gerrit.wikimedia.org/r/1236386 (https://phabricator.wikimedia.org/T405119) [01:07:14] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tools-k8s-worker1008.eqiad.wmnet with reason: host reimage [01:09:16] (03PS1) 10Ladsgroup: UserImpact: Remove zeros in per-article view stats [extensions/GrowthExperiments] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236387 (https://phabricator.wikimedia.org/T414080) [01:09:27] (03PS1) 10Ladsgroup: UserImpact: Remove zeros in per-article view stats [extensions/GrowthExperiments] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236388 (https://phabricator.wikimedia.org/T414080) [01:09:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P88577 and previous config saved to /var/cache/conftool/dbconfig/20260204-010928-marostegui.json [01:10:13] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1236389 [01:10:13] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1236389 (owner: 10TrainBranchBot) [01:10:40] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tools-k8s-worker1006.eqiad.wmnet with reason: host reimage [01:11:20] jouncebot: nowandnext [01:11:20] No deployments scheduled for the next 5 hour(s) and 48 minute(s) [01:11:20] In 5 hour(s) and 48 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260204T0700) [01:11:29] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1236386/7970/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1236386 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn) [01:11:30] (03CR) 10Ladsgroup: [C:03+2] UserImpact: Remove zeros in per-article view stats [extensions/GrowthExperiments] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236387 (https://phabricator.wikimedia.org/T414080) (owner: 10Ladsgroup) [01:11:33] (03CR) 10Ladsgroup: [C:03+2] UserImpact: Remove zeros in per-article view stats [extensions/GrowthExperiments] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236388 (https://phabricator.wikimedia.org/T414080) (owner: 10Ladsgroup) [01:12:22] (03PS3) 10Dzahn: zuul: set owner and notify zookeeper service with pki::get_cert [puppet] - 10https://gerrit.wikimedia.org/r/1236386 (https://phabricator.wikimedia.org/T405119) [01:15:33] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [01:15:53] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [01:15:54] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tools-k8s-worker1005.eqiad.wmnet with OS trixie [01:16:01] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581623 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host tools-k8s-worker1005.eqiad.wmnet with OS trixie completed: - tools-k8s-worker1005 (**PASS**) -... [01:16:56] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581624 (10Jclark-ctr) [01:19:26] (03PS1) 10Dzahn: zuul: move cert paths to role level, drop host-name based config [puppet] - 10https://gerrit.wikimedia.org/r/1236390 (https://phabricator.wikimedia.org/T405119) [01:19:58] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [01:20:37] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1236386/7970/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1236386 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn) [01:20:50] (03CR) 10Dzahn: [V:03+1] "so far so good, but what about the truststore path" [puppet] - 10https://gerrit.wikimedia.org/r/1236386 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn) [01:20:57] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [01:20:58] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tools-k8s-worker1007.eqiad.wmnet with OS trixie [01:21:04] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581632 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host tools-k8s-worker1007.eqiad.wmnet with OS trixie completed: - tools-k8s-worker1007 (**PASS**) -... [01:21:11] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/1236386/7970/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1236390 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn) [01:21:56] (03Merged) 10jenkins-bot: UserImpact: Remove zeros in per-article view stats [extensions/GrowthExperiments] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236387 (https://phabricator.wikimedia.org/T414080) (owner: 10Ladsgroup) [01:22:20] (03Merged) 10jenkins-bot: UserImpact: Remove zeros in per-article view stats [extensions/GrowthExperiments] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236388 (https://phabricator.wikimedia.org/T414080) (owner: 10Ladsgroup) [01:23:38] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [01:24:02] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [01:24:03] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tools-k8s-worker1008.eqiad.wmnet with OS trixie [01:24:12] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581638 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host tools-k8s-worker1008.eqiad.wmnet with OS trixie completed: - tools-k8s-worker1008 (**PASS**) -... [01:24:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P88578 and previous config saved to /var/cache/conftool/dbconfig/20260204-012436-marostegui.json [01:25:32] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1236387|UserImpact: Remove zeros in per-article view stats (T414080)]], [[gerrit:1236388|UserImpact: Remove zeros in per-article view stats (T414080)]] [01:25:35] T414080: x1 increase in writes results in a large increase of binlog files (over 2000) - https://phabricator.wikimedia.org/T414080 [01:26:49] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [01:27:20] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [01:27:21] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tools-k8s-worker1006.eqiad.wmnet with OS trixie [01:27:33] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581642 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host tools-k8s-worker1006.eqiad.wmnet with OS trixie completed: - tools-k8s-worker1006 (**PASS**) -... [01:29:37] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1236387|UserImpact: Remove zeros in per-article view stats (T414080)]], [[gerrit:1236388|UserImpact: Remove zeros in per-article view stats (T414080)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [01:29:59] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [01:34:37] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1236389 (owner: 10TrainBranchBot) [01:36:10] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1236387|UserImpact: Remove zeros in per-article view stats (T414080)]], [[gerrit:1236388|UserImpact: Remove zeros in per-article view stats (T414080)]] (duration: 10m 38s) [01:36:13] T414080: x1 increase in writes results in a large increase of binlog files (over 2000) - https://phabricator.wikimedia.org/T414080 [01:39:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T415786)', diff saved to https://phabricator.wikimedia.org/P88579 and previous config saved to /var/cache/conftool/dbconfig/20260204-013944-marostegui.json [01:39:48] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [01:39:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1202.eqiad.wmnet with reason: Maintenance [01:39:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1202 (T415786)', diff saved to https://phabricator.wikimedia.org/P88580 and previous config saved to /var/cache/conftool/dbconfig/20260204-013958-marostegui.json [01:41:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T415786)', diff saved to https://phabricator.wikimedia.org/P88581 and previous config saved to /var/cache/conftool/dbconfig/20260204-014127-marostegui.json [01:48:40] FIRING: SystemdUnitFailed: wmf_auto_restart_kerberos_rsync.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:55:20] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581667 (10Jclark-ctr) 05Open→03Resolved [01:56:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P88582 and previous config saved to /var/cache/conftool/dbconfig/20260204-015635-marostegui.json [02:00:51] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [02:06:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T415786)', diff saved to https://phabricator.wikimedia.org/P88583 and previous config saved to /var/cache/conftool/dbconfig/20260204-020609-marostegui.json [02:06:13] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [02:11:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P88584 and previous config saved to /var/cache/conftool/dbconfig/20260204-021144-marostegui.json [02:13:42] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 12m 50s) [02:16:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P88585 and previous config saved to /var/cache/conftool/dbconfig/20260204-021617-marostegui.json [02:26:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P88586 and previous config saved to /var/cache/conftool/dbconfig/20260204-022626-marostegui.json [02:26:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T415786)', diff saved to https://phabricator.wikimedia.org/P88587 and previous config saved to /var/cache/conftool/dbconfig/20260204-022652-marostegui.json [02:26:56] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [02:27:10] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2188.codfw.wmnet with reason: Maintenance [02:27:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2188 (T415786)', diff saved to https://phabricator.wikimedia.org/P88588 and previous config saved to /var/cache/conftool/dbconfig/20260204-022717-marostegui.json [02:36:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T415786)', diff saved to https://phabricator.wikimedia.org/P88589 and previous config saved to /var/cache/conftool/dbconfig/20260204-023634-marostegui.json [02:36:38] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [02:36:51] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1234.eqiad.wmnet with reason: Maintenance [02:36:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1234 (T415786)', diff saved to https://phabricator.wikimedia.org/P88590 and previous config saved to /var/cache/conftool/dbconfig/20260204-023659-marostegui.json [02:45:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T415786)', diff saved to https://phabricator.wikimedia.org/P88591 and previous config saved to /var/cache/conftool/dbconfig/20260204-024521-marostegui.json [02:45:24] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [03:00:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P88592 and previous config saved to /var/cache/conftool/dbconfig/20260204-030029-marostegui.json [03:10:35] 06SRE, 10Thumbor: Thumbor fails to render PNG with "Failed to convert image convert: IDAT: invalid distance too far back", returns 429 "Too Many Requests" - https://phabricator.wikimedia.org/T285875#11581780 (10AntiCompositeNumber) [03:14:40] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [03:15:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P88593 and previous config saved to /var/cache/conftool/dbconfig/20260204-031537-marostegui.json [03:19:16] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:30:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T415786)', diff saved to https://phabricator.wikimedia.org/P88594 and previous config saved to /var/cache/conftool/dbconfig/20260204-033046-marostegui.json [03:30:49] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [03:31:02] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1227.eqiad.wmnet with reason: Maintenance [03:31:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1227 (T415786)', diff saved to https://phabricator.wikimedia.org/P88595 and previous config saved to /var/cache/conftool/dbconfig/20260204-033110-marostegui.json [03:56:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T415786)', diff saved to https://phabricator.wikimedia.org/P88596 and previous config saved to /var/cache/conftool/dbconfig/20260204-035612-marostegui.json [03:56:16] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [04:09:29] 10SRE-SLO, 10Observability-Alerting, 13Patch-For-Review, 06SRE Observability (FY2025/2026-Q3): sloth deployment - https://phabricator.wikimedia.org/T414579#11581798 (10herron) >! In T414579#11581131, @tappof wrote: >> * Templates SLO manifests and allows default values (e.g. default alert state) >> * Allow... [04:09:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T415786)', diff saved to https://phabricator.wikimedia.org/P88597 and previous config saved to /var/cache/conftool/dbconfig/20260204-040933-marostegui.json [04:09:38] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [04:11:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P88598 and previous config saved to /var/cache/conftool/dbconfig/20260204-041121-marostegui.json [04:19:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P88599 and previous config saved to /var/cache/conftool/dbconfig/20260204-041941-marostegui.json [04:26:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P88600 and previous config saved to /var/cache/conftool/dbconfig/20260204-042629-marostegui.json [04:29:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P88601 and previous config saved to /var/cache/conftool/dbconfig/20260204-042950-marostegui.json [04:39:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T415786)', diff saved to https://phabricator.wikimedia.org/P88602 and previous config saved to /var/cache/conftool/dbconfig/20260204-043958-marostegui.json [04:40:02] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [04:40:14] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1235.eqiad.wmnet with reason: Maintenance [04:40:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1235 (T415786)', diff saved to https://phabricator.wikimedia.org/P88603 and previous config saved to /var/cache/conftool/dbconfig/20260204-044022-marostegui.json [04:41:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T415786)', diff saved to https://phabricator.wikimedia.org/P88604 and previous config saved to /var/cache/conftool/dbconfig/20260204-044137-marostegui.json [04:41:55] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2202.codfw.wmnet with reason: Maintenance [04:59:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T415786)', diff saved to https://phabricator.wikimedia.org/P88605 and previous config saved to /var/cache/conftool/dbconfig/20260204-045953-marostegui.json [04:59:57] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [05:09:16] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:15:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P88606 and previous config saved to /var/cache/conftool/dbconfig/20260204-051501-marostegui.json [05:26:27] (03PS1) 10QChris: Add .gitreview [slothslos] - 10https://gerrit.wikimedia.org/r/1236432 [05:26:27] (03CR) 10QChris: [V:03+2 C:03+2] Add .gitreview [slothslos] - 10https://gerrit.wikimedia.org/r/1236432 (owner: 10QChris) [05:30:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P88607 and previous config saved to /var/cache/conftool/dbconfig/20260204-053009-marostegui.json [05:34:16] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:35:12] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:45:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T415786)', diff saved to https://phabricator.wikimedia.org/P88608 and previous config saved to /var/cache/conftool/dbconfig/20260204-054518-marostegui.json [05:45:21] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [05:45:34] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1231.eqiad.wmnet with reason: Maintenance [05:45:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1231 (T415786)', diff saved to https://phabricator.wikimedia.org/P88609 and previous config saved to /var/cache/conftool/dbconfig/20260204-054542-marostegui.json [05:48:40] FIRING: SystemdUnitFailed: wmf_auto_restart_kerberos_rsync.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:05:08] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2212.codfw.wmnet with reason: Maintenance [06:05:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2212 (T415786)', diff saved to https://phabricator.wikimedia.org/P88610 and previous config saved to /var/cache/conftool/dbconfig/20260204-060516-marostegui.json [06:05:20] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [06:10:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T415786)', diff saved to https://phabricator.wikimedia.org/P88611 and previous config saved to /var/cache/conftool/dbconfig/20260204-061047-marostegui.json [06:10:51] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [06:11:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db2207 with weight 0 T416300', diff saved to https://phabricator.wikimedia.org/P88612 and previous config saved to /var/cache/conftool/dbconfig/20260204-061122-marostegui.json [06:11:26] T416300: Switchover s2 master (db2204 -> db2207) - https://phabricator.wikimedia.org/T416300 [06:11:39] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Primary switchover s2 T416300 [06:11:53] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2207 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1236109 (https://phabricator.wikimedia.org/T416300) (owner: 10Gerrit maintenance bot) [06:12:59] !log Starting s2 codfw failover from db2204 to db2207 - T416300 [06:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [06:16:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set s2 codfw as read-only for maintenance - T416300', diff saved to https://phabricator.wikimedia.org/P88613 and previous config saved to /var/cache/conftool/dbconfig/20260204-061613-marostegui.json [06:16:17] (03CR) 10Ayounsi: [C:03+1] DNS: Enable Bird 2.18 for all sites [puppet] - 10https://gerrit.wikimedia.org/r/1228560 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff) [06:16:31] (03CR) 10Ayounsi: [C:03+1] "lgtm but leaving the last call to Sukhe" [puppet] - 10https://gerrit.wikimedia.org/r/1228560 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff) [06:16:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db2207 to s2 primary and set section read-write T416300', diff saved to https://phabricator.wikimedia.org/P88614 and previous config saved to /var/cache/conftool/dbconfig/20260204-061637-marostegui.json [06:16:41] T416300: Switchover s2 master (db2204 -> db2207) - https://phabricator.wikimedia.org/T416300 [06:16:58] (03CR) 10Marostegui: [C:03+2] wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/1236110 (https://phabricator.wikimedia.org/T416300) (owner: 10Gerrit maintenance bot) [06:17:04] !log marostegui@dns1006 START - running authdns-update [06:17:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2204 T416300', diff saved to https://phabricator.wikimedia.org/P88615 and previous config saved to /var/cache/conftool/dbconfig/20260204-061739-marostegui.json [06:18:07] !log marostegui@dns1006 END - running authdns-update [06:20:09] (03PS1) 10Marostegui: db2204: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1236602 (https://phabricator.wikimedia.org/T415786) [06:20:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P88616 and previous config saved to /var/cache/conftool/dbconfig/20260204-062055-marostegui.json [06:30:53] 06SRE, 10Maps, 06Traffic, 07affects-Kiwix-and-openZIM: On using Wikimedia Maps to build Kiwix Openstreetmap ZIMs - https://phabricator.wikimedia.org/T416374#11581910 (10Bugreporter) This does not need to add a whitelist. Instead you need to set a proper referer when fetching tiles. [06:31:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P88617 and previous config saved to /var/cache/conftool/dbconfig/20260204-063103-marostegui.json [06:34:46] (03CR) 10Marostegui: [C:03+2] db2204: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1236602 (https://phabricator.wikimedia.org/T415786) (owner: 10Marostegui) [06:41:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T415786)', diff saved to https://phabricator.wikimedia.org/P88618 and previous config saved to /var/cache/conftool/dbconfig/20260204-064107-marostegui.json [06:41:12] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [06:41:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T415786)', diff saved to https://phabricator.wikimedia.org/P88619 and previous config saved to /var/cache/conftool/dbconfig/20260204-064118-marostegui.json [06:41:24] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1239.eqiad.wmnet with reason: Maintenance [06:44:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [06:56:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P88620 and previous config saved to /var/cache/conftool/dbconfig/20260204-065616-marostegui.json [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260204T0700) [07:11:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P88621 and previous config saved to /var/cache/conftool/dbconfig/20260204-071124-marostegui.json [07:14:40] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [07:19:16] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:26:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T415786)', diff saved to https://phabricator.wikimedia.org/P88622 and previous config saved to /var/cache/conftool/dbconfig/20260204-072632-marostegui.json [07:26:36] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [07:26:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1253.eqiad.wmnet with reason: Maintenance [07:26:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1253 (T415786)', diff saved to https://phabricator.wikimedia.org/P88623 and previous config saved to /var/cache/conftool/dbconfig/20260204-072658-marostegui.json [07:34:00] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2204.codfw.wmnet with reason: Schema change [07:35:41] !log Deploy schema change on db2204 (old s2 codfw master) T415786 [07:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:45] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [07:37:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T415786)', diff saved to https://phabricator.wikimedia.org/P88624 and previous config saved to /var/cache/conftool/dbconfig/20260204-073735-marostegui.json [07:39:07] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hadoop.reboot-workers (exit_code=99) for Hadoop analytics cluster [07:49:46] (03PS1) 10Muehlenhoff: kerberos::kadminserver: Fix service name [puppet] - 10https://gerrit.wikimedia.org/r/1236622 [07:50:43] (03CR) 10Tiziano Fogli: [C:03+2] centralauth: add recording rules for grafana widgets (write) [puppet] - 10https://gerrit.wikimedia.org/r/1236233 (https://phabricator.wikimedia.org/T415035) (owner: 10Tiziano Fogli) [07:52:40] (03CR) 10Muehlenhoff: [C:03+2] kerberos::kadminserver: Fix service name [puppet] - 10https://gerrit.wikimedia.org/r/1236622 (owner: 10Muehlenhoff) [07:52:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P88625 and previous config saved to /var/cache/conftool/dbconfig/20260204-075243-marostegui.json [07:57:36] (03PS1) 10Muehlenhoff: Make bitu-account-managers manageable in idm.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1236652 [07:59:15] (03PS1) 10Slyngshede: P:idm bitu-account-managers permission [puppet] - 10https://gerrit.wikimedia.org/r/1236657 [08:00:05] Amir1, Urbanecm, and awight: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260204T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:01:23] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1236657 (owner: 10Slyngshede) [08:01:44] (03Abandoned) 10Muehlenhoff: Make bitu-account-managers manageable in idm.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1236652 (owner: 10Muehlenhoff) [08:03:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_kerberos_rsync.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:04:06] (03CR) 10Slyngshede: [C:03+2] P:idm bitu-account-managers permission [puppet] - 10https://gerrit.wikimedia.org/r/1236657 (owner: 10Slyngshede) [08:07:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P88626 and previous config saved to /var/cache/conftool/dbconfig/20260204-080751-marostegui.json [08:08:03] (03CR) 10Elukey: [C:03+1] proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236333 (owner: 10Muehlenhoff) [08:08:49] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1240.eqiad.wmnet with reason: Maintenance [08:09:24] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ssalgaonkar-wmf - https://phabricator.wikimedia.org/T415594#11582089 (10elukey) 05Open→03Resolved a:03elukey [08:12:22] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1235781 (owner: 10L10n-bot) [08:13:43] (03PS1) 10Slyngshede: P:idm remove approver [puppet] - 10https://gerrit.wikimedia.org/r/1236668 [08:19:36] (03CR) 10Tiziano Fogli: "A couple of considerations:" [puppet] - 10https://gerrit.wikimedia.org/r/1219146 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [08:22:08] jouncebot: now [08:22:08] For the next 0 hour(s) and 37 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260204T0800) [08:23:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T415786)', diff saved to https://phabricator.wikimedia.org/P88627 and previous config saved to /var/cache/conftool/dbconfig/20260204-082259-marostegui.json [08:23:03] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [08:23:16] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2216.codfw.wmnet with reason: Maintenance [08:23:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2216 (T415786)', diff saved to https://phabricator.wikimedia.org/P88628 and previous config saved to /var/cache/conftool/dbconfig/20260204-082324-marostegui.json [08:23:36] Amir1, urbanecm: I'm going to backport a patch that missed the train. I'll self service [08:24:31] 06SRE, 10LDAP-Access-Requests: Add Jacob Thwaites WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T416358#11582113 (10Jacob_WMDE) Hi @Dzahn, I've just emailed Katie, I'll let you know once this step is complete. [08:24:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T415786)', diff saved to https://phabricator.wikimedia.org/P88629 and previous config saved to /var/cache/conftool/dbconfig/20260204-082450-marostegui.json [08:25:35] (03PS1) 10Phuedx: ext.wikimediaEvents: Add Test Kitchen new external path test [extensions/WikimediaEvents] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236669 (https://phabricator.wikimedia.org/T415708) [08:29:19] (03CR) 10CI reject: [V:04-1] ext.wikimediaEvents: Add Test Kitchen new external path test [extensions/WikimediaEvents] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236669 (https://phabricator.wikimedia.org/T415708) (owner: 10Phuedx) [08:31:46] (03CR) 10Phuedx: "Recheck" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236669 (https://phabricator.wikimedia.org/T415708) (owner: 10Phuedx) [08:38:16] OK. I'm not going to backport the change yet. There's a test failure that I'll need to dig into [08:38:30] (03PS1) 10Elukey: installserver: add EFI preseed config for ms-fe102[14] [puppet] - 10https://gerrit.wikimedia.org/r/1236671 (https://phabricator.wikimedia.org/T416245) [08:39:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P88630 and previous config saved to /var/cache/conftool/dbconfig/20260204-083958-marostegui.json [08:40:25] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:40:29] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:41:10] FIRING: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:41:36] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1236671 (https://phabricator.wikimedia.org/T416245) (owner: 10Elukey) [08:42:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:44:56] (03CR) 10Elukey: [C:03+2] installserver: add EFI preseed config for ms-fe102[14] [puppet] - 10https://gerrit.wikimedia.org/r/1236671 (https://phabricator.wikimedia.org/T416245) (owner: 10Elukey) [08:47:47] phuedx: hi, take your time! :) I am the one running the MW train this week. I haven't even started my daily routine yet [08:47:47] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install ms-fe102[14] - https://phabricator.wikimedia.org/T416245#11582175 (10elukey) @Jclark-ctr Matthew is out this week, I just merged a change that should unblock you. Lemme know how it goes! [08:49:41] PROBLEM - OSPF status on cr2-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:50:29] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:51:10] FIRING: [8x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:52:39] FIRING: [8x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:53:49] (03PS1) 10Elukey: services: upgrade thumbor's haproxy container to Bookworm and 2.8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236673 [08:55:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P88631 and previous config saved to /var/cache/conftool/dbconfig/20260204-085506-marostegui.json [08:59:25] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:59:29] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:59:41] RECOVERY - OSPF status on cr2-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:00:05] hashar and brennen: Deploy window MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260204T0900) [09:00:29] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:01:10] FIRING: [9x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:02:39] RESOLVED: [8x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:05:43] PROBLEM - OSPF status on cr2-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:06:10] RESOLVED: [9x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:06:29] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:06:40] FIRING: [4x] BFDdown: BFD session down between cr2-eqdfw and 195.200.68.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:07:54] FIRING: [8x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:08:54] (03CR) 10Gehel: [C:04-1] "See comments inline." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1235112 (https://phabricator.wikimedia.org/T410577) (owner: 10Ryan Kemper) [09:10:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T415786)', diff saved to https://phabricator.wikimedia.org/P88632 and previous config saved to /var/cache/conftool/dbconfig/20260204-091015-marostegui.json [09:10:18] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [09:10:31] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [09:11:16] (03CR) 10Jelto: [C:03+1] "lgtm now, thank you" [puppet] - 10https://gerrit.wikimedia.org/r/1216856 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [09:11:25] FIRING: [9x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:12:29] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:12:43] RECOVERY - OSPF status on cr2-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:12:52] phuedx: about your WikimediaEvents patch, I am not sure it is the cause of the CI failure since the error seems to be in CheckUser. It might be missing an extension [09:13:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.23 - 2026.02.13): Degraded RAID on an-worker1204 - https://phabricator.wikimedia.org/T414861#11582231 (10Gehel) a:05Jclark-ctr→03BTullis [09:13:45] hashar: I think so too but I don't want this to block you from deploying the train. I'll abandon the patch for now [09:13:55] The patch that I'm trying to backport is in -wmf.14 anyway so [09:14:07] (03Abandoned) 10Phuedx: ext.wikimediaEvents: Add Test Kitchen new external path test [extensions/WikimediaEvents] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236669 (https://phabricator.wikimedia.org/T415708) (owner: 10Phuedx) [09:14:38] I might have broken it while removing recursive injection of dependencies, or something changed in CheckUser that suddenly hard require another ext [09:14:39] :/ [09:16:25] RESOLVED: [9x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:16:58] (03CR) 10Dzahn: [C:03+2] "also needs a check if this is the active host around it - quickdatacopy only installs rsync service where needed - unless we change that" [puppet] - 10https://gerrit.wikimedia.org/r/1236308 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:17:54] RESOLVED: [8x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:18:07] 06SRE, 10SRE-swift-storage, 10Infrastructure Security, 06ServiceOps new, and 6 others: October 2025 Bullseye reboots (ServiceOps hosts) - https://phabricator.wikimedia.org/T416451 (10Blake) 03NEW [09:18:23] phuedx: confirmed, I ran the CI job for WikimediaEvents @ wmf/1.46.0-wmf.13 and it fails the same way https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php83/36726//console [09:18:31] and I am pretty sure that is due to CheckUser [09:27:13] (03PS1) 10Dzahn: vrts: ensure rsync auto restart only on active host [puppet] - 10https://gerrit.wikimedia.org/r/1236674 (https://phabricator.wikimedia.org/T416449) [09:27:32] (03CR) 10CI reject: [V:04-1] vrts: ensure rsync auto restart only on active host [puppet] - 10https://gerrit.wikimedia.org/r/1236674 (https://phabricator.wikimedia.org/T416449) (owner: 10Dzahn) [09:28:23] (03PS2) 10Dzahn: vrts: ensure rsync auto restart only on active host [puppet] - 10https://gerrit.wikimedia.org/r/1236674 (https://phabricator.wikimedia.org/T416449) [09:29:40] 06SRE, 06Infrastructure-Foundations, 07Epic: Migrate Docker images running in Production away from Bullseye - https://phabricator.wikimedia.org/T416452 (10elukey) 03NEW [09:29:42] (03CR) 10Dzahn: [C:03+2] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1236674" [puppet] - 10https://gerrit.wikimedia.org/r/1236308 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:29:46] I am doing the train [09:30:25] (03PS1) 10TrainBranchBot: group1 to 1.46.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1236675 (https://phabricator.wikimedia.org/T413805) [09:30:28] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by hashar@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1236675 (https://phabricator.wikimedia.org/T413805) (owner: 10TrainBranchBot) [09:31:44] (03Merged) 10jenkins-bot: group1 to 1.46.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1236675 (https://phabricator.wikimedia.org/T413805) (owner: 10TrainBranchBot) [09:32:39] (03PS3) 10Dzahn: vrts: ensure rsync auto restart only on active host [puppet] - 10https://gerrit.wikimedia.org/r/1236674 (https://phabricator.wikimedia.org/T416449) [09:34:03] 10ops-codfw, 06SRE, 06DC-Ops: Q3:rack/setup/install cumin2003 - https://phabricator.wikimedia.org/T416385#11582293 (10MoritzMuehlenhoff) p:05Triage→03Medium [09:34:13] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1251.eqiad.wmnet with reason: Maintenance [09:34:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1251 (T415786)', diff saved to https://phabricator.wikimedia.org/P88634 and previous config saved to /var/cache/conftool/dbconfig/20260204-093421-marostegui.json [09:34:25] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [09:34:30] (03PS1) 10Muehlenhoff: Add site.pp/preseed for cumin2003 [puppet] - 10https://gerrit.wikimedia.org/r/1236676 (https://phabricator.wikimedia.org/T461385) [09:34:34] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1236674/7972/" [puppet] - 10https://gerrit.wikimedia.org/r/1236674 (https://phabricator.wikimedia.org/T416449) (owner: 10Dzahn) [09:34:46] (03PS2) 10Elukey: services: upgrade thumbor's haproxy container to Bookworm and 2.8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236673 (https://phabricator.wikimedia.org/T416452) [09:34:52] (03CR) 10Jelto: [C:03+1] vrts: ensure rsync auto restart only on active host [puppet] - 10https://gerrit.wikimedia.org/r/1236674 (https://phabricator.wikimedia.org/T416449) (owner: 10Dzahn) [09:35:13] (03CR) 10Dzahn: [C:03+2] vrts: ensure rsync auto restart only on active host [puppet] - 10https://gerrit.wikimedia.org/r/1236674 (https://phabricator.wikimedia.org/T416449) (owner: 10Dzahn) [09:37:53] (03Abandoned) 10Slyngshede: Meta IP location changes [dns] - 10https://gerrit.wikimedia.org/r/1216806 (owner: 10Slyngshede) [09:37:54] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.46.0-wmf.14 refs T413805 [09:37:57] T413805: 1.46.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T413805 [09:38:20] !log installing openssl security updates [09:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:53] (03CR) 10Muehlenhoff: [C:03+2] Add Cumin alias for crm [puppet] - 10https://gerrit.wikimedia.org/r/1236238 (owner: 10Muehlenhoff) [09:41:35] (03CR) 10Dzahn: [C:03+2] ncredir: remove wikipedia25.org, keep wikipedia25.com to www.wikipedia25.org [puppet] - 10https://gerrit.wikimedia.org/r/1216856 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [09:43:15] (03CR) 10Elukey: [C:03+1] Add site.pp/preseed for cumin2003 [puppet] - 10https://gerrit.wikimedia.org/r/1236676 (https://phabricator.wikimedia.org/T461385) (owner: 10Muehlenhoff) [09:47:26] (03PS1) 10Fabfur: cache::upload: enable global ratelimiting for auth and bot (magru) [puppet] - 10https://gerrit.wikimedia.org/r/1236679 (https://phabricator.wikimedia.org/T406545) [09:49:31] (03CR) 10Muehlenhoff: [C:03+2] Add site.pp/preseed for cumin2003 [puppet] - 10https://gerrit.wikimedia.org/r/1236676 (https://phabricator.wikimedia.org/T461385) (owner: 10Muehlenhoff) [09:54:52] I am going to restart Jenkins instances [09:55:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T415786)', diff saved to https://phabricator.wikimedia.org/P88635 and previous config saved to /var/cache/conftool/dbconfig/20260204-095510-marostegui.json [09:55:14] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [09:56:53] * hashar waits for job to complete [09:59:26] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1236679 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [09:59:44] (03CR) 10Jelto: [V:03+1] "I591dcb36570281234854fb3cdb90fc3386ce87a9 adds general support to set the QoS to low optionally but the default is unchanged. This change " [puppet] - 10https://gerrit.wikimedia.org/r/1234984 (owner: 10Jelto) [09:59:44] 10ops-codfw, 06SRE, 06DC-Ops: Q3:rack/setup/install cumin2003 - https://phabricator.wikimedia.org/T416385#11582363 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03Jhancock.wm >>! In T416385#11580539, @Jhancock.wm wrote: > @MoritzMuehlenhoff > when you or someone you can delegate this to can, could you fil... [10:01:54] (03CR) 10Vgutierrez: [C:03+1] varnish: set Retry-After for cli_tool, wdqs and library policies [puppet] - 10https://gerrit.wikimedia.org/r/1230937 (https://phabricator.wikimedia.org/T415375) (owner: 10Fabfur) [10:02:31] 06SRE, 10SRE-swift-storage, 10Infrastructure Security, 06ServiceOps new, and 6 others: October 2025 Bullseye reboots (ServiceOps hosts) - https://phabricator.wikimedia.org/T416451#11582371 (10MoritzMuehlenhoff) The conf* servers are tricky to reboot, they've been often skipped in the past (as visible by th... [10:06:02] !log Restarting Gerrit instances [10:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:30] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2150.codfw.wmnet with reason: Maintenance [10:06:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2150 (T415786)', diff saved to https://phabricator.wikimedia.org/P88636 and previous config saved to /var/cache/conftool/dbconfig/20260204-100638-marostegui.json [10:06:42] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [10:06:57] 06SRE, 10LDAP-Access-Requests: Grant Access to bitu-account-managers(?) for reedy - https://phabricator.wikimedia.org/T416062#11582392 (10MoritzMuehlenhoff) @Reedy Due to an oversight "Bitu-account-managers" was only requesteable on the test instance for Bitu. This has now been fixed, please request it on http... [10:09:08] stopping it again [10:09:32] 06SRE, 10SRE-swift-storage, 10Infrastructure Security, 06ServiceOps new, and 6 others: October 2025 Bullseye reboots (ServiceOps hosts) - https://phabricator.wikimedia.org/T416451#11582397 (10Blake) Hey Moritz, thanks, that makes sense. Does that mean we'd only reboot the codfw hosts, as eqiad will be the... [10:10:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P88637 and previous config saved to /var/cache/conftool/dbconfig/20260204-101018-marostegui.json [10:10:42] !log Gerrit is back [10:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:11:19] !log Restarted CI Jenkins [10:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:31] (03PS1) 10Elukey: admin: add user ggalofre to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1236686 (https://phabricator.wikimedia.org/T415172) [10:14:20] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1236686 (https://phabricator.wikimedia.org/T415172) (owner: 10Elukey) [10:15:40] I am getting a coffee break and I'll check the logs [10:16:03] (03CR) 10Elukey: [C:03+2] admin: add user ggalofre to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1236686 (https://phabricator.wikimedia.org/T415172) (owner: 10Elukey) [10:16:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:16:30] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install backup1015 - https://phabricator.wikimedia.org/T414725#11582410 (10jcrespo) >>! In T414725#11580692, @Jclark-ctr wrote: > @jcrespo with eLukey and Topranks help we where able to get it to start imaging... [10:16:47] 06SRE, 06Data-Platform-SRE, 06Infrastructure-Foundations, 07Epic, 13Patch-For-Review: Migrate Docker images running in Production away from Bullseye - https://phabricator.wikimedia.org/T416452#11582411 (10Gehel) [10:22:03] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for ggalofre - https://phabricator.wikimedia.org/T415172#11582435 (10elukey) 05In progress→03Resolved Data access is propagating now, it will be available in ~30 mins. Going to close, please reopen and/... [10:25:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P88638 and previous config saved to /var/cache/conftool/dbconfig/20260204-102527-marostegui.json [10:27:59] ah metawiki OAuth fails with `Key cannot be empty` [10:28:00] pff [10:29:25] !log Rolling back to group0 due to an issue with OAuth on metawiki # T413805 [10:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:28] T413805: 1.46.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T413805 [10:29:42] (03PS1) 10TrainBranchBot: group0 to 1.46.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1236687 (https://phabricator.wikimedia.org/T413805) [10:29:45] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by hashar@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1236687 (https://phabricator.wikimedia.org/T413805) (owner: 10TrainBranchBot) [10:30:36] (03Merged) 10jenkins-bot: group0 to 1.46.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1236687 (https://phabricator.wikimedia.org/T413805) (owner: 10TrainBranchBot) [10:31:46] T416456 [10:31:47] T416456: Lcobucci\JWT\Signer\InvalidKeyProvided: Key cannot be empty (/w/rest.php/oauth2/access_token) - https://phabricator.wikimedia.org/T416456 [10:33:07] (03CR) 10Ayounsi: [C:03+1] Add Nokia BGP routing policy for wikikube-worker / k8s hosts [homer/public] - 10https://gerrit.wikimedia.org/r/1229562 (https://phabricator.wikimedia.org/T408757) (owner: 10Cathal Mooney) [10:33:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:36:42] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.46.0-wmf.14 refs T413805 [10:36:46] T413805: 1.46.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T413805 [10:36:50] (03CR) 10Dzahn: [C:03+1] "oh yea, my comment was about a previous PS then" [puppet] - 10https://gerrit.wikimedia.org/r/1234984 (owner: 10Jelto) [10:37:16] (03PS1) 10Elukey: install_server: add UEFI partman recipe for backup1015 [puppet] - 10https://gerrit.wikimedia.org/r/1236688 (https://phabricator.wikimedia.org/T414725) [10:38:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:38:18] (03PS2) 10Elukey: install_server: add UEFI partman recipe for backup1015 [puppet] - 10https://gerrit.wikimedia.org/r/1236688 (https://phabricator.wikimedia.org/T414725) [10:38:47] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:39:12] !log upgrade cloudcumin2001 to bookworm T403153 [10:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:15] T403153: Upgrade cloudcumin hosts to bookworm/trixie - https://phabricator.wikimedia.org/T403153 [10:39:54] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, and 2 others: Q3:rack/setup/install backup1015 - https://phabricator.wikimedia.org/T414725#11582518 (10elukey) @jcrespo Hi! I guess you refer to https://wikitech.wikimedia.org/wiki/UEFI_Boot, we can definitely add more docs together if you have time. I file... [10:40:14] (03CR) 10CI reject: [V:04-1] install_server: add UEFI partman recipe for backup1015 [puppet] - 10https://gerrit.wikimedia.org/r/1236688 (https://phabricator.wikimedia.org/T414725) (owner: 10Elukey) [10:40:34] (03PS2) 10Fabfur: cache::upload: enable global ratelimiting for auth and bot (magru) [puppet] - 10https://gerrit.wikimedia.org/r/1236679 (https://phabricator.wikimedia.org/T406545) [10:40:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T415786)', diff saved to https://phabricator.wikimedia.org/P88639 and previous config saved to /var/cache/conftool/dbconfig/20260204-104035-marostegui.json [10:40:39] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [10:41:09] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1236679 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [10:41:48] (03PS3) 10Fabfur: cache::upload: enable global ratelimiting for bot (magru) [puppet] - 10https://gerrit.wikimedia.org/r/1236679 (https://phabricator.wikimedia.org/T406545) [10:42:44] (03CR) 10Alexandros Kosiaris: [C:04-1] "Left a couple of comments for the commit message. Simply put it reads like slop and doesn't represent what the patch does. The change itse" [puppet] - 10https://gerrit.wikimedia.org/r/1222271 (https://phabricator.wikimedia.org/T201491) (owner: 10Divyaratann Srivastava) [10:43:15] (03PS3) 10Elukey: install_server: add UEFI partman recipe for backup1015 [puppet] - 10https://gerrit.wikimedia.org/r/1236688 (https://phabricator.wikimedia.org/T414725) [10:44:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcumin2001.codfw.wmnet [10:45:48] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1236679 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [10:47:03] !log installing openjdk-17 security updates [10:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:32] (03CR) 10Fabfur: [C:03+2] varnish: set Retry-After for cli_tool, wdqs and library policies [puppet] - 10https://gerrit.wikimedia.org/r/1230937 (https://phabricator.wikimedia.org/T415375) (owner: 10Fabfur) [10:48:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcumin2001.codfw.wmnet [10:52:29] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11582550 (10jijiki) [10:52:50] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11582551 (10jijiki) [10:53:19] (03CR) 10Alexandros Kosiaris: [C:03+1] docker_registry: move /v2/restricted to the s3 restricted backend [puppet] - 10https://gerrit.wikimedia.org/r/1229145 (https://phabricator.wikimedia.org/T412951) (owner: 10Elukey) [10:54:14] (03CR) 10Alexandros Kosiaris: [C:03+1] ferm: Only collect resources when ensure is present [puppet] - 10https://gerrit.wikimedia.org/r/1214549 (owner: 10Majavah) [10:55:13] (03CR) 10Vgutierrez: [C:03+1] "VTCs are happy" [puppet] - 10https://gerrit.wikimedia.org/r/1236679 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [10:55:58] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.23 - 2026.02.13): Degraded RAID on an-worker1187 - https://phabricator.wikimedia.org/T415002#11582558 (10Gehel) a:05Jclark-ctr→03BTullis [10:56:43] 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.23 - 2026.02.13), 07Essential-Work: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11582569 (10Gehel) With the various investigations that have happened around Airflow, do we now have a... [10:59:05] (03PS1) 10Kosta Harlan: IPReputationIPoidDataLookup: Allow returning stale values for 72 hours [extensions/IPReputation] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236689 (https://phabricator.wikimedia.org/T416316) [10:59:19] (03PS1) 10Kosta Harlan: IPReputationIPoidDataLookup: Allow returning stale values for 72 hours [extensions/IPReputation] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236690 (https://phabricator.wikimedia.org/T416316) [10:59:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/IPReputation] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236690 (https://phabricator.wikimedia.org/T416316) (owner: 10Kosta Harlan) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260204T1100) [11:00:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/IPReputation] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236689 (https://phabricator.wikimedia.org/T416316) (owner: 10Kosta Harlan) [11:02:58] (03PS1) 10Zabe: Revert "Updated lcobucci/jwt from 4.1.5 to 4.3.0" [core] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236692 (https://phabricator.wikimedia.org/T416456) [11:03:40] (03PS4) 10Elukey: install_server: add UEFI partman recipe for backup1015 [puppet] - 10https://gerrit.wikimedia.org/r/1236688 (https://phabricator.wikimedia.org/T414725)