[00:00:47] 06SRE, 06Infrastructure-Foundations, 06Release-Engineering-Team (Priority Backlog πŸ“₯): Update default GitLab runner image to a base image without mirrors.wikimedia.org - https://phabricator.wikimedia.org/T423971 (10thcipriani) 03NEW [00:03:07] 06SRE, 06Infrastructure-Foundations, 06Release-Engineering-Team (Priority Backlog πŸ“₯): Rebuild dev-images using a base image without mirrors.wikimedia.org in the apt sources - https://phabricator.wikimedia.org/T423972 (10thcipriani) 03NEW [00:09:43] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11842185 (10phaultfinder) [00:22:38] RECOVERY - MariaDB Replica Lag: s4 on clouddb1015 is OK: OK slave_sql_lag Replication lag: 25.98 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:28:09] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2250.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:28:33] (03PS1) 10C. Scott Ananian: [tests] add ParsoidLanguageConverterTest [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275560 [00:28:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2251.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:28:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275560 (owner: 10C. Scott Ananian) [00:28:59] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2252.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:29:25] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2253.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:29:34] (03PS1) 10C. Scott Ananian: ParsoidLanguageConverter: update lang/dir on content wrapper div [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275561 (https://phabricator.wikimedia.org/T423747) [00:29:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275561 (https://phabricator.wikimedia.org/T423747) (owner: 10C. Scott Ananian) [00:32:29] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2251.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:33:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2252.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:33:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2253.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:34:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2250.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:35:09] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2250.codfw.wmnet with OS bookworm [00:35:14] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install 4 new db hosts in codfw - https://phabricator.wikimedia.org/T418911#11842215 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2250.codfw.wmnet with OS bookworm [00:35:34] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2251.codfw.wmnet with OS bookworm [00:35:44] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2252.codfw.wmnet with OS bookworm [00:35:45] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install 4 new db hosts in codfw - https://phabricator.wikimedia.org/T418911#11842216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2251.codfw.wmnet with OS bookworm [00:35:51] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install 4 new db hosts in codfw - https://phabricator.wikimedia.org/T418911#11842217 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2252.codfw.wmnet with OS bookworm [00:35:59] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2253.codfw.wmnet with OS bookworm [00:36:08] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install 4 new db hosts in codfw - https://phabricator.wikimedia.org/T418911#11842218 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2253.codfw.wmnet with OS bookworm [00:48:58] (03PS1) 10Aude: Opt-in new accounts to the ReadingLists beta feature on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275562 (https://phabricator.wikimedia.org/T420881) [00:49:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275562 (https://phabricator.wikimedia.org/T420881) (owner: 10Aude) [00:49:39] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11842224 (10phaultfinder) [00:51:57] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2253.codfw.wmnet with reason: host reimage [00:52:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2252.codfw.wmnet with reason: host reimage [00:52:22] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2251.codfw.wmnet with reason: host reimage [00:52:24] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2250.codfw.wmnet with reason: host reimage [00:59:12] (03CR) 10Zabe: "I would have done it. But we disabled read new on all wikis due to https://phabricator.wikimedia.org/T423065 yesterday and we will not ree" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1273787 (https://phabricator.wikimedia.org/T423654) (owner: 10Jforrester) [00:59:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2253.codfw.wmnet with reason: host reimage [01:02:53] !log marked 543 revisions as bad # T393237 [01:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:02:57] T393237: Some en.wikipedia pageviews fatal "RevisionAccessException: Failed to load data blob from {address} for revision {revision}." - https://phabricator.wikimedia.org/T393237 [01:03:33] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2252.codfw.wmnet with reason: host reimage [01:04:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11842235 (10phaultfinder) [01:05:53] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 45771952 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:06:53] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 1811304 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:07:07] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2251.codfw.wmnet with reason: host reimage [01:09:33] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1275565 [01:09:33] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1275565 (owner: 10TrainBranchBot) [01:12:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2250.codfw.wmnet with reason: host reimage [01:17:15] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:18:08] FIRING: [9x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:18:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:18:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2253.codfw.wmnet with OS bookworm [01:18:27] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install 4 new db hosts in codfw - https://phabricator.wikimedia.org/T418911#11842240 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2253.codfw.wmnet with OS bookworm completed: - db2253 (**... [01:20:30] (03CR) 10Scott French: [C:03+1] mwscript-k8s: add --output-file flag (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1273905 (owner: 10CDanis) [01:20:39] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1275565 (owner: 10TrainBranchBot) [01:20:45] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:23:08] FIRING: [10x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:23:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:23:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2252.codfw.wmnet with OS bookworm [01:23:43] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install 4 new db hosts in codfw - https://phabricator.wikimedia.org/T418911#11842249 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2252.codfw.wmnet with OS bookworm completed: - db2252 (**... [01:24:07] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:24:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:24:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2251.codfw.wmnet with OS bookworm [01:24:32] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install 4 new db hosts in codfw - https://phabricator.wikimedia.org/T418911#11842250 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2251.codfw.wmnet with OS bookworm completed: - db2251 (**... [01:25:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:28:55] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:32:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:32:10] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2250.codfw.wmnet with OS bookworm [01:32:19] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install 4 new db hosts in codfw - https://phabricator.wikimedia.org/T418911#11842268 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2250.codfw.wmnet with OS bookworm completed: - db2250 (**... [01:32:21] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install 4 new db hosts in codfw - https://phabricator.wikimedia.org/T418911#11842269 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2250.codfw.wmnet with OS bookworm executed with errors: -... [01:32:43] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install 4 new db hosts in codfw - https://phabricator.wikimedia.org/T418911#11842272 (10Jhancock.wm) 05Openβ†’03Resolved [01:33:28] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install 4 new db hosts in codfw - https://phabricator.wikimedia.org/T418911#11842276 (10Jhancock.wm) @Marostegui these are done (and i double checked the raid lol) [01:59:10] FIRING: SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9600.service on cloudelastic1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:59:57] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11842287 (10phaultfinder) [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260421T0200) [02:01:09] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:07:13] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 03s) [02:09:17] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:10:35] PROBLEM - Host mr1-magru.oob is DOWN: PING CRITICAL - Packet loss = 100% [02:20:49] RECOVERY - Host mr1-magru.oob is UP: PING OK - Packet loss = 0%, RTA = 118.30 ms [02:22:21] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T410589)', diff saved to https://phabricator.wikimedia.org/P91238 and previous config saved to /var/cache/conftool/dbconfig/20260421-022219-ladsgroup.json [02:22:26] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [02:29:08] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [02:32:29] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P91239 and previous config saved to /var/cache/conftool/dbconfig/20260421-023228-ladsgroup.json [02:34:17] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [02:42:37] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P91240 and previous config saved to /var/cache/conftool/dbconfig/20260421-024237-ladsgroup.json [02:44:18] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [02:44:41] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11842311 (10phaultfinder) [02:52:47] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T410589)', diff saved to https://phabricator.wikimedia.org/P91241 and previous config saved to /var/cache/conftool/dbconfig/20260421-025245-ladsgroup.json [02:52:51] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [02:53:04] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance [02:53:12] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2154 (T410589)', diff saved to https://phabricator.wikimedia.org/P91242 and previous config saved to /var/cache/conftool/dbconfig/20260421-025311-ladsgroup.json [02:58:44] (03CR) 10RLazarus: mwscript-k8s: add --output-file flag (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1273905 (owner: 10CDanis) [03:00:05] Deploy window Automatic deployment of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260421T0300) [03:09:18] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [03:14:36] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11842381 (10phaultfinder) [03:54:02] FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260421T0400) [04:02:33] !log mwpresync@deploy1003 Pruned MediaWiki: 1.46.0-wmf.22 (duration: 02m 30s) [04:05:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [04:10:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [04:27:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDo [04:30:40] FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:32:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:43:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [04:53:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [04:57:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [04:59:10] 06SRE, 10dev-images, 06Infrastructure-Foundations, 06Release-Engineering-Team (Priority Backlog πŸ“₯): Rebuild dev-images using a base image without mirrors.wikimedia.org in the apt sources - https://phabricator.wikimedia.org/T423972#11842523 (10A_smart_kitten) [05:03:37] (03CR) 10Ayounsi: [C:03+1] sre.hosts.provision: skip LLDP settings for iDRAC 10+ hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1275509 (https://phabricator.wikimedia.org/T250367) (owner: 10Elukey) [05:16:04] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install 4 new db hosts in codfw - https://phabricator.wikimedia.org/T418911#11842532 (10Marostegui) They now look good to me! Thank you! [05:16:59] (03CR) 10Ayounsi: "What do you think of the suggestion inline ? (not tested)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1275509 (https://phabricator.wikimedia.org/T250367) (owner: 10Elukey) [05:22:16] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q3:rack/setup/install phab2003 - https://phabricator.wikimedia.org/T418899#11842545 (10ayounsi) >>! In T418899#11840735, @elukey wrote: > [...] > > ` >>>> a.components.keys() > dict_keys(['BIOS.Setup.1-1', 'EventFilters.Audit.1', 'EventFilters.Config... [05:23:08] FIRING: [10x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:25:21] (03CR) 10Marostegui: [C:03+2] cloudb1025: Add s6 [puppet] - 10https://gerrit.wikimedia.org/r/1273785 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [05:25:25] FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:25:33] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on clouddb1025.eqiad.wmnet with reason: Clone s6 [05:26:21] (03PS1) 10Marostegui: clouddb1025: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275747 (https://phabricator.wikimedia.org/T409557) [05:26:55] (03CR) 10Marostegui: [C:03+2] clouddb1025: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275747 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [05:27:04] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on clouddb1015.eqiad.wmnet with reason: Clone s6 to clouddb1025 [05:30:11] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:30:40] FIRING: [6x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:32:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:37:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [05:40:07] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [05:52:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:55:11] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:59:10] FIRING: SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9600.service on cloudelastic1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260421T0600) [06:00:05] marostegui, Amir1, and federico3: How many deployers does it take to do Primary database switchover deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260421T0600). [06:00:11] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:02:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:05:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [06:06:01] (03PS1) 10Marostegui: eqiad.yaml: Add clouddb1025 to s6 [puppet] - 10https://gerrit.wikimedia.org/r/1275748 (https://phabricator.wikimedia.org/T409557) [06:06:29] (03CR) 10Marostegui: "clouddb1025 has s6 cloned already." [puppet] - 10https://gerrit.wikimedia.org/r/1275748 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [06:10:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [06:10:40] FIRING: [6x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:11:12] (03PS1) 10Marostegui: db2144: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275749 (https://phabricator.wikimedia.org/T423874) [06:11:33] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2144.codfw.wmnet,db1151.eqiad.wmnet with reason: Reimage to Trixie [06:11:46] (03CR) 10Marostegui: [C:03+2] db2144: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275749 (https://phabricator.wikimedia.org/T423874) (owner: 10Marostegui) [06:12:00] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2144.codfw.wmnet with reason: Reimage to Trixie [06:12:06] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2144: Reimage to Trixie [06:12:06] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [06:12:12] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [06:12:12] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2144: Reimage to Trixie [06:12:35] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2144.codfw.wmnet with OS trixie [06:12:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:15:40] FIRING: [6x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:21:14] (03PS2) 10Ryan Kemper: growthbook: Drop dead SSO_CONFIG placeholder [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270559 (https://phabricator.wikimedia.org/T420696) [06:21:14] (03PS2) 10Ryan Kemper: growthbook: Upgrade vendored job template 1.0.1 β†’ 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270558 (https://phabricator.wikimedia.org/T420691) [06:22:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:25:37] (03CR) 10Ryan Kemper: "Addressed, good catch!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270558 (https://phabricator.wikimedia.org/T420691) (owner: 10Ryan Kemper) [06:27:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:31:14] (03CR) 10Ecarg: "oh is it this one: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1239961" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1274165 (https://phabricator.wikimedia.org/T423627) (owner: 10Ecarg) [06:32:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:33:10] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2144.codfw.wmnet with reason: host reimage [06:33:57] (03PS1) 10Slyngshede: R:cache::text enable TCP Fast Open [puppet] - 10https://gerrit.wikimedia.org/r/1275750 (https://phabricator.wikimedia.org/T415454) [06:35:11] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:35:40] FIRING: [6x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:36:18] (03PS4) 10Ryan Kemper: growthbook: Add automation API key placeholders [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269245 (https://phabricator.wikimedia.org/T420696) [06:36:18] (03PS3) 10Ryan Kemper: growthbook: Drop dead SSO_CONFIG placeholder [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270559 (https://phabricator.wikimedia.org/T420696) [06:36:18] (03PS3) 10Ryan Kemper: growthbook: Bump vendored job templ 1.0.1 β†’ 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270558 (https://phabricator.wikimedia.org/T420691) [06:36:53] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8444/co" [puppet] - 10https://gerrit.wikimedia.org/r/1275750 (https://phabricator.wikimedia.org/T415454) (owner: 10Slyngshede) [06:38:43] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2144.codfw.wmnet with reason: host reimage [06:38:49] (03PS1) 10Marostegui: Revert "db2144: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1275751 [06:42:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:45:11] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:45:40] FIRING: [6x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:46:58] 06SRE, 06Traffic, 13Patch-For-Review: TCP FastOpen not working since at least December 2025 - https://phabricator.wikimedia.org/T415454#11842634 (10SLyngshede-WMF) @Naruse_shiroha we're currently working on another task, related to TCP Fast Open. To make that work, this task will need to be completed first.... [06:50:11] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:52:34] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1275537 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [06:52:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:55:03] (03PS1) 10Muehlenhoff: redis::slave: Move to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1275752 (https://phabricator.wikimedia.org/T419976) [06:55:11] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:55:40] FIRING: [6x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:58:07] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275752 (https://phabricator.wikimedia.org/T419976) (owner: 10Muehlenhoff) [07:00:05] Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260421T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:02:24] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2144.codfw.wmnet with OS trixie [07:03:25] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2144: after reimage to trixie [07:03:25] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) pool db2144: after reimage to trixie [07:03:35] (03CR) 10Marostegui: [C:03+2] Revert "db2144: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1275751 (owner: 10Marostegui) [07:04:16] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2144: After reimage [07:04:16] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [07:04:42] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [07:04:43] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2144: After reimage [07:05:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [07:05:31] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool pc2011: Cloning pc2021 [07:05:31] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [07:05:39] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [07:05:39] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc2011: Cloning pc2021 [07:06:18] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Cloning pc2021 from pc2011 [07:07:41] (03PS2) 10Marostegui: eqiad.yaml: Add clouddb1025 to s6 [puppet] - 10https://gerrit.wikimedia.org/r/1275748 (https://phabricator.wikimedia.org/T409557) [07:07:41] (03PS1) 10Marostegui: mariadb: Productionize pc2021 [puppet] - 10https://gerrit.wikimedia.org/r/1275753 (https://phabricator.wikimedia.org/T418973) [07:08:07] (03PS2) 10Muehlenhoff: redis::slave: Move to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1275752 (https://phabricator.wikimedia.org/T419976) [07:10:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [07:11:29] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize pc2021 [puppet] - 10https://gerrit.wikimedia.org/r/1275753 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [07:11:50] (03CR) 10Marostegui: mariadb: Productionize pc2021 [puppet] - 10https://gerrit.wikimedia.org/r/1275753 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [07:12:13] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275752 (https://phabricator.wikimedia.org/T419976) (owner: 10Muehlenhoff) [07:14:42] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11842671 (10phaultfinder) [07:16:03] (03PS1) 10Marostegui: clouddb1025: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275756 [07:16:11] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be1006.eqiad.wmnet with OS bullseye [07:16:22] 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11842673 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be1006.eqiad.wmnet with OS bullseye [07:16:36] (03CR) 10Marostegui: [C:03+2] clouddb1025: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275756 (owner: 10Marostegui) [07:17:16] !log mvernon@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be1064.eqiad.wmnet with reason: vacuum overlarge container dbs [07:17:25] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#11842674 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=dffb0251-03df-42f3-8b2d-f9461fa80a0f) set by mvernon@cumin... [07:18:51] (03CR) 10Elukey: [C:03+2] ipmi: rework how to use a different user [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271631 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey) [07:24:44] (03PS1) 10Muehlenhoff: redis::master: Move to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1275757 (https://phabricator.wikimedia.org/T419976) [07:27:08] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275757 (https://phabricator.wikimedia.org/T419976) (owner: 10Muehlenhoff) [07:28:06] (03CR) 10Elukey: sre.hosts.provision: skip LLDP settings for iDRAC 10+ hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1275509 (https://phabricator.wikimedia.org/T250367) (owner: 10Elukey) [07:28:55] FIRING: [2x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9600.service on cloudelastic1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:32:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:32:59] (03PS2) 10Muehlenhoff: redis::master: Move to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1275757 (https://phabricator.wikimedia.org/T419976) [07:35:40] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [07:36:49] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275757 (https://phabricator.wikimedia.org/T419976) (owner: 10Muehlenhoff) [07:37:24] (03CR) 10FNegri: [C:03+1] eqiad.yaml: Add clouddb1025 to s6 [puppet] - 10https://gerrit.wikimedia.org/r/1275748 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [07:37:44] (03CR) 10Marostegui: [C:03+2] eqiad.yaml: Add clouddb1025 to s6 [puppet] - 10https://gerrit.wikimedia.org/r/1275748 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [07:37:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:37:58] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize pc2021 [puppet] - 10https://gerrit.wikimedia.org/r/1275753 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [07:38:30] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be1006.eqiad.wmnet with reason: host reimage [07:39:47] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool pc1011: Reimage to Trixie [07:39:48] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [07:39:54] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [07:39:54] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc1011: Reimage to Trixie [07:40:00] (03CR) 10JMeybohm: [C:03+2] Decom various wikikube-workers [puppet] - 10https://gerrit.wikimedia.org/r/1275433 (https://phabricator.wikimedia.org/T423863) (owner: 10JMeybohm) [07:40:18] (03CR) 10JMeybohm: [C:03+2] Decom various wikikube-workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1275433 (https://phabricator.wikimedia.org/T423863) (owner: 10JMeybohm) [07:42:01] 10SRE-Access-Requests: Update the list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T423989#11842739 (10WMDE-leszek) [07:45:50] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be1006.eqiad.wmnet with reason: host reimage [07:46:22] !log jayme@cumin1003 START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1002-1005,1011-1012,1019-1020,1029-1031,1058-1061].eqiad.wmnet [07:46:39] !log jayme@cumin1003 START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1062-1063,1082-1083,1088-1092,1096-1101].eqiad.wmnet [07:46:46] !log jayme@cumin1003 START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1102-1112,1166-1168].eqiad.wmnet [07:48:15] !log mvernon@cumin2002 START - Cookbook sre.hosts.remove-downtime for ms-be1064.eqiad.wmnet [07:48:16] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be1064.eqiad.wmnet [07:49:14] !log fnegri@cumin1003 conftool action : set/weight=100; selector: name=clouddb1025.eqiad.wmnet,service=s6 [07:49:28] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1025.eqiad.wmnet,service=s6 [07:49:34] jayme@cumin1003 decommission (PID 3662177) is awaiting input [07:50:48] (03PS2) 10MusikAnimal: ext.abuseFilter.edit.js: temporary locking of CodeMirror lineWrapping [extensions/AbuseFilter] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275760 (https://phabricator.wikimedia.org/T423773) [07:51:16] !log fnegri@cumin1003 conftool action : set/weight=100; selector: name=clouddb1025.eqiad.wmnet,service=s4 [07:51:20] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1025.eqiad.wmnet,service=s4 [07:52:58] (03CR) 10Muehlenhoff: [C:03+2] Add ganeti5006 to the routed Ganeti cluster [puppet] - 10https://gerrit.wikimedia.org/r/1275461 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [07:54:02] FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [07:54:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:59:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:59:36] RECOVERY - gerrit process on gerrit2002 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-17-openjdk-amd64/bin/java .*-jar /srv/gerrit/site_path/review_site/bin/gerrit.war daemon -d /srv/gerrit/site_path/review_site https://wikitech.wikimedia.org/wiki/Gerrit [08:03:06] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1159.eqiad.wmnet with reason: Maintenance [08:03:19] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1159 (T419635)', diff saved to https://phabricator.wikimedia.org/P91248 and previous config saved to /var/cache/conftool/dbconfig/20260421-080314-fceratto.json [08:03:27] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [08:04:32] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [08:04:49] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11842802 (10phaultfinder) [08:05:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy1003 using scap backport" [extensions/AbuseFilter] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275760 (https://phabricator.wikimedia.org/T423773) (owner: 10MusikAnimal) [08:05:58] (03CR) 10Elukey: [C:03+2] profile::pki::intermediates: add discovery2026 [puppet] - 10https://gerrit.wikimedia.org/r/1275479 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [08:06:18] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be1006.eqiad.wmnet with OS bullseye [08:06:25] 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11842809 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be1006.eqiad.wmnet with OS bullseye completed... [08:07:25] (03CR) 10CI reject: [V:04-1] ext.abuseFilter.edit.js: temporary locking of CodeMirror lineWrapping [extensions/AbuseFilter] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275760 (https://phabricator.wikimedia.org/T423773) (owner: 10MusikAnimal) [08:08:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5006.eqsin.wmnet [08:08:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy1003 using scap backport" [extensions/AbuseFilter] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275760 (https://phabricator.wikimedia.org/T423773) (owner: 10MusikAnimal) [08:09:19] 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11842812 (10MatthewVernon) [08:09:28] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [08:09:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1157 (T419961)', diff saved to https://phabricator.wikimedia.org/P91249 and previous config saved to /var/cache/conftool/dbconfig/20260421-080936-fceratto.json [08:09:38] (03Merged) 10jenkins-bot: ext.abuseFilter.edit.js: temporary locking of CodeMirror lineWrapping [extensions/AbuseFilter] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275760 (https://phabricator.wikimedia.org/T423773) (owner: 10MusikAnimal) [08:09:50] (03CR) 10JMeybohm: [C:03+1] role::pki::multiroot: configure discovery2026 [puppet] - 10https://gerrit.wikimedia.org/r/1275480 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [08:10:45] FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [08:10:52] !log musikanimal@deploy1003 Started scap sync-world: Backport for [[gerrit:1275760|ext.abuseFilter.edit.js: temporary locking of CodeMirror lineWrapping (T423773 T423756)]] [08:10:58] T423773: AbuseFilter editor shifts off-screen when using long lines of code with line wrap disabled - https://phabricator.wikimedia.org/T423773 [08:10:59] T423756: AbuseFilter edit window cannot be resized horizontally - https://phabricator.wikimedia.org/T423756 [08:12:34] !log musikanimal@deploy1003 musikanimal: Backport for [[gerrit:1275760|ext.abuseFilter.edit.js: temporary locking of CodeMirror lineWrapping (T423773 T423756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:14:04] (03CR) 10Elukey: [C:03+2] role::pki::multiroot: configure discovery2026 [puppet] - 10https://gerrit.wikimedia.org/r/1275480 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [08:14:04] !log musikanimal@deploy1003 musikanimal: Continuing with sync [08:14:09] (03CR) 10Muehlenhoff: [C:03+1] Add fake private secrets for discovery2026 PKI intermediate [labs/private] - 10https://gerrit.wikimedia.org/r/1275481 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [08:14:51] !log bootstrapping pki intermediate discovery2026 [08:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:50] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1272774 (https://phabricator.wikimedia.org/T105307) (owner: 10JHathaway) [08:17:18] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T419961)', diff saved to https://phabricator.wikimedia.org/P91250 and previous config saved to /var/cache/conftool/dbconfig/20260421-081717-fceratto.json [08:17:51] (03PS1) 10Elukey: Fix pki discovery2026's filename [puppet] - 10https://gerrit.wikimedia.org/r/1275801 [08:17:54] !log musikanimal@deploy1003 Finished scap sync-world: Backport for [[gerrit:1275760|ext.abuseFilter.edit.js: temporary locking of CodeMirror lineWrapping (T423773 T423756)]] (duration: 07m 01s) [08:18:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5006.eqsin.wmnet [08:18:06] T423773: AbuseFilter editor shifts off-screen when using long lines of code with line wrap disabled - https://phabricator.wikimedia.org/T423773 [08:18:09] T423756: AbuseFilter edit window cannot be resized horizontally - https://phabricator.wikimedia.org/T423756 [08:19:46] (03CR) 10Elukey: [V:03+2 C:03+2] Add fake private secrets for discovery2026 PKI intermediate [labs/private] - 10https://gerrit.wikimedia.org/r/1275481 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [08:20:18] (03CR) 10JavierMonton: [C:03+1] html-enrich - update values with latest settings from T421216 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275453 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata) [08:20:18] (03Abandoned) 10Elukey: profile::pki::intermediates: refresh discovery's public key [puppet] - 10https://gerrit.wikimedia.org/r/1264669 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [08:20:20] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1157: Security update [08:20:42] (03CR) 10Elukey: [C:03+2] Fix pki discovery2026's filename [puppet] - 10https://gerrit.wikimedia.org/r/1275801 (owner: 10Elukey) [08:25:42] (03PS1) 10Ayounsi: Decom cookbook: set BGP flag to False if True [cookbooks] - 10https://gerrit.wikimedia.org/r/1275806 [08:25:45] RESOLVED: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [08:27:14] (03CR) 10JMeybohm: [C:03+1] Decom cookbook: set BGP flag to False if True [cookbooks] - 10https://gerrit.wikimedia.org/r/1275806 (owner: 10Ayounsi) [08:27:31] !log jayme@cumin1003 START - Cookbook sre.dns.netbox [08:30:25] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:31:34] !log jayme@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1102-1112,1166-1168].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1003" [08:32:04] !log jayme@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1102-1112,1166-1168].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1003" [08:32:04] !log jayme@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:32:05] !log jayme@cumin1003 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1102-1112,1166-1168].eqiad.wmnet [08:32:16] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission wikikube-worker[1002-1005,1011-1012,1019-1020,1029-1031,1058-1063,1082-1083,1088-1092,1096-1112,1166-1168].eqiad.wmnet - https://phabricator.wikimedia.org/T423863#11842882 (10ops-monitoring-bot) cookbooks.sre.hosts.de... [08:32:28] 06SRE, 10Datasets-General-or-Unknown, 06tools-infrastructure-team: Move internal dumps NFS clients to clouddumps1001 - https://phabricator.wikimedia.org/T416677#11842883 (10taavi) 05Openβ†’03Declined This does not seem relevant after moving dumps HTTPS behind LVS. [08:32:30] !log installing gst-plugins-base1.0 security updates [08:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:40] !log jayme@cumin1003 START - Cookbook sre.dns.netbox [08:33:16] 10SRE-swift-storage, 06Traffic: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744#11842901 (10Fabfur) >>! In T352744#11840575, @ssingh wrote: >>>! In T352744#11840551, @Ladsgroup wrote: >>>>! In T352744#9413282, @MoritzMuehlenhoff wrote: >>>>>! In T352744#9413140, @jhathaway wrote: >>>... [08:33:32] (03PS1) 10Samwilson: Use canvas rather than webgl for OpenSeadragon [extensions/ProofreadPage] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275808 (https://phabricator.wikimedia.org/T423548) [08:34:52] (03CR) 10Slyngshede: [C:03+2] P:cache::haproxy mark requests from WMCS as trusted [puppet] - 10https://gerrit.wikimedia.org/r/1217466 (https://phabricator.wikimedia.org/T411503) (owner: 10Slyngshede) [08:35:25] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:36:14] (03CR) 10Ayounsi: [C:03+2] Decom cookbook: set BGP flag to False if True [cookbooks] - 10https://gerrit.wikimedia.org/r/1275806 (owner: 10Ayounsi) [08:36:43] (03CR) 10CI reject: [V:04-1] Use canvas rather than webgl for OpenSeadragon [extensions/ProofreadPage] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275808 (https://phabricator.wikimedia.org/T423548) (owner: 10Samwilson) [08:36:44] (03CR) 10Filippo Giunchedi: designate: list all zookeeper backends in tooz_url (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [08:37:38] (03CR) 10Arthur taylor: [C:03+1] "looks good to me - thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275439 (https://phabricator.wikimedia.org/T414376) (owner: 10Lucas Werkmeister (WMDE)) [08:38:11] !log jayme@cumin1003 START - Cookbook sre.dns.netbox [08:38:46] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q3:rack/setup/install phab2003 - https://phabricator.wikimedia.org/T418899#11842934 (10elukey) @ayounsi found https://www.dell.com/community/en/conversations/rack-servers/how-to-disable-lldp-on-broadcom-57414-nic/647f8904f4ccf8a8de88349b that doesn't... [08:38:54] (03Merged) 10jenkins-bot: Decom cookbook: set BGP flag to False if True [cookbooks] - 10https://gerrit.wikimedia.org/r/1275806 (owner: 10Ayounsi) [08:39:06] !log jayme@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1062-1063,1082-1083,1088-1092,1096-1101].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1003" [08:39:11] !log jayme@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1062-1063,1082-1083,1088-1092,1096-1101].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1003" [08:39:11] !log jayme@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:39:12] !log jayme@cumin1003 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1062-1063,1082-1083,1088-1092,1096-1101].eqiad.wmnet [08:39:25] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission wikikube-worker[1002-1005,1011-1012,1019-1020,1029-1031,1058-1063,1082-1083,1088-1092,1096-1112,1166-1168].eqiad.wmnet - https://phabricator.wikimedia.org/T423863#11842950 (10ops-monitoring-bot) cookbooks.sre.hosts.de... [08:40:55] !log jayme@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:40:56] !log jayme@cumin1003 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1002-1005,1011-1012,1019-1020,1029-1031,1058-1061].eqiad.wmnet [08:41:10] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission wikikube-worker[1002-1005,1011-1012,1019-1020,1029-1031,1058-1063,1082-1083,1088-1092,1096-1112,1166-1168].eqiad.wmnet - https://phabricator.wikimedia.org/T423863#11842964 (10ops-monitoring-bot) cookbooks.sre.hosts.de... [08:41:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 22 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/ProofreadPage] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275808 (https://phabricator.wikimedia.org/T423548) (owner: 10Samwilson) [08:42:15] (03CR) 10Cathal Mooney: [C:03+1] Decom cookbook: set BGP flag to False if True [cookbooks] - 10https://gerrit.wikimedia.org/r/1275806 (owner: 10Ayounsi) [08:42:55] (03PS1) 10Gkyziridis: ml-services: Deploy new version of revertrisk-multilingual model on experimental staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275810 (https://phabricator.wikimedia.org/T415892) [08:43:47] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti5006.eqsin.wmnet to cluster eqsin02 and group 01 [08:44:14] (03PS1) 10Marostegui: db1153: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275811 (https://phabricator.wikimedia.org/T423874) [08:44:54] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11842974 (10phaultfinder) [08:45:00] (03PS1) 10Elukey: admin_ng: move staging clusters to the pki discovery2026 intermediate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275812 (https://phabricator.wikimedia.org/T420993) [08:45:03] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2143.codfw.wmnet,db1153.eqiad.wmnet with reason: Reimage to Trixie [08:45:05] (03CR) 10Marostegui: [C:03+2] db1153: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275811 (https://phabricator.wikimedia.org/T423874) (owner: 10Marostegui) [08:45:24] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1153.eqiad.wmnet with reason: Reimage to Trixie [08:45:29] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1153: Reimage to Trixie [08:45:30] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [08:45:36] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [08:45:36] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1153: Reimage to Trixie [08:45:59] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11842976 (10MoritzMuehlenhoff) [08:46:00] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1153.eqiad.wmnet with OS trixie [08:46:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti5006.eqsin.wmnet to cluster eqsin02 and group 01 [08:46:35] (03CR) 10Jcrespo: "@volans could I get an ok from you, too?" [puppet] - 10https://gerrit.wikimedia.org/r/1273676 (https://phabricator.wikimedia.org/T423619) (owner: 10Jcrespo) [08:48:09] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy new version of revertrisk-multilingual model on experimental staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275810 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [08:49:11] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11843000 (10ayounsi) [08:49:24] (03CR) 10Samwilson: "recheck" [extensions/ProofreadPage] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275808 (https://phabricator.wikimedia.org/T423548) (owner: 10Samwilson) [08:50:25] (03Merged) 10jenkins-bot: ml-services: Deploy new version of revertrisk-multilingual model on experimental staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275810 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [08:50:36] !log homer 'cr*eqiad*' commit - T423863 [08:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:48] T423863: decommission wikikube-worker[1002-1005,1011-1012,1019-1020,1029-1031,1058-1063,1082-1083,1088-1092,1096-1112,1166-1168].eqiad.wmnet - https://phabricator.wikimedia.org/T423863 [08:51:21] (03PS1) 10Muehlenhoff: Add site.pp entries for new ncredir/tcpproxy VMs [puppet] - 10https://gerrit.wikimedia.org/r/1275813 (https://phabricator.wikimedia.org/T421863) [08:51:55] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1157: Security update [08:52:49] (03CR) 10CI reject: [V:04-1] admin_ng: move staging clusters to the pki discovery2026 intermediate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275812 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [08:55:02] (03CR) 10Klausman: [C:03+1] ml-serve: remove excludeIPRanges from cni config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275354 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [08:55:25] FIRING: [2x] SystemdUnitFailed: cfssl-ocspserve@discovery2026.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:55:55] (03CR) 10Klausman: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1272658 (owner: 10Klausman) [08:56:27] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [08:57:05] (03PS1) 10STran: Deploy new categories to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275816 (https://phabricator.wikimedia.org/T423043) [08:57:21] (03PS2) 10STran: Deploy new non-emergency categories to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275816 (https://phabricator.wikimedia.org/T423043) [08:57:32] (03PS1) 10Klausman: manifests/hiera: Move ml-serve101[45] to k8s worker role [puppet] - 10https://gerrit.wikimedia.org/r/1275814 [08:57:32] (03CR) 10Klausman: "Submitting this can wait until we have figured out the (basics of the) iommu=pt question with 1012." [puppet] - 10https://gerrit.wikimedia.org/r/1275814 (owner: 10Klausman) [09:00:25] FIRING: [3x] SystemdUnitFailed: cfssl-ocspserve@discovery2026.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:00:41] !log homer 'asw2-b-eqiad.mgmt.eqiad.wmnet' commit - T423863 [09:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:45] T423863: decommission wikikube-worker[1002-1005,1011-1012,1019-1020,1029-1031,1058-1063,1082-1083,1088-1092,1096-1112,1166-1168].eqiad.wmnet - https://phabricator.wikimedia.org/T423863 [09:00:45] !log homer 'asw2-a-eqiad.mgmt.eqiad.wmnet' commit - T423863 [09:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:10] (03CR) 10JMeybohm: [C:03+2] Decom various wikikube-workers from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1275442 (https://phabricator.wikimedia.org/T423863) (owner: 10JMeybohm) [09:01:14] (03PS1) 10Elukey: role::pki::multiroot: fix ocsp responder port for discovery20026 [puppet] - 10https://gerrit.wikimedia.org/r/1275818 [09:01:29] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1153.eqiad.wmnet with reason: host reimage [09:03:37] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T419635)', diff saved to https://phabricator.wikimedia.org/P91255 and previous config saved to /var/cache/conftool/dbconfig/20260421-090336-fceratto.json [09:03:41] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [09:03:51] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [09:03:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1175 (T419961)', diff saved to https://phabricator.wikimedia.org/P91256 and previous config saved to /var/cache/conftool/dbconfig/20260421-090358-fceratto.json [09:05:29] !log restarting pybal on lvs1019-1020 to clear alerts [09:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:42] !log kubectl delete node $(nodeset -e wikikube-worker[1002-1005,1011-1012,1019-1020,1029-1031,1058-1063,1082-1083,1088-1092,1096-1112,1166-1168].eqiad.wmnet) - T423863 [09:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:47] T423863: decommission wikikube-worker[1002-1005,1011-1012,1019-1020,1029-1031,1058-1063,1082-1083,1088-1092,1096-1112,1166-1168].eqiad.wmnet - https://phabricator.wikimedia.org/T423863 [09:05:53] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [09:07:17] (03PS2) 10Elukey: role::pki::multiroot: fix ocsp responder port for discovery20026 [puppet] - 10https://gerrit.wikimedia.org/r/1275818 [09:07:36] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:08:54] (03CR) 10Elukey: "We should add more configs if those nodes transition to become k8s workers, like conftool data etc.. grep for ml-serve1012 in puppet, it s" [puppet] - 10https://gerrit.wikimedia.org/r/1275814 (owner: 10Klausman) [09:09:57] (03CR) 10Klausman: "I have those edits, but wanted to do the basic role first, then add them to the cluster. But I can fold the changes in here." [puppet] - 10https://gerrit.wikimedia.org/r/1275814 (owner: 10Klausman) [09:10:24] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [09:10:24] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [09:10:32] (03PS2) 10Klausman: manifests/hiera: Move ml-serve101[45] to k8s worker role [puppet] - 10https://gerrit.wikimedia.org/r/1275814 [09:11:03] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1153.eqiad.wmnet with reason: host reimage [09:11:17] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance [09:11:23] (03PS1) 10AikoChou: ml-services: update revertrisk-language-agnostic image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275820 (https://phabricator.wikimedia.org/T416384) [09:11:26] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1198 (T419961)', diff saved to https://phabricator.wikimedia.org/P91257 and previous config saved to /var/cache/conftool/dbconfig/20260421-091124-fceratto.json [09:11:38] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [09:13:45] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P91258 and previous config saved to /var/cache/conftool/dbconfig/20260421-091344-fceratto.json [09:14:01] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8445/co" [puppet] - 10https://gerrit.wikimedia.org/r/1275814 (owner: 10Klausman) [09:14:02] (03CR) 10Elukey: "Those needs to be in otherwise we'll end up into weird setups. It is not a problem since they will enter the cluster in a totally depooled" [puppet] - 10https://gerrit.wikimedia.org/r/1275814 (owner: 10Klausman) [09:14:17] (03CR) 10JMeybohm: [C:03+1] role::pki::multiroot: fix ocsp responder port for discovery20026 [puppet] - 10https://gerrit.wikimedia.org/r/1275818 (owner: 10Elukey) [09:15:46] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs1020 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [09:15:46] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs2014 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [09:16:14] (03CR) 10Elukey: [C:03+2] role::pki::multiroot: fix ocsp responder port for discovery20026 [puppet] - 10https://gerrit.wikimedia.org/r/1275818 (owner: 10Elukey) [09:17:06] !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:18:08] (03PS1) 10Ayounsi: Decom cookbook: user Homer (except for VCs) and clean BGP [cookbooks] - 10https://gerrit.wikimedia.org/r/1275821 [09:18:51] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11843112 (10Aklapper) [09:19:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T419961)', diff saved to https://phabricator.wikimedia.org/P91259 and previous config saved to /var/cache/conftool/dbconfig/20260421-091949-fceratto.json [09:20:16] (03CR) 10CI reject: [V:04-1] Decom cookbook: user Homer (except for VCs) and clean BGP [cookbooks] - 10https://gerrit.wikimedia.org/r/1275821 (owner: 10Ayounsi) [09:21:52] (03CR) 10Cathal Mooney: [C:03+1] "LGTM, though might be an idea to wait and do a test-cookbook run on the next host we want to decom just to be sure before merging." [cookbooks] - 10https://gerrit.wikimedia.org/r/1275821 (owner: 10Ayounsi) [09:21:52] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission wikikube-worker[1002-1005,1011-1012,1019-1020,1029-1031,1058-1063,1082-1083,1088-1092,1096-1112,1166-1168].eqiad.wmnet - https://phabricator.wikimedia.org/T423863#11843121 (10JMeybohm) [09:21:57] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1198: Security update [09:22:02] (03PS2) 10Ayounsi: Decom cookbook: user Homer (except for VCs) and clean BGP [cookbooks] - 10https://gerrit.wikimedia.org/r/1275821 [09:22:09] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission wikikube-worker[1002-1005,1011-1012,1019-1020,1029-1031,1058-1063,1082-1083,1088-1092,1096-1112,1166-1168].eqiad.wmnet - https://phabricator.wikimedia.org/T423863#11843124 (10JMeybohm) a:05JMeybohmβ†’03None [09:23:23] FIRING: [10x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:23:53] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P91260 and previous config saved to /var/cache/conftool/dbconfig/20260421-092352-fceratto.json [09:24:46] (03PS3) 10Ayounsi: Decom cookbook: user Homer (except for VCs) and clean BGP [cookbooks] - 10https://gerrit.wikimedia.org/r/1275821 [09:24:51] (03PS1) 10Gkyziridis: ml-services: Free some cpus on experimental staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275823 [09:24:55] (03CR) 10Ayounsi: [C:03+1] Add site.pp entries for new ncredir/tcpproxy VMs [puppet] - 10https://gerrit.wikimedia.org/r/1275813 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [09:25:19] (03CR) 10Muehlenhoff: [C:03+2] Add site.pp entries for new ncredir/tcpproxy VMs [puppet] - 10https://gerrit.wikimedia.org/r/1275813 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [09:25:25] FIRING: [3x] SystemdUnitFailed: cfssl-ocspserve@discovery2026.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:25:25] FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:26:50] !log imported debdeploy 0.0.99.15 for trixie-wikimedia (compat release for Cumin 6) [09:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:48] (03PS1) 10Marostegui: Revert "db1153: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1275824 [09:29:24] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: NodeTextfileStale (instance ganeti-test2003:9100) - https://phabricator.wikimedia.org/T424001 (10LSobanski) 03NEW [09:30:25] FIRING: [3x] SystemdUnitFailed: cfssl-ocspserve@discovery2026.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:30:30] (03CR) 10JMeybohm: Decom cookbook: user Homer (except for VCs) and clean BGP (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1275821 (owner: 10Ayounsi) [09:32:26] (03PS4) 10Ayounsi: Decom cookbook: user Homer (except for VCs) and clean BGP [cookbooks] - 10https://gerrit.wikimedia.org/r/1275821 [09:33:00] (03CR) 10Gkyziridis: [C:03+2] ml-services: Free some cpus on experimental staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275823 (owner: 10Gkyziridis) [09:34:03] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T419635)', diff saved to https://phabricator.wikimedia.org/P91262 and previous config saved to /var/cache/conftool/dbconfig/20260421-093401-fceratto.json [09:34:05] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1153.eqiad.wmnet with OS trixie [09:34:08] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [09:34:21] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance [09:34:31] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [09:34:52] (03CR) 10Mszwarc: [C:03+1] Deploy new non-emergency categories to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275816 (https://phabricator.wikimedia.org/T423043) (owner: 10STran) [09:35:02] (03Merged) 10jenkins-bot: ml-services: Free some cpus on experimental staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275823 (owner: 10Gkyziridis) [09:35:05] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1153: after reimage to trixie [09:35:05] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) pool db1153: after reimage to trixie [09:35:06] PROBLEM - dump of es6 in codfw on backupmon1001 is CRITICAL: Last dump for es6 at codfw (es2036) taken on 2026-04-21 09:05:42 is 23 GiB, but the previous one was 2221 GiB, a change of -99.0 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [09:35:10] (03CR) 10Mszwarc: Deploy new non-emergency categories to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275816 (https://phabricator.wikimedia.org/T423043) (owner: 10STran) [09:35:37] (03PS1) 10Atsuko: admin: Add jmoore111 to the analytics-admin group [puppet] - 10https://gerrit.wikimedia.org/r/1275826 (https://phabricator.wikimedia.org/T422963) [09:35:49] (03CR) 10Mszwarc: Deploy new non-emergency categories to enwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275816 (https://phabricator.wikimedia.org/T423043) (owner: 10STran) [09:36:37] (03CR) 10Marostegui: [C:03+2] Revert "db1153: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1275824 (owner: 10Marostegui) [09:37:10] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1153: repool after maintenance [09:37:11] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) pool db1153: repool after maintenance [09:37:11] (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1275826 (https://phabricator.wikimedia.org/T422963) (owner: 10Atsuko) [09:38:02] !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:38:53] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 13Patch-For-Review: Add Jmoore111 to analytics-admins - https://phabricator.wikimedia.org/T422963#11843209 (10atsuko) [09:39:09] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1153: repool after maintenance [09:39:09] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [09:39:23] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [09:39:23] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1153: repool after maintenance [09:39:48] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11843210 (10phaultfinder) [09:40:13] (03CR) 10Fabfur: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1271804 (https://phabricator.wikimedia.org/T422804) (owner: 10ClΓ©ment Goubert) [09:45:12] PROBLEM - dump of es7 in eqiad on backupmon1001 is CRITICAL: Last dump for es7 at eqiad (es1040) taken on 2026-04-21 09:12:47 is 23 GiB, but the previous one was 2226 GiB, a change of -99.0 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [09:45:55] !log updating debdeploy on trixie to 0.0.99.15 [09:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:53] (03CR) 10JMeybohm: [C:03+1] Decom cookbook: user Homer (except for VCs) and clean BGP (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1275821 (owner: 10Ayounsi) [09:47:27] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:48:00] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware, 07Kubernetes, 13Patch-For-Review: wikikube-ctrl2006 implementation tracking - https://phabricator.wikimedia.org/T406596#11843220 (10JMeybohm) [09:49:12] (03PS3) 10Arnaudb: envoyproxy: rebuild envoy.yaml when the placeholder is created [puppet] - 10https://gerrit.wikimedia.org/r/1275827 (https://phabricator.wikimedia.org/T421827) [09:50:25] RESOLVED: SystemdUnitFailed: cfssl-ocspserve@discovery2026.service on pki2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:50:56] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host tcp-proxy5003.eqsin.wmnet [09:50:58] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:52:55] (03PS2) 10STran: Add next steps page for non-emergency "sockpuppetry" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275825 (https://phabricator.wikimedia.org/T423045) [09:54:39] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy5003.eqsin.wmnet - jmm@cumin2002" [09:54:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy5003.eqsin.wmnet - jmm@cumin2002" [09:54:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:54:45] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache tcp-proxy5003.eqsin.wmnet on all recursors [09:54:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy5003.eqsin.wmnet on all recursors [09:54:50] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1198: Security update [09:55:23] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy5003.eqsin.wmnet - jmm@cumin2002" [09:55:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy5003.eqsin.wmnet - jmm@cumin2002" [09:56:06] (03PS1) 10Gkyziridis: changeprop: Remove rr-multilingual model from changeprop. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275832 (https://phabricator.wikimedia.org/T415892) [09:56:25] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 13Patch-For-Review: Add Jmoore111 to analytics-admins - https://phabricator.wikimedia.org/T422963#11843237 (10atsuko) [09:57:00] 06SRE, 06Infrastructure-Foundations, 10GitLab (CI & Job Runners), 06Release-Engineering-Team (Priority Backlog πŸ“₯): Update default GitLab runner image to a base image without mirrors.wikimedia.org - https://phabricator.wikimedia.org/T423971#11843238 (10A_smart_kitten) [09:57:47] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 13Patch-For-Review: Add Jmoore111 to analytics-admins - https://phabricator.wikimedia.org/T422963#11843241 (10atsuko) Hi @MMiller_WMF, can you please approve the access to restricted HDFS for Justin Moore? [09:58:09] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 13Patch-For-Review: Add Jmoore111 to analytics-admins - https://phabricator.wikimedia.org/T422963#11843242 (10atsuko) [09:58:29] jmm@cumin2002 makevm (PID 1218565) is awaiting input [09:58:33] (03CR) 10AikoChou: [C:03+1] changeprop: Remove rr-multilingual model from changeprop. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275832 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [09:58:48] (03PS2) 10Elukey: admin_ng: move staging clusters to the pki discovery2026 intermediate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275812 (https://phabricator.wikimedia.org/T420993) [09:59:20] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1159.eqiad.wmnet with reason: Maintenance [09:59:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1159 (T419635)', diff saved to https://phabricator.wikimedia.org/P91267 and previous config saved to /var/cache/conftool/dbconfig/20260421-095928-fceratto.json [09:59:32] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260421T1000) [10:00:13] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be1007.eqiad.wmnet with OS bullseye [10:00:20] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1212.eqiad.wmnet with reason: Maintenance [10:00:23] 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11843253 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be1007.eqiad.wmnet with OS bullseye [10:00:44] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 6 hosts with reason: Maintenance [10:00:52] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1212 (T419961)', diff saved to https://phabricator.wikimedia.org/P91268 and previous config saved to /var/cache/conftool/dbconfig/20260421-100051-fceratto.json [10:01:59] (03PS1) 10STran: Add next steps page for non-emergency "vandalism" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275834 (https://phabricator.wikimedia.org/T423563) [10:02:01] (03PS1) 10STran: Add next steps page for non-emergency "user dispute" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275835 (https://phabricator.wikimedia.org/T423587) [10:02:03] (03PS1) 10STran: Add next steps page for non-emergency "disruptive editing" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275836 (https://phabricator.wikimedia.org/T423579) [10:02:07] (03PS1) 10STran: Normalize "Something else" naming across references [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275837 (https://phabricator.wikimedia.org/T423595) [10:02:10] (03PS1) 10STran: Add next steps page for non-emergency "other" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275838 (https://phabricator.wikimedia.org/T423595) [10:03:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host tcp-proxy5003.eqsin.wmnet with OS trixie [10:03:12] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11843276 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host tcp-proxy5003.eqsin.wmnet with OS trixie [10:04:18] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:05:13] (03PS3) 10STran: Deploy new non-emergency categories to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275816 (https://phabricator.wikimedia.org/T423043) [10:05:31] (03CR) 10STran: Deploy new non-emergency categories to enwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275816 (https://phabricator.wikimedia.org/T423043) (owner: 10STran) [10:05:47] 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Cookbook for rack depool - https://phabricator.wikimedia.org/T327300#11843281 (10FCeratto-WMF) In zarcillo we have the relation `host <-> role <-> rack` and we can label replicas and candidates as depoolable (but not primary/DC masters). We can u... [10:06:22] (03CR) 10CI reject: [V:04-1] admin_ng: move staging clusters to the pki discovery2026 intermediate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275812 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [10:07:41] !log Disabling puppet on A:cp to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1271804 - T422804 [10:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:45] T422804: Reroute LiftWing endpoints - https://phabricator.wikimedia.org/T422804 [10:08:28] jouncebot: nowandnext [10:08:28] For the next 0 hour(s) and 51 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260421T1000) [10:08:28] In 1 hour(s) and 51 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260421T1200) [10:08:32] (03CR) 10Mszwarc: [C:03+1] Deploy new non-emergency categories to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275816 (https://phabricator.wikimedia.org/T423043) (owner: 10STran) [10:08:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T419961)', diff saved to https://phabricator.wikimedia.org/P91269 and previous config saved to /var/cache/conftool/dbconfig/20260421-100849-fceratto.json [10:08:58] I’ll helmfile https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1275439 in a few minutes if that’s okay, should be very safe and boring [10:09:30] Lucas_WMDE: yep, go ahead [10:09:50] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275439 (https://phabricator.wikimedia.org/T414376) (owner: 10Lucas Werkmeister (WMDE)) [10:09:52] thx [10:10:01] (03CR) 10Klausman: [C:03+1] changeprop: Remove rr-multilingual model from changeprop. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275832 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [10:10:45] (03CR) 10Michael Große: [C:03+1] Don't set href for a link that has been unset [extensions/GrowthExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275543 (https://phabricator.wikimedia.org/T422907) (owner: 10Jdlrobson) [10:10:56] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:11:27] (03CR) 10ClΓ©ment Goubert: [C:03+2] gateway-check: Add matchers for liftwing and recommendation-api-ng [puppet] - 10https://gerrit.wikimedia.org/r/1271804 (https://phabricator.wikimedia.org/T422804) (owner: 10ClΓ©ment Goubert) [10:11:50] (03Merged) 10jenkins-bot: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275439 (https://phabricator.wikimedia.org/T414376) (owner: 10Lucas Werkmeister (WMDE)) [10:12:20] !log lucaswerkmeister-wmde@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [10:12:42] !log lucaswerkmeister-wmde@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [10:13:03] there’s no way to get a Phabricator task attached to these log messages, right? (like --comment in mwscript-k8s --sal) [10:13:09] !log lucaswerkmeister-wmde@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [10:13:24] !log lucaswerkmeister-wmde@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [10:13:29] !log lucaswerkmeister-wmde@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [10:13:48] !log lucaswerkmeister-wmde@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [10:14:29] * Lucas_WMDE done [10:15:07] PROBLEM - dump of es7 in codfw on backupmon1001 is CRITICAL: Last dump for es7 at codfw (es2040) taken on 2026-04-21 10:03:31 is 23 GiB, but the previous one was 2226 GiB, a change of -99.0 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [10:15:21] Lucas_WMDE: Unfortunately not yet, it's been on my list of "that would be nice to implement when I get some time" [10:15:27] for about 3 years now [10:15:43] ok, thanks <3 [10:19:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P91270 and previous config saved to /var/cache/conftool/dbconfig/20260421-101857-fceratto.json [10:21:05] (03PS3) 10Elukey: admin_ng: move staging clusters to the pki discovery2026 intermediate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275812 (https://phabricator.wikimedia.org/T420993) [10:22:09] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be1007.eqiad.wmnet with reason: host reimage [10:24:13] (03CR) 10Effie Mouzeli: [C:03+2] mw-mcrouter: update mcrouter module to 1.3.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273739 (https://phabricator.wikimedia.org/T421360) (owner: 10Effie Mouzeli) [10:24:18] (03PS2) 10STran: Add next steps page for non-emergency "vandalism" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275834 (https://phabricator.wikimedia.org/T423563) [10:24:18] (03PS2) 10STran: Add next steps page for non-emergency "user dispute" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275835 (https://phabricator.wikimedia.org/T423587) [10:24:18] (03PS2) 10STran: Add next steps page for non-emergency "disruptive editing" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275836 (https://phabricator.wikimedia.org/T423579) [10:24:18] (03PS2) 10STran: Normalize "Something else" naming across references [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275837 (https://phabricator.wikimedia.org/T423595) [10:24:20] (03PS2) 10STran: Add next steps page for non-emergency "other" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275838 (https://phabricator.wikimedia.org/T423595) [10:24:21] (03PS1) 10STran: Enable non-emergency categories via config [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275840 (https://phabricator.wikimedia.org/T423244) [10:24:39] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11843357 (10phaultfinder) [10:25:16] (03PS2) 10STran: Enable non-emergency categories via config [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275840 (https://phabricator.wikimedia.org/T423244) [10:25:16] (03PS3) 10STran: Add next steps page for non-emergency "sockpuppetry" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275825 (https://phabricator.wikimedia.org/T423045) [10:25:16] (03PS3) 10STran: Add next steps page for non-emergency "vandalism" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275834 (https://phabricator.wikimedia.org/T423563) [10:25:16] (03PS3) 10STran: Add next steps page for non-emergency "user dispute" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275835 (https://phabricator.wikimedia.org/T423587) [10:25:18] (03PS3) 10STran: Add next steps page for non-emergency "disruptive editing" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275836 (https://phabricator.wikimedia.org/T423579) [10:25:20] (03PS3) 10STran: Normalize "Something else" naming across references [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275837 (https://phabricator.wikimedia.org/T423595) [10:25:24] (03PS3) 10STran: Add next steps page for non-emergency "other" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275838 (https://phabricator.wikimedia.org/T423595) [10:26:51] (03CR) 10Effie Mouzeli: "We have had fewer workers in the past, so I do not expect this to make a lot of a difference, but sure it is ok to start with just the TKO" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275376 (https://phabricator.wikimedia.org/T421360) (owner: 10Effie Mouzeli) [10:27:03] (03PS3) 10Effie Mouzeli: mw-mcrouter: bump image and new config (codfw) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275376 (https://phabricator.wikimedia.org/T421360) [10:27:07] (03CR) 10CI reject: [V:04-1] Add next steps page for non-emergency "vandalism" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275834 (https://phabricator.wikimedia.org/T423563) (owner: 10STran) [10:28:01] (03Merged) 10jenkins-bot: mw-mcrouter: update mcrouter module to 1.3.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273739 (https://phabricator.wikimedia.org/T421360) (owner: 10Effie Mouzeli) [10:28:22] (03CR) 10CI reject: [V:04-1] Enable non-emergency categories via config [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275840 (https://phabricator.wikimedia.org/T423244) (owner: 10STran) [10:28:26] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11843361 (10elukey) On iDRAC 10 I see the following: {F77111523} [10:28:29] 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 13Patch-For-Review: Add Jmoore111 to analytics-admins - https://phabricator.wikimedia.org/T422963#11843362 (10atsuko) [10:28:44] (03CR) 10CI reject: [V:04-1] admin_ng: move staging clusters to the pki discovery2026 intermediate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275812 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [10:28:46] (03CR) 10CI reject: [V:04-1] Add next steps page for non-emergency "sockpuppetry" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275825 (https://phabricator.wikimedia.org/T423045) (owner: 10STran) [10:29:03] (03CR) 10CI reject: [V:04-1] Add next steps page for non-emergency "user dispute" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275835 (https://phabricator.wikimedia.org/T423587) (owner: 10STran) [10:29:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P91271 and previous config saved to /var/cache/conftool/dbconfig/20260421-102907-fceratto.json [10:29:10] (03CR) 10CI reject: [V:04-1] Add next steps page for non-emergency "disruptive editing" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275836 (https://phabricator.wikimedia.org/T423579) (owner: 10STran) [10:29:20] (03CR) 10CI reject: [V:04-1] Add next steps page for non-emergency "vandalism" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275834 (https://phabricator.wikimedia.org/T423563) (owner: 10STran) [10:29:22] (03CR) 10CI reject: [V:04-1] Normalize "Something else" naming across references [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275837 (https://phabricator.wikimedia.org/T423595) (owner: 10STran) [10:29:28] (03CR) 10CI reject: [V:04-1] Add next steps page for non-emergency "other" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275838 (https://phabricator.wikimedia.org/T423595) (owner: 10STran) [10:29:35] (03CR) 10Gkyziridis: [C:03+2] changeprop: Remove rr-multilingual model from changeprop. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275832 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [10:30:28] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be1007.eqiad.wmnet with reason: host reimage [10:31:28] (03Merged) 10jenkins-bot: changeprop: Remove rr-multilingual model from changeprop. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275832 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [10:32:05] (03PS4) 10STran: Add next steps page for non-emergency "sockpuppetry" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275825 (https://phabricator.wikimedia.org/T423045) [10:32:05] (03PS4) 10STran: Add next steps page for non-emergency "vandalism" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275834 (https://phabricator.wikimedia.org/T423563) [10:32:05] (03PS4) 10STran: Add next steps page for non-emergency "user dispute" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275835 (https://phabricator.wikimedia.org/T423587) [10:32:06] (03PS4) 10STran: Add next steps page for non-emergency "disruptive editing" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275836 (https://phabricator.wikimedia.org/T423579) [10:32:07] (03PS4) 10STran: Normalize "Something else" naming across references [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275837 (https://phabricator.wikimedia.org/T423595) [10:32:08] (03PS4) 10STran: Add next steps page for non-emergency "other" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275838 (https://phabricator.wikimedia.org/T423595) [10:32:42] (03PS4) 10Effie Mouzeli: mw-mcrouter: bump image and new config (codfw) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275376 (https://phabricator.wikimedia.org/T421360) [10:34:01] (03CR) 10Effie Mouzeli: [C:03+2] site.pp: switch insetup rdb* servers to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1275497 (owner: 10Effie Mouzeli) [10:35:16] (03PS4) 10Elukey: admin_ng: move staging clusters to the pki discovery2026 intermediate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275812 (https://phabricator.wikimedia.org/T420993) [10:35:16] (03PS1) 10Elukey: kserve: apply a workaround for kubeconform [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275841 [10:35:27] (03CR) 10Elukey: "to keep archives happy: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1275841" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273664 (https://phabricator.wikimedia.org/T423149) (owner: 10Dpogorzelski) [10:35:52] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2143.codfw.wmnet,db1153.eqiad.wmnet with reason: Reimage to Trixie [10:36:28] (03PS1) 10Marostegui: db2143: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275842 (https://phabricator.wikimedia.org/T423874) [10:36:41] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:36:45] (03CR) 10Elukey: "This is currently causing CI to fail for unrelated admin_ng changes, see the other change in the chain :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275841 (owner: 10Elukey) [10:37:02] (03CR) 10Marostegui: [C:03+2] db2143: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275842 (https://phabricator.wikimedia.org/T423874) (owner: 10Marostegui) [10:37:11] 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 13Patch-For-Review: Add Jmoore111 to analytics-admins - https://phabricator.wikimedia.org/T422963#11843391 (10atsuko) [10:37:31] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2143.codfw.wmnet with reason: Reimage to Trixie [10:37:33] 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 13Patch-For-Review: Add Jmoore111 to analytics-admins - https://phabricator.wikimedia.org/T422963#11843393 (10atsuko) a:05atsukoβ†’03MMiller_WMF [10:37:36] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2143: Reimage to Trixie [10:37:36] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [10:37:44] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [10:37:44] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2143: Reimage to Trixie [10:37:56] 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 13Patch-For-Review: Add Jmoore111 to analytics-admins - https://phabricator.wikimedia.org/T422963#11843396 (10atsuko) [10:38:04] (03PS1) 10Gkyziridis: ml-services: Remove old models from experimental staging that we are not working on. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275843 [10:38:07] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2143.codfw.wmnet with OS trixie [10:39:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T419961)', diff saved to https://phabricator.wikimedia.org/P91273 and previous config saved to /var/cache/conftool/dbconfig/20260421-103915-fceratto.json [10:39:37] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1223.eqiad.wmnet with reason: Maintenance [10:39:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1223 (T419961)', diff saved to https://phabricator.wikimedia.org/P91274 and previous config saved to /var/cache/conftool/dbconfig/20260421-103945-fceratto.json [10:40:07] PROBLEM - dump of es6 in eqiad on backupmon1001 is CRITICAL: Last dump for es6 at eqiad (es1036) taken on 2026-04-21 10:01:22 is 23 GiB, but the previous one was 2221 GiB, a change of -99.0 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [10:40:46] jouncebot: now [10:40:46] For the next 0 hour(s) and 19 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260421T1000) [10:40:49] jouncebot: next [10:40:50] In 1 hour(s) and 19 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260421T1200) [10:41:20] (03CR) 10Mszwarc: [C:03+1] Enable non-emergency categories via config [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275840 (https://phabricator.wikimedia.org/T423244) (owner: 10STran) [10:41:55] (03CR) 10Blake: mw-mcrouter: bump image and new config (codfw) (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275376 (https://phabricator.wikimedia.org/T421360) (owner: 10Effie Mouzeli) [10:42:02] (03CR) 10Mszwarc: [C:03+1] Add next steps page for non-emergency "sockpuppetry" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275825 (https://phabricator.wikimedia.org/T423045) (owner: 10STran) [10:42:25] (03CR) 10Mszwarc: [C:03+1] Add next steps page for non-emergency "vandalism" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275834 (https://phabricator.wikimedia.org/T423563) (owner: 10STran) [10:42:37] (03CR) 10Mszwarc: [C:03+1] Add next steps page for non-emergency "user dispute" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275835 (https://phabricator.wikimedia.org/T423587) (owner: 10STran) [10:42:40] (03CR) 10Muehlenhoff: [C:03+2] proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275372 (owner: 10Muehlenhoff) [10:42:52] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: Remove old models from experimental staging that we are not working on. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275843 (owner: 10Gkyziridis) [10:42:52] (03CR) 10Mszwarc: [C:03+1] Add next steps page for non-emergency "disruptive editing" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275836 (https://phabricator.wikimedia.org/T423579) (owner: 10STran) [10:43:08] (03CR) 10Mszwarc: [C:03+1] Normalize "Something else" naming across references [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275837 (https://phabricator.wikimedia.org/T423595) (owner: 10STran) [10:43:21] !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/proton: apply [10:43:26] (03CR) 10Mszwarc: [C:03+1] Add next steps page for non-emergency "other" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275838 (https://phabricator.wikimedia.org/T423595) (owner: 10STran) [10:44:09] (03CR) 10Effie Mouzeli: mw-mcrouter: bump image and new config (codfw) (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275376 (https://phabricator.wikimedia.org/T421360) (owner: 10Effie Mouzeli) [10:44:14] !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/proton: apply [10:44:18] RESOLVED: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:45:41] (03CR) 10Gkyziridis: [C:03+2] ml-services: Remove old models from experimental staging that we are not working on. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275843 (owner: 10Gkyziridis) [10:46:35] (03CR) 10Blake: [C:03+1] mw-mcrouter: bump image and new config (codfw) (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275376 (https://phabricator.wikimedia.org/T421360) (owner: 10Effie Mouzeli) [10:47:10] !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/proton: apply [10:47:58] (03Merged) 10jenkins-bot: ml-services: Remove old models from experimental staging that we are not working on. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275843 (owner: 10Gkyziridis) [10:48:32] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be1007.eqiad.wmnet with OS bullseye [10:48:44] 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11843407 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be1007.eqiad.wmnet with OS bullseye completed... [10:49:31] 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11843408 (10MatthewVernon) [10:49:35] !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/proton: apply [10:49:55] !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/proton: apply [10:50:46] !log jiji@deploy1003 Locking from deployment [ALL REPOSITORIES]: Upgrading mw-mcrouter - effie [10:50:53] !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:51:07] !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/proton: apply [10:51:16] 10SRE-swift-storage, 06Traffic: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744#11843416 (10Ladsgroup) noted. thanks! [10:53:32] PROBLEM - gerrit process on gerrit1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-17-openjdk-amd64/bin/java .*-jar /var/lib/gerrit/review_site/bin/gerrit.war daemon -d /var/lib/gerrit/review_site https://wikitech.wikimedia.org/wiki/Gerrit [10:53:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275840 (https://phabricator.wikimedia.org/T423244) (owner: 10STran) [10:53:45] FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [10:53:48] ^ gerrit1003 is me - T333143 [10:53:49] T333143: Move Gerrit data out of root partition - https://phabricator.wikimedia.org/T333143 [10:53:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275825 (https://phabricator.wikimedia.org/T423045) (owner: 10STran) [10:53:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275834 (https://phabricator.wikimedia.org/T423563) (owner: 10STran) [10:54:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275835 (https://phabricator.wikimedia.org/T423587) (owner: 10STran) [10:54:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275836 (https://phabricator.wikimedia.org/T423579) (owner: 10STran) [10:54:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275837 (https://phabricator.wikimedia.org/T423595) (owner: 10STran) [10:54:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275838 (https://phabricator.wikimedia.org/T423595) (owner: 10STran) [10:54:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275816 (https://phabricator.wikimedia.org/T423043) (owner: 10STran) [10:55:19] (03PS2) 10Elukey: kserve: apply a workaround for kubeconform [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275841 [10:55:19] (03PS5) 10Elukey: admin_ng: move staging clusters to the pki discovery2026 intermediate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275812 (https://phabricator.wikimedia.org/T420993) [10:55:19] (03PS1) 10Elukey: admin_ng: add bgp configs for ml-serve101[4,5] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275845 [10:55:21] (03CR) 10Jelto: [V:03+1 C:03+2] gerrit: migrate data gerrit1003 to /srv/gerrit [puppet] - 10https://gerrit.wikimedia.org/r/1273449 (https://phabricator.wikimedia.org/T333143) (owner: 10Jelto) [10:55:42] (03CR) 10CI reject: [V:04-1] kserve: apply a workaround for kubeconform [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275841 (owner: 10Elukey) [10:55:43] (03PS3) 10Elukey: kserve: apply a workaround for kubeconform [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275841 [10:55:43] (03PS2) 10Elukey: admin_ng: add bgp configs for ml-serve101[4,5] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275845 [10:55:43] (03PS6) 10Elukey: admin_ng: move staging clusters to the pki discovery2026 intermediate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275812 (https://phabricator.wikimedia.org/T420993) [10:55:49] (03CR) 10CI reject: [V:04-1] admin_ng: add bgp configs for ml-serve101[4,5] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275845 (owner: 10Elukey) [10:56:01] I will use the next deployment window for mw-mcrouter upgrades, I have locked scap for that reason, please ping me if there is a problem with this [10:56:17] (03CR) 10CI reject: [V:04-1] admin_ng: move staging clusters to the pki discovery2026 intermediate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275812 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [10:56:24] (03CR) 10CI reject: [V:04-1] kserve: apply a workaround for kubeconform [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275841 (owner: 10Elukey) [10:56:27] (03CR) 10CI reject: [V:04-1] admin_ng: add bgp configs for ml-serve101[4,5] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275845 (owner: 10Elukey) [10:56:46] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2143.codfw.wmnet with reason: host reimage [10:56:52] (03CR) 10Elukey: "https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1275845 is also a pre-requisite, otherwise calico will not able to establi" [puppet] - 10https://gerrit.wikimedia.org/r/1275814 (owner: 10Klausman) [10:56:55] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:57:02] (03CR) 10Effie Mouzeli: [C:03+1] redis::slave: Move to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1275752 (https://phabricator.wikimedia.org/T419976) (owner: 10Muehlenhoff) [10:57:14] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host tcp-proxy5003.eqsin.wmnet with OS trixie [10:57:14] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host tcp-proxy5003.eqsin.wmnet [10:57:20] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11843427 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host tcp-proxy5003.eqsin.wmnet with OS trixie executed with errors: - tcp-proxy5003 (**... [10:58:08] (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275841 (owner: 10Elukey) [10:59:18] FIRING: [5x] JobUnavailable: Reduced availability for job gerrit-replica in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:59:32] RECOVERY - gerrit process on gerrit1003 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-17-openjdk-amd64/bin/java .*-jar /srv/gerrit/site_path/review_site/bin/gerrit.war daemon -d /srv/gerrit/site_path/review_site https://wikitech.wikimedia.org/wiki/Gerrit [10:59:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T419635)', diff saved to https://phabricator.wikimedia.org/P91275 and previous config saved to /var/cache/conftool/dbconfig/20260421-105945-fceratto.json [10:59:47] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host tcp-proxy5003.eqsin.wmnet [10:59:49] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:59:51] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [11:00:29] (03PS1) 10ClΓ©ment Goubert: gateway-check: Escape - in lua patterns [puppet] - 10https://gerrit.wikimedia.org/r/1275852 (https://phabricator.wikimedia.org/T422804) [11:00:36] (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275841 (owner: 10Elukey) [11:00:45] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2143.codfw.wmnet with reason: host reimage [11:02:48] (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275845 (owner: 10Elukey) [11:02:57] (03CR) 10JMeybohm: [C:03+1] gateway-check: Escape - in lua patterns [puppet] - 10https://gerrit.wikimedia.org/r/1275852 (https://phabricator.wikimedia.org/T422804) (owner: 10ClΓ©ment Goubert) [11:02:58] (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275812 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [11:02:59] (03CR) 10Fabfur: [C:03+1] gateway-check: Escape - in lua patterns [puppet] - 10https://gerrit.wikimedia.org/r/1275852 (https://phabricator.wikimedia.org/T422804) (owner: 10ClΓ©ment Goubert) [11:03:29] (03CR) 10ClΓ©ment Goubert: [C:03+2] gateway-check: Escape - in lua patterns [puppet] - 10https://gerrit.wikimedia.org/r/1275852 (https://phabricator.wikimedia.org/T422804) (owner: 10ClΓ©ment Goubert) [11:03:38] (03PS2) 10Elukey: sre.hosts.provision: skip LLDP settings for iDRAC 10+ hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1275509 (https://phabricator.wikimedia.org/T250367) [11:04:12] (03PS3) 10Elukey: sre.hosts.provision: skip LLDP settings for iDRAC 10+ hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1275509 (https://phabricator.wikimedia.org/T250367) [11:04:18] RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit-replica in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:04:22] (03CR) 10Muehlenhoff: [C:03+2] redis::slave: Move to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1275752 (https://phabricator.wikimedia.org/T419976) (owner: 10Muehlenhoff) [11:05:27] jmm@cumin2002 makevm (PID 1266534) is awaiting input [11:05:54] (03CR) 10Effie Mouzeli: [C:03+2] mw-mcrouter: bump image and new config (codfw) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275376 (https://phabricator.wikimedia.org/T421360) (owner: 10Effie Mouzeli) [11:06:09] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11843463 (10elukey) Applied and took the SCP dump, diffed with its previous config, nothing stands out. It seems that we are not able anymore to disable LLDP v... [11:07:04] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q3:rack/setup/install phab2003 - https://phabricator.wikimedia.org/T418899#11843465 (10elukey) See T250367#11843361, I was able to disable it manually but not via Redfish. Going to test and merge https://gerrit.wikimedia.org/r/c/operations/cookbooks/+... [11:07:48] !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [11:07:51] (03Merged) 10jenkins-bot: mw-mcrouter: bump image and new config (codfw) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275376 (https://phabricator.wikimedia.org/T421360) (owner: 10Effie Mouzeli) [11:07:55] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host tcp-proxy5003.eqsin.wmnet [11:08:01] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11843467 (10AnnieKim_WMDE) SSH Key has been suspended and deleted from Bitu Identity Manager and won't be used anywhere else. I believe... [11:08:45] RESOLVED: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [11:09:04] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T419961)', diff saved to https://phabricator.wikimedia.org/P91276 and previous config saved to /var/cache/conftool/dbconfig/20260421-110903-fceratto.json [11:09:24] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply [11:09:47] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11843485 (10phaultfinder) [11:09:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P91277 and previous config saved to /var/cache/conftool/dbconfig/20260421-110954-fceratto.json [11:10:49] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11843487 (10Nux) Seems like icons in the upload are >>! In T414805#11556311, @Nux wrote: > Found two problems with the migration: >... [11:11:07] !log Enabling puppet on A:cp to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/1271804 - T422804 [11:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:11] T422804: Reroute LiftWing endpoints - https://phabricator.wikimedia.org/T422804 [11:11:51] (03PS1) 10Btullis: Deploy the new Airflow version as the default for devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275854 (https://phabricator.wikimedia.org/T423243) [11:11:53] (03PS1) 10Btullis: Deploy the new Airflow version to the test-k8s instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275855 (https://phabricator.wikimedia.org/T423243) [11:11:55] (03PS1) 10Btullis: Deploy the new Airflow version to the analytics-test instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275856 (https://phabricator.wikimedia.org/T423243) [11:11:58] (03PS1) 10Btullis: Deploy the new airflow version to the main instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275857 (https://phabricator.wikimedia.org/T423243) [11:12:00] (03PS1) 10Btullis: Deploy the new Airflow version to the platform-eng instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275858 (https://phabricator.wikimedia.org/T423243) [11:12:04] (03PS1) 10Btullis: Deploy the new airflow version to the search instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275859 (https://phabricator.wikimedia.org/T423243) [11:12:07] (03PS1) 10Btullis: Deploy the new Airflow version to the analytics-product instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275860 (https://phabricator.wikimedia.org/T423243) [11:12:09] (03PS1) 10Btullis: Deploy the new Airflow version to the research instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275861 (https://phabricator.wikimedia.org/T423243) [11:12:11] (03PS1) 10Btullis: Deploy the new Airflow version to the ml instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275862 (https://phabricator.wikimedia.org/T423243) [11:12:14] (03PS1) 10Btullis: Deploy the new Airflow version to the sre instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275863 (https://phabricator.wikimedia.org/T423243) [11:12:18] (03PS1) 10Btullis: Deploy the new Airflow version to the wmde instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275864 (https://phabricator.wikimedia.org/T423243) [11:12:22] (03PS1) 10Btullis: Deploy the new Airflow version to the wikidata instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275865 (https://phabricator.wikimedia.org/T423243) [11:12:26] (03PS1) 10Btullis: Deploy the new Airflow version to the fr-tech instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275866 (https://phabricator.wikimedia.org/T423243) [11:12:51] (03CR) 10Ayounsi: [C:03+1] admin_ng: add bgp configs for ml-serve101[4,5] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275845 (owner: 10Elukey) [11:15:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:17:10] (03CR) 10Atsuko: [C:03+1] Deploy the new Airflow version as the default for devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275854 (https://phabricator.wikimedia.org/T423243) (owner: 10Btullis) [11:17:50] (03PS1) 10Marostegui: Revert "db2143: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1275868 [11:19:12] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P91278 and previous config saved to /var/cache/conftool/dbconfig/20260421-111911-fceratto.json [11:20:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P91279 and previous config saved to /var/cache/conftool/dbconfig/20260421-112001-fceratto.json [11:21:04] !log klausman@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [11:21:11] (03CR) 10Marostegui: [C:03+2] Revert "db2143: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1275868 (owner: 10Marostegui) [11:21:37] !log klausman@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [11:23:46] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2143.codfw.wmnet with OS trixie [11:23:47] FIRING: [2x] HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:24:38] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11843533 (10phaultfinder) [11:24:47] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2143: after reimage to trixie [11:24:47] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) pool db2143: after reimage to trixie [11:25:00] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2143: repool after maintenance [11:25:00] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [11:26:49] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [11:26:49] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2143: repool after maintenance [11:26:54] !log klausman@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply [11:27:25] !log klausman@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [11:29:10] FIRING: [2x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9600.service on cloudelastic1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:29:20] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P91281 and previous config saved to /var/cache/conftool/dbconfig/20260421-112919-fceratto.json [11:30:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T419635)', diff saved to https://phabricator.wikimedia.org/P91282 and previous config saved to /var/cache/conftool/dbconfig/20260421-113010-fceratto.json [11:30:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:30:15] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [11:30:17] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance [11:30:37] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:31:14] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [11:33:44] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:33:47] FIRING: [2x] HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:33:53] ^ me [11:34:40] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Review of ferm services without srange - https://phabricator.wikimedia.org/T149804#11843571 (10MoritzMuehlenhoff) [11:35:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:38:16] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: test current diff - jmm@cumin2002" [11:38:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: test current diff - jmm@cumin2002" [11:38:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:38:39] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11843589 (10Ladsgroup) We will not be allowing that as reasons outlined in T414805#11623347. For your use case, as I said in T414805#... [11:39:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T419961)', diff saved to https://phabricator.wikimedia.org/P91283 and previous config saved to /var/cache/conftool/dbconfig/20260421-113927-fceratto.json [11:39:49] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1240.eqiad.wmnet with reason: Maintenance [11:40:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:41:22] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host tcp-proxy5003.eqsin.wmnet [11:41:24] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:41:31] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply [11:43:26] (03PS1) 10Gkyziridis: ml-services: Remove rr-multilingual from experimental [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275877 [11:43:37] (03PS1) 10Marostegui: instaces.yaml: Add pc2021 [puppet] - 10https://gerrit.wikimedia.org/r/1275878 (https://phabricator.wikimedia.org/T418973) [11:44:36] (03CR) 10Marostegui: [C:03+2] instaces.yaml: Add pc2021 [puppet] - 10https://gerrit.wikimedia.org/r/1275878 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [11:44:41] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11843609 (10phaultfinder) [11:44:58] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy5003.eqsin.wmnet - jmm@cumin2002" [11:45:00] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179#11843611 (10MoritzMuehlenhoff) [11:45:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy5003.eqsin.wmnet - jmm@cumin2002" [11:45:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:45:04] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache tcp-proxy5003.eqsin.wmnet on all recursors [11:45:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy5003.eqsin.wmnet on all recursors [11:45:40] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy5003.eqsin.wmnet - jmm@cumin2002" [11:45:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy5003.eqsin.wmnet - jmm@cumin2002" [11:47:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove pc2011 and add pc2021 as replacement', diff saved to https://phabricator.wikimedia.org/P91285 and previous config saved to /var/cache/conftool/dbconfig/20260421-114718-marostegui.json [11:48:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host tcp-proxy5003.eqsin.wmnet with OS trixie [11:48:49] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [11:48:54] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11843625 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host tcp-proxy5003.eqsin.wmnet with OS trixie [11:49:55] (03PS1) 10Tchanders: Add contextual attribute to editattemptstep instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275879 (https://phabricator.wikimedia.org/T424010) [11:50:46] !log installing Tornado security updates [11:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:12] (03CR) 10Gkyziridis: [C:03+2] ml-services: Remove rr-multilingual from experimental [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275877 (owner: 10Gkyziridis) [11:51:20] (03CR) 10CI reject: [V:04-1] Add contextual attribute to editattemptstep instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275879 (https://phabricator.wikimedia.org/T424010) (owner: 10Tchanders) [11:52:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'add pc2021 to pc1', diff saved to https://phabricator.wikimedia.org/P91286 and previous config saved to /var/cache/conftool/dbconfig/20260421-115209-marostegui.json [11:52:37] (03Merged) 10jenkins-bot: ml-services: Remove rr-multilingual from experimental [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275877 (owner: 10Gkyziridis) [11:53:14] !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [11:54:43] (03PS1) 10Marostegui: pc2021: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275880 (https://phabricator.wikimedia.org/T418973) [11:55:42] (03CR) 10Marostegui: [C:03+2] pc2021: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275880 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [11:56:28] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool pc2021: Pool pc2021 into pc [11:56:28] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [11:56:29] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.parsercache (exit_code=99) [11:56:29] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc2021: Pool pc2021 into pc [11:57:01] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool pc2021: Pool pc2021 into pc [11:57:02] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [11:57:02] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.parsercache (exit_code=99) [11:57:02] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc2021: Pool pc2021 into pc [11:57:22] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool pc1011: Pool pc2021 into pc [11:57:22] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [11:57:22] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.parsercache (exit_code=99) [11:57:22] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc1011: Pool pc2021 into pc [11:58:49] !log jiji@deploy1003 Unlocked for deployment [ALL REPOSITORIES]: Upgrading mw-mcrouter - effie (duration: 68m 02s) [11:59:08] jouncebot: next [11:59:08] In 0 hour(s) and 0 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260421T1200) [11:59:42] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11843723 (10phaultfinder) [12:00:01] (03PS1) 10Gkyziridis: ml-services: Deploy rr-multilingual on experimental staging for testing. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275881 [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260421T1200) [12:02:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Pool back pc1 but with pc2021 replacing pc2011', diff saved to https://phabricator.wikimedia.org/P91287 and previous config saved to /var/cache/conftool/dbconfig/20260421-120206-marostegui.json [12:02:08] PROBLEM - Host wikikube-worker1106 is DOWN: PING CRITICAL - Packet loss = 100% [12:03:09] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy rr-multilingual on experimental staging for testing. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275881 (owner: 10Gkyziridis) [12:04:20] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1275491 (https://phabricator.wikimedia.org/T422964) (owner: 10Scott French) [12:05:19] (03Merged) 10jenkins-bot: ml-services: Deploy rr-multilingual on experimental staging for testing. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275881 (owner: 10Gkyziridis) [12:06:08] PROBLEM - Host wikikube-worker1089 is DOWN: PING CRITICAL - Packet loss = 100% [12:06:11] (03PS1) 10Effie Mouzeli: mcrouter: increase maxUnavailable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275882 [12:06:46] !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:08:39] (03PS6) 10Arnaudb: envoyproxy: rebuild envoy.yaml when the placeholder is created [puppet] - 10https://gerrit.wikimedia.org/r/1275827 (https://phabricator.wikimedia.org/T421827) [12:09:43] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11843788 (10phaultfinder) [12:11:58] PROBLEM - Host wikikube-worker1029 is DOWN: PING CRITICAL - Packet loss = 100% [12:12:58] PROBLEM - Host wikikube-worker1092 is DOWN: PING CRITICAL - Packet loss = 100% [12:13:30] (03PS1) 10Muehlenhoff: Fix Cumin alias for kerberized SSH access [puppet] - 10https://gerrit.wikimedia.org/r/1275883 [12:14:58] PROBLEM - Host wikikube-worker1112 is DOWN: PING CRITICAL - Packet loss = 100% [12:15:28] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be1008.eqiad.wmnet with OS bullseye [12:15:40] 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11843807 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be1008.eqiad.wmnet with OS bullseye [12:15:45] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts puppetserver1002.eqiad.wmnet [12:16:18] (03CR) 10JMeybohm: [C:03+1] mcrouter: increase maxUnavailable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275882 (owner: 10Effie Mouzeli) [12:16:58] PROBLEM - Host wikikube-worker1098 is DOWN: PING CRITICAL - Packet loss = 100% [12:19:58] PROBLEM - Host wikikube-worker1099 is DOWN: PING CRITICAL - Packet loss = 100% [12:22:09] (03CR) 10Elukey: [C:03+1] mcrouter: increase maxUnavailable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275882 (owner: 10Effie Mouzeli) [12:22:14] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:22:32] jouncebot: now [12:22:32] For the next 0 hour(s) and 37 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260421T1200) [12:22:49] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11843836 (10Jclark-ctr) Rebalanced PDU Monitoring [12:22:54] I will have another round of mw-mcrouter deployment, I will be locking scap again [12:23:05] !log jiji@deploy1003 Locking from deployment [ALL REPOSITORIES]: Upgrading mw-mcrouter - effie [12:23:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:25:11] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:25:40] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:28:32] !log update firmware on puppetserver1002: idrac from 6.10.30.20 to 7.20.80.50 T423282 [12:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:36] T423282: Timeouts on puppetserver1002 past reboot - https://phabricator.wikimedia.org/T423282 [12:28:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:29:04] !log update firmware on puppetserver1002: BIOS from 1.9.2 to 1.20.2 T423282 [12:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetserver1002.eqiad.wmnet [12:30:11] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:30:41] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:33:09] (03CR) 10Kamila SoučkovΓ‘: [C:03+2] P:mediawiki::php: Support component/php83-icu72 [puppet] - 10https://gerrit.wikimedia.org/r/1275491 (https://phabricator.wikimedia.org/T422964) (owner: 10Scott French) [12:33:59] (03CR) 10Effie Mouzeli: [C:03+2] mcrouter: increase maxUnavailable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275882 (owner: 10Effie Mouzeli) [12:36:19] (03Merged) 10jenkins-bot: mcrouter: increase maxUnavailable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275882 (owner: 10Effie Mouzeli) [12:36:37] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11843879 (10Nux) > The docs for the editor suggest using 22px (though I guess you can replace that with 20px in most cases): https://... [12:38:08] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be1008.eqiad.wmnet with reason: host reimage [12:38:16] (03CR) 10Jforrester: "Ack." [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1273787 (https://phabricator.wikimedia.org/T423654) (owner: 10Jforrester) [12:38:19] (03Abandoned) 10Jforrester: ImageListPager: Make sure file and filerevision are in correct order [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1273787 (https://phabricator.wikimedia.org/T423654) (owner: 10Jforrester) [12:39:33] (03CR) 10Jforrester: "Yup, sorry for not linking it!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1274165 (https://phabricator.wikimedia.org/T423627) (owner: 10Ecarg) [12:39:37] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply [12:40:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetserver1002.eqiad.wmnet [12:40:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts puppetserver1002.eqiad.wmnet [12:41:01] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply [12:42:04] (03PS4) 10Elukey: sre.hosts.provision: skip LLDP settings for iDRAC 10+ hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1275509 (https://phabricator.wikimedia.org/T250367) [12:42:04] (03PS1) 10Elukey: sre.hosts.provision: make UncoreFrequency dynamic for iDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1275889 (https://phabricator.wikimedia.org/T418899) [12:43:45] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host phab2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:44:42] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host phab2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:45:02] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: skip LLDP settings for iDRAC 10+ hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1275509 (https://phabricator.wikimedia.org/T250367) (owner: 10Elukey) [12:45:18] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be1008.eqiad.wmnet with reason: host reimage [12:45:18] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on tcp-proxy5003.eqsin.wmnet with reason: host reimage [12:45:43] (03CR) 10Elukey: "tested with phab2003, all good!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1275889 (https://phabricator.wikimedia.org/T418899) (owner: 10Elukey) [12:46:35] 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11843928 (10MatthewVernon) [12:46:57] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install phab2003 - https://phabricator.wikimedia.org/T418899#11843929 (10elukey) I used test-cookbook with https://gerrit.wikimedia.org/r/1275889 and it worked, the host is now provisioned. I'll wait for Jesse's rev... [12:47:40] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [12:47:47] (03CR) 10Elukey: [C:03+2] kserve: apply a workaround for kubeconform [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275841 (owner: 10Elukey) [12:47:57] (03CR) 10Elukey: [C:03+2] admin_ng: add bgp configs for ml-serve101[4,5] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275845 (owner: 10Elukey) [12:49:10] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [12:49:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tcp-proxy5003.eqsin.wmnet with reason: host reimage [12:49:44] (03CR) 10Kamila SoučkovΓ‘: [C:03+2] hieradata: Switch deployment hosts to component/php83-icu72 [puppet] - 10https://gerrit.wikimedia.org/r/1275492 (https://phabricator.wikimedia.org/T422964) (owner: 10Scott French) [12:52:31] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts puppetserver1002.eqiad.wmnet [12:53:07] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11843953 (10MatthewVernon) >>! In T414805#11843879, @Nux wrote: >> The docs for the editor suggest using 22px (though I guess you can... [12:53:40] !log update firmware on puppetserver1002: NIC from 22.31.6 to 23.21.6 T423282 [12:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:43] T423282: Timeouts on puppetserver1002 past reboot - https://phabricator.wikimedia.org/T423282 [12:54:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetserver1002.eqiad.wmnet [12:56:05] 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): 2 devices deleted from netbox that where active - https://phabricator.wikimedia.org/T424019#11843985 (10Jclark-ctr) [12:56:42] !log jiji@deploy1003 Unlocked for deployment [ALL REPOSITORIES]: Upgrading mw-mcrouter - effie (duration: 33m 37s) [12:57:35] ACKNOWLEDGEMENT - dump of es6 in codfw on backupmon1001 is CRITICAL: Last dump for es6 at codfw (es2036) taken on 2026-04-21 09:05:42 is 23 GiB, but the previous one was 2221 GiB, a change of -99.0 % Jcrespo Expected due to T421729 - The acknowledgement expires at: 2026-04-28 12:56:45. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [12:57:35] ACKNOWLEDGEMENT - dump of es6 in eqiad on backupmon1001 is CRITICAL: Last dump for es6 at eqiad (es1036) taken on 2026-04-21 10:01:22 is 23 GiB, but the previous one was 2221 GiB, a change of -99.0 % Jcrespo Expected due to T421729 - The acknowledgement expires at: 2026-04-28 12:56:45. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [12:57:36] ACKNOWLEDGEMENT - dump of es7 in codfw on backupmon1001 is CRITICAL: Last dump for es7 at codfw (es2040) taken on 2026-04-21 10:03:31 is 23 GiB, but the previous one was 2226 GiB, a change of -99.0 % Jcrespo Expected due to T421729 - The acknowledgement expires at: 2026-04-28 12:56:45. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [12:57:36] ACKNOWLEDGEMENT - dump of es7 in eqiad on backupmon1001 is CRITICAL: Last dump for es7 at eqiad (es1040) taken on 2026-04-21 09:12:47 is 23 GiB, but the previous one was 2226 GiB, a change of -99.0 % Jcrespo Expected due to T421729 - The acknowledgement expires at: 2026-04-28 12:56:45. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [12:58:02] (03PS1) 10STran: Support URLs in any "help method" configuration that takes a Title [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275893 (https://phabricator.wikimedia.org/T423575) [12:58:32] (03PS1) 10Elukey: profile::pki::get_cert: add lookup() to the label argument [puppet] - 10https://gerrit.wikimedia.org/r/1275895 (https://phabricator.wikimedia.org/T420993) [12:58:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275893 (https://phabricator.wikimedia.org/T423575) (owner: 10STran) [12:59:06] (03PS1) 10Ilias Sarantopoulos: ml-services: update articletopic in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275897 (https://phabricator.wikimedia.org/T423582) [12:59:27] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275895 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [12:59:30] (03CR) 10Mszwarc: [C:03+1] Support URLs in any "help method" configuration that takes a Title [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275893 (https://phabricator.wikimedia.org/T423575) (owner: 10STran) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260421T1300). [13:00:05] cscott, aude, and Tran: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:14] hi [13:00:21] o/ [13:00:59] (03CR) 10CI reject: [V:04-1] profile::pki::get_cert: add lookup() to the label argument [puppet] - 10https://gerrit.wikimedia.org/r/1275895 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [13:02:01] 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): 2 devices deleted from netbox that where active - https://phabricator.wikimedia.org/T424019#11844027 (10Jclark-ctr) T409769 this is the Ticket where they where repurposed for WDQS backend migration. [13:03:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetserver1002.eqiad.wmnet [13:03:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts puppetserver1002.eqiad.wmnet [13:03:17] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be1008.eqiad.wmnet with OS bullseye [13:03:23] 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11844035 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be1008.eqiad.wmnet with OS bullseye completed... [13:04:05] cscott: you around? You're first on the list. [13:04:48] mine is a config change [13:05:20] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission wikikube-worker[1002-1005,1011-1012,1019-1020,1029-1031,1058-1063,1082-1083,1088-1092,1096-1112,1166-1168].eqiad.wmnet - https://phabricator.wikimedia.org/T423863#11844037 (10Jclark-ctr) a:03Jclark-ctr [13:05:25] RESOLVED: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:05:46] o/ [13:05:58] I only have half an hour so if someone else can run the window that would be great [13:06:18] if it just involves me running spider-pig, I can [13:06:18] jfc the window [13:06:29] in my defense, mine looks scary but they all go at once [13:06:41] where did the β€œmax 6 patches” note go anyway [13:06:48] !log jayme@cumin1003 START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1029,1089,1092,1098-1099,1106,1112].eqiad.wmnet [13:06:53] but anyway, if cscott isn’t around yet I’d say aude go ahead with your config change at least [13:06:57] or do you need a deployer? [13:07:24] i can deploy mine, but not comfortable with everything else [13:07:34] !bash IT HAS BEEN 0️⃣ DAYS SINCE WE BLOCKED CHROME [13:07:35] Amir1: Stored quip at https://bash.toolforge.org/quip/L2cnsJ0B8tZ8Ohr0PWH- [13:07:45] FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [13:07:48] aude: go ahead with that and then we’ll see about the rest :) [13:07:58] ok [13:08:21] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:08:21] Tran: would you deploy your own changes as well? [13:08:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aude@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275562 (https://phabricator.wikimedia.org/T420881) (owner: 10Aude) [13:08:42] sure if ssh will cooperate [13:09:02] oh right, I forgot about your spiderpig message earlier [13:09:07] should be okay to do it all with spiderpig [13:09:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tcp-proxy5003.eqsin.wmnet with OS trixie [13:09:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host tcp-proxy5003.eqsin.wmnet [13:09:33] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11844044 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host tcp-proxy5003.eqsin.wmnet with OS trixie completed: - tcp-proxy5003 (**PASS**) -... [13:09:34] I would say once aude’s deploy has reached the testservers, you can start +2ing all your changes to speed up the gate-and-submit builds [13:09:37] (03Merged) 10jenkins-bot: Opt-in new accounts to the ReadingLists beta feature on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275562 (https://phabricator.wikimedia.org/T420881) (owner: 10Aude) [13:09:46] and then spiderpig them once that’s free, and then cscott can come after you [13:09:56] !log aude@deploy1003 Started scap sync-world: Backport for [[gerrit:1275562|Opt-in new accounts to the ReadingLists beta feature on enwiki (T420881)]] [13:09:59] T420881: [Reading list web beta] Deploy beta feature to all wikipedias - https://phabricator.wikimedia.org/T420881 [13:10:23] (if stuff is super broken and you need someone with SSH, you can ping me, I’ll still be at the keyboard, my manager will just be annoyed ^^) [13:10:49] there is a new bastion host [13:10:57] I had to update my ssh config to use it [13:11:04] by which I mean I can't ssh in to get my otp code. The bastion changed and my port connection is being rejected so I'm looking into it [13:11:14] yeah that, but just changing to 1004 didn't fix it for me [13:11:15] ah ok :/ [13:11:34] !log aude@deploy1003 aude: Backport for [[gerrit:1275562|Opt-in new accounts to the ReadingLists beta feature on enwiki (T420881)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:11:51] sudo -u stran scap spiderpig-otp # /j [13:12:34] are you using bast1004 to proxyjump to deploy1003? (the deploy server still has the 3 number) [13:12:57] !log aude@deploy1003 aude: Continuing with sync [13:13:14] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Timeouts on puppetserver1002 past reboot - https://phabricator.wikimedia.org/T423282#11844060 (10MoritzMuehlenhoff) >>! In T423282#11837345, @jhathaway wrote: > @MoritzMuehlenhoff I tried to reproduce the issue on Friday afternoon, but I was unabl... [13:13:29] that was it, thanks. Yes I can spiderpig my own patches [13:13:37] yay [13:13:50] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host tcp-proxy5004.eqsin.wmnet [13:13:52] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:13:54] then I think you can start the gate-and-submit builds already [13:13:56] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission wikikube-worker[1002-1005,1011-1012,1019-1020,1029-1031,1058-1063,1082-1083,1088-1092,1096-1112,1166-1168].eqiad.wmnet - https://phabricator.wikimedia.org/T423863#11844062 (10Jclark-ctr) | Name | Rack | Position | | wikikube-worker1088 | A... [13:14:24] (03CR) 10STran: [C:03+2] Enable non-emergency categories via config [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275840 (https://phabricator.wikimedia.org/T423244) (owner: 10STran) [13:14:29] (03CR) 10STran: [C:03+2] Add next steps page for non-emergency "sockpuppetry" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275825 (https://phabricator.wikimedia.org/T423045) (owner: 10STran) [13:14:33] (03CR) 10STran: [C:03+2] Add next steps page for non-emergency "vandalism" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275834 (https://phabricator.wikimedia.org/T423563) (owner: 10STran) [13:14:36] (03CR) 10STran: [C:03+2] Add next steps page for non-emergency "user dispute" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275835 (https://phabricator.wikimedia.org/T423587) (owner: 10STran) [13:14:40] (03CR) 10STran: [C:03+2] Add next steps page for non-emergency "disruptive editing" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275836 (https://phabricator.wikimedia.org/T423579) (owner: 10STran) [13:14:43] (03CR) 10STran: [C:03+2] Normalize "Something else" naming across references [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275837 (https://phabricator.wikimedia.org/T423595) (owner: 10STran) [13:14:47] (03CR) 10STran: [C:03+2] Add next steps page for non-emergency "other" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275838 (https://phabricator.wikimedia.org/T423595) (owner: 10STran) [13:14:50] (03CR) 10STran: [C:03+2] Deploy new non-emergency categories to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275816 (https://phabricator.wikimedia.org/T423043) (owner: 10STran) [13:14:53] (03CR) 10STran: [C:03+2] Support URLs in any "help method" configuration that takes a Title [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275893 (https://phabricator.wikimedia.org/T423575) (owner: 10STran) [13:15:44] (03Merged) 10jenkins-bot: Enable non-emergency categories via config [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275840 (https://phabricator.wikimedia.org/T423244) (owner: 10STran) [13:15:46] (03Merged) 10jenkins-bot: Add next steps page for non-emergency "sockpuppetry" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275825 (https://phabricator.wikimedia.org/T423045) (owner: 10STran) [13:15:48] (03Merged) 10jenkins-bot: Add next steps page for non-emergency "vandalism" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275834 (https://phabricator.wikimedia.org/T423563) (owner: 10STran) [13:15:50] (03Merged) 10jenkins-bot: Add next steps page for non-emergency "user dispute" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275835 (https://phabricator.wikimedia.org/T423587) (owner: 10STran) [13:15:53] (03Merged) 10jenkins-bot: Add next steps page for non-emergency "disruptive editing" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275836 (https://phabricator.wikimedia.org/T423579) (owner: 10STran) [13:15:55] (03Merged) 10jenkins-bot: Normalize "Something else" naming across references [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275837 (https://phabricator.wikimedia.org/T423595) (owner: 10STran) [13:16:04] (03Merged) 10jenkins-bot: Add next steps page for non-emergency "other" incidents [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275838 (https://phabricator.wikimedia.org/T423595) (owner: 10STran) [13:16:06] (03Merged) 10jenkins-bot: Support URLs in any "help method" configuration that takes a Title [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275893 (https://phabricator.wikimedia.org/T423575) (owner: 10STran) [13:16:46] !log aude@deploy1003 Finished scap sync-world: Backport for [[gerrit:1275562|Opt-in new accounts to the ReadingLists beta feature on enwiki (T420881)]] (duration: 06m 50s) [13:16:47] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrate prometheus5002 to prometheus5003 - https://phabricator.wikimedia.org/T424024 (10MoritzMuehlenhoff) 03NEW [13:16:50] T420881: [Reading list web beta] Deploy beta feature to all wikipedias - https://phabricator.wikimedia.org/T420881 [13:17:11] i'm done [13:17:17] k I'll get started [13:17:23] sgtm [13:18:01] (03PS1) 10ClΓ©ment Goubert: rest-gateway: Add liftwing endpoints to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275906 [13:18:32] (03PS1) 10Muehlenhoff: Add prometheus5003 [puppet] - 10https://gerrit.wikimedia.org/r/1275907 (https://phabricator.wikimedia.org/T424024) [13:18:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275816 (https://phabricator.wikimedia.org/T423043) (owner: 10STran) [13:19:30] jmm@cumin2002 makevm (PID 1353847) is awaiting input [13:20:04] (03Merged) 10jenkins-bot: Deploy new non-emergency categories to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275816 (https://phabricator.wikimedia.org/T423043) (owner: 10STran) [13:20:06] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install phab2003 - https://phabricator.wikimedia.org/T418899#11844115 (10Jhancock.wm) ty! [13:20:27] !log stran@deploy1003 Started scap sync-world: Backport for [[gerrit:1275840|Enable non-emergency categories via config (T423244)]], [[gerrit:1275825|Add next steps page for non-emergency "sockpuppetry" incidents (T423045)]], [[gerrit:1275834|Add next steps page for non-emergency "vandalism" incidents (T423563)]], [[gerrit:1275835|Add next steps page for non-emergency "user dispute" incidents (T423587)]], [[gerrit:1275836 [13:20:27] |Add next steps page for non-emergency "disruptive editing" incidents (T423579)]], [[gerrit:1275837|Normalize "Something else" naming across references (T423595)]], [[gerrit:1275838|Add next steps page for non-emergency "other" incidents (T423595)]], [[gerrit:1275816|Deploy new non-emergency categories to enwiki (T423043)]], [[gerrit:1275893|Support URLs in any "help method" configuration that takes a Title (T423575)]] [13:20:37] T423244: Add support for customizing enabled/disabled IRS categories on a per-wiki basis - https://phabricator.wikimedia.org/T423244 [13:20:38] T423045: Create support page for "Illegitimate use of multiple accounts (sockpuppetry)" incident category - https://phabricator.wikimedia.org/T423045 [13:20:38] T423563: Create support page for "Vandalism" incident category - https://phabricator.wikimedia.org/T423563 [13:20:39] T423587: Create support page for "Disputes with another user" incident category - https://phabricator.wikimedia.org/T423587 [13:20:39] T423579: Create support page for "Disruptive editing" incident category - https://phabricator.wikimedia.org/T423579 [13:20:40] T423595: Create support page for "Other" incident category - https://phabricator.wikimedia.org/T423595 [13:20:40] T423043: Deploy new categories for enwiki trial - https://phabricator.wikimedia.org/T423043 [13:20:41] T423575: Support URL parameters in IRS CommunityConfiguration page inputs - https://phabricator.wikimedia.org/T423575 [13:21:23] (03CR) 10AikoChou: [C:03+1] ml-services: update articletopic in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275897 (https://phabricator.wikimedia.org/T423582) (owner: 10Ilias Sarantopoulos) [13:22:45] RESOLVED: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [13:23:05] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: update articletopic in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275897 (https://phabricator.wikimedia.org/T423582) (owner: 10Ilias Sarantopoulos) [13:23:23] FIRING: [10x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:23:32] tfw a scap message barely fits in *two* IRC lines [13:23:36] jayme@cumin1003 decommission (PID 3935433) is awaiting input [13:24:32] o/ [13:24:36] (belated) [13:24:51] hi! Tran is currently deploying; do you want to self-service after that or do you need a deployer? [13:25:11] (03Merged) 10jenkins-bot: ml-services: update articletopic in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275897 (https://phabricator.wikimedia.org/T423582) (owner: 10Ilias Sarantopoulos) [13:25:57] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy5004.eqsin.wmnet - jmm@cumin2002" [13:26:20] Lucas_WMDE: I can self service. [13:26:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy5004.eqsin.wmnet - jmm@cumin2002" [13:26:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:26:27] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache tcp-proxy5004.eqsin.wmnet on all recursors [13:26:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy5004.eqsin.wmnet on all recursors [13:26:30] alright [13:26:31] Spiderpig and I are best buds [13:26:36] * Lucas_WMDE will be afk in 4 minutes [13:26:46] (03PS1) 10Ayounsi: Remove old sandbox1-eqsin dns includes [dns] - 10https://gerrit.wikimedia.org/r/1275912 (https://phabricator.wikimedia.org/T421863) [13:27:04] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy5004.eqsin.wmnet - jmm@cumin2002" [13:27:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy5004.eqsin.wmnet - jmm@cumin2002" [13:27:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host tcp-proxy5004.eqsin.wmnet with OS trixie [13:27:56] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11844197 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host tcp-proxy5004.eqsin.wmnet with OS trixie [13:28:15] (03PS1) 10Papaul: Add ge-0/0/7 to untrust zone and remove ge-0/0/5 [homer/public] - 10https://gerrit.wikimedia.org/r/1275913 (https://phabricator.wikimedia.org/T421674) [13:28:41] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [dns] - 10https://gerrit.wikimedia.org/r/1275912 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [13:28:47] jayme@cumin1003 decommission (PID 3935433) is awaiting input [13:28:55] (03CR) 10Ayounsi: [C:03+2] Remove old sandbox1-eqsin dns includes [dns] - 10https://gerrit.wikimedia.org/r/1275912 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [13:29:12] !log ayounsi@dns1004 START - running authdns-update [13:30:41] !log ayounsi@dns1004 END - running authdns-update [13:31:14] oh, right, some of those backports include i18n changes so the deploy takes longer [13:31:22] (I was wondering where the last scap update was) [13:33:30] (03CR) 10Papaul: [C:03+2] Add ge-0/0/7 to untrust zone and remove ge-0/0/5 [homer/public] - 10https://gerrit.wikimedia.org/r/1275913 (https://phabricator.wikimedia.org/T421674) (owner: 10Papaul) [13:34:31] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [13:37:43] !log stran@deploy1003 stran: Backport for [[gerrit:1275840|Enable non-emergency categories via config (T423244)]], [[gerrit:1275825|Add next steps page for non-emergency "sockpuppetry" incidents (T423045)]], [[gerrit:1275834|Add next steps page for non-emergency "vandalism" incidents (T423563)]], [[gerrit:1275835|Add next steps page for non-emergency "user dispute" incidents (T423587)]], [[gerrit:1275836|Add next steps pa [13:37:43] ge for non-emergency "disruptive editing" incidents (T423579)]], [[gerrit:1275837|Normalize "Something else" naming across references (T423595)]], [[gerrit:1275838|Add next steps page for non-emergency "other" incidents (T423595)]], [[gerrit:1275816|Deploy new non-emergency categories to enwiki (T423043)]], [[gerrit:1275893|Support URLs in any "help method" configuration that takes a Title (T423575)]] synced to the testse [13:37:43] rvers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:37:52] T423244: Add support for customizing enabled/disabled IRS categories on a per-wiki basis - https://phabricator.wikimedia.org/T423244 [13:37:52] T423045: Create support page for "Illegitimate use of multiple accounts (sockpuppetry)" incident category - https://phabricator.wikimedia.org/T423045 [13:37:53] T423563: Create support page for "Vandalism" incident category - https://phabricator.wikimedia.org/T423563 [13:37:53] T423587: Create support page for "Disputes with another user" incident category - https://phabricator.wikimedia.org/T423587 [13:37:54] T423579: Create support page for "Disruptive editing" incident category - https://phabricator.wikimedia.org/T423579 [13:37:54] T423595: Create support page for "Other" incident category - https://phabricator.wikimedia.org/T423595 [13:37:55] T423043: Deploy new categories for enwiki trial - https://phabricator.wikimedia.org/T423043 [13:37:55] T423575: Support URL parameters in IRS CommunityConfiguration page inputs - https://phabricator.wikimedia.org/T423575 [13:38:07] testing now [13:38:31] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: eqsin remove old sandbox vlan - ayounsi@cumin1003" [13:39:51] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: eqsin remove old sandbox vlan - ayounsi@cumin1003" [13:39:51] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:40:00] (03PS1) 10Ayounsi: eqsin: remove sandbox ACL on now gone interface [homer/public] - 10https://gerrit.wikimedia.org/r/1275925 (https://phabricator.wikimedia.org/T421863) [13:40:01] looks good, proceeding [13:40:07] !log stran@deploy1003 stran: Continuing with sync [13:40:57] (03PS1) 10Ayounsi: remove sandbox1-eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1275926 (https://phabricator.wikimedia.org/T421863) [13:44:09] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [13:44:21] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [13:44:43] !log isaranto@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [13:45:28] !log ayounsi@cumin1003 START - Cookbook sre.ganeti.makevm for new host atlas5001.wikimedia.org [13:45:30] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [13:48:40] (03CR) 10Effie Mouzeli: [C:03+1] redis::master: Move to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1275757 (https://phabricator.wikimedia.org/T419976) (owner: 10Muehlenhoff) [13:49:38] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM atlas5001.wikimedia.org - ayounsi@cumin1003" [13:50:21] (03CR) 10Ayounsi: [C:03+1] Add prometheus5003 [puppet] - 10https://gerrit.wikimedia.org/r/1275907 (https://phabricator.wikimedia.org/T424024) (owner: 10Muehlenhoff) [13:50:35] !log jayme@cumin1003 START - Cookbook sre.dns.netbox [13:51:46] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM atlas5001.wikimedia.org - ayounsi@cumin1003" [13:51:46] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:51:46] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache atlas5001.wikimedia.org on all recursors [13:51:50] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) atlas5001.wikimedia.org on all recursors [13:51:57] !log stran@deploy1003 Finished scap sync-world: Backport for [[gerrit:1275840|Enable non-emergency categories via config (T423244)]], [[gerrit:1275825|Add next steps page for non-emergency "sockpuppetry" incidents (T423045)]], [[gerrit:1275834|Add next steps page for non-emergency "vandalism" incidents (T423563)]], [[gerrit:1275835|Add next steps page for non-emergency "user dispute" incidents (T423587)]], [[gerrit:127583 [13:51:57] 6|Add next steps page for non-emergency "disruptive editing" incidents (T423579)]], [[gerrit:1275837|Normalize "Something else" naming across references (T423595)]], [[gerrit:1275838|Add next steps page for non-emergency "other" incidents (T423595)]], [[gerrit:1275816|Deploy new non-emergency categories to enwiki (T423043)]], [[gerrit:1275893|Support URLs in any "help method" configuration that takes a Title (T423575)]] ( [13:51:57] duration: 31m 30s) [13:52:05] T423244: Add support for customizing enabled/disabled IRS categories on a per-wiki basis - https://phabricator.wikimedia.org/T423244 [13:52:05] T423045: Create support page for "Illegitimate use of multiple accounts (sockpuppetry)" incident category - https://phabricator.wikimedia.org/T423045 [13:52:06] T423563: Create support page for "Vandalism" incident category - https://phabricator.wikimedia.org/T423563 [13:52:07] T423587: Create support page for "Disputes with another user" incident category - https://phabricator.wikimedia.org/T423587 [13:52:08] T423579: Create support page for "Disruptive editing" incident category - https://phabricator.wikimedia.org/T423579 [13:52:08] T423595: Create support page for "Other" incident category - https://phabricator.wikimedia.org/T423595 [13:52:08] T423043: Deploy new categories for enwiki trial - https://phabricator.wikimedia.org/T423043 [13:52:09] T423575: Support URL parameters in IRS CommunityConfiguration page inputs - https://phabricator.wikimedia.org/T423575 [13:52:14] done, all yours cscott [13:52:22] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM atlas5001.wikimedia.org - ayounsi@cumin1003" [13:52:28] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM atlas5001.wikimedia.org - ayounsi@cumin1003" [13:52:28] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host atlas5001.wikimedia.org [13:52:34] (03CR) 10Muehlenhoff: [C:03+2] redis::master: Move to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1275757 (https://phabricator.wikimedia.org/T419976) (owner: 10Muehlenhoff) [13:53:16] !log jayme@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:53:17] !log jayme@cumin1003 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1029,1089,1092,1098-1099,1106,1112].eqiad.wmnet [13:53:26] ok! thanks [13:53:28] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission wikikube-worker[1002-1005,1011-1012,1019-1020,1029-1031,1058-1063,1082-1083,1088-1092,1096-1112,1166-1168].eqiad.wmnet - https://phabricator.wikimedia.org/T423863#11844382 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by... [13:54:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [vendor] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275541 (https://phabricator.wikimedia.org/T420102) (owner: 10C. Scott Ananian) [13:54:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275542 (https://phabricator.wikimedia.org/T423662) (owner: 10C. Scott Ananian) [13:54:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275560 (owner: 10C. Scott Ananian) [13:54:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275561 (https://phabricator.wikimedia.org/T423747) (owner: 10C. Scott Ananian) [13:54:57] (03PS1) 10Marostegui: mariadb: Productionize db2251 [puppet] - 10https://gerrit.wikimedia.org/r/1275930 (https://phabricator.wikimedia.org/T418979) [13:55:14] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2142: Cloning [13:55:14] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [13:55:22] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [13:55:22] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2142: Cloning [13:55:57] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2142.codfw.wmnet,db1152.eqiad.wmnet with reason: Cloning [13:56:20] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize db2251 [puppet] - 10https://gerrit.wikimedia.org/r/1275930 (https://phabricator.wikimedia.org/T418979) (owner: 10Marostegui) [13:56:42] (03CR) 10Kamila SoučkovΓ‘: [C:03+2] hieradata: Switch parsoidtest1001 to component/php83-icu72 [puppet] - 10https://gerrit.wikimedia.org/r/1275493 (https://phabricator.wikimedia.org/T422964) (owner: 10Scott French) [14:00:05] Deploy window Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260421T1400) [14:03:04] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission wikikube-worker[1002-1005,1011-1012,1019-1020,1029-1031,1058-1063,1082-1083,1088-1092,1096-1112,1166-1168].eqiad.wmnet - https://phabricator.wikimedia.org/T423863#11844434 (10JMeybohm) >>! In T423863#11844382, @ops-monitoring-bot wrote: >... [14:03:20] (03PS1) 10Jelto: helmfile.d/miscweb: add values file for aux private secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275934 (https://phabricator.wikimedia.org/T414405) [14:05:35] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.23.0-a28 [vendor] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275541 (https://phabricator.wikimedia.org/T420102) (owner: 10C. Scott Ananian) [14:07:22] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.23.0-a28 [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275542 (https://phabricator.wikimedia.org/T423662) (owner: 10C. Scott Ananian) [14:07:33] (03Merged) 10jenkins-bot: [tests] add ParsoidLanguageConverterTest [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275560 (owner: 10C. Scott Ananian) [14:07:53] (03Merged) 10jenkins-bot: ParsoidLanguageConverter: update lang/dir on content wrapper div [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275561 (https://phabricator.wikimedia.org/T423747) (owner: 10C. Scott Ananian) [14:08:25] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be1009.eqiad.wmnet with OS bullseye [14:08:32] 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11844472 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be1009.eqiad.wmnet with OS bullseye [14:09:44] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1275541|Bump wikimedia/parsoid to 0.23.0-a28 (T420102 T421680 T422879 T422966 T423192 T423763 T423662)]], [[gerrit:1275542|Bump wikimedia/parsoid to 0.23.0-a28 (T423662)]], [[gerrit:1275560|[tests] add ParsoidLanguageConverterTest]], [[gerrit:1275561|ParsoidLanguageConverter: update lang/dir on content wrapper div (T423747)]] [14:10:17] T420102: Special:LintTemplateErrors shows parser functions without context - https://phabricator.wikimedia.org/T420102 [14:10:18] T422966: Parsoid tokenizer doesn't allow table or list markup inside language conversion brackets - https://phabricator.wikimedia.org/T422966 [14:10:19] T423192: Galleries on Parsoid don't support data attributes - https://phabricator.wikimedia.org/T423192 [14:10:19] T423763: phpunit PKSA-5jz8-6tcw-pbk4 breaks CI - https://phabricator.wikimedia.org/T423763 [14:10:20] T423662: CTT tasks week of 2026-04-17 - https://phabricator.wikimedia.org/T423662 [14:10:20] T423747: Parsoid LanguageConverter sets top level `lang` and `dir` attributes to the base language not the variant - https://phabricator.wikimedia.org/T423747 [14:11:25] !log cscott@deploy1003 cscott: Backport for [[gerrit:1275541|Bump wikimedia/parsoid to 0.23.0-a28 (T420102 T421680 T422879 T422966 T423192 T423763 T423662)]], [[gerrit:1275542|Bump wikimedia/parsoid to 0.23.0-a28 (T423662)]], [[gerrit:1275560|[tests] add ParsoidLanguageConverterTest]], [[gerrit:1275561|ParsoidLanguageConverter: update lang/dir on content wrapper div (T423747)]] synced to the testservers (see https://wikit [14:11:25] ech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:12:29] (03PS4) 10Herron: kafka-logging: set all codfw brokers to confluent_distribution 77 [puppet] - 10https://gerrit.wikimedia.org/r/1275932 (https://phabricator.wikimedia.org/T423723) [14:12:52] (03CR) 10Muehlenhoff: [C:03+2] Add prometheus5003 [puppet] - 10https://gerrit.wikimedia.org/r/1275907 (https://phabricator.wikimedia.org/T424024) (owner: 10Muehlenhoff) [14:13:04] (03CR) 10Herron: [V:03+1] "aiming to deploy on thursday" [puppet] - 10https://gerrit.wikimedia.org/r/1275932 (https://phabricator.wikimedia.org/T423723) (owner: 10Herron) [14:14:09] (03PS1) 10Yahya: Enable campaignEvents on bdwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275938 (https://phabricator.wikimedia.org/T424016) [14:16:03] (03PS1) 10Marostegui: pc2011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275940 (https://phabricator.wikimedia.org/T424012) [14:16:39] !log cscott@deploy1003 cscott: Continuing with sync [14:16:48] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 15 days, 0:00:00 on db2142.codfw.wmnet,pc2011.codfw.wmnet with reason: Will be decommissioned [14:16:49] (03CR) 10Marostegui: [C:03+2] pc2011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275940 (https://phabricator.wikimedia.org/T424012) (owner: 10Marostegui) [14:19:11] (03PS2) 10Jforrester: mediawiki-common, mw-debug, -experimental: Drop /local/wf memcache route [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275464 (https://phabricator.wikimedia.org/T423623) (owner: 10RLazarus) [14:19:16] (03CR) 10Jforrester: [C:03+1] mediawiki-common, mw-debug, -experimental: Drop /local/wf memcache route [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275464 (https://phabricator.wikimedia.org/T423623) (owner: 10RLazarus) [14:19:57] (03CR) 10Elukey: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1275895 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [14:20:30] (03CR) 10Dzahn: "thanks for review - this just needs to wait until "switch day" as the lack of this rule is the one thing that keeps new jenkins from picki" [puppet] - 10https://gerrit.wikimedia.org/r/1275537 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [14:20:39] (03PS1) 10Muehlenhoff: Apply the tcp-proxy role to tcp-proxy5003/5004 [puppet] - 10https://gerrit.wikimedia.org/r/1275942 (https://phabricator.wikimedia.org/T421863) [14:20:41] (03PS1) 10Muehlenhoff: Add tcp-proxy5003/5004 to conftool [puppet] - 10https://gerrit.wikimedia.org/r/1275943 (https://phabricator.wikimedia.org/T421863) [14:21:05] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on tcp-proxy5004.eqsin.wmnet with reason: host reimage [14:21:11] (03CR) 10MdsShakil: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275938 (https://phabricator.wikimedia.org/T424016) (owner: 10Yahya) [14:21:28] (03PS1) 10Marostegui: db2142: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275945 (https://phabricator.wikimedia.org/T418979) [14:22:46] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1275541|Bump wikimedia/parsoid to 0.23.0-a28 (T420102 T421680 T422879 T422966 T423192 T423763 T423662)]], [[gerrit:1275542|Bump wikimedia/parsoid to 0.23.0-a28 (T423662)]], [[gerrit:1275560|[tests] add ParsoidLanguageConverterTest]], [[gerrit:1275561|ParsoidLanguageConverter: update lang/dir on content wrapper div (T423747)]] (duration: 13m 02s) [14:22:48] (03CR) 10JMeybohm: [C:04-1] "I think it would be better to just change the issuers label (`issuers.discovery.label`) label to `discovery2026`." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275812 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [14:22:57] T420102: Special:LintTemplateErrors shows parser functions without context - https://phabricator.wikimedia.org/T420102 [14:22:58] T422966: Parsoid tokenizer doesn't allow table or list markup inside language conversion brackets - https://phabricator.wikimedia.org/T422966 [14:22:59] T423192: Galleries on Parsoid don't support data attributes - https://phabricator.wikimedia.org/T423192 [14:22:59] T423763: phpunit PKSA-5jz8-6tcw-pbk4 breaks CI - https://phabricator.wikimedia.org/T423763 [14:22:59] T423662: CTT tasks week of 2026-04-17 - https://phabricator.wikimedia.org/T423662 [14:23:00] T423747: Parsoid LanguageConverter sets top level `lang` and `dir` attributes to the base language not the variant - https://phabricator.wikimedia.org/T423747 [14:23:55] (03CR) 10Marostegui: [C:03+2] db2142: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275945 (https://phabricator.wikimedia.org/T418979) (owner: 10Marostegui) [14:24:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tcp-proxy5004.eqsin.wmnet with reason: host reimage [14:25:26] ok done. If Test Kitchen Folks don't need the window, I've got one more quick config patch to deploy. I can wait on that if the window is needed, though. [14:26:15] (03CR) 10Elukey: "I am fine with changing the label, but don't we have to regenerate all certs anyway? What do we gain changing only the label?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275812 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [14:26:16] (03PS1) 10Marostegui: instances.yaml: Remove db2142, add db2251 [puppet] - 10https://gerrit.wikimedia.org/r/1275946 (https://phabricator.wikimedia.org/T418979) [14:26:27] ok, hearing nothing I'm going to proceed with the config change. [14:26:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1274387 (owner: 10C. Scott Ananian) [14:26:54] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove db2142, add db2251 [puppet] - 10https://gerrit.wikimedia.org/r/1275946 (https://phabricator.wikimedia.org/T418979) (owner: 10Marostegui) [14:28:43] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11844757 (10Nemoralis) [14:29:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Add db2251, remove db2142 T418979', diff saved to https://phabricator.wikimedia.org/P91298 and previous config saved to /var/cache/conftool/dbconfig/20260421-142913-marostegui.json [14:29:18] T418979: Productionize db225[0-3] - https://phabricator.wikimedia.org/T418979 [14:29:24] (03Merged) 10jenkins-bot: Increase Parsoid Read Views percentage for ruwiki to 55% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1274387 (owner: 10C. Scott Ananian) [14:29:36] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T410589)', diff saved to https://phabricator.wikimedia.org/P91299 and previous config saved to /var/cache/conftool/dbconfig/20260421-142935-ladsgroup.json [14:29:37] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1274387|Increase Parsoid Read Views percentage for ruwiki to 55%]] [14:29:41] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [14:30:04] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260421T1430) [14:30:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Add db2251 to ms1 T418979', diff saved to https://phabricator.wikimedia.org/P91300 and previous config saved to /var/cache/conftool/dbconfig/20260421-143017-marostegui.json [14:30:30] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be1009.eqiad.wmnet with reason: host reimage [14:31:13] !log cscott@deploy1003 cscott: Backport for [[gerrit:1274387|Increase Parsoid Read Views percentage for ruwiki to 55%]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:31:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repool ms1 T418979', diff saved to https://phabricator.wikimedia.org/P91301 and previous config saved to /var/cache/conftool/dbconfig/20260421-143145-marostegui.json [14:32:24] !log installing gdk-pixbuf security updates [14:32:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:26] (03PS1) 10Marostegui: db2251: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275947 (https://phabricator.wikimedia.org/T418979) [14:33:00] (03CR) 10Marostegui: [C:03+2] db2251: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1275947 (https://phabricator.wikimedia.org/T418979) (owner: 10Marostegui) [14:34:32] !log moving OOB link on mr1-eqiad to ge-0/0/7 [14:34:32] (03PS1) 10Marostegui: installserver: Do not format pc2021 [puppet] - 10https://gerrit.wikimedia.org/r/1275948 [14:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:28] !log cscott@deploy1003 cscott: Continuing with sync [14:36:09] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be1009.eqiad.wmnet with reason: host reimage [14:36:54] (03CR) 10Marostegui: [C:03+2] installserver: Do not format pc2021 [puppet] - 10https://gerrit.wikimedia.org/r/1275948 (owner: 10Marostegui) [14:37:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 22 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275938 (https://phabricator.wikimedia.org/T424016) (owner: 10Yahya) [14:38:49] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:38:49] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [14:39:14] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1274387|Increase Parsoid Read Views percentage for ruwiki to 55%]] (duration: 09m 37s) [14:39:44] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P91302 and previous config saved to /var/cache/conftool/dbconfig/20260421-143943-ladsgroup.json [14:40:33] 10SRE-SLO: Retire Pyrra - https://phabricator.wikimedia.org/T423307#11844821 (10Jdforrester-WMF) There are a handful of on-wiki links to slo.wikimedia.org (https://wikitech.wikimedia.org/wiki/Special:Search?search=insource%3A%22slo%5C.wikimedia%5C.org%22 and https://mediawiki.org/wiki/Special:Search?search=i... [14:40:41] (03Abandoned) 10ClΓ©ment Goubert: rest-gateway: Add liftwing endpoints to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275906 (owner: 10ClΓ©ment Goubert) [14:41:09] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.199, interfaces up: 34, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:41:56] (03CR) 10ClΓ©ment Goubert: envoyproxy: rebuild envoy.yaml when the placeholder is created (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1275827 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb) [14:42:09] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:43:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Standardize management routers interfaces - https://phabricator.wikimedia.org/T421674#11844857 (10Papaul) [14:43:51] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 61.47 ms [14:43:51] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms [14:45:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tcp-proxy5004.eqsin.wmnet with OS trixie [14:45:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host tcp-proxy5004.eqsin.wmnet [14:45:11] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11844904 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host tcp-proxy5004.eqsin.wmnet with OS trixie completed: - tcp-pr... [14:46:30] (03PS1) 10Daniel Kinzler: rest gateway: update 429 response body [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275949 [14:47:17] (03CR) 10Elukey: [C:03+1] "Looks good! Remember to merge it only when the cookbooks asks you to do it :)" [puppet] - 10https://gerrit.wikimedia.org/r/1275932 (https://phabricator.wikimedia.org/T423723) (owner: 10Herron) [14:48:09] (03CR) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [14:48:54] (03CR) 10Herron: [V:03+1] ""the cookbooks" you say! I thought this would be merge + run rolling restart broker cookbook, is there another specific cookbook for this" [puppet] - 10https://gerrit.wikimedia.org/r/1275932 (https://phabricator.wikimedia.org/T423723) (owner: 10Herron) [14:49:12] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Timeouts on puppetserver1002 past reboot - https://phabricator.wikimedia.org/T423282#11844960 (10MoritzMuehlenhoff) Poking at this further I also noticed one other discrepancy actually: For some reason puppetserver1002 has the jdk variant of OpenJ... [14:49:52] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P91303 and previous config saved to /var/cache/conftool/dbconfig/20260421-144951-ladsgroup.json [14:51:25] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host prometheus5003.eqsin.wmnet [14:51:27] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [14:51:43] (03PS19) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) [14:51:52] !log elukey@cumin1003 START - Cookbook sre.hosts.powercycle for host sretest2010 [14:53:34] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.powercycle (exit_code=0) for host sretest2010 [14:53:41] PROBLEM - Host sretest2010 is DOWN: PING CRITICAL - Packet loss = 100% [14:54:08] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be1009.eqiad.wmnet with OS bullseye [14:54:18] (03PS20) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) [14:54:19] 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11845021 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be1009.eqiad.wmnet with OS bullseye completed... [14:54:59] RECOVERY - Host sretest2010 is UP: PING OK - Packet loss = 0%, RTA = 33.04 ms [14:55:43] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus5003.eqsin.wmnet - jmm@cumin2002" [14:55:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus5003.eqsin.wmnet - jmm@cumin2002" [14:55:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:55:50] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache prometheus5003.eqsin.wmnet on all recursors [14:55:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus5003.eqsin.wmnet on all recursors [14:56:22] (03PS21) 10Andrew Bogott: designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) [14:56:26] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus5003.eqsin.wmnet - jmm@cumin2002" [14:56:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus5003.eqsin.wmnet - jmm@cumin2002" [14:56:44] (03CR) 10Andrew Bogott: [C:03+2] cloudinfra hiera: remove obsolete hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1275511 (owner: 10Andrew Bogott) [14:56:50] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11845056 (10TheDJ) [14:57:01] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [14:59:33] jmm@cumin2002 makevm (PID 1420671) is awaiting input [14:59:45] FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [15:00:00] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T410589)', diff saved to https://phabricator.wikimedia.org/P91304 and previous config saved to /var/cache/conftool/dbconfig/20260421-145959-ladsgroup.json [15:00:04] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [15:00:05] jelto, arnoldokoth, mutante, and arnaudb: Time to do the SRE Collaboration Services office hours deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260421T1500). [15:00:17] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance [15:00:26] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2161 (T410589)', diff saved to https://phabricator.wikimedia.org/P91305 and previous config saved to /var/cache/conftool/dbconfig/20260421-150025-ladsgroup.json [15:00:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host prometheus5003.eqsin.wmnet with OS bookworm [15:00:43] (03CR) 10Elukey: [C:03+1] "Sorry, it is "cookbook", an extra s slipped in :D" [puppet] - 10https://gerrit.wikimedia.org/r/1275932 (https://phabricator.wikimedia.org/T423723) (owner: 10Herron) [15:00:45] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrate prometheus5002 to prometheus5003 - https://phabricator.wikimedia.org/T424024#11845139 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host prometheus5003.eqsin.wmnet with OS bookworm [15:01:19] !log aokoth@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: Phorge Deploy [15:03:10] (03PS6) 10Muehlenhoff: firewall::service: Add a new parameter unrestricted_access [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804) [15:03:22] !log brennen@deploy1003 Started deploy [phabricator/deployment@ce0ec30]: deploy phab2002 for T424033 [15:03:26] T424033: Deploy Phab/Phorge 2026-04-21 - https://phabricator.wikimedia.org/T424033 [15:03:31] (03CR) 10Herron: [V:03+1] "Ahh ok! Thanks I'll give this one a try" [puppet] - 10https://gerrit.wikimedia.org/r/1275932 (https://phabricator.wikimedia.org/T423723) (owner: 10Herron) [15:03:48] 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11845162 (10MatthewVernon) [15:04:06] !log brennen@deploy1003 Finished deploy [phabricator/deployment@ce0ec30]: deploy phab2002 for T424033 (duration: 00m 44s) [15:04:51] !log brennen@deploy1003 Started deploy [phabricator/deployment@ce0ec30]: deploy phab1004 for T424033 [15:04:59] !log aokoth@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phorge Deploy [15:05:34] !log brennen@deploy1003 Finished deploy [phabricator/deployment@ce0ec30]: deploy phab1004 for T424033 (duration: 00m 43s) [15:05:37] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [15:07:16] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [15:07:46] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275253 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [15:14:45] RESOLVED: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [15:18:27] (03CR) 10Andrew Bogott: [C:03+2] designate: list all zookeeper backends in tooz_url [puppet] - 10https://gerrit.wikimedia.org/r/1275489 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [15:26:50] (03PS2) 10Elukey: profile::pki::get_cert: add lookup() to the label argument [puppet] - 10https://gerrit.wikimedia.org/r/1275895 (https://phabricator.wikimedia.org/T420993) [15:28:23] (03CR) 10Andrew Bogott: [C:03+2] cloud-vps: switch VM openstack references to version Flamingo [puppet] - 10https://gerrit.wikimedia.org/r/1273833 (owner: 10Andrew Bogott) [15:28:36] (03CR) 10Andrew Bogott: [C:03+2] Openstack: remove packages for version Dalmatian [puppet] - 10https://gerrit.wikimedia.org/r/1273834 (owner: 10Andrew Bogott) [15:28:39] (03CR) 10Andrew Bogott: [C:03+2] Openstack: remove packages for version Epoxy [puppet] - 10https://gerrit.wikimedia.org/r/1273835 (owner: 10Andrew Bogott) [15:28:41] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275895 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [15:29:10] FIRING: [2x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9600.service on cloudelastic1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:29:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Standardize management routers interfaces - https://phabricator.wikimedia.org/T421674#11845364 (10ayounsi) [15:30:39] (03PS1) 10Elukey: profile::pki::get_cert: add lookup() to the label argument [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) [15:32:49] (03CR) 10CI reject: [V:04-1] profile::pki::get_cert: add lookup() to the label argument [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [15:33:47] FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:34:35] (03CR) 10JMeybohm: [C:04-1] "Sorry, I was not clear here. The `certificates.cert-manager.io` resources (the objects defining what certificate to request/create) do car" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275812 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [15:34:50] (03PS1) 10Elukey: Move netbox, debmonitor and presto to the discovery2026 pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1275960 (https://phabricator.wikimedia.org/T420993) [15:35:10] (03PS1) 10Dzahn: jenkins: include profile::ci::pipeline::publisher [puppet] - 10https://gerrit.wikimedia.org/r/1275961 (https://phabricator.wikimedia.org/T423968) [15:35:15] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275895 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [15:35:33] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275960 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [15:36:06] (03CR) 10CI reject: [V:04-1] jenkins: include profile::ci::pipeline::publisher [puppet] - 10https://gerrit.wikimedia.org/r/1275961 (https://phabricator.wikimedia.org/T423968) (owner: 10Dzahn) [15:36:44] (03PS2) 10Elukey: profile::pki::get_cert: add lookup() to the label argument [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) [15:36:44] (03PS2) 10Elukey: Move netbox, debmonitor and presto to the discovery2026 pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1275960 (https://phabricator.wikimedia.org/T420993) [15:37:01] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [15:38:49] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179#11845399 (10MoritzMuehlenhoff) [15:39:21] !log installing busybox updates from Trixie point release [15:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:37] (03CR) 10CI reject: [V:04-1] profile::pki::get_cert: add lookup() to the label argument [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [15:40:11] (03PS1) 10Dzahn: zuul: add new public key for zuul <-> gerrit 2026 [puppet] - 10https://gerrit.wikimedia.org/r/1275964 (https://phabricator.wikimedia.org/T395938) [15:40:54] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus5003.eqsin.wmnet with reason: host reimage [15:40:56] (03CR) 10CI reject: [V:04-1] zuul: add new public key for zuul <-> gerrit 2026 [puppet] - 10https://gerrit.wikimedia.org/r/1275964 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [15:42:08] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1273861 (owner: 10Andrew Bogott) [15:42:18] (03PS3) 10Elukey: profile::pki::get_cert: add lookup() to the label argument [puppet] - 10https://gerrit.wikimedia.org/r/1275895 (https://phabricator.wikimedia.org/T420993) [15:42:18] (03PS3) 10Elukey: Move netbox, debmonitor and presto to the discovery2026 pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1275960 (https://phabricator.wikimedia.org/T420993) [15:43:09] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.4 point update - https://phabricator.wikimedia.org/T420240#11845405 (10MoritzMuehlenhoff) [15:44:19] (03CR) 10Elukey: "The approach is handy but I found these:" [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [15:45:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus5003.eqsin.wmnet with reason: host reimage [15:52:27] (03PS1) 10Muehlenhoff: Ignore .pub file for the SPDX check [puppet] - 10https://gerrit.wikimedia.org/r/1275965 [15:53:17] (03CR) 10Dzahn: [C:03+1] "thanks:) will be handy over at I8f68cf041862b5b6055" [puppet] - 10https://gerrit.wikimedia.org/r/1275965 (owner: 10Muehlenhoff) [15:53:56] (03CR) 10Dzahn: [C:03+2] Ignore .pub file for the SPDX check [puppet] - 10https://gerrit.wikimedia.org/r/1275965 (owner: 10Muehlenhoff) [15:54:03] (03PS2) 10Muehlenhoff: Ignore .pub file for the SPDX check [puppet] - 10https://gerrit.wikimedia.org/r/1275965 [15:55:37] (03CR) 10Dzahn: [C:03+2] Ignore .pub file for the SPDX check [puppet] - 10https://gerrit.wikimedia.org/r/1275965 (owner: 10Muehlenhoff) [15:55:56] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1275964 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [15:57:05] (03CR) 10Muehlenhoff: "Can't we simply move this out of the profile hierarchy and use a global Hiera value, then it's overridable everywhere?" [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [15:57:53] PROBLEM - Host sretest2010 is DOWN: PING CRITICAL - Packet loss = 100% [15:57:58] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus5003.eqsin.wmnet with OS bookworm [15:57:58] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus5003.eqsin.wmnet [15:58:04] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrate prometheus5002 to prometheus5003 - https://phabricator.wikimedia.org/T424024#11845456 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host prometheus5003.eqsin.wmnet with OS bookworm executed with errors: - promet... [15:59:20] (03CR) 10Dzahn: [C:03+2] zuul: add new public key for zuul <-> gerrit 2026 [puppet] - 10https://gerrit.wikimedia.org/r/1275964 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [16:00:05] jhathaway and rzl: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260421T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:25] RECOVERY - Host sretest2010 is UP: PING OK - Packet loss = 0%, RTA = 33.02 ms [16:08:27] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: wikikube-worker2190 System Configuration Check error - https://phabricator.wikimedia.org/T423175#11845506 (10Clement_Goubert) [16:09:18] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:09:42] (03PS3) 10Jforrester: mediawiki-common, mw-debug, -experimental: Drop /local/wf memcache route [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275464 (https://phabricator.wikimedia.org/T423623) (owner: 10RLazarus) [16:10:27] jmm@cumin2002 reimage (PID 1472843) is awaiting input [16:11:17] (03PS1) 10Pppery: Add missing files [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1275970 (https://phabricator.wikimedia.org/T424059) [16:11:54] (03CR) 10Muehlenhoff: "Or we could fix this rather easily to make sure they are only used in profiles? For ganeti it's a straighforward patch, the only reason it" [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [16:13:07] (03CR) 10CI reject: [V:04-1] mediawiki-common, mw-debug, -experimental: Drop /local/wf memcache route [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275464 (https://phabricator.wikimedia.org/T423623) (owner: 10RLazarus) [16:14:32] (03CR) 10Aklapper: [V:03+2 C:03+2] "Thanks for the quick patch! Confirming that this fixes the issue locally for me." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1275970 (https://phabricator.wikimedia.org/T424059) (owner: 10Pppery) [16:18:55] FIRING: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9600.service on cloudelastic1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:20:36] (03CR) 10Andrew Bogott: [C:03+2] cloud-vps: consolidate inclusion of openstack server/client packages [puppet] - 10https://gerrit.wikimedia.org/r/1273861 (owner: 10Andrew Bogott) [16:21:14] !log brennen@deploy1003 Started deploy [phabricator/deployment@ceeecba]: deploy phab2002 for T424059 [16:21:19] T424059: 'Unhandled Exception ("Exception") / Source file "TranslatewikiCoreUk.php" failed to load' when visiting Phabricator settings pages - https://phabricator.wikimedia.org/T424059 [16:21:36] jouncebot nowandnext [16:21:36] For the next 0 hour(s) and 38 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260421T1600) [16:21:37] In 0 hour(s) and 38 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260421T1700) [16:21:52] doing a quick phabricator update to fix a regression in latest deploy. [16:22:02] !log brennen@deploy1003 Finished deploy [phabricator/deployment@ceeecba]: deploy phab2002 for T424059 (duration: 00m 47s) [16:22:37] !log brennen@deploy1003 Started deploy [phabricator/deployment@ceeecba]: deploy phab1004 for T424059 [16:23:16] !log brennen@deploy1003 Finished deploy [phabricator/deployment@ceeecba]: deploy phab1004 for T424059 (duration: 00m 38s) [16:23:25] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2005.codfw.wmnet with OS bullseye [16:23:35] 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11845574 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye [16:26:49] (03CR) 10RLazarus: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275464 (https://phabricator.wikimedia.org/T423623) (owner: 10RLazarus) [16:28:49] (03CR) 10Jforrester: "Note to self: This shouldn't be deployed until 1.46.0-wmf.26 is everywhere." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275467 (https://phabricator.wikimedia.org/T423311) (owner: 10RLazarus) [16:29:34] (03PS2) 10Daniel Kinzler: rest gateway: update 429 response body [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275949 [16:32:26] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 4 others: MediaViewer (and the commons file page) should serve WebP originals not thumbnails of equivalent size - https://phabricator.wikimedia.org/T418745#11845632 (10Krinkle) >>! In T418745#11845112, @gerritbot wrote: > Change #12... [16:34:18] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:36:03] (03PS1) 10Dduvall: zuul: Allow connections to zookeeper from executors [puppet] - 10https://gerrit.wikimedia.org/r/1275972 (https://phabricator.wikimedia.org/T424061) [16:36:31] (03CR) 10CI reject: [V:04-1] zuul: Allow connections to zookeeper from executors [puppet] - 10https://gerrit.wikimedia.org/r/1275972 (https://phabricator.wikimedia.org/T424061) (owner: 10Dduvall) [16:37:09] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2005.codfw.wmnet with reason: host reimage [16:37:59] (03PS2) 10Dduvall: zuul: Allow connections to zookeeper from executors [puppet] - 10https://gerrit.wikimedia.org/r/1275972 (https://phabricator.wikimedia.org/T424061) [16:39:18] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:40:04] 06SRE, 10Maps, 06Traffic, 07affects-Kiwix-and-openZIM: Multiple wikipedia wikis have broken maps URLs in the infobox - https://phabricator.wikimedia.org/T424046#11845738 (10Aklapper) https://maps.wikimedia.org/img/osm-intl,10,6.81486,-1.42489,300x300.png?lang=ha&domain=ha.wikipedia.org&title=Juaben&revid=7... [16:41:56] PROBLEM - MegaRAID on db1162 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:41:57] ACKNOWLEDGEMENT - MegaRAID on db1162 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T424064 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:42:06] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1162 - https://phabricator.wikimedia.org/T424064 (10ops-monitoring-bot) 03NEW [16:45:01] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2005.codfw.wmnet with reason: host reimage [16:46:25] (03PS2) 10AikoChou: ml-services: update revertrisk-language-agnostic image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275820 (https://phabricator.wikimedia.org/T416384) [16:48:39] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update revertrisk-language-agnostic image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275820 (https://phabricator.wikimedia.org/T416384) (owner: 10AikoChou) [16:50:25] (03CR) 10AikoChou: [C:03+2] "Yay thanks! :D" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275820 (https://phabricator.wikimedia.org/T416384) (owner: 10AikoChou) [16:51:09] (03Abandoned) 10Tchanders: Add contextual attribute to editattemptstep instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275879 (https://phabricator.wikimedia.org/T424010) (owner: 10Tchanders) [16:52:35] (03Merged) 10jenkins-bot: ml-services: update revertrisk-language-agnostic image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275820 (https://phabricator.wikimedia.org/T416384) (owner: 10AikoChou) [16:59:10] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install rdb201[34] - https://phabricator.wikimedia.org/T418922#11845996 (10Jhancock.wm) a:03Jhancock.wm [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260421T1700) [17:00:37] !log aikochou@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [17:01:22] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host thanos-be2005.codfw.wmnet with OS bullseye [17:01:31] 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11846022 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye executed... [17:01:33] 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11846023 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye completed... [17:03:30] PROBLEM - Host sretest2010 is DOWN: PING CRITICAL - Packet loss = 100% [17:06:45] FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [17:12:10] 06SRE, 10SRE-swift-storage, 06SRE Observability: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690#11846068 (10MatthewVernon) [17:13:02] (03CR) 10Dzahn: [V:03+1 C:03+1] "thank you. looks good! https://puppet-compiler.wmflabs.org/output/1275972/8448/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1275972 (https://phabricator.wikimedia.org/T424061) (owner: 10Dduvall) [17:13:04] (03CR) 10Dzahn: [V:03+1 C:03+2] zuul: Allow connections to zookeeper from executors [puppet] - 10https://gerrit.wikimedia.org/r/1275972 (https://phabricator.wikimedia.org/T424061) (owner: 10Dduvall) [17:21:45] RESOLVED: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [17:23:23] FIRING: [10x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:23:43] (03CR) 10Dzahn: [V:03+1 C:03+2] "confirmed the new rules exist in iptables:" [puppet] - 10https://gerrit.wikimedia.org/r/1275972 (https://phabricator.wikimedia.org/T424061) (owner: 10Dduvall) [17:24:47] (03CR) 10Elukey: "I didn't get the first suggestion - if we move it out from the profile hierarchy, we'll not be able to use lookup() anymore, or am I missi" [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [17:25:00] RECOVERY - Host sretest2010 is UP: PING OK - Packet loss = 0%, RTA = 33.16 ms [17:25:43] (03PS4) 10Elukey: Move netbox, debmonitor and presto to the discovery2026 pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1275960 (https://phabricator.wikimedia.org/T420993) [17:25:43] (03PS1) 10Elukey: profile::pki::get_cert: add lookup() to the label argument [puppet] - 10https://gerrit.wikimedia.org/r/1275984 (https://phabricator.wikimedia.org/T420993) [17:26:54] (03Abandoned) 10Elukey: profile::pki::get_cert: add lookup() to the label argument [puppet] - 10https://gerrit.wikimedia.org/r/1275984 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [17:27:11] (03Abandoned) 10Elukey: profile::pki::get_cert: add lookup() to the label argument [puppet] - 10https://gerrit.wikimedia.org/r/1275895 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [17:28:23] (03PS3) 10Elukey: profile::pki::get_cert: add lookup() to the label argument [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) [17:29:35] (03PS4) 10Elukey: profile::pki::get_cert: add lookup() to the label argument [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) [17:29:36] (03PS5) 10Elukey: Move netbox, debmonitor and presto to the discovery2026 pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1275960 (https://phabricator.wikimedia.org/T420993) [17:29:48] (03CR) 10Jasmine: [C:03+2] wikikube: Add wikikube-ctrl200[4-5] [puppet] - 10https://gerrit.wikimedia.org/r/1195350 (https://phabricator.wikimedia.org/T390861) (owner: 10Jasmine) [17:30:04] (03CR) 10RLazarus: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275464 (https://phabricator.wikimedia.org/T423623) (owner: 10RLazarus) [17:31:31] (03CR) 10CI reject: [V:04-1] profile::pki::get_cert: add lookup() to the label argument [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [17:32:01] (03PS5) 10Elukey: profile::pki::get_cert: add lookup() to the label argument [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) [17:32:03] (03PS6) 10Elukey: Move netbox, debmonitor and presto to the discovery2026 pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1275960 (https://phabricator.wikimedia.org/T420993) [17:35:59] (03CR) 10RLazarus: [C:03+2] "πŸ™ƒ" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275464 (https://phabricator.wikimedia.org/T423623) (owner: 10RLazarus) [17:36:39] using the infra window for a helmfile-only scap [17:36:53] (03PS2) 10Dzahn: jenkins: include profile::ci::pipeline::publisher [puppet] - 10https://gerrit.wikimedia.org/r/1275961 (https://phabricator.wikimedia.org/T423968) [17:39:30] (03Merged) 10jenkins-bot: mediawiki-common, mw-debug, -experimental: Drop /local/wf memcache route [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275464 (https://phabricator.wikimedia.org/T423623) (owner: 10RLazarus) [17:41:08] !log rzl@deploy1003 Started scap sync-world: https://gerrit.wikimedia.org/r/1275464 T423623 [17:41:12] T423623: Drop /local/wf/ mcrouter route from production mcrouter for mw-*, no longer used - https://phabricator.wikimedia.org/T423623 [17:41:27] (no-op except in mw-debug, mostly just scapping it to clean up the diff) [17:42:53] !log rzl@deploy1003 Finished scap sync-world: https://gerrit.wikimedia.org/r/1275464 T423623 (duration: 02m 30s) [17:49:46] done [17:50:15] (03PS3) 10Dzahn: jenkins: include profile::ci::pipeline::publisher [puppet] - 10https://gerrit.wikimedia.org/r/1275961 (https://phabricator.wikimedia.org/T423968) [17:50:41] FIRING: ConfdResourceFailed: confd resource _etc_kubernetes_pki_kube-apiserver-sa-certs.pem.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [17:56:57] !log dancy@deploy1003 Installing scap version "4.250.0" for 2 host(s) [17:57:14] \o/ [17:57:45] (03CR) 10Muehlenhoff: "lookup() can be used pretty arbitrary outside of the profile hierarchy, we do the same e.g. for lookup('cluster) in the base module." [puppet] - 10https://gerrit.wikimedia.org/r/1275956 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [17:58:48] !log dancy@deploy1003 Installation of scap version "4.250.0" completed for 2 hosts [17:59:44] !log dancy@deploy1003 Started scap sync-world: Testing [18:02:34] (03PS1) 10Muehlenhoff: ganeti: Move pki::get_cert into the profile [puppet] - 10https://gerrit.wikimedia.org/r/1275992 (https://phabricator.wikimedia.org/T420993) [18:02:42] !log dancy@deploy1003 Finished scap sync-world: Testing (duration: 02m 58s) [18:03:03] (03CR) 10CI reject: [V:04-1] ganeti: Move pki::get_cert into the profile [puppet] - 10https://gerrit.wikimedia.org/r/1275992 (https://phabricator.wikimedia.org/T420993) (owner: 10Muehlenhoff) [18:06:06] 06SRE, 10Maps, 06Traffic, 07affects-Kiwix-and-openZIM: Multiple wikipedia wikis have broken maps URLs in the infobox - https://phabricator.wikimedia.org/T424046#11846386 (10Benoit74) Yes, this is what we get as well. [18:06:20] (03PS2) 10Muehlenhoff: ganeti: Move pki::get_cert into the profile [puppet] - 10https://gerrit.wikimedia.org/r/1275992 (https://phabricator.wikimedia.org/T420993) [18:08:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-ctrl2004:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-ctrl2004 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:09:03] (03PS1) 10Dzahn: jenkins: add fake password for docker registry for PipelineLib [labs/private] - 10https://gerrit.wikimedia.org/r/1275993 (https://phabricator.wikimedia.org/T423968) [18:09:04] (03PS3) 10MusikAnimal: Promote CodeMirror 6 out of beta and use in place of CodeEditor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271263 (https://phabricator.wikimedia.org/T419332) [18:09:28] (03CR) 10Dzahn: [V:03+2 C:03+2] jenkins: add fake password for docker registry for PipelineLib [labs/private] - 10https://gerrit.wikimedia.org/r/1275993 (https://phabricator.wikimedia.org/T423968) (owner: 10Dzahn) [18:09:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-ctrl2004:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-ctrl2004 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:09:49] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1275992 (https://phabricator.wikimedia.org/T420993) (owner: 10Muehlenhoff) [18:10:01] (03PS1) 10Jasmine: wikikube: Add wikikube-ctrl200[4-5] to cluster_nodes: following [0] [puppet] - 10https://gerrit.wikimedia.org/r/1275994 (https://phabricator.wikimedia.org/T390861) [18:10:39] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.4 point update - https://phabricator.wikimedia.org/T420240#11846409 (10MoritzMuehlenhoff) [18:10:46] 06SRE, 10Scap, 06serviceops-radar, 06Release-Engineering-Team (Seen): Enable scap to roll back broken changes to MediaWiki - https://phabricator.wikimedia.org/T225207#11846414 (10dancy) 05Openβ†’03Resolved a:03dancy A lot of stuff has happened since this ticket was originally filed. Scap's automat... [18:12:19] I want to be transparent that I'm going to deploy https://phabricator.wikimedia.org/T259059 now. This involves like 10 patches, could be risky… but I think it actually helps that there was no train this week (as we don't have only some wikis relying on code that got removed, etc.) [18:12:33] of course I will test thoroughly and revert as necessary [18:12:50] Sounds good. [18:12:58] Best of luck [18:13:20] thanks! [18:13:24] oh that's very exciting to hear. Good luck! [18:14:02] (03Abandoned) 10RLazarus: mw-wikifunctions: Set $MCROUTER_SERVER in values-${ENV}.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267915 (https://phabricator.wikimedia.org/T411807) (owner: 10RLazarus) [18:16:38] (03PS1) 10David Martin: Turn on import of references inside Wikidata statements [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275996 (https://phabricator.wikimedia.org/T405849) [18:16:39] I don't think this necessarily needs to be an "emergency deployments only" situation, but note that with a WMF holiday tomorrow, if the ten risky patches create any issues that don't become obvious immediately, response tomorrow will be limited -- in that sense today is exactly like a Friday :) [18:18:02] (03PS1) 10MusikAnimal: Promote CM6 out of beta, remove CM5 modules, and add v6 aliases [extensions/CodeMirror] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275997 (https://phabricator.wikimedia.org/T373720) [18:19:34] understood. It should either immediately look good, or obviously not going to work, hehe. I figure today is better than a Thursday though [18:20:16] (03PS1) 10MusikAnimal: Hooks: remove temporary CodeMirror code following promotion from beta [extensions/CodeEditor] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275998 (https://phabricator.wikimedia.org/T419332) [18:20:46] (03PS1) 10MusikAnimal: CodeEditorHooks: remove temporary code for CodeMirror beta feature [extensions/UploadWizard] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275999 (https://phabricator.wikimedia.org/T419332) [18:20:49] in general, a Thursday is better because we're all at work the next day :) but if today is better for you, there's no rule against it afaik [18:21:11] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1162 - https://phabricator.wikimedia.org/T424064#11846474 (10Jclark-ctr) a:03Jclark-ctr This server is out of warranty can we use a disk from decom server to replace? [18:21:14] (03PS1) 10MusikAnimal: ext.math.editpage: update CodeMirror RL module [extensions/Math] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276000 (https://phabricator.wikimedia.org/T373720) [18:21:17] (we don't have no-deploy Fridays because we want a quiet Friday -- we do it because we want a quiet *Saturday*) [18:21:41] FIRING: [2x] ProbeDown: Service wikikube-ctrl2004:6443 has failed probes (http_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#wikikube-ctrl2004:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:21:50] gotcha, thanks [18:22:04] (03CR) 10Scott French: [C:03+1] wikikube: Add wikikube-ctrl200[4-5] to cluster_nodes: following [0] [puppet] - 10https://gerrit.wikimedia.org/r/1275994 (https://phabricator.wikimedia.org/T390861) (owner: 10Jasmine) [18:22:22] (03PS1) 10MusikAnimal: ext.abuseFilter.edit: target newly updated CodeMirror modules [extensions/AbuseFilter] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276001 (https://phabricator.wikimedia.org/T399673) [18:22:24] jasmine_: ^ I suspect this is due to your turn-up [18:22:44] (03PS1) 10MusikAnimal: VisualEditor.CodeMirror.less: remove CM5 styles [skins/Timeless] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276002 [18:22:50] (03CR) 10CI reject: [V:04-1] CodeEditorHooks: remove temporary code for CodeMirror beta feature [extensions/UploadWizard] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275999 (https://phabricator.wikimedia.org/T419332) (owner: 10MusikAnimal) [18:23:03] PROBLEM - Etcd cluster health on wikikube-ctrl2004 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [18:23:08] musikanimal: if you're about to deploy please hold off [18:23:18] sure, no problem [18:23:24] I'm still cherry-picking anyway [18:23:24] yes, sorry, fixing in a moment - thanks! [18:23:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-ctrl2004:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-ctrl2004 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:23:41] !incidents [18:23:41] 7856 (UNACKED) [2x] ProbeDown sre (wikikube-ctrl2004:6443 probes/custom codfw) [18:23:44] !ack [18:23:45] 7856 (ACKED) [2x] ProbeDown sre (wikikube-ctrl2004:6443 probes/custom codfw) [18:23:50] (03PS1) 10MusikAnimal: CodeEditorHooks: remove temporary code for CodeMirror beta feature [extensions/TemplateStyles] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276004 (https://phabricator.wikimedia.org/T419332) [18:24:16] (03CR) 10Jasmine: [C:03+2] wikikube: Add wikikube-ctrl200[4-5] to cluster_nodes: following [0] [puppet] - 10https://gerrit.wikimedia.org/r/1275994 (https://phabricator.wikimedia.org/T390861) (owner: 10Jasmine) [18:24:26] musikanimal: great, thank you [18:24:29] (03PS1) 10MusikAnimal: DescriptionField: use new module name for loading CodeMirror [extensions/CommunityRequests] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276005 [18:24:51] (03PS1) 10MusikAnimal: CodeEditorHooks: remove temporary code for CodeMirror beta feature [extensions/Scribunto] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276006 (https://phabricator.wikimedia.org/T419332) [18:25:13] (03PS1) 10MusikAnimal: CodeEditorHooks: remove temporary code for CodeMirror beta feature [extensions/JsonConfig] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276007 (https://phabricator.wikimedia.org/T419332) [18:25:35] rzl: Good point about the Earth Day holiday. I keep forgetting about that. [18:25:40] jasmine_: I see you're working on wikikube-ctrl200[4-5], is the alert related? [18:25:51] (03PS1) 10MusikAnimal: CodeEditorHooks: remove temporary code for CodeMirror beta feature [extensions/Gadgets] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276008 (https://phabricator.wikimedia.org/T419332) [18:26:21] fabfur: yes, apologies about that - should resolved soon) [18:26:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:26:39] jasmine_: ack nx [18:26:42] *tnx [18:26:43] (03PS1) 10Andrew Bogott: Add upstream repos for openstack flamingo and gazpacho [puppet] - 10https://gerrit.wikimedia.org/r/1276009 (https://phabricator.wikimedia.org/T423598) [18:26:45] (03PS1) 10Andrew Bogott: Remove openstack::[client|server]packages::flamingo::bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1276010 [18:26:45] (03PS1) 10Andrew Bogott: Openstack: get osbpo packages from apt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1276011 (https://phabricator.wikimedia.org/T423598) [18:27:57] (03CR) 10CI reject: [V:04-1] Openstack: get osbpo packages from apt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1276011 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [18:28:44] (03CR) 10CI reject: [V:04-1] CodeEditorHooks: remove temporary code for CodeMirror beta feature [extensions/Gadgets] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276008 (https://phabricator.wikimedia.org/T419332) (owner: 10MusikAnimal) [18:29:41] (03PS1) 10MusikAnimal: mw.FormDataTransport.test: Update expected API call for POSTed calls [extensions/UploadWizard] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276012 (https://phabricator.wikimedia.org/T423529) [18:30:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): 2 devices deleted from netbox that where active - https://phabricator.wikimedia.org/T424019#11846548 (10Jclark-ctr) I do need some assistance adding the network ip's and back in. [18:30:48] (03PS2) 10Andrew Bogott: Openstack: get osbpo packages from apt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1276011 (https://phabricator.wikimedia.org/T423598) [18:31:09] (03CR) 10MusikAnimal: "recheck" [extensions/Gadgets] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276008 (https://phabricator.wikimedia.org/T419332) (owner: 10MusikAnimal) [18:31:41] (03CR) 10CI reject: [V:04-1] Openstack: get osbpo packages from apt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1276011 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [18:33:22] (03CR) 10CI reject: [V:04-1] ext.math.editpage: update CodeMirror RL module [extensions/Math] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276000 (https://phabricator.wikimedia.org/T373720) (owner: 10MusikAnimal) [18:33:34] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1275961/8451/contint1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1275961 (https://phabricator.wikimedia.org/T423968) (owner: 10Dzahn) [18:33:42] (03PS3) 10Andrew Bogott: Openstack: get osbpo packages from apt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1276011 (https://phabricator.wikimedia.org/T423598) [18:34:19] (03CR) 10MusikAnimal: "recheck" [extensions/Math] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276000 (https://phabricator.wikimedia.org/T373720) (owner: 10MusikAnimal) [18:35:04] jasmine_: it doesn't look like 2004 has joined the etcd cluster yet, which is blocking kube-publish-sa-cert.service, which is in turn blocking kube-apiserver.service [18:35:07] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1276011 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [18:35:09] (03PS4) 10Dzahn: jenkins: include profile::ci::pipeline::publisher [puppet] - 10https://gerrit.wikimedia.org/r/1275961 (https://phabricator.wikimedia.org/T423968) [18:39:07] (03PS2) 10David Martin: Wikifunctions: Turn on import of references inside Wikidata statements [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275996 (https://phabricator.wikimedia.org/T404652) [18:42:00] (03CR) 10Andrew Bogott: "I'm unsure what to do (if anything) about ListShellHook since there are a zillion packages we need from these repos, which may change from" [puppet] - 10https://gerrit.wikimedia.org/r/1276009 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [18:42:49] (03CR) 10Andrew Bogott: [C:04-2] "Do not merge, we are probably going to import all this into apt.wm.o" [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [18:47:31] jasmine_: I suspect (but an not sure) your second patch (https://gerrit.wikimedia.org/r/1275994) may make changes to the firewall rules on the control-plane hosts once puppet runs there, which may unblock 2004 joining the etcd cluster [18:47:46] have you run puppet agent on the existing ctrl hosts and 2004 after merging that? [18:48:48] ah good point, I ran it after [0], will rerun following [1] [18:48:48] [0] - https://gerrit.wikimedia.org/r/c/operations/puppet/+/1195350 [18:48:48] [1] - https://gerrit.wikimedia.org/r/c/operations/puppet/+/1275994 [18:49:10] rerunning now) [18:50:32] jasmine_: just to confirm one more thing: you ran `member add` before merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1195350, right? [18:51:20] yes confirmed [18:53:11] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T423872#11846623 (10Jclark-ctr) 05Openβ†’03Resolved [18:53:30] FIRING: Outbound discards: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [18:54:36] jasmine_: when those puppet runs are done, mind if I restart etcd on 2004? [18:55:09] yes pls do, ty! [18:56:04] (although sorry to clarify, it's still running) [18:56:32] oh! hmmmm ... already did [18:56:51] so, it is currently unhappy - trying to assess why [18:56:51] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11846642 (10Aklapper) [18:58:56] Looks like the Systemd start for kube-scheduler is failing on the puppet run too [19:01:42] !incidents [19:01:42] 7856 (ACKED) [2x] ProbeDown sre (wikikube-ctrl2004:6443 probes/custom codfw) [19:02:35] swfrench-wmf: should I remove it from the etcd cluster for the time being? [19:03:11] jasmine_: perhaps? has puppet run on 2001-2003 and 2004 yet? [19:04:10] I ran it on the master nodes, yet to run on the worker nodes [19:04:24] (03PS1) 10Ssingh: varnish: do not set CSP policy for beta [puppet] - 10https://gerrit.wikimedia.org/r/1276017 (https://phabricator.wikimedia.org/T420604) [19:05:36] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1276017 (https://phabricator.wikimedia.org/T420604) (owner: 10Ssingh) [19:07:55] jasmine_: so yeah, this is going to take a bit of time to sort out. I'd recommend unwinding what you've done so far the the extent possible (and what you can't undo, silence). [19:09:31] makes sense, will remove the node from the cluster and revert both patches [19:11:11] jasmine_: I'm not sure that reverting the patches will do anything (i.e., I don't know if that's actually going to absent any of the new configuration / services) [19:13:18] hm, okay in that case 2004 is removed from the cluster, I should follow this with running puppet on all 200[1-3] ? [19:14:02] 2005 also included in the patches, seems okay on icinga [19:14:09] etcd cluster health wise [19:14:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275112 (https://phabricator.wikimedia.org/T328207) (owner: 10Pppery) [19:15:39] jasmine_: 2005 is still serviceops_ferm, so etcd is not (yet) attempting to run there [19:16:09] (not the rest of the control plane stack) [19:16:13] s/not/nor/ [19:18:34] so, given that puppet is disabled on 2005, if you were to partially revert your changes (i.e., just the 2005 parts), it would probably be ok [19:19:17] but reverting the 2004 parts at this point is likely to leave a mess [19:19:38] I believe they're both `master_stacked` ? [0] [19:19:38] [0] - https://gerrit.wikimedia.org/r/c/operations/puppet/+/1195350 [19:21:11] jasmine_: you disabled puppet on 2005, so it that was never applied there. if you revert the intent (in puppet) before reenabling puppet there, then it will be as if it was never master_stacked on that host. what I can't answer is what effect that has for other hosts (i.e., configuration changes they may have applied as a result of the patch) [19:21:44] oh right that makes sense [19:25:53] so will leave those patches, as is - 2004 is now removed from the etcd cluster, I should disable puppet on 2004 too and then silence alerts on both hosts? [19:28:03] or I guess that doesn't really unblock kube-publish-sa-cert.service, which is in turn blocking kube-apiserver.service? [19:29:33] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:30:22] jasmine_: right, 2004 is going to be non-functional until it can join the etcd cluster. what disabling puppet will likely achieve is preventing puppet from attempting to start those services again, which is probably a good thing. [19:30:33] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:32:02] !log Created cusi_user, cusi_case, and cusi_signal on commonswiki on the extension1 database cluster - T424084 [19:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:06] T424084: Enable Suggested Investigations on Wikimedia Commons - https://phabricator.wikimedia.org/T424084 [19:32:33] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:32:33] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:34:02] FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [19:35:10] (03CR) 10Dzahn: [C:03+2] jenkins: include profile::ci::pipeline::publisher [puppet] - 10https://gerrit.wikimedia.org/r/1275961 (https://phabricator.wikimedia.org/T423968) (owner: 10Dzahn) [19:35:33] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:35:33] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:35:36] (03CR) 10Dzahn: [C:03+2] jenkins: include profile::ci::pipeline::publisher [puppet] - 10https://gerrit.wikimedia.org/r/1275961 (https://phabricator.wikimedia.org/T423968) (owner: 10Dzahn) [19:37:12] can I assume it's safe to do a deploy now? I'm going to wait until after the backport window to do https://gerrit.wikimedia.org/r/q/topic:%22codemirror6-wmf/1.46.0-wmf.24%22 but there's one small, non-risky patch I'd like to deploy now if that's OK [19:37:13] puppet is now disabled on 2004 [19:37:15] !log contint1003 - re-enabling puppet T418521 [19:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:19] T418521: setup 2 contint machines for jenkins - https://phabricator.wikimedia.org/T418521 [19:39:43] jasmine_: just to confirm two things: (1) you've not reverted anything yet, correct? (2) are you around for a bit to try something? [19:40:05] I have a theory as to what's happening [19:40:33] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:40:33] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:40:40] yes 1 confirmed, and yes re 3 [19:40:42] 2* [19:40:55] jasmine_: in short, I believe this is an artifact of having 2005 in the SRV record while 2004 is trying to boostrap [19:41:16] can you remove 2005 from the SRV record? [19:41:23] and we can try adding 2004 to the cluster again? [19:43:20] (03PS1) 10Aude: Opt-in new accounts to ReadingLists beta feature on all Wikipedia wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276021 (https://phabricator.wikimedia.org/T420881) [19:43:33] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:44:21] jasmine_: basically, the theory is that with 2005 present in the SRV record, 2004 observes a mismatch in the expected set of peers when it contacts the first existing peer (i.e., when it tries to learn the state of the cluster). [19:44:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276021 (https://phabricator.wikimedia.org/T420881) (owner: 10Aude) [19:45:19] jasmine_: it then fails with something like `"msg":"discovery failed","error":"error validating peerURLs [...] member count is unequal"` [19:45:32] 2005 SRV revert patch: https://gerrit.wikimedia.org/r/c/operations/dns/+/1276023 [19:45:55] ah yeah that makes sense [19:46:04] +1 [19:46:23] (03PS1) 10Jasmine: wmnet: remove wikikube-ctrl2005 from SRV records [dns] - 10https://gerrit.wikimedia.org/r/1276023 (https://phabricator.wikimedia.org/T390861) [19:46:33] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:47:33] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:47:33] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:47:42] !log jasmine@dns1004 START - running authdns-update [19:47:52] (03CR) 10Scott French: [C:03+1] wmnet: remove wikikube-ctrl2005 from SRV records [dns] - 10https://gerrit.wikimedia.org/r/1276023 (https://phabricator.wikimedia.org/T390861) (owner: 10Jasmine) [19:48:00] (03CR) 10Jasmine: [C:03+2] wmnet: remove wikikube-ctrl2005 from SRV records [dns] - 10https://gerrit.wikimedia.org/r/1276023 (https://phabricator.wikimedia.org/T390861) (owner: 10Jasmine) [19:48:12] wikibugs is very slow today :) [19:48:57] (03PS1) 10Dreamy Jazz: CheckUser Suggested Investigations: Enable on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276024 (https://phabricator.wikimedia.org/T424084) [19:49:00] (03PS1) 10Dreamy Jazz: Remove unused wgCheckUserUserAgentTableMigrationStage config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276025 [19:49:02] jouncebot: nowandnext [19:49:02] No deployments scheduled for the next 0 hour(s) and 10 minute(s) [19:49:02] In 0 hour(s) and 10 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260421T2000) [19:49:14] !log jasmine@dns1004 END - running authdns-update [19:49:28] jasmine_: cool, now let's wait 5 minutes [19:49:59] then you can try `member add` again, and then we can try to start etcd.service on 2004 [19:50:33] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:50:33] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:50:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276024 (https://phabricator.wikimedia.org/T424084) (owner: 10Dreamy Jazz) [19:51:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276025 (owner: 10Dreamy Jazz) [19:51:33] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:51:33] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:54:37] alright, now that 5 minutes have passed, _etcd-server-ssl._tcp.k8s3.codfw.wmnet resolves to only 2001-2004 [19:54:38] sounds good, proceeding now [19:54:50] ah yes [19:54:55] nice [19:54:56] jasmine_: you're running `member add`? [19:56:29] yes - `etcdctl --endpoints https://$(hostname -f):2379 member add "${NEW_FQDN%%.*}" --peer-urls="https://${NEW_FQDN}:2380"` [19:56:50] ok to proceed? [19:57:13] depending on the value of NEW_FQDN, yes :) [19:58:00] (I'm kidding) [19:58:20] if it's the exact same command you ran before, then you should be good [19:58:31] x) okay done [19:58:57] great, I see it in `member list` (as unstarted, as expected) [19:59:29] nice [19:59:44] now, on 2004 we can try to restart etcd.service again. would you like to do that, or shall I? [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260421T2000). [20:00:05] Pppery and Dreamy_Jazz: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:10] here [20:00:12] \o [20:00:35] Pppery do you need a deployer (I always forget who has deployment access)? [20:00:44] I have never had deployment access [20:00:48] People often think I do though [20:00:51] :D [20:01:02] I can deploy for you. Let me look at the change [20:01:03] swfrench-wmf: is it `sudo systemctl start etcd.service`? [20:01:07] Maybe I should actually become a deployer someday [20:01:07] jasmine_: since there are patches being deployed in this backport window, let's not go any further than just restarting etcd (i.e., let's pause there and not put the new control plane node into service until the window is done) [20:01:25] sounds good) [20:01:25] jasmine_: yes, that should do it [20:01:36] on 2004 just to confirm? [20:01:49] jasmine_: yes, only 2004 [20:02:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install rdb101[56] - https://phabricator.wikimedia.org/T418916#11847035 (10Jclark-ctr) a:03Jclark-ctr [20:02:39] jasmine_: `50362fa49b3e1f7c, started, wikikube-ctrl2004, https://wikikube-ctrl2004.codfw.wmnet:2380, https://wikikube-ctrl2004.codfw.wmnet:2379, false` [20:02:40] \o/ [20:02:53] nice :) [20:03:03] RECOVERY - Etcd cluster health on wikikube-ctrl2004 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [20:03:07] \o/ [20:03:25] (03CR) 10Dreamy Jazz: [C:03+2] Diqwiki: change project namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275112 (https://phabricator.wikimedia.org/T328207) (owner: 10Pppery) [20:03:26] jasmine_: alright, let's leave it as is until the backport window is over [20:03:50] (03CR) 10Dreamy Jazz: [C:03+2] CheckUser Suggested Investigations: Enable on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276024 (https://phabricator.wikimedia.org/T424084) (owner: 10Dreamy Jazz) [20:03:52] (03CR) 10Dreamy Jazz: [C:03+2] Remove unused wgCheckUserUserAgentTableMigrationStage config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276025 (owner: 10Dreamy Jazz) [20:04:16] (03Merged) 10jenkins-bot: Diqwiki: change project namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275112 (https://phabricator.wikimedia.org/T328207) (owner: 10Pppery) [20:04:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276025 (owner: 10Dreamy Jazz) [20:04:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276024 (https://phabricator.wikimedia.org/T424084) (owner: 10Dreamy Jazz) [20:04:41] (03Merged) 10jenkins-bot: CheckUser Suggested Investigations: Enable on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276024 (https://phabricator.wikimedia.org/T424084) (owner: 10Dreamy Jazz) [20:04:45] (03Merged) 10jenkins-bot: Remove unused wgCheckUserUserAgentTableMigrationStage config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276025 (owner: 10Dreamy Jazz) [20:05:15] Scap failed with `scap: error: extra arguments found: --backport` [20:05:26] hmmm.. [20:05:31] =/ [20:05:38] Lemme roll back scap [20:05:41] https://spiderpig.wikimedia.org/jobs/1805 [20:05:41] RESOLVED: ConfdResourceFailed: confd resource _etc_kubernetes_pki_kube-apiserver-sa-certs.pem.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:05:56] !log dancy@deploy1003 Installing scap version "4.249.0" for 2 host(s) [20:06:21] Thanks, ready to go again when your done rolling back scap [20:06:41] RESOLVED: [2x] ProbeDown: Service wikikube-ctrl2004:6443 has failed probes (http_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#wikikube-ctrl2004:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:07:23] jasmine_: heh, well I guess that fixed it anyway even without reenabling puppet :) [20:07:37] !log dancy@deploy1003 Installation of scap version "4.249.0" completed for 2 hosts [20:07:46] Dreamy_Jazz: Back atcha [20:08:10] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1275112|Diqwiki: change project namespace (T328207)]], [[gerrit:1276025|Remove unused wgCheckUserUserAgentTableMigrationStage config]], [[gerrit:1276024|CheckUser Suggested Investigations: Enable on commonswiki (T424084)]] [20:08:15] T328207: Change Namespace Aliases on diq.wikipedia - https://phabricator.wikimedia.org/T328207 [20:08:16] T424084: Enable Suggested Investigations on Wikimedia Commons - https://phabricator.wikimedia.org/T424084 [20:08:17] ` Etcd cluster health on wikikube-ctrl2004 is OK: The etcd server is healthy` < :') (amazing) thank you swfrench-wmf, much much appreciated [20:08:58] Scap looks to be working again :D [20:09:08] jasmine_: actually, never mind: it looks like kube-publish-sa-cert.service will still need a manual poke, which should(?) happen naturally when you run puppet after reenabling it [20:09:33] on 2004 right? [20:09:48] !log dreamyjazz@deploy1003 pppery, dreamyjazz: Backport for [[gerrit:1275112|Diqwiki: change project namespace (T328207)]], [[gerrit:1276025|Remove unused wgCheckUserUserAgentTableMigrationStage config]], [[gerrit:1276024|CheckUser Suggested Investigations: Enable on commonswiki (T424084)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:09:48] jasmine_: yes, exactly [20:09:52] looking [20:10:38] Looks good. Since it's a namespace change you should probably run namespaceDupes once syncing is done [20:11:13] My changes appear fine, thanks for the info on that [20:11:58] !log dreamyjazz@deploy1003 pppery, dreamyjazz: Continuing with sync [20:15:49] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1275112|Diqwiki: change project namespace (T328207)]], [[gerrit:1276025|Remove unused wgCheckUserUserAgentTableMigrationStage config]], [[gerrit:1276024|CheckUser Suggested Investigations: Enable on commonswiki (T424084)]] (duration: 07m 38s) [20:15:54] T328207: Change Namespace Aliases on diq.wikipedia - https://phabricator.wikimedia.org/T328207 [20:15:54] T424084: Enable Suggested Investigations on Wikimedia Commons - https://phabricator.wikimedia.org/T424084 [20:16:47] !log Running `mwscript-k8s maintenance/namespaceDupes.php --wiki=diqwiki --fix` [20:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:56] 06SRE, 10Maps, 06Traffic, 07affects-Kiwix-and-openZIM: Multiple wikipedia wikis have broken maps URLs in the infobox - https://phabricator.wikimedia.org/T424046#11847073 (10A_smart_kitten) FWIW, for https://ha.wikipedia.org/wiki/Juaben, when I copy the exact request made by my browser from the browser's 'N... [20:17:14] 764 links to fix, 761 were resolvable, 3 were deleted. [20:17:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11847074 (10wiki_willy) a:03VRiley-WMF [20:19:10] FIRING: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9600.service on cloudelastic1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:21:41] 10 pages need fixing. Will post on the phab task [20:21:44] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface cr1-eqiad:ae2 (asw2-b-eqiad:ae1) - https://phabricator.wikimedia.org/T421989#11847081 (10wiki_willy) a:03Jclark-ctr [20:24:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-ctrl2004:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-ctrl2004 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:27:43] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1162 - https://phabricator.wikimedia.org/T424064#11847115 (10Marostegui) Yes please and it can be swapped anytime. Thanks! [20:27:44] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1162 - https://phabricator.wikimedia.org/T424064#11847128 (10Marostegui) p:05Triageβ†’03Medium [20:28:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13[73-82] - https://phabricator.wikimedia.org/T423719#11847137 (10wiki_willy) a:03Jclark-ctr [20:28:55] !log Evening UTC backport window done [20:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:14] swfrench-wmf: If you need it, the backport window is done so should be free for you to finish [20:29:28] Dreamy_Jazz: thank you very much :) [20:35:23] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [20:41:01] jclark@cumin1003 netbox (PID 4001155) is awaiting input [20:42:15] !log dancy@deploy1003 Installing scap version "4.250.1" for 2 host(s) [20:44:06] !log dancy@deploy1003 Installation of scap version "4.250.1" completed for 2 hosts [20:49:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy1003 using scap backport" [extensions/UploadWizard] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276012 (https://phabricator.wikimedia.org/T423529) (owner: 10MusikAnimal) [20:49:31] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding rdb1016 to eqiad - jclark@cumin1003" [20:50:47] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding rdb1016 to eqiad - jclark@cumin1003" [20:50:47] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:50:51] (03Merged) 10jenkins-bot: mw.FormDataTransport.test: Update expected API call for POSTed calls [extensions/UploadWizard] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276012 (https://phabricator.wikimedia.org/T423529) (owner: 10MusikAnimal) [20:50:58] (03Abandoned) 10Kosta Harlan: hCaptcha: Emit Prometheus counter on health check failover [extensions/ConfirmEdit] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1267056 (https://phabricator.wikimedia.org/T421204) (owner: 10Kosta Harlan) [20:51:09] !log musikanimal@deploy1003 Started scap sync-world: Backport for [[gerrit:1276012|mw.FormDataTransport.test: Update expected API call for POSTed calls (T423529 T421288)]] [20:51:15] T423529: mw.FormDataTransport upload and uploadChunk failing in CI - https://phabricator.wikimedia.org/T423529 [20:51:15] T421288: Action API: prefer the action parameter to be given as a query parameter, even for POST requests - https://phabricator.wikimedia.org/T421288 [20:52:45] !log musikanimal@deploy1003 musikanimal: Backport for [[gerrit:1276012|mw.FormDataTransport.test: Update expected API call for POSTed calls (T423529 T421288)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:52:56] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host rdb1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:53:02] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host rdb1016.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:53:44] !log musikanimal@deploy1003 musikanimal: Continuing with deployment [20:57:36] !log musikanimal@deploy1003 Finished scap sync-world: Backport for [[gerrit:1276012|mw.FormDataTransport.test: Update expected API call for POSTed calls (T423529 T421288)]] (duration: 06m 27s) [20:57:41] T423529: mw.FormDataTransport upload and uploadChunk failing in CI - https://phabricator.wikimedia.org/T423529 [20:57:42] T421288: Action API: prefer the action parameter to be given as a query parameter, even for POST requests - https://phabricator.wikimedia.org/T421288 [21:00:04] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host rdb1016.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:00:05] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260421T2100) [21:01:34] (03CR) 10MusikAnimal: "recheck" [extensions/UploadWizard] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275999 (https://phabricator.wikimedia.org/T419332) (owner: 10MusikAnimal) [21:05:59] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host rdb1016.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:06:07] (03PS2) 10Tiziano Fogli: logstash: add thanos-query-frontend filter [puppet] - 10https://gerrit.wikimedia.org/r/1275800 (https://phabricator.wikimedia.org/T423986) [21:06:23] (03PS2) 10Tiziano Fogli: rsyslog: forward thanos-query-frontend logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/1275799 (https://phabricator.wikimedia.org/T423986) [21:07:00] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host rdb1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:08:33] (03CR) 10Dreamy Jazz: [C:03+1] hCaptcha: Don't prevent opening links present in the hCaptcha popup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275429 (https://phabricator.wikimedia.org/T408812) (owner: 10Harroyo-wmf) [21:10:51] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host rdb1015.eqiad.wmnet with OS trixie [21:11:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install rdb101[56] - https://phabricator.wikimedia.org/T418916#11847361 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host rdb1015.eqiad.wmnet with OS trixie [21:12:12] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host rdb1016.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:17:28] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host rdb1016.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:23:23] FIRING: [10x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:29:18] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host rdb1016.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:29:50] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host rdb1016.eqiad.wmnet with OS trixie [21:30:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install rdb101[56] - https://phabricator.wikimedia.org/T418916#11847429 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host rdb1016.eqiad.wmnet with OS trixie [21:43:14] okay, I'm starting deployment of https://phabricator.wikimedia.org/T259059 now. This make take a while, FYI [21:43:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy1003 using scap backport" [extensions/CodeMirror] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275997 (https://phabricator.wikimedia.org/T373720) (owner: 10MusikAnimal) [21:43:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy1003 using scap backport" [skins/Timeless] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276002 (owner: 10MusikAnimal) [21:43:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy1003 using scap backport" [extensions/TemplateStyles] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276004 (https://phabricator.wikimedia.org/T419332) (owner: 10MusikAnimal) [21:43:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy1003 using scap backport" [extensions/CommunityRequests] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276005 (owner: 10MusikAnimal) [21:44:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy1003 using scap backport" [extensions/CodeEditor] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275998 (https://phabricator.wikimedia.org/T419332) (owner: 10MusikAnimal) [21:44:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy1003 using scap backport" [extensions/AbuseFilter] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276001 (https://phabricator.wikimedia.org/T399673) (owner: 10MusikAnimal) [21:44:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy1003 using scap backport" [extensions/JsonConfig] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276007 (https://phabricator.wikimedia.org/T419332) (owner: 10MusikAnimal) [21:44:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy1003 using scap backport" [extensions/Scribunto] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276006 (https://phabricator.wikimedia.org/T419332) (owner: 10MusikAnimal) [21:44:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy1003 using scap backport" [extensions/Gadgets] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276008 (https://phabricator.wikimedia.org/T419332) (owner: 10MusikAnimal) [21:44:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy1003 using scap backport" [extensions/Math] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276000 (https://phabricator.wikimedia.org/T373720) (owner: 10MusikAnimal) [21:44:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy1003 using scap backport" [extensions/UploadWizard] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275999 (https://phabricator.wikimedia.org/T419332) (owner: 10MusikAnimal) [21:44:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271263 (https://phabricator.wikimedia.org/T419332) (owner: 10MusikAnimal) [21:45:11] (03Merged) 10jenkins-bot: Promote CodeMirror 6 out of beta and use in place of CodeEditor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271263 (https://phabricator.wikimedia.org/T419332) (owner: 10MusikAnimal) [21:46:39] jclark@cumin1003 reimage (PID 4007848) is awaiting input [21:47:04] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host rdb1016.eqiad.wmnet with OS trixie [21:47:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install rdb101[56] - https://phabricator.wikimedia.org/T418916#11847490 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host rdb1016.eqiad.wmnet with OS trixie executed with errors: - rdb... [21:48:10] (03Merged) 10jenkins-bot: Promote CM6 out of beta, remove CM5 modules, and add v6 aliases [extensions/CodeMirror] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275997 (https://phabricator.wikimedia.org/T373720) (owner: 10MusikAnimal) [21:48:13] (03Merged) 10jenkins-bot: CodeEditorHooks: remove temporary code for CodeMirror beta feature [extensions/UploadWizard] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275999 (https://phabricator.wikimedia.org/T419332) (owner: 10MusikAnimal) [21:49:52] (03PS1) 10Clare Ming: Test Kitchen UI: Deploy v1.2.9 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276061 (https://phabricator.wikimedia.org/T422679) [21:51:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install rdb101[56] - https://phabricator.wikimedia.org/T418916#11847519 (10Jclark-ctr) [21:52:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install rdb101[56] - https://phabricator.wikimedia.org/T418916#11847528 (10Jclark-ctr) @Clement_Goubert i am having issues with these failing to image this is error on console. They might be missing a partman β”‚ Error while se... [21:54:12] (03CR) 10Santiago Faci: [C:03+2] Test Kitchen UI: Deploy v1.2.9 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276061 (https://phabricator.wikimedia.org/T422679) (owner: 10Clare Ming) [21:54:13] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [21:56:07] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.2.9 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276061 (https://phabricator.wikimedia.org/T422679) (owner: 10Clare Ming) [21:56:14] (03Merged) 10jenkins-bot: CodeEditorHooks: remove temporary code for CodeMirror beta feature [extensions/Gadgets] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276008 (https://phabricator.wikimedia.org/T419332) (owner: 10MusikAnimal) [21:56:16] (03Merged) 10jenkins-bot: CodeEditorHooks: remove temporary code for CodeMirror beta feature [extensions/Scribunto] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276006 (https://phabricator.wikimedia.org/T419332) (owner: 10MusikAnimal) [21:56:18] (03Merged) 10jenkins-bot: ext.math.editpage: update CodeMirror RL module [extensions/Math] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276000 (https://phabricator.wikimedia.org/T373720) (owner: 10MusikAnimal) [21:56:20] (03Merged) 10jenkins-bot: CodeEditorHooks: remove temporary code for CodeMirror beta feature [extensions/JsonConfig] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276007 (https://phabricator.wikimedia.org/T419332) (owner: 10MusikAnimal) [21:57:38] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply [21:58:03] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply [21:59:02] (03Merged) 10jenkins-bot: ext.abuseFilter.edit: target newly updated CodeMirror modules [extensions/AbuseFilter] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276001 (https://phabricator.wikimedia.org/T399673) (owner: 10MusikAnimal) [21:59:05] (03Merged) 10jenkins-bot: DescriptionField: use new module name for loading CodeMirror [extensions/CommunityRequests] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276005 (owner: 10MusikAnimal) [21:59:15] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 1.07 ms [22:01:33] (03Merged) 10jenkins-bot: Hooks: remove temporary CodeMirror code following promotion from beta [extensions/CodeEditor] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275998 (https://phabricator.wikimedia.org/T419332) (owner: 10MusikAnimal) [22:01:34] (03Merged) 10jenkins-bot: CodeEditorHooks: remove temporary code for CodeMirror beta feature [extensions/TemplateStyles] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276004 (https://phabricator.wikimedia.org/T419332) (owner: 10MusikAnimal) [22:01:37] (03Merged) 10jenkins-bot: VisualEditor.CodeMirror.less: remove CM5 styles [skins/Timeless] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1276002 (owner: 10MusikAnimal) [22:02:08] !log musikanimal@deploy1003 Started scap sync-world: Backport for [[gerrit:1275997|Promote CM6 out of beta, remove CM5 modules, and add v6 aliases (T373720)]], [[gerrit:1276002|VisualEditor.CodeMirror.less: remove CM5 styles]], [[gerrit:1276004|CodeEditorHooks: remove temporary code for CodeMirror beta feature (T419332)]], [[gerrit:1276005|DescriptionField: use new module name for loading CodeMirror]], [[gerrit:1275998|Ho [22:02:08] oks: remove temporary CodeMirror code following promotion from beta (T419332)]], [[gerrit:1276001|ext.abuseFilter.edit: target newly updated CodeMirror modules (T399673)]], [[gerrit:1276007|CodeEditorHooks: remove temporary code for CodeMirror beta feature (T419332)]], [[gerrit:1276006|CodeEditorHooks: remove temporary code for CodeMirror beta feature (T419332)]], [[gerrit:1276008|CodeEditorHooks: remove temporary code fo [22:02:08] r CodeMirror beta feature (T419332)]], [[gerrit:1276000|ext.math.editpage: update CodeMirror RL module (T373720)]], [[gerrit:1275999|CodeEditorHooks: remove temporary code for CodeMirror beta feature (T419332)]], [[gerrit:1271263|Promote CodeMirror 6 out of beta and use in place of CodeEditor (T419332 T259059)]] [22:02:14] T373720: Deprecate use of CodeMirror 5 - https://phabricator.wikimedia.org/T373720 [22:02:15] T419332: Replacing CodeEditor with CodeMirror by MW 1.46 - https://phabricator.wikimedia.org/T419332 [22:02:15] T399673: Add CodeMirror mode for AbuseFilter syntax - https://phabricator.wikimedia.org/T399673 [22:02:16] T259059: Upgrade to CodeMirror 6 - https://phabricator.wikimedia.org/T259059 [22:03:44] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission wikikube-worker[1002-1005,1011-1012,1019-1020,1029-1031,1058-1063,1082-1083,1088-1092,1096-1112,1166-1168].eqiad.wmnet - https://phabricator.wikimedia.org/T423863#11847614 (10Jclark-ctr) Servers have been Physcially removed from Rows A ,C D [22:18:58] !log musikanimal@deploy1003 musikanimal: Backport for [[gerrit:1275997|Promote CM6 out of beta, remove CM5 modules, and add v6 aliases (T373720)]], [[gerrit:1276002|VisualEditor.CodeMirror.less: remove CM5 styles]], [[gerrit:1276004|CodeEditorHooks: remove temporary code for CodeMirror beta feature (T419332)]], [[gerrit:1276005|DescriptionField: use new module name for loading CodeMirror]], [[gerrit:1275998|Hooks: remove [22:18:58] temporary CodeMirror code following promotion from beta (T419332)]], [[gerrit:1276001|ext.abuseFilter.edit: target newly updated CodeMirror modules (T399673)]], [[gerrit:1276007|CodeEditorHooks: remove temporary code for CodeMirror beta feature (T419332)]], [[gerrit:1276006|CodeEditorHooks: remove temporary code for CodeMirror beta feature (T419332)]], [[gerrit:1276008|CodeEditorHooks: remove temporary code for CodeMirror [22:18:58] beta feature (T419332)]], [[gerrit:1276000|ext.math.editpage: update CodeMirror RL module (T373720)]], [[gerrit:1275999|CodeEditorHooks: remove temporary code for CodeMirror beta feature (T419332)]], [[gerrit:1271263|Promote CodeMirror 6 out of beta and use in place of CodeEditor (T419332 T259059)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:19:03] T373720: Deprecate use of CodeMirror 5 - https://phabricator.wikimedia.org/T373720 [22:19:03] T419332: Replacing CodeEditor with CodeMirror by MW 1.46 - https://phabricator.wikimedia.org/T419332 [22:19:04] T399673: Add CodeMirror mode for AbuseFilter syntax - https://phabricator.wikimedia.org/T399673 [22:19:04] T259059: Upgrade to CodeMirror 6 - https://phabricator.wikimedia.org/T259059 [22:23:30] RESOLVED: Outbound discards: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [22:25:22] !log musikanimal@deploy1003 musikanimal: Continuing with deployment [22:25:27] weee!!!! [22:26:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:26:48] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on cloudelastic1011:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [22:31:04] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host rdb1015.eqiad.wmnet with OS trixie [22:31:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install rdb101[56] - https://phabricator.wikimedia.org/T418916#11847707 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host rdb1015.eqiad.wmnet with OS trixie executed with errors: - rdb... [22:37:25] !log musikanimal@deploy1003 Finished scap sync-world: Backport for [[gerrit:1275997|Promote CM6 out of beta, remove CM5 modules, and add v6 aliases (T373720)]], [[gerrit:1276002|VisualEditor.CodeMirror.less: remove CM5 styles]], [[gerrit:1276004|CodeEditorHooks: remove temporary code for CodeMirror beta feature (T419332)]], [[gerrit:1276005|DescriptionField: use new module name for loading CodeMirror]], [[gerrit:1275998|H [22:37:25] ooks: remove temporary CodeMirror code following promotion from beta (T419332)]], [[gerrit:1276001|ext.abuseFilter.edit: target newly updated CodeMirror modules (T399673)]], [[gerrit:1276007|CodeEditorHooks: remove temporary code for CodeMirror beta feature (T419332)]], [[gerrit:1276006|CodeEditorHooks: remove temporary code for CodeMirror beta feature (T419332)]], [[gerrit:1276008|CodeEditorHooks: remove temporary code f [22:37:25] or CodeMirror beta feature (T419332)]], [[gerrit:1276000|ext.math.editpage: update CodeMirror RL module (T373720)]], [[gerrit:1275999|CodeEditorHooks: remove temporary code for CodeMirror beta feature (T419332)]], [[gerrit:1271263|Promote CodeMirror 6 out of beta and use in place of CodeEditor (T419332 T259059)]] (duration: 35m 16s) [22:37:30] T373720: Deprecate use of CodeMirror 5 - https://phabricator.wikimedia.org/T373720 [22:37:31] T419332: Replacing CodeEditor with CodeMirror by MW 1.46 - https://phabricator.wikimedia.org/T419332 [22:37:31] T399673: Add CodeMirror mode for AbuseFilter syntax - https://phabricator.wikimedia.org/T399673 [22:37:32] T259059: Upgrade to CodeMirror 6 - https://phabricator.wikimedia.org/T259059 [22:37:43] FINISHED!!! [22:37:59] that was one hour deployment [22:39:54] (03PS1) 10BryanDavis: deployment-prep: Remove hiera for deployment-deploy04 [puppet] - 10https://gerrit.wikimedia.org/r/1276068 (https://phabricator.wikimedia.org/T421244) [23:15:25] !log denisse@deploy1003 Started deploy [librenms/librenms@4a0466d]: Upgrade LibreNMS to 26.4.0 - T423229 [23:15:44] !log denisse@deploy1003 Finished deploy [librenms/librenms@4a0466d]: Upgrade LibreNMS to 26.4.0 - T423229 (duration: 00m 18s) [23:23:56] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11847856 (10Bawolff) [23:28:59] 06SRE, 10SRE-Access-Requests: Update the list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T423989#11847887 (10Dzahn) I did the removal. Leaving the addition for clinic duty for now. [23:34:02] FIRING: HelmReleaseBadStatus: Helm release mw-script/nngkzgw8 on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [23:35:58] 06SRE, 10SRE-Access-Requests: Update the list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T423989#11847946 (10Dzahn) Ok, just doing it based on T420459#11794832 https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#WMDE_Group has been edited as requested. [23:37:04] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for katiamusiolek - https://phabricator.wikimedia.org/T420459#11847954 (10Dzahn) I added Katia to approvers for WMDE requests at https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#WMDE_Group per request on T423... [23:37:45] 06SRE, 10SRE-Access-Requests: Update the list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T423989#11847960 (10Dzahn) 05Openβ†’03Resolved a:03Dzahn [23:39:40] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1276081 [23:39:40] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1276081 (owner: 10TrainBranchBot) [23:51:25] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1276081 (owner: 10TrainBranchBot)