[00:05:14] !log pt1979@cumin1003 START - Cookbook sre.hosts.reimage for host rdb1015.eqiad.wmnet with OS trixie [00:05:24] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install rdb101[56] - https://phabricator.wikimedia.org/T418916#11938752 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1003 for host rdb1015.eqiad.wmnet with OS trixie [00:06:33] (03PS1) 10Sbisson: Log editing_start and article_saved events for control group [extensions/ArticleGuidance] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289446 (https://phabricator.wikimedia.org/T422146) [00:07:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/ArticleGuidance] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289446 (https://phabricator.wikimedia.org/T422146) (owner: 10Sbisson) [00:09:19] (03PS3) 10Ladsgroup: 404.php: Force a redirect to /wiki/ in very obvious cases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288274 (https://phabricator.wikimedia.org/T129433) [00:09:22] (03CR) 10Ladsgroup: 404.php: Force a redirect to /wiki/ in very obvious cases (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288274 (https://phabricator.wikimedia.org/T129433) (owner: 10Ladsgroup) [00:09:32] (03CR) 10Ladsgroup: 404.php: Force a redirect to /wiki/ in very obvious cases (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288274 (https://phabricator.wikimedia.org/T129433) (owner: 10Ladsgroup) [00:16:40] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:20:01] (03PS1) 10Papaul: Fix partman to use standard-efi for new rdb nodes [puppet] - 10https://gerrit.wikimedia.org/r/1289448 (https://phabricator.wikimedia.org/T418916) [00:22:55] (03CR) 10Papaul: [C:03+2] Fix partman to use standard-efi for new rdb nodes [puppet] - 10https://gerrit.wikimedia.org/r/1289448 (https://phabricator.wikimedia.org/T418916) (owner: 10Papaul) [00:24:42] (03PS3) 10Ladsgroup: Limit $wgThumbLimits to three options [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287441 (https://phabricator.wikimedia.org/T426328) (owner: 10Jdlrobson) [00:25:51] jouncebot: nowandnext [00:25:51] No deployments scheduled for the next 5 hour(s) and 34 minute(s) [00:25:51] In 5 hour(s) and 34 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260520T0600) [00:25:56] cool cool [00:26:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287441 (https://phabricator.wikimedia.org/T426328) (owner: 10Jdlrobson) [00:27:30] (03CR) 10Ladsgroup: Limit $wgThumbLimits to three options [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287441 (https://phabricator.wikimedia.org/T426328) (owner: 10Jdlrobson) [00:32:52] pt1979@cumin1003 reimage (PID 1743822) is awaiting input [00:33:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287441 (https://phabricator.wikimedia.org/T426328) (owner: 10Jdlrobson) [00:35:24] (03Merged) 10jenkins-bot: Limit $wgThumbLimits to three options [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287441 (https://phabricator.wikimedia.org/T426328) (owner: 10Jdlrobson) [00:36:38] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1287441|Limit $wgThumbLimits to three options (T426328)]] [00:36:42] T426328: [1.48] 3rd parties must update ThumbLimits config - https://phabricator.wikimedia.org/T426328 [00:38:38] !log ladsgroup@deploy1003 jdlrobson, ladsgroup: Backport for [[gerrit:1287441|Limit $wgThumbLimits to three options (T426328)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:39:02] !log ladsgroup@deploy1003 jdlrobson, ladsgroup: Continuing with deployment [00:43:10] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1287441|Limit $wgThumbLimits to three options (T426328)]] (duration: 06m 33s) [00:43:14] T426328: [1.48] 3rd parties must update ThumbLimits config - https://phabricator.wikimedia.org/T426328 [00:43:35] (03CR) 10DDesouza: [C:03+2] miscweb(design-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289440 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [00:44:38] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on dbproxy2005 - https://phabricator.wikimedia.org/T426791#11938830 (10Ladsgroup) m1 main dbproxy on codfw. I wouldn't fail it over at middle of night since it's codfw but maybe tomorrow if DBAs think we should. [00:45:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288274 (https://phabricator.wikimedia.org/T129433) (owner: 10Ladsgroup) [00:45:17] FIRING: KubernetesCalicoDown: wikikube-worker1246.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1246.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [00:46:00] (03Merged) 10jenkins-bot: miscweb(design-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289440 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [00:46:08] (03Merged) 10jenkins-bot: 404.php: Force a redirect to /wiki/ in very obvious cases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288274 (https://phabricator.wikimedia.org/T129433) (owner: 10Ladsgroup) [00:46:34] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1288274|404.php: Force a redirect to /wiki/ in very obvious cases (T129433)]] [00:46:40] T129433: Improve design for wiki-facing error pages - https://phabricator.wikimedia.org/T129433 [00:47:20] !log pt1979@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on rdb1015.eqiad.wmnet with reason: host reimage [00:48:28] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1288274|404.php: Force a redirect to /wiki/ in very obvious cases (T129433)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:50:33] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [00:54:14] !log pt1979@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on rdb1015.eqiad.wmnet with reason: host reimage [00:54:45] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1288274|404.php: Force a redirect to /wiki/ in very obvious cases (T129433)]] (duration: 08m 10s) [00:54:48] T129433: Improve design for wiki-facing error pages - https://phabricator.wikimedia.org/T129433 [01:09:50] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1289454 [01:09:50] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1289454 (owner: 10TrainBranchBot) [01:10:33] !log pt1979@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin1003" [01:13:38] pt1979@cumin1003 reimage (PID 1743822) is awaiting input [01:18:12] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1289454 (owner: 10TrainBranchBot) [01:23:38] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp7011.* [01:26:02] (03CR) 10Ottomata: "> "We add some logic to output_kafka.conf.erbto see if we can parse meta.stream from the message. If it is found, set the topic to be $dc." [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [01:26:08] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp7011.magru.wmnet [01:29:53] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=ncredir7003.* [01:30:46] !log pt1979@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin1003" [01:30:47] !log pt1979@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host rdb1015.eqiad.wmnet with OS trixie [01:30:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install rdb101[56] - https://phabricator.wikimedia.org/T418916#11938893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1003 for host rdb1015.eqiad.wmnet with OS trixie completed: - rdb1015 (**PAS... [01:31:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install rdb101[56] - https://phabricator.wikimedia.org/T418916#11938896 (10Papaul) @Jclark-ctr partman fixed 1015 is done you can install 1016. thanks [01:32:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install rdb101[56] - https://phabricator.wikimedia.org/T418916#11938899 (10Papaul) [01:36:10] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp7011.magru.wmnet [01:36:51] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#11938900 (10Ladsgroup) The ones marked as "running" have been restarted now. I will some extra ones tomorrow for the containers that fell into cracks. [01:41:15] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7011.* [02:05:16] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:05:26] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:06:16] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:09:13] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:26] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:17:33] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [02:17:46] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [02:17:47] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [02:17:59] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [02:18:00] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [02:18:14] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [02:34:13] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:46:25] FIRING: SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:15:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:51:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:55:11] 06SRE, 10Wikimedia-Mailing-lists: New mailing list for the latam tech community - https://phabricator.wikimedia.org/T426803#11938979 (10Arcstur) [03:56:16] 06SRE, 10Wikimedia-Mailing-lists: New mailing list for the latam tech community - https://phabricator.wikimedia.org/T426803#11938981 (10Arcstur) [04:11:09] (03PS1) 10KartikMistry: Update cxserver to 2026-05-20-034002-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289463 (https://phabricator.wikimedia.org/T388690) [04:45:18] FIRING: KubernetesCalicoDown: wikikube-worker1246.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1246.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:47:01] PROBLEM - Backup freshness on backup1014 is CRITICAL: All failures: 1 (pki-root1002), Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:03:59] (03CR) 10Kevin Bazira: "Thank you for working on this. I've added some comments." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289372 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [05:22:57] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on dbproxy2005 - https://phabricator.wikimedia.org/T426791#11939019 (10Marostegui) >>! In T426791#11938830, @Ladsgroup wrote: > m1 main dbproxy on codfw. I wouldn't fail it over at middle of night since it's codfw but maybe tomorrow if DBAs think we should... [05:23:34] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on dbproxy2005 - https://phabricator.wikimedia.org/T426791#11939021 (10Marostegui) p:05Triage→03Medium [05:27:02] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc2022.codfw.wmnet,pc[1012,1022].eqiad.wmnet with reason: Maintenance on pc1 [05:27:18] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool pc1012.eqiad.wmnet: Maintenance on pc2 [05:28:59] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [05:29:07] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [05:29:07] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc1012.eqiad.wmnet: Maintenance on pc2 [05:32:15] (03PS1) 10Marostegui: mariadb: Productionize pc1022. [puppet] - 10https://gerrit.wikimedia.org/r/1289733 (https://phabricator.wikimedia.org/T418973) [05:34:13] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize pc1022. [puppet] - 10https://gerrit.wikimedia.org/r/1289733 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [05:37:51] (03PS1) 10Marostegui: instances.yaml: Remove pc1011 [puppet] - 10https://gerrit.wikimedia.org/r/1289734 (https://phabricator.wikimedia.org/T426806) [05:38:38] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove pc1011 [puppet] - 10https://gerrit.wikimedia.org/r/1289734 (https://phabricator.wikimedia.org/T426806) (owner: 10Marostegui) [05:41:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove pc1011 from dbctl T426806', diff saved to https://phabricator.wikimedia.org/P92647 and previous config saved to /var/cache/conftool/dbconfig/20260520-054146-marostegui.json [05:41:51] T426806: decommission pc1011.eqiad.wmnet - https://phabricator.wikimedia.org/T426806 [05:46:58] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260520T0600) [06:00:38] (03PS1) 10Arthur taylor: Disable support for PHP-serialized EntityData on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289736 (https://phabricator.wikimedia.org/T98035) [06:01:07] 06SRE, 06DBA: db1249 is unreachable - https://phabricator.wikimedia.org/T426750#11939087 (10Marostegui) >>! In T426750#11937397, @FCeratto-WMF wrote: > moving the task back to DBA: the host is up (and with an updated kernel) but before pooling in we should decide if we want to clone it or trust the crash recov... [06:02:06] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on ms-backup[2003-2004].codfw.wmnet with reason: restart [06:03:08] (03PS1) 10Marostegui: installserver: Do not format pc1022 [puppet] - 10https://gerrit.wikimedia.org/r/1289737 (https://phabricator.wikimedia.org/T418973) [06:06:07] (03CR) 10Marostegui: [C:03+2] installserver: Do not format pc1022 [puppet] - 10https://gerrit.wikimedia.org/r/1289737 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [06:07:35] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on 18 hosts with reason: restart [06:17:50] (03PS1) 10Marostegui: mariadb: Decommission pc1011 [puppet] - 10https://gerrit.wikimedia.org/r/1289738 (https://phabricator.wikimedia.org/T426806) [06:21:31] !log marostegui@cumin1003 START - Cookbook sre.hosts.decommission for hosts pc1011.eqiad.wmnet [06:22:11] (03CR) 10Marostegui: [C:03+2] mariadb: Decommission pc1011 [puppet] - 10https://gerrit.wikimedia.org/r/1289738 (https://phabricator.wikimedia.org/T426806) (owner: 10Marostegui) [06:27:07] !log marostegui@cumin1003 START - Cookbook sre.dns.netbox [06:27:28] (03PS1) 10Effie Mouzeli: php8.3-icu72: Rebuild to pick up new PHP packages (8.3.31) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1289442 (owner: 10Scott French) [06:27:32] (03PS1) 10Effie Mouzeli: php8.3: Rebuild to pick up new PHP packages (8.3.31) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1289441 (owner: 10Scott French) [06:31:04] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on 18 hosts with reason: restart [06:31:33] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet [06:31:40] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet [06:31:48] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet [06:32:45] 07sre-alert-triage, 06SRE Observability: Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T426809 (10LSobanski) 03NEW [06:32:55] marostegui@cumin1003 decommission (PID 1787525) is awaiting input [06:32:59] !log failover Ganeti cluster in drmrs02 to ganeti6002 [06:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:36] 07sre-alert-triage, 06SRE Observability: Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T426809#11939140 (10LSobanski) Also LibericaStaleConfig, LibericaUnhealthyRealserverPooled and UnmergedPuppetChanges, all with the same error. [06:34:13] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:34:28] !log marostegui@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: pc1011.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [06:34:44] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, nit inline" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1289441 (owner: 10Scott French) [06:35:14] PROBLEM - ganeti-wconfd running on ganeti6004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [06:35:36] !log marostegui@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: pc1011.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [06:35:36] !log marostegui@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:35:37] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts pc1011.eqiad.wmnet [06:36:35] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc1011.eqiad.wmnet - https://phabricator.wikimedia.org/T426806#11939161 (10Marostegui) Ready for DC-Ops [06:36:38] jmm@cumin2002 drain-node (PID 3426525) is awaiting input [06:39:13] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:44:44] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6004.drmrs.wmnet [06:44:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2030.codfw.wmnet [06:45:10] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1249: Repooling after boot [06:45:20] 06SRE, 06DBA: db1249 is unreachable - https://phabricator.wikimedia.org/T426750#11939194 (10ops-monitoring-bot) Starting pool of db1249 by fceratto@cumin1003: Repooling after boot [06:46:00] 06SRE, 06DBA: db1249 is unreachable - https://phabricator.wikimedia.org/T426750#11939195 (10FCeratto-WMF) ok, pooling and removing silence [06:46:19] !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for db1249.eqiad.wmnet [06:46:20] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db1249.eqiad.wmnet [06:46:25] FIRING: SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:47:49] jmm@cumin2002 drain-node (PID 3434000) is awaiting input [06:50:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2030.codfw.wmnet [06:50:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2030.codfw.wmnet [06:51:11] (03PS1) 10Matthias Mullie: Squashed diff to master [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1289743 [06:51:35] (03PS1) 10Matthias Mullie: Squashed diff to master [extensions/ReaderExperiments] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289744 [06:52:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 20 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1289743 (owner: 10Matthias Mullie) [06:52:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 20 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289744 (owner: 10Matthias Mullie) [06:53:07] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet [06:55:39] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti105[5678] - https://phabricator.wikimedia.org/T424680#11939248 (10MoritzMuehlenhoff) [06:58:15] jmm@cumin2002 drain-node (PID 3434000) is awaiting input [07:00:05] Amir1, Urbanecm, and awight: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260520T0700). Please do the needful. [07:00:05] matthiasmullie: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:08] o/ [07:02:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mlitn@deploy1003 using scap backport" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1289743 (owner: 10Matthias Mullie) [07:02:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mlitn@deploy1003 using scap backport" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289744 (owner: 10Matthias Mullie) [07:03:32] (03Merged) 10jenkins-bot: Squashed diff to master [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1289743 (owner: 10Matthias Mullie) [07:03:33] (03Merged) 10jenkins-bot: Squashed diff to master [extensions/ReaderExperiments] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289744 (owner: 10Matthias Mullie) [07:03:38] 06SRE, 06DBA: db1249 is unreachable - https://phabricator.wikimedia.org/T426750#11939263 (10FCeratto-WMF) 05Open→03Resolved a:03FCeratto-WMF [07:04:22] !log mlitn@deploy1003 Started scap sync-world: Backport for [[gerrit:1289743|Squashed diff to master]], [[gerrit:1289744|Squashed diff to master]] [07:04:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1055.eqiad.wmnet [07:04:50] (03CR) 10Slyngshede: [C:03+1] idp: restrict growthbook UI login to the growthbook LDAP groups [puppet] - 10https://gerrit.wikimedia.org/r/1289384 (https://phabricator.wikimedia.org/T420691) (owner: 10Brouberol) [07:06:27] !log mlitn@deploy1003 mlitn: Backport for [[gerrit:1289743|Squashed diff to master]], [[gerrit:1289744|Squashed diff to master]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:09:09] !log remove haveged [07:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1055.eqiad.wmnet [07:11:04] !log mlitn@deploy1003 mlitn: Continuing with deployment [07:11:23] (03PS1) 10Matthias Mullie: Fix wordmark dimensions [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1289748 [07:11:31] (03PS1) 10Matthias Mullie: Fix wordmark dimensions [extensions/ReaderExperiments] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289749 [07:15:02] (03CR) 10Brouberol: [C:03+2] idp: restrict growthbook UI login to the growthbook LDAP groups [puppet] - 10https://gerrit.wikimedia.org/r/1289384 (https://phabricator.wikimedia.org/T420691) (owner: 10Brouberol) [07:15:13] !log mlitn@deploy1003 Finished scap sync-world: Backport for [[gerrit:1289743|Squashed diff to master]], [[gerrit:1289744|Squashed diff to master]] (duration: 10m 51s) [07:15:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:16:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mlitn@deploy1003 using scap backport" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1289748 (owner: 10Matthias Mullie) [07:16:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mlitn@deploy1003 using scap backport" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289749 (owner: 10Matthias Mullie) [07:17:08] (03Merged) 10jenkins-bot: Fix wordmark dimensions [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1289748 (owner: 10Matthias Mullie) [07:17:09] (03Merged) 10jenkins-bot: Fix wordmark dimensions [extensions/ReaderExperiments] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289749 (owner: 10Matthias Mullie) [07:17:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1056.eqiad.wmnet [07:17:36] !log mlitn@deploy1003 Started scap sync-world: Backport for [[gerrit:1289748|Fix wordmark dimensions]], [[gerrit:1289749|Fix wordmark dimensions]] [07:17:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host build2001.codfw.wmnet [07:18:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6004.drmrs.wmnet [07:19:26] !log mlitn@deploy1003 mlitn: Backport for [[gerrit:1289748|Fix wordmark dimensions]], [[gerrit:1289749|Fix wordmark dimensions]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:19:58] !log mlitn@deploy1003 mlitn: Continuing with deployment [07:20:45] PROBLEM - Host hcaptcha-proxy6002 is DOWN: PING CRITICAL - Packet loss = 100% [07:20:55] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:21:03] PROBLEM - Host doh6002 is DOWN: PING CRITICAL - Packet loss = 100% [07:21:31] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:22:10] FIRING: [3x] BFDdown: BFD session down between asw1-b13-drmrs and 185.15.58.38 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b13-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [07:22:10] (03PS1) 10Brouberol: idp: restrict growthbook-next UI login to users in the growthbook LDAP groups [puppet] - 10https://gerrit.wikimedia.org/r/1289752 (https://phabricator.wikimedia.org/T420691) [07:23:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1056.eqiad.wmnet [07:23:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2001.codfw.wmnet [07:24:05] !log mlitn@deploy1003 Finished scap sync-world: Backport for [[gerrit:1289748|Fix wordmark dimensions]], [[gerrit:1289749|Fix wordmark dimensions]] (duration: 06m 28s) [07:24:31] I'm all done in case there are other deploys [07:24:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6004.drmrs.wmnet [07:25:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6004.drmrs.wmnet [07:25:25] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1092.eqiad.wmnet [07:25:27] FIRING: [4x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:25:31] RECOVERY - Host doh6002 is UP: PING OK - Packet loss = 0%, RTA = 96.47 ms [07:25:47] RECOVERY - Host hcaptcha-proxy6002 is UP: PING OK - Packet loss = 0%, RTA = 87.68 ms [07:25:50] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2092.codfw.wmnet [07:27:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host serpens.wikimedia.org [07:27:09] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [07:27:10] FIRING: [3x] BFDdown: BFD session down between asw1-b13-drmrs and 185.15.58.38 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b13-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [07:27:15] PROBLEM - Bird Internet Routing Daemon on hcaptcha-proxy6002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [07:27:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1057.eqiad.wmnet [07:27:32] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2187: Upgrading db2187.codfw.wmnet [07:27:59] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2187: Upgrading db2187.codfw.wmnet [07:29:13] RESOLVED: [4x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:29:15] RECOVERY - Bird Internet Routing Daemon on hcaptcha-proxy6002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [07:29:23] (03CR) 10Elukey: [C:03+2] profile::cache::haproxy: add webrequest-based ip reputation data [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [07:29:31] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:30:39] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1249: Repooling after boot [07:30:44] 06SRE, 06DBA: db1249 is unreachable - https://phabricator.wikimedia.org/T426750#11939321 (10ops-monitoring-bot) Completed pooling of db1249 by fceratto@cumin1003: Repooling after boot [07:30:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host serpens.wikimedia.org [07:32:00] cwilliams@cumin1003 major-upgrade (PID 1820489) is awaiting input [07:32:09] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1092.eqiad.wmnet [07:32:10] RESOLVED: [2x] BFDdown: BFD session down between asw1-b13-drmrs and 185.15.58.38 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b13-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [07:32:14] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1093.eqiad.wmnet [07:32:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1057.eqiad.wmnet [07:32:44] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2092.codfw.wmnet [07:32:48] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2093.codfw.wmnet [07:32:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host seaborgium.wikimedia.org [07:34:26] Hello, I'll have a bit of delay to run the MediaWiki train. I woke up late this morning and haven't finished my morning routine yet. [07:34:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1058.eqiad.wmnet [07:35:23] (03PS1) 10Elukey: profile::cache::haproxy: fix typo in top_10000_ips_requestctl_webrequest_source [puppet] - 10https://gerrit.wikimedia.org/r/1289754 [07:36:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host seaborgium.wikimedia.org [07:37:44] (03CR) 10Elukey: [C:03+2] profile::cache::haproxy: fix typo in top_10000_ips_requestctl_webrequest_source [puppet] - 10https://gerrit.wikimedia.org/r/1289754 (owner: 10Elukey) [07:37:53] (03CR) 10Slyngshede: [C:03+1] profile::cache::haproxy: fix typo in top_10000_ips_requestctl_webrequest_source [puppet] - 10https://gerrit.wikimedia.org/r/1289754 (owner: 10Elukey) [07:38:56] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1093.eqiad.wmnet [07:39:01] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1094.eqiad.wmnet [07:39:36] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2093.codfw.wmnet [07:39:41] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2094.codfw.wmnet [07:40:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1058.eqiad.wmnet [07:43:25] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2187.codfw.wmnet with OS trixie [07:44:43] (03CR) 10Slyngshede: [C:03+1] idp: restrict growthbook-next UI login to users in the growthbook LDAP groups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1289752 (https://phabricator.wikimedia.org/T420691) (owner: 10Brouberol) [07:45:00] (03CR) 10Brouberol: [C:03+2] idp: restrict growthbook-next UI login to users in the growthbook LDAP groups [puppet] - 10https://gerrit.wikimedia.org/r/1289752 (https://phabricator.wikimedia.org/T420691) (owner: 10Brouberol) [07:45:13] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1014.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:45:32] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1094.eqiad.wmnet [07:45:36] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1095.eqiad.wmnet [07:46:08] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2094.codfw.wmnet [07:46:12] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2095.codfw.wmnet [07:46:13] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:48:31] (03CR) 10Muehlenhoff: [C:03+2] Switch install1005 / the installserver role at large to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1289187 (owner: 10Muehlenhoff) [07:49:14] (03PS1) 10Elukey: haproxy: Enable use_webrequest_ipreputation flag for cp7002/cp7012 [puppet] - 10https://gerrit.wikimedia.org/r/1289808 (https://phabricator.wikimedia.org/T402512) [07:51:40] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:51:46] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbproxy1024.eqiad.wmnet with reason: Reboot [07:51:48] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1289808 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [07:52:41] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2095.codfw.wmnet [07:52:46] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2096.codfw.wmnet [07:52:52] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1095.eqiad.wmnet [07:52:56] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1096.eqiad.wmnet [07:53:18] (03CR) 10Slyngshede: [C:03+1] haproxy: Enable use_webrequest_ipreputation flag for cp7002/cp7012 [puppet] - 10https://gerrit.wikimedia.org/r/1289808 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [07:55:01] (03PS1) 10Muehlenhoff: Make ganeti1055/1056 Ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/1289890 (https://phabricator.wikimedia.org/T424680) [07:55:12] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbproxy1022.eqiad.wmnet with reason: Reboot [07:58:07] (03CR) 10Fabfur: [C:03+1] haproxy: Enable use_webrequest_ipreputation flag for cp7002/cp7012 [puppet] - 10https://gerrit.wikimedia.org/r/1289808 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [07:59:06] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2096.codfw.wmnet [07:59:13] FIRING: JobUnavailable: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:59:39] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1096.eqiad.wmnet [07:59:44] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1097.eqiad.wmnet [08:00:04] hashar and andre: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260520T0800) [08:00:25] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbproxy1023.eqiad.wmnet with reason: Reboot [08:00:55] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:01:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install1005.wikimedia.org [08:02:06] (03CR) 10Elukey: [C:03+2] haproxy: Enable use_webrequest_ipreputation flag for cp7002/cp7012 [puppet] - 10https://gerrit.wikimedia.org/r/1289808 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [08:02:38] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2187.codfw.wmnet with reason: host reimage [08:07:39] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2187.codfw.wmnet with reason: host reimage [08:08:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install1005.wikimedia.org [08:09:12] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1097.eqiad.wmnet [08:10:44] (03CR) 10Muehlenhoff: [C:03+2] Make ganeti1055/1056 Ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/1289890 (https://phabricator.wikimedia.org/T424680) (owner: 10Muehlenhoff) [08:11:37] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbproxy1026.eqiad.wmnet with reason: Reboot [08:11:43] o/ [08:12:02] I am checking the backend logs [08:12:06] (03CR) 10Muehlenhoff: [C:03+2] use_linux612_on_bookworm: Bump kernel to 6.12.88 [puppet] - 10https://gerrit.wikimedia.org/r/1289279 (owner: 10Muehlenhoff) [08:12:07] (03CR) 10Ilias Sarantopoulos: [C:04-1] ml-services: Deploy qwen3-14b model in experimental ns. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289372 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [08:15:27] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:17:07] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbproxy1029.eqiad.wmnet with reason: Reboot [08:17:49] Lets try! [08:18:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install3004.wikimedia.org [08:18:46] (03PS1) 10TrainBranchBot: group1 to 1.47.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289892 (https://phabricator.wikimedia.org/T423912) [08:18:49] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by hashar@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289892 (https://phabricator.wikimedia.org/T423912) (owner: 10TrainBranchBot) [08:20:00] (03Merged) 10jenkins-bot: group1 to 1.47.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289892 (https://phabricator.wikimedia.org/T423912) (owner: 10TrainBranchBot) [08:20:54] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbproxy[2005-2008].codfw.wmnet with reason: Reboot [08:21:43] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbproxy[2006-2008].codfw.wmnet with reason: Reboot [08:24:13] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:24:37] !log imported openjdk-8u492-ga-1~deb11u1 to component/jdk8 for bookworm (forward port of latest Java 8 security release) [08:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install3004.wikimedia.org [08:25:40] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2187.codfw.wmnet with OS trixie [08:26:03] !log hashar@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.47.0-wmf.3 refs T423912 [08:26:07] T423912: 1.47.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T423912 [08:26:42] !log installing Java 11 security updates [08:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:55] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2187: Migration of db2187.codfw.wmnet completed [08:35:27] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp700[5-6].magru.wmnet} and A:cp [08:38:41] (03PS1) 10JavierMonton: stream: webrequest-page-view [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289894 (https://phabricator.wikimedia.org/T426425) [08:41:29] (03CR) 10Elukey: "Tested with a spicerack shell script crafted to add the wmfroot user to a target set of hosts, the logic works." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1287905 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [08:41:39] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [08:42:50] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [08:45:18] FIRING: KubernetesCalicoDown: wikikube-worker1246.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1246.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:46:25] (03PS2) 10JavierMonton: stream: webrequest-page-view [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289894 (https://phabricator.wikimedia.org/T426425) [08:46:49] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp7005.magru.wmnet [08:47:02] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: fasw2-c8a-codfw:xe-0/0/47 low RX power - https://phabricator.wikimedia.org/T426824 (10ayounsi) 03NEW p:05Triage→03Medium [08:47:32] (03Abandoned) 10Arnaudb: envoyproxy: update verify-envoy-config logic [puppet] - 10https://gerrit.wikimedia.org/r/1278482 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb) [08:48:53] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: fasw2-c8a-codfw:xe-0/0/47 low RX power - https://phabricator.wikimedia.org/T426824#11939523 (10ayounsi) [08:50:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install4004.wikimedia.org [08:53:51] !log klausman@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-codfw [08:53:55] !log klausman@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve2001.codfw.wmnet [08:54:32] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen: apply [08:54:55] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen: apply [08:55:02] (03PS1) 10Mszwarc: Update UserInfoCard to be enabled by default for certain user groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289895 (https://phabricator.wikimedia.org/T426021) [08:57:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install4004.wikimedia.org [08:59:01] !log klausman@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve2001.codfw.wmnet [08:59:06] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install rdb101[56] - https://phabricator.wikimedia.org/T418916#11939566 (10MLechvien-WMF) @Jclark-ctr are this and {T418922} (which had the same issue IIRC) unblocked now? [09:01:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install5004.wikimedia.org [09:01:59] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [09:02:23] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2242: Upgrading db2242.codfw.wmnet [09:03:03] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2242: Upgrading db2242.codfw.wmnet [09:04:32] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2242.codfw.wmnet with OS trixie [09:04:37] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 207947 [09:04:54] !log klausman@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve2001.codfw.wmnet [09:04:55] !log klausman@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve2001.codfw.wmnet [09:04:59] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs-test1001.eqiad.wmnet with OS bookworm [09:05:03] !log klausman@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve2002.codfw.wmnet [09:05:08] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 207947 [09:06:28] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 12389 [09:07:07] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 12389 [09:08:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install5004.wikimedia.org [09:09:59] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [09:13:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install6003.wikimedia.org [09:13:24] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2187: Migration of db2187.codfw.wmnet completed [09:13:25] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [09:14:27] (03CR) 10Lucas Werkmeister (WMDE): "Actually, let’s split this up, so the announcement can say “this change is alread" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289736 (https://phabricator.wikimedia.org/T98035) (owner: 10Arthur taylor) [09:14:56] (03CR) 10Lucas Werkmeister (WMDE): "sorry, hit Enter too soon >.<" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289736 (https://phabricator.wikimedia.org/T98035) (owner: 10Arthur taylor) [09:15:08] !log klausman@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve2002.codfw.wmnet [09:15:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2030.codfw.wmnet [09:16:08] (03PS2) 10Gkyziridis: ml-services: Deploy qwen3-14b model in experimental ns. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289372 (https://phabricator.wikimedia.org/T425680) [09:18:55] !log temporarily drop ganeti2030 from the codfw cluster T426199 [09:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:59] T426199: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199 [09:19:41] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet [09:19:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install6003.wikimedia.org [09:19:58] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2242.codfw.wmnet with reason: host reimage [09:20:46] (03PS3) 10JavierMonton: stream: webrequest-page-view [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289894 (https://phabricator.wikimedia.org/T426425) [09:21:18] PROBLEM - ganeti-confd running on ganeti2030 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 109 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [09:21:18] PROBLEM - ganeti-noded running on ganeti2030 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [09:21:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install7002.wikimedia.org [09:21:50] FIRING: ProbeDown: Service ganeti2030:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:23:01] !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl2001.codfw.wmnet [09:23:24] jmm@cumin2002 drain-node (PID 3537397) is awaiting input [09:23:55] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 06Traffic: Decommision hosts cp2041 - cp2042 - https://phabricator.wikimedia.org/T426828 (10Fabfur) 03NEW [09:24:11] (03PS2) 10Arthur taylor: Disable support for PHP-serialized EntityData on Beta / Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289736 (https://phabricator.wikimedia.org/T98035) [09:24:12] (03PS1) 10Arthur taylor: Disable support for PHP-serialized EntityData on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289898 (https://phabricator.wikimedia.org/T98035) [09:24:36] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2242.codfw.wmnet with reason: host reimage [09:24:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet [09:25:08] (03CR) 10CI reject: [V:04-1] Disable support for PHP-serialized EntityData on Beta / Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289736 (https://phabricator.wikimedia.org/T98035) (owner: 10Arthur taylor) [09:26:44] !log btullis@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:dse-k8s-worker-eqiad [09:26:48] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host dse-k8s-worker1001.eqiad.wmnet [09:27:14] !log klausman@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve2002.codfw.wmnet [09:27:14] (03CR) 10Btullis: [C:03+2] [airflow-wikidata]: Add a connection for the wikidata-platform S3 user [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289385 (https://phabricator.wikimedia.org/T426764) (owner: 10Btullis) [09:27:16] !log klausman@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve2002.codfw.wmnet [09:27:22] !log klausman@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve2003.codfw.wmnet [09:28:00] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node ml-serve2002:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [09:28:02] (03PS4) 10JavierMonton: stream: webrequest-page-view [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289894 (https://phabricator.wikimedia.org/T426425) [09:28:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install7002.wikimedia.org [09:28:26] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp7006.magru.wmnet [09:28:26] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp700[5-6].magru.wmnet} and A:cp [09:28:27] !log btullis@cumin1003 START - Cookbook sre.presto.reboot-workers for Presto an-presto cluster: Reboot Presto nodes [09:29:15] (03Merged) 10jenkins-bot: [airflow-wikidata]: Add a connection for the wikidata-platform S3 user [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289385 (https://phabricator.wikimedia.org/T426764) (owner: 10Btullis) [09:30:03] FIRING: [2x] KubernetesCalicoDown: ml-serve2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:30:03] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl2001.codfw.wmnet [09:30:06] RECOVERY - Host wikikube-worker1246 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [09:30:12] (03CR) 10Hnowlan: [C:03+1] logstash: restore sampling of webrequest logs [puppet] - 10https://gerrit.wikimedia.org/r/1289423 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite) [09:30:19] !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl2002.codfw.wmnet [09:30:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet [09:30:50] (03PS3) 10Arthur taylor: Disable support for PHP-serialized EntityData on Beta / Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289736 (https://phabricator.wikimedia.org/T98035) [09:30:50] (03PS2) 10Arthur taylor: Disable support for PHP-serialized EntityData on Wikidata production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289898 (https://phabricator.wikimedia.org/T98035) [09:30:50] (03CR) 10Majavah: [C:03+1] mariadb: Migrate mariadb internal ferm rule to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1289382 (https://phabricator.wikimedia.org/T421705) (owner: 10JHathaway) [09:31:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet [09:31:50] FIRING: [2x] ProbeDown: Service ganeti2030:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:32:27] !log klausman@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve2003.codfw.wmnet [09:33:00] RESOLVED: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node ml-serve2002:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [09:33:35] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl2002.codfw.wmnet [09:33:56] !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl1002.eqiad.wmnet [09:35:03] RESOLVED: [2x] KubernetesCalicoDown: ml-serve2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:36:20] PROBLEM - Host wikikube-worker1246 is DOWN: PING CRITICAL - Packet loss = 100% [09:36:24] !log klausman@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve2003.codfw.wmnet [09:36:26] !log klausman@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve2003.codfw.wmnet [09:36:32] !log klausman@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve2004.codfw.wmnet [09:37:24] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl1002.eqiad.wmnet [09:37:34] !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl1001.eqiad.wmnet [09:38:58] RECOVERY - Host wikikube-worker1246 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [09:40:12] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet [09:40:33] (03PS1) 10Ayounsi: Add depool policy for wdqs::scholarly and netmon hosts [puppet] - 10https://gerrit.wikimedia.org/r/1289900 (https://phabricator.wikimedia.org/T327300) [09:41:03] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl1001.eqiad.wmnet [09:41:37] !log klausman@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve2004.codfw.wmnet [09:42:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet [09:42:27] (03CR) 10Ayounsi: [C:03+2] Add depool policy for wdqs::scholarly and netmon hosts [puppet] - 10https://gerrit.wikimedia.org/r/1289900 (https://phabricator.wikimedia.org/T327300) (owner: 10Ayounsi) [09:43:46] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2242.codfw.wmnet with OS trixie [09:43:57] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wikidata: apply [09:44:00] PROBLEM - Host ml-etcd2001 is DOWN: PING CRITICAL - Packet loss = 100% [09:44:26] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wikidata: apply [09:46:00] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2242: Migration of db2242.codfw.wmnet completed [09:47:21] !log klausman@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve2004.codfw.wmnet [09:47:23] !log klausman@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve2004.codfw.wmnet [09:47:28] (03CR) 10Lucas Werkmeister (WMDE): Disable support for PHP-serialized EntityData on Wikidata production (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289898 (https://phabricator.wikimedia.org/T98035) (owner: 10Arthur taylor) [09:47:29] !log klausman@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve2005.codfw.wmnet [09:48:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet [09:48:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet [09:48:46] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "LGTM, should be okay to deploy at any time (ideally before Friday when we send the announcement). Do you want to schedule it for a backpor" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289736 (https://phabricator.wikimedia.org/T98035) (owner: 10Arthur taylor) [09:50:31] RECOVERY - Host ml-etcd2001 is UP: PING OK - Packet loss = 0%, RTA = 31.79 ms [09:51:50] RESOLVED: [2x] ProbeDown: Service ganeti2030:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:52:22] (03CR) 10Arthur taylor: "Sure. I would be around for the window today." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289736 (https://phabricator.wikimedia.org/T98035) (owner: 10Arthur taylor) [09:52:27] (03CR) 10Elukey: "I am using a lot istio gateway logs to debug all services with Ingress enabled on k8s, this would impact us a lot :( Is it really needed?" [puppet] - 10https://gerrit.wikimedia.org/r/1289423 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite) [09:52:35] !log klausman@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve2005.codfw.wmnet [09:52:43] (03PS1) 10Kosta Harlan: hCaptcha: Exempt Wikibase entity namespaces from edit/create triggers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289902 (https://phabricator.wikimedia.org/T426829) [09:52:57] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-wdqs-test1001.eqiad.wmnet with OS bookworm [09:53:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289736 (https://phabricator.wikimedia.org/T98035) (owner: 10Arthur taylor) [09:53:31] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs-test1001.eqiad.wmnet with OS bookworm [09:56:54] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host dse-k8s-worker1001.eqiad.wmnet [09:56:58] !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1015.eqiad.wmnet [09:57:20] FIRING: [2x] ProbeDown: Service ganeti2030:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:57:25] !log klausman@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve2005.codfw.wmnet [09:57:27] !log klausman@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve2005.codfw.wmnet [09:57:33] !log klausman@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve2006.codfw.wmnet [09:57:42] !log fnegri@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1015.eqiad.wmnet with reason: Rebooting clouddb1015 T415165 [09:57:46] T415165: Install a clouddb hosts with Debian Trixie - https://phabricator.wikimedia.org/T415165 [09:58:17] (03CR) 10Lucas Werkmeister (WMDE): hCaptcha: Exempt Wikibase entity namespaces from edit/create triggers (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289902 (https://phabricator.wikimedia.org/T426829) (owner: 10Kosta Harlan) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260520T1000) [10:02:20] RESOLVED: [2x] ProbeDown: Service ganeti2030:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:02:37] PROBLEM - Druid historical on an-druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:02:37] !log klausman@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve2006.codfw.wmnet [10:02:55] !log fnegri@cumin1003 START - Cookbook sre.hosts.reimage for host clouddb1015.eqiad.wmnet with OS trixie [10:03:20] FIRING: ProbeDown: Service ganeti2030:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:03:35] RESOLVED: ProbeDown: Service ganeti2030:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:03:37] RECOVERY - Druid historical on an-druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:03:43] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host dse-k8s-worker1001.eqiad.wmnet [10:03:44] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host dse-k8s-worker1001.eqiad.wmnet [10:03:53] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host dse-k8s-worker1002.eqiad.wmnet [10:04:07] !log btullis@cumin1003 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) depool for host dse-k8s-worker1002.eqiad.wmnet [10:04:14] (03CR) 10Kevin Bazira: [C:03+1] ml-services: Deploy qwen3-14b model in experimental ns. (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289372 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [10:04:23] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dborch1002.wikimedia.org with reason: Reboot [10:05:12] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2048.codfw.wmnet [10:07:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet [10:07:20] (03Abandoned) 10Elukey: sre.hosts.provision: add workaround for root user on X14 supermicros [cookbooks] - 10https://gerrit.wikimedia.org/r/1266257 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey) [10:07:58] (03PS2) 10Kosta Harlan: hCaptcha: Exempt Wikibase entity namespaces from edit/create triggers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289902 (https://phabricator.wikimedia.org/T426829) [10:08:20] FIRING: ProbeDown: Service ganeti2030:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:08:55] PROBLEM - Host dse-k8s-etcd2002 is DOWN: PING CRITICAL - Packet loss = 100% [10:09:07] !log klausman@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve2006.codfw.wmnet [10:09:09] !log klausman@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve2006.codfw.wmnet [10:09:15] !log klausman@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve2007.codfw.wmnet [10:09:26] hashar: T426832 warrants rollback, IMO. [10:09:27] T426832: userrights-interwiki fails with server error - https://phabricator.wikimedia.org/T426832 [10:09:38] (it means "we are unable to hide personally identifying stuff from the wikis") [10:09:40] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "I don’t know anything about hCaptcha but the approach here seems worth a try." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289902 (https://phabricator.wikimedia.org/T426829) (owner: 10Kosta Harlan) [10:10:42] RECOVERY - Host dse-k8s-etcd2002 is UP: PING OK - Packet loss = 0%, RTA = 30.77 ms [10:11:03] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host dse-k8s-worker1002.eqiad.wmnet [10:11:04] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host dse-k8s-worker1002.eqiad.wmnet [10:11:10] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host dse-k8s-worker1003.eqiad.wmnet [10:11:41] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host dse-k8s-worker1003.eqiad.wmnet [10:11:43] fnegri@cumin1003 reimage (PID 1921948) is awaiting input [10:12:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet [10:12:25] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp701[3-4].magru.wmnet} and A:cp [10:12:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2048.codfw.wmnet [10:13:12] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-wdqs-test1001.eqiad.wmnet with reason: host reimage [10:13:27] (03CR) 10Hnowlan: [C:03+1] "Given the enormous volumes we're seeing here, we need to do something unfortunately. But that said, sampling will still offer some signal " [puppet] - 10https://gerrit.wikimedia.org/r/1289423 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite) [10:13:39] !log fceratto@cumin1003 DONE (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1 day, 0:00:00 on es2039.codfw.wmnet with reason: Maintenance [10:14:20] !log klausman@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve2007.codfw.wmnet [10:15:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2049.codfw.wmnet [10:17:46] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-wdqs-test1001.eqiad.wmnet with reason: host reimage [10:18:20] RESOLVED: ProbeDown: Service ganeti2030:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:19:43] 06SRE, 06Data-Engineering (Q4 FS25/26 April 1st - June 30st), 10Event-Platform, 13Patch-For-Review: Flink Page View: Create K8s resources - https://phabricator.wikimedia.org/T426425#11939866 (10JMonton-WMF) [10:19:56] 06SRE, 06Data-Engineering (Q4 FS25/26 April 1st - June 30st), 10Event-Platform, 13Patch-For-Review: Flink Page View: Create K8s resources - https://phabricator.wikimedia.org/T426425#11939867 (10JMonton-WMF) [10:20:00] jmm@cumin2002 drain-node (PID 3574668) is awaiting input [10:20:50] !log klausman@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve2007.codfw.wmnet [10:20:51] !log klausman@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve2007.codfw.wmnet [10:20:57] !log klausman@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve2008.codfw.wmnet [10:21:15] (03CR) 10Slyngshede: [C:03+2] Geo-maps: Update Meta PoPs [dns] - 10https://gerrit.wikimedia.org/r/1282956 (owner: 10Slyngshede) [10:21:22] !log slyngshede@dns1004 START - running authdns-update [10:21:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287433 (https://phabricator.wikimedia.org/T355445) (owner: 10Codename Noreste) [10:22:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet [10:22:37] !log btullis@cumin1003 START - Cookbook sre.ceph.roll-restart-reboot-server rolling reboot on A:cephosd-codfw [10:23:00] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host dse-k8s-worker1003.eqiad.wmnet [10:23:01] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host dse-k8s-worker1003.eqiad.wmnet [10:23:07] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host dse-k8s-worker1004.eqiad.wmnet [10:23:18] !log slyngshede@dns1004 END - running authdns-update [10:23:41] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host dse-k8s-worker1004.eqiad.wmnet [10:24:11] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp7013.magru.wmnet [10:24:30] PROBLEM - Host aux-k8s-etcd2004 is DOWN: PING CRITICAL - Packet loss = 100% [10:24:57] (03CR) 10Dreamy Jazz: [C:03+1] hCaptcha: Exempt Wikibase entity namespaces from edit/create triggers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289902 (https://phabricator.wikimedia.org/T426829) (owner: 10Kosta Harlan) [10:25:38] PROBLEM - BFD status on lsw1-a7-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:26:00] RECOVERY - Host aux-k8s-etcd2004 is UP: PING OK - Packet loss = 0%, RTA = 32.04 ms [10:27:21] !log fnegri@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddb1015.eqiad.wmnet with OS trixie [10:27:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet [10:27:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2049.codfw.wmnet [10:28:28] !log fnegri@cumin1003 START - Cookbook sre.hosts.reimage for host clouddb1015.eqiad.wmnet with OS trixie [10:30:45] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host dse-k8s-worker1004.eqiad.wmnet [10:30:46] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host dse-k8s-worker1004.eqiad.wmnet [10:30:48] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2050.codfw.wmnet [10:30:51] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host dse-k8s-worker1005.eqiad.wmnet [10:31:03] !log klausman@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve2008.codfw.wmnet [10:31:28] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2242: Migration of db2242.codfw.wmnet completed [10:31:29] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [10:32:03] (03PS7) 10Federico Ceratto: sre.mysql.global-read-only Set all sections as RO/RW [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) [10:32:38] RECOVERY - BFD status on lsw1-a7-codfw.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:32:53] !log btullis@cumin1003 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) depool for host dse-k8s-worker1005.eqiad.wmnet [10:33:01] (03CR) 10Federico Ceratto: "I added a 5 second sleep, logic to do the change in reverse direction and tests for both kind of changes." [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [10:34:34] (03CR) 10Elukey: "Asked to the team, all good! I just realized that wmfmariadbpy is a debian package, so the new deps will be installed automatically, we ju" [puppet] - 10https://gerrit.wikimedia.org/r/1287378 (https://phabricator.wikimedia.org/T409926) (owner: 10Federico Ceratto) [10:35:26] jmm@cumin2002 drain-node (PID 3585810) is awaiting input [10:36:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet [10:36:10] !log klausman@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve2008.codfw.wmnet [10:36:11] !log klausman@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve2008.codfw.wmnet [10:36:17] !log klausman@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve2009.codfw.wmnet [10:36:59] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-wdqs-test1001.eqiad.wmnet with OS bookworm [10:38:26] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Datastores: Upgrade Kafka to version 3.x - https://phabricator.wikimedia.org/T416669#11939912 (10brouberol) [10:39:20] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [10:39:43] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2162: Upgrading db2162.codfw.wmnet [10:40:13] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2162: Upgrading db2162.codfw.wmnet [10:40:24] (03PS1) 10Brouberol: preseed: drop configuration for kafka-jumbo100[7-9] [puppet] - 10https://gerrit.wikimedia.org/r/1289908 (https://phabricator.wikimedia.org/T426835) [10:40:26] (03PS1) 10Brouberol: preseed: prepare kafka-jumbo brokers for re-imaging [puppet] - 10https://gerrit.wikimedia.org/r/1289909 (https://phabricator.wikimedia.org/T426835) [10:40:38] PROBLEM - BFD status on lsw1-c2-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:40:57] !log fnegri@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb1015.eqiad.wmnet with reason: host reimage [10:41:22] !log klausman@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve2009.codfw.wmnet [10:41:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet [10:41:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2050.codfw.wmnet [10:42:20] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2162.codfw.wmnet with OS trixie [10:42:41] (03CR) 10CI reject: [V:04-1] preseed: drop configuration for kafka-jumbo100[7-9] [puppet] - 10https://gerrit.wikimedia.org/r/1289908 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [10:43:04] (03CR) 10Elukey: "My understanding from https://phabricator.wikimedia.org/T390215#11784911 is that there is a specific service that logs a ton of logs, that" [puppet] - 10https://gerrit.wikimedia.org/r/1289423 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite) [10:43:08] (03PS1) 10Brouberol: kafka-jumbo: upgrade broker to jdk21 [puppet] - 10https://gerrit.wikimedia.org/r/1289910 (https://phabricator.wikimedia.org/T426835) [10:44:25] (03CR) 10Marostegui: "Since this is a completely new thing, @cwilliams@wikimedia.org do you want to take a look at it and review it?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [10:44:43] !log fnegri@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb1015.eqiad.wmnet with reason: host reimage [10:45:02] (03PS2) 10Brouberol: preseed: drop configuration for kafka-jumbo100[7-9] [puppet] - 10https://gerrit.wikimedia.org/r/1289908 (https://phabricator.wikimedia.org/T426835) [10:45:02] (03PS2) 10Brouberol: preseed: prepare kafka-jumbo brokers for re-imaging [puppet] - 10https://gerrit.wikimedia.org/r/1289909 (https://phabricator.wikimedia.org/T426835) [10:45:02] (03PS2) 10Brouberol: kafka-jumbo: upgrade broker to jdk21 [puppet] - 10https://gerrit.wikimedia.org/r/1289910 (https://phabricator.wikimedia.org/T426835) [10:45:59] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc1011.eqiad.wmnet - https://phabricator.wikimedia.org/T426806#11939927 (10VRiley-WMF) a:03VRiley-WMF [10:46:06] !log klausman@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve2009.codfw.wmnet [10:46:08] !log klausman@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve2009.codfw.wmnet [10:46:14] !log klausman@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve2010.codfw.wmnet [10:46:38] RECOVERY - BFD status on lsw1-c2-codfw.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:47:05] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy qwen3-14b model in experimental ns. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289372 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [10:47:15] (03CR) 10Btullis: "Is it worth removing them from site.pp at the same time?" [puppet] - 10https://gerrit.wikimedia.org/r/1289908 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [10:47:40] (03CR) 10Elukey: [C:03+1] preseed: prepare kafka-jumbo brokers for re-imaging [puppet] - 10https://gerrit.wikimedia.org/r/1289909 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [10:47:50] (03CR) 10Btullis: [C:03+1] preseed: prepare kafka-jumbo brokers for re-imaging [puppet] - 10https://gerrit.wikimedia.org/r/1289909 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [10:47:53] (03CR) 10Elukey: [C:03+1] preseed: drop configuration for kafka-jumbo100[7-9] [puppet] - 10https://gerrit.wikimedia.org/r/1289908 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [10:48:25] (03CR) 10Brouberol: "yes, good spot! Doing" [puppet] - 10https://gerrit.wikimedia.org/r/1289908 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [10:48:27] (03CR) 10Elukey: [C:03+1] kafka-jumbo: upgrade broker to jdk21 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1289910 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [10:48:32] (03PS3) 10Brouberol: preseed: drop configuration for kafka-jumbo100[7-9] [puppet] - 10https://gerrit.wikimedia.org/r/1289908 (https://phabricator.wikimedia.org/T426835) [10:48:32] (03PS3) 10Brouberol: preseed: prepare kafka-jumbo brokers for re-imaging [puppet] - 10https://gerrit.wikimedia.org/r/1289909 (https://phabricator.wikimedia.org/T426835) [10:48:32] (03PS3) 10Brouberol: kafka-jumbo: upgrade broker to jdk21 [puppet] - 10https://gerrit.wikimedia.org/r/1289910 (https://phabricator.wikimedia.org/T426835) [10:48:53] (03CR) 10Btullis: "You're just upgrading this one as a canary, right?" [puppet] - 10https://gerrit.wikimedia.org/r/1289910 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [10:49:10] PROBLEM - Host cloudlb2002-dev is DOWN: PING CRITICAL - Packet loss = 100% [10:49:16] (03Merged) 10jenkins-bot: ml-services: Deploy qwen3-14b model in experimental ns. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289372 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [10:49:17] (03CR) 10Btullis: [C:03+1] preseed: drop configuration for kafka-jumbo100[7-9] [puppet] - 10https://gerrit.wikimedia.org/r/1289908 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [10:49:26] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:49:43] (03CR) 10Ilias Sarantopoulos: [C:03+1] "Copied votes on follow-up patch sets have been updated:" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289372 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [10:50:10] (03PS4) 10Brouberol: kafka-jumbo: upgrade kafka-jumbo1010 to jdk21 [puppet] - 10https://gerrit.wikimedia.org/r/1289910 (https://phabricator.wikimedia.org/T426835) [10:50:13] (03CR) 10Brouberol: "Yep!" [puppet] - 10https://gerrit.wikimedia.org/r/1289910 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [10:50:23] (03CR) 10Brouberol: kafka-jumbo: upgrade kafka-jumbo1010 to jdk21 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1289910 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [10:50:25] (03CR) 10FNegri: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1288526 (https://phabricator.wikimedia.org/T424207) (owner: 10Raymond Ndibe) [10:50:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2002-dev (172.20.5.3) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [10:50:42] RECOVERY - Host cloudlb2002-dev is UP: PING OK - Packet loss = 0%, RTA = 32.55 ms [10:50:50] PROBLEM - Bird Internet Routing Daemon on cloudlb2002-dev is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [10:51:10] FIRING: [2x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.3 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:51:21] !log klausman@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve2010.codfw.wmnet [10:51:35] !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:51:38] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host dse-k8s-worker1005.eqiad.wmnet [10:51:39] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host dse-k8s-worker1005.eqiad.wmnet [10:51:45] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host dse-k8s-worker1006.eqiad.wmnet [10:52:25] (03CR) 10Btullis: [C:03+1] kafka-jumbo: upgrade kafka-jumbo1010 to jdk21 [puppet] - 10https://gerrit.wikimedia.org/r/1289910 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [10:52:50] RECOVERY - Bird Internet Routing Daemon on cloudlb2002-dev is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [10:52:53] (03CR) 10Brouberol: [C:03+2] preseed: drop configuration for kafka-jumbo100[7-9] [puppet] - 10https://gerrit.wikimedia.org/r/1289908 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [10:52:57] (03CR) 10Brouberol: [C:03+2] preseed: prepare kafka-jumbo brokers for re-imaging [puppet] - 10https://gerrit.wikimedia.org/r/1289909 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [10:53:01] (03CR) 10Brouberol: [C:03+2] kafka-jumbo: upgrade kafka-jumbo1010 to jdk21 [puppet] - 10https://gerrit.wikimedia.org/r/1289910 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [10:53:28] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:53:46] !log btullis@cumin1003 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) depool for host dse-k8s-worker1006.eqiad.wmnet [10:55:28] PROBLEM - BFD status on lsw1-d2-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:55:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2002-dev (172.20.5.3) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [10:55:55] (03CR) 10FNegri: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1288521 (https://phabricator.wikimedia.org/T424209) (owner: 10Raymond Ndibe) [10:55:58] !log klausman@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve2010.codfw.wmnet [10:56:00] !log klausman@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve2010.codfw.wmnet [10:56:06] !log klausman@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve2011.codfw.wmnet [10:56:10] RESOLVED: [2x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.3 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:57:30] !log btullis@cumin1003 END (PASS) - Cookbook sre.presto.reboot-workers (exit_code=0) for Presto an-presto cluster: Reboot Presto nodes [10:58:14] PROBLEM - Host cloudlb2002-dev is DOWN: PING CRITICAL - Packet loss = 100% [10:58:26] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:58:31] brouberol@cumin1003 reimage (PID 1966150) is awaiting input [10:58:52] {{done}} [10:59:42] RECOVERY - Host cloudlb2002-dev is UP: PING OK - Packet loss = 0%, RTA = 31.63 ms [10:59:45] !log failover Ganeti cluster in codfw to ganeti2048 [10:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:05] mvolz: gettimeofday() says it's time for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260520T1100) [11:00:28] RECOVERY - BFD status on lsw1-d2-codfw.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:00:50] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2162.codfw.wmnet with reason: host reimage [11:00:50] PROBLEM - Bird Internet Routing Daemon on cloudlb2002-dev is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [11:00:54] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2002-dev (172.20.5.3) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:01:10] !log klausman@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve2011.codfw.wmnet [11:01:25] FIRING: [2x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.3 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:01:31] !log brouberol@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-jumbo1010.eqiad.wmnet with OS trixie [11:01:32] !log btullis@cumin1003 END (PASS) - Cookbook sre.ceph.roll-restart-reboot-server (exit_code=0) rolling reboot on A:cephosd-codfw [11:02:18] PROBLEM - ganeti-wconfd running on ganeti2047 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [11:02:28] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:02:50] RECOVERY - Bird Internet Routing Daemon on cloudlb2002-dev is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [11:03:19] jouncebot: nowandnext [11:03:19] For the next 0 hour(s) and 56 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260520T1100) [11:03:19] In 1 hour(s) and 56 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260520T1300) [11:03:30] Any objections to me syncing a config patch? [11:04:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install rdb101[56] - https://phabricator.wikimedia.org/T418916#11940000 (10Jclark-ctr) a:05Clement_Goubert→03Jclark-ctr @MLechvien-WMF i believe so looks like @Papaul noticed the missing part in puppet and updated both [11:04:27] !log btullis@cumin1003 START - Cookbook sre.druid.reboot-workers for Druid public cluster: Reboot Druid nodes [11:04:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289902 (https://phabricator.wikimedia.org/T426829) (owner: 10Kosta Harlan) [11:04:58] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2162.codfw.wmnet with reason: host reimage [11:04:58] !log btullis@cumin1003 START - Cookbook sre.ceph.roll-restart-reboot-server rolling reboot on A:cephosd-eqiad [11:05:14] PROBLEM - Host cloudlb2003-dev is DOWN: PING CRITICAL - Packet loss = 100% [11:05:28] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:05:33] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host rdb1016.eqiad.wmnet with OS trixie [11:05:41] !log klausman@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve2011.codfw.wmnet [11:05:42] !log klausman@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve2011.codfw.wmnet [11:05:43] !log klausman@cumin2002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:ml-serve-worker-codfw [11:05:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install rdb101[56] - https://phabricator.wikimedia.org/T418916#11940002 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host rdb1016.eqiad.wmnet with OS trixie [11:05:52] RECOVERY - Host cloudlb2003-dev is UP: PING OK - Packet loss = 0%, RTA = 31.66 ms [11:05:52] !log jiji@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{wikikube-worker[1249-1289,1291-1327,1375-1384].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [11:05:53] (03CR) 10CI reject: [V:04-1] hCaptcha: Exempt Wikibase entity namespaces from edit/create triggers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289902 (https://phabricator.wikimedia.org/T426829) (owner: 10Kosta Harlan) [11:05:54] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2002-dev (172.20.5.3) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:05:56] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp7014.magru.wmnet [11:05:57] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp701[3-4].magru.wmnet} and A:cp [11:06:02] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1249-1252].eqiad.wmnet [11:06:19] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp700[7-8].magru.wmnet} and A:cp [11:06:25] RESOLVED: [2x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.3 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:06:39] (03PS1) 10Majavah: bird: Create anycast-healthchecker run directory with tmpfiles [puppet] - 10https://gerrit.wikimedia.org/r/1289919 (https://phabricator.wikimedia.org/T426837) [11:06:50] PROBLEM - Bird Internet Routing Daemon on cloudlb2003-dev is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [11:07:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289902 (https://phabricator.wikimedia.org/T426829) (owner: 10Kosta Harlan) [11:07:56] PROBLEM - BFD status on lsw1-e1-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:08:02] (03CR) 10CI reject: [V:04-1] hCaptcha: Exempt Wikibase entity namespaces from edit/create triggers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289902 (https://phabricator.wikimedia.org/T426829) (owner: 10Kosta Harlan) [11:08:17] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1249-1252].eqiad.wmnet [11:08:24] hashar andre: merging mediawiki-config patches appears to be broken due to “no composer.lock file present” in lint and test jobs [11:08:27] (03PS1) 10Btullis: [dse-k8s-wdqs-test] Stop configuring a second volume group for RAID0 [puppet] - 10https://gerrit.wikimedia.org/r/1289921 (https://phabricator.wikimedia.org/T425653) [11:08:28] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:08:50] RECOVERY - Bird Internet Routing Daemon on cloudlb2003-dev is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [11:09:55] (03PS2) 10Majavah: bird: Create anycast-healthchecker run directory with tmpfiles [puppet] - 10https://gerrit.wikimedia.org/r/1289919 (https://phabricator.wikimedia.org/T426837) [11:10:56] RECOVERY - BFD status on lsw1-e1-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:11:07] (03PS2) 10Btullis: [dse-k8s-wdqs-test] Stop configuring a second volume group for RAID0 [puppet] - 10https://gerrit.wikimedia.org/r/1289921 (https://phabricator.wikimedia.org/T425653) [11:11:14] PROBLEM - Host cloudlb2004-dev is DOWN: PING CRITICAL - Packet loss = 100% [11:11:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1055.eqiad.wmnet [11:11:28] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:12:38] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host dse-k8s-worker1006.eqiad.wmnet [11:12:39] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host dse-k8s-worker1006.eqiad.wmnet [11:12:44] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host dse-k8s-worker1007.eqiad.wmnet [11:12:44] RECOVERY - Host cloudlb2004-dev is UP: PING OK - Packet loss = 0%, RTA = 31.76 ms [11:13:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [11:13:33] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2047.codfw.wmnet [11:13:50] PROBLEM - Bird Internet Routing Daemon on cloudlb2004-dev is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [11:13:56] PROBLEM - BFD status on lsw1-e1-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:14:15] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 9): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8567/co" [puppet] - 10https://gerrit.wikimedia.org/r/1289919 (https://phabricator.wikimedia.org/T426837) (owner: 10Majavah) [11:14:50] RECOVERY - Bird Internet Routing Daemon on cloudlb2004-dev is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [11:14:56] RECOVERY - BFD status on lsw1-e1-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:15:28] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:15:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:15:54] FIRING: [6x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2002-dev (172.20.5.3) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:16:09] RESOLVED: [6x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2002-dev (172.20.5.3) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:16:12] !log brouberol@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-jumbo1010.eqiad.wmnet with reason: host reimage [11:16:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1055.eqiad.wmnet [11:16:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet [11:16:25] FIRING: [6x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.4 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:16:40] RESOLVED: [6x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.4 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:17:14] PROBLEM - Host cloudlb1001 is DOWN: PING CRITICAL - Packet loss = 100% [11:17:21] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on rdb1016.eqiad.wmnet with reason: host reimage [11:17:56] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp7007.magru.wmnet [11:18:14] !log jmm@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1055 [11:18:30] PROBLEM - Host ml-staging-etcd2002 is DOWN: PING CRITICAL - Packet loss = 100% [11:18:47] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1249-1252].eqiad.wmnet [11:18:49] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1249-1252].eqiad.wmnet [11:18:59] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1253-1256].eqiad.wmnet [11:19:06] PROBLEM - BFD status on cloudsw1-c8-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:19:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1055 [11:19:43] RECOVERY - Host cloudlb1001 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [11:19:56] !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-jumbo1010.eqiad.wmnet with reason: host reimage [11:20:10] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1055.eqiad.wmnet to cluster codfw and group A [11:20:20] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1055.eqiad.wmnet to cluster codfw and group A [11:20:41] PROBLEM - Bird Internet Routing Daemon on cloudlb1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [11:20:54] FIRING: [8x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2002-dev (172.20.5.3) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:20:57] RECOVERY - Host ml-staging-etcd2002 is UP: PING OK - Packet loss = 0%, RTA = 31.78 ms [11:21:11] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1253-1256].eqiad.wmnet [11:21:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet [11:22:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2047.codfw.wmnet [11:22:05] RECOVERY - BFD status on cloudsw1-c8-eqiad.mgmt is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:22:19] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2162.codfw.wmnet with OS trixie [11:22:41] RECOVERY - Bird Internet Routing Daemon on cloudlb1001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [11:23:26] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on rdb1016.eqiad.wmnet with reason: host reimage [11:24:18] PROBLEM - Host cloudlb1002 is DOWN: PING CRITICAL - Packet loss = 100% [11:24:28] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2162: Migration of db2162.codfw.wmnet completed [11:24:56] PROBLEM - BFD status on lsw1-e2-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:25:09] RESOLVED: [6x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2003-dev (172.20.5.4) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:25:42] RECOVERY - Host cloudlb1002 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [11:25:51] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1055.eqiad.wmnet to cluster codfw and group A [11:25:58] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1055.eqiad.wmnet to cluster codfw and group A [11:27:40] PROBLEM - Bird Internet Routing Daemon on cloudlb1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [11:28:32] (03PS1) 10Btullis: [airflow-wikidata] - Add the new S3 credentials to extra_secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289924 (https://phabricator.wikimedia.org/T426764) [11:28:40] RECOVERY - Bird Internet Routing Daemon on cloudlb1002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [11:28:40] (03CR) 10CI reject: [V:04-1] [airflow-wikidata] - Add the new S3 credentials to extra_secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289924 (https://phabricator.wikimedia.org/T426764) (owner: 10Btullis) [11:28:44] (03PS2) 10Btullis: [airflow-wikidata] - Add the new S3 credentials to extra_secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289924 (https://phabricator.wikimedia.org/T426764) [11:28:52] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1253-1256].eqiad.wmnet [11:28:54] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1253-1256].eqiad.wmnet [11:29:04] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1257-1260].eqiad.wmnet [11:29:13] RESOLVED: JobUnavailable: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:29:17] FIRING: [4x] KubernetesCalicoDown: dse-k8s-worker1006.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:30:09] FIRING: [6x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2004-dev (172.20.5.5) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:30:54] RESOLVED: [6x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2004-dev (172.20.5.5) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:31:16] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1257-1260].eqiad.wmnet [11:31:52] (03PS1) 10Muehlenhoff: sre.ganeti.addnode: Stop using the wrapper for the firewall check [cookbooks] - 10https://gerrit.wikimedia.org/r/1289927 [11:31:58] RECOVERY - BFD status on lsw1-e2-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:32:02] RESOLVED: [3x] KubernetesCalicoDown: wikikube-worker1250.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:32:26] (03PS1) 10Mszwarc: Fix UserGroupManager::getUserAutopromoteGroups with interwiki users [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289928 (https://phabricator.wikimedia.org/T426832) [11:32:41] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet [11:35:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet [11:36:54] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1289919 (https://phabricator.wikimedia.org/T426837) (owner: 10Majavah) [11:37:18] PROBLEM - Host ml-staging-etcd2003 is DOWN: PING CRITICAL - Packet loss = 100% [11:38:39] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1257-1260].eqiad.wmnet [11:38:41] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1257-1260].eqiad.wmnet [11:38:51] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1261-1264].eqiad.wmnet [11:39:03] FIRING: KafkaBrokerUnavailable: One or more Kafka brokers unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [11:39:34] !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-jumbo1010.eqiad.wmnet with OS trixie [11:39:36] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [11:40:06] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [11:40:07] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host rdb1016.eqiad.wmnet with OS trixie [11:40:30] RECOVERY - Host ml-staging-etcd2003 is UP: PING OK - Packet loss = 0%, RTA = 31.79 ms [11:40:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install rdb101[56] - https://phabricator.wikimedia.org/T418916#11940208 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host rdb1016.eqiad.wmnet with OS trixie completed: - rdb1016 (**PAS... [11:40:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet [11:40:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet [11:40:54] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet [11:41:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install rdb101[56] - https://phabricator.wikimedia.org/T418916#11940212 (10Jclark-ctr) 05Open→03Resolved [11:41:08] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1261-1264].eqiad.wmnet [11:41:20] PROBLEM - BFD status on lsw1-e3-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:41:56] (03PS1) 10Brouberol: Upgrade kafka-jumbo1011 to JDK21 [puppet] - 10https://gerrit.wikimedia.org/r/1289933 (https://phabricator.wikimedia.org/T426835) [11:42:12] Would anyone mind if I deploy a fix for T426832? [11:42:13] T426832: userrights-interwiki fails with server error - https://phabricator.wikimedia.org/T426832 [11:42:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet [11:42:47] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host dse-k8s-worker1007.eqiad.wmnet [11:43:04] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [11:43:26] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1258: Upgrading db1258.eqiad.wmnet [11:43:55] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1258: Upgrading db1258.eqiad.wmnet [11:44:03] RESOLVED: KafkaBrokerUnavailable: One or more Kafka brokers unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [11:45:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289928 (https://phabricator.wikimedia.org/T426832) (owner: 10Mszwarc) [11:47:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet [11:47:57] cwilliams@cumin1003 major-upgrade (PID 2006056) is awaiting input [11:48:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet [11:48:10] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet [11:48:20] RECOVERY - BFD status on lsw1-e3-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:49:07] (03PS1) 10Dpogorzelski: ml-serve: update kserve/knative on prod codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289935 (https://phabricator.wikimedia.org/T426823) [11:49:49] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host dse-k8s-worker1007.eqiad.wmnet [11:49:50] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host dse-k8s-worker1007.eqiad.wmnet [11:49:56] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host dse-k8s-worker1008.eqiad.wmnet [11:51:07] (03Merged) 10jenkins-bot: Fix UserGroupManager::getUserAutopromoteGroups with interwiki users [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289928 (https://phabricator.wikimedia.org/T426832) (owner: 10Mszwarc) [11:51:19] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1261-1264].eqiad.wmnet [11:51:20] Msz2001: +1 :) [11:51:21] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1261-1264].eqiad.wmnet [11:51:31] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1265-1268].eqiad.wmnet [11:51:34] !log mszwarc@deploy1003 Started scap sync-world: Backport for [[gerrit:1289928|Fix UserGroupManager::getUserAutopromoteGroups with interwiki users (T426832)]] [11:51:36] I was having lunch but I am back now [11:51:38] T426832: userrights-interwiki fails with server error - https://phabricator.wikimedia.org/T426832 [11:51:40] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:51:58] !log btullis@cumin1003 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) depool for host dse-k8s-worker1008.eqiad.wmnet [11:52:12] PROBLEM - Host cloudlb2002-dev is DOWN: PING CRITICAL - Packet loss = 100% [11:52:28] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:52:39] (03PS1) 10Muehlenhoff: ganeti: Remove validate-ganeti-firewall [puppet] - 10https://gerrit.wikimedia.org/r/1289936 [11:52:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet [11:53:32] !log mszwarc@deploy1003 mszwarc: Backport for [[gerrit:1289928|Fix UserGroupManager::getUserAutopromoteGroups with interwiki users (T426832)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:53:42] RECOVERY - Host cloudlb2002-dev is UP: PING OK - Packet loss = 0%, RTA = 31.60 ms [11:54:21] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1265-1268].eqiad.wmnet [11:54:26] !log mszwarc@deploy1003 mszwarc: Continuing with deployment [11:54:28] (03PS6) 10Daniel Kinzler: Move Makefiles to standard location [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282962 (https://phabricator.wikimedia.org/T424824) [11:54:38] (03CR) 10Daniel Kinzler: Move Makefiles to standard location (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282962 (https://phabricator.wikimedia.org/T424824) (owner: 10Daniel Kinzler) [11:54:44] PROBLEM - Host dse-k8s-ctrl2001 is DOWN: PING CRITICAL - Packet loss = 100% [11:54:50] PROBLEM - Bird Internet Routing Daemon on cloudlb2002-dev is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [11:54:52] PROBLEM - Host ml-etcd2002 is DOWN: PING CRITICAL - Packet loss = 100% [11:55:28] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:55:30] RECOVERY - Host ml-etcd2002 is UP: PING OK - Packet loss = 0%, RTA = 30.27 ms [11:55:32] RECOVERY - Host dse-k8s-ctrl2001 is UP: PING OK - Packet loss = 0%, RTA = 31.92 ms [11:55:47] !log cwilliams@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [11:55:50] RECOVERY - Bird Internet Routing Daemon on cloudlb2002-dev is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [11:55:52] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [11:55:55] !log cwilliams@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [11:57:48] 10SRE-SLO, 06ServiceOps new, 06Data-Platform-SRE (2026-04-24 - 2026-05-15), 07Essential-Work, and 2 others: IPoid: Define service level indicators and service level objectives - https://phabricator.wikimedia.org/T348935#11940270 (10kostajh) 05Open→03Resolved [11:58:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet [11:58:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet [11:58:38] !log mszwarc@deploy1003 Finished scap sync-world: Backport for [[gerrit:1289928|Fix UserGroupManager::getUserAutopromoteGroups with interwiki users (T426832)]] (duration: 07m 04s) [11:58:42] T426832: userrights-interwiki fails with server error - https://phabricator.wikimedia.org/T426832 [11:59:23] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp7008.magru.wmnet [11:59:23] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp700[7-8].magru.wmnet} and A:cp [12:00:55] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:01:03] (03CR) 10Brouberol: [C:03+2] Upgrade kafka-jumbo1011 to JDK21 [puppet] - 10https://gerrit.wikimedia.org/r/1289933 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [12:01:32] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1265-1268].eqiad.wmnet [12:01:34] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1265-1268].eqiad.wmnet [12:01:37] (03CR) 10Elukey: [C:03+1] [dse-k8s-wdqs-test] Stop configuring a second volume group for RAID0 [puppet] - 10https://gerrit.wikimedia.org/r/1289921 (https://phabricator.wikimedia.org/T425653) (owner: 10Btullis) [12:01:44] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1269-1272].eqiad.wmnet [12:02:02] FIRING: [4x] KubernetesCalicoDown: wikikube-worker1261.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:02:45] !log brouberol@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-jumbo1011.eqiad.wmnet with OS trixie [12:03:01] hashar: I’m going to backport a wmf.3 patch, ok? [12:04:14] (03PS1) 10Kosta Harlan: Revert "ApiEditPage: Update request in main context before calling attemptSave()" [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289938 (https://phabricator.wikimedia.org/T426751) [12:04:30] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1269-1272].eqiad.wmnet [12:04:31] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp701[5-6].magru.wmnet} and A:cp [12:04:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289938 (https://phabricator.wikimedia.org/T426751) (owner: 10Kosta Harlan) [12:04:59] !log btullis@cumin1003 END (FAIL) - Cookbook sre.ceph.roll-restart-reboot-server (exit_code=99) rolling reboot on A:cephosd-eqiad [12:05:17] RESOLVED: [4x] KubernetesCalicoDown: wikikube-worker1261.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:05:42] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host dse-k8s-worker1008.eqiad.wmnet [12:05:43] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host dse-k8s-worker1008.eqiad.wmnet [12:05:49] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host dse-k8s-worker1009.eqiad.wmnet [12:06:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2038.codfw.wmnet [12:06:42] kostajh: apologies for the question, do you know if there's a task yet for the error in test/lint for mediawiki-config patches? [12:06:48] at first glance it looks like some version of T416518 [12:06:48] T416518: Disable Composer 2.9 functionality to randomly block existing configurations from working - https://phabricator.wikimedia.org/T416518 [12:06:57] (or rather, what T416518 would aim to prevent, IIUC) [12:09:14] A_smart_kitten: I don’t know yet, I didn’t get around to looking yet [12:09:35] hashar: what’s the easiest way to restart the tests when there’s a single flaky test on a build? e.g. for https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1289938 . Do I need to abandon/restore? [12:09:56] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2162: Migration of db2162.codfw.wmnet completed [12:09:57] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [12:10:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2038.codfw.wmnet [12:11:04] kostajh: that's in a test job, so 'recheck' will work just fine [12:11:22] (03CR) 10Brouberol: [C:03+1] [airflow-wikidata] - Add the new S3 credentials to extra_secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289924 (https://phabricator.wikimedia.org/T426764) (owner: 10Btullis) [12:11:38] although i don't think that `test` failure will affect `gate-and-submit` passing, so not really needed even [12:11:54] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1269-1272].eqiad.wmnet [12:11:56] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1269-1272].eqiad.wmnet [12:12:02] FIRING: [7x] KubernetesCalicoDown: dse-k8s-worker1008.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:12:06] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1273-1276].eqiad.wmnet [12:12:07] PROBLEM - Host aux-k8s-etcd2005 is DOWN: PING CRITICAL - Packet loss = 100% [12:12:29] (03CR) 10CI reject: [V:04-1] Revert "ApiEditPage: Update request in main context before calling attemptSave()" [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289938 (https://phabricator.wikimedia.org/T426751) (owner: 10Kosta Harlan) [12:12:38] yup `recheck` would do [12:13:30] and lookup whether the tests has a Phabricator task, else create one for it [12:14:20] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1273-1276].eqiad.wmnet [12:14:51] but recheck needs to wait for everything to finish first [12:15:03] whereas I would like to have a short circuit (similar to pressing “rebase” on a patch) [12:15:08] !log btullis@cumin1003 END (PASS) - Cookbook sre.druid.reboot-workers (exit_code=0) for Druid public cluster: Reboot Druid nodes [12:15:09] there is no way to rerun a single test [12:15:17] RESOLVED: [7x] KubernetesCalicoDown: dse-k8s-worker1008.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:15:18] I mean to restart all of them [12:15:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289938 (https://phabricator.wikimedia.org/T426751) (owner: 10Kosta Harlan) [12:15:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2038.codfw.wmnet [12:15:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2038.codfw.wmnet [12:15:59] RECOVERY - Host aux-k8s-etcd2005 is UP: PING OK - Packet loss = 0%, RTA = 30.81 ms [12:16:00] e.g. I pressed +2 on https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1289938 at 2:04, at 2:05 I could see that one of the jobs failed due to being unable to clone a repo. I want to restart the whole process. But I don’t see how to do that, other than abandoning and restoring the patch [12:16:06] ah yeah then the short circuit is to rebase/amend the commit message, that would create a new patchset and Zuul would cancel all the jobs [12:16:22] Ok, I’ll just do a simple commit message change next time, good point [12:16:37] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp7015.magru.wmnet [12:17:02] then in this case that is the job in the `test` pipeline that failed [12:17:09] !log brouberol@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-jumbo1011.eqiad.wmnet with reason: host reimage [12:17:13] and the change is already in `gate-and-submit` so I guess it will be merged [12:17:21] (03Merged) 10jenkins-bot: Revert "ApiEditPage: Update request in main context before calling attemptSave()" [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289938 (https://phabricator.wikimedia.org/T426751) (owner: 10Kosta Harlan) [12:17:25] yeah it did [12:17:46] cause all jobs passed in `gate-and-submit`, then Zuul votes Verified +1 (which erases the previously set V-1) [12:17:48] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1289938|Revert "ApiEditPage: Update request in main context before calling attemptSave()" (T426751)]] [12:17:52] T426751: Stuck in FancyCaptcha challenge loop on VisualEditor - https://phabricator.wikimedia.org/T426751 [12:19:42] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1289938|Revert "ApiEditPage: Update request in main context before calling attemptSave()" (T426751)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:19:46] kostajh et al (FYI), filed T426845 [12:19:46] T426845: mediawiki-config CI blocked by Composer automatic-security-blocking (PKSA-v5yj-8nmz-sk2q, PKSA-ft77-7h5f-p3r6, PKSA-b14r-zh1d-vdrc) - https://phabricator.wikimedia.org/T426845 [12:20:55] FIRING: [4x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:21:47] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1273-1276].eqiad.wmnet [12:21:49] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1273-1276].eqiad.wmnet [12:22:00] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1277-1280].eqiad.wmnet [12:22:02] FIRING: [7x] KubernetesCalicoDown: dse-k8s-worker1008.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:22:14] !log kharlan@deploy1003 kharlan: Continuing with deployment [12:23:52] A_smart_kitten: thanks [12:23:59] hashar: any ideas for unbreaking T426845 ? [12:24:15] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1277-1280].eqiad.wmnet [12:25:13] !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-jumbo1011.eqiad.wmnet with reason: host reimage [12:25:17] RESOLVED: [6x] KubernetesCalicoDown: wikikube-worker1270.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:25:48] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2039.codfw.wmnet [12:26:26] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1289938|Revert "ApiEditPage: Update request in main context before calling attemptSave()" (T426751)]] (duration: 08m 37s) [12:26:29] T426751: Stuck in FancyCaptcha challenge loop on VisualEditor - https://phabricator.wikimedia.org/T426751 [12:26:35] !log taavi@cumin1003 START - Cookbook sre.hosts.reboot-single for host clouddumps1002.wikimedia.org [12:27:21] (03PS1) 10Elukey: sre.hosts.provision: enable IOMMU for multi-GPU Supermicro hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1289942 (https://phabricator.wikimedia.org/T421461) [12:29:06] kostajh: ah that is composer 2.x breaking on security update [12:29:31] that ties our build to a changing extenral entity (a security advisory being released in packagist.org) [12:31:04] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [12:31:07] !log cwilliams@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [12:31:40] jmm@cumin2002 drain-node (PID 3664581) is awaiting input [12:31:45] FWIW my idea would be a patch to bump symfony/yaml in mediawiki-config's composer.json to 7.4.12 (which FWICS seems to be the 7.* version of that package that isn't marked as being vulnerable). i'm afraid i'm not personally able to look much into that possibility at the moment though [12:32:14] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1015.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [12:33:10] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [12:33:13] !log cwilliams@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [12:33:31] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [12:33:42] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1258: Upgrading db1258.eqiad.wmnet [12:33:50] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1258: Upgrading db1258.eqiad.wmnet [12:34:23] PROBLEM - Host ml-serve1015 is DOWN: PING CRITICAL - Packet loss = 100% [12:34:47] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1277-1280].eqiad.wmnet [12:34:49] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1277-1280].eqiad.wmnet [12:34:59] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1281-1284].eqiad.wmnet [12:35:47] !log taavi@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddumps1002.wikimedia.org [12:35:51] (03CR) 10Klausman: [C:03+1] sre.hosts.provision: enable IOMMU for multi-GPU Supermicro hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1289942 (https://phabricator.wikimedia.org/T421461) (owner: 10Elukey) [12:35:53] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host dse-k8s-worker1009.eqiad.wmnet [12:35:55] FIRING: [4x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:36:04] !log taavi@cumin1003 START - Cookbook sre.hosts.reboot-single for host clouddumps1001.wikimedia.org [12:36:48] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1258.eqiad.wmnet with OS trixie [12:37:06] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "(CI is blocked by T426845)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289902 (https://phabricator.wikimedia.org/T426829) (owner: 10Kosta Harlan) [12:37:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2039.codfw.wmnet [12:37:27] kostajh: I replied https://phabricator.wikimedia.org/T426845#11940461 and reached the same conclusion as A_smart_kitten : simply upgrade symfony/yaml in mediawiki-config, it is just a dev requirement [12:37:31] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1015.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [12:37:46] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1281-1284].eqiad.wmnet [12:39:15] PROBLEM - Host kubestagemaster2004 is DOWN: PING CRITICAL - Packet loss = 100% [12:39:17] (03PS1) 10Lucas Werkmeister (WMDE): Disable Composer audit.block-insecure option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289943 (https://phabricator.wikimedia.org/T416518) [12:40:09] hashar, kostajh: I went for the big hammer anyway ^ [12:40:29] (03CR) 10Majavah: [C:03+1] Disable Composer audit.block-insecure option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289943 (https://phabricator.wikimedia.org/T416518) (owner: 10Lucas Werkmeister (WMDE)) [12:40:33] I don’t object to upgrading symfony/yaml, but also, letting third parties break our deployments at arbitrary times sounds like madness to me [12:40:47] given the repo only has require-dev dependencies, I think it is fine to disable the audit thing indeed [12:40:53] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1015.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [12:40:59] (at least in mediawiki/core, it was usually only breaking merges, not immediate deployments – except on backport branches I guess) [12:41:01] RECOVERY - Host kubestagemaster2004 is UP: PING OK - Packet loss = 0%, RTA = 32.06 ms [12:41:01] (03CR) 10Ssingh: [C:03+1] ml-serve(grpc): step 1, etcd data for DNS Discovery [puppet] - 10https://gerrit.wikimedia.org/r/1283745 (https://phabricator.wikimedia.org/T424049) (owner: 10Dpogorzelski) [12:41:05] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1015.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [12:42:02] FIRING: [4x] KubernetesCalicoDown: ml-serve1015.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:42:09] !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-jumbo1011.eqiad.wmnet with OS trixie [12:42:23] Lucas_WMDE: thanks [12:42:28] (03CR) 10Hashar: [C:03+2] Disable Composer audit.block-insecure option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289943 (https://phabricator.wikimedia.org/T416518) (owner: 10Lucas Werkmeister (WMDE)) [12:42:31] +2 ed [12:42:31] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host dse-k8s-worker1009.eqiad.wmnet [12:42:31] CI is green yay [12:42:32] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host dse-k8s-worker1009.eqiad.wmnet [12:42:32] \o/ [12:42:33] thanks [12:42:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2039.codfw.wmnet [12:42:39] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host dse-k8s-worker1010.eqiad.wmnet [12:42:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2039.codfw.wmnet [12:42:52] (03CR) 10A smart kitten: "Possibly a silly question, bear with me... what will take care of automatically updating vulnerable composer dependencies in the `mediawik" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289943 (https://phabricator.wikimedia.org/T416518) (owner: 10Lucas Werkmeister (WMDE)) [12:42:58] and if you had a deployment going on you can +2 it again (it would be enqueued behind that fix) [12:43:01] (03PS1) 10Muehlenhoff: thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289944 [12:43:02] then scap sync both [12:43:10] s/scap sync/scap deploy/ [12:43:18] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host dse-k8s-worker1010.eqiad.wmnet [12:43:24] (03Merged) 10jenkins-bot: Disable Composer audit.block-insecure option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289943 (https://phabricator.wikimedia.org/T416518) (owner: 10Lucas Werkmeister (WMDE)) [12:44:23] hashar: I’m not deploying anything at the moment [12:44:25] I think kostajh was? [12:44:31] !log taavi@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddumps1001.wikimedia.org [12:44:42] Lucas_WMDE: I can deploy the config patch for Wikibase + hCaptcha if you like [12:44:57] (03CR) 10Hashar: "recheck after having disabled composer audit I178d54cedc007cb6f9cbbc487eec2384ede717ad / T426845" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289902 (https://phabricator.wikimedia.org/T426829) (owner: 10Kosta Harlan) [12:45:21] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1281-1284].eqiad.wmnet [12:45:23] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1281-1284].eqiad.wmnet [12:45:33] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1285-1288].eqiad.wmnet [12:45:34] (03PS1) 10Brouberol: Upgrade kafka-jumbo1012 to JDK21 [puppet] - 10https://gerrit.wikimedia.org/r/1289945 (https://phabricator.wikimedia.org/T426835) [12:45:35] yup https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1289902 [12:45:36] (03PS1) 10Brouberol: Upgrade kafka-jumbo1013 to JDK21 [puppet] - 10https://gerrit.wikimedia.org/r/1289946 (https://phabricator.wikimedia.org/T426835) [12:45:38] (03PS1) 10Brouberol: Upgrade kafka-jumbo1014 to JDK21 [puppet] - 10https://gerrit.wikimedia.org/r/1289947 (https://phabricator.wikimedia.org/T426835) [12:45:40] (03PS1) 10Brouberol: Upgrade kafka-jumbo1015 to JDK21 [puppet] - 10https://gerrit.wikimedia.org/r/1289948 (https://phabricator.wikimedia.org/T426835) [12:45:42] (03PS1) 10Brouberol: Upgrade kafka-jumbo1016 to JDK21 [puppet] - 10https://gerrit.wikimedia.org/r/1289949 (https://phabricator.wikimedia.org/T426835) [12:45:43] Ok, I’ll go ahead with that one [12:45:46] (03PS1) 10Brouberol: Upgrade kafka-jumbo1017 to JDK21 [puppet] - 10https://gerrit.wikimedia.org/r/1289950 (https://phabricator.wikimedia.org/T426835) [12:45:50] (03PS1) 10Brouberol: Upgrade kafka-jumbo1018 to JDK21 [puppet] - 10https://gerrit.wikimedia.org/r/1289951 (https://phabricator.wikimedia.org/T426835) [12:45:52] ok, thanks [12:45:54] (03CR) 10Lucas Werkmeister (WMDE): "[Search](https://gerrit.wikimedia.org/r/q/project:operations/mediawiki-config+author:tools.libraryupgrader@tools.wmflabs.org) suggests Lib" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289943 (https://phabricator.wikimedia.org/T416518) (owner: 10Lucas Werkmeister (WMDE)) [12:45:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289902 (https://phabricator.wikimedia.org/T426829) (owner: 10Kosta Harlan) [12:46:17] (03CR) 10A smart kitten: "s/automatically updating/automatically updating (or notifying folks about)/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289943 (https://phabricator.wikimedia.org/T416518) (owner: 10Lucas Werkmeister (WMDE)) [12:46:32] (03CR) 10Ssingh: [C:03+1] ml-serve(grpc): step 2, add entry to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1283746 (https://phabricator.wikimedia.org/T424049) (owner: 10Dpogorzelski) [12:46:41] (03CR) 10Ssingh: [C:03+1] ml-serve(grpc): step 3, add service to k8s pools [puppet] - 10https://gerrit.wikimedia.org/r/1283747 (https://phabricator.wikimedia.org/T424049) (owner: 10Dpogorzelski) [12:47:02] FIRING: [8x] KubernetesCalicoDown: ml-serve1015.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:47:51] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1285-1288].eqiad.wmnet [12:47:53] RECOVERY - Host ml-serve1015 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [12:48:18] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db[2183-2184].codfw.wmnet with reason: restart [12:48:29] (03CR) 10CI reject: [V:04-1] Upgrade kafka-jumbo1017 to JDK21 [puppet] - 10https://gerrit.wikimedia.org/r/1289950 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [12:49:44] (03Merged) 10jenkins-bot: hCaptcha: Exempt Wikibase entity namespaces from edit/create triggers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289902 (https://phabricator.wikimedia.org/T426829) (owner: 10Kosta Harlan) [12:49:56] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host dse-k8s-worker1010.eqiad.wmnet [12:49:57] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host dse-k8s-worker1010.eqiad.wmnet [12:50:03] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host dse-k8s-worker1011.eqiad.wmnet [12:50:25] (03PS1) 10Reedy: composer.json: Upgrading symfony/yaml (v7.4.6 => v7.4.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289953 (https://phabricator.wikimedia.org/T426845) [12:51:13] (03PS2) 10Brouberol: Upgrade kafka-jumbo1017 to JDK21 [puppet] - 10https://gerrit.wikimedia.org/r/1289950 (https://phabricator.wikimedia.org/T426835) [12:52:02] RESOLVED: [6x] KubernetesCalicoDown: ml-serve1015.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:52:14] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1289902|hCaptcha: Exempt Wikibase entity namespaces from edit/create triggers (T426829)]] [12:52:18] T426829: New users unable to create Wikidata items: Incorrect or missing CAPTCHA - https://phabricator.wikimedia.org/T426829 [12:52:25] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2040.codfw.wmnet [12:53:19] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1258.eqiad.wmnet with reason: host reimage [12:54:10] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1289902|hCaptcha: Exempt Wikibase entity namespaces from edit/create triggers (T426829)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:55:07] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host dse-k8s-worker1011.eqiad.wmnet [12:55:45] !log fnegri@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddb1015.eqiad.wmnet with OS trixie [12:55:48] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1285-1288].eqiad.wmnet [12:55:50] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1285-1288].eqiad.wmnet [12:55:55] (03CR) 10Ssingh: ml-serve(grpc): step 2, add entry to service catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1283746 (https://phabricator.wikimedia.org/T424049) (owner: 10Dpogorzelski) [12:56:00] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1289,1291-1293].eqiad.wmnet [12:56:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2040.codfw.wmnet [12:56:42] (03CR) 10Ssingh: [C:03+1] "Looks good, thanks for patching it. As with all-things-Bird, consider a controlled rollout with Puppet disabled on C:bird, even though thi" [puppet] - 10https://gerrit.wikimedia.org/r/1289919 (https://phabricator.wikimedia.org/T426837) (owner: 10Majavah) [12:57:29] !log kharlan@deploy1003 kharlan: Continuing with deployment [12:57:43] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1015.eqiad.wmnet [12:57:57] !log fnegri@cumin1003 START - Cookbook sre.hosts.remove-downtime for clouddb1015.eqiad.wmnet [12:57:58] !log fnegri@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for clouddb1015.eqiad.wmnet [12:58:09] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1289,1291-1293].eqiad.wmnet [12:58:20] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp7016.magru.wmnet [12:58:20] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp701[5-6].magru.wmnet} and A:cp [12:58:24] (03CR) 10Muehlenhoff: [C:03+2] thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289944 (owner: 10Muehlenhoff) [12:58:59] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1258.eqiad.wmnet with reason: host reimage [12:59:38] (03PS4) 10JHathaway: mariadb: Migrate mariadb internal ferm rule to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1289382 (https://phabricator.wikimedia.org/T421705) [12:59:42] (03CR) 10Ladsgroup: [C:03+2] mariadb: Migrate mariadb internal ferm rule to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1289382 (https://phabricator.wikimedia.org/T421705) (owner: 10JHathaway) [12:59:45] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb: Migrate mariadb internal ferm rule to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1289382 (https://phabricator.wikimedia.org/T421705) (owner: 10JHathaway) [12:59:59] !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [13:00:03] Lucas_WMDE, Urbanecm, and TheresNoTime: UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260520T1300). Please do the needful. [13:00:03] stephanebisson, codders, and codenamenoreste: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] o/ [13:00:13] o/ [13:00:51] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: wikikube-ctrl100[45] implementation tracking - https://phabricator.wikimedia.org/T418920#11940588 (10MLechvien-WMF) a:03jasmine_ [13:01:07] (03CR) 10Gehel: [C:03+1] Upgrade kafka-jumbo1012 to JDK21 [puppet] - 10https://gerrit.wikimedia.org/r/1289945 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [13:01:19] (03CR) 10Gehel: [C:03+1] Upgrade kafka-jumbo1013 to JDK21 [puppet] - 10https://gerrit.wikimedia.org/r/1289946 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [13:01:27] I see there's a config change being deployed atm. I can do my patch when this is done. [13:01:29] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host dse-k8s-worker1011.eqiad.wmnet [13:01:30] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host dse-k8s-worker1011.eqiad.wmnet [13:01:31] (03CR) 10Gehel: [C:03+1] Upgrade kafka-jumbo1014 to JDK21 [puppet] - 10https://gerrit.wikimedia.org/r/1289947 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [13:01:36] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host dse-k8s-worker1012.eqiad.wmnet [13:01:40] (03CR) 10Gehel: [C:03+1] Upgrade kafka-jumbo1015 to JDK21 [puppet] - 10https://gerrit.wikimedia.org/r/1289948 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [13:01:41] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1289902|hCaptcha: Exempt Wikibase entity namespaces from edit/create triggers (T426829)]] (duration: 09m 31s) [13:01:45] T426829: New users unable to create Wikidata items: Incorrect or missing CAPTCHA - https://phabricator.wikimedia.org/T426829 [13:01:47] (03CR) 10Gehel: [C:03+1] Upgrade kafka-jumbo1016 to JDK21 [puppet] - 10https://gerrit.wikimedia.org/r/1289949 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [13:01:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2040.codfw.wmnet [13:01:52] (03CR) 10Gehel: [C:03+1] Upgrade kafka-jumbo1017 to JDK21 [puppet] - 10https://gerrit.wikimedia.org/r/1289950 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [13:01:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2040.codfw.wmnet [13:01:59] (03CR) 10Gehel: [C:03+1] Upgrade kafka-jumbo1018 to JDK21 [puppet] - 10https://gerrit.wikimedia.org/r/1289951 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [13:02:21] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp403[7-8].ulsfo.wmnet} and A:cp [13:02:40] !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply [13:02:42] also fine for me to wait for a bit [13:03:05] (03CR) 10Brouberol: [C:03+2] Upgrade kafka-jumbo1012 to JDK21 [puppet] - 10https://gerrit.wikimedia.org/r/1289945 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [13:03:16] (03CR) 10Ayounsi: [C:03+1] sre.ganeti.addnode: Stop using the wrapper for the firewall check [cookbooks] - 10https://gerrit.wikimedia.org/r/1289927 (owner: 10Muehlenhoff) [13:03:40] !log btullis@cumin1003 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) depool for host dse-k8s-worker1012.eqiad.wmnet [13:03:51] I’m done with my deploy [13:04:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [extensions/ArticleGuidance] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289446 (https://phabricator.wikimedia.org/T422146) (owner: 10Sbisson) [13:04:13] (03CR) 10Ladsgroup: [V:03+2 C:03+2] "root@db2168:~# diff /tmp/old /tmp/new" [puppet] - 10https://gerrit.wikimedia.org/r/1289382 (https://phabricator.wikimedia.org/T421705) (owner: 10JHathaway) [13:04:51] !log brouberol@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-jumbo1012.eqiad.wmnet with OS trixie [13:05:23] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:05:59] !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [13:06:14] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet [13:06:37] o/ [13:06:43] I missed the beginning of the window, sorry [13:06:56] stephanebisson: I think you can go ahead now? [13:07:10] think that's already running [13:07:11] Lucas_WMDE I have just started [13:07:16] (03Merged) 10jenkins-bot: Log editing_start and article_saved events for control group [extensions/ArticleGuidance] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289446 (https://phabricator.wikimedia.org/T422146) (owner: 10Sbisson) [13:07:16] ok thanks [13:07:27] (03CR) 10Muehlenhoff: [C:03+2] sre.ganeti.addnode: Stop using the wrapper for the firewall check [cookbooks] - 10https://gerrit.wikimedia.org/r/1289927 (owner: 10Muehlenhoff) [13:07:30] (03PS3) 10Brouberol: Upgrade kafka-jumbo1017 to JDK21 [puppet] - 10https://gerrit.wikimedia.org/r/1289950 (https://phabricator.wikimedia.org/T426835) [13:07:30] (03PS2) 10Brouberol: Upgrade kafka-jumbo1018 to JDK21 [puppet] - 10https://gerrit.wikimedia.org/r/1289951 (https://phabricator.wikimedia.org/T426835) [13:07:30] (03PS1) 10Brouberol: Set JDK21 as default for all kafka-jumbo brokers [puppet] - 10https://gerrit.wikimedia.org/r/1289959 (https://phabricator.wikimedia.org/T426835) [13:07:42] !log sbisson@deploy1003 Started scap sync-world: Backport for [[gerrit:1289446|Log editing_start and article_saved events for control group (T422146)]] [13:07:46] T422146: Experiment config and schema registration (Article Guidance initial intervention) - https://phabricator.wikimedia.org/T422146 [13:08:33] Lucas_WMDE: for my patch, I can't test beta because my IP is blocked (currently tethering, mobile). Do you have an IP that works for beta? [13:08:35] !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [13:08:50] (03CR) 10CI reject: [V:04-1] Upgrade kafka-jumbo1017 to JDK21 [puppet] - 10https://gerrit.wikimedia.org/r/1289950 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [13:08:59] (03PS1) 10Marostegui: wmf-pt-kill: New trixie version [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/1289960 (https://phabricator.wikimedia.org/T426842) [13:09:38] !log sbisson@deploy1003 sbisson: Backport for [[gerrit:1289446|Log editing_start and article_saved events for control group (T422146)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:09:48] (03PS10) 10Filippo Giunchedi: alerts: Add optional pre-deploy transformations [puppet] - 10https://gerrit.wikimedia.org/r/1288883 (https://phabricator.wikimedia.org/T424814) [13:10:47] codders: yes, but mwdebug testing doesn’t work on beta anyways [13:10:51] so that test would only be after the deployment is done [13:10:54] ah. okay [13:10:54] (03CR) 10FNegri: [C:03+1] wmf-pt-kill: New trixie version [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/1289960 (https://phabricator.wikimedia.org/T426842) (owner: 10Marostegui) [13:11:21] (03CR) 10Filippo Giunchedi: "I was on the fence too about ruamel specifically, I changed the dependency to pyyaml instead. It means output files won't be exactly the s" [puppet] - 10https://gerrit.wikimedia.org/r/1288883 (https://phabricator.wikimedia.org/T424814) (owner: 10Filippo Giunchedi) [13:11:27] !log sbisson@deploy1003 sbisson: Continuing with deployment [13:11:37] (03CR) 10Marostegui: [V:03+2 C:03+2] wmf-pt-kill: New trixie version [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/1289960 (https://phabricator.wikimedia.org/T426842) (owner: 10Marostegui) [13:11:47] jmm@cumin2002 drain-node (PID 3691683) is awaiting input [13:12:11] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp4037.ulsfo.wmnet [13:13:03] !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [13:14:53] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host dse-k8s-worker1012.eqiad.wmnet [13:14:55] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host dse-k8s-worker1012.eqiad.wmnet [13:15:01] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host dse-k8s-worker1013.eqiad.wmnet [13:15:33] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host dse-k8s-worker1013.eqiad.wmnet [13:15:37] !log sbisson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1289446|Log editing_start and article_saved events for control group (T422146)]] (duration: 07m 55s) [13:15:42] T422146: Experiment config and schema registration (Article Guidance initial intervention) - https://phabricator.wikimedia.org/T422146 [13:15:44] I'm done [13:15:53] FIRING: [4x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:16:00] (03CR) 10Federico Ceratto: [C:03+2] hiera: Revert changes on test-s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1249276 (https://phabricator.wikimedia.org/T409926) (owner: 10Federico Ceratto) [13:16:01] nice. should I? Or will you Lucas? [13:16:12] codders: do you have deployment access? I forget [13:16:17] yeah, I can do it [13:16:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet [13:16:22] okay, go ahead! [13:16:40] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1289369 (https://phabricator.wikimedia.org/T421705) (owner: 10Ladsgroup) [13:17:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arthurtaylor@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289736 (https://phabricator.wikimedia.org/T98035) (owner: 10Arthur taylor) [13:17:04] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1055.eqiad.wmnet to cluster codfw and group A [13:17:07] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1055.eqiad.wmnet to cluster codfw and group A [13:17:09] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1258.eqiad.wmnet with OS trixie [13:17:15] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1055.eqiad.wmnet to cluster eqiad and group A [13:17:56] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [13:18:10] !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.update-replication (exit_code=97) [13:18:15] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [13:18:19] (03Merged) 10jenkins-bot: Disable support for PHP-serialized EntityData on Beta / Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289736 (https://phabricator.wikimedia.org/T98035) (owner: 10Arthur taylor) [13:18:26] !log root@cumin1003 START - Cookbook sre.hosts.reimage for host db2183.codfw.wmnet with OS trixie [13:18:32] !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [13:18:36] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [13:18:45] !log arthurtaylor@deploy1003 Started scap sync-world: Backport for [[gerrit:1289736|Disable support for PHP-serialized EntityData on Beta / Test Wikidata (T98035)]] [13:18:49] !log root@cumin1003 START - Cookbook sre.hosts.move-vlan for host db2183 [13:18:49] T98035: [Task] Drop support for php-serialized output from Special:EntityData - https://phabricator.wikimedia.org/T98035 [13:18:56] PROBLEM - Host wikikube-worker1291 is DOWN: PING CRITICAL - Packet loss = 100% [13:19:00] FIRING: [5x] KubernetesCalicoDown: dse-k8s-worker1012.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:19:15] !log brouberol@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-jumbo1012.eqiad.wmnet with reason: host reimage [13:19:15] (03PS1) 10CWilliams: sre.mysql.pool: Add support for downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1289965 (https://phabricator.wikimedia.org/T426318) [13:19:21] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1258: Migration of db1258.eqiad.wmnet completed [13:19:35] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [13:19:45] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [13:19:57] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [13:20:06] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [13:20:36] (03PS1) 10Muehlenhoff: proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289967 [13:20:43] !log arthurtaylor@deploy1003 arthurtaylor: Backport for [[gerrit:1289736|Disable support for PHP-serialized EntityData on Beta / Test Wikidata (T98035)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:20:55] FIRING: [4x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:21:22] jmm@cumin2002 addnode (PID 3700004) is awaiting input [13:21:34] seems okay to me on test.wikidata.org [13:21:39] proceeding [13:21:54] root@cumin1003 reimage (PID 2045728) is awaiting input [13:22:00] (03CR) 10CI reject: [V:04-1] sre.mysql.pool: Add support for downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1289965 (https://phabricator.wikimedia.org/T426318) (owner: 10CWilliams) [13:22:01] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host dse-k8s-worker1013.eqiad.wmnet [13:22:03] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host dse-k8s-worker1013.eqiad.wmnet [13:22:07] !log arthurtaylor@deploy1003 arthurtaylor: Continuing with deployment [13:22:09] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host dse-k8s-worker1014.eqiad.wmnet [13:22:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet [13:22:35] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [13:22:37] !log root@cumin1003 START - Cookbook sre.dns.netbox [13:22:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet [13:22:45] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [13:22:49] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [13:22:58] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [13:23:21] !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-jumbo1012.eqiad.wmnet with reason: host reimage [13:24:15] !log btullis@cumin1003 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) depool for host dse-k8s-worker1014.eqiad.wmnet [13:24:59] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: T426560 - bking@cumin2002 [13:25:25] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: T426560 - bking@cumin2002 [13:26:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1055.eqiad.wmnet to cluster eqiad and group A [13:26:14] !log arthurtaylor@deploy1003 Finished scap sync-world: Backport for [[gerrit:1289736|Disable support for PHP-serialized EntityData on Beta / Test Wikidata (T98035)]] (duration: 07m 26s) [13:26:18] T98035: [Task] Drop support for php-serialized output from Special:EntityData - https://phabricator.wikimedia.org/T98035 [13:26:30] k. done. Seems to work on test.wikidata.org [13:26:37] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti105[5678] and decom ganeti102[3456] - https://phabricator.wikimedia.org/T424680#11940802 (10MoritzMuehlenhoff) [13:26:55] !log root@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host db2183 - root@cumin1003" [13:27:01] !log root@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host db2183 - root@cumin1003" [13:27:01] !log root@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:27:01] !log root@cumin1003 START - Cookbook sre.dns.wipe-cache db2183.codfw.wmnet 6.0.192.10.in-addr.arpa 6.0.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:27:05] !log root@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db2183.codfw.wmnet 6.0.192.10.in-addr.arpa 6.0.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:27:06] !log root@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db2183 [13:27:17] yay [13:27:30] thanks for your support Lucas_WMDE! [13:27:31] I’ll wait a few minutes to see if codenamenoreste shows up [13:28:22] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] composer.json: Upgrading symfony/yaml (v7.4.6 => v7.4.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289953 (https://phabricator.wikimedia.org/T426845) (owner: 10Reedy) [13:28:29] Reedy: wanna deploy ^ now? [13:28:43] Can do... [13:28:54] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudelastic1012.eqiad.wmnet [13:29:31] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [13:29:52] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [13:30:10] root@cumin1003 reimage (PID 2045728) is awaiting input [13:30:24] 10SRE-SLO: Sloth dashboard performance improvement - https://phabricator.wikimedia.org/T425564#11940847 (10tappof) The Sloth dashboard is already using rows for SLO services. Nested rows, which would allow forecasting panels to be computed on demand, will be available starting with Grafana 13. In the meantime, I... [13:30:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1056.eqiad.wmnet [13:31:23] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet [13:31:43] !log root@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2183 [13:31:43] !log root@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host db2183 [13:34:18] (03CR) 10Reedy: [C:03+2] composer.json: Upgrading symfony/yaml (v7.4.6 => v7.4.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289953 (https://phabricator.wikimedia.org/T426845) (owner: 10Reedy) [13:34:31] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1289369 (https://phabricator.wikimedia.org/T421705) (owner: 10Ladsgroup) [13:35:10] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudelastic1012.eqiad.wmnet [13:35:14] (03Merged) 10jenkins-bot: composer.json: Upgrading symfony/yaml (v7.4.6 => v7.4.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289953 (https://phabricator.wikimedia.org/T426845) (owner: 10Reedy) [13:35:58] !log jmm@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1056 [13:36:26] !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1289953|composer.json: Upgrading symfony/yaml (v7.4.6 => v7.4.12) (T426845)]] [13:36:29] T426845: mediawiki-config CI blocked by Composer automatic-security-blocking (PKSA-v5yj-8nmz-sk2q, PKSA-ft77-7h5f-p3r6, PKSA-b14r-zh1d-vdrc) - https://phabricator.wikimedia.org/T426845 [13:36:47] oh look, vendor [13:37:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1056 [13:37:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet [13:38:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1056.eqiad.wmnet [13:38:24] !log reedy@deploy1003 reedy: Backport for [[gerrit:1289953|composer.json: Upgrading symfony/yaml (v7.4.6 => v7.4.12) (T426845)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:38:28] !log installing krb5 security updates [13:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:55] !log reedy@deploy1003 reedy: Continuing with deployment [13:40:13] Reedy: judging by https://gerrit.wikimedia.org/r/c/mediawiki/core/+/513358 it’s used for “parsing event schemas used by EventBus” [13:40:25] so, might be limited to trusted input [13:40:34] (probably still a good idea to upgrade it in core+vendor) [13:40:35] AFAIK Translate uses it at least too [13:40:48] thank you for the composer / yaml fixes! [13:41:18] Reedy: https://codesearch.wmcloud.org/deployed/?q=Symfony%5C%5CComponent%5C%5CYaml&files=&excludeFiles=&repos= shows no Translate (but core, DonationInterface, WikiLambda, CirrusSearch) [13:41:41] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host dse-k8s-worker1014.eqiad.wmnet [13:41:43] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host dse-k8s-worker1014.eqiad.wmnet [13:41:48] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host dse-k8s-worker1015.eqiad.wmnet [13:41:53] !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-jumbo1012.eqiad.wmnet with OS trixie [13:41:56] Thought it did at one point... [13:42:01] Maybe it did, and was removed for other reasons [13:42:17] in theory (judging by the task description of T416518), shouldn't libup be able to update the dep in core? /genq [13:42:17] T416518: Disable Composer 2.9 functionality to randomly block existing configurations from working - https://phabricator.wikimedia.org/T416518 [13:42:18] (03CR) 10Ssingh: [C:03+1] "🚢 it!" [puppet] - 10https://gerrit.wikimedia.org/r/1275750 (https://phabricator.wikimedia.org/T415454) (owner: 10Slyngshede) [13:42:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06cloud-services-team (Hardware): Q2:rack/setup/install clouddb1026-1033 - https://phabricator.wikimedia.org/T409162#11940907 (10Marostegui) [13:42:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb2002.codfw.wmnet [13:42:26] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host dse-k8s-worker1015.eqiad.wmnet [13:42:34] (03CR) 10Slyngshede: [V:03+1 C:03+2] R:cache::text enable TCP Fast Open [puppet] - 10https://gerrit.wikimedia.org/r/1275750 (https://phabricator.wikimedia.org/T415454) (owner: 10Slyngshede) [13:42:44] A_smart_kitten: I expect so, I guess it just hasn’t run yet? this vuln dropped today right [13:42:49] yep [13:42:54] I'm not sure we have it run on vendor either [13:43:03] Reedy: there’s romaricdrigon/metayaml, maybe that’s what you remembered [13:43:05] !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1289953|composer.json: Upgrading symfony/yaml (v7.4.6 => v7.4.12) (T426845)]] (duration: 06m 39s) [13:43:08] (git log -S Symfony yielded nothing) [13:43:10] T426845: mediawiki-config CI blocked by Composer automatic-security-blocking (PKSA-v5yj-8nmz-sk2q, PKSA-ft77-7h5f-p3r6, PKSA-b14r-zh1d-vdrc) - https://phabricator.wikimedia.org/T426845 [13:43:13] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudelastic1011.eqiad.wmnet [13:43:17] FIRING: [2x] KubernetesCalicoDown: dse-k8s-worker1014.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:43:18] i suppose i would just like to see that it definitely works :p [13:43:39] (given that that was a reason for being able to safely disable auto-security-blocking in core) [13:43:45] I still don’t see codenamenoreste around, so I’ll close the window [13:43:49] !log UTC afternoon backport+config window done [13:43:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet [13:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:00] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2026-05-18-230044 to 2026-05-19-171108 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289974 [13:44:00] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-05-19-145724 to 2026-05-19-223625 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289975 (https://phabricator.wikimedia.org/T426409) [13:44:13] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:44:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet [13:44:33] (03CR) 10Herron: [C:03+2] grafana-dashboard-reporter: initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/1286507 (https://phabricator.wikimedia.org/T425795) (owner: 10Herron) [13:45:19] slyngs: ready for me to puppet merge multiple? [13:45:25] Please do [13:45:27] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:45:45] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 252 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 786, active_shards: 1345, relocating_shards: 0, initializing_shards: 25, unassigned_shar [13:45:45] delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 84.22041327489042 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:45:45] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 252 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 786, active_shards: 1345, relocating_shards: 0, initializing_shards: 25, unassigned_shar [13:45:45] delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 84.22041327489042 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:45:45] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 252 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 786, active_shards: 1345, relocating_shards: 0, initializing_shards: 25, unassigned_shar [13:45:45] delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 84.22041327489042 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:45:45] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 252 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 786, active_shards: 1345, relocating_shards: 0, initializing_shards: 25, unassigned_shar [13:45:46] delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 84.22041327489042 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:45:53] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch inactive shards 251 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 786, active_shards: 1346, relocating_shards: 0, initializing_shards: 25, unassigned_shar [13:45:53] delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 84.28303068252974 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:46:15] herron: If you could let me know when it's merge. I need to test some stuff :-) [13:46:20] done! [13:46:45] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 786, active_shards: 1395, relocating_shards: 0, initializing_shards: 14, unassigned_shards: 190, delayed_unassig [13:46:45] ds: 0, number_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.2420262664165 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:46:45] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 786, active_shards: 1395, relocating_shards: 0, initializing_shards: 14, unassigned_shards: 190, delayed_unassig [13:46:45] ds: 0, number_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.2420262664165 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:46:45] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 786, active_shards: 1395, relocating_shards: 0, initializing_shards: 14, unassigned_shards: 190, delayed_unassig [13:46:45] ds: 0, number_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.2420262664165 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:46:45] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 786, active_shards: 1395, relocating_shards: 0, initializing_shards: 14, unassigned_shards: 190, delayed_unassig [13:46:46] ds: 0, number_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.2420262664165 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:46:53] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 786, active_shards: 1450, relocating_shards: 0, initializing_shards: 19, unassigned_shards: 130, delayed_unassig [13:46:53] ds: 0, number_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 90.6816760475297 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:47:09] (03CR) 10Kamila Součková: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282962 (https://phabricator.wikimedia.org/T424824) (owner: 10Daniel Kinzler) [13:47:37] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2041.codfw.wmnet [13:48:07] (03CR) 10Brouberol: [C:03+2] Upgrade kafka-jumbo1013 to JDK21 [puppet] - 10https://gerrit.wikimedia.org/r/1289946 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [13:48:10] herron: Thank you very much :-) [13:48:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb2002.codfw.wmnet [13:48:23] slyngs: np! [13:48:44] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host dse-k8s-worker1015.eqiad.wmnet [13:48:45] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host dse-k8s-worker1015.eqiad.wmnet [13:48:51] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host dse-k8s-worker1016.eqiad.wmnet [13:49:26] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host dse-k8s-worker1016.eqiad.wmnet [13:49:33] !log brouberol@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-jumbo1013.eqiad.wmnet with OS trixie [13:49:38] !log root@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2183.codfw.wmnet with reason: host reimage [13:49:45] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudelastic1011.eqiad.wmnet [13:50:47] (03PS7) 10Herron: grafana: add dashboard reporter plugin [puppet] - 10https://gerrit.wikimedia.org/r/1286986 [13:52:15] (03CR) 10Herron: [C:03+2] grafana: add dashboard reporter plugin [puppet] - 10https://gerrit.wikimedia.org/r/1286986 (owner: 10Herron) [13:52:17] (03PS1) 10Mszwarc: Fix newFromUserIdentity calls with interwiki users [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289980 (https://phabricator.wikimedia.org/T426832) [13:52:29] !log btullis@cumin1003 START - Cookbook sre.druid.reboot-workers for Druid analytics cluster: Reboot Druid nodes [13:53:26] jmm@cumin2002 drain-node (PID 3720268) is awaiting input [13:53:33] !log root@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2183.codfw.wmnet with reason: host reimage [13:53:55] !log slyngshede@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp4038.ulsfo.wmnet [13:53:55] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp403[7-8].ulsfo.wmnet} and A:cp [13:54:43] (03CR) 10Mszwarc: "Unless somebody deploys this patch earlier, I'll do it EU evening" [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289980 (https://phabricator.wikimedia.org/T426832) (owner: 10Mszwarc) [13:55:36] (03CR) 10Btullis: [C:03+2] [airflow-wikidata] - Add the new S3 credentials to extra_secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289924 (https://phabricator.wikimedia.org/T426764) (owner: 10Btullis) [13:55:46] (03CR) 10Btullis: [C:03+2] [dse-k8s-wdqs-test] Stop configuring a second volume group for RAID0 [puppet] - 10https://gerrit.wikimedia.org/r/1289921 (https://phabricator.wikimedia.org/T425653) (owner: 10Btullis) [13:56:07] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host dse-k8s-worker1016.eqiad.wmnet [13:56:08] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host dse-k8s-worker1016.eqiad.wmnet [13:56:13] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host dse-k8s-worker1017.eqiad.wmnet [13:56:49] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host dse-k8s-worker1017.eqiad.wmnet [13:57:48] (03Merged) 10jenkins-bot: [airflow-wikidata] - Add the new S3 credentials to extra_secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289924 (https://phabricator.wikimedia.org/T426764) (owner: 10Btullis) [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260520T1400) [14:00:21] !log btullis@cumin1003 START - Cookbook sre.ceph.roll-restart-reboot-server rolling reboot on P{cephosd100[4-5].eqiad.wmnet} and (A:cephosd-codfw or A:cephosd-eqiad) [14:00:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2041.codfw.wmnet [14:01:37] (03CR) 10Ssingh: "Yes that's a fair question. @bblack@wikimedia.org: any thoughts on this? I remember we had a similar discussion for Gerrit on changing con" [puppet] - 10https://gerrit.wikimedia.org/r/1282428 (https://phabricator.wikimedia.org/T425441) (owner: 10Dzahn) [14:01:40] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudelastic1010.eqiad.wmnet [14:03:06] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host dse-k8s-worker1017.eqiad.wmnet [14:03:07] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host dse-k8s-worker1017.eqiad.wmnet [14:03:12] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host dse-k8s-worker1018.eqiad.wmnet [14:03:13] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2026-05-18-230044 to 2026-05-19-171108 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289974 (owner: 10Jforrester) [14:03:18] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2026-05-19-145724 to 2026-05-19-223625 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289975 (https://phabricator.wikimedia.org/T426409) (owner: 10Jforrester) [14:03:29] PROBLEM - BFD status on lsw1-f1-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:04:24] !log brouberol@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-jumbo1013.eqiad.wmnet with reason: host reimage [14:04:45] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 254 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 786, active_shards: 1343, relocating_shards: 0, initializing_shards: 25, unassigned_shar [14:04:45] delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 84.09517845961177 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:04:45] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 254 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 786, active_shards: 1343, relocating_shards: 0, initializing_shards: 25, unassigned_shar [14:04:45] delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 84.09517845961177 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:04:45] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 254 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 786, active_shards: 1343, relocating_shards: 0, initializing_shards: 25, unassigned_shar [14:04:45] delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 84.09517845961177 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:04:46] (03CR) 10CI reject: [V:04-1] Fix newFromUserIdentity calls with interwiki users [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289980 (https://phabricator.wikimedia.org/T426832) (owner: 10Mszwarc) [14:04:51] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1258: Migration of db1258.eqiad.wmnet completed [14:04:52] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [14:04:53] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 254 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 786, active_shards: 1343, relocating_shards: 0, initializing_shards: 25, unassigned_shar [14:04:53] delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 84.09517845961177 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:04:53] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch inactive shards 254 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 786, active_shards: 1343, relocating_shards: 0, initializing_shards: 25, unassigned_shar [14:04:53] delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 84.09517845961177 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:05:32] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2026-05-18-230044 to 2026-05-19-171108 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289974 (owner: 10Jforrester) [14:05:35] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2026-05-19-145724 to 2026-05-19-223625 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289975 (https://phabricator.wikimedia.org/T426409) (owner: 10Jforrester) [14:06:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2041.codfw.wmnet [14:06:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2041.codfw.wmnet [14:06:45] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 786, active_shards: 1362, relocating_shards: 0, initializing_shards: 25, unassigned_shards: 210, delayed_unassig [14:06:45] ds: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.28490920475892 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:06:45] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 786, active_shards: 1362, relocating_shards: 0, initializing_shards: 25, unassigned_shards: 210, delayed_unassig [14:06:45] ds: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.28490920475892 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:06:45] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 786, active_shards: 1362, relocating_shards: 0, initializing_shards: 25, unassigned_shards: 210, delayed_unassig [14:06:45] ds: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.28490920475892 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:06:53] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 786, active_shards: 1363, relocating_shards: 0, initializing_shards: 25, unassigned_shards: 209, delayed_unassig [14:06:53] ds: 0, number_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.34752661239825 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:06:53] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 786, active_shards: 1364, relocating_shards: 0, initializing_shards: 25, unassigned_shards: 208, delayed_unassig [14:06:53] ds: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.41014402003756 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:07:06] (03PS1) 10Atsuko: eventstreams: convert configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289978 (https://phabricator.wikimedia.org/T348763) [14:07:33] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:08:08] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs-test1001.eqiad.wmnet with OS bookworm [14:08:16] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host dse-k8s-worker1018.eqiad.wmnet [14:08:32] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:08:42] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs-test2001.codfw.wmnet with OS bookworm [14:08:47] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:09:27] !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-jumbo1013.eqiad.wmnet with reason: host reimage [14:09:27] 07sre-alert-triage, 06SRE Observability: Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T426809#11941063 (10hnowlan) a:03tappof [14:09:31] RECOVERY - BFD status on lsw1-f1-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:10:07] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wikidata: apply [14:10:38] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wikidata: apply [14:10:45] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: T426560 - bking@cumin2002 [14:11:36] !log uploaded trixie-packaged memkeys on apt1002 [14:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:56] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:12:03] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:12:03] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2042.codfw.wmnet [14:14:18] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host dse-k8s-worker1018.eqiad.wmnet [14:14:19] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host dse-k8s-worker1018.eqiad.wmnet [14:14:24] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host dse-k8s-worker1019.eqiad.wmnet [14:14:48] (03CR) 10Btullis: [C:03+1] eventstreams: convert configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289978 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [14:15:16] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:16:24] !log root@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2183.codfw.wmnet with OS trixie [14:17:29] PROBLEM - BFD status on lsw1-f2-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:17:38] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudelastic1010.eqiad.wmnet [14:18:20] jmm@cumin2002 drain-node (PID 3736677) is awaiting input [14:19:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2042.codfw.wmnet [14:19:31] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host dse-k8s-worker1019.eqiad.wmnet [14:20:41] PROBLEM - Host ml-serve1015 is DOWN: PING CRITICAL - Packet loss = 100% [14:21:00] !log dancy@deploy1003 Installing scap version "4.266.0" for 2 host(s) [14:21:25] FIRING: [10x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:21:28] !log jiji@cumin1003 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on P{wikikube-worker[1249-1289,1291-1327,1375-1384].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [14:22:45] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: T426560 - bking@cumin2002 [14:22:52] !log dancy@deploy1003 Installation of scap version "4.266.0" completed for 2 hosts [14:23:05] (03PS1) 10Daniel Kinzler: rest-gateway: tighten rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289992 (https://phabricator.wikimedia.org/T424821) [14:23:44] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1037.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:24:29] RECOVERY - BFD status on lsw1-f2-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:24:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2042.codfw.wmnet [14:25:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2042.codfw.wmnet [14:26:10] 06SRE, 06Traffic, 13Patch-For-Review: TCP FastOpen not working since at least December 2025 - https://phabricator.wikimedia.org/T415454#11941096 (10Cuthead) 05Open→03Resolved a:03Cuthead [14:26:19] !log btullis@cumin1003 END (PASS) - Cookbook sre.ceph.roll-restart-reboot-server (exit_code=0) rolling reboot on P{cephosd100[4-5].eqiad.wmnet} and (A:cephosd-codfw or A:cephosd-eqiad) [14:26:24] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-wdqs-test2001.codfw.wmnet with reason: host reimage [14:26:25] FIRING: [10x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:26:29] RECOVERY - Host ml-serve1015 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [14:26:40] 06SRE, 06Traffic, 13Patch-For-Review: TCP FastOpen not working since at least December 2025 - https://phabricator.wikimedia.org/T415454#11941098 (10Cuthead) Thanks everyone. [14:27:22] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: enable IOMMU for multi-GPU Supermicro hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1289942 (https://phabricator.wikimedia.org/T421461) (owner: 10Elukey) [14:27:53] 06SRE, 06Traffic, 13Patch-For-Review: TCP FastOpen not working since at least December 2025 - https://phabricator.wikimedia.org/T415454#11941105 (10SLyngshede-WMF) 05Resolved→03Open p:05Triage→03Medium @Cuthead Sorry, I'll just reopen this. I still need to do the second half of the caching servers, t... [14:27:54] !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-jumbo1013.eqiad.wmnet with OS trixie [14:28:03] FIRING: KafkaBrokerUnavailable: One or more Kafka brokers unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [14:28:17] FIRING: [2x] KubernetesCalicoDown: ml-serve1015.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:28:22] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-wdqs-test1001.eqiad.wmnet with reason: host reimage [14:29:03] FIRING: [2x] KubernetesCalicoDown: ml-serve1015.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:29:33] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2043.codfw.wmnet [14:29:56] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-wdqs-test2001.codfw.wmnet with reason: host reimage [14:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260520T1400) [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260520T1430) [14:30:36] jclark@cumin1003 provision (PID 2102227) is awaiting input [14:31:24] (03CR) 10Mszwarc: "recheck" [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289980 (https://phabricator.wikimedia.org/T426832) (owner: 10Mszwarc) [14:33:03] RESOLVED: KafkaBrokerUnavailable: One or more Kafka brokers unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [14:33:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2043.codfw.wmnet [14:33:21] (03CR) 10ArielGlenn: [C:03+2] Move Makefiles to standard location [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282962 (https://phabricator.wikimedia.org/T424824) (owner: 10Daniel Kinzler) [14:33:35] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-wdqs-test1001.eqiad.wmnet with reason: host reimage [14:34:40] PROBLEM - Host dse-k8s-etcd2003 is DOWN: PING CRITICAL - Packet loss = 100% [14:35:05] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wdqs1037.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:36:06] 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Q4:rack/setup/install wdqs20[28-31] - https://phabricator.wikimedia.org/T423312#11941163 (10Jhancock.wm) @bking can we rack these in rows e and f in codfw? [14:36:08] (03Merged) 10jenkins-bot: Move Makefiles to standard location [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282962 (https://phabricator.wikimedia.org/T424824) (owner: 10Daniel Kinzler) [14:36:43] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host dse-k8s-worker1019.eqiad.wmnet [14:36:44] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host dse-k8s-worker1019.eqiad.wmnet [14:36:50] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host dse-k8s-worker1024.eqiad.wmnet [14:38:17] FIRING: [3x] KubernetesCalicoDown: dse-k8s-worker1019.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:38:27] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [14:38:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2043.codfw.wmnet [14:38:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2043.codfw.wmnet [14:39:03] (03CR) 10LWatson: Make image browsing available in Beta and TestWiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288996 (https://phabricator.wikimedia.org/T421019) (owner: 10Kimberly Sarabia) [14:39:14] (03CR) 10Brouberol: [C:03+2] Upgrade kafka-jumbo1014 to JDK21 [puppet] - 10https://gerrit.wikimedia.org/r/1289947 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [14:39:24] !log cwilliams@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [14:39:33] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2044.codfw.wmnet [14:40:52] RECOVERY - Host dse-k8s-etcd2003 is UP: PING OK - Packet loss = 0%, RTA = 32.03 ms [14:40:54] !log btullis@cumin1003 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) depool for host dse-k8s-worker1024.eqiad.wmnet [14:41:30] 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Q4:rack/setup/install wdqs20[28-31] - https://phabricator.wikimedia.org/T423312#11941199 (10bking) Hello @Jhancock.wm , per above I have requested one in each row, avoid row D if possible since it alrea... [14:41:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb1002.eqiad.wmnet [14:41:56] (03CR) 10Atsuko: [C:03+2] eventstreams: convert configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289978 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [14:42:00] RECOVERY - Host wikikube-worker1291 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [14:42:25] !log pt1979@cumin1003 START - Cookbook sre.dns.admin DNS admin: depool ulsfo [reason: router upgrade, T416562] [14:42:30] T416562: ulsfo: upgrade routers (2026) - https://phabricator.wikimedia.org/T416562 [14:42:32] !log pt1979@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool ulsfo [reason: router upgrade, T416562] [14:42:41] !log brouberol@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-jumbo1014.eqiad.wmnet with OS trixie [14:42:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2044.codfw.wmnet [14:42:59] !log installing rsync security updates [14:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:17] RESOLVED: [3x] KubernetesCalicoDown: dse-k8s-worker1019.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:43:21] (03PS2) 10Cwhite: logstash: add sampling to page-analytics.discovery.wmnet istio logs [puppet] - 10https://gerrit.wikimedia.org/r/1289423 (https://phabricator.wikimedia.org/T390215) [14:43:38] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1037.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:43:51] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [14:43:53] !log pt1979@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cr3-ulsfo,cr3-ulsfo IPv6,cr3-ulsfo.mgmt with reason: switch refresh [14:44:02] 06SRE, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Data Platform SRE paging alerts and on-call SRE response - https://phabricator.wikimedia.org/T420264#11941228 (10BTullis) 05Open→03Resolved I'm closing this ticket, for now. We have made it so that the k8s API blackbox check no longer pages the core... [14:44:12] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1257: Upgrading db1257.eqiad.wmnet [14:44:13] (03Merged) 10jenkins-bot: eventstreams: convert configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289978 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [14:44:41] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1257: Upgrading db1257.eqiad.wmnet [14:44:46] PROBLEM - Host dse-k8s-ctrl2002 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:52] PROBLEM - Host kubestagemaster2003 is DOWN: PING CRITICAL - Packet loss = 100% [14:45:18] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.11 point update - https://phabricator.wikimedia.org/T373795#11941238 (10MoritzMuehlenhoff) [14:45:20] PROBLEM - Host ml-etcd2003 is DOWN: PING CRITICAL - Packet loss = 100% [14:45:20] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-master1004.eqiad.wmnet [14:45:27] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.11 point update - https://phabricator.wikimedia.org/T373795#11941239 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All done [14:45:34] RECOVERY - Host kubestagemaster2003 is UP: PING OK - Packet loss = 0%, RTA = 30.85 ms [14:45:46] RECOVERY - Host dse-k8s-ctrl2002 is UP: PING OK - Packet loss = 0%, RTA = 30.64 ms [14:45:48] RECOVERY - Host ml-etcd2003 is UP: PING OK - Packet loss = 0%, RTA = 32.01 ms [14:46:20] (03CR) 10ArielGlenn: [C:03+1] "Double checked the limits, I think this is good to go." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289992 (https://phabricator.wikimedia.org/T424821) (owner: 10Daniel Kinzler) [14:47:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb1002.eqiad.wmnet [14:47:42] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-wdqs-test2001.codfw.wmnet with OS bookworm [14:48:09] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wdqs1037.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:48:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2044.codfw.wmnet [14:48:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2044.codfw.wmnet [14:48:37] cwilliams@cumin1003 major-upgrade (PID 2118406) is awaiting input [14:49:48] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1257.eqiad.wmnet with OS trixie [14:50:05] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1037.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:50:39] (03CR) 10Kamila Součková: [C:03+1] "The CI diff looks a little confusing (and so will the prod diff), but it's just the usual "removal of Apr2026 + ordering makes it diff the" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289992 (https://phabricator.wikimedia.org/T424821) (owner: 10Daniel Kinzler) [14:51:15] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wdqs1037.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:51:48] (03PS1) 10Gkyziridis: api-gateway: Configure qwen3-14b in api gateway. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289996 (https://phabricator.wikimedia.org/T425680) [14:51:56] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-master1004.eqiad.wmnet [14:52:20] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1037.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:52:55] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-wdqs-test1001.eqiad.wmnet with OS bookworm [14:53:28] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wdqs1037.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:54:36] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host dse-k8s-worker1024.eqiad.wmnet [14:54:38] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host dse-k8s-worker1024.eqiad.wmnet [14:54:43] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host dse-k8s-worker1025.eqiad.wmnet [14:54:47] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/turnilo: apply [14:54:54] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/turnilo: apply [14:55:12] (03CR) 10Hnowlan: [C:03+1] logstash: add sampling to page-analytics.discovery.wmnet istio logs [puppet] - 10https://gerrit.wikimedia.org/r/1289423 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite) [14:55:26] (03PS2) 10Gkyziridis: api-gateway: Configure qwen3-14b in api gateway. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289996 (https://phabricator.wikimedia.org/T425680) [14:56:08] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1037.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:56:26] 06SRE, 06Data-Engineering (Q4 FS25/26 April 1st - June 30st), 10Event-Platform, 13Patch-For-Review: Flink Page View: Create K8s resources - https://phabricator.wikimedia.org/T426425#11941291 (10brouberol) I created the s3 user, added the keypair into the `webrequest-page-view-next` secrets and created the... [14:56:47] !log btullis@cumin1003 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) depool for host dse-k8s-worker1025.eqiad.wmnet [14:57:06] !log brouberol@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-jumbo1014.eqiad.wmnet with reason: host reimage [14:57:12] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on 14 hosts with reason: restart [14:57:17] (03PS3) 10Gkyziridis: api-gateway: Configure qwen3-14b in api gateway. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289996 (https://phabricator.wikimedia.org/T425680) [14:57:46] (03CR) 10Hnowlan: "Should this be routed via the rest-gateway @claime?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289996 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [14:59:49] (03PS1) 10Fabfur: hiera: using haproxy-awslc on cp2043-cp2044 [puppet] - 10https://gerrit.wikimedia.org/r/1289997 (https://phabricator.wikimedia.org/T419825) [15:00:06] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2033.codfw.wmnet [15:00:07] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2033.codfw.wmnet [15:00:20] !log blake@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{wikikube-worker[1294-1327].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [15:00:29] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1294-1297].eqiad.wmnet [15:00:56] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2033.codfw.wmnet [15:01:21] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1289997 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [15:02:26] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1257.eqiad.wmnet with reason: host reimage [15:02:41] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1294-1297].eqiad.wmnet [15:04:16] 10SRE-tools, 10Observability-Alerting, 06SRE Observability (FY2025/2026-Q1): Cookbook sre.hosts.remove_downtime does not remove silences - https://phabricator.wikimedia.org/T395032#11941331 (10LSobanski) Untagging #sre-tools, please loop us back in if needed. [15:04:18] 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Q4:rack/setup/install wdqs20[28-31] - https://phabricator.wikimedia.org/T423312#11941332 (10Jhancock.wm) i can do that. wanted to make sure there wasn't some other reason we can't rack them in e or f. w... [15:04:38] (03PS1) 10Fabfur: hiera: using haproxy-awslc on cp3074,cp3066 [puppet] - 10https://gerrit.wikimedia.org/r/1289998 (https://phabricator.wikimedia.org/T419825) [15:04:44] !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-jumbo1014.eqiad.wmnet with reason: host reimage [15:04:52] !log btullis@cumin1003 END (PASS) - Cookbook sre.druid.reboot-workers (exit_code=0) for Druid analytics cluster: Reboot Druid nodes [15:05:14] (03PS1) 10Muehlenhoff: Bitu: Adapt approvers for growthbook-readonly and growthbook-elevatedacccess [puppet] - 10https://gerrit.wikimedia.org/r/1289999 [15:05:51] 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Q4:rack/setup/install wdqs20[28-31] - https://phabricator.wikimedia.org/T423312#11941340 (10bking) NP, thanks for your help on this! Feel free to ping me in IRC (inflatador) if you need anything else. [15:06:19] (03CR) 10Hashar: "@mszwarc@wikimedia.org thanks for the fix & backport. Feel free to backport it now (if nothing else is happening currently). We can sync o" [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289980 (https://phabricator.wikimedia.org/T426832) (owner: 10Mszwarc) [15:06:31] jouncebot: now [15:06:31] No deployments scheduled for the next 1 hour(s) and 53 minute(s) [15:06:34] jmm@cumin2002 drain-node (PID 3770242) is awaiting input [15:06:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2033.codfw.wmnet [15:07:01] (03CR) 10Marostegui: sre.mysql.pool: Add support for downtime (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1289965 (https://phabricator.wikimedia.org/T426318) (owner: 10CWilliams) [15:07:41] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1037.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:08:07] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host dse-k8s-worker1025.eqiad.wmnet [15:08:08] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host dse-k8s-worker1025.eqiad.wmnet [15:08:14] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host dse-k8s-worker1026.eqiad.wmnet [15:08:47] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1257.eqiad.wmnet with reason: host reimage [15:09:58] !log bking@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=wdqs-scholarly,name=codfw [15:11:11] (03CR) 10Aleksandar Mastilovic: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) (owner: 10Aleksandar Mastilovic) [15:11:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:13:16] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1294-1297].eqiad.wmnet [15:13:17] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1294-1297].eqiad.wmnet [15:13:29] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1298-1301].eqiad.wmnet [15:14:02] FIRING: [5x] KubernetesCalicoDown: dse-k8s-worker1025.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:14:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2033.codfw.wmnet [15:15:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2033.codfw.wmnet [15:15:17] RESOLVED: [5x] KubernetesCalicoDown: dse-k8s-worker1025.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:15:40] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1298-1301].eqiad.wmnet [15:17:18] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: T426560 - bking@cumin2002 [15:20:45] !log Restarted Jenkins CI due to Java upgrade which causes integration/pipelinelib to not be loadable. [15:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:59] (03CR) 10Mpostoronca: [C:03+1] Update UserInfoCard to be enabled by default for certain user groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289895 (https://phabricator.wikimedia.org/T426021) (owner: 10Mszwarc) [15:23:07] !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-jumbo1014.eqiad.wmnet with OS trixie [15:23:59] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: T426560 - bking@cumin2002 [15:25:01] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1257.eqiad.wmnet with OS trixie [15:25:28] !log bking@cumin2002 START - Cookbook sre.wdqs.reboot [15:25:29] (03CR) 10Elukey: [C:03+1] "<3" [puppet] - 10https://gerrit.wikimedia.org/r/1289423 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite) [15:25:57] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: T426560 - bking@cumin2002 [15:26:02] (03CR) 10Brouberol: [C:03+2] Upgrade kafka-jumbo1015 to JDK21 [puppet] - 10https://gerrit.wikimedia.org/r/1289948 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [15:26:16] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1298-1301].eqiad.wmnet [15:26:17] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1298-1301].eqiad.wmnet [15:26:27] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1302-1305].eqiad.wmnet [15:26:56] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch inactive shards 279 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 826, active_shards: 1391, relocating_shards: 0, initializing_shards: 0, unassigned_shard [15:26:56] delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.2934131736527 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:26:56] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 279 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 826, active_shards: 1391, relocating_shards: 0, initializing_shards: 0, unassigned_shard [15:26:56] delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.2934131736527 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:27:09] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1257: Migration of db1257.eqiad.wmnet completed [15:27:48] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 279 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 826, active_shards: 1391, relocating_shards: 0, initializing_shards: 0, unassigned_shard [15:27:48] delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.2934131736527 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:27:48] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 279 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 826, active_shards: 1391, relocating_shards: 0, initializing_shards: 0, unassigned_shard [15:27:48] delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.2934131736527 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:27:48] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 279 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 826, active_shards: 1391, relocating_shards: 0, initializing_shards: 0, unassigned_shard [15:27:48] delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 83.2934131736527 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:27:50] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99) [15:29:15] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1302-1305].eqiad.wmnet [15:29:31] !log failover Ganeti master in codfw02 to ganeti2033 [15:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:44] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 3/5 UP : OSPFv3: 3/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:29:52] (03PS1) 10BCornwall: Remove cp2041/cp2042 [puppet] - 10https://gerrit.wikimedia.org/r/1290006 (https://phabricator.wikimedia.org/T426828) [15:30:08] !log brouberol@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-jumbo1015.eqiad.wmnet with OS trixie [15:30:38] !log atsuko@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [15:31:36] (03PS7) 10Aleksandar Mastilovic: Presto memory tuning, resource groups [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) [15:31:38] PROBLEM - ganeti-wconfd running on ganeti2034 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [15:31:39] FIRING: [3x] CoreBGPDown: Core BGP session down between cr2-codfw and cr3-ulsfo (198.35.26.128) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:31:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqord:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:31:51] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - asw1-22-ulsfo:ethernet-1/55 (Core: cr3-ulsfo:et-0/0/1 {#G24090478750000381}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [15:32:04] !log atsuko@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [15:32:10] FIRING: [2x] BFDdown: BFD session down between cr4-ulsfo and 198.35.26.128 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr4-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:32:45] !log brett@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp[2041-2042].codfw.wmnet [15:33:27] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: Decommision hosts cp2041 - cp2042 - https://phabricator.wikimedia.org/T426828#11941439 (10BCornwall) [15:34:27] (03CR) 10BCornwall: [C:03+1] wmnet: Add new CNAMEs for Wikifunctions replacement evaluators [dns] - 10https://gerrit.wikimedia.org/r/1289393 (https://phabricator.wikimedia.org/T417870) (owner: 10Jforrester) [15:34:56] (03PS6) 10Ladsgroup: mariadb: Migrate ferm_misc to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1289369 (https://phabricator.wikimedia.org/T421705) [15:35:04] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb: Migrate ferm_misc to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1289369 (https://phabricator.wikimedia.org/T421705) (owner: 10Ladsgroup) [15:35:05] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs2016.codfw.wmnet [15:35:29] 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Q4:rack/setup/install wdqs20[28-31] - https://phabricator.wikimedia.org/T423312#11941449 (10Jhancock.wm) a:05bking→03Jhancock.wm [15:36:31] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1302-1305].eqiad.wmnet [15:36:33] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1302-1305].eqiad.wmnet [15:36:39] FIRING: [7x] CoreBGPDown: Core BGP session down between cr1-codfw and cr3-ulsfo (198.35.26.128) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:36:44] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1306-1309].eqiad.wmnet [15:37:13] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs2023.codfw.wmnet [15:37:19] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs2024.codfw.wmnet [15:38:16] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host dse-k8s-worker1026.eqiad.wmnet [15:38:31] !log brett@cumin2002 START - Cookbook sre.dns.netbox [15:39:35] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1306-1309].eqiad.wmnet [15:39:55] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-scholarly_443: Servers wdqs2024.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:40:11] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-scholarly_443: Servers wdqs2024.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:40:30] ^^ that's known, CODFW is depooled [15:41:03] (03PS3) 10Btullis: [airflow-sre] Add a new cephfs PVC for data transfer purposes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1288881 (https://phabricator.wikimedia.org/T380626) [15:41:09] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2016.codfw.wmnet [15:41:39] RESOLVED: [7x] CoreBGPDown: Core BGP session down between cr1-codfw and cr3-ulsfo (198.35.26.128) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:41:48] !log brett@cumin2002 START - Cookbook sre.dns.roll-reboot rolling reboot on P{dns6002.wikimedia.org} and (A:dnsbox) [15:41:48] !log brett@cumin2002 cookbooks.sre.dns.roll-reboot begin reboot of dns6002.wikimedia.org [15:41:50] (03CR) 10Ladsgroup: [V:03+2 C:03+2] "root@dbproxy2006:~# diff /tmp/new_iptables /tmp/old_iptables" [puppet] - 10https://gerrit.wikimedia.org/r/1289369 (https://phabricator.wikimedia.org/T421705) (owner: 10Ladsgroup) [15:41:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqord:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:41:51] RESOLVED: [2x] SwitchCoreInterfaceDown: Switch core interface down - asw1-22-ulsfo:ethernet-1/55 (Core: cr3-ulsfo:et-0/0/1 {#G24090478750000381}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [15:41:52] (03CR) 10Cwhite: [C:03+2] logstash: add sampling to page-analytics.discovery.wmnet istio logs [puppet] - 10https://gerrit.wikimedia.org/r/1289423 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite) [15:41:55] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:42:10] RESOLVED: [2x] BFDdown: BFD session down between cr4-ulsfo and 198.35.26.128 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr4-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:43:03] (03CR) 10Kosta Harlan: Update UserInfoCard to be enabled by default for certain user groups (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289895 (https://phabricator.wikimedia.org/T426021) (owner: 10Mszwarc) [15:43:11] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:43:13] 06SRE, 06Data-Engineering (Q4 FS25/26 April 1st - June 30st), 10Event-Platform, 13Patch-For-Review: Flink Page View: Create K8s resources - https://phabricator.wikimedia.org/T426425#11941499 (10brouberol) [15:43:18] !log btullis@cumin1003 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling reboot on A:schema [15:44:11] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2023.codfw.wmnet [15:44:16] (03PS1) 10Brouberol: dse-k8s-eqiad: define the webrequest-page-view-next namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290010 (https://phabricator.wikimedia.org/T426425) [15:44:21] brett@cumin2002 decommission (PID 3791380) is awaiting input [15:44:23] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2024.codfw.wmnet [15:44:27] !log brouberol@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-jumbo1015.eqiad.wmnet with reason: host reimage [15:44:28] !log bking@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=wdqs-scholarly,name=codfw [15:44:53] !log bking@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=wdqs-scholarly,name=eqiad [15:44:53] (03CR) 10Jclark-ctr: [C:03+2] wdqs: Add config for net-new wdqs hosts [puppet] - 10https://gerrit.wikimedia.org/r/1289428 (https://phabricator.wikimedia.org/T423314) (owner: 10Bking) [15:44:56] (03CR) 10JavierMonton: [C:03+1] dse-k8s-eqiad: define the webrequest-page-view-next namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290010 (https://phabricator.wikimedia.org/T426425) (owner: 10Brouberol) [15:44:56] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host matomo1003.eqiad.wmnet [15:45:10] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs1024.eqiad.wmnet [15:45:18] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: T426560 - bking@cumin2002 [15:45:22] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host dse-k8s-worker1026.eqiad.wmnet [15:45:23] (03CR) 10Ladsgroup: [V:03+2 C:03+2] "yup, they just moved up and down the chain which is noop. Moving forward." [puppet] - 10https://gerrit.wikimedia.org/r/1289369 (https://phabricator.wikimedia.org/T421705) (owner: 10Ladsgroup) [15:45:23] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host dse-k8s-worker1026.eqiad.wmnet [15:45:27] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs1023.eqiad.wmnet [15:45:31] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host dse-k8s-worker1027.eqiad.wmnet [15:45:35] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[2041-2042].codfw.wmnet decommissioned, removing all IPs except the asset tag one - brett@cumin2002" [15:45:59] (03CR) 10Clément Goubert: "Yes, I need to make more clear the api-gateway is now deprecated." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289996 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [15:46:05] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:46:06] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host dse-k8s-worker1027.eqiad.wmnet [15:46:10] FIRING: [3x] BFDdown: BFD session down between asw1-b13-drmrs and 185.15.58.37 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:46:18] (03CR) 10Clément Goubert: [C:04-1] "api-gateway is deprecated, this should be routed through the rest-gateway." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289996 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [15:46:31] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[2041-2042].codfw.wmnet decommissioned, removing all IPs except the asset tag one - brett@cumin2002" [15:46:31] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:46:33] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cp[2041-2042].codfw.wmnet [15:46:47] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: Decommision hosts cp2041 - cp2042 - https://phabricator.wikimedia.org/T426828#11941515 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by brett@cumin2002 for hosts: `cp[2041-2042].codfw.wmnet` - cp2041.codfw.wmnet (**FAIL... [15:47:51] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-scholarly_443: Servers wdqs1023.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:47:57] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-scholarly_443: Servers wdqs1023.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:48:53] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1306-1309].eqiad.wmnet [15:48:55] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1306-1309].eqiad.wmnet [15:48:56] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host matomo1003.eqiad.wmnet [15:49:05] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1310-1313].eqiad.wmnet [15:49:08] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:49:22] !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-jumbo1015.eqiad.wmnet with reason: host reimage [15:49:52] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:50:18] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: T426560 - bking@cumin2002 [15:50:33] !log brett@cumin2002 cookbooks.sre.dns.roll-reboot finished rebooting dns6002.wikimedia.org [15:50:34] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P{dns6002.wikimedia.org} and (A:dnsbox) [15:50:46] (03PS1) 10Ejegg: Restore mistakenly-deleted messages [extensions/DonationInterface] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1290012 (https://phabricator.wikimedia.org/T111677) [15:50:48] (03PS1) 10Ejegg: Restore translations of mistakenly deleted messages [extensions/DonationInterface] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1290013 (https://phabricator.wikimedia.org/T111677) [15:51:10] FIRING: [9x] BFDdown: BFD session down between asw1-b13-drmrs and 185.15.58.37 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:51:17] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1310-1313].eqiad.wmnet [15:51:26] (03PS2) 10JHathaway: mariadb: Rename profile::mariadb::ferm to profile::mariadb::firewall [puppet] - 10https://gerrit.wikimedia.org/r/1289386 (https://phabricator.wikimedia.org/T411089) [15:51:36] !log brett@cumin2002 START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and not P{dns6002.wikimedia.org} and not A:magru and (A:dnsbox) [15:51:37] !log brett@cumin2002 cookbooks.sre.dns.roll-reboot begin reboot of dns1004.wikimedia.org [15:51:38] (03CR) 10Ejegg: [C:03+2] Restore mistakenly-deleted messages [extensions/DonationInterface] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1290012 (https://phabricator.wikimedia.org/T111677) (owner: 10Ejegg) [15:51:46] (03CR) 10Ejegg: [C:03+2] Restore translations of mistakenly deleted messages [extensions/DonationInterface] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1290013 (https://phabricator.wikimedia.org/T111677) (owner: 10Ejegg) [15:52:02] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1024.eqiad.wmnet [15:52:08] (03CR) 10JHathaway: "@Ladsgroup@gmail.com pcc looks good, this is ready for you review" [puppet] - 10https://gerrit.wikimedia.org/r/1289378 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [15:52:18] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1289386 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [15:52:19] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1023.eqiad.wmnet [15:52:25] (03CR) 10Brouberol: [C:03+2] dse-k8s-eqiad: define the webrequest-page-view-next namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1290010 (https://phabricator.wikimedia.org/T426425) (owner: 10Brouberol) [15:52:37] (03PS1) 10Sbisson: Enable AG on phase 2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290014 (https://phabricator.wikimedia.org/T426871) [15:52:44] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host dse-k8s-worker1027.eqiad.wmnet [15:52:45] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host dse-k8s-worker1027.eqiad.wmnet [15:52:51] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host dse-k8s-worker1028.eqiad.wmnet [15:53:36] (03CR) 10Majavah: [C:03+1] Rename role::mariadb::ferm to role::mariadb::firewall [puppet] - 10https://gerrit.wikimedia.org/r/1289378 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [15:54:20] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:54:38] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:55:27] jouncebot: nowandnext [15:55:27] No deployments scheduled for the next 1 hour(s) and 4 minute(s) [15:55:27] In 1 hour(s) and 4 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260520T1700) [15:55:33] trying to fix an UBN [15:55:38] (03CR) 10Urbanecm: [C:03+2] Fix newFromUserIdentity calls with interwiki users [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289980 (https://phabricator.wikimedia.org/T426832) (owner: 10Mszwarc) [15:56:10] RESOLVED: [13x] BFDdown: BFD session down between asw1-b13-drmrs and 185.15.58.37 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:56:40] FIRING: [5x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:56:42] !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: T426560 - bking@cumin2002 [15:56:56] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 826, active_shards: 1428, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 240, delayed_unassign [15:56:56] s: 0, number_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.50898203592814 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:56:56] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 826, active_shards: 1428, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 240, delayed_unassign [15:56:57] s: 0, number_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.50898203592814 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:57:47] !log bking@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=wdqs-scholarly,name=eqiad [15:57:50] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: green, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 826, active_shards: 1670, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_ [15:57:50] 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:57:50] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: green, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 826, active_shards: 1670, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_ [15:57:50] 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:57:50] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: green, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 826, active_shards: 1670, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_ [15:57:50] 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:59:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289980 (https://phabricator.wikimedia.org/T426832) (owner: 10Mszwarc) [15:59:50] !log btullis@cumin1003 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling reboot on A:schema [15:59:52] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1310-1313].eqiad.wmnet [15:59:54] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1310-1313].eqiad.wmnet [16:00:04] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1314-1317].eqiad.wmnet [16:00:17] FIRING: [3x] KubernetesCalicoDown: wikikube-worker1308.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:00:18] (03Merged) 10jenkins-bot: Fix newFromUserIdentity calls with interwiki users [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289980 (https://phabricator.wikimedia.org/T426832) (owner: 10Mszwarc) [16:00:49] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1289980|Fix newFromUserIdentity calls with interwiki users (T426832)]] [16:00:52] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:00:56] T426832: userrights-interwiki fails with server error - https://phabricator.wikimedia.org/T426832 [16:01:25] RESOLVED: [17x] BFDdown: BFD session down between asw1-b13-drmrs and 185.15.58.37 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:02:16] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1314-1317].eqiad.wmnet [16:02:54] !log urbanecm@deploy1003 urbanecm, mszwarc: Backport for [[gerrit:1289980|Fix newFromUserIdentity calls with interwiki users (T426832)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:03:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290014 (https://phabricator.wikimedia.org/T426871) (owner: 10Sbisson) [16:05:03] RESOLVED: [3x] KubernetesCalicoDown: wikikube-worker1308.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:05:06] (03CR) 10JHathaway: "@Ladsgroup@gmail.com after rebasing the PCC output looks good, ready for your review" [puppet] - 10https://gerrit.wikimedia.org/r/1289386 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [16:05:13] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:cassandra-dev [16:05:43] !log brett@cumin2002 cookbooks.sre.dns.roll-reboot finished rebooting dns1004.wikimedia.org [16:05:51] !log urbanecm@deploy1003 urbanecm, mszwarc: Continuing with deployment [16:07:00] (03CR) 10Ladsgroup: [C:03+1] "The change scares me a bit. I downloaded it and grep'ed to make sure there is no references to the old class and it's good. But haven't ch" [puppet] - 10https://gerrit.wikimedia.org/r/1289378 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [16:07:17] (03CR) 10Majavah: [C:03+1] "ship it" [puppet] - 10https://gerrit.wikimedia.org/r/1289386 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [16:07:26] !log pt1979@cumin1003 START - Cookbook sre.hosts.remove-downtime for cr3-ulsfo,cr3-ulsfo IPv6,cr3-ulsfo.mgmt [16:07:28] !log pt1979@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cr3-ulsfo,cr3-ulsfo IPv6,cr3-ulsfo.mgmt [16:07:52] !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-jumbo1015.eqiad.wmnet with OS trixie [16:08:03] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-mariadb1002.eqiad.wmnet [16:09:13] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:09:52] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1314-1317].eqiad.wmnet [16:09:54] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1314-1317].eqiad.wmnet [16:10:02] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1289980|Fix newFromUserIdentity calls with interwiki users (T426832)]] (duration: 09m 12s) [16:10:04] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1318-1321].eqiad.wmnet [16:10:05] T426832: userrights-interwiki fails with server error - https://phabricator.wikimedia.org/T426832 [16:10:34] !log pt1979@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cr4-ulsfo,cr4-ulsfo IPv6,cr4-ulsfo.mgmt with reason: switch refresh [16:12:16] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1318-1321].eqiad.wmnet [16:12:38] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1257: Migration of db1257.eqiad.wmnet completed [16:12:39] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [16:14:24] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-mariadb1002.eqiad.wmnet [16:16:25] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:18:37] (03PS1) 10Clément Goubert: gateway-check: inference post-migration cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1290019 (https://phabricator.wikimedia.org/T422937) [16:19:12] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:19:54] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1318-1321].eqiad.wmnet [16:19:56] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1318-1321].eqiad.wmnet [16:20:04] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1322-1324].eqiad.wmnet [16:20:43] !log brett@cumin2002 cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org [16:22:19] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1322-1324].eqiad.wmnet [16:22:39] FIRING: CoreBGPDown: Core BGP session down between cr2-eqsin and cr4-ulsfo (198.35.26.129) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr2-eqsin:9804&var-bgp_group=Confed_ulsfo&var-bgp_neighbor=cr4-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:22:52] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host dse-k8s-worker1028.eqiad.wmnet [16:26:10] FIRING: [4x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:26:11] (03PS2) 10Scott French: php8.3: Rebuild to pick up new PHP packages (8.3.31) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1289441 [16:26:11] (03PS2) 10Scott French: php8.3-icu72: Rebuild to pick up new PHP packages (8.3.31) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1289442 [16:26:26] (03CR) 10Blake: [C:03+1] "Thanks for setting all this up, it made the migration very smooth 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1290019 (https://phabricator.wikimedia.org/T422937) (owner: 10Clément Goubert) [16:27:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287433 (https://phabricator.wikimedia.org/T355445) (owner: 10Codename Noreste) [16:27:32] (03CR) 10Scott French: [V:03+2] "https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1289441/comments/8804d0e9_8f4152f6" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1289441 (owner: 10Scott French) [16:28:00] (03CR) 10Scott French: [V:03+2] "https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1289442/comments/fb5622df_0a2e4cf5" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1289442 (owner: 10Scott French) [16:29:31] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host dse-k8s-worker1028.eqiad.wmnet [16:29:32] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host dse-k8s-worker1028.eqiad.wmnet [16:29:32] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:dse-k8s-worker-eqiad [16:29:36] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1322-1324].eqiad.wmnet [16:29:38] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1322-1324].eqiad.wmnet [16:29:46] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1325-1327].eqiad.wmnet [16:31:10] RESOLVED: [4x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:31:29] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1325-1327].eqiad.wmnet [16:33:21] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:cassandra-dev [16:34:13] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:55] !log brett@cumin2002 cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org [16:36:12] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_ulsfo and not P{cp4037.ulsfo.wmnet} and not P{cp4038.ulsfo.wmnet} and A:cp [16:36:47] (03PS4) 10Clément Goubert: rest-gateway: Configure qwen3-14b in rest-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289996 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [16:37:18] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-upload_ulsfo and A:cp [16:37:30] !log atsuko@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [16:38:07] !log atsuko@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [16:39:00] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1325-1327].eqiad.wmnet [16:39:02] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1325-1327].eqiad.wmnet [16:39:02] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on P{wikikube-worker[1294-1327].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [16:43:54] !log btullis@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling reboot on A:kafka-test-eqiad [16:47:34] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp4045.ulsfo.wmnet [16:47:45] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp4039.ulsfo.wmnet [16:49:08] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:49:36] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:49:44] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 1/3 UP : OSPFv3: 1/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:49:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:49:55] !log brett@cumin2002 cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org [16:50:35] !log brett@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs7002.magru.wmnet} and A:liberica [16:50:51] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - asw1-22-ulsfo:ethernet-1/56 (Core: cr4-ulsfo:et-0/0/1 {#G24090478750000399}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [16:51:10] FIRING: [2x] BFDdown: BFD session down between cr3-ulsfo and 198.35.26.129 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:52:39] FIRING: [7x] CoreBGPDown: Core BGP session down between cr1-codfw and cr4-ulsfo (198.35.26.129) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:52:55] (03PS1) 10Dzahn: codesearch: fix invalid calendar format on cleanup systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1290025 (https://phabricator.wikimedia.org/T421147) [16:53:23] !log btullis@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:dse-k8s-worker-codfw [16:53:27] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host dse-k8s-worker2001.codfw.wmnet [16:53:35] (03CR) 10Dzahn: [C:03+2] codesearch: fix invalid calendar format on cleanup systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1290025 (https://phabricator.wikimedia.org/T421147) (owner: 10Dzahn) [16:53:52] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs2007.codfw.wmnet [16:54:13] jouncebot: nowandnext [16:54:13] No deployments scheduled for the next 0 hour(s) and 5 minute(s) [16:54:14] In 0 hour(s) and 5 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260520T1700) [16:54:19] !log brett@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs7002.magru.wmnet} and A:liberica [16:54:57] FYI, I'll be starting prep work for the upcoming MediaWiki infra window shortly. please do not start any new MediaWiki deployments. [16:55:28] (03CR) 10Scott French: [V:03+2 C:03+2] "Thanks for the reviews!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1289441 (owner: 10Scott French) [16:55:43] (03CR) 10Scott French: [V:03+2 C:03+2] php8.3-icu72: Rebuild to pick up new PHP packages (8.3.31) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1289442 (owner: 10Scott French) [16:56:08] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:56:10] FIRING: [6x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.77 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:56:36] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:56:44] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:57:39] FIRING: [7x] CoreBGPDown: Core BGP session down between cr1-codfw and cr4-ulsfo (198.35.26.129) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:58:13] !log brett@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs7001.magru.wmnet} and A:liberica [16:58:34] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host dse-k8s-worker2001.codfw.wmnet [16:59:14] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: T426560 - bking@cumin2002 [16:59:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:00:05] swfrench-wmf: gettimeofday() says it's time for MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260520T1700) [17:00:14] o/ [17:00:51] RESOLVED: [2x] SwitchCoreInterfaceDown: Switch core interface down - asw1-22-ulsfo:ethernet-1/56 (Core: cr4-ulsfo:et-0/0/1 {#G24090478750000399}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:00:52] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2007.codfw.wmnet [17:00:56] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs2008.codfw.wmnet [17:00:59] I'll be deploying in ~ 5m [17:01:10] RESOLVED: [6x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.77 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:02:02] !log brett@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs7001.magru.wmnet} and A:liberica [17:02:39] FIRING: [7x] CoreBGPDown: Core BGP session down between cr1-codfw and cr4-ulsfo (198.35.26.129) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:03:43] (03CR) 10CWilliams: "This was pushed up to run the tests and for visibility, tt wasn't really ready for a proper review as just; I now have tox setup locally, " [cookbooks] - 10https://gerrit.wikimedia.org/r/1289965 (https://phabricator.wikimedia.org/T426318) (owner: 10CWilliams) [17:04:02] !log brett@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs4010.ulsfo.wmnet} and A:liberica [17:04:08] !log brett@cumin2002 cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org [17:04:34] PROBLEM - Host asw1-eqsin is DOWN: CRITICAL - Time to live exceeded (10.132.128.4) [17:04:38] PROBLEM - Host durum5003 is DOWN: CRITICAL - Time to live exceeded (10.132.2.9) [17:05:06] PROBLEM - Host doh5004 is DOWN: CRITICAL - Time to live exceeded (103.102.166.99) [17:05:12] uh [17:05:25] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host dse-k8s-worker2001.codfw.wmnet [17:05:27] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host dse-k8s-worker2001.codfw.wmnet [17:05:28] PROBLEM - SSH on install5004 is CRITICAL: connect to address 103.102.166.104 and port 22: No route to host https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:05:28] PROBLEM - HTTP on install5004 is CRITICAL: connect to address 103.102.166.104 and port 80: No route to host https://wikitech.wikimedia.org/wiki/Install_servers [17:05:33] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host dse-k8s-worker2002.codfw.wmnet [17:05:38] PROBLEM - Host prometheus5003 is DOWN: CRITICAL - Time to live exceeded (10.132.2.5) [17:05:44] PROBLEM - Host ncredir5004 is DOWN: CRITICAL - Time to live exceeded (10.132.2.8) [17:05:52] PROBLEM - Squid on install5004 is CRITICAL: connect to address 103.102.166.104 and port 8080: No route to host https://wikitech.wikimedia.org/wiki/HTTP_proxy [17:05:56] PROBLEM - SSH on tcp-proxy5003 is CRITICAL: connect to address 10.132.2.3 and port 22: No route to host https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:05:57] FIRING: [3x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:06:02] (03CR) 10Ilias Sarantopoulos: "just a note that this service is only deployed in eqiad so we shouldn't be using the discovery endpoint to reach it. We should use https:/" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289996 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [17:06:14] RECOVERY - Host durum5003 is UP: PING OK - Packet loss = 0%, RTA = 239.80 ms [17:06:22] RECOVERY - Host asw1-eqsin is UP: PING OK - Packet loss = 0%, RTA = 243.73 ms [17:06:28] RECOVERY - HTTP on install5004 is OK: HTTP OK: HTTP/1.1 200 OK - 244 bytes in 0.493 second response time https://wikitech.wikimedia.org/wiki/Install_servers [17:06:28] RECOVERY - SSH on install5004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u10 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:06:36] RECOVERY - Host prometheus5003 is UP: PING OK - Packet loss = 0%, RTA = 242.68 ms [17:06:38] holding deployment [17:06:44] RECOVERY - Host doh5004 is UP: PING OK - Packet loss = 0%, RTA = 255.41 ms [17:06:44] RECOVERY - Host ncredir5004 is UP: PING OK - Packet loss = 0%, RTA = 243.98 ms [17:06:49] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2008.codfw.wmnet [17:06:52] RECOVERY - Squid on install5004 is OK: TCP OK - 0.252 second response time on 103.102.166.104 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy [17:06:53] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs2010.codfw.wmnet [17:06:56] RECOVERY - SSH on tcp-proxy5003 is OK: SSH OK - OpenSSH_10.0p2 Debian-7+deb13u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:06:57] FIRING: ProbeDown: Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:07:18] !log brett@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs4010.ulsfo.wmnet} and A:liberica [17:07:24] (03PS5) 10Clément Goubert: rest-gateway: Configure qwen3-14b in rest-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289996 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [17:07:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-eqsin and cr4-ulsfo (198.35.26.129) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr2-eqsin:9804&var-bgp_group=Confed_ulsfo&var-bgp_neighbor=cr4-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:07:55] FIRING: [13x] BFDdown: BFD session down between cr1-codfw and 198.35.26.202 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:08:12] FIRING: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [17:08:13] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [17:08:23] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: T426560 - bking@cumin2002 [17:08:34] (03PS6) 10Clément Goubert: rest-gateway: Configure qwen3-14b in rest-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289996 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [17:08:35] (03CR) 10Ilias Sarantopoulos: "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289996 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [17:08:40] RESOLVED: BFDdown: BFD session down between cr1-codfw and 198.35.26.202 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:08:42] (03CR) 10Clément Goubert: "Acknowledged" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289996 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [17:09:29] !ack [17:09:30] All incidents are already acked. [17:10:25] !log pt1979@cumin1003 START - Cookbook sre.hosts.remove-downtime for cr4-ulsfo,cr4-ulsfo IPv6,cr4-ulsfo.mgmt [17:10:27] !log pt1979@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cr4-ulsfo,cr4-ulsfo IPv6,cr4-ulsfo.mgmt [17:10:43] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host dse-k8s-worker2002.codfw.wmnet [17:10:57] RESOLVED: [3x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:11:03] (03CR) 10Ilias Sarantopoulos: "that was fast! thanks a lot for the help claime!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289996 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [17:11:57] RESOLVED: [3x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:12:41] (03CR) 10Clément Goubert: "FYI this also needs Id724a916831fa6254e82c856570677f81990dbb3 to be deployed so the endpoints can be reached from outside WMF prod." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289996 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [17:13:12] RESOLVED: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [17:13:13] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [17:13:24] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2010.codfw.wmnet [17:13:28] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs2011.codfw.wmnet [17:14:55] !log pt1979@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mr1-ulsfo,mr1-ulsfo IPv6,mr1-ulsfo.oob,mr1-ulsfo.oob IPv6 with reason: switch refresh [17:15:28] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:17:22] !log swfrench@deploy1003 Started scap sync-world: Rebuild to pick up new production image [17:17:31] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host dse-k8s-worker2002.codfw.wmnet [17:17:33] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host dse-k8s-worker2002.codfw.wmnet [17:17:39] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host dse-k8s-worker2003.codfw.wmnet [17:19:08] !log brett@cumin2002 cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org [17:19:13] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:19:50] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2011.codfw.wmnet [17:19:54] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs2012.codfw.wmnet [17:20:08] !log btullis@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling reboot on A:kafka-test-eqiad [17:21:25] FIRING: [2x] SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:22:54] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host dse-k8s-worker2003.codfw.wmnet [17:24:13] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:24:25] FIRING: [4x] BFDdown: BFD session down between cr1-codfw and 208.80.153.48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:26:00] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2012.codfw.wmnet [17:26:05] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs2013.codfw.wmnet [17:27:25] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp4040.ulsfo.wmnet [17:27:30] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp4046.ulsfo.wmnet [17:27:59] !log brett@cumin2002 cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org [17:28:24] !log btullis@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host dse-k8s-worker2003.codfw.wmnet [17:28:25] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host dse-k8s-worker2003.codfw.wmnet [17:28:26] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:dse-k8s-worker-codfw [17:28:26] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:aqs [17:28:40] RESOLVED: [4x] BFDdown: BFD session down between cr1-codfw and 208.80.153.48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:31:02] (03CR) 10Bearloga: [C:03+1] Bitu: Adapt approvers for growthbook-readonly and growthbook-elevatedacccess [puppet] - 10https://gerrit.wikimedia.org/r/1289999 (owner: 10Muehlenhoff) [17:31:25] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:31:42] FIRING: [2x] ProbeDown: Service aqs2001-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:32:35] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2013.codfw.wmnet [17:32:39] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs2014.codfw.wmnet [17:34:51] FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from grafana.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [17:36:42] RESOLVED: [4x] ProbeDown: Service aqs2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:38:54] PROBLEM - Host ps1-23-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [17:38:54] PROBLEM - Host ps1-22-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [17:38:55] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2014.codfw.wmnet [17:39:00] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs2015.codfw.wmnet [17:39:13] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:39:51] RESOLVED: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from grafana.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [17:41:51] (03PS2) 10Neriah: Enable 'flood' user group at en.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290032 (https://phabricator.wikimedia.org/T426882) [17:42:37] FIRING: [2x] ProbeDown: Service aqs2001-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:42:37] (03CR) 10CI reject: [V:04-1] Enable 'flood' user group at en.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290032 (https://phabricator.wikimedia.org/T426882) (owner: 10Neriah) [17:42:59] !log brett@cumin2002 cookbooks.sre.dns.roll-reboot begin reboot of dns2005.wikimedia.org [17:43:18] !log disabled puppet on grafana* to temporarily fix file ownership issue on /etc/grafana/provisioning/plugins/mahendrapaipuri-dashboardreporter-app.yaml [17:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:13] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:44:46] RECOVERY - Host ps1-23-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 72.40 ms [17:44:48] RECOVERY - Host ps1-22-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 72.76 ms [17:45:13] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2015.codfw.wmnet [17:45:17] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs2021.codfw.wmnet [17:45:28] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:45:51] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11942050 (10ssingh) Yeah I should have been more careful in resolving this, my bad. @Jhancock.wm: While the DIMM was replaced, we still need to look at the RAID thing. [17:45:54] !log swfrench@deploy1003 Finished scap sync-world: Rebuild to pick up new production image (duration: 28m 32s) [17:46:44] (03CR) 10VadymTS1: [C:04-1] Enable 'flood' user group at en.wikiversity (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290032 (https://phabricator.wikimedia.org/T426882) (owner: 10Neriah) [17:47:00] (03CR) 10Codename Noreste: Enable 'flood' user group at en.wikiversity (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290032 (https://phabricator.wikimedia.org/T426882) (owner: 10Neriah) [17:47:37] RESOLVED: [8x] ProbeDown: Service aqs2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:48:10] FIRING: [4x] BFDdown: BFD session down between cr1-codfw and 208.80.153.74 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:48:12] (03CR) 10Codename Noreste: Enable 'flood' user group at en.wikiversity (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290032 (https://phabricator.wikimedia.org/T426882) (owner: 10Neriah) [17:48:51] (03PS3) 10Neriah: Enable 'flood' user group at en.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290032 (https://phabricator.wikimedia.org/T426882) [17:48:59] (03CR) 10Neriah: Enable 'flood' user group at en.wikiversity (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290032 (https://phabricator.wikimedia.org/T426882) (owner: 10Neriah) [17:49:39] (03CR) 10Neriah: [C:04-1] "STALLED: per T426882#11942055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290032 (https://phabricator.wikimedia.org/T426882) (owner: 10Neriah) [17:51:33] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2021.codfw.wmnet [17:51:37] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs2022.codfw.wmnet [17:53:10] RESOLVED: [4x] BFDdown: BFD session down between cr1-codfw and 208.80.153.74 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:53:37] FIRING: [9x] ProbeDown: Service aqs2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:53:52] RESOLVED: [8x] ProbeDown: Service aqs2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:56:30] !log pt1979@cumin1003 START - Cookbook sre.hosts.remove-downtime for mr1-ulsfo,mr1-ulsfo IPv6,mr1-ulsfo.oob,mr1-ulsfo.oob IPv6 [17:56:33] !log pt1979@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mr1-ulsfo,mr1-ulsfo IPv6,mr1-ulsfo.oob,mr1-ulsfo.oob IPv6 [17:56:51] !log brett@cumin2002 cookbooks.sre.dns.roll-reboot finished rebooting dns2005.wikimedia.org [17:57:50] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2022.codfw.wmnet [17:57:54] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs2025.codfw.wmnet [17:59:13] FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:00:28] RESOLVED: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:01:09] (03PS1) 10Dbrant: docroot: Remove non-wikipedias from digital asset links. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290035 (https://phabricator.wikimedia.org/T426010) [18:01:46] !log pt1979@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool ulsfo [reason: no reason specified, T416562] [18:01:50] T416562: ulsfo: upgrade routers (2026) - https://phabricator.wikimedia.org/T416562 [18:01:51] !log pt1979@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool ulsfo [reason: no reason specified, T416562] [18:04:52] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2025.codfw.wmnet [18:07:36] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp4047.ulsfo.wmnet [18:08:52] FIRING: [8x] ProbeDown: Service aqs2003-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:08:55] FIRING: [8x] ProbeDown: Service aqs2003-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:09:24] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp4041.ulsfo.wmnet [18:11:51] !log brett@cumin2002 cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org [18:13:52] RESOLVED: [8x] ProbeDown: Service aqs2003-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:15:53] (03PS1) 10Reedy: Update symfony/* [vendor] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1290037 (https://phabricator.wikimedia.org/T426861) [18:16:03] jouncebot: nowandnext [18:16:03] No deployments scheduled for the next 1 hour(s) and 43 minute(s) [18:16:03] In 1 hour(s) and 43 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260520T2000) [18:16:10] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and 208.80.153.107 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:16:16] PROBLEM - Host 2620:0:860:4:208:80:153:107 is DOWN: CRITICAL - Host Unreachable (2620:0:860:4:208:80:153:107) [18:16:34] (03CR) 10VadymTS1: Allow Vector 2022 font size changes in namespace 100 for enwiktionary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288370 (https://phabricator.wikimedia.org/T423766) (owner: 10Pppery) [18:17:00] RECOVERY - Host 2620:0:860:4:208:80:153:107 is UP: PING OK - Packet loss = 0%, RTA = 31.74 ms [18:18:52] FIRING: [8x] ProbeDown: Service aqs2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:21:10] RESOLVED: [4x] BFDdown: BFD session down between cr1-codfw and 208.80.153.107 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:23:52] RESOLVED: [8x] ProbeDown: Service aqs2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:24:13] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:24:39] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11942148 (10ssingh) a:05ssingh→03Jhancock.wm [18:25:28] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:26:11] !log brett@cumin2002 cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org [18:28:07] (03PS1) 10Dwisehaupt: Shift fundraising db read handle and add frdb analytics handle [dns] - 10https://gerrit.wikimedia.org/r/1290039 [18:28:52] FIRING: [9x] ProbeDown: Service aqs2004-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:29:22] (03CR) 10Ssingh: "I think you should squash this and the parent commit (Add lvs1017 to high-traffic1). Correct me if I am wrong but my reasoning is that if " [puppet] - 10https://gerrit.wikimedia.org/r/1286517 (https://phabricator.wikimedia.org/T421421) (owner: 10BCornwall) [18:29:44] (03CR) 10Jgreen: [C:03+1] Shift fundraising db read handle and add frdb analytics handle [dns] - 10https://gerrit.wikimedia.org/r/1290039 (owner: 10Dwisehaupt) [18:29:59] (03CR) 10Ssingh: [C:03+1] "+1, with the above caveat of not reimaging _until_ the next commit in the chain is merged together." [puppet] - 10https://gerrit.wikimedia.org/r/1286517 (https://phabricator.wikimedia.org/T421421) (owner: 10BCornwall) [18:31:06] (03CR) 10Reedy: [C:03+2] Update symfony/* [vendor] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1290037 (https://phabricator.wikimedia.org/T426861) (owner: 10Reedy) [18:31:44] (03CR) 10Ssingh: [C:03+1] "Nice job on taking care of the MED." [puppet] - 10https://gerrit.wikimedia.org/r/1286522 (https://phabricator.wikimedia.org/T421421) (owner: 10BCornwall) [18:32:57] (03CR) 10Ssingh: "hieradata/hosts/lvs1016.yaml can also be rm'ed I think" [puppet] - 10https://gerrit.wikimedia.org/r/1286523 (https://phabricator.wikimedia.org/T421421) (owner: 10BCornwall) [18:33:10] (03CR) 10Ssingh: "Ignore this, already done in the next commit." [puppet] - 10https://gerrit.wikimedia.org/r/1286523 (https://phabricator.wikimedia.org/T421421) (owner: 10BCornwall) [18:33:15] (03CR) 10Ssingh: [C:03+1] Remove lvs1016, promote lvs1017 [puppet] - 10https://gerrit.wikimedia.org/r/1286523 (https://phabricator.wikimedia.org/T421421) (owner: 10BCornwall) [18:33:32] (03CR) 10Ssingh: [C:03+1] Remove lvs1016 hieradata, demote to insetup_noferm [puppet] - 10https://gerrit.wikimedia.org/r/1286524 (https://phabricator.wikimedia.org/T421421) (owner: 10BCornwall) [18:33:52] RESOLVED: [8x] ProbeDown: Service aqs2005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:34:18] (03Merged) 10jenkins-bot: Update symfony/* [vendor] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1290037 (https://phabricator.wikimedia.org/T426861) (owner: 10Reedy) [18:34:22] (03CR) 10Ssingh: [C:03+1] Remove cp2041/cp2042 [puppet] - 10https://gerrit.wikimedia.org/r/1290006 (https://phabricator.wikimedia.org/T426828) (owner: 10BCornwall) [18:34:44] (03CR) 10Brouberol: [C:03+2] Upgrade kafka-jumbo1016 to JDK21 [puppet] - 10https://gerrit.wikimedia.org/r/1289949 (https://phabricator.wikimedia.org/T426835) (owner: 10Brouberol) [18:36:12] (03CR) 10Dwisehaupt: [C:03+2] Shift fundraising db read handle and add frdb analytics handle [dns] - 10https://gerrit.wikimedia.org/r/1290039 (owner: 10Dwisehaupt) [18:36:31] !log dwisehaupt@dns1004 START - running authdns-update [18:37:11] !log brouberol@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-jumbo1016.eqiad.wmnet with OS trixie [18:38:25] !log dwisehaupt@dns1004 END - running authdns-update [18:38:40] 06SRE, 06Traffic: ASW single-point of failure for LVS VIPs at POPs - https://phabricator.wikimedia.org/T362772#11942203 (10ssingh) Without knowing the details of this, I wanted to point out that the drmrs refresh is upcoming in Q1/Q2 of FY2026 and drmrs like all edge sites, is on Liberica. If there is any rede... [18:38:52] FIRING: [8x] ProbeDown: Service aqs2005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:41:11] !log brett@cumin2002 cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org [18:41:48] !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1290037|Update symfony/* (T426861)]] [18:41:52] T426861: symfony/yaml security issues blocking vendor - https://phabricator.wikimedia.org/T426861 [18:43:44] !log reedy@deploy1003 reedy: Backport for [[gerrit:1290037|Update symfony/* (T426861)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:43:52] RESOLVED: [8x] ProbeDown: Service aqs2006-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:44:13] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:44:44] PROBLEM - BFD status on asw1-by27-esams.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:45:02] !log reedy@deploy1003 reedy: Continuing with deployment [18:46:10] FIRING: [2x] BFDdown: BFD session down between asw1-by27-esams and 185.15.59.34 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-by27-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:46:23] (03CR) 10Ejegg: [V:03+2 C:03+2] Restore mistakenly-deleted messages [extensions/DonationInterface] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1290012 (https://phabricator.wikimedia.org/T111677) (owner: 10Ejegg) [18:46:33] (03CR) 10Ejegg: [V:03+2 C:03+2] Restore translations of mistakenly deleted messages [extensions/DonationInterface] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1290013 (https://phabricator.wikimedia.org/T111677) (owner: 10Ejegg) [18:47:32] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp4048.ulsfo.wmnet [18:49:15] !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1290037|Update symfony/* (T426861)]] (duration: 07m 28s) [18:49:20] T426861: symfony/yaml security issues blocking vendor - https://phabricator.wikimedia.org/T426861 [18:49:44] RECOVERY - BFD status on asw1-by27-esams.mgmt is OK: UP: 2 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:50:06] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp4042.ulsfo.wmnet [18:51:10] RESOLVED: [2x] BFDdown: BFD session down between asw1-by27-esams and 185.15.59.34 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-by27-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:53:52] FIRING: [8x] ProbeDown: Service aqs2007-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:53:55] FIRING: [8x] ProbeDown: Service aqs2007-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:54:13] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:55:11] (03CR) 10Daniel Kinzler: rest-gateway: tighten rate limits (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289992 (https://phabricator.wikimedia.org/T424821) (owner: 10Daniel Kinzler) [18:56:19] !log brett@cumin2002 cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org [18:58:52] RESOLVED: [8x] ProbeDown: Service aqs2007-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:59:44] jouncebot: nowandnext [18:59:44] No deployments scheduled for the next 0 hour(s) and 0 minute(s) [18:59:44] In 0 hour(s) and 0 minute(s): DonationInterface i18n restoration (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260520T1900) [18:59:50] Ha, timing. [18:59:53] ejegg: Over to you. [19:00:00] 🎉 [19:00:01] thanks! [19:00:05] ejegg: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for DonationInterface i18n restoration deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260520T1900). [19:00:29] https://github.com/wikimedia/mediawiki/commit/095f788dbe4fecf76dfb106430d82e9309f891c3 https://github.com/wikimedia/mediawiki/commit/7abd7b042d12e4aa5f34d69ee38c533309852911 [19:00:35] The magic bumps did work as expected [19:00:44] oh good [19:00:46] But in future please do not ever force-merge patches (bypassing CI) or merge out of a deployment window (bypassing RelEng). [19:01:26] James_F: ahh, I was advised I might want to do this one outside of a window because of the i18n wait [19:01:44] 06SRE, 10Wikimedia-Mailing-lists: New mailing list for the latam tech community - https://phabricator.wikimedia.org/T426803#11942346 (10Dzahn) Sounds good to me! Does anyone here have concerns with the suggested names due to [[ https://meta.wikimedia.org/wiki/Mailing_lists/Standardization | Mailing lists/Stan... [19:02:02] Yes, that's great; but the window started 2 mins ago and you merged 18 minutes ago. [19:02:24] Right, I had C+2ed a couple hours ago [19:02:31] and thought it would work its way through CI [19:02:54] !log brouberol@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kafka-jumbo1016.eqiad.wmnet with OS trixie [19:02:56] In normal times it would have ended up being randomly deployed by whatever deployer was deploying when it happened to merge. [19:03:03] But as that didn't happen I thought maybe it wasn't running on that branch, or for pure-json [19:03:06] Yay for multiple parallel UBNs. [19:03:40] !log brouberol@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-jumbo1016.eqiad.wmnet with OS trixie [19:03:48] ut-oh, do you need me to revert to deal with those James_F ? [19:04:04] l think Reedy has fixed the prod vendor breakages. [19:04:07] (03CR) 10Gergő Tisza: "Can we keep the projects which aren't split by language (Wikidata etc)? Those are more impactful, and there aren't many of them." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290035 (https://phabricator.wikimedia.org/T426010) (owner: 10Dbrant) [19:04:37] FIRING: [8x] ProbeDown: Service aqs2008-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:04:42] I didn't touch .2, but neither is this patch [19:05:45] (03CR) 10Jforrester: "recheck" [extensions/ConfirmEdit] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289347 (https://phabricator.wikimedia.org/T426740) (owner: 10Michael Große) [19:05:53] (03CR) 10Jforrester: "recheck" [extensions/WikimediaEvents] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289342 (https://phabricator.wikimedia.org/T419413) (owner: 10Michael Große) [19:06:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:07:15] (03CR) 10CI reject: [V:04-1] Skip init.test.js test if VisualEditor not installed [extensions/ConfirmEdit] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289347 (https://phabricator.wikimedia.org/T426740) (owner: 10Michael Große) [19:09:00] !log ejegg@deploy1003 Started scap sync-world: Backport for [[gerrit:1290012|Restore mistakenly-deleted messages (T111677)]], [[gerrit:1290013|Restore translations of mistakenly deleted messages (T111677)]] [19:09:04] T111677: Some messages in the Donation extensions are outdated and should be removed - https://phabricator.wikimedia.org/T111677 [19:09:37] RESOLVED: [8x] ProbeDown: Service aqs2008-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:10:31] (03CR) 10BCornwall: "Oh, yeah, that hunk should probably have been in this commit. Out of laziness I'll just keep it there. Thanks for the eyes. :)" [puppet] - 10https://gerrit.wikimedia.org/r/1286523 (https://phabricator.wikimedia.org/T421421) (owner: 10BCornwall) [19:11:19] !log brett@cumin2002 cookbooks.sre.dns.roll-reboot begin reboot of dns3004.wikimedia.org [19:12:51] (03PS2) 10Dbrant: docroot: Remove non-wikipedias from digital asset links. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290035 (https://phabricator.wikimedia.org/T426010) [19:14:44] PROBLEM - BFD status on asw1-bw27-esams.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:14:56] (03PS1) 10Ssingh: P:cache::haproxy: guard webrequest IP reputation data for beta [puppet] - 10https://gerrit.wikimedia.org/r/1290047 (https://phabricator.wikimedia.org/T426822) [19:15:55] FIRING: [8x] ProbeDown: Service aqs2009-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:16:10] FIRING: [2x] BFDdown: BFD session down between asw1-bw27-esams and 185.15.59.2 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-bw27-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:16:52] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1290047 (https://phabricator.wikimedia.org/T426822) (owner: 10Ssingh) [19:17:05] (03CR) 10Dbrant: "whoops, done!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290035 (https://phabricator.wikimedia.org/T426010) (owner: 10Dbrant) [19:17:55] (03CR) 10Ssingh: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1290047/8568/" [puppet] - 10https://gerrit.wikimedia.org/r/1290047 (https://phabricator.wikimedia.org/T426822) (owner: 10Ssingh) [19:19:44] RECOVERY - BFD status on asw1-bw27-esams.mgmt is OK: UP: 2 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:19:45] (03CR) 10Gergő Tisza: "There are two classes of APIs:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287731 (https://phabricator.wikimedia.org/T426323) (owner: 10Kosta Harlan) [19:20:55] RESOLVED: [8x] ProbeDown: Service aqs2009-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:21:10] RESOLVED: [2x] BFDdown: BFD session down between asw1-bw27-esams and 185.15.59.2 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-bw27-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:26:04] (03PS2) 10Cwhite: grafana: set group on dashboard reporter provision yaml file [puppet] - 10https://gerrit.wikimedia.org/r/1290038 (https://phabricator.wikimedia.org/T425795) (owner: 10Herron) [19:26:23] !log ejegg@deploy1003 ejegg: Backport for [[gerrit:1290012|Restore mistakenly-deleted messages (T111677)]], [[gerrit:1290013|Restore translations of mistakenly deleted messages (T111677)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:26:27] T111677: Some messages in the Donation extensions are outdated and should be removed - https://phabricator.wikimedia.org/T111677 [19:26:28] !log brett@cumin2002 cookbooks.sre.dns.roll-reboot finished rebooting dns3004.wikimedia.org [19:26:37] FIRING: [8x] ProbeDown: Service aqs2010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:27:18] !log ejegg@deploy1003 ejegg: Continuing with deployment [19:27:26] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp4049.ulsfo.wmnet [19:31:37] RESOLVED: [8x] ProbeDown: Service aqs2010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:31:53] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp4043.ulsfo.wmnet [19:33:39] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11942438 (10Jhancock.wm) I put in a different drive from a different manufacturer. this new one should be the same as the old one. lemme know if that works. you might need to manually add it to the... [19:37:55] FIRING: [8x] ProbeDown: Service aqs2011-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:39:18] !log ejegg@deploy1003 Finished scap sync-world: Backport for [[gerrit:1290012|Restore mistakenly-deleted messages (T111677)]], [[gerrit:1290013|Restore translations of mistakenly deleted messages (T111677)]] (duration: 30m 19s) [19:39:22] T111677: Some messages in the Donation extensions are outdated and should be removed - https://phabricator.wikimedia.org/T111677 [19:39:31] ok, all done [19:39:46] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs2012.codfw.wmnet [19:41:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:41:28] !log brett@cumin2002 cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org [19:42:55] RESOLVED: [8x] ProbeDown: Service aqs2011-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:43:30] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on dbproxy2005 - https://phabricator.wikimedia.org/T426791#11942458 (10Jhancock.wm) @Marostegui i do have a spare drive but i'm honestly having a hard time telling which physical drive is down. i have mixed indicators from the server and the error in the m... [19:43:34] ejegg: thank you ! :] [19:44:13] :) thanks for the guidance [19:46:46] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11942475 (10BCornwall) @Jhancock.wm Thanks for doing that - however, I feel that there might be some sort of firmware thing going on. Upon reboot I'm seeing this: {F82982529} I'm still seeing th... [19:47:55] FIRING: [9x] ProbeDown: Service aqs1016-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:48:10] FIRING: [11x] ProbeDown: Service aqs1016-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:49:33] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs2012.codfw.wmnet [19:49:41] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11942488 (10Jhancock.wm) i'll pull the server out and take a look. could be the card or a cable to it. [19:49:52] PROBLEM - PyBal connections to etcd on lvs2012 is CRITICAL: CRITICAL: 0 connections established with conf2004.codfw.wmnet:4001 (min=6) https://wikitech.wikimedia.org/wiki/PyBal [19:49:52] PROBLEM - pybal on lvs2012 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [19:49:58] PROBLEM - PyBal backends health check on lvs2012 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [19:50:40] !log brett@cumin2002 cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org [19:52:55] RESOLVED: [8x] ProbeDown: Service aqs1016-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:53:17] brouberol@cumin1003 reimage (PID 2243888) is awaiting input [19:53:36] (03CR) 10Aleksandar Mastilovic: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) (owner: 10Aleksandar Mastilovic) [19:54:50] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2012 is CRITICAL: CRITICAL: Service pybal.service is not active. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [19:55:07] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11942505 (10BCornwall) Actually, I just created a virtual disk in the storage section of idrac and it seems to be attached now. Is that the appropriate way forward with new disks or have I stumbled... [19:57:47] ACKNOWLEDGEMENT - MD RAID on lvs2012 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T426899 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [19:57:57] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T426899 (10ops-monitoring-bot) 03NEW [19:58:37] FIRING: [9x] ProbeDown: Service aqs1016-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:59:24] PROBLEM - Host lvs2012 is DOWN: PING CRITICAL - Packet loss = 100% [20:00:22] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs1011.eqiad.wmnet [20:00:35] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260520T2000). [20:00:35] codenamenoreste: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:01:07] !log brouberol@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-jumbo1016.eqiad.wmnet with OS trixie [20:02:09] I'm here to deploy a user right configuration change for mediawiki.org, restricting the changetags user right to admins and bots by default [20:02:09] but even then, it might possibly be moot because of https://phabricator.wikimedia.org/T355639#11942514 [20:02:56] !log brouberol@cumin1003 START - Cookbook sre.hosts.provision for host kafka-jumbo1016.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [20:03:37] FIRING: [8x] ProbeDown: Service aqs1016-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:04:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281901 (https://phabricator.wikimedia.org/T424413) (owner: 10Codename Noreste) [20:05:40] !log brett@cumin2002 cookbooks.sre.dns.roll-reboot begin reboot of dns4004.wikimedia.org [20:06:50] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1011.eqiad.wmnet [20:06:54] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs1012.eqiad.wmnet [20:07:13] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp4050.ulsfo.wmnet [20:08:37] RESOLVED: [8x] ProbeDown: Service aqs1016-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:09:30] (03PS1) 10Kosta Harlan: hCaptcha: Exempt CommunityRequests pages from edit/create triggers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290055 (https://phabricator.wikimedia.org/T426897) [20:10:40] (03PS1) 10Dwisehaupt: Add frdb1004 back as the fundraisingdb-read dns handle [dns] - 10https://gerrit.wikimedia.org/r/1290056 [20:12:53] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: T426560 - bking@cumin2002 [20:12:53] (03CR) 10Jgreen: [C:03+1] Add frdb1004 back as the fundraisingdb-read dns handle [dns] - 10https://gerrit.wikimedia.org/r/1290056 (owner: 10Dwisehaupt) [20:12:53] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: T426560 - bking@cumin2002 [20:12:56] are there any deployers available? if not, I'll see tomorrow morning in my time zone [20:13:37] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp4044.ulsfo.wmnet [20:13:37] FIRING: [8x] ProbeDown: Service aqs1017-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:13:37] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-text_ulsfo and not P{cp4037.ulsfo.wmnet} and not P{cp4038.ulsfo.wmnet} and A:cp [20:13:48] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1012.eqiad.wmnet [20:13:54] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs1013.eqiad.wmnet [20:13:58] (03CR) 10Dwisehaupt: [C:03+2] Add frdb1004 back as the fundraisingdb-read dns handle [dns] - 10https://gerrit.wikimedia.org/r/1290056 (owner: 10Dwisehaupt) [20:14:00] RECOVERY - Host lvs2012 is UP: PING OK - Packet loss = 0%, RTA = 32.96 ms [20:14:50] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2012 is CRITICAL: CRITICAL: Service pybal.service is not active. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [20:14:52] PROBLEM - PyBal connections to etcd on lvs2012 is CRITICAL: CRITICAL: 0 connections established with conf2004.codfw.wmnet:4001 (min=6) https://wikitech.wikimedia.org/wiki/PyBal [20:14:52] PROBLEM - pybal on lvs2012 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [20:14:56] PROBLEM - PyBal backends health check on lvs2012 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [20:15:03] !log dwisehaupt@dns1005 START - running authdns-update [20:15:03] ^can be ignored [20:15:40] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on lvs2012.codfw.wmnet with reason: Maintenance [20:15:45] !log brett@cumin2002 START - Cookbook sre.hosts.provision for host lvs2012.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [20:16:23] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: T426560 - bking@cumin2002 [20:16:25] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:16:25] I need a patch deployed today [20:16:32] it's for mediawiki.org [20:16:40] !log dwisehaupt@dns1005 END - running authdns-update [20:17:47] ACKNOWLEDGEMENT - MD RAID on lvs2012 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T426902 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [20:17:52] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T426902 (10ops-monitoring-bot) 03NEW [20:18:22] !log brouberol@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-jumbo1016.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [20:18:37] RESOLVED: [8x] ProbeDown: Service aqs1017-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:19:00] brett@cumin2002 provision (PID 3977880) is awaiting input [20:19:09] !log brouberol@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-jumbo1016.eqiad.wmnet with OS trixie [20:20:33] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1013.eqiad.wmnet [20:20:38] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs1014.eqiad.wmnet [20:21:44] !log brett@cumin2002 cookbooks.sre.dns.roll-reboot finished rebooting dns4004.wikimedia.org [20:23:52] FIRING: [8x] ProbeDown: Service aqs1018-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:25:11] (03PS2) 10Scott French: httpd*: Align tag with apache2 version and fix -cas Depends [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1290054 [20:25:11] (03CR) 10Scott French: [V:03+2] "Built locally:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1290054 (owner: 10Scott French) [20:26:54] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1014.eqiad.wmnet [20:26:58] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs1015.eqiad.wmnet [20:28:06] (03CR) 10Scott French: [V:03+2] "Balthazar: Could I ask you to review the httpd-cas change? Thanks!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1290054 (owner: 10Scott French) [20:28:52] RESOLVED: [8x] ProbeDown: Service aqs1018-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:31:25] FIRING: [8x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2075:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:32:08] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 5a7a5c65495f25f5fda1a08b746598b509102755, dns.git is a619484c0241a9530a3e2715d0b8103f86c08148) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [20:33:18] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1015.eqiad.wmnet [20:33:24] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs1016.eqiad.wmnet [20:33:51] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs2012.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [20:33:52] FIRING: [11x] ProbeDown: Service aqs1018-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:33:55] FIRING: [12x] ProbeDown: Service aqs1018-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:34:58] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2012.codfw.wmnet with OS bullseye [20:36:44] !log brett@cumin2002 cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org [20:37:26] !log brouberol@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kafka-jumbo1016.eqiad.wmnet with OS trixie [20:37:54] codenamenoreste: Did you get assistance? [20:38:52] FIRING: [8x] ProbeDown: Service aqs1019-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:38:55] FIRING: [8x] ProbeDown: Service aqs1019-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:39:34] codenamenoreste: I can help you deploy if you still want to do it in this window [20:39:44] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1016.eqiad.wmnet [20:39:48] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs1017.eqiad.wmnet [20:41:09] dancy I have patch https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1287433 [20:41:34] OK. Do you have a way to test that change once it hits testservers? [20:42:10] FIRING: [4x] BFDdown: BFD session down between cr2-eqsin and 103.102.166.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:42:14] I have WikimediaDebug [20:42:35] Alright, pressing the button [20:43:52] RESOLVED: [8x] ProbeDown: Service aqs1019-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:45:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from releases.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=releases.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [20:46:38] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1017.eqiad.wmnet [20:46:43] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs1018.eqiad.wmnet [20:47:00] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp4051.ulsfo.wmnet [20:47:10] RESOLVED: [4x] BFDdown: BFD session down between cr2-eqsin and 103.102.166.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:47:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287433 (https://phabricator.wikimedia.org/T355445) (owner: 10Codename Noreste) [20:48:50] (03Merged) 10jenkins-bot: Restrict the changetags user right to bots and sysops on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287433 (https://phabricator.wikimedia.org/T355445) (owner: 10Codename Noreste) [20:48:52] FIRING: [8x] ProbeDown: Service aqs1020-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:49:20] !log dancy@deploy1003 Started scap sync-world: Backport for [[gerrit:1287433|Restrict the changetags user right to bots and sysops on mediawiki.org (T355445)]] [20:49:20] (03CR) 10Cwhite: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1288883 (https://phabricator.wikimedia.org/T424814) (owner: 10Filippo Giunchedi) [20:49:24] T355445: "Changetags" right only for bots and administrators in MediaWiki.org - https://phabricator.wikimedia.org/T355445 [20:50:55] I cant use Wikimedia Debug right now, go ahead and deploy the change... [20:52:08] !log dancy@deploy1003 codenamenoreste, dancy: Backport for [[gerrit:1287433|Restrict the changetags user right to bots and sysops on mediawiki.org (T355445)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:53:04] !log brett@cumin2002 cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org [20:53:44] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2012.codfw.wmnet with reason: host reimage [20:53:46] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1018.eqiad.wmnet [20:53:51] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs1019.eqiad.wmnet [20:53:52] RESOLVED: [8x] ProbeDown: Service aqs1020-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:54:05] !log dancy@deploy1003 codenamenoreste, dancy: Continuing with deployment [20:58:31] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2012.codfw.wmnet with reason: host reimage [20:58:52] FIRING: [8x] ProbeDown: Service aqs1021-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260520T2100) [21:00:06] !log dancy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1287433|Restrict the changetags user right to bots and sysops on mediawiki.org (T355445)]] (duration: 10m 45s) [21:00:10] T355445: "Changetags" right only for bots and administrators in MediaWiki.org - https://phabricator.wikimedia.org/T355445 [21:00:45] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1019.eqiad.wmnet [21:00:50] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs1020.eqiad.wmnet [21:03:52] RESOLVED: [8x] ProbeDown: Service aqs1021-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:06:25] FIRING: [7x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2062:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:07:44] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1020.eqiad.wmnet [21:07:48] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs1021.eqiad.wmnet [21:08:04] !log brett@cumin2002 cookbooks.sre.dns.roll-reboot begin reboot of dns5004.wikimedia.org [21:08:52] FIRING: [10x] ProbeDown: Service aqs1021-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:08:55] FIRING: [10x] ProbeDown: Service aqs1021-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:11:25] FIRING: [7x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2062:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:13:10] FIRING: [4x] BFDdown: BFD session down between cr2-eqsin and 103.102.166.8 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:13:52] RESOLVED: [8x] ProbeDown: Service aqs1022-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:14:36] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1021.eqiad.wmnet [21:14:41] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs1022.eqiad.wmnet [21:16:03] (03CR) 10BryanDavis: "Cherry-picked to deployment-puppetserver-1.deployment-prep.eqiad1.wikimedia.cloud in an attempt to unblock an unblock request." [puppet] - 10https://gerrit.wikimedia.org/r/1290047 (https://phabricator.wikimedia.org/T426822) (owner: 10Ssingh) [21:16:06] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs2012.codfw.wmnet with OS bullseye [21:18:10] RESOLVED: [4x] BFDdown: BFD session down between cr2-eqsin and 103.102.166.8 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:18:52] FIRING: [9x] ProbeDown: Service aqs1022-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:18:55] FIRING: [9x] ProbeDown: Service aqs1022-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:21:07] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11942792 (10BCornwall) After creating the virtual disk, re-provisioning (for good measure, though no changes were made), then re-imaging, we're back in business. We might follow-up regarding the us... [21:21:25] FIRING: [2x] SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:21:29] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11942795 (10BCornwall) 05Open→03Resolved [21:21:35] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1022.eqiad.wmnet [21:22:29] !log brett@cumin2002 cookbooks.sre.dns.roll-reboot finished rebooting dns5004.wikimedia.org [21:23:52] FIRING: [8x] ProbeDown: Service aqs1023-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:23:55] RESOLVED: [8x] ProbeDown: Service aqs1023-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:25:16] (03PS3) 10Krinkle: docroot: Remove non-wikipedias from digital asset links. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290035 (https://phabricator.wikimedia.org/T426010) (owner: 10Dbrant) [21:26:13] 10ops-codfw, 06SRE, 06DC-Ops: Too low optic power on - pfw1-codfw:xe-7/2/0 (Core: cr2-codfw:xe-0/0/1:0 {#122503}) - https://phabricator.wikimedia.org/T426671#11942810 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [21:26:45] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp4052.ulsfo.wmnet [21:26:45] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-upload_ulsfo and A:cp [21:27:00] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: fasw2-c8a-codfw:xe-0/0/47 low RX power - https://phabricator.wikimedia.org/T426824#11942813 (10Jhancock.wm) i can get this one in the morning if Jeff or Dallas is around and want to coordinate. [21:27:02] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs1025.eqiad.wmnet [21:27:16] (03CR) 10Krinkle: [C:03+1] docroot: Remove non-wikipedias from digital asset links. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290035 (https://phabricator.wikimedia.org/T426010) (owner: 10Dbrant) [21:33:19] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1025.eqiad.wmnet [21:33:23] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs1026.eqiad.wmnet [21:33:52] FIRING: [8x] ProbeDown: Service aqs1024-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:33:55] FIRING: [8x] ProbeDown: Service aqs1024-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:37:29] !log brett@cumin2002 cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org [21:38:52] RESOLVED: [8x] ProbeDown: Service aqs1024-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:40:18] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1026.eqiad.wmnet [21:40:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from releases.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=releases.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [21:41:20] 10ops-codfw, 06DC-Ops, 06Traffic: Investigate hardware RAID usage in codfw LVS hosts - https://phabricator.wikimedia.org/T426912 (10BCornwall) 03NEW [21:41:21] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:43:10] FIRING: [2x] BFDdown: BFD session down between asw1-b12-drmrs and 185.15.58.5 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b12-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:43:52] FIRING: [8x] ProbeDown: Service aqs1025-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:44:55] FIRING: [8x] ProbeDown: Service aqs1025-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:45:21] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 7 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:45:59] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs2018.codfw.wmnet [21:46:25] FIRING: [7x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:47:52] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp6009.drmrs.wmnet} and A:cp [21:48:06] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp6001.drmrs.wmnet} and A:cp [21:48:10] RESOLVED: [2x] BFDdown: BFD session down between asw1-b12-drmrs and 185.15.58.5 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b12-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:48:52] RESOLVED: [8x] ProbeDown: Service aqs1025-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:51:36] !log brett@cumin2002 cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org [21:51:37] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and not P{dns6002.wikimedia.org} and not A:magru and (A:dnsbox) [21:52:15] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2018.codfw.wmnet [21:52:18] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs2019.codfw.wmnet [21:53:52] FIRING: [9x] ProbeDown: Service aqs1025-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:54:48] 10ops-codfw, 06DC-Ops, 06Traffic: Investigate hardware RAID usage in codfw LVS hosts - https://phabricator.wikimedia.org/T426912#11942911 (10BCornwall) Chatted with @Papaul and I'm told that it's a requirement for us to set each drive as their own virtual disk in RAID0 for the drives to be accessible/online.... [21:55:09] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:aqs [21:55:55] RESOLVED: [8x] ProbeDown: Service aqs1026-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:56:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from releases.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=releases.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [21:58:24] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2019.codfw.wmnet [21:58:28] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs2020.codfw.wmnet [21:58:56] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp6001.drmrs.wmnet [21:58:56] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp6001.drmrs.wmnet} and A:cp [21:58:59] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp6009.drmrs.wmnet [21:58:59] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp6009.drmrs.wmnet} and A:cp [21:59:13] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:00:05] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260520T2200) [22:00:28] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:01:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from releases.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=releases.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [22:02:15] (03CR) 10Bking: "@jclark@wikimedia.org per Slack conversation at https://wikimedia.slack.com/archives/C055QGPTC69/p1779228915463289 , we think we want to r" [puppet] - 10https://gerrit.wikimedia.org/r/1289428 (https://phabricator.wikimedia.org/T423314) (owner: 10Bking) [22:04:57] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2020.codfw.wmnet [22:09:23] (03PS1) 10Jdlrobson: Migrate Swedish to same preference values as other wikis (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290075 (https://phabricator.wikimedia.org/T426880) [22:16:01] (03PS2) 10JHathaway: redfish: add add_account method for RedfishDell [software/spicerack] - 10https://gerrit.wikimedia.org/r/1287905 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [22:17:05] (03CR) 10Krinkle: 404.php: Force a redirect to /wiki/ in very obvious cases (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288274 (https://phabricator.wikimedia.org/T129433) (owner: 10Ladsgroup) [22:17:29] (03CR) 10JHathaway: "if possible I think we should try to keep one implementation between vendors, pushed a patch that makes the add_account function generic, " [software/spicerack] - 10https://gerrit.wikimedia.org/r/1287905 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [22:19:16] (03CR) 10CI reject: [V:04-1] redfish: add add_account method for RedfishDell [software/spicerack] - 10https://gerrit.wikimedia.org/r/1287905 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [22:23:15] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs2017.codfw.wmnet [22:24:50] (03PS5) 10Ladsgroup: Rename role::mariadb::ferm to role::mariadb::firewall [puppet] - 10https://gerrit.wikimedia.org/r/1289378 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [22:24:50] (03PS1) 10Ladsgroup: wikireplicas: Migrate from ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1290078 (https://phabricator.wikimedia.org/T421705) [22:25:58] (03CR) 10Ladsgroup: "I accidentally rebased it when I pulled it. sorry" [puppet] - 10https://gerrit.wikimedia.org/r/1289378 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [22:26:15] (03PS2) 10Ladsgroup: wikireplicas: Migrate from ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1290078 (https://phabricator.wikimedia.org/T421705) [22:28:26] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1290078 (https://phabricator.wikimedia.org/T421705) (owner: 10Ladsgroup) [22:29:45] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2017.codfw.wmnet [22:29:49] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs2026.codfw.wmnet [22:31:19] (03PS3) 10Ladsgroup: wikireplicas: Migrate from ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1290078 (https://phabricator.wikimedia.org/T421705) [22:31:30] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1290078 (https://phabricator.wikimedia.org/T421705) (owner: 10Ladsgroup) [22:33:53] !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: T426560 - bking@cumin2002 [22:36:17] (03PS1) 10Ladsgroup: mariadb: Migrate public dbproxies to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1290080 (https://phabricator.wikimedia.org/T421705) [22:36:34] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2026.codfw.wmnet [22:36:38] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs2027.codfw.wmnet [22:37:28] (03PS2) 10Ladsgroup: mariadb: Migrate public dbproxies to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1290080 (https://phabricator.wikimedia.org/T421705) [22:37:37] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1290080 (https://phabricator.wikimedia.org/T421705) (owner: 10Ladsgroup) [22:38:45] (03CR) 10Ladsgroup: "The PCC is a bit spicy: https://puppet-compiler.wmflabs.org/output/1290078/6808/" [puppet] - 10https://gerrit.wikimedia.org/r/1290078 (https://phabricator.wikimedia.org/T421705) (owner: 10Ladsgroup) [22:41:25] (03PS3) 10Ladsgroup: mariadb: Migrate public dbproxies to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1290080 (https://phabricator.wikimedia.org/T421705) [22:41:34] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1290080 (https://phabricator.wikimedia.org/T421705) (owner: 10Ladsgroup) [22:43:20] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2027.codfw.wmnet [22:44:58] jouncebot: nowandnext [22:44:58] For the next 0 hour(s) and 15 minute(s): Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260520T2200) [22:44:58] In 7 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260521T0600) [22:44:58] In 7 hour(s) and 15 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260521T0600) [22:55:45] (03PS1) 10Krinkle: errorpage: Fix missing `` tag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290086 (https://phabricator.wikimedia.org/T129433) [22:56:01] (03PS2) 10Krinkle: errorpage: Fix unclosed bold tag in 404.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290086 (https://phabricator.wikimedia.org/T129433) [22:56:17] (03CR) 10Krinkle: 404.php: Force a redirect to /wiki/ in very obvious cases (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288274 (https://phabricator.wikimedia.org/T129433) (owner: 10Ladsgroup) [22:56:56] Amir1: looking to deploy? [22:57:20] planning to but need to figure out something first [22:57:47] oh oopsie, I missed that [22:57:47] I could sync the 404 fix quickly if you like [22:57:56] I can do it [22:57:59] okay [22:58:01] Thanks for fixing! [22:58:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290086 (https://phabricator.wikimedia.org/T129433) (owner: 10Krinkle) [22:58:50] I tested it but didn't notice everything is bold, I guess two seconds was too fast :D [22:59:29] (03Merged) 10jenkins-bot: errorpage: Fix unclosed bold tag in 404.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290086 (https://phabricator.wikimedia.org/T129433) (owner: 10Krinkle) [22:59:52] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1290086|errorpage: Fix unclosed bold tag in 404.php (T129433)]] [22:59:56] T129433: Improve design for wiki-facing error pages - https://phabricator.wikimedia.org/T129433 [23:00:01] (03CR) 10Ladsgroup: "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290086 (https://phabricator.wikimedia.org/T129433) (owner: 10Krinkle) [23:01:58] !log ladsgroup@deploy1003 ladsgroup, krinkle: Backport for [[gerrit:1290086|errorpage: Fix unclosed bold tag in 404.php (T129433)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:02:49] !log ladsgroup@deploy1003 ladsgroup, krinkle: Continuing with deployment [23:02:55] 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11943144 (10Papaul) I sent a follow up email on this and Engineer said he will get back with me [23:04:43] Amir1: I noticed it in gerrit email about merged patches. It had a ± line for a line that no longer had an (intended) change, so I took a closer look and noticed it [23:04:58] then in Firefox view-source:https://en.wikipedia.org/foo.txt it makes a nice red squigle. [23:05:23] makes sense. Sorry for missing it [23:05:28] np [23:07:01] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1290086|errorpage: Fix unclosed bold tag in 404.php (T129433)]] (duration: 07m 09s) [23:07:05] T129433: Improve design for wiki-facing error pages - https://phabricator.wikimedia.org/T129433 [23:07:29] !log wikiadmin2023@10.64.48.159(svwiki)> delete from user_properties where up_value = '2' and up_property = 'thumbsize'; Query OK, 215 rows affected (0.018 sec) (T426880 and T376152) [23:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:35] T426880: PHP Warning: Undefined array key 0 - https://phabricator.wikimedia.org/T426880 [23:07:35] T376152: Evaluate feasibility of deprecating (or limiting) user media size preferences - https://phabricator.wikimedia.org/T376152 [23:09:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290075 (https://phabricator.wikimedia.org/T426880) (owner: 10Jdlrobson) [23:10:03] (03Merged) 10jenkins-bot: Migrate Swedish to same preference values as other wikis (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290075 (https://phabricator.wikimedia.org/T426880) (owner: 10Jdlrobson) [23:10:26] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1290075|Migrate Swedish to same preference values as other wikis (1/2) (T426880)]] [23:10:49] 249 px? [23:11:00] Is that just to make me sad? [23:11:15] haha, it has no impact and I'll remove it in five minutes [23:11:28] (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1290076/2) [23:11:31] Oh, just to see that it's coming from there? [23:11:40] * James_F nods. [23:11:45] yeah [23:12:23] !log ladsgroup@deploy1003 jdlrobson, ladsgroup: Backport for [[gerrit:1290075|Migrate Swedish to same preference values as other wikis (1/2) (T426880)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:14:56] !log ladsgroup@deploy1003 jdlrobson, ladsgroup: Continuing with deployment [23:19:02] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1290075|Migrate Swedish to same preference values as other wikis (1/2) (T426880)]] (duration: 08m 35s) [23:19:07] T426880: PHP Warning: Undefined array key 0 - https://phabricator.wikimedia.org/T426880 [23:21:23] I still can't believe we let wikis have different thumb arrays, so with global preferences, someone with 300px thumb pref would end up with a random thumb size in svwiki because that's what the array value for that key is [23:24:16] (03PS2) 10Ladsgroup: Migrate Swedish to same preference values as other wikis (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290076 (https://phabricator.wikimedia.org/T426880) (owner: 10Jdlrobson) [23:24:18] (03PS2) 10Neriah: Disable wgUseFilePatrol in ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290088 (https://phabricator.wikimedia.org/T426905) [23:24:25] (03PS3) 10Jdlrobson: Migrate Swedish to same preference values as other wikis (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290076 (https://phabricator.wikimedia.org/T426880) [23:24:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290076 (https://phabricator.wikimedia.org/T426880) (owner: 10Jdlrobson) [23:25:34] (03Merged) 10jenkins-bot: Migrate Swedish to same preference values as other wikis (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290076 (https://phabricator.wikimedia.org/T426880) (owner: 10Jdlrobson) [23:26:01] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1290076|Migrate Swedish to same preference values as other wikis (2/2) (T426880)]] [23:26:05] T426880: PHP Warning: Undefined array key 0 - https://phabricator.wikimedia.org/T426880 [23:27:08] (03CR) 10RLazarus: [C:03+1] "LGTM for the others!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1290054 (owner: 10Scott French) [23:27:52] !log ladsgroup@deploy1003 ladsgroup, jdlrobson: Backport for [[gerrit:1290076|Migrate Swedish to same preference values as other wikis (2/2) (T426880)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:28:32] !log ladsgroup@deploy1003 ladsgroup, jdlrobson: Continuing with deployment [23:32:38] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1290076|Migrate Swedish to same preference values as other wikis (2/2) (T426880)]] (duration: 06m 37s) [23:32:43] T426880: PHP Warning: Undefined array key 0 - https://phabricator.wikimedia.org/T426880 [23:40:09] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1290089 [23:40:10] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1290089 (owner: 10TrainBranchBot) [23:52:23] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1290089 (owner: 10TrainBranchBot) [23:56:36] (03PS1) 10Krinkle: Enable wmgUseUrlShortenerLegacy on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290093 (https://phabricator.wikimedia.org/T107188)