[00:05:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-nginx-exporter.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:09:28] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1268104 [01:09:28] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1268104 (owner: 10TrainBranchBot) [01:19:15] RECOVERY - ps1-by27-esams-infeed-load-tower-B-single-phase on ps1-by27-esams is OK: SNMP OK - ps1-by27-esams-infeed-load-tower-B-single-phase 1193 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:21:59] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1268104 (owner: 10TrainBranchBot) [01:24:08] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1205 - https://phabricator.wikimedia.org/T422317#11788942 (10Jclark-ctr) a:03Jclark-ctr [02:00:36] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:06:52] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 15s) [02:09:14] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:14] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:56:12] FIRING: CertAlmostExpired: Certificate for service opensearch-test:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#opensearch-test:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:02:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:32:13] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:37:13] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:43:05] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:43:59] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 03 Jun 2026 06:56:12 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:48:05] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:48:55] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 03 Jun 2026 06:56:12 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:05:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-nginx-exporter.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:34:57] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:36:57] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:38:13] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:39:57] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:43:57] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:46:57] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:47:13] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:48:52] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:50:13] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:51:57] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:52:13] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:53:52] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:55:13] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:55:57] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:56:57] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:57:13] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:39:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:51:25] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1022.eqiad.wmnet with reason: Patch clouddb1022 [05:52:00] (03PS1) 10Marostegui: clouddb1022.yaml: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1268117 [05:53:27] (03CR) 10Marostegui: [C:03+2] clouddb1022.yaml: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1268117 (owner: 10Marostegui) [06:09:28] (03PS1) 10Marostegui: Revert "clouddb1022.yaml: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1268118 [06:10:56] (03CR) 10Marostegui: [C:03+2] Revert "clouddb1022.yaml: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1268118 (owner: 10Marostegui) [06:41:57] FIRING: GitlabPackagePullerFailedOnPrepare: Package puller has some run errors while preparing projects. - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DGitlabPackagePullerFailedOnPrepare [06:56:13] FIRING: CertAlmostExpired: Certificate for service opensearch-test:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#opensearch-test:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:00:05] Amir1, Urbanecm, and awight: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260406T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:02:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:23:17] (03Abandoned) 10Hashar: gerrit: adjust idleTimeout on Jetty [puppet] - 10https://gerrit.wikimedia.org/r/1262020 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb) [07:35:08] (03CR) 10Kgraessle: Set live configuration for Extension:PersonalDashboard on English Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264631 (https://phabricator.wikimedia.org/T421415) (owner: 10Kgraessle) [07:36:09] (03CR) 10Kgraessle: Set live configuration for Extension:PersonalDashboard on English Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264631 (https://phabricator.wikimedia.org/T421415) (owner: 10Kgraessle) [07:37:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 06 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264631 (https://phabricator.wikimedia.org/T421415) (owner: 10Kgraessle) [07:40:17] hi, deploying mine if that's ok? [07:41:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kgraessle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264631 (https://phabricator.wikimedia.org/T421415) (owner: 10Kgraessle) [07:42:53] (03Merged) 10jenkins-bot: Set live configuration for Extension:PersonalDashboard on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264631 (https://phabricator.wikimedia.org/T421415) (owner: 10Kgraessle) [07:43:27] !log kgraessle@deploy1003 Started scap sync-world: Backport for [[gerrit:1264631|Set live configuration for Extension:PersonalDashboard on English Wikipedia (T421415)]] [07:43:31] T421415: Set live configuration for Extension:PersonalDashboard on English Wikipedia - https://phabricator.wikimedia.org/T421415 [07:59:35] !log kgraessle@deploy1003 kgraessle: Backport for [[gerrit:1264631|Set live configuration for Extension:PersonalDashboard on English Wikipedia (T421415)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:59:38] T421415: Set live configuration for Extension:PersonalDashboard on English Wikipedia - https://phabricator.wikimedia.org/T421415 [08:02:50] !log kgraessle@deploy1003 kgraessle: Continuing with sync [08:05:41] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-nginx-exporter.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:15:22] !log kgraessle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1264631|Set live configuration for Extension:PersonalDashboard on English Wikipedia (T421415)]] (duration: 31m 54s) [08:15:25] T421415: Set live configuration for Extension:PersonalDashboard on English Wikipedia - https://phabricator.wikimedia.org/T421415 [08:23:00] (03CR) 10Hashar: [C:04-1] gerrit: update sshd timeouts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1266149 (https://phabricator.wikimedia.org/T417996) (owner: 10Arnaudb) [08:33:46] katherine_g: checking if you're finished, before i start mine? [08:34:02] urbanecm: I'm done, go ahead [08:34:05] perf [08:34:12] (03PS2) 10Urbanecm: [Growth] Decrease user impact limits back to the defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268061 (https://phabricator.wikimedia.org/T422288) [08:34:17] (03CR) 10Urbanecm: [C:03+2] [Growth] Decrease user impact limits back to the defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268061 (https://phabricator.wikimedia.org/T422288) (owner: 10Urbanecm) [08:35:11] (03Merged) 10jenkins-bot: [Growth] Decrease user impact limits back to the defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268061 (https://phabricator.wikimedia.org/T422288) (owner: 10Urbanecm) [08:35:40] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1268061|[Growth] Decrease user impact limits back to the defaults (T422288 T341599)]] [08:35:45] T422288: Massive performance problems apparently related to the GrowthExperiments extension - https://phabricator.wikimedia.org/T422288 [08:35:45] T341599: Impact Module: improvements for former newcomers - https://phabricator.wikimedia.org/T341599 [08:36:23] 06SRE, 06DBA, 07Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11789152 (10Od1n) I was still encountering the issue, and I’ve just resolved it by making an edit to [[ https://fr.wikipedia.org/wiki/MediaWiki:Group-sysop.js | MediaWiki:Group-s... [08:37:01] 06SRE, 06DBA, 07Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11789154 (10Marostegui) Is this good to be closed? [08:37:18] !log urbanecm@deploy1003 urbanecm: Backport for [[gerrit:1268061|[Growth] Decrease user impact limits back to the defaults (T422288 T341599)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:38:51] (03PS1) 10Urbanecm: SECURITY: Protect ApiEchoNotifications with a new user right [extensions/Echo] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268196 (https://phabricator.wikimedia.org/T420154) [08:40:02] !log urbanecm@deploy1003 urbanecm: Continuing with sync [08:40:37] (03PS2) 10Urbanecm: SECURITY: Protect ApiEchoNotifications with a new user right [extensions/Echo] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268196 (https://phabricator.wikimedia.org/T420154) [08:40:37] (03PS1) 10Urbanecm: [i18n] Correct the action message [extensions/Echo] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268197 (https://phabricator.wikimedia.org/T420154) [08:40:46] (03PS1) 10Urbanecm: refactor: Use a trait to check for reading permissions [extensions/Echo] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268198 (https://phabricator.wikimedia.org/T420154) [08:40:55] (03PS1) 10Urbanecm: Create a new grant for the echo-read-notifications [extensions/Echo] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268199 (https://phabricator.wikimedia.org/T420154) [08:40:59] (03CR) 10Urbanecm: [C:03+2] SECURITY: Protect ApiEchoNotifications with a new user right [extensions/Echo] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268196 (https://phabricator.wikimedia.org/T420154) (owner: 10Urbanecm) [08:41:07] (03CR) 10Urbanecm: [C:03+2] [i18n] Correct the action message [extensions/Echo] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268197 (https://phabricator.wikimedia.org/T420154) (owner: 10Urbanecm) [08:41:10] (03CR) 10Urbanecm: [C:03+2] refactor: Use a trait to check for reading permissions [extensions/Echo] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268198 (https://phabricator.wikimedia.org/T420154) (owner: 10Urbanecm) [08:41:17] (03CR) 10Urbanecm: [C:03+2] Create a new grant for the echo-read-notifications [extensions/Echo] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268199 (https://phabricator.wikimedia.org/T420154) (owner: 10Urbanecm) [08:46:31] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1268061|[Growth] Decrease user impact limits back to the defaults (T422288 T341599)]] (duration: 10m 50s) [08:46:36] T422288: Massive performance problems apparently related to the GrowthExperiments extension - https://phabricator.wikimedia.org/T422288 [08:46:36] T341599: Impact Module: improvements for former newcomers - https://phabricator.wikimedia.org/T341599 [08:47:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [extensions/Echo] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268196 (https://phabricator.wikimedia.org/T420154) (owner: 10Urbanecm) [08:47:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [extensions/Echo] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268197 (https://phabricator.wikimedia.org/T420154) (owner: 10Urbanecm) [08:47:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [extensions/Echo] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268198 (https://phabricator.wikimedia.org/T420154) (owner: 10Urbanecm) [08:47:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [extensions/Echo] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268199 (https://phabricator.wikimedia.org/T420154) (owner: 10Urbanecm) [08:48:39] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1023.eqiad.wmnet with reason: Maintenance [08:53:46] (03Merged) 10jenkins-bot: SECURITY: Protect ApiEchoNotifications with a new user right [extensions/Echo] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268196 (https://phabricator.wikimedia.org/T420154) (owner: 10Urbanecm) [08:53:52] (03Merged) 10jenkins-bot: [i18n] Correct the action message [extensions/Echo] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268197 (https://phabricator.wikimedia.org/T420154) (owner: 10Urbanecm) [08:53:58] (03Merged) 10jenkins-bot: refactor: Use a trait to check for reading permissions [extensions/Echo] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268198 (https://phabricator.wikimedia.org/T420154) (owner: 10Urbanecm) [08:54:03] (03Merged) 10jenkins-bot: Create a new grant for the echo-read-notifications [extensions/Echo] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268199 (https://phabricator.wikimedia.org/T420154) (owner: 10Urbanecm) [08:54:36] 06SRE, 06DBA, 07Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11789178 (10A_smart_kitten) >>! In T422130#11789154, @Marostegui wrote: > Is this good to be closed? Judging by T422130#11782760, I guess it's just waiting for followups to be f... [08:55:50] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1268196|SECURITY: Protect ApiEchoNotifications with a new user right (T420154)]], [[gerrit:1268197|[i18n] Correct the action message (T420154)]], [[gerrit:1268198|refactor: Use a trait to check for reading permissions (T420154)]], [[gerrit:1268199|Create a new grant for the echo-read-notifications (T420154)]] [08:59:48] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1169: test [09:00:18] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1169: test [09:00:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Pool db1169', diff saved to https://phabricator.wikimedia.org/P90257 and previous config saved to /var/cache/conftool/dbconfig/20260406-090040-marostegui.json [09:01:15] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1152: Upgrade [09:01:15] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [09:01:24] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [09:01:24] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1152: Upgrade [09:05:39] PROBLEM - statsv Varnishkafka log producer on cp2045 is CRITICAL: PROCS CRITICAL: 3 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [09:06:39] RECOVERY - statsv Varnishkafka log producer on cp2045 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [09:10:42] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [09:10:49] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [09:15:16] !log urbanecm@deploy1003 urbanecm: Backport for [[gerrit:1268196|SECURITY: Protect ApiEchoNotifications with a new user right (T420154)]], [[gerrit:1268197|[i18n] Correct the action message (T420154)]], [[gerrit:1268198|refactor: Use a trait to check for reading permissions (T420154)]], [[gerrit:1268199|Create a new grant for the echo-read-notifications (T420154)]] synced to the testservers (see https://wikitech.wikimedia [09:15:16] .org/wiki/Mwdebug). Changes can now be verified there. [09:15:42] !log urbanecm@deploy1003 urbanecm: Continuing with sync [09:19:32] (03PS1) 10Marostegui: db2142: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1268200 (https://phabricator.wikimedia.org/T418561) [09:20:05] (03CR) 10Marostegui: [C:03+2] db2142: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1268200 (https://phabricator.wikimedia.org/T418561) (owner: 10Marostegui) [09:22:57] (03PS1) 10Urbanecm: Respect the echo-read-notifications right in user interface [extensions/Echo] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268201 (https://phabricator.wikimedia.org/T420154) [09:23:06] (03PS1) 10Urbanecm: Grant new 'echo-read-notifications' right to all users [extensions/Echo] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268202 (https://phabricator.wikimedia.org/T422297) [09:23:42] (03CR) 10Urbanecm: [C:03+2] Respect the echo-read-notifications right in user interface [extensions/Echo] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268201 (https://phabricator.wikimedia.org/T420154) (owner: 10Urbanecm) [09:23:45] (03CR) 10Urbanecm: [C:03+2] Grant new 'echo-read-notifications' right to all users [extensions/Echo] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268202 (https://phabricator.wikimedia.org/T422297) (owner: 10Urbanecm) [09:23:57] FIRING: ProbeDown: Service mw-api-ext-next:4455 has failed probes (http_mw-api-ext-next_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext-next:4455 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:26:06] !incidents [09:26:07] 7812 (UNACKED) ProbeDown sre (10.2.2.7 ip4 mw-api-ext-next:4455 probes/service http_mw-api-ext-next_ip4 eqiad) [09:26:12] !ack 7812 [09:26:13] 7812 (ACKED) ProbeDown sre (10.2.2.7 ip4 mw-api-ext-next:4455 probes/service http_mw-api-ext-next_ip4 eqiad) [09:26:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via next at eqiad: 0% idle #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=next - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:26:34] !incidents [09:26:35] 7812 (ACKED) ProbeDown sre (10.2.2.7 ip4 mw-api-ext-next:4455 probes/service http_mw-api-ext-next_ip4 eqiad) [09:26:35] 7813 (UNACKED) PHPFPMTooBusy sre (mw-api-ext next eqiad) [09:26:41] !ack 7813 [09:26:41] 7813 (ACKED) PHPFPMTooBusy sre (mw-api-ext next eqiad) [09:26:52] (03CR) 10Urbanecm: Respect the echo-read-notifications right in user interface [extensions/Echo] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268201 (https://phabricator.wikimedia.org/T420154) (owner: 10Urbanecm) [09:26:55] (03CR) 10Urbanecm: Grant new 'echo-read-notifications' right to all users [extensions/Echo] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268202 (https://phabricator.wikimedia.org/T422297) (owner: 10Urbanecm) [09:27:09] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2142.codfw.wmnet,db1152.eqiad.wmnet with reason: Upgrade [09:27:11] removed +2 from backports, waiting. [09:27:37] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1268196|SECURITY: Protect ApiEchoNotifications with a new user right (T420154)]], [[gerrit:1268197|[i18n] Correct the action message (T420154)]], [[gerrit:1268198|refactor: Use a trait to check for reading permissions (T420154)]], [[gerrit:1268199|Create a new grant for the echo-read-notifications (T420154)]] (duration: 31m 47s) [09:27:49] urbanecm: you are about to backport I reckon ? [09:28:08] effie: i have two pending backports, but i aborted. [09:28:31] I hope you will not wait for long, need a little bit to assess what is up [09:28:35] tx [09:29:18] np, thanks for taking a look! [09:30:42] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2142.codfw.wmnet with OS trixie [09:31:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via next at eqiad: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=next - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:31:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via next at eqiad: 0% idle #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=next - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:32:53] !incidents [09:32:53] 7812 (ACKED) ProbeDown sre (10.2.2.7 ip4 mw-api-ext-next:4455 probes/service http_mw-api-ext-next_ip4 eqiad) [09:32:53] 7813 (RESOLVED) PHPFPMTooBusy sre (mw-api-ext next eqiad) [09:33:22] (03CR) 10CI reject: [V:04-1] Grant new 'echo-read-notifications' right to all users [extensions/Echo] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268202 (https://phabricator.wikimedia.org/T422297) (owner: 10Urbanecm) [09:33:57] RESOLVED: ProbeDown: Service mw-api-ext-next:4455 has failed probes (http_mw-api-ext-next_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext-next:4455 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:34:26] urbanecm: please go ahead [09:35:42] effie: Did you do anything? [09:35:42] that was quick! i'll need to finish it up later. thanks for unblocking though! [09:36:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via next at eqiad: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=next - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:38:01] slyngs: I did not, see -sre [09:39:21] Awesome, thank you :-) [09:39:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:48:26] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2142.codfw.wmnet with reason: host reimage [09:54:24] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2142.codfw.wmnet with reason: host reimage [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260406T1000) [10:06:06] 06SRE, 10SRE-swift-storage, 07Upstream: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#11789260 (10A_smart_kitten) >>! In T383053#10471670, @MatthewVernon wrote: > Reported as [[ https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=... [10:11:33] (03PS1) 10Marostegui: Revert "db2142: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1268205 [10:12:06] (03CR) 10Marostegui: [C:03+2] Revert "db2142: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1268205 (owner: 10Marostegui) [10:12:20] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2142.codfw.wmnet with OS trixie [10:13:45] 10SRE-Access-Requests: Update SSH key for production access – Surbhi Gupta - https://phabricator.wikimedia.org/T422363 (10SGupta-WMF) 03NEW [10:15:19] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [10:15:35] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [10:42:12] FIRING: GitlabPackagePullerFailedOnPrepare: Package puller has some run errors while preparing projects. - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DGitlabPackagePullerFailedOnPrepare [10:56:13] FIRING: CertAlmostExpired: Certificate for service opensearch-test:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#opensearch-test:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:02:01] (03PS2) 10D3r1ck01: Remove unused JWT for bot password temporary config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247960 (https://phabricator.wikimedia.org/T422367) [11:02:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:14:15] !log marostegui@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1022.eqiad.wmnet,service=s3 [11:23:19] (03PS1) 10Marostegui: pc5: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1268207 (https://phabricator.wikimedia.org/T422368) [11:24:24] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [11:24:34] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [11:24:37] (03CR) 10Marostegui: [C:03+2] pc5: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1268207 (https://phabricator.wikimedia.org/T422368) (owner: 10Marostegui) [11:26:21] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on pc2015.codfw.wmnet with reason: Maintenance [11:26:43] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance [11:28:33] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host pc1015.eqiad.wmnet with OS trixie [11:29:25] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host pc2015.codfw.wmnet with OS trixie [11:43:20] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on pc1015.eqiad.wmnet with reason: host reimage [11:43:20] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:46:58] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:48:02] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on pc2015.codfw.wmnet with reason: host reimage [11:49:47] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc1015.eqiad.wmnet with reason: host reimage [11:52:27] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2011 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:52:57] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:53:20] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2011 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:53:22] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:53:35] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc2015.codfw.wmnet with reason: host reimage [11:56:22] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:58:24] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:01:58] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:03:24] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:03:58] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:05:41] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-nginx-exporter.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:06:08] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc1015.eqiad.wmnet with OS trixie [12:06:24] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:06:27] 06SRE, 10SRE-Access-Requests: Update SSH key for production access – Surbhi Gupta - https://phabricator.wikimedia.org/T422363#11789489 (10Aklapper) > Wikitech Username: Sg912 @SGupta-WMF: That's unlikely. Please also [link your LDAP account to your Phab account](https://phabricator.wikimedia.org/settings/pane... [12:11:19] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc2015.codfw.wmnet with OS trixie [12:11:44] (03PS1) 10Marostegui: Revert "pc5: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1268211 [12:13:13] (03CR) 10Marostegui: [C:03+2] Revert "pc5: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1268211 (owner: 10Marostegui) [12:13:43] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [12:13:59] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [12:23:21] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [12:29:13] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1205 - https://phabricator.wikimedia.org/T422317#11789530 (10Jclark-ctr) Updated iDRAC and expander firmware, and opened a Dell support ticket. [12:29:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Degraded RAID on an-worker1205 - https://phabricator.wikimedia.org/T422317#11789532 (10Jclark-ctr) [12:30:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Degraded RAID on an-worker1205 - https://phabricator.wikimedia.org/T422317#11789535 (10Jclark-ctr) [12:34:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Degraded RAID on an-worker1205 - https://phabricator.wikimedia.org/T422317#11789546 (10Jclark-ctr) SR 224787352 [12:38:47] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 465.26 ms [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260406T1300). [13:00:04] No Gerrit patches in the queue for this window AFAICS. [13:04:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:06:43] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1023.eqiad.wmnet with reason: Maintenance [13:10:54] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Sunsetting mirrors.wikimedia.org - https://phabricator.wikimedia.org/T416707#11789623 (10LSobanski) The alert `MirrorHighLag` has started firing 1 month ago. Would it make sense to disable it at this point? ===== Labels `lang=ini alertname=MirrorHighL... [13:15:35] 06SRE, 10SRE-Access-Requests: Update SSH key for production access – Surbhi Gupta - https://phabricator.wikimedia.org/T422363#11789637 (10SGupta-WMF) @Aklapper Thanks for the link , I completed the needful [13:20:47] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1013.eqiad.wmnet with reason: Maintenance [13:42:23] 06SRE, 06ServiceOps new, 07Datacenter-Switchover: Increased rate of badtoken errors / session store issues due to datacenter switchover? - https://phabricator.wikimedia.org/T421168#11789662 (10LucasWerkmeister) 05Declined→03Open I don’t see how that’s a reason to close the task? I don’t really care wheth... [14:00:46] (03PS1) 10Bking: opensearch-ipoid: remove version pin [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268222 (https://phabricator.wikimedia.org/T422378) [14:06:30] (03CR) 10JHathaway: [C:03+2] puppetserver: emit info before deploying code [puppet] - 10https://gerrit.wikimedia.org/r/1260686 (owner: 10Hashar) [14:15:25] PROBLEM - MD RAID on ml-serve1001 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [14:15:26] ACKNOWLEDGEMENT - MD RAID on ml-serve1001 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T422382 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [14:15:31] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ml-serve1001 - https://phabricator.wikimedia.org/T422382 (10ops-monitoring-bot) 03NEW [14:17:23] (03PS1) 10Vgutierrez: aptrepo: Add missing update block for haproxy32 [puppet] - 10https://gerrit.wikimedia.org/r/1268223 (https://phabricator.wikimedia.org/T421402) [14:18:14] (03CR) 10Majavah: [C:03+1] aptrepo: Add missing update block for haproxy32 [puppet] - 10https://gerrit.wikimedia.org/r/1268223 (https://phabricator.wikimedia.org/T421402) (owner: 10Vgutierrez) [14:19:03] (03CR) 10Vgutierrez: [C:03+2] aptrepo: Add missing update block for haproxy32 [puppet] - 10https://gerrit.wikimedia.org/r/1268223 (https://phabricator.wikimedia.org/T421402) (owner: 10Vgutierrez) [14:19:32] (03CR) 10Eevans: [C:03+2] restbase: upgrade to Cassandra 4.1.11 [puppet] - 10https://gerrit.wikimedia.org/r/1266387 (https://phabricator.wikimedia.org/T418417) (owner: 10Eevans) [14:20:22] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ml-serve1001 - https://phabricator.wikimedia.org/T422382#11789845 (10Jclark-ctr) a:03Jclark-ctr This server is out of warranty will look to at spare drives [14:24:43] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance [14:26:23] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-eqiad: Actually upgrade Cassandra to 4.1.11 — T418417 - eevans@cumin1003 [14:26:26] T418417: Upgrade Cassandra clusters to 4.1.11 - https://phabricator.wikimedia.org/T418417 [14:27:05] !log fetch haproxy 3.2.15 on thirdparty/haproxy32 (trixie-wikimedia) - T421402 [14:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:07] T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402 [14:28:16] !log vgutierrez@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp[6001,6009].*} and A:cp - 3.2.15 upgrade (T421402) [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260406T1430) [14:37:15] PROBLEM - Host lsw1-b7-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:37:37] PROBLEM - Host ps1-b7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:40:14] !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp[6001,6009].*} and A:cp - 3.2.15 upgrade (T421402) [14:40:16] T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402 [14:42:12] FIRING: GitlabPackagePullerFailedOnPrepare: Package puller has some run errors while preparing projects. - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DGitlabPackagePullerFailedOnPrepare [14:45:01] (03PS1) 10Pppery: Move createwithcontentmodel to autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268225 (https://phabricator.wikimedia.org/T248294) [14:45:10] RECOVERY - Host ps1-b7-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.90 ms [14:45:17] (03PS2) 10Pppery: Move createwithcontentmodel to autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268225 (https://phabricator.wikimedia.org/T248294) [14:45:18] RECOVERY - Host lsw1-b7-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.73 ms [14:46:22] FIRING: GnmiTargetDown: lsw1-b7-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [14:46:35] 06SRE, 06ServiceOps new, 07Datacenter-Switchover: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw - https://phabricator.wikimedia.org/T422166#11789898 (10jijiki) 05Open→03In progress [14:47:26] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1157.eqiad.wmnet with reason: Maintenance [14:47:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1157 (T419635)', diff saved to https://phabricator.wikimedia.org/P90263 and previous config saved to /var/cache/conftool/dbconfig/20260406-144734-fceratto.json [14:47:38] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [14:47:43] 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - Status - issue on logstash2036:9290 - https://phabricator.wikimedia.org/T422310#11789900 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm psu2 is rapid blinking even after reseated power, trying new port, and replacing the cable. out of warranty, checki... [14:48:59] 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - Status - issue on cirrussearch2080:9290 - https://phabricator.wikimedia.org/T422309#11789908 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm there was a bad psu in T422310 that was causing power fluctuations in the cabinet. it was replaced in that tick... [14:49:38] !log taavi@cumin1003 START - Cookbook sre.dns.netbox [14:51:22] RESOLVED: GnmiTargetDown: lsw1-b7-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [14:53:24] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:53:33] !log taavi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: allocate lvs vip for dumps-lb.eqiad - taavi@cumin1003" [14:53:39] !log taavi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: allocate lvs vip for dumps-lb.eqiad - taavi@cumin1003" [14:53:39] !log taavi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:53:45] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T419635)', diff saved to https://phabricator.wikimedia.org/P90264 and previous config saved to /var/cache/conftool/dbconfig/20260406-145344-fceratto.json [14:53:48] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [14:54:01] Hah, just as I said that WDQS hasn't lost service [14:54:06] Looking now [14:56:02] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:56:13] FIRING: CertAlmostExpired: Certificate for service opensearch-test:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#opensearch-test:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:57:22] 06SRE, 06Data-Persistence, 06ServiceOps new, 07Datacenter-Switchover: Increased rate of badtoken errors / session store issues due to datacenter switchover? - https://phabricator.wikimedia.org/T421168#11789935 (10jijiki) [14:58:49] (03CR) 10JHathaway: P:base: Make nftables::set resources always defined (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1266205 (owner: 10Majavah) [15:01:14] (03CR) 10Jasmine: [C:03+1] mw-web: downsize for multi-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266213 (https://phabricator.wikimedia.org/T413974) (owner: 10Blake) [15:01:24] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:02:22] (03CR) 10Jasmine: "Ah Thanks for catching - fixed!" [puppet] - 10https://gerrit.wikimedia.org/r/1260765 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [15:02:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:03:22] (03CR) 10JHathaway: nftables: Fix issues around virtual resource dependencies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1260721 (owner: 10Majavah) [15:03:53] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P90265 and previous config saved to /var/cache/conftool/dbconfig/20260406-150353-fceratto.json [15:05:24] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:06:01] 06SRE, 06Data-Persistence, 06ServiceOps new, 07Datacenter-Switchover: Increased rate of badtoken errors / session store issues due to datacenter switchover? - https://phabricator.wikimedia.org/T421168#11789973 (10jijiki) >>! In T421168#11789662, @LucasWerkmeister wrote: > I don’t see how that’s a reason to... [15:06:24] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:07:02] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:10:00] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:10:24] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:10:41] (03PS1) 10Vgutierrez: mtail:cache_haproxy: Support status code 0 [puppet] - 10https://gerrit.wikimedia.org/r/1268229 [15:11:32] (03PS1) 10Dzahn: phabricator: disable dump file creation also in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1268230 (https://phabricator.wikimedia.org/T422327) [15:12:02] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:12:40] (03CR) 10CI reject: [V:04-1] mtail:cache_haproxy: Support status code 0 [puppet] - 10https://gerrit.wikimedia.org/r/1268229 (owner: 10Vgutierrez) [15:13:34] (03CR) 10Dzahn: [C:03+2] phabricator: disable dump file creation also in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1268230 (https://phabricator.wikimedia.org/T422327) (owner: 10Dzahn) [15:14:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P90266 and previous config saved to /var/cache/conftool/dbconfig/20260406-151401-fceratto.json [15:14:37] (03PS2) 10Vgutierrez: mtail:cache_haproxy: Support status code 0 [puppet] - 10https://gerrit.wikimedia.org/r/1268229 [15:15:02] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:16:37] (03CR) 10CI reject: [V:04-1] mtail:cache_haproxy: Support status code 0 [puppet] - 10https://gerrit.wikimedia.org/r/1268229 (owner: 10Vgutierrez) [15:20:44] (03CR) 10JHathaway: [C:03+1] profile::pki::intermediates: refresh discovery's public key [puppet] - 10https://gerrit.wikimedia.org/r/1264669 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [15:23:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11790096 (10VRiley-WMF) Ticket number for Dell is 224788872. Onsite tech will be coming onsite to check devices. [15:24:01] (03CR) 10Jasmine: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1260765 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [15:24:09] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T419635)', diff saved to https://phabricator.wikimedia.org/P90267 and previous config saved to /var/cache/conftool/dbconfig/20260406-152409-fceratto.json [15:24:12] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [15:24:15] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1166.eqiad.wmnet with reason: Maintenance [15:27:24] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:27:28] 10ops-eqiad, 06SRE, 06DC-Ops: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11790122 (10Papaul) @jcrespo hello We have an issue with the serial number on this sever and we have some update from SM on how to fix it but we will have to t... [15:28:02] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:29:01] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1175.eqiad.wmnet with reason: Maintenance [15:29:09] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1175 (T419635)', diff saved to https://phabricator.wikimedia.org/P90268 and previous config saved to /var/cache/conftool/dbconfig/20260406-152908-fceratto.json [15:30:04] jan_drewniak: It is that lovely time of the day again! You are hereby commanded to deploy Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260406T1530). [15:31:02] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:31:24] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:32:24] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:35:24] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:35:27] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T419635)', diff saved to https://phabricator.wikimedia.org/P90269 and previous config saved to /var/cache/conftool/dbconfig/20260406-153526-fceratto.json [15:35:30] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [15:36:24] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:37:02] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:39:08] (03CR) 10Jasmine: role::kubernetes::worker: add sophroid to the lvs pools (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1260765 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [15:45:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P90270 and previous config saved to /var/cache/conftool/dbconfig/20260406-154534-fceratto.json [15:55:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P90271 and previous config saved to /var/cache/conftool/dbconfig/20260406-155542-fceratto.json [16:00:34] (03PS3) 10Vgutierrez: mtail:cache_haproxy: Support status code 0 [puppet] - 10https://gerrit.wikimedia.org/r/1268229 [16:02:38] (03CR) 10CI reject: [V:04-1] mtail:cache_haproxy: Support status code 0 [puppet] - 10https://gerrit.wikimedia.org/r/1268229 (owner: 10Vgutierrez) [16:05:41] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-nginx-exporter.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:05:47] (03PS4) 10Vgutierrez: mtail:cache_haproxy: Support status code 0 [puppet] - 10https://gerrit.wikimedia.org/r/1268229 [16:05:52] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T419635)', diff saved to https://phabricator.wikimedia.org/P90272 and previous config saved to /var/cache/conftool/dbconfig/20260406-160551-fceratto.json [16:05:54] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [16:06:08] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1198.eqiad.wmnet with reason: Maintenance [16:06:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1198 (T419635)', diff saved to https://phabricator.wikimedia.org/P90273 and previous config saved to /var/cache/conftool/dbconfig/20260406-160615-fceratto.json [16:09:14] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:10:52] (03CR) 10BCornwall: [C:03+1] mtail:cache_haproxy: Support status code 0 [puppet] - 10https://gerrit.wikimedia.org/r/1268229 (owner: 10Vgutierrez) [16:12:33] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T419635)', diff saved to https://phabricator.wikimedia.org/P90274 and previous config saved to /var/cache/conftool/dbconfig/20260406-161232-fceratto.json [16:12:36] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [16:17:48] (03PS1) 10DLynch: VisualEditor editcheck suggestion feedback is always consolidated [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268242 (https://phabricator.wikimedia.org/T420123) [16:17:51] (03PS6) 10Effie Mouzeli: role::mediawiki::memcached::wikifunctions: add new role [puppet] - 10https://gerrit.wikimedia.org/r/1251059 (https://phabricator.wikimedia.org/T419831) [16:18:08] (03CR) 10Vgutierrez: [C:03+2] mtail:cache_haproxy: Support status code 0 [puppet] - 10https://gerrit.wikimedia.org/r/1268229 (owner: 10Vgutierrez) [16:19:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 06 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268242 (https://phabricator.wikimedia.org/T420123) (owner: 10DLynch) [16:22:41] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P90275 and previous config saved to /var/cache/conftool/dbconfig/20260406-162241-fceratto.json [16:23:39] (03CR) 10Medelius: VisualEditor editcheck suggestion feedback is always consolidated (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268242 (https://phabricator.wikimedia.org/T420123) (owner: 10DLynch) [16:29:52] (03CR) 10DLynch: VisualEditor editcheck suggestion feedback is always consolidated (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268242 (https://phabricator.wikimedia.org/T420123) (owner: 10DLynch) [16:30:55] (03PS2) 10DLynch: VisualEditor editcheck suggestion feedback is always consolidated [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268242 (https://phabricator.wikimedia.org/T420123) [16:31:53] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-eqiad: Actually upgrade Cassandra to 4.1.11 — T418417 - eevans@cumin1003 [16:31:57] T418417: Upgrade Cassandra clusters to 4.1.11 - https://phabricator.wikimedia.org/T418417 [16:32:08] (03CR) 10Medelius: [C:03+1] VisualEditor editcheck suggestion feedback is always consolidated [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268242 (https://phabricator.wikimedia.org/T420123) (owner: 10DLynch) [16:32:29] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: Actually upgrade Cassandra to 4.1.11 — T418417 - eevans@cumin1003 [16:32:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P90276 and previous config saved to /var/cache/conftool/dbconfig/20260406-163249-fceratto.json [16:34:14] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:36:28] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install apus-be200[56] - https://phabricator.wikimedia.org/T418902#11790335 (10Jhancock.wm) @MatthewVernon these came in. any objections to me racking them in the new cage, rows E and F? [16:42:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T419635)', diff saved to https://phabricator.wikimedia.org/P90277 and previous config saved to /var/cache/conftool/dbconfig/20260406-164257-fceratto.json [16:43:01] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [16:43:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 06 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267946 (https://phabricator.wikimedia.org/T422275) (owner: 10Aude) [16:43:04] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1212.eqiad.wmnet with reason: Maintenance [16:43:16] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 6 hosts with reason: Maintenance [16:43:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1212 (T419635)', diff saved to https://phabricator.wikimedia.org/P90278 and previous config saved to /var/cache/conftool/dbconfig/20260406-164323-fceratto.json [16:48:37] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install apus-be200[56] - https://phabricator.wikimedia.org/T418902#11790427 (10Jhancock.wm) [16:49:40] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q3:rack/setup/install phab2003 - https://phabricator.wikimedia.org/T418899#11790430 (10Jhancock.wm) a:03Jhancock.wm [16:50:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T419635)', diff saved to https://phabricator.wikimedia.org/P90279 and previous config saved to /var/cache/conftool/dbconfig/20260406-165005-fceratto.json [16:50:09] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [16:52:20] 06SRE, 10SRE-Access-Requests, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Request: wdqs shell access for user AWesterinen - https://phabricator.wikimedia.org/T422141#11790453 (10DSantamaria) Approved! [16:59:44] (03CR) 10VolkerE: [C:03+1] Set $wgReadingListsEnableBetaQuickSurvey to true for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267946 (https://phabricator.wikimedia.org/T422275) (owner: 10Aude) [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260406T1700) [17:00:04] ryankemper: Time to snap out of that daydream and deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260406T1700). [17:00:14] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P90280 and previous config saved to /var/cache/conftool/dbconfig/20260406-170013-fceratto.json [17:10:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P90281 and previous config saved to /var/cache/conftool/dbconfig/20260406-171021-fceratto.json [17:16:13] !log import trafficserver 9.2.13-1wm1 into trixie-wikimedia - T422328 [17:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:10] (03CR) 10Bking: [C:03+1] Remove support for old Elastic releases [puppet] - 10https://gerrit.wikimedia.org/r/1247917 (https://phabricator.wikimedia.org/T388607) (owner: 10Muehlenhoff) [17:20:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T419635)', diff saved to https://phabricator.wikimedia.org/P90282 and previous config saved to /var/cache/conftool/dbconfig/20260406-172030-fceratto.json [17:20:33] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [17:20:47] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1223.eqiad.wmnet with reason: Maintenance [17:20:56] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1223 (T419635)', diff saved to https://phabricator.wikimedia.org/P90283 and previous config saved to /var/cache/conftool/dbconfig/20260406-172055-fceratto.json [17:21:16] (03CR) 10Bking: [C:03+1] Mark WDQS spec tests to run on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1239596 (owner: 10Muehlenhoff) [17:29:38] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on P{cp7001.magru.wmnet} and A:cp - 9.2.13 Upgrade () [17:30:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T419635)', diff saved to https://phabricator.wikimedia.org/P90284 and previous config saved to /var/cache/conftool/dbconfig/20260406-173056-fceratto.json [17:31:00] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [17:34:51] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on P{cp7001.magru.wmnet} and A:cp - 9.2.13 Upgrade () [17:37:24] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on P{cp7009.magru.wmnet} and A:cp - 9.2.13 Upgrade () [17:41:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P90285 and previous config saved to /var/cache/conftool/dbconfig/20260406-174104-fceratto.json [17:42:44] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on P{cp7009.magru.wmnet} and A:cp - 9.2.13 Upgrade () [17:47:59] (03PS1) 10Bking: opensearch-on-k8s: associate `OpenSearchCert` alerts with the correct prom instance [alerts] - 10https://gerrit.wikimedia.org/r/1268255 (https://phabricator.wikimedia.org/T419289) [17:51:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P90286 and previous config saved to /var/cache/conftool/dbconfig/20260406-175111-fceratto.json [17:52:35] (03PS1) 10JHathaway: bastions: add bast4006 [puppet] - 10https://gerrit.wikimedia.org/r/1268257 (https://phabricator.wikimedia.org/T418993) [17:53:02] (03PS2) 10JHathaway: bastions: add bast4006 [puppet] - 10https://gerrit.wikimedia.org/r/1268257 (https://phabricator.wikimedia.org/T418993) [17:53:07] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1268257 (https://phabricator.wikimedia.org/T418993) (owner: 10JHathaway) [17:53:25] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Degraded RAID on an-worker1205 - https://phabricator.wikimedia.org/T422317#11790718 (10Jclark-ctr) @btullis drive will be on site tomorrow please advise when I can replace it [17:53:42] (03PS1) 10Dzahn: jenkins: switch firewall provider to ferm [puppet] - 10https://gerrit.wikimedia.org/r/1268258 (https://phabricator.wikimedia.org/T422350) [17:54:45] (03CR) 10Dzahn: [C:03+1] "could see it was missing in firewall rules on apt1002 - looks good; unless there was some reason it wasn't ready yet" [puppet] - 10https://gerrit.wikimedia.org/r/1268257 (https://phabricator.wikimedia.org/T418993) (owner: 10JHathaway) [17:55:50] (03CR) 10Dzahn: [C:03+2] "we can't use docker and nftables as firewall provider at the same time - it leads to puppet changes on every run and other tickets like ht" [puppet] - 10https://gerrit.wikimedia.org/r/1267173 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [17:56:21] (03PS2) 10Dzahn: jenkins: switch firewall provider to ferm [puppet] - 10https://gerrit.wikimedia.org/r/1268258 (https://phabricator.wikimedia.org/T422350) [17:57:16] (03CR) 10Dzahn: [C:03+2] "breaking "puppet change on every run"-cycle" [puppet] - 10https://gerrit.wikimedia.org/r/1268258 (https://phabricator.wikimedia.org/T422350) (owner: 10Dzahn) [17:57:51] (03PS1) 10Andrew Bogott: Revert "magnum/codfw1dev: try using the same chart repo as eqiad1" [puppet] - 10https://gerrit.wikimedia.org/r/1268259 [18:01:00] (03CR) 10Andrew Bogott: [C:03+2] Revert "magnum/codfw1dev: try using the same chart repo as eqiad1" [puppet] - 10https://gerrit.wikimedia.org/r/1268259 (owner: 10Andrew Bogott) [18:01:19] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T419635)', diff saved to https://phabricator.wikimedia.org/P90287 and previous config saved to /var/cache/conftool/dbconfig/20260406-180118-fceratto.json [18:01:22] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [18:01:25] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1240.eqiad.wmnet with reason: Maintenance [18:03:50] (03CR) 10Herron: [C:03+1] thanos/compact: add support for instance-based partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1260650 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli) [18:03:59] (03CR) 10Herron: [C:03+1] thanos/compact: assign prometheus instances to compactors [puppet] - 10https://gerrit.wikimedia.org/r/1265429 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli) [18:08:34] (03PS1) 10Andrew Bogott: wmcs-dnsleaks: allow for user-created records under .az as well as .org [puppet] - 10https://gerrit.wikimedia.org/r/1268260 (https://phabricator.wikimedia.org/T421025) [18:09:36] (03CR) 10Andrew Bogott: [C:03+2] wmcs-dnsleaks: allow for user-created records under .az as well as .org [puppet] - 10https://gerrit.wikimedia.org/r/1268260 (https://phabricator.wikimedia.org/T421025) (owner: 10Andrew Bogott) [18:16:37] (03CR) 10Dzahn: [V:03+1 C:03+2] "Need to follow-up on this because the docker.io package only installs the docker daemon, but the client has been moved into separate `dock" [puppet] - 10https://gerrit.wikimedia.org/r/1260659 (https://phabricator.wikimedia.org/T418109) (owner: 10Hashar) [18:19:58] (03PS1) 10Dzahn: ci::docker: also install docker-cli when installing docker.io [puppet] - 10https://gerrit.wikimedia.org/r/1268262 (https://phabricator.wikimedia.org/T418109) [18:20:45] (03CR) 10Dzahn: [V:03+1 C:03+2] "this kind of thing feels like a reason to not have specialized classes to install docker just for CI vs using the global docker profile di" [puppet] - 10https://gerrit.wikimedia.org/r/1260659 (https://phabricator.wikimedia.org/T418109) (owner: 10Hashar) [18:21:16] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [18:23:37] (03CR) 10Dzahn: [C:03+2] ci::docker: also install docker-cli when installing docker.io [puppet] - 10https://gerrit.wikimedia.org/r/1268262 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [18:25:01] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:25:29] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:37:06] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-codfw: Actually upgrade Cassandra to 4.1.11 — T418417 - eevans@cumin1003 [18:37:09] T418417: Upgrade Cassandra clusters to 4.1.11 - https://phabricator.wikimedia.org/T418417 [18:39:08] !log gitlab-runner1004 - reimaging with --move-vlan T421717 [18:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:11] T421717: Collaboration Services: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421717 [18:40:02] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab-runner1004.eqiad.wmnet with OS bookworm [18:40:23] !log dzahn@cumin2002 START - Cookbook sre.hosts.move-vlan for host gitlab-runner1004 [18:42:12] FIRING: GitlabPackagePullerFailedOnPrepare: Package puller has some run errors while preparing projects. - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DGitlabPackagePullerFailedOnPrepare [18:42:29] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:42:41] (03PS1) 10Dzahn: docker_registry: update IP of gitlab-runner1004 for jwt-auth'ed hosts [puppet] - 10https://gerrit.wikimedia.org/r/1268265 (https://phabricator.wikimedia.org/T421717) [18:43:26] dzahn@cumin2002 reimage (PID 2996862) is awaiting input [18:43:50] (03CR) 10Dzahn: [C:03+2] docker_registry: update IP of gitlab-runner1004 for jwt-auth'ed hosts [puppet] - 10https://gerrit.wikimedia.org/r/1268265 (https://phabricator.wikimedia.org/T421717) (owner: 10Dzahn) [18:45:07] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [18:45:29] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:46:01] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:46:29] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:49:01] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:49:29] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:49:46] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host gitlab-runner1004 - dzahn@cumin2002" [18:49:51] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host gitlab-runner1004 - dzahn@cumin2002" [18:49:51] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:49:52] !log dzahn@cumin2002 START - Cookbook sre.dns.wipe-cache gitlab-runner1004.eqiad.wmnet 141.48.64.10.in-addr.arpa 1.4.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [18:49:55] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) gitlab-runner1004.eqiad.wmnet 141.48.64.10.in-addr.arpa 1.4.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [18:49:56] !log dzahn@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host gitlab-runner1004 [18:50:27] ryankemper: wdqs servers reimages or known issue? [18:50:42] !log dzahn@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host gitlab-runner1004 [18:50:42] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host gitlab-runner1004 [18:51:29] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:51:58] mutante: known issue (servers getting slammed) [18:52:01] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:52:15] ryankemper: ack! [18:55:15] !log brett@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs5005.eqsin.wmnet} and A:liberica [18:56:13] FIRING: CertAlmostExpired: Certificate for service opensearch-test:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#opensearch-test:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:58:34] 10ops-eqiad, 06SRE, 06DC-Ops: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11790930 (10VRiley-WMF) Thank you, I will report this back. Thanks again [18:58:56] !log brett@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs5005.eqsin.wmnet} and A:liberica [19:02:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:05:49] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1004.eqiad.wmnet with reason: host reimage [19:09:34] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [19:09:45] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1004.eqiad.wmnet with reason: host reimage [19:10:35] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [19:11:24] 10ops-eqiad, 06SRE, 06DC-Ops: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11790955 (10VRiley-WMF) Gathered another set of logs and have emailed dell [19:16:29] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook-next: apply [19:17:19] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthboo-next: apply [19:19:29] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [19:20:15] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [19:28:26] !log brett@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs5004.eqsin.wmnet} and A:liberica [19:28:41] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1004.eqiad.wmnet with OS bookworm [19:29:50] (03CR) 10Bking: [C:03+2] "self-merging, as most folks are out today and the blast radius is low." [alerts] - 10https://gerrit.wikimedia.org/r/1268255 (https://phabricator.wikimedia.org/T419289) (owner: 10Bking) [19:32:10] !log brett@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs5004.eqsin.wmnet} and A:liberica [19:49:00] (03PS1) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1268270 [19:51:14] (03CR) 10Ahmon Dancy: [C:03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1268270 (owner: 10Ahmon Dancy) [19:52:05] !log [wdqs] Restarted `wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service` on `wdqs1012` to clear systemdunitfailed alert [19:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:08] (03Merged) 10jenkins-bot: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1268270 (owner: 10Ahmon Dancy) [19:53:10] (03CR) 10Eevans: [C:03+2] aqs: upgrade to Cassandra 4.1.11 [puppet] - 10https://gerrit.wikimedia.org/r/1266388 (https://phabricator.wikimedia.org/T418417) (owner: 10Eevans) [19:57:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260406T2000). [20:00:05] kemayo and aude: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:08] hi [20:00:43] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-eqiad: Actually upgrade Cassandra to 4.1.11 — T418417 - eevans@cumin1003 [20:00:45] o/ [20:00:46] T418417: Upgrade Cassandra clusters to 4.1.11 - https://phabricator.wikimedia.org/T418417 [20:02:17] I can deploy my patch. Want me to get yours bundled in as well, aude? [20:02:24] yes please [20:02:56] mine is for the beta cluster [20:03:25] but i can check test wiki to make sure it is good there too [20:03:31] aude: Yours doesn't want to merge because of the depends-on. [20:03:37] checking [20:04:32] It said: "Change '1267946' has dependency '1267923' targeting the master branch,of MediaWiki code project 'mediawiki/extensions/ReadingLists', but the,dependency is not present in live train branch: wmf/1.46.0-wmf.22. Master dependencies must be cherry-picked to all live train branches. To avoid this error you will need to cherry-pick and merge the,dependency into that branch. This can be done directly from the Gerrit,UI. Then you [20:04:32] can restart this backport operation." [20:04:39] (03PS2) 10Aude: Set $wgReadingListsEnableBetaQuickSurvey to true for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267946 (https://phabricator.wikimedia.org/T422275) [20:04:58] (03CR) 10Tacsipacsi: Move createwithcontentmodel to autoconfirmed (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268225 (https://phabricator.wikimedia.org/T248294) (owner: 10Pppery) [20:05:20] the patch is okay to deploy. we just want the dependency to be on the beta cluster [20:05:41] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-nginx-exporter.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:05:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268242 (https://phabricator.wikimedia.org/T420123) (owner: 10DLynch) [20:05:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267946 (https://phabricator.wikimedia.org/T422275) (owner: 10Aude) [20:05:42] i removed the depends-on [20:06:00] Okay, I am going ahead. [20:06:03] thanks [20:06:41] (03Merged) 10jenkins-bot: VisualEditor editcheck suggestion feedback is always consolidated [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268242 (https://phabricator.wikimedia.org/T420123) (owner: 10DLynch) [20:06:45] (03Merged) 10jenkins-bot: Set $wgReadingListsEnableBetaQuickSurvey to true for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267946 (https://phabricator.wikimedia.org/T422275) (owner: 10Aude) [20:07:05] !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1268242|VisualEditor editcheck suggestion feedback is always consolidated (T420123)]], [[gerrit:1267946|Set $wgReadingListsEnableBetaQuickSurvey to true for beta cluster (T422275)]] [20:07:12] T420123: MW Feedback link points locally at Dewiki and to the wrong namespace elsewhere - https://phabricator.wikimedia.org/T420123 [20:07:12] T422275: Set $wgReadingListsEnableBetaQuickSurvey to true for beta cluster - https://phabricator.wikimedia.org/T422275 [20:08:52] !log kemayo@deploy1003 kemayo, aude: Backport for [[gerrit:1268242|VisualEditor editcheck suggestion feedback is always consolidated (T420123)]], [[gerrit:1267946|Set $wgReadingListsEnableBetaQuickSurvey to true for beta cluster (T422275)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:09:15] aude: let me know when you've tested yours [20:09:26] checking [20:10:24] looks good [20:10:57] (03PS3) 10Pppery: Move createwithcontentmodel to autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268225 (https://phabricator.wikimedia.org/T248294) [20:11:16] (03PS4) 10Pppery: Move createwithcontentmodel to autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268225 (https://phabricator.wikimedia.org/T248294) [20:11:21] (03CR) 10Pppery: Move createwithcontentmodel to autoconfirmed (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268225 (https://phabricator.wikimedia.org/T248294) (owner: 10Pppery) [20:11:55] !log kemayo@deploy1003 kemayo, aude: Continuing with sync [20:15:41] (03CR) 10Tacsipacsi: Move createwithcontentmodel to autoconfirmed (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268225 (https://phabricator.wikimedia.org/T248294) (owner: 10Pppery) [20:18:02] !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1268242|VisualEditor editcheck suggestion feedback is always consolidated (T420123)]], [[gerrit:1267946|Set $wgReadingListsEnableBetaQuickSurvey to true for beta cluster (T422275)]] (duration: 10m 56s) [20:18:06] T420123: MW Feedback link points locally at Dewiki and to the wrong namespace elsewhere - https://phabricator.wikimedia.org/T420123 [20:18:07] T422275: Set $wgReadingListsEnableBetaQuickSurvey to true for beta cluster - https://phabricator.wikimedia.org/T422275 [20:19:35] (03PS1) 10DLynch: VisualEditorSuggestionFeedback: undo the addition of Talk to the URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268271 (https://phabricator.wikimedia.org/T420123) [20:19:47] (03PS1) 10Ahmon Dancy: InitialiseSettings-dev.php: Disable IPReputation and TestKitchen in train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1268272 [20:19:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 06 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268271 (https://phabricator.wikimedia.org/T420123) (owner: 10DLynch) [20:20:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268271 (https://phabricator.wikimedia.org/T420123) (owner: 10DLynch) [20:21:04] (03CR) 10Ahmon Dancy: [C:03+2] InitialiseSettings-dev.php: Disable IPReputation and TestKitchen in train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1268272 (owner: 10Ahmon Dancy) [20:21:12] (03Merged) 10jenkins-bot: VisualEditorSuggestionFeedback: undo the addition of Talk to the URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268271 (https://phabricator.wikimedia.org/T420123) (owner: 10DLynch) [20:21:24] !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1268271|VisualEditorSuggestionFeedback: undo the addition of Talk to the URL (T420123)]] [20:21:48] looks good on the beta cluster too. Thanks for deploying my patch! [20:22:00] aude: No problem! [20:22:01] (03Merged) 10jenkins-bot: InitialiseSettings-dev.php: Disable IPReputation and TestKitchen in train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1268272 (owner: 10Ahmon Dancy) [20:22:59] !log kemayo@deploy1003 kemayo: Backport for [[gerrit:1268271|VisualEditorSuggestionFeedback: undo the addition of Talk to the URL (T420123)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:24:20] !log kemayo@deploy1003 kemayo: Continuing with sync [20:28:31] !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1268271|VisualEditorSuggestionFeedback: undo the addition of Talk to the URL (T420123)]] (duration: 07m 07s) [20:28:34] T420123: MW Feedback link points locally at Dewiki and to the wrong namespace elsewhere - https://phabricator.wikimedia.org/T420123 [20:41:02] Kemayo: i guess you're finished? :) [20:41:51] urbanecm: indeed! I guess I could have said something rather than letting logmsgbot speak for me. [20:42:03] no worries, i just prefer doublechecking [20:42:09] (03CR) 10Urbanecm: [C:03+2] Respect the echo-read-notifications right in user interface [extensions/Echo] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268201 (https://phabricator.wikimedia.org/T420154) (owner: 10Urbanecm) [20:42:15] (03CR) 10Urbanecm: [C:03+2] Grant new 'echo-read-notifications' right to all users [extensions/Echo] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268202 (https://phabricator.wikimedia.org/T422297) (owner: 10Urbanecm) [20:43:28] (03Merged) 10jenkins-bot: Respect the echo-read-notifications right in user interface [extensions/Echo] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268201 (https://phabricator.wikimedia.org/T420154) (owner: 10Urbanecm) [20:43:40] (03Merged) 10jenkins-bot: Grant new 'echo-read-notifications' right to all users [extensions/Echo] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268202 (https://phabricator.wikimedia.org/T422297) (owner: 10Urbanecm) [20:44:10] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1268201|Respect the echo-read-notifications right in user interface (T420154)]], [[gerrit:1268202|Grant new 'echo-read-notifications' right to all users (T422297)]] [20:44:15] T420154: CVE-2026-5266: Notifications (Echo) API can be used by any OAuth tool - https://phabricator.wikimedia.org/T420154 [20:44:15] T422297: Echo fails to fetch notifications for temporary accounts - https://phabricator.wikimedia.org/T422297 [20:45:44] !log urbanecm@deploy1003 urbanecm: Backport for [[gerrit:1268201|Respect the echo-read-notifications right in user interface (T420154)]], [[gerrit:1268202|Grant new 'echo-read-notifications' right to all users (T422297)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:46:25] !log urbanecm@deploy1003 urbanecm: Continuing with sync [20:50:40] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1268201|Respect the echo-read-notifications right in user interface (T420154)]], [[gerrit:1268202|Grant new 'echo-read-notifications' right to all users (T422297)]] (duration: 06m 30s) [20:50:44] T420154: CVE-2026-5266: Notifications (Echo) API can be used by any OAuth tool - https://phabricator.wikimedia.org/T420154 [20:50:44] T422297: Echo fails to fetch notifications for temporary accounts - https://phabricator.wikimedia.org/T422297 [20:50:53] * urbanecm done [20:53:39] claiming mw-experimental [20:53:52] !log urbanecm@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [20:54:35] !log urbanecm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [20:54:45] !log urbanecm@deploy1003 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [20:55:42] !log urbanecm@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [21:00:04] Reedy, sbassett, Maryum, and manfredi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260406T2100). [21:00:09] !log Locking mw-experimental@eqiad [21:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:43] !log Unlocking mw-experimental@eqiad [21:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:42] !log dancy@deploy1003 Installing scap version "4.244.0" for 2 host(s) [21:15:34] !log dancy@deploy1003 Installation of scap version "4.244.0" completed for 2 hosts [21:19:35] (03CR) 10Bking: [C:03+2] Remove support for old Elastic releases [puppet] - 10https://gerrit.wikimedia.org/r/1247917 (https://phabricator.wikimedia.org/T388607) (owner: 10Muehlenhoff) [21:20:07] FIRING: ProbeDown: Service aqs1022-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#aqs1022-a:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:25:07] RESOLVED: ProbeDown: Service aqs1022-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#aqs1022-a:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:25:28] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-eqiad: Actually upgrade Cassandra to 4.1.11 — T418417 - eevans@cumin1003 [21:25:32] T418417: Upgrade Cassandra clusters to 4.1.11 - https://phabricator.wikimedia.org/T418417 [21:26:15] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-codfw: Actually upgrade Cassandra to 4.1.11 — T418417 - eevans@cumin1003 [21:55:37] (03PS1) 10Andrew Bogott: Make cloudcephmon2007-dev a real cloudcephmon [puppet] - 10https://gerrit.wikimedia.org/r/1268280 (https://phabricator.wikimedia.org/T420282) [21:58:39] Hey all - going to deploy https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1268273 now via spiderpig [21:59:08] (03PS1) 10SBassett: Check if $res->message is null within ApiAuthManagerHelper [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268281 (https://phabricator.wikimedia.org/T422320) [21:59:33] (03CR) 10SBassett: [C:03+1] Check if $res->message is null within ApiAuthManagerHelper [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268281 (https://phabricator.wikimedia.org/T422320) (owner: 10SBassett) [21:59:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbassett@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268281 (https://phabricator.wikimedia.org/T422320) (owner: 10SBassett) [22:05:10] (03PS2) 10Andrew Bogott: Make cloudcephmon2007-dev a real cloudcephmon [puppet] - 10https://gerrit.wikimedia.org/r/1268280 (https://phabricator.wikimedia.org/T420282) [22:06:20] (03PS3) 10Andrew Bogott: Make cloudcephmon2007-dev a real cloudcephmon [puppet] - 10https://gerrit.wikimedia.org/r/1268280 (https://phabricator.wikimedia.org/T420282) [22:08:41] (03CR) 10CI reject: [V:04-1] Check if $res->message is null within ApiAuthManagerHelper [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268281 (https://phabricator.wikimedia.org/T422320) (owner: 10SBassett) [22:09:52] (03CR) 10Andrew Bogott: [C:03+2] Make cloudcephmon2007-dev a real cloudcephmon [puppet] - 10https://gerrit.wikimedia.org/r/1268280 (https://phabricator.wikimedia.org/T420282) (owner: 10Andrew Bogott) [22:10:13] (03Merged) 10jenkins-bot: Check if $res->message is null within ApiAuthManagerHelper [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268281 (https://phabricator.wikimedia.org/T422320) (owner: 10SBassett) [22:12:16] !log sbassett@deploy1003 Started scap sync-world: Backport for [[gerrit:1268281|Check if $res->message is null within ApiAuthManagerHelper (T422320)]] [22:12:20] T422320: Android & iOS app login broken: "Could not extract login status" - https://phabricator.wikimedia.org/T422320 [22:13:55] !log sbassett@deploy1003 sbassett: Backport for [[gerrit:1268281|Check if $res->message is null within ApiAuthManagerHelper (T422320)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:14:19] !log sbassett@deploy1003 sbassett: Continuing with sync [22:18:34] !log sbassett@deploy1003 Finished scap sync-world: Backport for [[gerrit:1268281|Check if $res->message is null within ApiAuthManagerHelper (T422320)]] (duration: 06m 18s) [22:18:37] T422320: Android & iOS app login broken: "Could not extract login status" - https://phabricator.wikimedia.org/T422320 [22:42:12] FIRING: GitlabPackagePullerFailedOnPrepare: Package puller has some run errors while preparing projects. - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DGitlabPackagePullerFailedOnPrepare [22:51:37] (03PS1) 10Andrew Bogott: Take cloudcephosd2004-dev out of service [puppet] - 10https://gerrit.wikimedia.org/r/1268286 (https://phabricator.wikimedia.org/T420282) [22:53:17] (03CR) 10Andrew Bogott: [C:03+2] Take cloudcephosd2004-dev out of service [puppet] - 10https://gerrit.wikimedia.org/r/1268286 (https://phabricator.wikimedia.org/T420282) (owner: 10Andrew Bogott) [22:56:06] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-codfw: Actually upgrade Cassandra to 4.1.11 — T418417 - eevans@cumin1003 [22:56:09] T418417: Upgrade Cassandra clusters to 4.1.11 - https://phabricator.wikimedia.org/T418417 [22:56:13] FIRING: CertAlmostExpired: Certificate for service opensearch-test:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#opensearch-test:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260406T2300) [23:12:15] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab-runner1003.eqiad.wmnet with OS bookworm [23:12:47] !log dzahn@cumin2002 START - Cookbook sre.hosts.move-vlan for host gitlab-runner1003 [23:15:46] (03PS1) 10Dzahn: docker_registry: update IP of gitlab-runner1003 for jwt-auth'ed hosts [puppet] - 10https://gerrit.wikimedia.org/r/1268288 (https://phabricator.wikimedia.org/T421717) [23:15:50] dzahn@cumin2002 reimage (PID 3133281) is awaiting input [23:16:10] (03CR) 10CI reject: [V:04-1] docker_registry: update IP of gitlab-runner1003 for jwt-auth'ed hosts [puppet] - 10https://gerrit.wikimedia.org/r/1268288 (https://phabricator.wikimedia.org/T421717) (owner: 10Dzahn) [23:16:12] (03PS2) 10Dzahn: docker_registry: update IP of gitlab-runner1003 for jwt-auth'ed hosts [puppet] - 10https://gerrit.wikimedia.org/r/1268288 (https://phabricator.wikimedia.org/T421717) [23:18:54] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [23:20:22] (03CR) 10Dzahn: [C:03+2] docker_registry: update IP of gitlab-runner1003 for jwt-auth'ed hosts [puppet] - 10https://gerrit.wikimedia.org/r/1268288 (https://phabricator.wikimedia.org/T421717) (owner: 10Dzahn) [23:24:58] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host gitlab-runner1003 - dzahn@cumin2002" [23:25:04] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host gitlab-runner1003 - dzahn@cumin2002" [23:25:04] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:25:05] !log dzahn@cumin2002 START - Cookbook sre.dns.wipe-cache gitlab-runner1003.eqiad.wmnet 184.32.64.10.in-addr.arpa 4.8.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [23:25:08] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) gitlab-runner1003.eqiad.wmnet 184.32.64.10.in-addr.arpa 4.8.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [23:25:09] !log dzahn@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host gitlab-runner1003 [23:25:14] !log gitlab: reimaging trusted runners with --move-vlan parameter which changed their IPs - verified was showing up as online after the change and using the new IPs (T421717) [23:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:17] T421717: Collaboration Services: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421717 [23:25:33] !log dzahn@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host gitlab-runner1003 [23:25:33] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host gitlab-runner1003 [23:39:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1268290 [23:39:35] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1268290 (owner: 10TrainBranchBot) [23:40:39] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1003.eqiad.wmnet with reason: host reimage [23:42:35] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: cloudcephmon2007-dev service implementation - https://phabricator.wikimedia.org/T420282#11791747 (10Andrew) [23:43:44] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1003.eqiad.wmnet with reason: host reimage [23:51:27] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1268290 (owner: 10TrainBranchBot) [23:56:24] (03PS1) 10Zabe: Start reading from the new file tables on more large wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268291 (https://phabricator.wikimedia.org/T416548)