[00:00:25] The good old days when I didn't have +2 in that repo, so it was impossible to ever make that mistake. [00:00:30] Ha. [00:00:47] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:00:47] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:06:47] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:07:47] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:10:47] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:11:47] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:19:16] (03PS1) 10Jforrester: Provide abstractwiki-rust, using Trixie-backports [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1289012 (https://phabricator.wikimedia.org/T425340) [00:20:17] (03PS2) 10Jforrester: Provide abstractwiki-rust, using Trixie-backports [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1289012 (https://phabricator.wikimedia.org/T425340) [00:20:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289005 (owner: 10Jforrester) [00:20:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289006 (owner: 10Jforrester) [00:20:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289007 (owner: 10Jforrester) [00:22:29] (03CR) 10Jdlrobson: [C:03+1] ThumbLimits: Harmonize svwiki large size with the rest of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289008 (https://phabricator.wikimedia.org/T376152) (owner: 10Ladsgroup) [00:23:06] (03Merged) 10jenkins-bot: IS: Drop wgGraphDefaultVegaVer, never used any more [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289005 (owner: 10Jforrester) [00:23:08] (03Merged) 10jenkins-bot: IS: Drop wgEnableSpecialMute, ignored since MW 1.46 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289006 (owner: 10Jforrester) [00:23:11] (03Merged) 10jenkins-bot: IS: Drop wgDiscussionTools_visualenhancements_*, ignored since 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289007 (owner: 10Jforrester) [00:23:28] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1289005|IS: Drop wgGraphDefaultVegaVer, never used any more]], [[gerrit:1289006|IS: Drop wgEnableSpecialMute, ignored since MW 1.46]], [[gerrit:1289007|IS: Drop wgDiscussionTools_visualenhancements_*, ignored since 2025]] [00:24:10] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on gitlab2002.wikimedia.org with reason: T426563 [00:25:12] !log ladsgroup@deploy1003 ladsgroup, jforrester: Backport for [[gerrit:1289005|IS: Drop wgGraphDefaultVegaVer, never used any more]], [[gerrit:1289006|IS: Drop wgEnableSpecialMute, ignored since MW 1.46]], [[gerrit:1289007|IS: Drop wgDiscussionTools_visualenhancements_*, ignored since 2025]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:25:39] !log dzahn@cumin2002 START - Cookbook sre.hosts.reboot-single for host gitlab2002.wikimedia.org [00:26:24] !log ladsgroup@deploy1003 ladsgroup, jforrester: Continuing with deployment [00:30:28] Whee. [00:30:35] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1289005|IS: Drop wgGraphDefaultVegaVer, never used any more]], [[gerrit:1289006|IS: Drop wgEnableSpecialMute, ignored since MW 1.46]], [[gerrit:1289007|IS: Drop wgDiscussionTools_visualenhancements_*, ignored since 2025]] (duration: 07m 08s) [00:32:05] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab2002.wikimedia.org [00:40:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289008 (https://phabricator.wikimedia.org/T376152) (owner: 10Ladsgroup) [00:40:36] (03PS1) 10Dzahn: tcpircbot (logmsgbot): replace deploy2002 with deploy2003 [puppet] - 10https://gerrit.wikimedia.org/r/1289019 (https://phabricator.wikimedia.org/T426222) [00:42:37] (03CR) 10CI reject: [V:04-1] ThumbLimits: Harmonize svwiki large size with the rest of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289008 (https://phabricator.wikimedia.org/T376152) (owner: 10Ladsgroup) [00:49:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289008 (https://phabricator.wikimedia.org/T376152) (owner: 10Ladsgroup) [00:51:56] (03Merged) 10jenkins-bot: ThumbLimits: Harmonize svwiki large size with the rest of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289008 (https://phabricator.wikimedia.org/T376152) (owner: 10Ladsgroup) [00:52:11] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1289008|ThumbLimits: Harmonize svwiki large size with the rest of wikis (T376152)]] [00:52:15] T376152: Evaluate feasibility of deprecating (or limiting) user media size preferences - https://phabricator.wikimedia.org/T376152 [00:54:12] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1289008|ThumbLimits: Harmonize svwiki large size with the rest of wikis (T376152)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:54:35] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [00:55:24] (03PS5) 10Aleksandar Mastilovic: Presto memory tuning, resource groups [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) [00:56:23] (03CR) 10Aleksandar Mastilovic: Presto memory tuning, resource groups (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) (owner: 10Aleksandar Mastilovic) [00:56:30] (03CR) 10Aleksandar Mastilovic: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) (owner: 10Aleksandar Mastilovic) [00:57:20] (03CR) 10CI reject: [V:04-1] Presto memory tuning, resource groups [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) (owner: 10Aleksandar Mastilovic) [00:58:47] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1289008|ThumbLimits: Harmonize svwiki large size with the rest of wikis (T376152)]] (duration: 06m 36s) [00:58:51] T376152: Evaluate feasibility of deprecating (or limiting) user media size preferences - https://phabricator.wikimedia.org/T376152 [00:59:11] (03PS6) 10Aleksandar Mastilovic: Presto memory tuning, resource groups [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) [01:00:19] (03CR) 10Aleksandar Mastilovic: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) (owner: 10Aleksandar Mastilovic) [01:05:08] (03PS1) 10Jasmine: k8s: add wikikube-worker2331 [puppet] - 10https://gerrit.wikimedia.org/r/1289022 (https://phabricator.wikimedia.org/T426688) [01:09:12] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.47.0-wmf.3 [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289023 (https://phabricator.wikimedia.org/T423912) [01:09:14] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.47.0-wmf.3 [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289023 (https://phabricator.wikimedia.org/T423912) (owner: 10TrainBranchBot) [01:09:29] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1289024 [01:09:29] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1289024 (owner: 10TrainBranchBot) [01:19:12] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:21:30] (03Merged) 10jenkins-bot: Branch commit for wmf/1.47.0-wmf.3 [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289023 (https://phabricator.wikimedia.org/T423912) (owner: 10TrainBranchBot) [01:21:37] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1289024 (owner: 10TrainBranchBot) [01:24:12] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:31:18] (03PS1) 10DDesouza: miscweb(design-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289031 (https://phabricator.wikimedia.org/T344471) [01:34:54] (03CR) 10DDesouza: [C:03+2] miscweb(design-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289031 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [01:37:23] (03Merged) 10jenkins-bot: miscweb(design-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289031 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260519T0200) [02:00:09] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [02:00:22] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [02:00:24] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [02:00:36] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [02:00:37] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [02:00:52] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [02:01:23] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:08:02] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 39s) [02:09:12] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:30:10] FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:34:12] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:36:32] (03CR) 10RLazarus: "Building and testing locally, this doesn't have the right version:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1289012 (https://phabricator.wikimedia.org/T425340) (owner: 10Jforrester) [02:46:25] FIRING: [42x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:49:12] FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:50:09] (03CR) 10Scott French: [C:03+1] "Thanks, Jasmine!" [puppet] - 10https://gerrit.wikimedia.org/r/1289022 (https://phabricator.wikimedia.org/T426688) (owner: 10Jasmine) [02:50:27] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:04] Deploy window Automatic deployment of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260519T0300) [03:01:52] (03PS1) 10TrainBranchBot: testwikis to 1.47.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289054 (https://phabricator.wikimedia.org/T423912) [03:01:55] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289054 (https://phabricator.wikimedia.org/T423912) (owner: 10TrainBranchBot) [03:02:51] (03Merged) 10jenkins-bot: testwikis to 1.47.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289054 (https://phabricator.wikimedia.org/T423912) (owner: 10TrainBranchBot) [03:03:18] !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.47.0-wmf.3 refs T423912 [03:03:22] T423912: 1.47.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T423912 [03:09:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287433 (https://phabricator.wikimedia.org/T355445) (owner: 10Codename Noreste) [03:10:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281901 (https://phabricator.wikimedia.org/T424413) (owner: 10Codename Noreste) [03:17:47] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:19:49] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:20:47] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:21:47] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:25:47] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1016.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:25:47] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1016.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:25:47] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:25:47] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:26:47] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:26:47] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:28:47] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:28:47] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:29:49] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:29:49] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:33:41] PROBLEM - MariaDB Replica Lag: m2 on db2160 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 639.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [03:34:41] RECOVERY - MariaDB Replica Lag: m2 on db2160 is OK: OK slave_sql_lag Replication lag: 0.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [03:34:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [03:39:48] looking [03:40:06] FIRING: [6x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [03:41:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:41:42] !log mwpresync@deploy1003 Finished scap sync-world: testwikis to 1.47.0-wmf.3 refs T423912 (duration: 38m 23s) [03:41:46] T423912: 1.47.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T423912 [03:43:47] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:46:49] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:54:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260519T0400) [04:02:42] !log mwpresync@deploy1003 Pruned MediaWiki: 1.46.0-wmf.26 (duration: 02m 40s) [04:05:17] (03PS1) 10C. Scott Ananian: Forward-compatibility for serialization of ContentHolder in ParserOutput [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289070 (https://phabricator.wikimedia.org/T423701) [04:05:28] (03PS1) 10C. Scott Ananian: ParsoidLanguageConverter: don't convert TOC if __NOCONTENTCONVERT__ [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289071 (https://phabricator.wikimedia.org/T424773) [04:05:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289070 (https://phabricator.wikimedia.org/T423701) (owner: 10C. Scott Ananian) [04:06:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289071 (https://phabricator.wikimedia.org/T424773) (owner: 10C. Scott Ananian) [04:14:08] (03CR) 10CI reject: [V:04-1] Forward-compatibility for serialization of ContentHolder in ParserOutput [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289070 (https://phabricator.wikimedia.org/T423701) (owner: 10C. Scott Ananian) [04:38:21] RESOLVED: [6x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [04:50:30] (03PS1) 10JHathaway: Rename scap::ferm to scap::firewall [puppet] - 10https://gerrit.wikimedia.org/r/1289089 (https://phabricator.wikimedia.org/T411089) [04:51:09] (03CR) 10JHathaway: "I think I tracked down the issue!" [puppet] - 10https://gerrit.wikimedia.org/r/1289089 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [05:09:17] PROBLEM - mysqld processes on pc2014 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [05:09:17] PROBLEM - MariaDB Replica Lag: pc4 on pc2014 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [05:09:17] PROBLEM - MariaDB Replica SQL: pc4 on pc2014 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [05:09:17] PROBLEM - MariaDB Replica IO: pc4 on pc2014 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [05:09:17] PROBLEM - MariaDB Event Scheduler pc4 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [05:09:17] PROBLEM - MariaDB read only pc4 on pc2014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [05:10:56] Downtime expired [05:11:18] 👍 [05:11:54] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc2014.codfw.wmnet with reason: Maintenance on pc4 [05:12:37] (03PS1) 10Marostegui: pc2014: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1289136 [05:13:57] (03CR) 10Marostegui: [C:03+2] pc2014: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1289136 (owner: 10Marostegui) [05:17:06] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Investigate db2218 crash - https://phabricator.wikimedia.org/T426383#11934315 (10Marostegui) Thank you @Jhancock.wm! [05:24:23] (03PS1) 10Marostegui: db2159: Make host candidate master for s7 [puppet] - 10https://gerrit.wikimedia.org/r/1289141 (https://phabricator.wikimedia.org/T426383) [05:25:47] (03CR) 10Marostegui: "Changes made in dbctl and orch*" [puppet] - 10https://gerrit.wikimedia.org/r/1289141 (https://phabricator.wikimedia.org/T426383) (owner: 10Marostegui) [05:25:49] (03CR) 10Marostegui: [C:03+2] db2159: Make host candidate master for s7 [puppet] - 10https://gerrit.wikimedia.org/r/1289141 (https://phabricator.wikimedia.org/T426383) (owner: 10Marostegui) [05:26:42] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 13Patch-For-Review: Investigate db2218 crash - https://phabricator.wikimedia.org/T426383#11934329 (10Marostegui) 05Open→03Resolved I've made db2159 candidate master for s7. This can be closed. Thanks for the help Jenn [05:27:07] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:28:07] PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator [05:38:15] (03CR) 10Phuedx: [C:03+1] Remove `wgTestKitchenExperimentStreamNames` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285412 (https://phabricator.wikimedia.org/T422358) (owner: 10Santiago Faci) [05:53:33] (03CR) 10Tiziano Fogli: [C:03+2] logstash/thanos-qfe: add event.start [puppet] - 10https://gerrit.wikimedia.org/r/1287827 (owner: 10Tiziano Fogli) [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260519T0600) [06:00:04] marostegui, Amir1, and federico3: gettimeofday() says it's time for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260519T0600) [06:08:41] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 23 hosts with reason: Primary switchover s5 T426087 [06:08:45] T426087: Switchover s5 master (db1210 -> db1230) - https://phabricator.wikimedia.org/T426087 [06:09:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Set db1230 with weight 0 T426087', diff saved to https://phabricator.wikimedia.org/P92574 and previous config saved to /var/cache/conftool/dbconfig/20260519-060929-fceratto.json [06:12:24] (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db1230 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1286412 (https://phabricator.wikimedia.org/T426087) (owner: 10Gerrit maintenance bot) [06:14:01] !log Starting s5 eqiad failover from db1210 to db1230 - T426087 [06:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:05] T426087: Switchover s5 master (db1210 -> db1230) - https://phabricator.wikimedia.org/T426087 [06:14:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Set s5 eqiad as read-only for maintenance - T426087', diff saved to https://phabricator.wikimedia.org/P92575 and previous config saved to /var/cache/conftool/dbconfig/20260519-061435-fceratto.json [06:15:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Promote db1230 to s5 primary and set section read-write T426087', diff saved to https://phabricator.wikimedia.org/P92576 and previous config saved to /var/cache/conftool/dbconfig/20260519-061524-fceratto.json [06:17:23] (03CR) 10Federico Ceratto: [C:03+2] wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1286413 (https://phabricator.wikimedia.org/T426087) (owner: 10Gerrit maintenance bot) [06:18:19] !log fceratto@dns1005 START - running authdns-update [06:19:56] !log fceratto@dns1005 END - running authdns-update [06:20:53] (03PS1) 10Marostegui: instances.yaml: Remove pc2014 [puppet] - 10https://gerrit.wikimedia.org/r/1289166 (https://phabricator.wikimedia.org/T426595) [06:20:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depool db1210 T426087', diff saved to https://phabricator.wikimedia.org/P92577 and previous config saved to /var/cache/conftool/dbconfig/20260519-062056-fceratto.json [06:21:00] T426087: Switchover s5 master (db1210 -> db1230) - https://phabricator.wikimedia.org/T426087 [06:21:30] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove pc2014 [puppet] - 10https://gerrit.wikimedia.org/r/1289166 (https://phabricator.wikimedia.org/T426595) (owner: 10Marostegui) [06:22:02] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1210.eqiad.wmnet with reason: Maintenance [06:22:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove pc2014 from dbctl T426595', diff saved to https://phabricator.wikimedia.org/P92578 and previous config saved to /var/cache/conftool/dbconfig/20260519-062227-marostegui.json [06:22:31] T426595: decommission pc2014.codfw.wmnet - https://phabricator.wikimedia.org/T426595 [06:24:08] (03PS1) 10Marostegui: mariadb: Remove pc2014 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1289169 (https://phabricator.wikimedia.org/T426595) [06:24:44] !log marostegui@cumin1003 START - Cookbook sre.hosts.decommission for hosts pc2014.codfw.wmnet [06:24:48] (03CR) 10Marostegui: [C:03+2] mariadb: Remove pc2014 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1289169 (https://phabricator.wikimedia.org/T426595) (owner: 10Marostegui) [06:27:07] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:28:11] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1057.eqiad.wmnet with OS bookworm [06:28:34] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1210: Repooling after switchover [06:28:49] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1058.eqiad.wmnet with OS bookworm [06:29:13] !log marostegui@cumin1003 START - Cookbook sre.dns.netbox [06:30:10] FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:32:25] 06SRE, 06Infrastructure-Foundations, 10Mail, 06Product Safety and Integrity, and 2 others: yahoo rejecting our emails - https://phabricator.wikimedia.org/T426105#11934418 (10kostajh) 05Open→03Resolved >>! In T426105#11933806, @jhathaway wrote: > Someone from Yahoo was kind enough to reach out to me... [06:33:09] !log marostegui@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: pc2014.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [06:33:26] !log marostegui@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: pc2014.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [06:33:26] !log marostegui@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:33:27] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts pc2014.codfw.wmnet [06:33:36] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc2014.codfw.wmnet - https://phabricator.wikimedia.org/T426595#11934420 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1003 for hosts: `pc2014.codfw.wmnet` - pc2014.codfw.wmnet (**PASS**) - Downtimed... [06:33:38] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc2014.codfw.wmnet - https://phabricator.wikimedia.org/T426595#11934421 (10Marostegui) a:05Marostegui→03None [06:38:12] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc2014.codfw.wmnet - https://phabricator.wikimedia.org/T426595#11934436 (10Marostegui) Ready for DC-Ops [06:39:51] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1057.eqiad.wmnet with reason: host reimage [06:40:40] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1058.eqiad.wmnet with reason: host reimage [06:41:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow1002.eqiad.wmnet [06:42:55] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2203 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1289170 (https://phabricator.wikimedia.org/T426703) [06:44:24] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1057.eqiad.wmnet with reason: host reimage [06:44:44] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 30 hosts with reason: Primary switchover s1 T426703 [06:44:50] T426703: Switchover s1 master (db2212 -> db2203) - https://phabricator.wikimedia.org/T426703 [06:45:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Set db2203 with weight 0 T426703', diff saved to https://phabricator.wikimedia.org/P92581 and previous config saved to /var/cache/conftool/dbconfig/20260519-064500-fceratto.json [06:45:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow1002.eqiad.wmnet [06:46:25] FIRING: [42x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:47:30] (03PS2) 10Matthias Mullie: Squashed diff to master [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288994 [06:48:18] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1058.eqiad.wmnet with reason: host reimage [06:49:35] (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db2203 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1289170 (https://phabricator.wikimedia.org/T426703) (owner: 10Gerrit maintenance bot) [06:50:21] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [06:50:21] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.parsercache (exit_code=99) [06:51:21] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool pc1011.eqiad.wmnet: Maintenance on pc1 [06:51:28] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [06:51:35] !log Starting s1 codfw failover from db2212 to db2203 - T426703 [06:51:35] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [06:51:35] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc1011.eqiad.wmnet: Maintenance on pc1 [06:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:41] T426703: Switchover s1 master (db2212 -> db2203) - https://phabricator.wikimedia.org/T426703 [06:51:50] (03PS1) 10Muehlenhoff: mariadb::ferm: Switch to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1289171 [06:52:01] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc2021.codfw.wmnet,pc[1011,1021].eqiad.wmnet with reason: Maintenance on pc1 [06:52:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Promote db2203 to s1 primary T426703', diff saved to https://phabricator.wikimedia.org/P92583 and previous config saved to /var/cache/conftool/dbconfig/20260519-065224-fceratto.json [06:54:39] !log installing qemu security updates [06:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:47] (03PS1) 10Marostegui: mariadb: Productionize pc1021 [puppet] - 10https://gerrit.wikimedia.org/r/1289173 (https://phabricator.wikimedia.org/T418973) [06:54:50] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1289171 (owner: 10Muehlenhoff) [06:56:37] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depool db2212 T426703', diff saved to https://phabricator.wikimedia.org/P92584 and previous config saved to /var/cache/conftool/dbconfig/20260519-065637-fceratto.json [06:56:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow1003.eqiad.wmnet [06:57:39] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2212.codfw.wmnet with reason: Maintenance [06:59:34] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1057.eqiad.wmnet with OS bookworm [07:00:05] Amir1, Urbanecm, and awight: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260519T0700). [07:00:05] matthiasmullie: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:02:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow1003.eqiad.wmnet [07:02:37] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1056.eqiad.wmnet with OS bookworm [07:02:43] (03CR) 10Marostegui: "FYI" [puppet] - 10https://gerrit.wikimedia.org/r/1289173 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [07:02:45] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize pc1021 [puppet] - 10https://gerrit.wikimedia.org/r/1289173 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [07:03:12] o/ [07:03:26] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet [07:03:58] (03CR) 10A-pizzata: "@btullis@wikimedia.org can you help us out whenever is convenient?" [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) (owner: 10A-pizzata) [07:04:24] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1058.eqiad.wmnet with OS bookworm [07:04:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mlitn@deploy1003 using scap backport" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288994 (owner: 10Matthias Mullie) [07:05:50] (03Merged) 10jenkins-bot: Squashed diff to master [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288994 (owner: 10Matthias Mullie) [07:07:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet [07:07:24] !log mlitn@deploy1003 Started scap sync-world: Backport for [[gerrit:1288994|Squashed diff to master]] [07:07:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2001.codfw.wmnet [07:07:53] (03PS1) 10Marostegui: pc1011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1289178 (https://phabricator.wikimedia.org/T418973) [07:08:29] (03CR) 10Marostegui: [C:03+2] pc1011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1289178 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [07:12:16] o/ [07:12:25] any deployers available? [07:13:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2001.codfw.wmnet [07:13:24] !log mlitn@deploy1003 mlitn: Backport for [[gerrit:1288994|Squashed diff to master]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:14:03] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1210: Repooling after switchover [07:14:23] !log mlitn@deploy1003 mlitn: Continuing with deployment [07:14:29] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1056.eqiad.wmnet with reason: host reimage [07:16:19] mlitn: I have a patch for increasing account threshold, can you deploy it if possible? [07:17:00] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db2212: Repooling after switchover [07:17:28] Nemoralis: if it's a simple one, sure [07:18:59] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1056.eqiad.wmnet with reason: host reimage [07:20:42] !log mlitn@deploy1003 Finished scap sync-world: Backport for [[gerrit:1288994|Squashed diff to master]] (duration: 13m 17s) [07:23:42] 10SRE-swift-storage, 10Thumbor, 13Patch-For-Review: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#11934634 (10MatthewVernon) [07:24:19] (03CR) 10MVernon: [C:03+1] "Seems like a sensible temporary move; I've added a note to T379942 to remind us to undo this once we're done cleaning out thumbs (and pain" [puppet] - 10https://gerrit.wikimedia.org/r/1288929 (https://phabricator.wikimedia.org/T379942) (owner: 10Ladsgroup) [07:26:28] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for dsantamaria - https://phabricator.wikimedia.org/T426561#11934646 (10SLyngshede-WMF) [07:29:35] (03PS1) 10NMW03: Increase account threshold for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289180 [07:29:47] (03CR) 10Muehlenhoff: [C:03+2] Switch install2005 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1286942 (owner: 10Muehlenhoff) [07:30:23] (03CR) 10CI reject: [V:04-1] Increase account threshold for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289180 (owner: 10NMW03) [07:30:25] (03CR) 10Marostegui: [C:03+1] mariadb::ferm: Switch to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1289171 (owner: 10Muehlenhoff) [07:32:14] (03PS2) 10NMW03: Increase account threshold for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289180 [07:32:39] (03PS1) 10Slyngshede: data.yaml: add dsantamaria to deployment [puppet] - 10https://gerrit.wikimedia.org/r/1289182 (https://phabricator.wikimedia.org/T4266561) [07:33:05] (03CR) 10CI reject: [V:04-1] Increase account threshold for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289180 (owner: 10NMW03) [07:33:57] (03PS3) 10NMW03: Increase account threshold for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289180 [07:33:58] !log add gnmic 0.46.0 to reprepro [07:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:27] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1056.eqiad.wmnet with OS bookworm [07:39:09] !log cwilliams@cumin1003 START - Cookbook sre.hosts.decommission for hosts db2150.codfw.wmnet [07:39:28] !log volans@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcumin2001.codfw.wmnet [07:40:32] (03PS2) 10CWilliams: mariadb: Decommission db2150 [puppet] - 10https://gerrit.wikimedia.org/r/1288874 (https://phabricator.wikimedia.org/T424342) [07:41:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:42:21] (03CR) 10Elukey: [C:03+2] kafka-logging: set new hosts to raid10-4dev [puppet] - 10https://gerrit.wikimedia.org/r/1288891 (https://phabricator.wikimedia.org/T418929) (owner: 10Herron) [07:43:46] !log cwilliams@cumin1003 START - Cookbook sre.dns.netbox [07:44:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install2005.wikimedia.org [07:44:48] (03CR) 10CWilliams: [C:03+2] mariadb: Decommission db2150 [puppet] - 10https://gerrit.wikimedia.org/r/1288874 (https://phabricator.wikimedia.org/T424342) (owner: 10CWilliams) [07:45:20] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcumin2001.codfw.wmnet [07:46:11] !log volans@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcumin1001.eqiad.wmnet [07:48:12] !log cwilliams@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2150.codfw.wmnet decommissioned, removing all IPs except the asset tag one - cwilliams@cumin1003" [07:48:38] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2150.codfw.wmnet decommissioned, removing all IPs except the asset tag one - cwilliams@cumin1003" [07:48:38] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:48:39] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2150.codfw.wmnet [07:50:03] !log Removing db2150 from zarcillo T424342 [07:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:06] T424342: decommission db2150.codfw.wmnet - https://phabricator.wikimedia.org/T424342 [07:50:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install2005.wikimedia.org [07:52:15] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcumin1001.eqiad.wmnet [07:52:16] (03PS2) 10Slyngshede: data.yaml: add dsantamaria to deployment [puppet] - 10https://gerrit.wikimedia.org/r/1289182 (https://phabricator.wikimedia.org/T426561) [07:53:28] (03PS1) 10Muehlenhoff: Switch install1005 / the installserver role at large to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1289187 [07:57:56] !log reboot apus eqiad frontends (May reboots) [07:57:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:14] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-cluster [07:58:34] !log Removing db2150 from orchestrator T424342 [07:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:38] T424342: decommission db2150.codfw.wmnet - https://phabricator.wikimedia.org/T424342 [08:00:05] hashar and andre: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260519T0800) [08:01:02] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission db2150.codfw.wmnet - https://phabricator.wikimedia.org/T424342#11934804 (10CWilliams-WMF) a:05CWilliams-WMF→03wiki_willy [08:01:19] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission db2150.codfw.wmnet - https://phabricator.wikimedia.org/T424342#11934808 (10CWilliams-WMF) This host is ready for DC-Ops to decommission [08:01:51] (03PS3) 10CWilliams: mariadb: Decomission db2151 [puppet] - 10https://gerrit.wikimedia.org/r/1288875 (https://phabricator.wikimedia.org/T424343) [08:02:30] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2212: Repooling after switchover [08:06:05] (03PS1) 10Elukey: confluent: fix command-config usage in kafka-leader-election [puppet] - 10https://gerrit.wikimedia.org/r/1289266 (https://phabricator.wikimedia.org/T426639) [08:07:12] o/ [08:07:19] I will run the mw train eventually [08:09:35] 06SRE: May 2026 SRE reboots - https://phabricator.wikimedia.org/T426720#11934849 (10A_smart_kitten) [08:09:42] (03CR) 10Elukey: "Tested on Kafka test!" [puppet] - 10https://gerrit.wikimedia.org/r/1289266 (https://phabricator.wikimedia.org/T426639) (owner: 10Elukey) [08:10:05] (03PS2) 10Elukey: confluent: fix command-config usage in kafka-leader-election [puppet] - 10https://gerrit.wikimedia.org/r/1289266 (https://phabricator.wikimedia.org/T426639) [08:10:36] (03CR) 10CI reject: [V:04-1] confluent: fix command-config usage in kafka-leader-election [puppet] - 10https://gerrit.wikimedia.org/r/1289266 (https://phabricator.wikimedia.org/T426639) (owner: 10Elukey) [08:10:37] running the train [08:10:53] !log volans@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudrabbit2003-dev.codfw.wmnet [08:11:06] (03PS3) 10Elukey: confluent: fix command-config usage in kafka-leader-election [puppet] - 10https://gerrit.wikimedia.org/r/1289266 (https://phabricator.wikimedia.org/T426639) [08:11:50] (03PS1) 10TrainBranchBot: group0 to 1.47.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289267 (https://phabricator.wikimedia.org/T423912) [08:11:53] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by hashar@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289267 (https://phabricator.wikimedia.org/T423912) (owner: 10TrainBranchBot) [08:13:00] (03Merged) 10jenkins-bot: group0 to 1.47.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289267 (https://phabricator.wikimedia.org/T423912) (owner: 10TrainBranchBot) [08:13:16] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet [08:15:03] (03CR) 10Elukey: [C:03+2] role::aux_k8s::master: setup IPIP encapsulation settings [puppet] - 10https://gerrit.wikimedia.org/r/1282298 (https://phabricator.wikimedia.org/T420439) (owner: 10Elukey) [08:16:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet [08:17:27] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit2003-dev.codfw.wmnet [08:17:33] (03CR) 10Brouberol: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1289266 (https://phabricator.wikimedia.org/T426639) (owner: 10Elukey) [08:17:35] !log jiji@cumin1003 START - Cookbook sre.memcached.roll-reboot-restart rolling reboot on A:memcached-gutter-eqiad [08:17:53] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [08:19:02] !log volans@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudrabbit2002-dev.codfw.wmnet [08:19:11] !log hashar@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.47.0-wmf.3 refs T423912 [08:19:15] T423912: 1.47.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T423912 [08:21:22] (03CR) 10JMeybohm: k8s: add wikikube-worker2331 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1289022 (https://phabricator.wikimedia.org/T426688) (owner: 10Jasmine) [08:21:48] (03PS4) 10CWilliams: mariadb: Decomission db2151 [puppet] - 10https://gerrit.wikimedia.org/r/1288875 (https://phabricator.wikimedia.org/T424343) [08:22:52] !log cwilliams@cumin1003 START - Cookbook sre.hosts.decommission for hosts db2151.codfw.wmnet [08:23:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet [08:23:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet [08:23:49] (03CR) 10Elukey: [C:03+2] confluent: fix command-config usage in kafka-leader-election [puppet] - 10https://gerrit.wikimedia.org/r/1289266 (https://phabricator.wikimedia.org/T426639) (owner: 10Elukey) [08:24:14] !log reboot apus codfw frontends (May reboots) [08:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:21] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-cluster [08:25:54] (03PS2) 10Effie Mouzeli: mcrouter_wancache: add mc1070-mc1072 to production [puppet] - 10https://gerrit.wikimedia.org/r/1288500 (https://phabricator.wikimedia.org/T418263) [08:26:00] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit2002-dev.codfw.wmnet [08:26:01] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet [08:26:50] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc2014.codfw.wmnet - https://phabricator.wikimedia.org/T426595#11934903 (10Marostegui) [08:27:25] !log cwilliams@cumin1003 START - Cookbook sre.dns.netbox [08:27:47] hashar: can I sync a config patch when the train is done? [08:28:21] !log volans@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudrabbit2001-dev.codfw.wmnet [08:28:37] it looks quiet, so I guess yes? :] [08:28:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet [08:29:17] (03PS5) 10CWilliams: mariadb: Decomission db2151 [puppet] - 10https://gerrit.wikimedia.org/r/1288875 (https://phabricator.wikimedia.org/T424343) [08:29:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet [08:29:59] (03CR) 10CWilliams: [C:03+2] mariadb: Decomission db2151 [puppet] - 10https://gerrit.wikimedia.org/r/1288875 (https://phabricator.wikimedia.org/T424343) (owner: 10CWilliams) [08:30:50] kostajh: yes please go ahead! [08:31:26] !log cwilliams@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2151.codfw.wmnet decommissioned, removing all IPs except the asset tag one - cwilliams@cumin1003" [08:31:47] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2151.codfw.wmnet decommissioned, removing all IPs except the asset tag one - cwilliams@cumin1003" [08:31:47] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:31:48] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2151.codfw.wmnet [08:32:32] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc1070.eqiad.wmnet [08:32:49] !log Removing db2151 from zarcillo T424343 [08:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:52] T424343: decommission db2151.codfw.wmnet - https://phabricator.wikimedia.org/T424343 [08:33:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet [08:33:31] !log Removing db2151 from orchestrator T424343 [08:33:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cuminunpriv1001.eqiad.wmnet [08:33:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286804 (https://phabricator.wikimedia.org/T421293) (owner: 10Kosta Harlan) [08:33:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet [08:35:00] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission db2151.codfw.wmnet - https://phabricator.wikimedia.org/T424343#11934935 (10CWilliams-WMF) a:05CWilliams-WMF→03wiki_willy [08:35:06] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission db2151.codfw.wmnet - https://phabricator.wikimedia.org/T424343#11934939 (10CWilliams-WMF) This host is ready for DC-Ops to decommission [08:35:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet [08:35:13] (03Merged) 10jenkins-bot: IPReputation: Route opensearch_ipoid through envoy service mesh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286804 (https://phabricator.wikimedia.org/T421293) (owner: 10Kosta Harlan) [08:35:14] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit2001-dev.codfw.wmnet [08:35:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet [08:35:41] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1286804|IPReputation: Route opensearch_ipoid through envoy service mesh (T421293)]] [08:35:45] T421293: Enable service mesh for OpenSearch on K8s clusters - https://phabricator.wikimedia.org/T421293 [08:35:46] (03CR) 10Elukey: [C:03+2] role::aux_k8s::worker: add IPIP encapsulation settings [puppet] - 10https://gerrit.wikimedia.org/r/1282299 (https://phabricator.wikimedia.org/T420439) (owner: 10Elukey) [08:35:54] !log volans@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudrabbit1003.eqiad.wmnet [08:36:31] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet [08:37:05] (03PS1) 10Muehlenhoff: Apply cluster::management role to cumin2003 [puppet] - 10https://gerrit.wikimedia.org/r/1289272 [08:37:19] (03PS2) 10Muehlenhoff: Apply cluster::management role to cumin2003 [puppet] - 10https://gerrit.wikimedia.org/r/1289272 [08:37:20] !log jiji@cumin1003 END (PASS) - Cookbook sre.memcached.roll-reboot-restart (exit_code=0) rolling reboot on A:memcached-gutter-eqiad [08:37:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cuminunpriv1001.eqiad.wmnet [08:37:46] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1286804|IPReputation: Route opensearch_ipoid through envoy service mesh (T421293)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:37:47] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1070.eqiad.wmnet [08:37:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet [08:37:51] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc1071.eqiad.wmnet [08:39:34] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [08:39:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet [08:39:58] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2244: Upgrading db2244.codfw.wmnet [08:40:27] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2244: Upgrading db2244.codfw.wmnet [08:40:38] !log kharlan@deploy1003 kharlan: Continuing with deployment [08:41:28] (03PS1) 10Elukey: services: move the aux k8s' kubemaster to IPIP load balancing [puppet] - 10https://gerrit.wikimedia.org/r/1289273 (https://phabricator.wikimedia.org/T420439) [08:41:30] (03PS1) 10Elukey: service: move Aux k8s' ingress to IPIP load balancing [puppet] - 10https://gerrit.wikimedia.org/r/1289274 (https://phabricator.wikimedia.org/T420439) [08:41:52] !log volans@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit1003.eqiad.wmnet [08:42:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet [08:42:40] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudnet2007-dev.codfw.wmnet [08:43:05] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1071.eqiad.wmnet [08:43:09] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc1072.eqiad.wmnet [08:43:48] (03PS1) 10Dreamy Jazz: Remove unused $wgEnableUserEmailMuteList config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289275 (https://phabricator.wikimedia.org/T413867) [08:43:53] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [08:44:11] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS trixie [08:44:11] jouncebot: nowandnext [08:44:11] For the next 1 hour(s) and 15 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260519T0800) [08:44:11] In 1 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260519T1000) [08:44:31] !log jiji@cumin1003 START - Cookbook sre.memcached.roll-reboot-restart rolling reboot on A:memcached-gutter-codfw [08:44:36] Looks like the train is done, so going to sync a no-op config change [08:44:49] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1286804|IPReputation: Route opensearch_ipoid through envoy service mesh (T421293)]] (duration: 09m 08s) [08:44:53] T421293: Enable service mesh for OpenSearch on K8s clusters - https://phabricator.wikimedia.org/T421293 [08:45:15] kostajh: Are you done? [08:45:40] Dreamy_Jazz: +1 I have promoted group0 wikis and all seems quiet as far as I can tell [08:45:44] (03CR) 10Ayounsi: [C:03+1] Switch install1005 / the installserver role at large to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1289187 (owner: 10Muehlenhoff) [08:45:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet [08:45:52] Thanks [08:45:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet [08:45:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet [08:46:33] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org [08:46:34] Dreamy_Jazz: yes I am done [08:46:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289275 (https://phabricator.wikimedia.org/T413867) (owner: 10Dreamy Jazz) [08:46:55] Thanks, proceeding [08:47:01] !log failover Ganeti cluster in ulsfo to ganeti4008 [08:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:43] (03Merged) 10jenkins-bot: Remove unused $wgEnableUserEmailMuteList config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289275 (https://phabricator.wikimedia.org/T413867) (owner: 10Dreamy Jazz) [08:48:10] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1289275|Remove unused $wgEnableUserEmailMuteList config (T413867)]] [08:48:13] T413867: Enable Special:Mute by default and remove $wgEnableSpecialMute/$wgEnableUserEmailMuteList feature flags as unnecessary - https://phabricator.wikimedia.org/T413867 [08:48:23] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1072.eqiad.wmnet [08:49:04] !log elukey@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:aux-worker-eqiad [08:49:21] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet2007-dev.codfw.wmnet [08:49:25] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudnet2005-dev.codfw.wmnet [08:49:45] PROBLEM - ganeti-wconfd running on ganeti4005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [08:50:10] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1289275|Remove unused $wgEnableUserEmailMuteList config (T413867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:50:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org [08:51:05] (03CR) 10Isabelle Hurbain-Palatin: "recheck" [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289070 (https://phabricator.wikimedia.org/T423701) (owner: 10C. Scott Ananian) [08:51:15] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with deployment [08:52:29] !log volans@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudrabbit1002.eqiad.wmnet [08:55:25] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1289275|Remove unused $wgEnableUserEmailMuteList config (T413867)]] (duration: 07m 15s) [08:55:29] T413867: Enable Special:Mute by default and remove $wgEnableSpecialMute/$wgEnableUserEmailMuteList feature flags as unnecessary - https://phabricator.wikimedia.org/T413867 [08:55:31] I'm done [08:57:38] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet2005-dev.codfw.wmnet [08:57:42] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudnet2006-dev.codfw.wmnet [08:58:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org [08:58:28] !log volans@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit1002.eqiad.wmnet [08:59:20] !log volans@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudrabbit1001.eqiad.wmnet [09:00:47] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage [09:02:17] (03CR) 10Effie Mouzeli: [C:03+2] mcrouter_wancache: add mc1070-mc1072 to production [puppet] - 10https://gerrit.wikimedia.org/r/1288500 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli) [09:02:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org [09:03:05] (03PS1) 10Marostegui: installserver: Do not format pc1021 [puppet] - 10https://gerrit.wikimedia.org/r/1289277 (https://phabricator.wikimedia.org/T418973) [09:03:40] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage [09:04:01] !log jiji@cumin1003 END (PASS) - Cookbook sre.memcached.roll-reboot-restart (exit_code=0) rolling reboot on A:memcached-gutter-codfw [09:05:31] !log volans@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit1001.eqiad.wmnet [09:05:33] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet2006-dev.codfw.wmnet [09:06:13] (03PS1) 10JavierMonton: image: Flink 2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1289278 (https://phabricator.wikimedia.org/T412978) [09:06:27] (03CR) 10Marostegui: "@cwilliams@wikimedia.org FYI" [puppet] - 10https://gerrit.wikimedia.org/r/1289277 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [09:06:30] (03CR) 10Marostegui: [C:03+2] installserver: Do not format pc1021 [puppet] - 10https://gerrit.wikimedia.org/r/1289277 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [09:07:54] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host aux-k8s-worker1006.eqiad.wmnet [09:08:18] (03PS1) 10Muehlenhoff: use_linux612_on_bookworm: Bump kernel to 6.12.88 [puppet] - 10https://gerrit.wikimedia.org/r/1289279 [09:08:26] 10SRE-tools, 06Infrastructure-Foundations, 06SRE Observability: sre.kafka.roll-restart-reboot-brokers: command-config is not a recognized option - https://phabricator.wikimedia.org/T426639#11935096 (10elukey) 05Open→03Resolved a:03elukey [09:08:29] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host aux-k8s-worker1006.eqiad.wmnet [09:13:17] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudnet1006.eqiad.wmnet [09:13:19] !log filippo@cumin1003 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host cloudnet1006.eqiad.wmnet [09:13:32] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudnet1005.eqiad.wmnet [09:13:37] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host aux-k8s-worker1006.eqiad.wmnet [09:13:38] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host aux-k8s-worker1006.eqiad.wmnet [09:13:43] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host aux-k8s-worker1007.eqiad.wmnet [09:14:19] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host aux-k8s-worker1007.eqiad.wmnet [09:15:52] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1289182 (https://phabricator.wikimedia.org/T426561) (owner: 10Slyngshede) [09:17:25] (03PS1) 10Elukey: sre.k8s: fix host_has_l2_adjacency_to_lvs for ganeti VMs [cookbooks] - 10https://gerrit.wikimedia.org/r/1289281 (https://phabricator.wikimedia.org/T426601) [09:17:59] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply [09:18:10] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply [09:18:14] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [09:18:21] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [09:19:28] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host aux-k8s-worker1007.eqiad.wmnet [09:19:29] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host aux-k8s-worker1007.eqiad.wmnet [09:19:35] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host aux-k8s-worker1008.eqiad.wmnet [09:20:09] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet [09:20:10] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host aux-k8s-worker1008.eqiad.wmnet [09:20:46] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS trixie [09:20:50] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet1005.eqiad.wmnet [09:22:24] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission db2151.codfw.wmnet - https://phabricator.wikimedia.org/T424343#11935176 (10Marostegui) a:05wiki_willy→03None [09:22:54] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2244: Migration of db2244.codfw.wmnet completed [09:23:21] (03CR) 10Brouberol: [C:03+2] image: Flink 2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1289278 (https://phabricator.wikimedia.org/T412978) (owner: 10JavierMonton) [09:23:23] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudnet1006.eqiad.wmnet [09:23:25] (03CR) 10Brouberol: [V:03+2 C:03+2] image: Flink 2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1289278 (https://phabricator.wikimedia.org/T412978) (owner: 10JavierMonton) [09:23:36] (03CR) 10Blake: [C:03+1] sre.k8s: fix host_has_l2_adjacency_to_lvs for ganeti VMs [cookbooks] - 10https://gerrit.wikimedia.org/r/1289281 (https://phabricator.wikimedia.org/T426601) (owner: 10Elukey) [09:47:06] !log aikochou@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [09:48:57] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2012.codfw.wmnet [09:51:17] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol2006-dev.codfw.wmnet [09:51:21] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudcontrol2010-dev.codfw.wmnet [09:53:17] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host rdb2011.codfw.wmnet [09:55:16] !log cgoubert@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling reboot on A:kafka-main-codfw [09:57:08] RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator [09:58:21] !log tappof@cumin1003 START - Cookbook sre.hosts.reboot-single for host titan1001.eqiad.wmnet [09:58:45] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2011.codfw.wmnet [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260519T1000) [10:00:25] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet [10:00:29] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1008-dev.eqiad.wmnet [10:05:27] FIRING: [3x] JobUnavailable: Reduced availability for job thanos-rule in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:05:51] !log tappof@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host titan1001.eqiad.wmnet [10:07:07] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol1008-dev.eqiad.wmnet [10:07:45] (03PS1) 10Brouberol: global_config: replace dns->IP resolution for gerrit by PQL lookup [puppet] - 10https://gerrit.wikimedia.org/r/1289292 [10:08:23] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2244: Migration of db2244.codfw.wmnet completed [10:08:24] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [10:09:12] RESOLVED: [3x] JobUnavailable: Reduced availability for job thanos-rule in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:10:58] !log tappof@cumin1003 START - Cookbook sre.hosts.reboot-single for host titan2002.codfw.wmnet [10:11:16] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8562/co" [puppet] - 10https://gerrit.wikimedia.org/r/1289292 (owner: 10Brouberol) [10:14:41] (03PS1) 10Brouberol: global_config: ensure the gerrit external service contains all LB IPs [puppet] - 10https://gerrit.wikimedia.org/r/1289295 [10:15:35] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1266.eqiad.wmnet with OS trixie [10:16:54] (03CR) 10CI reject: [V:04-1] global_config: ensure the gerrit external service contains all LB IPs [puppet] - 10https://gerrit.wikimedia.org/r/1289295 (owner: 10Brouberol) [10:18:13] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8563/co" [puppet] - 10https://gerrit.wikimedia.org/r/1289292 (owner: 10Brouberol) [10:18:49] !log tappof@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host titan2002.codfw.wmnet [10:19:17] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1267.eqiad.wmnet with OS trixie [10:19:39] (03PS2) 10Brouberol: global_config: ensure the gerrit external service contains all LB IPs [puppet] - 10https://gerrit.wikimedia.org/r/1289292 [10:19:42] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1268.eqiad.wmnet with OS trixie [10:20:03] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1269.eqiad.wmnet with OS trixie [10:20:05] (03Abandoned) 10Brouberol: global_config: ensure the gerrit external service contains all LB IPs [puppet] - 10https://gerrit.wikimedia.org/r/1289295 (owner: 10Brouberol) [10:21:18] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1270.eqiad.wmnet with OS trixie [10:22:21] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1271.eqiad.wmnet with OS trixie [10:22:46] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1272.eqiad.wmnet with OS trixie [10:23:13] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1273.eqiad.wmnet with OS trixie [10:23:31] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8565/co" [puppet] - 10https://gerrit.wikimedia.org/r/1289292 (owner: 10Brouberol) [10:23:32] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1274.eqiad.wmnet with OS trixie [10:24:04] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1275.eqiad.wmnet with OS trixie [10:24:12] FIRING: [5x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:24:26] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1276.eqiad.wmnet with OS trixie [10:24:29] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-be1003.eqiad.wmnet [10:24:51] (03CR) 10Atsuko: [C:03+1] global_config: ensure the gerrit external service contains all LB IPs [puppet] - 10https://gerrit.wikimedia.org/r/1289292 (owner: 10Brouberol) [10:25:27] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:25:36] (03CR) 10Brouberol: [V:03+1 C:03+2] global_config: ensure the gerrit external service contains all LB IPs [puppet] - 10https://gerrit.wikimedia.org/r/1289292 (owner: 10Brouberol) [10:26:42] !log elukey@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on D{aux-k8s-worker100[2-5].eqiad.wmnet} and (A:aux-master-eqiad or A:aux-worker-eqiad) [10:26:45] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host aux-k8s-worker1002.eqiad.wmnet [10:28:59] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1266.eqiad.wmnet with reason: host reimage [10:30:10] FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:31:10] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be1003.eqiad.wmnet [10:32:01] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host apus-be1006.eqiad.wmnet [10:32:01] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host aux-k8s-worker1002.eqiad.wmnet [10:32:49] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1267.eqiad.wmnet with reason: host reimage [10:32:56] (03CR) 10Slyngshede: [C:03+2] data.yaml: add dsantamaria to deployment [puppet] - 10https://gerrit.wikimedia.org/r/1289182 (https://phabricator.wikimedia.org/T426561) (owner: 10Slyngshede) [10:33:14] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1268.eqiad.wmnet with reason: host reimage [10:33:16] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1269.eqiad.wmnet with reason: host reimage [10:34:10] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1266.eqiad.wmnet with reason: host reimage [10:34:13] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1270.eqiad.wmnet with reason: host reimage [10:35:27] FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:35:39] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1271.eqiad.wmnet with reason: host reimage [10:36:03] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host aux-k8s-worker1002.eqiad.wmnet [10:36:04] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host aux-k8s-worker1002.eqiad.wmnet [10:36:10] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host aux-k8s-worker1003.eqiad.wmnet [10:36:18] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1274.eqiad.wmnet with reason: host reimage [10:36:19] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1272.eqiad.wmnet with reason: host reimage [10:36:22] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1273.eqiad.wmnet with reason: host reimage [10:36:41] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host aux-k8s-worker1003.eqiad.wmnet [10:36:47] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc2021.codfw.wmnet,pc[1011,1021].eqiad.wmnet with reason: Maintenance on pc1 [10:37:31] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1275.eqiad.wmnet with reason: host reimage [10:37:32] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1267.eqiad.wmnet with reason: host reimage [10:37:51] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1276.eqiad.wmnet with reason: host reimage [10:38:04] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apus-be1006.eqiad.wmnet [10:38:39] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host apus-be1004.eqiad.wmnet [10:39:12] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:40:45] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host aux-k8s-worker1003.eqiad.wmnet [10:40:46] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host aux-k8s-worker1003.eqiad.wmnet [10:40:52] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host aux-k8s-worker1004.eqiad.wmnet [10:41:01] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1271.eqiad.wmnet with reason: host reimage [10:41:01] (03PS1) 10Elukey: Add role pki::root to pki-root1002 [puppet] - 10https://gerrit.wikimedia.org/r/1289308 (https://phabricator.wikimedia.org/T416664) [10:41:23] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host aux-k8s-worker1004.eqiad.wmnet [10:41:32] (03PS2) 10Elukey: Add role pki::root to pki-root1002 [puppet] - 10https://gerrit.wikimedia.org/r/1289308 (https://phabricator.wikimedia.org/T416664) [10:42:08] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1289308 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey) [10:43:45] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1268.eqiad.wmnet with reason: host reimage [10:44:42] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apus-be1004.eqiad.wmnet [10:45:14] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host apus-be1005.eqiad.wmnet [10:45:23] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling reboot on A:kafka-main-codfw [10:45:25] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host aux-k8s-worker1004.eqiad.wmnet [10:45:27] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host aux-k8s-worker1004.eqiad.wmnet [10:45:32] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host aux-k8s-worker1005.eqiad.wmnet [10:46:03] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host aux-k8s-worker1005.eqiad.wmnet [10:46:25] FIRING: [42x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:47:34] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1274.eqiad.wmnet with reason: host reimage [10:47:51] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for dsantamaria - https://phabricator.wikimedia.org/T426561#11935520 (10SLyngshede-WMF) 05In progress→03Resolved [10:49:06] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1266.eqiad.wmnet with OS trixie [10:49:36] !log tappof@cumin1003 START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet [10:50:05] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host aux-k8s-worker1005.eqiad.wmnet [10:50:07] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host aux-k8s-worker1005.eqiad.wmnet [10:50:07] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on D{aux-k8s-worker100[2-5].eqiad.wmnet} and (A:aux-master-eqiad or A:aux-worker-eqiad) [10:50:25] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11935531 (10SLyngshede-WMF) @AnnieKim_WMDE Hi, did you get the access you needed? [10:50:42] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apus-be1005.eqiad.wmnet [10:51:19] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1275.eqiad.wmnet with reason: host reimage [10:51:27] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5004.eqsin.wmnet [10:51:48] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1267.eqiad.wmnet with OS trixie [10:52:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet [10:53:19] !log tappof@cumin1003 START - Cookbook sre.hosts.reboot-single for host titan1002.eqiad.wmnet [10:54:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5004.eqsin.wmnet [10:54:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet [10:55:15] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1271.eqiad.wmnet with OS trixie [10:55:44] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1270.eqiad.wmnet with reason: host reimage [10:57:47] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1268.eqiad.wmnet with OS trixie [10:58:08] (03PS3) 10Elukey: Add role pki::root to pki-root1002 [puppet] - 10https://gerrit.wikimedia.org/r/1289308 (https://phabricator.wikimedia.org/T416664) [10:59:11] (03CR) 10Clément Goubert: [C:03+1] "Makes sense to me, but I would still like to have SRE traffic's opinion on possible impact on cache." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287731 (https://phabricator.wikimedia.org/T426323) (owner: 10Kosta Harlan) [10:59:30] !log tappof@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet [10:59:34] !log tappof@cumin1003 START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet [10:59:35] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1272.eqiad.wmnet with reason: host reimage [11:00:33] !log tappof@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host titan1002.eqiad.wmnet [11:02:11] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1274.eqiad.wmnet with OS trixie [11:02:20] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1273.eqiad.wmnet with reason: host reimage [11:02:38] (03CR) 10Muehlenhoff: Add role pki::root to pki-root1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1289308 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey) [11:03:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5004.eqsin.wmnet [11:03:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet [11:03:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet [11:03:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5004.eqsin.wmnet [11:04:22] (03Abandoned) 10Bartosz Wójtowicz: inference-services: Deploy outlink-cache-adapter service. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248399 (https://phabricator.wikimedia.org/T418493) (owner: 10Bartosz Wójtowicz) [11:05:06] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1275.eqiad.wmnet with OS trixie [11:05:23] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5005.eqsin.wmnet [11:05:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7002.magru.wmnet [11:06:01] !log cgoubert@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling reboot on A:kafka-main-eqiad [11:06:43] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1269.eqiad.wmnet with reason: host reimage [11:07:24] !log tappof@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet [11:07:29] !log tappof@cumin1003 START - Cookbook sre.hosts.reboot-single for host prometheus3004.esams.wmnet [11:08:49] (03CR) 10Majavah: [C:03+1] designate: use zk backend in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1287822 (https://phabricator.wikimedia.org/T422646) (owner: 10Filippo Giunchedi) [11:09:39] jmm@cumin2002 drain-node (PID 2570751) is awaiting input [11:09:42] jmm@cumin2002 drain-node (PID 2570698) is awaiting input [11:09:55] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1270.eqiad.wmnet with OS trixie [11:10:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5005.eqsin.wmnet [11:10:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet [11:10:19] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [11:10:32] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [11:10:43] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1276.eqiad.wmnet with reason: host reimage [11:13:32] !log tappof@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus3004.esams.wmnet [11:13:37] !log tappof@cumin1003 START - Cookbook sre.hosts.reboot-single for host prometheus4003.ulsfo.wmnet [11:14:11] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1272.eqiad.wmnet with OS trixie [11:14:28] (03PS1) 10Muehlenhoff: thumbor-plugins: Rebuild against latest package versions in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1289315 [11:16:15] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1273.eqiad.wmnet with OS trixie [11:18:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5005.eqsin.wmnet [11:18:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet [11:18:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5005.eqsin.wmnet [11:18:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7002.magru.wmnet [11:19:32] !log tappof@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus4003.ulsfo.wmnet [11:19:36] !log tappof@cumin1003 START - Cookbook sre.hosts.reboot-single for host prometheus1006.eqiad.wmnet [11:20:34] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5006.eqsin.wmnet [11:20:41] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7003.magru.wmnet [11:21:08] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1269.eqiad.wmnet with OS trixie [11:24:45] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1276.eqiad.wmnet with OS trixie [11:24:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5006.eqsin.wmnet [11:24:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7003.magru.wmnet [11:29:33] !log tappof@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1006.eqiad.wmnet [11:29:37] !log tappof@cumin1003 START - Cookbook sre.hosts.reboot-single for host prometheus1008.eqiad.wmnet [11:31:13] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1277.eqiad.wmnet with OS trixie [11:31:32] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1278.eqiad.wmnet with OS trixie [11:31:46] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1279.eqiad.wmnet with OS trixie [11:32:09] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1280.eqiad.wmnet with OS trixie [11:32:30] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1281.eqiad.wmnet with OS trixie [11:33:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5006.eqsin.wmnet [11:33:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7003.magru.wmnet [11:33:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7003.magru.wmnet [11:33:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5006.eqsin.wmnet [11:34:19] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1282.eqiad.wmnet with OS trixie [11:34:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [11:34:48] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1283.eqiad.wmnet with OS trixie [11:34:57] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-be2003.codfw.wmnet [11:35:08] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1284.eqiad.wmnet with OS trixie [11:35:33] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1286.eqiad.wmnet with OS trixie [11:35:56] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1287.eqiad.wmnet with OS trixie [11:36:16] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1288.eqiad.wmnet with OS trixie [11:37:34] !log failover Ganeti cluster in magru to ganeti7001 [11:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:41] (03CR) 10AikoChou: "deployed!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1281588 (https://phabricator.wikimedia.org/T412830) (owner: 10Eevans) [11:37:44] !log failover Ganeti cluster in eqsin to ganeti5004 [11:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:39] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on ms-backup[1003-1004].eqiad.wmnet with reason: restart [11:39:39] (03PS1) 10Majavah: P:toolforge::k8s::haproxy: Fix inconsistent log formatting [puppet] - 10https://gerrit.wikimedia.org/r/1289319 [11:39:43] !log tappof@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1008.eqiad.wmnet [11:39:49] !log tappof@cumin1003 START - Cookbook sre.hosts.reboot-single for host prometheus5003.eqsin.wmnet [11:39:54] PROBLEM - ganeti-wconfd running on ganeti7004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [11:40:04] PROBLEM - ganeti-wconfd running on ganeti5007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [11:41:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:42:00] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be2003.codfw.wmnet [11:44:11] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1277.eqiad.wmnet with reason: host reimage [11:44:42] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1278.eqiad.wmnet with reason: host reimage [11:44:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [11:44:53] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1279.eqiad.wmnet with reason: host reimage [11:45:37] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1280.eqiad.wmnet with reason: host reimage [11:45:43] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1281.eqiad.wmnet with reason: host reimage [11:46:07] !log tappof@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus5003.eqsin.wmnet [11:46:12] !log tappof@cumin1003 START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet [11:47:23] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1282.eqiad.wmnet with reason: host reimage [11:47:45] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1283.eqiad.wmnet with reason: host reimage [11:48:12] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1284.eqiad.wmnet with reason: host reimage [11:48:31] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1286.eqiad.wmnet with reason: host reimage [11:49:24] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1288.eqiad.wmnet with reason: host reimage [11:49:29] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1287.eqiad.wmnet with reason: host reimage [11:49:41] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1277.eqiad.wmnet with reason: host reimage [11:50:54] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on 18 hosts with reason: restart [11:51:01] PROBLEM - Host cloudweb2002-dev is DOWN: PING CRITICAL - Packet loss = 100% [11:52:34] RECOVERY - Host cloudweb2002-dev is UP: PING OK - Packet loss = 0%, RTA = 31.68 ms [11:52:50] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1284.eqiad.wmnet with reason: host reimage [11:53:22] !log taavi@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudidp2001-dev.codfw.wmnet [11:53:35] (03CR) 10Filippo Giunchedi: [C:03+1] P:toolforge::k8s::haproxy: Fix inconsistent log formatting [puppet] - 10https://gerrit.wikimedia.org/r/1289319 (owner: 10Majavah) [11:54:10] (03CR) 10Majavah: [C:03+2] P:toolforge::k8s::haproxy: Fix inconsistent log formatting [puppet] - 10https://gerrit.wikimedia.org/r/1289319 (owner: 10Majavah) [11:55:13] (03PS2) 10Filippo Giunchedi: alerts: Add optional pre-deploy transformations [puppet] - 10https://gerrit.wikimedia.org/r/1288883 (https://phabricator.wikimedia.org/T424814) [11:55:44] (03CR) 10CI reject: [V:04-1] alerts: Add optional pre-deploy transformations [puppet] - 10https://gerrit.wikimedia.org/r/1288883 (https://phabricator.wikimedia.org/T424814) (owner: 10Filippo Giunchedi) [11:56:07] !log tappof@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet [11:56:12] !log tappof@cumin1003 START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet [11:56:18] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host apus-be2006.codfw.wmnet [11:56:35] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling reboot on A:kafka-main-eqiad [11:56:36] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1282.eqiad.wmnet with reason: host reimage [11:56:50] PROBLEM - Host cloudweb1004 is DOWN: PING CRITICAL - Packet loss = 100% [11:57:12] !log taavi@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudidp2001-dev.codfw.wmnet [11:57:23] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission db2150.codfw.wmnet - https://phabricator.wikimedia.org/T424342#11935872 (10Jhancock.wm) a:05wiki_willy→03Jhancock.wm [11:57:34] RECOVERY - Host cloudweb1004 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [11:58:44] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5007.eqsin.wmnet [11:58:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7004.magru.wmnet [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260519T1200) [12:00:13] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1283.eqiad.wmnet with reason: host reimage [12:00:32] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apus-be2006.codfw.wmnet [12:01:30] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host apus-be2005.codfw.wmnet [12:03:43] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1281.eqiad.wmnet with reason: host reimage [12:04:45] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1277.eqiad.wmnet with OS trixie [12:05:21] jmm@cumin2002 drain-node (PID 2605964) is awaiting input [12:05:37] jmm@cumin2002 drain-node (PID 2605957) is awaiting input [12:06:01] !log tappof@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet [12:06:06] !log tappof@cumin1003 START - Cookbook sre.hosts.reboot-single for host prometheus6002.drmrs.wmnet [12:06:10] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1284.eqiad.wmnet with OS trixie [12:06:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5007.eqsin.wmnet [12:06:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7004.magru.wmnet [12:07:09] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apus-be2005.codfw.wmnet [12:07:28] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1288.eqiad.wmnet with reason: host reimage [12:07:47] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:08:08] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host apus-be2004.codfw.wmnet [12:08:20] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:08:45] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [12:09:37] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [12:10:18] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1282.eqiad.wmnet with OS trixie [12:10:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Q4:rack/setup/install wdqs103[6-8] - https://phabricator.wikimedia.org/T423314#11935962 (10Jclark-ctr) i see this is assigned to @bking but @RKemper is listed as contact. The servers have arrived P... [12:11:26] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1278.eqiad.wmnet with reason: host reimage [12:12:01] !log tappof@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus6002.drmrs.wmnet [12:12:06] !log tappof@cumin1003 START - Cookbook sre.hosts.reboot-single for host prometheus7002.magru.wmnet [12:12:46] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apus-be2004.codfw.wmnet [12:14:09] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1279.eqiad.wmnet with reason: host reimage [12:14:16] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1283.eqiad.wmnet with OS trixie [12:14:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7004.magru.wmnet [12:14:53] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1280.eqiad.wmnet with reason: host reimage [12:14:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5007.eqsin.wmnet [12:15:10] FIRING: [4x] GanetiBGPDown: BGP session down between ganeti5007 and cr2-eqsin - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [12:15:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7004.magru.wmnet [12:15:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5007.eqsin.wmnet [12:15:33] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1286.eqiad.wmnet with reason: host reimage [12:17:22] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1281.eqiad.wmnet with OS trixie [12:18:02] !log tappof@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus7002.magru.wmnet [12:18:07] !log tappof@cumin1003 START - Cookbook sre.hosts.reboot-single for host prometheus2006.codfw.wmnet [12:18:55] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1064.eqiad.wmnet [12:19:03] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2062.codfw.wmnet [12:19:06] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3005.esams.wmnet [12:20:10] RESOLVED: [3x] GanetiBGPDown: BGP session down between ganeti5007 and cr2-eqsin - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [12:20:54] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1279.eqiad.wmnet with OS trixie [12:21:12] (03PS1) 10JMeybohm: sre.k8s: Fix regex for VLANs trunked on LVS links [cookbooks] - 10https://gerrit.wikimedia.org/r/1289326 [12:21:24] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1288.eqiad.wmnet with OS trixie [12:22:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3005.esams.wmnet [12:23:37] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1279.eqiad.wmnet with OS trixie [12:23:46] (03CR) 10Ayounsi: [C:03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1289326 (owner: 10JMeybohm) [12:24:13] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1287.eqiad.wmnet with reason: host reimage [12:24:28] 06SRE, 06Content-Transform-Team, 06ServiceOps new, 06Wikipedia-Android-App-Backlog: Investigate Code 414 error when selecting zh-classical (lzh) language from article toolbar - https://phabricator.wikimedia.org/T425545#11936034 (10Raine) p:05Triage→03Medium Changing priority to Medium as it is no longe... [12:24:50] (03CR) 10JMeybohm: [C:03+2] sre.k8s: Fix regex for VLANs trunked on LVS links [cookbooks] - 10https://gerrit.wikimedia.org/r/1289326 (owner: 10JMeybohm) [12:25:28] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1278.eqiad.wmnet with OS trixie [12:25:40] (03PS1) 10Dbrant: Add "get_login_creds" permission to Android app for auth domain. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289328 (https://phabricator.wikimedia.org/T426010) [12:26:04] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2062.codfw.wmnet [12:26:05] (03PS1) 10Muehlenhoff: Start blacklisting unused packet mangling/network scheduler modules [puppet] - 10https://gerrit.wikimedia.org/r/1289329 [12:26:07] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1064.eqiad.wmnet [12:26:53] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1065.eqiad.wmnet [12:27:02] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2063.codfw.wmnet [12:27:23] !log tappof@cumin1003 START - Cookbook sre.hosts.reboot-single for host titan2001.codfw.wmnet [12:27:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289328 (https://phabricator.wikimedia.org/T426010) (owner: 10Dbrant) [12:28:02] !log tappof@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2006.codfw.wmnet [12:28:07] !log tappof@cumin1003 START - Cookbook sre.hosts.reboot-single for host prometheus2008.codfw.wmnet [12:28:09] (03CR) 10CI reject: [V:04-1] Start blacklisting unused packet mangling/network scheduler modules [puppet] - 10https://gerrit.wikimedia.org/r/1289329 (owner: 10Muehlenhoff) [12:28:18] FIRING: [3x] ProbeDown: Service ganeti5007:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:28:26] (03Merged) 10jenkins-bot: sre.k8s: Fix regex for VLANs trunked on LVS links [cookbooks] - 10https://gerrit.wikimedia.org/r/1289326 (owner: 10JMeybohm) [12:29:36] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1280.eqiad.wmnet with OS trixie [12:30:15] (03PS2) 10Muehlenhoff: Start blacklisting unused packet mangling/network scheduler modules [puppet] - 10https://gerrit.wikimedia.org/r/1289329 [12:32:01] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1286.eqiad.wmnet with OS trixie [12:32:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3005.esams.wmnet [12:32:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3005.esams.wmnet [12:33:05] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2063.codfw.wmnet [12:33:18] RESOLVED: [3x] ProbeDown: Service ganeti3005:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:33:54] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2064.codfw.wmnet [12:34:09] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1065.eqiad.wmnet [12:34:12] FIRING: [3x] JobUnavailable: Reduced availability for job thanos-rule in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:34:19] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1066.eqiad.wmnet [12:35:17] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1289.eqiad.wmnet with OS trixie [12:35:27] !log tappof@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host titan2001.codfw.wmnet [12:35:31] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1290.eqiad.wmnet with OS trixie [12:35:44] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1289329 (owner: 10Muehlenhoff) [12:36:00] !log tappof@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2008.codfw.wmnet [12:36:28] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1279.eqiad.wmnet with reason: host reimage [12:36:31] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1279.eqiad.wmnet with reason: host reimage [12:36:45] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1287.eqiad.wmnet with OS trixie [12:39:12] RESOLVED: [3x] JobUnavailable: Reduced availability for job thanos-rule in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:39:40] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2064.codfw.wmnet [12:39:48] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2065.codfw.wmnet [12:41:18] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1066.eqiad.wmnet [12:41:40] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1067.eqiad.wmnet [12:41:45] (03PS1) 10CWilliams: mariadb: Decommission db2149 [puppet] - 10https://gerrit.wikimedia.org/r/1289333 (https://phabricator.wikimedia.org/T424341) [12:43:53] (03CR) 10Marostegui: [C:03+1] mariadb: Decommission db2149 [puppet] - 10https://gerrit.wikimedia.org/r/1289333 (https://phabricator.wikimedia.org/T424341) (owner: 10CWilliams) [12:44:06] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3006.esams.wmnet [12:45:06] (03PS1) 10CWilliams: mariadb: Decommission db2143 [puppet] - 10https://gerrit.wikimedia.org/r/1289336 (https://phabricator.wikimedia.org/T424171) [12:45:40] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2065.codfw.wmnet [12:46:03] (03CR) 10Marostegui: mariadb: Decommission db2143 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1289336 (https://phabricator.wikimedia.org/T424171) (owner: 10CWilliams) [12:46:15] (03PS1) 10Dreamy Jazz: Drop wgCheckUserDisplayClientHints definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289337 [12:46:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet [12:47:09] jmm@cumin2002 drain-node (PID 2653285) is awaiting input [12:47:32] !log cwilliams@cumin1003 START - Cookbook sre.hosts.decommission for hosts db2149.codfw.wmnet [12:48:02] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1067.eqiad.wmnet [12:48:09] PROBLEM - Host cloudgw1003 is DOWN: PING CRITICAL - Packet loss = 100% [12:48:15] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1289.eqiad.wmnet with reason: host reimage [12:48:33] 07sre-alert-triage, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Alert in need of triage: ResourceQuotaMemoryLimitsWarning - https://phabricator.wikimedia.org/T426589#11936088 (10Gehel) [12:48:48] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1068.eqiad.wmnet [12:48:56] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1290.eqiad.wmnet with reason: host reimage [12:49:18] 07sre-alert-triage, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Alert in need of triage: ResourceQuotaMemoryLimitsWarning - https://phabricator.wikimedia.org/T426589#11936096 (10Gehel) We're going to disable opensearch/semantic search soon, we should silence this alert and wait for the removal of this names... [12:49:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3006.esams.wmnet [12:49:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet [12:50:19] RECOVERY - Host cloudgw1003 is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [12:50:36] (03PS1) 10Dreamy Jazz: Drop unused $wgWikimediaEventsIPoidUrl definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289339 [12:50:42] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet [12:50:50] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1279.eqiad.wmnet with OS trixie [12:52:07] !log cwilliams@cumin1003 START - Cookbook sre.dns.netbox [12:52:35] PROBLEM - Host kubestagemaster2005 is DOWN: PING CRITICAL - Packet loss = 100% [12:53:11] PROBLEM - Host cloudgw1004 is DOWN: PING CRITICAL - Packet loss = 100% [12:54:15] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db[1204-1205].eqiad.wmnet with reason: restart/reimage [12:54:37] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1289.eqiad.wmnet with reason: host reimage [12:55:33] RECOVERY - Host cloudgw1004 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [12:55:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet [12:56:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet [12:56:07] RECOVERY - Host kubestagemaster2005 is UP: PING OK - Packet loss = 0%, RTA = 30.46 ms [12:56:11] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet [12:56:15] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2067.codfw.wmnet [12:56:24] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet [12:56:26] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1068.eqiad.wmnet [12:56:45] (03PS3) 10CWilliams: mariadb: Decommission db2143 [puppet] - 10https://gerrit.wikimedia.org/r/1289336 (https://phabricator.wikimedia.org/T424171) [12:56:49] (03CR) 10CI reject: [V:04-1] alerts: Add optional pre-deploy transformations [puppet] - 10https://gerrit.wikimedia.org/r/1288883 (https://phabricator.wikimedia.org/T424814) (owner: 10Filippo Giunchedi) [12:56:53] (03CR) 10CWilliams: mariadb: Decommission db2143 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1289336 (https://phabricator.wikimedia.org/T424171) (owner: 10CWilliams) [12:56:54] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1069.eqiad.wmnet [12:56:57] FIRING: KubernetesCalicoDown: kubestagemaster2005.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2005.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:57:15] PROBLEM - Host db1249 #page is DOWN: PING CRITICAL - Packet loss = 100% [12:57:39] looking [12:57:58] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1290.eqiad.wmnet with reason: host reimage [12:58:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3006.esams.wmnet [12:58:03] !log cwilliams@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2149.codfw.wmnet decommissioned, removing all IPs except the asset tag one - cwilliams@cumin1003" [12:58:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3006.esams.wmnet [12:58:21] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2149.codfw.wmnet decommissioned, removing all IPs except the asset tag one - cwilliams@cumin1003" [12:58:21] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:58:22] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2149.codfw.wmnet [12:58:57] (03PS4) 10Filippo Giunchedi: alerts: Add optional pre-deploy transformations [puppet] - 10https://gerrit.wikimedia.org/r/1288883 (https://phabricator.wikimedia.org/T424814) [12:59:00] !log Removing db2149 from zarcillo T424341 [12:59:24] (03CR) 10Marostegui: [C:03+1] mariadb: Decommission db2143 [puppet] - 10https://gerrit.wikimedia.org/r/1289336 (https://phabricator.wikimedia.org/T424171) (owner: 10CWilliams) [12:59:42] !log Removing db2149 from orchestrator T424341 [12:59:45] !ack [12:59:45] All incidents are already acked. [12:59:48] (03PS1) 10Michael Große: fix: simplify to show only one icon type for password reveal [extensions/WikimediaEvents] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289342 (https://phabricator.wikimedia.org/T419413) [13:00:17] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3007.esams.wmnet [13:00:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet [13:00:31] hm, no jouncebot? [13:01:14] quit with ping timeout at 12:55 UTC apparently [13:01:33] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission db2149.codfw.wmnet - https://phabricator.wikimedia.org/T424341#11936139 (10CWilliams-WMF) [13:01:57] RESOLVED: KubernetesCalicoDown: kubestagemaster2005.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2005.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:02:01] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission db2149.codfw.wmnet - https://phabricator.wikimedia.org/T424341#11936142 (10CWilliams-WMF) This host is ready for DC-Ops to decommission [13:02:08] (03CR) 10CI reject: [V:04-1] alerts: Add optional pre-deploy transformations [puppet] - 10https://gerrit.wikimedia.org/r/1288883 (https://phabricator.wikimedia.org/T424814) (owner: 10Filippo Giunchedi) [13:02:16] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission db2149.codfw.wmnet - https://phabricator.wikimedia.org/T424341#11936143 (10CWilliams-WMF) a:05CWilliams-WMF→03None [13:02:17] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2067.codfw.wmnet [13:02:21] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2068.codfw.wmnet [13:02:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Set db1249 depooled', diff saved to https://phabricator.wikimedia.org/P92601 and previous config saved to /var/cache/conftool/dbconfig/20260519-130235-fceratto.json [13:02:41] PROBLEM - Host ml-staging-etcd2001 is DOWN: PING CRITICAL - Packet loss = 100% [13:02:55] (03CR) 10CWilliams: [C:03+2] mariadb: Decommission db2143 [puppet] - 10https://gerrit.wikimedia.org/r/1289336 (https://phabricator.wikimedia.org/T424171) (owner: 10CWilliams) [13:02:59] (03CR) 10CI reject: [V:04-1] fix: simplify to show only one icon type for password reveal [extensions/WikimediaEvents] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289342 (https://phabricator.wikimedia.org/T419413) (owner: 10Michael Große) [13:03:06] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1069.eqiad.wmnet [13:03:17] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1070.eqiad.wmnet [13:04:11] !log cwilliams@cumin1003 START - Cookbook sre.hosts.decommission for hosts db2143.codfw.wmnet [13:04:23] Lucas_WMDE, Urbanecm, and TheresNoTime: #humanhumor When jouncebot is offline, a human deployer must needs take its place. Rise for Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260519T1300). [13:04:32] codenamenoreste, cscott, dbrant: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:04:47] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 4:00:00 on db1249.eqiad.wmnet with reason: Unreachable [13:04:55] RESOLVED: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:05:01] o/ [13:05:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3007.esams.wmnet [13:05:16] dbrant: want to self-service? [13:05:41] RECOVERY - Host ml-staging-etcd2001 is UP: PING OK - Packet loss = 0%, RTA = 33.40 ms [13:05:45] “Rise for Rise for UTC afternoon backport window” good job Lucas [13:05:49] this is why bots are superior /j [13:05:51] Lucas_WMDE: sure, but looks like i'm last in order? [13:05:52] o/ [13:06:00] dbrant: well cscott hadn’t waved yet :P [13:06:04] hi cscott [13:06:13] you just needed to say my name three times [13:06:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet [13:06:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet [13:06:29] I would do the config change first (i.e. dbrant) [13:06:39] i can also self-service. i'm happy for dbrant to go first. [13:06:39] and then cscott’s backports can go through gate-and-submit while that deploy runs [13:06:49] (which I assume will take a few minutes) [13:06:55] alright! proceeding [13:06:57] and hopefully at some point codenamenoreste will show up [13:06:58] ok! [13:07:15] jenkins was being extremely cranky last night with spurious failures, which is why these patches didn't make the train to begin with. [13:07:20] hopefully it's feeling better this morning [13:07:21] :S [13:07:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dbrant@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289328 (https://phabricator.wikimedia.org/T426010) (owner: 10Dbrant) [13:08:13] 10ops-eqiad, 06DC-Ops: db1249 is unreachable - https://phabricator.wikimedia.org/T426750 (10FCeratto-WMF) 03NEW [13:08:44] !log cwilliams@cumin1003 START - Cookbook sre.dns.netbox [13:08:45] cscott: would you deploy your changes separately or together? [13:08:49] (03Merged) 10jenkins-bot: Add "get_login_creds" permission to Android app for auth domain. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289328 (https://phabricator.wikimedia.org/T426010) (owner: 10Dbrant) [13:08:52] Lucas_WMDE: is the idea just to manually C+2 the patches after dbrant's scap starts? i've seen that spiderpig handles things fine if the patches are already merged before it starts. [13:08:56] Lucas_WMDE: together [13:09:12] yeah, manual CR+2 as soon as spiderpig has made it past the “git pull” stage [13:09:23] !log dbrant@deploy1003 Started scap sync-world: Backport for [[gerrit:1289328|Add "get_login_creds" permission to Android app for auth domain. (T426010)]] [13:09:27] T426010: Enable integration with Credential Manager - https://phabricator.wikimedia.org/T426010 [13:09:35] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2068.codfw.wmnet [13:09:36] (so, now should be fine, https://spiderpig.wikimedia.org/jobs/2031 is already past the point where the changes wouldn’t be included in the deploy anymore even if they merged immediately) [13:09:39] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2069.codfw.wmnet [13:10:11] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1289.eqiad.wmnet with OS trixie [13:10:17] (03CR) 10Ladsgroup: [C:04-1] mariadb::ferm: Switch to firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1289171 (owner: 10Muehlenhoff) [13:10:20] 👍 [13:10:29] (03PS5) 10Filippo Giunchedi: alerts: Add optional pre-deploy transformations [puppet] - 10https://gerrit.wikimedia.org/r/1288883 (https://phabricator.wikimedia.org/T424814) [13:10:49] (03CR) 10C. Scott Ananian: [C:03+2] "manual c+2 to get a head start on spiderpig deploy" [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289070 (https://phabricator.wikimedia.org/T423701) (owner: 10C. Scott Ananian) [13:10:56] (03CR) 10C. Scott Ananian: [C:03+2] "manual c+2 to get a head start on spiderpig deploy" [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289071 (https://phabricator.wikimedia.org/T424773) (owner: 10C. Scott Ananian) [13:11:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3007.esams.wmnet [13:11:33] !log dbrant@deploy1003 dbrant: Backport for [[gerrit:1289328|Add "get_login_creds" permission to Android app for auth domain. (T426010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:11:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3007.esams.wmnet [13:11:57] !log failover Ganeti cluster in esams to ganeti3005 [13:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:09] !log dbrant@deploy1003 dbrant: Continuing with deployment [13:12:19] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2045.codfw.wmnet [13:13:02] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1290.eqiad.wmnet with OS trixie [13:13:05] (03PS1) 10Michael Große: Skip init.test.js test if VisualEditor not installed [extensions/ConfirmEdit] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289347 (https://phabricator.wikimedia.org/T426740) [13:13:13] !log Removing db2143 from zarcillo T424171 [13:13:15] 10ops-eqiad, 06DBA, 06DC-Ops: db1249 is unreachable - https://phabricator.wikimedia.org/T426750#11936203 (10Marostegui) [13:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:18] T424171: decommission db2143.codfw.wmnet - https://phabricator.wikimedia.org/T424171 [13:13:25] !log cwilliams@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2143.codfw.wmnet decommissioned, removing all IPs except the asset tag one - cwilliams@cumin1003" [13:13:28] (03CR) 10CI reject: [V:04-1] alerts: Add optional pre-deploy transformations [puppet] - 10https://gerrit.wikimedia.org/r/1288883 (https://phabricator.wikimedia.org/T424814) (owner: 10Filippo Giunchedi) [13:13:36] 10ops-eqiad, 06DBA, 06DC-Ops: db1249 is unreachable - https://phabricator.wikimedia.org/T426750#11936212 (10Marostegui) p:05Triage→03Medium [13:13:51] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2143.codfw.wmnet decommissioned, removing all IPs except the asset tag one - cwilliams@cumin1003" [13:13:51] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:13:52] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2143.codfw.wmnet [13:14:13] (03PS4) 10Elukey: Add role pki::root to pki-root1002 [puppet] - 10https://gerrit.wikimedia.org/r/1289308 (https://phabricator.wikimedia.org/T416664) [13:14:20] !log Removing db2143 from orchestrator T424171 [13:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:23] (03CR) 10Elukey: Add role pki::root to pki-root1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1289308 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey) [13:14:58] PROBLEM - ganeti-wconfd running on ganeti3008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [13:15:24] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1070.eqiad.wmnet [13:15:29] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2069.codfw.wmnet [13:15:33] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2070.codfw.wmnet [13:15:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3008.esams.wmnet [13:15:39] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission db2143.codfw.wmnet - https://phabricator.wikimedia.org/T424171#11936229 (10CWilliams-WMF) a:05CWilliams-WMF→03None [13:15:43] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission db2143.codfw.wmnet - https://phabricator.wikimedia.org/T424171#11936233 (10CWilliams-WMF) This host is ready for DC-Ops to decommission [13:15:47] (03Merged) 10jenkins-bot: Forward-compatibility for serialization of ContentHolder in ParserOutput [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289070 (https://phabricator.wikimedia.org/T423701) (owner: 10C. Scott Ananian) [13:15:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet [13:16:24] !log dbrant@deploy1003 Finished scap sync-world: Backport for [[gerrit:1289328|Add "get_login_creds" permission to Android app for auth domain. (T426010)]] (duration: 07m 00s) [13:16:25] (03PS4) 10Ladsgroup: swift: Insert the auth file on all frontend hosts [puppet] - 10https://gerrit.wikimedia.org/r/1288929 (https://phabricator.wikimedia.org/T379942) [13:16:27] T426010: Enable integration with Credential Manager - https://phabricator.wikimedia.org/T426010 [13:16:30] (03CR) 10Ladsgroup: [V:03+2 C:03+2] swift: Insert the auth file on all frontend hosts [puppet] - 10https://gerrit.wikimedia.org/r/1288929 (https://phabricator.wikimedia.org/T379942) (owner: 10Ladsgroup) [13:16:50] cscott: i'm all set [13:17:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289071 (https://phabricator.wikimedia.org/T424773) (owner: 10C. Scott Ananian) [13:17:07] (03CR) 10Elukey: [C:03+1] Start blacklisting unused packet mangling/network scheduler modules [puppet] - 10https://gerrit.wikimedia.org/r/1289329 (owner: 10Muehlenhoff) [13:17:17] ok, jumped in. patches are just about through jenkins. [13:17:42] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1071.eqiad.wmnet [13:17:46] PROBLEM - Host aux-k8s-etcd2003 is DOWN: PING CRITICAL - Packet loss = 100% [13:17:47] (03CR) 10Tiziano Fogli: alerts: Add optional pre-deploy transformations (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1288883 (https://phabricator.wikimedia.org/T424814) (owner: 10Filippo Giunchedi) [13:17:54] PROBLEM - Host dse-k8s-etcd2001 is DOWN: PING CRITICAL - Packet loss = 100% [13:18:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3008.esams.wmnet [13:18:38] PROBLEM - Host logstash2023 is DOWN: PING CRITICAL - Packet loss = 100% [13:19:20] FIRING: [4x] ProbeDown: Service ganeti2028:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:19:30] (03PS6) 10Filippo Giunchedi: alerts: Add optional pre-deploy transformations [puppet] - 10https://gerrit.wikimedia.org/r/1288883 (https://phabricator.wikimedia.org/T424814) [13:19:41] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1289308 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey) [13:20:14] (03PS2) 10Muehlenhoff: mariadb::ferm: Switch to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1289171 [13:20:28] (03CR) 10Muehlenhoff: mariadb::ferm: Switch to firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1289171 (owner: 10Muehlenhoff) [13:20:46] RECOVERY - Host dse-k8s-etcd2001 is UP: PING OK - Packet loss = 0%, RTA = 30.63 ms [13:20:50] RECOVERY - Host aux-k8s-etcd2003 is UP: PING OK - Packet loss = 0%, RTA = 30.55 ms [13:21:01] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1289171 (owner: 10Muehlenhoff) [13:21:08] RECOVERY - Host logstash2023 is UP: PING OK - Packet loss = 0%, RTA = 30.52 ms [13:21:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet [13:21:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2045.codfw.wmnet [13:21:55] !log root@cumin1003 START - Cookbook sre.hosts.reimage for host db1204.eqiad.wmnet with OS trixie [13:22:28] (03CR) 10CI reject: [V:04-1] alerts: Add optional pre-deploy transformations [puppet] - 10https://gerrit.wikimedia.org/r/1288883 (https://phabricator.wikimedia.org/T424814) (owner: 10Filippo Giunchedi) [13:22:46] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2070.codfw.wmnet [13:22:50] (03Merged) 10jenkins-bot: ParsoidLanguageConverter: don't convert TOC if __NOCONTENTCONVERT__ [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289071 (https://phabricator.wikimedia.org/T424773) (owner: 10C. Scott Ananian) [13:22:50] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2071.codfw.wmnet [13:23:12] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1071.eqiad.wmnet [13:23:20] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1289070|Forward-compatibility for serialization of ContentHolder in ParserOutput (T423701)]], [[gerrit:1289071|ParsoidLanguageConverter: don't convert TOC if __NOCONTENTCONVERT__ (T424773)]] [13:23:24] T423701: Serialize ContentHolder (or at least its fragments) in ParserOutput - https://phabricator.wikimedia.org/T423701 [13:23:25] T424773: __NOCONTENTCONVERT__ is not honored in Parsoid - https://phabricator.wikimedia.org/T424773 [13:23:45] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1072.eqiad.wmnet [13:24:20] RESOLVED: [3x] ProbeDown: Service ganeti3007:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:25:11] (03PS1) 10Kosta Harlan: hCaptcha: Enable for group1 wikis (except itwiki, metawiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289348 (https://phabricator.wikimedia.org/T425354) [13:25:24] !log cscott@deploy1003 cscott: Backport for [[gerrit:1289070|Forward-compatibility for serialization of ContentHolder in ParserOutput (T423701)]], [[gerrit:1289071|ParsoidLanguageConverter: don't convert TOC if __NOCONTENTCONVERT__ (T424773)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:25:35] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2046.codfw.wmnet [13:25:51] !log elukey@cumin1003 START - Cookbook sre.hosts.decommission for hosts pki1001.eqiad.wmnet [13:26:59] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs-test1001.eqiad.wmnet with OS bookworm [13:27:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet [13:27:13] (03CR) 10Elukey: [C:03+1] thumbor-plugins: Rebuild against latest package versions in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1289315 (owner: 10Muehlenhoff) [13:27:23] !log sukhe@cumin1003 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough [13:27:37] !log sukhe@cumin1003 START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and A:durum [13:28:04] !log sukhe@cumin1003 START - Cookbook sre.cdn.roll-restart-reboot-hcaptcha-proxy rolling reboot on A:hcaptcha-proxy and A:hcaptcha-proxy [13:28:26] (03PS3) 10Muehlenhoff: Start blacklisting unused packet mangling/network scheduler modules [puppet] - 10https://gerrit.wikimedia.org/r/1289329 [13:28:47] !log sukhe@cumin1003 START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) [13:28:47] !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org [13:29:27] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2071.codfw.wmnet [13:29:30] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2072.codfw.wmnet [13:30:01] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289349 [13:30:44] (03PS7) 10Filippo Giunchedi: alerts: Add optional pre-deploy transformations [puppet] - 10https://gerrit.wikimedia.org/r/1288883 (https://phabricator.wikimedia.org/T424814) [13:30:56] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1072.eqiad.wmnet [13:31:01] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1073.eqiad.wmnet [13:31:10] FIRING: [4x] BFDdown: BFD session down between cr1-codfw and 10.192.32.58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:31:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3008.esams.wmnet [13:31:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3008.esams.wmnet [13:31:27] !log elukey@cumin1003 START - Cookbook sre.dns.netbox [13:32:03] !log cscott@deploy1003 cscott: Continuing with deployment [13:32:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet [13:32:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2046.codfw.wmnet [13:32:19] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-wdqs-test1001.eqiad.wmnet with OS bookworm [13:32:43] (03CR) 10Muehlenhoff: [C:03+2] Start blacklisting unused packet mangling/network scheduler modules [puppet] - 10https://gerrit.wikimedia.org/r/1289329 (owner: 10Muehlenhoff) [13:33:02] PROBLEM - BFD status on asw1-b3-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:33:30] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs-test1001.eqiad.wmnet with OS bookworm [13:34:14] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet [13:34:34] (03CR) 10Filippo Giunchedi: alerts: Add optional pre-deploy transformations (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1288883 (https://phabricator.wikimedia.org/T424814) (owner: 10Filippo Giunchedi) [13:34:37] (03PS8) 10Filippo Giunchedi: alerts: Add optional pre-deploy transformations [puppet] - 10https://gerrit.wikimedia.org/r/1288883 (https://phabricator.wikimedia.org/T424814) [13:35:26] !log elukey@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: pki1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - elukey@cumin1003" [13:36:04] !log elukey@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: pki1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - elukey@cumin1003" [13:36:04] !log elukey@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:36:05] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts pki1001.eqiad.wmnet [13:36:10] FIRING: [16x] BFDdown: BFD session down between asw1-b3-magru and 195.200.68.4 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:36:16] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1289070|Forward-compatibility for serialization of ContentHolder in ParserOutput (T423701)]], [[gerrit:1289071|ParsoidLanguageConverter: don't convert TOC if __NOCONTENTCONVERT__ (T424773)]] (duration: 12m 56s) [13:36:21] T423701: Serialize ContentHolder (or at least its fragments) in ParserOutput - https://phabricator.wikimedia.org/T423701 [13:36:21] T424773: __NOCONTENTCONVERT__ is not honored in Parsoid - https://phabricator.wikimedia.org/T424773 [13:36:38] !log root@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1204.eqiad.wmnet with reason: host reimage [13:37:39] !log UTC afternoon backport+config window done [13:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:49] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission pki1001.eqiad.wmnet - https://phabricator.wikimedia.org/T426739#11936325 (10elukey) a:05elukey→03None [13:37:58] Lucas_WMDE: ok, i'm done. [13:38:14] did codenamenoreste show up? (cool username btw) [13:38:24] (03CR) 10Elukey: [C:03+2] Add role pki::root to pki-root1002 [puppet] - 10https://gerrit.wikimedia.org/r/1289308 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey) [13:38:33] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2072.codfw.wmnet [13:38:37] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2073.codfw.wmnet [13:38:42] not AFAICT [13:38:50] jmm@cumin2002 drain-node (PID 2687790) is awaiting input [13:38:56] if they do show up I’d be happy to deploy the changetags mw.o change [13:39:19] whereas for “Create Wikinews-based namespaces” I’m not sure if that could be construed as running counter to the BoT Wikinews closure decision [13:39:36] so I’m happy if it’s not me who deploys that tbh [13:39:37] (03PS1) 10Elukey: Remove pki1001 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1289351 (https://phabricator.wikimedia.org/T426739) [13:39:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet [13:40:46] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1073.eqiad.wmnet [13:40:51] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1074.eqiad.wmnet [13:41:00] RECOVERY - BFD status on asw1-b3-magru.mgmt is OK: UP: 2 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:41:10] FIRING: [18x] BFDdown: BFD session down between asw1-b3-magru and 195.200.68.4 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:41:16] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:42:46] !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org [13:43:03] (03CR) 10Muehlenhoff: [C:03+1] "Farewell and thanks for all those certs we didn't have to generate in cergen" [puppet] - 10https://gerrit.wikimedia.org/r/1289351 (https://phabricator.wikimedia.org/T426739) (owner: 10Elukey) [13:44:14] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 7 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:44:16] (03PS1) 10C. Scott Ananian: Remove unused ParsoidFragmentInput and ParsoidFragmentSupport [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289352 [13:44:48] (03PS2) 10C. Scott Ananian: Remove unused ParsoidFragmentInput and ParsoidFragmentSupport [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289352 [13:45:25] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2073.codfw.wmnet [13:45:29] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2074.codfw.wmnet [13:46:10] RESOLVED: [26x] BFDdown: BFD session down between asw1-b12-drmrs and 10.136.0.21 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:46:38] !log root@cumin1003 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1204.eqiad.wmnet with reason: host reimage [13:48:14] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:48:30] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1074.eqiad.wmnet [13:48:35] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1075.eqiad.wmnet [13:50:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet [13:50:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet [13:50:50] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6001.drmrs.wmnet [13:51:08] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet [13:51:14] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:51:25] FIRING: [33x] BFDdown: BFD session down between asw1-b12-drmrs and 10.136.0.21 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:53:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6001.drmrs.wmnet [13:53:52] PROBLEM - mysqld processes #page on db1204 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [13:54:19] (03CR) 10Elukey: [C:03+2] Remove pki1001 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1289351 (https://phabricator.wikimedia.org/T426739) (owner: 10Elukey) [13:54:41] another one? [13:54:52] this is a master [13:54:57] ah it is backups one [13:54:58] jynus: ^ [13:55:04] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2074.codfw.wmnet [13:55:08] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2075.codfw.wmnet [13:55:45] !ack [13:55:46] All incidents are already acked. [13:55:54] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-wdqs-test1001.eqiad.wmnet with OS bookworm [13:56:25] RESOLVED: [27x] BFDdown: BFD session down between asw1-b12-drmrs and 10.136.0.21 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:57:20] !log elukey@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:aux-worker-codfw [13:57:23] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host aux-k8s-worker2002.codfw.wmnet [13:57:46] !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org [13:57:50] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1075.eqiad.wmnet [13:57:54] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1076.eqiad.wmnet [13:57:59] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host aux-k8s-worker2002.codfw.wmnet [13:59:08] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db[1204-1205].eqiad.wmnet with reason: restart/reimage [14:00:05] Deploy window Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260519T1400) [14:00:25] (03PS1) 10Muehlenhoff: pki:multirootca: Switch to nftables on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1289355 (https://phabricator.wikimedia.org/T416664) [14:01:05] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1289355 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff) [14:01:25] FIRING: [23x] BFDdown: BFD session down between asw1-b12-drmrs and 10.136.0.21 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:01:48] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs-test1001.eqiad.wmnet with OS bookworm [14:01:53] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-wdqs-test1001.eqiad.wmnet with OS bookworm [14:02:00] PROBLEM - BFD status on asw1-b4-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:02:04] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host aux-k8s-worker2002.codfw.wmnet [14:02:06] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host aux-k8s-worker2002.codfw.wmnet [14:02:12] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host aux-k8s-worker2003.codfw.wmnet [14:02:19] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs-test1001.eqiad.wmnet with OS bookworm [14:02:43] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2075.codfw.wmnet [14:02:44] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host aux-k8s-worker2003.codfw.wmnet [14:02:47] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2076.codfw.wmnet [14:03:53] RECOVERY - mysqld processes #page on db1204 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [14:03:54] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1289089 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [14:04:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6001.drmrs.wmnet [14:04:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6001.drmrs.wmnet [14:05:23] (03CR) 10Ladsgroup: mariadb::ferm: Switch to firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1289171 (owner: 10Muehlenhoff) [14:05:29] (03PS3) 10Muehlenhoff: mariadb::ferm: Switch to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1289171 [14:05:32] (03CR) 10Ladsgroup: [C:03+2] mariadb::ferm: Switch to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1289171 (owner: 10Muehlenhoff) [14:05:35] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb::ferm: Switch to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1289171 (owner: 10Muehlenhoff) [14:05:44] (03CR) 10Jforrester: [C:03+1] "TY!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289352 (owner: 10C. Scott Ananian) [14:06:02] RECOVERY - BFD status on asw1-b4-magru.mgmt is OK: UP: 2 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:06:25] FIRING: [17x] BFDdown: BFD session down between asw1-b13-drmrs and 10.136.1.23 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:06:38] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host aux-k8s-worker2003.codfw.wmnet [14:06:39] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host aux-k8s-worker2003.codfw.wmnet [14:06:45] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host aux-k8s-worker2004.codfw.wmnet [14:06:51] (03CR) 10Hnowlan: [C:03+1] thumbor-plugins: Rebuild against latest package versions in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1289315 (owner: 10Muehlenhoff) [14:07:13] !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org [14:07:13] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) [14:07:16] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host aux-k8s-worker2004.codfw.wmnet [14:07:18] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1076.eqiad.wmnet [14:07:23] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1077.eqiad.wmnet [14:07:33] (03CR) 10Thiemo Kreuz (WMDE): [C:03+1] Skip init.test.js test if VisualEditor not installed [extensions/ConfirmEdit] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289347 (https://phabricator.wikimedia.org/T426740) (owner: 10Michael Große) [14:10:23] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2076.codfw.wmnet [14:10:28] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2077.codfw.wmnet [14:11:13] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host aux-k8s-worker2004.codfw.wmnet [14:11:15] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host aux-k8s-worker2004.codfw.wmnet [14:11:20] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host aux-k8s-worker2005.codfw.wmnet [14:11:25] RESOLVED: [10x] BFDdown: BFD session down between asw1-b4-magru and 195.200.68.37 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:11:56] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host aux-k8s-worker2005.codfw.wmnet [14:12:23] (03CR) 10Majavah: "Ooooooh, that's a nice find. Can you run a PCC?" [puppet] - 10https://gerrit.wikimedia.org/r/1289089 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [14:14:19] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1077.eqiad.wmnet [14:14:24] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1078.eqiad.wmnet [14:16:08] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host aux-k8s-worker2005.codfw.wmnet [14:16:11] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host aux-k8s-worker2005.codfw.wmnet [14:16:16] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host aux-k8s-worker2006.codfw.wmnet [14:16:23] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [14:16:40] (03CR) 10Muehlenhoff: [C:03+2] thumbor-plugins: Rebuild against latest package versions in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1289315 (owner: 10Muehlenhoff) [14:16:46] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2243: Upgrading db2243.codfw.wmnet [14:16:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [14:17:25] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2243: Upgrading db2243.codfw.wmnet [14:18:02] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2077.codfw.wmnet [14:18:06] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2078.codfw.wmnet [14:19:45] !log root@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1204.eqiad.wmnet with OS trixie [14:20:53] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2243.codfw.wmnet with OS trixie [14:21:20] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host aux-k8s-worker2006.codfw.wmnet [14:21:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [14:22:01] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1078.eqiad.wmnet [14:22:06] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1079.eqiad.wmnet [14:22:42] 06SRE, 06Content-Transform-Team, 06ServiceOps new, 06Wikipedia-Android-App-Backlog: Investigate Code 414 error when selecting zh-classical (lzh) language from article toolbar - https://phabricator.wikimedia.org/T425545#11936489 (10Raine) IIUC, the PR merged above fixes this for the Android app. With that f... [14:22:58] 06SRE, 06Content-Transform-Team, 06ServiceOps new, 06Wikipedia-Android-App-Backlog: Investigate Code 414 error when selecting zh-classical (lzh) language from article toolbar - https://phabricator.wikimedia.org/T425545#11936491 (10Raine) [14:24:14] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:25:10] FIRING: BFDdown: BFD session down between asw1-b12-drmrs and 185.15.58.8 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b12-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:25:15] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 7 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:25:25] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2078.codfw.wmnet [14:25:29] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2079.codfw.wmnet [14:26:41] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host aux-k8s-worker2006.codfw.wmnet [14:26:42] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host aux-k8s-worker2006.codfw.wmnet [14:26:48] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host aux-k8s-worker2007.codfw.wmnet [14:27:25] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host aux-k8s-worker2007.codfw.wmnet [14:28:15] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:29:14] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1079.eqiad.wmnet [14:29:15] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:29:15] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 7 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:29:19] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1080.eqiad.wmnet [14:29:29] jouncebot: nowandnext [14:29:29] For the next 0 hour(s) and 0 minute(s): Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260519T1400) [14:29:29] In 0 hour(s) and 0 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260519T1430) [14:29:55] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [14:30:00] (03PS1) 10Marostegui: pc1021: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1289360 (https://phabricator.wikimedia.org/T418973) [14:30:02] Going to use scap [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260519T1430) [14:30:07] (03PS2) 10JHathaway: Rename scap::ferm to scap::firewall [puppet] - 10https://gerrit.wikimedia.org/r/1289089 (https://phabricator.wikimedia.org/T411089) [14:30:10] FIRING: [2x] BFDdown: BFD session down between asw1-b12-drmrs and 185.15.58.8 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:30:15] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:30:39] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [14:30:40] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [14:30:46] (03CR) 10Marostegui: [C:03+2] pc1021: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1289360 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [14:31:24] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [14:31:27] (03CR) 10Dreamy Jazz: [C:03+1] hCaptcha: Enable for group1 wikis (except itwiki, metawiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289348 (https://phabricator.wikimedia.org/T425354) (owner: 10Kosta Harlan) [14:31:36] (03PS1) 10JHathaway: puppet-lint disable top_scopes_facts check [puppet] - 10https://gerrit.wikimedia.org/r/1289361 [14:31:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289348 (https://phabricator.wikimedia.org/T425354) (owner: 10Kosta Harlan) [14:31:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289340 (owner: 10Dreamy Jazz) [14:31:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289339 (owner: 10Dreamy Jazz) [14:31:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289337 (owner: 10Dreamy Jazz) [14:32:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288953 (https://phabricator.wikimedia.org/T426614) (owner: 10Gergő Tisza) [14:32:29] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2079.codfw.wmnet [14:32:34] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2080.codfw.wmnet [14:32:35] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host aux-k8s-worker2007.codfw.wmnet [14:32:37] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host aux-k8s-worker2007.codfw.wmnet [14:32:37] (03CR) 10JHathaway: Rename scap::ferm to scap::firewall (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1289089 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [14:32:42] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host aux-k8s-worker2008.codfw.wmnet [14:32:51] (03PS1) 10Marostegui: instances.yaml: Add pc1021 [puppet] - 10https://gerrit.wikimedia.org/r/1289362 (https://phabricator.wikimedia.org/T418973) [14:32:53] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1289089 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [14:32:55] (03Merged) 10jenkins-bot: hCaptcha: Enable for group1 wikis (except itwiki, metawiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289348 (https://phabricator.wikimedia.org/T425354) (owner: 10Kosta Harlan) [14:32:59] (03Merged) 10jenkins-bot: Drop wgCheckUserDisplayClientHints definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289337 (owner: 10Dreamy Jazz) [14:33:06] (03Merged) 10jenkins-bot: Drop unused $wgWikimediaEventsIPoidUrl definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289339 (owner: 10Dreamy Jazz) [14:33:09] (03Merged) 10jenkins-bot: Drop unused $wgIPInfoIpoidUrl definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289340 (owner: 10Dreamy Jazz) [14:33:16] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host aux-k8s-worker2008.codfw.wmnet [14:33:17] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:33:32] !log kamila@cumin1003 START - Cookbook sre.hosts.reboot-single for host deploy2002.codfw.wmnet [14:33:35] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1289348|hCaptcha: Enable for group1 wikis (except itwiki, metawiki) (T425354)]], [[gerrit:1289340|Drop unused $wgIPInfoIpoidUrl definition]], [[gerrit:1289339|Drop unused $wgWikimediaEventsIPoidUrl definition]], [[gerrit:1289337|Drop wgCheckUserDisplayClientHints definition]] [14:33:38] T425354: hCaptcha: Rollout to all projects - https://phabricator.wikimedia.org/T425354 [14:33:45] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add pc1021 [puppet] - 10https://gerrit.wikimedia.org/r/1289362 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [14:33:47] !log kamila@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host deploy2002.codfw.wmnet [14:34:48] !log sukhe@cumin1003 START - Cookbook sre.cdn.roll-restart-reboot-tcp-proxy rolling reboot on A:tcpproxy and A:tcpproxy [14:35:09] !log kamila@cumin1003 START - Cookbook sre.hosts.reboot-single for host deploy2002.codfw.wmnet [14:35:10] FIRING: [4x] BFDdown: BFD session down between asw1-b12-drmrs and 185.15.58.8 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:35:17] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:35:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Make pc1021 master of pc1 in eqiad T418973', diff saved to https://phabricator.wikimedia.org/P92605 and previous config saved to /var/cache/conftool/dbconfig/20260519-143549-marostegui.json [14:35:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [14:35:54] T418973: Productionize pc20[21-24] and pc10[21-24] - https://phabricator.wikimedia.org/T418973 [14:35:57] !log sukhe@cumin1003 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir and A:ncredir [14:36:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repool pc1 T418973', diff saved to https://phabricator.wikimedia.org/P92606 and previous config saved to /var/cache/conftool/dbconfig/20260519-143632-marostegui.json [14:36:36] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2243.codfw.wmnet with reason: host reimage [14:37:00] Saw a warning `14:35:14 ['/usr/bin/scap', 'pull-master', 'deploy1003.eqiad.wmnet'] (ran as mwdeploy@deploy2002.codfw.wmnet) returned [255]: 14:35:01 Copying from deploy1003.eqiad.wmnet to deploy2002.codfw.wmnet:/srv/mediawiki-staging` [14:37:06] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1080.eqiad.wmnet [14:37:11] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1081.eqiad.wmnet [14:37:25] Though scap still proceeded [14:37:43] !log dreamyjazz@deploy1003 dreamyjazz, kharlan: Backport for [[gerrit:1289348|hCaptcha: Enable for group1 wikis (except itwiki, metawiki) (T425354)]], [[gerrit:1289340|Drop unused $wgIPInfoIpoidUrl definition]], [[gerrit:1289339|Drop unused $wgWikimediaEventsIPoidUrl definition]], [[gerrit:1289337|Drop wgCheckUserDisplayClientHints definition]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). C [14:37:43] hanges can now be verified there. [14:38:04] (03CR) 10Ladsgroup: [V:03+2 C:03+2] "(gradually rolled out, obviously it was noop)" [puppet] - 10https://gerrit.wikimedia.org/r/1289171 (owner: 10Muehlenhoff) [14:38:12] !log sukhe@cumin1003 START - Cookbook sre.loadbalancer.admin rebooting P{lvs7003.magru.wmnet} and A:liberica [14:38:37] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host aux-k8s-worker2008.codfw.wmnet [14:38:39] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host aux-k8s-worker2008.codfw.wmnet [14:38:44] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host aux-k8s-worker2009.codfw.wmnet [14:39:18] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host aux-k8s-worker2009.codfw.wmnet [14:39:23] !log kamila@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host deploy2002.codfw.wmnet [14:39:47] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2080.codfw.wmnet [14:39:51] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2081.codfw.wmnet [14:40:10] RESOLVED: [4x] BFDdown: BFD session down between asw1-b12-drmrs and 185.15.58.8 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:40:16] !log sukhe@cumin1003 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-hcaptcha-proxy (exit_code=0) rolling reboot on A:hcaptcha-proxy and A:hcaptcha-proxy [14:40:17] !log dreamyjazz@deploy1003 dreamyjazz, kharlan: Rolling back deployment [14:40:28] Changes didn't seem to be synced, so rolling back [14:40:32] Will try again [14:40:51] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1289348|hCaptcha: Enable for group1 wikis (except itwiki, metawiki) (T425354)]], [[gerrit:1289340|Drop unused $wgIPInfoIpoidUrl definition]], [[gerrit:1289339|Drop unused $wgWikimediaEventsIPoidUrl definition]], [[gerrit:1289337|Drop wgCheckUserDisplayClientHints definition]] (duration: 07m 16s) [14:40:59] T425354: hCaptcha: Rollout to all projects - https://phabricator.wikimedia.org/T425354 [14:41:14] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-wdqs-test1001.eqiad.wmnet with OS bookworm [14:41:40] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1289348|hCaptcha: Enable for group1 wikis (except itwiki, metawiki) (T425354)]], [[gerrit:1289340|Drop unused $wgIPInfoIpoidUrl definition]], [[gerrit:1289339|Drop unused $wgWikimediaEventsIPoidUrl definition]], [[gerrit:1289337|Drop wgCheckUserDisplayClientHints definition]] [14:42:01] !log sukhe@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs7003.magru.wmnet} and A:liberica [14:42:29] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs-test1001.eqiad.wmnet with OS trixie [14:43:03] sync-masters worked this time, so presumably just a temporary failure [14:43:08] !log dreamyjazz@deploy1003 kharlan, dreamyjazz: Backport for [[gerrit:1289348|hCaptcha: Enable for group1 wikis (except itwiki, metawiki) (T425354)]], [[gerrit:1289340|Drop unused $wgIPInfoIpoidUrl definition]], [[gerrit:1289339|Drop unused $wgWikimediaEventsIPoidUrl definition]], [[gerrit:1289337|Drop wgCheckUserDisplayClientHints definition]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). C [14:43:08] hanges can now be verified there. [14:43:23] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.14 point update - https://phabricator.wikimedia.org/T426759 (10MoritzMuehlenhoff) 03NEW [14:43:28] (03PS1) 10Santiago Faci: growtbook: New release that supports status as a filter for API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289366 (https://phabricator.wikimedia.org/T421800) [14:43:37] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and A:durum [14:44:03] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2243.codfw.wmnet with reason: host reimage [14:44:25] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.14 point update - https://phabricator.wikimedia.org/T426759#11936656 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:44:37] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough [14:44:39] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host aux-k8s-worker2009.codfw.wmnet [14:44:41] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host aux-k8s-worker2009.codfw.wmnet [14:44:41] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:aux-worker-codfw [14:44:42] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1081.eqiad.wmnet [14:44:46] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1082.eqiad.wmnet [14:45:10] !log elukey@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:aux-master-codfw [14:45:14] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host aux-k8s-ctrl2002.codfw.wmnet [14:45:15] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host aux-k8s-ctrl2002.codfw.wmnet [14:45:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [14:46:11] (03CR) 10Elukey: [C:03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/1289272 (owner: 10Muehlenhoff) [14:46:25] FIRING: SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:47:03] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2081.codfw.wmnet [14:47:08] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2082.codfw.wmnet [14:47:09] !log dreamyjazz@deploy1003 kharlan, dreamyjazz: Continuing with deployment [14:48:14] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2026-05-12-211330 to 2026-05-18-230044 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289367 (https://phabricator.wikimedia.org/T420426) [14:48:34] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2026-05-12-211330 to 2026-05-18-230044 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289367 (https://phabricator.wikimedia.org/T420426) (owner: 10Jforrester) [14:50:00] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host aux-k8s-ctrl2002.codfw.wmnet [14:50:02] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host aux-k8s-ctrl2002.codfw.wmnet [14:50:07] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host aux-k8s-ctrl2003.codfw.wmnet [14:50:08] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host aux-k8s-ctrl2003.codfw.wmnet [14:50:47] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2026-05-12-211330 to 2026-05-18-230044 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289367 (https://phabricator.wikimedia.org/T420426) (owner: 10Jforrester) [14:51:05] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:51:18] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1289348|hCaptcha: Enable for group1 wikis (except itwiki, metawiki) (T425354)]], [[gerrit:1289340|Drop unused $wgIPInfoIpoidUrl definition]], [[gerrit:1289339|Drop unused $wgWikimediaEventsIPoidUrl definition]], [[gerrit:1289337|Drop wgCheckUserDisplayClientHints definition]] (duration: 09m 37s) [14:51:21] T425354: hCaptcha: Rollout to all projects - https://phabricator.wikimedia.org/T425354 [14:51:29] Want to use scap again shortly [14:51:40] (03CR) 10Majavah: [C:03+1] puppet-lint disable top_scopes_facts check [puppet] - 10https://gerrit.wikimedia.org/r/1289361 (owner: 10JHathaway) [14:51:41] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1082.eqiad.wmnet [14:51:45] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1083.eqiad.wmnet [14:51:54] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:52:22] (03CR) 10JHathaway: [C:03+2] puppet-lint disable top_scopes_facts check [puppet] - 10https://gerrit.wikimedia.org/r/1289361 (owner: 10JHathaway) [14:54:00] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2082.codfw.wmnet [14:54:04] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2083.codfw.wmnet [14:54:36] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:55:05] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host aux-k8s-ctrl2003.codfw.wmnet [14:55:07] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host aux-k8s-ctrl2003.codfw.wmnet [14:55:07] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:aux-master-codfw [14:55:20] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:55:24] (03CR) 10Majavah: [C:03+1] "thanks!!" [puppet] - 10https://gerrit.wikimedia.org/r/1289089 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [14:55:37] (03PS1) 10Dreamy Jazz: Enable hCaptcha for wikitext editor on group1 minus meta and itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289368 (https://phabricator.wikimedia.org/T425354) [14:56:26] (03CR) 10JHathaway: [C:03+2] Rename scap::ferm to scap::firewall [puppet] - 10https://gerrit.wikimedia.org/r/1289089 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [14:56:32] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:56:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289368 (https://phabricator.wikimedia.org/T425354) (owner: 10Dreamy Jazz) [14:57:17] (03PS18) 10Majavah: firewall: Declare resources for both providers [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) [14:57:17] (03PS18) 10Majavah: P:wmcs::instance: Convert to firewall wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1211652 (https://phabricator.wikimedia.org/T411089) [14:57:27] (03Merged) 10jenkins-bot: Enable hCaptcha for wikitext editor on group1 minus meta and itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289368 (https://phabricator.wikimedia.org/T425354) (owner: 10Dreamy Jazz) [14:57:29] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) (owner: 10Majavah) [14:57:39] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:57:48] (03CR) 10CI reject: [V:04-1] firewall: Declare resources for both providers [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) (owner: 10Majavah) [14:57:52] bah [14:57:55] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1289368|Enable hCaptcha for wikitext editor on group1 minus meta and itwiki (T425354)]] [14:57:59] T425354: hCaptcha: Rollout to all projects - https://phabricator.wikimedia.org/T425354 [14:58:11] jhathaway: 17:57:42 invalid option: --no-top_scope_facts-check [14:58:31] on https://integration.wikimedia.org/ci/job/operations-puppet-tests-bullseye/28252/console [14:58:41] ugh, thanks taavi [14:58:58] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1083.eqiad.wmnet [14:59:03] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1084.eqiad.wmnet [14:59:03] (03PS3) 10Slyngshede: Geo-maps: Update Meta PoPs [dns] - 10https://gerrit.wikimedia.org/r/1282956 [14:59:22] !log elukey@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:aux-master-eqiad [14:59:25] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host aux-k8s-ctrl1002.eqiad.wmnet [14:59:25] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host aux-k8s-ctrl1002.eqiad.wmnet [14:59:27] probably the puppet-lint is too old [14:59:45] gemfile says 2.4.2 [14:59:50] let's revert, upgrade and then try again? [14:59:57] although i wonder why it didn't fail in the first place [14:59:57] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1289368|Enable hCaptcha for wikitext editor on group1 minus meta and itwiki (T425354)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:00:00] (03PS1) 10Ladsgroup: mariadb: Migrate ferm_misc to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1289369 (https://phabricator.wikimedia.org/T421705) [15:00:05] jelto, arnoldokoth, mutante, and arnaudb: That opportune time for a SRE Collaboration Services office hours deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260519T1500). [15:00:23] (03PS1) 10Majavah: Revert "puppet-lint disable top_scopes_facts check" [puppet] - 10https://gerrit.wikimedia.org/r/1289370 [15:00:25] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1289361 [15:00:29] (03PS1) 10JHathaway: Revert "puppet-lint disable top_scopes_facts check" [puppet] - 10https://gerrit.wikimedia.org/r/1289371 [15:00:32] (03CR) 10CI reject: [V:04-1] mariadb: Migrate ferm_misc to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1289369 (https://phabricator.wikimedia.org/T421705) (owner: 10Ladsgroup) [15:00:49] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2083.codfw.wmnet [15:00:50] !log failover Ganeti cluster in drmrs01 to ganeti6001 [15:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:52] (03CR) 10Majavah: [C:03+1] Revert "puppet-lint disable top_scopes_facts check" [puppet] - 10https://gerrit.wikimedia.org/r/1289371 (owner: 10JHathaway) [15:00:53] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2084.codfw.wmnet [15:01:08] (03Abandoned) 10Majavah: Revert "puppet-lint disable top_scopes_facts check" [puppet] - 10https://gerrit.wikimedia.org/r/1289370 (owner: 10Majavah) [15:01:29] (03CR) 10JHathaway: [C:03+2] Revert "puppet-lint disable top_scopes_facts check" [puppet] - 10https://gerrit.wikimedia.org/r/1289371 (owner: 10JHathaway) [15:01:41] (03CR) 10Aleksandar Mastilovic: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) (owner: 10Aleksandar Mastilovic) [15:01:47] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with deployment [15:02:01] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2243.codfw.wmnet with OS trixie [15:02:09] (03PS19) 10Majavah: firewall: Declare resources for both providers [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) [15:02:09] (03PS19) 10Majavah: P:wmcs::instance: Convert to firewall wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1211652 (https://phabricator.wikimedia.org/T411089) [15:02:21] !log klausman@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd1001.eqiad.wmnet [15:02:48] (03CR) 10Ladsgroup: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1289369 (https://phabricator.wikimedia.org/T421705) (owner: 10Ladsgroup) [15:02:48] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) (owner: 10Majavah) [15:03:06] (03PS2) 10Ladsgroup: mariadb: Migrate ferm_misc to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1289369 (https://phabricator.wikimedia.org/T421705) [15:03:15] PROBLEM - ganeti-wconfd running on ganeti6003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [15:03:16] (03CR) 10Majavah: "rebase to fix CI after https://gerrit.wikimedia.org/r/c/operations/puppet/+/1289371" [puppet] - 10https://gerrit.wikimedia.org/r/1289369 (https://phabricator.wikimedia.org/T421705) (owner: 10Ladsgroup) [15:04:01] (03CR) 10Ladsgroup: "oh thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1289369 (https://phabricator.wikimedia.org/T421705) (owner: 10Ladsgroup) [15:04:03] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2243: Migration of db2243.codfw.wmnet completed [15:04:17] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host aux-k8s-ctrl1002.eqiad.wmnet [15:04:18] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host aux-k8s-ctrl1002.eqiad.wmnet [15:04:23] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host aux-k8s-ctrl1003.eqiad.wmnet [15:04:24] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host aux-k8s-ctrl1003.eqiad.wmnet [15:04:33] (03PS1) 10Gkyziridis: ml-services: Deploy qwen3-14b model in experimental ns. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289372 (https://phabricator.wikimedia.org/T425680) [15:05:51] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1084.eqiad.wmnet [15:05:55] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1085.eqiad.wmnet [15:05:59] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1289368|Enable hCaptcha for wikitext editor on group1 minus meta and itwiki (T425354)]] (duration: 08m 03s) [15:06:02] T425354: hCaptcha: Rollout to all projects - https://phabricator.wikimedia.org/T425354 [15:06:14] !log aokoth@cumin1003 START - Cookbook sre.hosts.reboot-single for host phab1004.eqiad.wmnet [15:06:18] !log klausman@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd1001.eqiad.wmnet [15:06:23] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1289369 (https://phabricator.wikimedia.org/T421705) (owner: 10Ladsgroup) [15:06:23] !log klausman@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd1002.eqiad.wmnet [15:07:52] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2084.codfw.wmnet [15:07:56] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2085.codfw.wmnet [15:08:10] (03CR) 10Gkyziridis: "I've set `amd.com/gpu: "1"` since the model fits in one gpu." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289372 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [15:08:44] !log klausman@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd1002.eqiad.wmnet [15:08:50] !log klausman@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd1003.eqiad.wmnet [15:09:14] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host aux-k8s-ctrl1003.eqiad.wmnet [15:09:15] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host aux-k8s-ctrl1003.eqiad.wmnet [15:09:15] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:aux-master-eqiad [15:11:12] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-05-12-210548 to 2026-05-19-145724 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289373 (https://phabricator.wikimedia.org/T282922) [15:11:13] !log klausman@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd1003.eqiad.wmnet [15:11:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:11:32] (03CR) 10Ladsgroup: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1289369 (https://phabricator.wikimedia.org/T421705) (owner: 10Ladsgroup) [15:11:42] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1289369 (https://phabricator.wikimedia.org/T421705) (owner: 10Ladsgroup) [15:11:58] (03CR) 10Ladsgroup: "(the recheck was a brain fart, ignore)" [puppet] - 10https://gerrit.wikimedia.org/r/1289369 (https://phabricator.wikimedia.org/T421705) (owner: 10Ladsgroup) [15:12:19] !log klausman@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2001.codfw.wmnet [15:13:03] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1085.eqiad.wmnet [15:13:08] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1086.eqiad.wmnet [15:13:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet [15:13:17] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6003.drmrs.wmnet [15:14:00] !log brennen@deploy1003 Started deploy [phabricator/deployment@463a948]: deploy phab2002 for T426754 [15:14:04] T426754: Deploy Phab/Phorge 2026-05-19 - https://phabricator.wikimedia.org/T426754 [15:14:04] !log aokoth@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: Phorge Deploy [15:14:17] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1245.eqiad.wmnet with reason: restart [15:14:29] !log aokoth@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host phab1004.eqiad.wmnet [15:14:49] !log brennen@deploy1003 Finished deploy [phabricator/deployment@463a948]: deploy phab2002 for T426754 (duration: 00m 49s) [15:14:54] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2085.codfw.wmnet [15:14:58] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2086.codfw.wmnet [15:16:02] !log brennen@deploy1003 Started deploy [phabricator/deployment@463a948]: deploy phab1004 for T426754 [15:16:11] !log aokoth@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phorge Deploy [15:16:16] !log klausman@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2001.codfw.wmnet [15:16:48] !log brennen@deploy1003 Finished deploy [phabricator/deployment@463a948]: deploy phab1004 for T426754 (duration: 00m 46s) [15:17:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet [15:17:11] !log klausman@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2003.codfw.wmnet [15:17:42] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2026-05-12-210548 to 2026-05-19-145724 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289373 (https://phabricator.wikimedia.org/T282922) (owner: 10Jforrester) [15:17:59] jmm@cumin2002 drain-node (PID 2825364) is awaiting input [15:18:23] (03PS1) 10Majavah: hiddenparma: Add cwilliams [labs/private] - 10https://gerrit.wikimedia.org/r/1289376 [15:18:46] (03CR) 10Brouberol: [C:03+1] use_linux612_on_bookworm: Bump kernel to 6.12.88 [puppet] - 10https://gerrit.wikimedia.org/r/1289279 (owner: 10Muehlenhoff) [15:19:27] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1055.eqiad.wmnet [15:19:28] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1055.eqiad.wmnet [15:19:37] !log klausman@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2003.codfw.wmnet [15:19:40] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1086.eqiad.wmnet [15:19:44] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1087.eqiad.wmnet [15:20:01] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2026-05-12-210548 to 2026-05-19-145724 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289373 (https://phabricator.wikimedia.org/T282922) (owner: 10Jforrester) [15:20:26] !log klausman@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2002.codfw.wmnet [15:20:39] !log reprepro include php8.3_8.3.31-1+wmf11u2 into component/php83 [15:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:14] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:21:32] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:21:44] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:21:57] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2086.codfw.wmnet [15:22:01] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2087.codfw.wmnet [15:22:14] 10ops-codfw, 06SRE, 06DC-Ops: Too low optic power on - pfw1-codfw:xe-7/2/0 (Core: cr2-codfw:xe-0/0/1:0 {#122503}) - https://phabricator.wikimedia.org/T426671#11937032 (10cmooney) To clarify what link this is: pfw1b-codfw xe-7/2/0 (“port 18” on the front) <-> cr2-codfw xe-1/0/1:3 It goes via our patch panel... [15:22:18] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:22:30] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:22:49] !log klausman@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2002.codfw.wmnet [15:23:11] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:23:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6003.drmrs.wmnet [15:23:28] !log klausman@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-ctrl2001.codfw.wmnet [15:25:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet [15:25:21] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1070.eqiad.wmnet [15:25:22] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1070.eqiad.wmnet [15:26:22] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1289369 (https://phabricator.wikimedia.org/T421705) (owner: 10Ladsgroup) [15:26:54] !log klausman@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-ctrl2001.codfw.wmnet [15:26:58] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1087.eqiad.wmnet [15:27:02] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1088.eqiad.wmnet [15:27:03] !log klausman@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-ctrl2002.codfw.wmnet [15:27:09] !log jiji@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{wikikube-worker[1006-1007,1015-1016,1021,1034-1057,1064-1081,1084-1087,1093-1095,1113-1165,1240-1289,1291-1327,1375-1384].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [15:27:18] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1006-1007,1015-1016].eqiad.wmnet [15:27:32] (03CR) 10Ladsgroup: "sigh, I noticed the stack trace and thought PCC is broken but I know actually see the actual error" [puppet] - 10https://gerrit.wikimedia.org/r/1289369 (https://phabricator.wikimedia.org/T421705) (owner: 10Ladsgroup) [15:27:46] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1249 is unreachable - https://phabricator.wikimedia.org/T426750#11937068 (10Jclark-ctr) a:03Jclark-ctr [15:28:41] !log jiji@cumin1003 START - Cookbook sre.memcached.roll-reboot-restart rolling reboot on A:memcached-codfw [15:28:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet [15:29:01] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2087.codfw.wmnet [15:29:05] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2088.codfw.wmnet [15:29:36] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1006-1007,1015-1016].eqiad.wmnet [15:29:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6003.drmrs.wmnet [15:30:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6003.drmrs.wmnet [15:30:03] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1249 is unreachable - https://phabricator.wikimedia.org/T426750#11937081 (10Jclark-ctr) 05Open→03Resolved Power cables were not inserted all the way into the PDU. [15:30:27] !log klausman@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-ctrl2002.codfw.wmnet [15:30:54] (03PS3) 10Ladsgroup: mariadb: Migrate ferm_misc to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1289369 (https://phabricator.wikimedia.org/T421705) [15:31:09] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6002.drmrs.wmnet [15:31:57] (03PS4) 10Ladsgroup: mariadb: Migrate ferm_misc to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1289369 (https://phabricator.wikimedia.org/T421705) [15:32:05] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1289369 (https://phabricator.wikimedia.org/T421705) (owner: 10Ladsgroup) [15:32:23] RECOVERY - Host db1249 #page is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [15:32:59] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission pki1001.eqiad.wmnet - https://phabricator.wikimedia.org/T426739#11937109 (10Jclark-ctr) a:03Jclark-ctr [15:33:05] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission pki1001.eqiad.wmnet - https://phabricator.wikimedia.org/T426739#11937112 (10Jclark-ctr) 05Open→03Resolved [15:33:53] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1088.eqiad.wmnet [15:33:58] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1089.eqiad.wmnet [15:35:51] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2088.codfw.wmnet [15:35:56] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2089.codfw.wmnet [15:36:29] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1006-1007,1015-1016].eqiad.wmnet [15:36:31] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1006-1007,1015-1016].eqiad.wmnet [15:36:33] FIRING: KubernetesCalicoDown: wikikube-worker1015.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1015.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:36:41] btullis@cumin1003 reimage (PID 1569612) is awaiting input [15:36:41] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1021,1034-1036].eqiad.wmnet [15:36:51] (03PS1) 10JHathaway: Rename role::mariadb::ferm to role::mariadb::firewall [puppet] - 10https://gerrit.wikimedia.org/r/1289378 (https://phabricator.wikimedia.org/T411089) [15:37:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6002.drmrs.wmnet [15:37:23] (03CR) 10CI reject: [V:04-1] Rename role::mariadb::ferm to role::mariadb::firewall [puppet] - 10https://gerrit.wikimedia.org/r/1289378 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [15:39:04] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1021,1034-1036].eqiad.wmnet [15:39:58] !log sukhe@cumin1003 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-tcp-proxy (exit_code=0) rolling reboot on A:tcpproxy and A:tcpproxy [15:40:36] (03PS2) 10JHathaway: Rename role::mariadb::ferm to role::mariadb::firewall [puppet] - 10https://gerrit.wikimedia.org/r/1289378 (https://phabricator.wikimedia.org/T411089) [15:40:40] (03PS3) 10Jforrester: Provide abstractwiki-rust, using Trixie-backports [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1289012 (https://phabricator.wikimedia.org/T425340) [15:40:41] (03CR) 10Ladsgroup: "From the PCC this is interesting:" [puppet] - 10https://gerrit.wikimedia.org/r/1289369 (https://phabricator.wikimedia.org/T421705) (owner: 10Ladsgroup) [15:41:03] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1289378 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [15:41:05] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1089.eqiad.wmnet [15:41:07] (03CR) 10CI reject: [V:04-1] Rename role::mariadb::ferm to role::mariadb::firewall [puppet] - 10https://gerrit.wikimedia.org/r/1289378 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [15:41:10] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1090.eqiad.wmnet [15:41:28] (03CR) 10Jforrester: "Yeah, it was a bit tricky to fix, but it's now building correctly:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1289012 (https://phabricator.wikimedia.org/T425340) (owner: 10Jforrester) [15:41:32] RESOLVED: KubernetesCalicoDown: wikikube-worker1015.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1015.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:43:01] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2089.codfw.wmnet [15:43:05] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2090.codfw.wmnet [15:43:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6002.drmrs.wmnet [15:43:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6002.drmrs.wmnet [15:45:05] (03PS1) 10Muehlenhoff: Record LDAP access for fromeowmf [puppet] - 10https://gerrit.wikimedia.org/r/1289380 [15:45:44] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1021,1034-1036].eqiad.wmnet [15:45:45] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1021,1034-1036].eqiad.wmnet [15:45:55] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1037-1040].eqiad.wmnet [15:46:28] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for fromeowmf [puppet] - 10https://gerrit.wikimedia.org/r/1289380 (owner: 10Muehlenhoff) [15:46:34] (03PS3) 10JHathaway: Rename role::mariadb::ferm to role::mariadb::firewall [puppet] - 10https://gerrit.wikimedia.org/r/1289378 (https://phabricator.wikimedia.org/T411089) [15:47:10] (03CR) 10CI reject: [V:04-1] Rename role::mariadb::ferm to role::mariadb::firewall [puppet] - 10https://gerrit.wikimedia.org/r/1289378 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [15:48:11] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1090.eqiad.wmnet [15:48:16] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1091.eqiad.wmnet [15:49:24] !log sukhe@cumin1003 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir and A:ncredir [15:49:36] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2243: Migration of db2243.codfw.wmnet completed [15:49:37] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [15:49:47] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2090.codfw.wmnet [15:49:51] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2091.codfw.wmnet [15:50:13] (03CR) 10CDanis: [C:03+1] profile::cache::haproxy: add webrequest-based ip reputation data (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [15:50:26] (03CR) 10Muehlenhoff: mariadb: Migrate ferm_misc to firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1289369 (https://phabricator.wikimedia.org/T421705) (owner: 10Ladsgroup) [15:51:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet [15:52:02] (03PS1) 10JHathaway: mariadb: Migrate mariadb internal ferm rule to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1289382 (https://phabricator.wikimedia.org/T421705) [15:52:36] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1037-1040].eqiad.wmnet [15:53:07] !log temporarily drop ganeti2029 from the codfw cluster T426199 [15:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:11] T426199: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199 [15:53:15] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1289382 (https://phabricator.wikimedia.org/T421705) (owner: 10JHathaway) [15:55:01] (03PS4) 10JHathaway: Rename role::mariadb::ferm to role::mariadb::firewall [puppet] - 10https://gerrit.wikimedia.org/r/1289378 (https://phabricator.wikimedia.org/T411089) [15:55:11] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1289378 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [15:55:18] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1091.eqiad.wmnet [15:55:50] (03PS2) 10JHathaway: mariadb: Migrate mariadb internal ferm rule to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1289382 (https://phabricator.wikimedia.org/T421705) [15:55:53] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2091.codfw.wmnet [15:55:59] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1289382 (https://phabricator.wikimedia.org/T421705) (owner: 10JHathaway) [15:56:05] (03PS1) 10Brouberol: idp: restrict growthbook UI login to users belonging to the growthbook LDAP groups [puppet] - 10https://gerrit.wikimedia.org/r/1289384 (https://phabricator.wikimedia.org/T420691) [15:56:10] PROBLEM - ganeti-noded running on ganeti2029 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [15:56:12] PROBLEM - ganeti-confd running on ganeti2029 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 109 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [15:56:43] (03CR) 10CI reject: [V:04-1] idp: restrict growthbook UI login to users belonging to the growthbook LDAP groups [puppet] - 10https://gerrit.wikimedia.org/r/1289384 (https://phabricator.wikimedia.org/T420691) (owner: 10Brouberol) [15:56:44] (03PS1) 10Btullis: [airflow-wikidata]: Add a connection for the wikidata-platform S3 user [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289385 (https://phabricator.wikimedia.org/T426764) [15:56:50] FIRING: ProbeDown: Service ganeti2029:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:57:48] (03PS2) 10Brouberol: idp: restrict growthbook UI login to the growthbook LDAP groups [puppet] - 10https://gerrit.wikimedia.org/r/1289384 (https://phabricator.wikimedia.org/T420691) [15:59:33] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1037-1040].eqiad.wmnet [15:59:36] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1037-1040].eqiad.wmnet [15:59:46] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1041-1044].eqiad.wmnet [16:00:05] jhathaway and rzl: Your horoscope predicts another Puppet request window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260519T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:01:23] (03CR) 10Btullis: "The corresponding secrets have already been added to the private puppet repo." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289385 (https://phabricator.wikimedia.org/T426764) (owner: 10Btullis) [16:01:52] (03PS2) 10Jasmine: k8s: add wikikube-worker2331 [puppet] - 10https://gerrit.wikimedia.org/r/1289022 (https://phabricator.wikimedia.org/T426688) [16:02:10] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1041-1044].eqiad.wmnet [16:02:43] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-wdqs-test1001.eqiad.wmnet with OS trixie [16:04:37] (03CR) 10Bearloga: [C:04-1] idp: restrict growthbook UI login to the growthbook LDAP groups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1289384 (https://phabricator.wikimedia.org/T420691) (owner: 10Brouberol) [16:06:32] (03PS1) 10JHathaway: mariadb: Rename profile::mariadb::ferm to profile::mariadb::firewall [puppet] - 10https://gerrit.wikimedia.org/r/1289386 (https://phabricator.wikimedia.org/T411089) [16:06:46] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1289386 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [16:07:13] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-wdqs-test1001.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [16:08:13] (03CR) 10Jasmine: k8s: add wikikube-worker2331 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1289022 (https://phabricator.wikimedia.org/T426688) (owner: 10Jasmine) [16:09:05] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1249 is unreachable - https://phabricator.wikimedia.org/T426750#11937325 (10FCeratto-WMF) 05Resolved→03Open a:05Jclark-ctr→03None [16:09:12] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:09:15] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1041-1044].eqiad.wmnet [16:09:17] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1041-1044].eqiad.wmnet [16:09:18] 06SRE, 06DBA: db1249 is unreachable - https://phabricator.wikimedia.org/T426750#11937329 (10FCeratto-WMF) [16:09:28] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1045-1048].eqiad.wmnet [16:11:38] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1045-1048].eqiad.wmnet [16:12:55] (03CR) 10Clément Goubert: "Actually I'm not sure it does, these routes have `cache-control: no-cache`" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287731 (https://phabricator.wikimedia.org/T426323) (owner: 10Kosta Harlan) [16:13:45] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-wdqs-test1001.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [16:15:34] (03CR) 10Dreamy Jazz: [C:03+1] "Looks like we missed a definition in `InitialiseSettings-labs.php` cc @mszwarc@wikimedia.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182798 (https://phabricator.wikimedia.org/T280532) (owner: 10Mszwarc) [16:15:38] 06SRE, 06DBA: db1249 is unreachable - https://phabricator.wikimedia.org/T426750#11937397 (10FCeratto-WMF) moving the task back to DBA: the host is up (and with an updated kernel) but before pooling in we should decide if we want to clone it or trust the crash recovery https://phabricator.wikimedia.org/P92613... [16:16:25] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:18:52] (03PS1) 10Elukey: wikifunctions: raise orchestrator's container limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289389 [16:21:04] (03CR) 10Kosta Harlan: "I don't know. If I run `curl https://api.wikimedia.org/service/lw/recommendation/api/v1/translation/page-collection-groups` I see that ACA" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287731 (https://phabricator.wikimedia.org/T426323) (owner: 10Kosta Harlan) [16:21:24] (03CR) 10Krinkle: 404.php: Force a redirect to /wiki/ in very obvious cases (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288274 (https://phabricator.wikimedia.org/T129433) (owner: 10Ladsgroup) [16:21:27] (03CR) 10Krinkle: [C:03+1] 404.php: Force a redirect to /wiki/ in very obvious cases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288274 (https://phabricator.wikimedia.org/T129433) (owner: 10Ladsgroup) [16:21:42] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1045-1048].eqiad.wmnet [16:21:43] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1045-1048].eqiad.wmnet [16:21:50] RESOLVED: ProbeDown: Service ganeti2029:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:21:53] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1049-1052].eqiad.wmnet [16:22:28] (03CR) 10CDanis: [C:03+1] hiddenparma: Add cwilliams [labs/private] - 10https://gerrit.wikimedia.org/r/1289376 (owner: 10Majavah) [16:22:56] (03CR) 10RLazarus: "Yep, this builds! Let's still wait for Moritz's +1 to make sure it's a sound way of going about this, but if he's happy I'll merge and pub" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1289012 (https://phabricator.wikimedia.org/T425340) (owner: 10Jforrester) [16:23:21] (03CR) 10Majavah: [V:03+2 C:03+2] hiddenparma: Add cwilliams [labs/private] - 10https://gerrit.wikimedia.org/r/1289376 (owner: 10Majavah) [16:24:11] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1049-1052].eqiad.wmnet [16:24:16] (03PS3) 10Brouberol: idp: restrict growthbook UI login to the growthbook LDAP groups [puppet] - 10https://gerrit.wikimedia.org/r/1289384 (https://phabricator.wikimedia.org/T420691) [16:24:30] (03CR) 10Brouberol: idp: restrict growthbook UI login to the growthbook LDAP groups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1289384 (https://phabricator.wikimedia.org/T420691) (owner: 10Brouberol) [16:26:13] (03CR) 10Clément Goubert: "Yeah, `no-cache` doesn't mean "don't store in cache", it means "revalidate every time". What I'm trying to figure out right now is why the" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287731 (https://phabricator.wikimedia.org/T426323) (owner: 10Kosta Harlan) [16:27:54] (03PS4) 10Brouberol: idp: restrict growthbook UI login to the growthbook LDAP groups [puppet] - 10https://gerrit.wikimedia.org/r/1289384 (https://phabricator.wikimedia.org/T420691) [16:28:08] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbprov[1004-1007].eqiad.wmnet with reason: restart [16:28:32] (03PS2) 10Elukey: wikifunctions: raise orchestrator's container limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289389 [16:28:42] (03CR) 10Jforrester: "Definitely (on both counts)!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1289012 (https://phabricator.wikimedia.org/T425340) (owner: 10Jforrester) [16:31:42] (03CR) 10Jforrester: "Hmm, yeah, might help." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289389 (owner: 10Elukey) [16:32:19] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1049-1052].eqiad.wmnet [16:32:21] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1049-1052].eqiad.wmnet [16:32:35] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1053-1056].eqiad.wmnet [16:32:58] (03CR) 10Elukey: "The pods are few, so it is fine resource-wise to test. Worst case we rollback, it is a test worth doing imho, and very cheap as well :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289389 (owner: 10Elukey) [16:33:38] (03CR) 10Elukey: "Feel free to deploy it! And/or I can do it tomorrow :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289389 (owner: 10Elukey) [16:34:13] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:47] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1053-1056].eqiad.wmnet [16:36:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:0 (Core: pfw1-codfw:xe-7/2/0 {#11923_12249-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:37:11] FIRING: PfwCoreBGPDown: ... [16:37:11] Fundraising Firewall core BGP session down between pfw1-codfw and (null) (208.80.153.202) - group Production - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=pfw1-codfw:9804&var-bgp_group=Production&var-bgp_neighbor=(null) - https://alerts.wikimedia.org/?q=alertname%3DPfwCoreBGPDown [16:37:39] FIRING: CoreBGPDown: Core BGP session down between cr2-codfw and pfw1b-codfw (208.80.153.203) - group fundraising - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-codfw:9804&var-bgp_group=fundraising&var-bgp_neighbor=pfw1b-codfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:43:07] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1053-1056].eqiad.wmnet [16:43:09] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1053-1056].eqiad.wmnet [16:43:19] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1057,1064-1066].eqiad.wmnet [16:46:08] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1057,1064-1066].eqiad.wmnet [16:53:13] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1057,1064-1066].eqiad.wmnet [16:53:15] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1057,1064-1066].eqiad.wmnet [16:53:25] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1067-1070].eqiad.wmnet [16:54:17] (03CR) 10Ladsgroup: mariadb: Migrate ferm_misc to firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1289369 (https://phabricator.wikimedia.org/T421705) (owner: 10Ladsgroup) [16:54:17] (03PS5) 10Ladsgroup: mariadb: Migrate ferm_misc to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1289369 (https://phabricator.wikimedia.org/T421705) [16:54:24] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on backupmon1001.eqiad.wmnet with reason: restart [16:55:27] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:55:42] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1067-1070].eqiad.wmnet [16:56:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:0 (Core: pfw1-codfw:xe-7/2/0 {#11923_12249-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:57:11] RESOLVED: PfwCoreBGPDown: ... [16:57:11] Fundraising Firewall core BGP session down between pfw1-codfw and (null) (208.80.153.202) - group Production - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=pfw1-codfw:9804&var-bgp_group=Production&var-bgp_neighbor=(null) - https://alerts.wikimedia.org/?q=alertname%3DPfwCoreBGPDown [16:57:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-codfw and pfw1b-codfw (208.80.153.203) - group fundraising - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-codfw:9804&var-bgp_group=fundraising&var-bgp_neighbor=pfw1b-codfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:57:52] (03CR) 10Ssingh: [C:03+1] Geo-maps: Update Meta PoPs [dns] - 10https://gerrit.wikimedia.org/r/1282956 (owner: 10Slyngshede) [16:59:12] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260519T1700) [17:02:16] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbprov[2004-2007].codfw.wmnet with reason: restart [17:02:56] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1067-1070].eqiad.wmnet [17:02:58] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1067-1070].eqiad.wmnet [17:03:08] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1071-1074].eqiad.wmnet [17:03:14] !log kamila@cumin1003 START - Cookbook sre.hosts.reboot-single for host deploy1003.eqiad.wmnet [17:03:38] (03PS1) 10Jforrester: wmnet: Add new CNAMEs for Wikifunctions replacement evaluators [dns] - 10https://gerrit.wikimedia.org/r/1289393 (https://phabricator.wikimedia.org/T417870) [17:05:20] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1071-1074].eqiad.wmnet [17:12:30] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1071-1074].eqiad.wmnet [17:12:31] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1071-1074].eqiad.wmnet [17:12:46] (03CR) 10Ssingh: "In the previous round with Gerrit, we were advised that changing any configs on the user end was not an option. In this case, this would m" [puppet] - 10https://gerrit.wikimedia.org/r/1282428 (https://phabricator.wikimedia.org/T425441) (owner: 10Dzahn) [17:12:46] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1075-1078].eqiad.wmnet [17:13:03] !log kamila@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host deploy1003.eqiad.wmnet [17:14:58] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1075-1078].eqiad.wmnet [17:15:39] FIRING: CoreBGPDown: Core BGP session down between cr2-codfw and pfw1b-codfw (208.80.153.203) - group fundraising - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-codfw:9804&var-bgp_group=fundraising&var-bgp_neighbor=pfw1b-codfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:16:12] (03PS1) 10Jforrester: services: Add Wikifunctions's Rust-based evaluator ingress endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1289395 (https://phabricator.wikimedia.org/T417870) [17:16:14] (03PS1) 10Jforrester: services: Turn Wikifunctions's Rust-based evaluator endpoints to production state [puppet] - 10https://gerrit.wikimedia.org/r/1289396 (https://phabricator.wikimedia.org/T417870) [17:16:17] (03PS1) 10Jforrester: profile::services_proxy::envoy: Add Wikifunctions's Rust-based eval endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1289397 (https://phabricator.wikimedia.org/T417870) [17:17:09] (03CR) 10CI reject: [V:04-1] services: Turn Wikifunctions's Rust-based evaluator endpoints to production state [puppet] - 10https://gerrit.wikimedia.org/r/1289396 (https://phabricator.wikimedia.org/T417870) (owner: 10Jforrester) [17:19:23] (03PS1) 10Jforrester: wikifunctions: Add extraFQDNs for the Rust-based evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289399 (https://phabricator.wikimedia.org/T417870) [17:19:54] (03PS2) 10Jforrester: services: Turn Wikifunctions's Rust-based evaluator endpoints to prod state [puppet] - 10https://gerrit.wikimedia.org/r/1289396 (https://phabricator.wikimedia.org/T417870) [17:20:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-codfw and pfw1b-codfw (208.80.153.203) - group fundraising - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-codfw:9804&var-bgp_group=fundraising&var-bgp_neighbor=pfw1b-codfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:21:48] (03PS9) 10Herron: grafana-dashboard-reporter: initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/1286507 (https://phabricator.wikimedia.org/T425795) [17:21:56] (03PS6) 10Herron: grafana: add dashboard reporter plugin [puppet] - 10https://gerrit.wikimedia.org/r/1286986 [17:21:58] (03CR) 10Jforrester: [C:03+2] wikifunctions: raise orchestrator's container limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289389 (owner: 10Elukey) [17:22:12] !log aokoth@cumin1003 START - Cookbook sre.hosts.reboot-single for host vrts2002.codfw.wmnet [17:23:26] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1075-1078].eqiad.wmnet [17:23:28] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1075-1078].eqiad.wmnet [17:23:38] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1079-1081,1084].eqiad.wmnet [17:24:38] (03Merged) 10jenkins-bot: wikifunctions: raise orchestrator's container limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289389 (owner: 10Elukey) [17:26:02] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1079-1081,1084].eqiad.wmnet [17:26:56] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [17:27:02] (03CR) 10Bearloga: [C:03+1] idp: restrict growthbook UI login to the growthbook LDAP groups [puppet] - 10https://gerrit.wikimedia.org/r/1289384 (https://phabricator.wikimedia.org/T420691) (owner: 10Brouberol) [17:27:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [extensions/CirrusSearch] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288983 (owner: 10Ebernhardson) [17:27:06] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [17:27:18] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [17:27:51] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [17:27:59] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [17:28:34] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [17:28:51] !log aokoth@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts2002.codfw.wmnet [17:30:03] !log tchin@deploy1003 Started deploy [analytics/refinery@eeef7f3] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@eeef7f3d] [17:30:50] !log tchin@deploy1003 Finished deploy [analytics/refinery@eeef7f3] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@eeef7f3d] (duration: 00m 46s) [17:31:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:0 (Core: pfw1-codfw:xe-7/2/0 {#11923_12249-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:32:11] FIRING: PfwCoreBGPDown: ... [17:32:11] Fundraising Firewall core BGP session down between pfw1-codfw and (null) (208.80.153.202) - group Production - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=pfw1-codfw:9804&var-bgp_group=Production&var-bgp_neighbor=(null) - https://alerts.wikimedia.org/?q=alertname%3DPfwCoreBGPDown [17:32:39] FIRING: CoreBGPDown: Core BGP session down between cr2-codfw and pfw1b-codfw (208.80.153.203) - group fundraising - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-codfw:9804&var-bgp_group=fundraising&var-bgp_neighbor=pfw1b-codfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:33:56] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1079-1081,1084].eqiad.wmnet [17:33:57] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1079-1081,1084].eqiad.wmnet [17:34:08] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1085-1087,1093].eqiad.wmnet [17:35:04] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on 14 hosts with reason: restart [17:36:27] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1085-1087,1093].eqiad.wmnet [17:36:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:0 (Core: pfw1-codfw:xe-7/2/0 {#11923_12249-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:37:11] RESOLVED: PfwCoreBGPDown: ... [17:37:17] Fundraising Firewall core BGP session down between pfw1-codfw and (null) (208.80.153.202) - group Production - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=pfw1-codfw:9804&var-bgp_group=Production&var-bgp_neighbor=(null) - https://alerts.wikimedia.org/?q=alertname%3DPfwCoreBGPDown [17:37:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-codfw and pfw1b-codfw (208.80.153.203) - group fundraising - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-codfw:9804&var-bgp_group=fundraising&var-bgp_neighbor=pfw1b-codfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:39:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:0 (Core: pfw1-codfw:xe-7/2/0 {#11923_12249-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:40:11] FIRING: PfwCoreBGPDown: ... [17:40:17] Fundraising Firewall core BGP session down between pfw1-codfw and (null) (208.80.153.202) - group Production - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=pfw1-codfw:9804&var-bgp_group=Production&var-bgp_neighbor=(null) - https://alerts.wikimedia.org/?q=alertname%3DPfwCoreBGPDown [17:40:39] FIRING: CoreBGPDown: Core BGP session down between cr2-codfw and pfw1b-codfw (208.80.153.203) - group fundraising - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-codfw:9804&var-bgp_group=fundraising&var-bgp_neighbor=pfw1b-codfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:43:42] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1085-1087,1093].eqiad.wmnet [17:43:44] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1085-1087,1093].eqiad.wmnet [17:43:55] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1094-1095,1113-1114].eqiad.wmnet [17:44:23] (03CR) 10Dzahn: "I was under the impression that might just not be a way around this because port 22 is being used which we can't use externally. But if t" [puppet] - 10https://gerrit.wikimedia.org/r/1282428 (https://phabricator.wikimedia.org/T425441) (owner: 10Dzahn) [17:46:12] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1094-1095,1113-1114].eqiad.wmnet [17:49:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:0 (Core: pfw1-codfw:xe-7/2/0 {#11923_12249-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:50:11] RESOLVED: PfwCoreBGPDown: ... [17:50:17] Fundraising Firewall core BGP session down between pfw1-codfw and (null) (208.80.153.202) - group Production - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=pfw1-codfw:9804&var-bgp_group=Production&var-bgp_neighbor=(null) - https://alerts.wikimedia.org/?q=alertname%3DPfwCoreBGPDown [17:50:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-codfw and pfw1b-codfw (208.80.153.203) - group fundraising - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-codfw:9804&var-bgp_group=fundraising&var-bgp_neighbor=pfw1b-codfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:51:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288999 (https://phabricator.wikimedia.org/T424058) (owner: 10SBassett) [17:51:02] !log joal@deploy1003 Started deploy [analytics/refinery@eeef7f3] (hadoop-test): Hotfix Hadoop-test [analytics/refinery@eeef7f3d] [17:53:16] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1094-1095,1113-1114].eqiad.wmnet [17:53:18] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1094-1095,1113-1114].eqiad.wmnet [17:53:28] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1115-1118].eqiad.wmnet [17:56:16] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1115-1118].eqiad.wmnet [17:59:37] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host ncmonitor1001.eqiad.wmnet [18:03:31] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncmonitor1001.eqiad.wmnet [18:05:34] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1115-1118].eqiad.wmnet [18:05:36] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1115-1118].eqiad.wmnet [18:05:51] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1119-1122].eqiad.wmnet [18:05:59] (03PS2) 10AOkoth: phabricator: replace phab2002 with phab2003 [puppet] - 10https://gerrit.wikimedia.org/r/1278521 (https://phabricator.wikimedia.org/T423727) [18:07:34] !log import gdnsd_3.99.0-alpha3~deb13u1 into trixie-wikimedia-T401832 [18:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:38] T401832: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832 [18:08:10] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1119-1122].eqiad.wmnet [18:12:24] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp7001.magru.wmnet,cp7009.magru.wmnet} and A:cp [18:17:41] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1119-1122].eqiad.wmnet [18:17:43] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1119-1122].eqiad.wmnet [18:17:53] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1123-1126].eqiad.wmnet [18:20:08] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1123-1126].eqiad.wmnet [18:24:00] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp7001.magru.wmnet [18:26:49] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1123-1126].eqiad.wmnet [18:26:50] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1123-1126].eqiad.wmnet [18:27:00] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1127-1130].eqiad.wmnet [18:29:01] (03CR) 10Dzahn: "Let's not remove the role from the existing failover server until after the new server is setup and ready or we lose the redundancy for th" [puppet] - 10https://gerrit.wikimedia.org/r/1278521 (https://phabricator.wikimedia.org/T423727) (owner: 10AOkoth) [18:33:41] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1127-1130].eqiad.wmnet [18:34:56] (03PS2) 10Kimberly Sarabia: Make image browsing available in Beta and TestWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288996 (https://phabricator.wikimedia.org/T421019) [18:40:24] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1127-1130].eqiad.wmnet [18:40:26] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1127-1130].eqiad.wmnet [18:40:36] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1131-1134].eqiad.wmnet [18:43:29] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1131-1134].eqiad.wmnet [18:45:05] !log brett@cumin2002 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir-magru and A:ncredir [18:45:55] (03Abandoned) 10NMW03: Increase account threshold for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289180 (owner: 10NMW03) [18:46:25] FIRING: SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:50:18] apologies, deployments should be fixed now. [18:50:24] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1131-1134].eqiad.wmnet [18:50:26] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1131-1134].eqiad.wmnet [18:50:35] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1135-1138].eqiad.wmnet [18:50:41] !log jiji@cumin1003 END (PASS) - Cookbook sre.memcached.roll-reboot-restart (exit_code=0) rolling reboot on A:memcached-codfw [18:52:24] (03CR) 10Ladsgroup: "I'm not really versed in iptables config so correct me if I'm wrong but the output seems wrong" [puppet] - 10https://gerrit.wikimedia.org/r/1289382 (https://phabricator.wikimedia.org/T421705) (owner: 10JHathaway) [18:52:50] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1135-1138].eqiad.wmnet [18:54:41] (03CR) 10Majavah: [C:04-1] mariadb: Migrate mariadb internal ferm rule to firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1289382 (https://phabricator.wikimedia.org/T421705) (owner: 10JHathaway) [18:55:00] (03CR) 10JHathaway: "you are right, that definitely looks broken, I should have looked at the PCC output before tagging you, new patch incoming..." [puppet] - 10https://gerrit.wikimedia.org/r/1289382 (https://phabricator.wikimedia.org/T421705) (owner: 10JHathaway) [18:55:39] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir-magru and A:ncredir [18:56:11] (03CR) 10JHathaway: mariadb: Migrate mariadb internal ferm rule to firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1289382 (https://phabricator.wikimedia.org/T421705) (owner: 10JHathaway) [18:58:55] (03CR) 10Ssingh: [C:03+1] sre.cdn.roll-restart-reboot-ncredir: Fix aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/1269402 (owner: 10Muehlenhoff) [18:59:37] (03CR) 10Muehlenhoff: mariadb: Migrate mariadb internal ferm rule to firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1289382 (https://phabricator.wikimedia.org/T421705) (owner: 10JHathaway) [19:00:17] (03PS3) 10JHathaway: mariadb: Migrate mariadb internal ferm rule to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1289382 (https://phabricator.wikimedia.org/T421705) [19:00:22] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1289382 (https://phabricator.wikimedia.org/T421705) (owner: 10JHathaway) [19:00:41] (03CR) 10JHathaway: mariadb: Migrate mariadb internal ferm rule to firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1289382 (https://phabricator.wikimedia.org/T421705) (owner: 10JHathaway) [19:01:15] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1289382 (https://phabricator.wikimedia.org/T421705) (owner: 10JHathaway) [19:02:22] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1135-1138].eqiad.wmnet [19:02:23] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1135-1138].eqiad.wmnet [19:02:33] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1139-1142].eqiad.wmnet [19:02:48] !log brett@cumin2002 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir-magru and A:ncredir [19:02:54] !log brett@cumin2002 END (ERROR) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=97) rolling reboot on A:ncredir-magru and A:ncredir [19:03:37] !log brett@cumin2002 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir and not A:ncredir-magru and A:ncredir [19:04:45] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1139-1142].eqiad.wmnet [19:06:03] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp7009.magru.wmnet [19:06:03] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp7001.magru.wmnet,cp7009.magru.wmnet} and A:cp [19:06:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:10:06] (03PS3) 10AOkoth: phabricator: replace phab2002 with phab2003 [puppet] - 10https://gerrit.wikimedia.org/r/1278521 (https://phabricator.wikimedia.org/T423727) [19:11:25] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1139-1142].eqiad.wmnet [19:11:26] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1139-1142].eqiad.wmnet [19:11:36] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1143-1146].eqiad.wmnet [19:11:49] (03CR) 10BCornwall: "The linked Phab task doesn't mention anything related to replacement evaluators... is there a reason we're adding two now?" [dns] - 10https://gerrit.wikimedia.org/r/1289393 (https://phabricator.wikimedia.org/T417870) (owner: 10Jforrester) [19:11:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:14:27] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1143-1146].eqiad.wmnet [19:16:20] Hi, I need to deploy a unbreak now EventBus extension patch I like to deploy. I could wait for the window in 45 mins but I may have to be on childcare then. I haven't used spiderpig to deploy anything bug config changes. Is it okay if I deploy my thing before the next window? [19:16:26] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventBus/+/1289407 [19:16:36] (03CR) 10Dzahn: [C:03+1] "yea, we can try that - I kind of expect the same problem with scap and puppet we saw in cloud - but let's verify if that's true or not." [puppet] - 10https://gerrit.wikimedia.org/r/1278521 (https://phabricator.wikimedia.org/T423727) (owner: 10AOkoth) [19:16:55] jouncebot now [19:17:04] haha [19:17:07] :p [19:17:23] ottomata: Go ahead. [19:17:55] thanks, waiting for jenkins... [19:18:16] can I go ahead and do the cherry pick? jenkins takes a while these days [19:18:34] Yeah, optimistic cherry-pick is reasonable. [19:18:55] (03PS1) 10Ottomata: BugFix: Emit page_change at version 1.6.0 to pick up user wiki_id [extensions/EventBus] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289408 (https://phabricator.wikimedia.org/T426198) [19:20:59] do I need to wait before starting spiderpig deploy? [19:21:05] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1143-1146].eqiad.wmnet [19:21:07] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1143-1146].eqiad.wmnet [19:21:08] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp7010.magru.wmnet} and A:cp [19:21:13] it is a one character change ;) [19:21:15] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp7002.magru.wmnet} and A:cp [19:21:17] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1147-1150].eqiad.wmnet [19:21:26] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:22:11] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:23:32] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1147-1150].eqiad.wmnet [19:24:59] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on gitlab1003.wikimedia.org with reason: T426563 [19:25:13] !log dzahn@cumin2002 START - Cookbook sre.hosts.reboot-single for host gitlab1003.wikimedia.org [19:25:22] ottomata: I would wait until the first round of CI passes. [19:25:55] !log rebooting gitlab-replica-a.wikimedia.org [19:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:26] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:26:37] !log tchin@deploy1003 Started deploy [analytics/refinery@eeef7f3] (hadoop-test): ReHotfix Hadoop-test analytics/refinery@eeef7f3d] [19:26:52] dancy: okay ty [19:27:36] https://integration.wikimedia.org/ci/job/quibble-with-gated-extensions-vendor-mysql-php83/35492/console failed but i don't know why [19:27:45] 15:14:05 stderr: 'fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/extensions/IPInfo/': GnuTLS recv error (-54): Error in the pull function.' [19:27:45] ? [19:28:01] ottomata: unfortunate but known issue. fix for now: retry it [19:28:15] Painful [19:28:17] PROBLEM - Host gitlab-replica-a.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [19:28:25] okay...with "recheck", yes? [19:28:29] yeah [19:28:31] yes [19:28:38] !log tchin@deploy1003 Finished deploy [analytics/refinery@eeef7f3] (hadoop-test): ReHotfix Hadoop-test analytics/refinery@eeef7f3d] (duration: 02m 00s) [19:28:40] okay... [19:30:11] (03CR) 10CI reject: [V:04-1] BugFix: Emit page_change at version 1.6.0 to pick up user wiki_id [extensions/EventBus] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289408 (https://phabricator.wikimedia.org/T426198) (owner: 10Ottomata) [19:30:14] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1147-1150].eqiad.wmnet [19:30:15] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1147-1150].eqiad.wmnet [19:30:26] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1151-1154].eqiad.wmnet [19:30:33] FIRING: KubernetesCalicoDown: wikikube-worker1148.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1148.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:31:35] now that's a different issue it actually found, right? [19:32:42] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1151-1154].eqiad.wmnet [19:32:48] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp7002.magru.wmnet [19:32:48] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp7002.magru.wmnet} and A:cp [19:32:53] in the cherry pikc? yeah...but that is unrelated too hm [19:32:54] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp7010.magru.wmnet [19:32:55] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp7010.magru.wmnet} and A:cp [19:33:01] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1003.wikimedia.org [19:33:10] yea, I meant the reason it said V-1 is all different now [19:33:19] RECOVERY - Host gitlab-replica-a.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [19:34:32] !log tchin@deploy1003 Started deploy [analytics/refinery@eeef7f3]: Redeploy v0.3.14 [analytics/refinery@eeef7f3d] [19:34:40] yeah but i think it is also an annoying recheck problem [19:34:44] happened on this change too [19:34:44] https://integration.wikimedia.org/ci/job/quibble-apitests-only-vendor-php83/23599/console [19:34:46] last week [19:34:58] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventBus/+/1286365/comments/b2c70622_a785d839 [19:35:05] (03CR) 10Ottomata: "recheck" [extensions/EventBus] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289408 (https://phabricator.wikimedia.org/T426198) (owner: 10Ottomata) [19:35:20] jenkins is blocking my unbreak now ;) [19:35:22] ACK, but this known I don't know unlike the other [19:35:28] this one [19:35:33] RESOLVED: KubernetesCalicoDown: wikikube-worker1148.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1148.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:35:45] mutante: which one is known? [19:36:14] GNU TLs error when doing git pull [19:36:17] "The rollback action should undo the last edit" or "unable to access 'https://gerrit.wikimedia.org/r/mediawiki/extensions/IPInfo" ? [19:36:20] ah [19:36:20] but not "some tests about API fail" [19:36:20] (03CR) 10LWatson: Make image browsing available in Beta and TestWiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288996 (https://phabricator.wikimedia.org/T421019) (owner: 10Kimberly Sarabia) [19:36:20] k [19:37:26] okay i got a V+2 from jenkins on the cherry pick [19:37:29] after recheck [19:37:32] proceeding [19:37:47] ottomata: if you want to be informed that known one is https://phabricator.wikimedia.org/T420865 [19:37:56] ack [19:37:58] the other one I was about to say maybe take to developer-experience [19:38:02] but that sounds good :) [19:38:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by otto@deploy1003 using scap backport" [extensions/EventBus] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289408 (https://phabricator.wikimedia.org/T426198) (owner: 10Ottomata) [19:39:07] !log tchin@deploy1003 Finished deploy [analytics/refinery@eeef7f3]: Redeploy v0.3.14 [analytics/refinery@eeef7f3d] (duration: 04m 35s) [19:39:48] !log tchin@deploy1003 Started deploy [analytics/refinery@eeef7f3] (thin): Redeploy v0.3.14 THIN [analytics/refinery@eeef7f3d] [19:40:45] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host pybal-test2003.codfw.wmnet [19:40:50] (03Merged) 10jenkins-bot: BugFix: Emit page_change at version 1.6.0 to pick up user wiki_id [extensions/EventBus] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1289408 (https://phabricator.wikimedia.org/T426198) (owner: 10Ottomata) [19:41:20] !log otto@deploy1003 Started scap sync-world: Backport for [[gerrit:1289408|BugFix: Emit page_change at version 1.6.0 to pick up user wiki_id (T426198)]] [19:41:24] T426198: Event schemas - mediawiki user entity should be wiki aware - https://phabricator.wikimedia.org/T426198 [19:41:35] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp701[1-2].magru.wmnet} and A:cp [19:41:56] !log tchin@deploy1003 Finished deploy [analytics/refinery@eeef7f3] (thin): Redeploy v0.3.14 THIN [analytics/refinery@eeef7f3d] (duration: 02m 07s) [19:42:09] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1151-1154].eqiad.wmnet [19:42:11] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1151-1154].eqiad.wmnet [19:42:17] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp700[3-4].magru.wmnet} and A:cp [19:42:21] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1155-1158].eqiad.wmnet [19:43:58] !log otto@deploy1003 otto: Backport for [[gerrit:1289408|BugFix: Emit page_change at version 1.6.0 to pick up user wiki_id (T426198)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:44:33] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pybal-test2003.codfw.wmnet [19:44:38] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1155-1158].eqiad.wmnet [19:47:04] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host acmechief-test2001.codfw.wmnet [19:50:17] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:50:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289352 (owner: 10C. Scott Ananian) [19:50:42] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief-test2001.codfw.wmnet [19:50:57] PROBLEM - BFD status on cloudsw1-d5-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:51:10] FIRING: [2x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.8 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:51:16] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1155-1158].eqiad.wmnet [19:51:18] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1155-1158].eqiad.wmnet [19:51:29] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1159-1162].eqiad.wmnet [19:51:32] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host acmechief-test1001.eqiad.wmnet [19:51:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:52:57] RECOVERY - BFD status on cloudsw1-d5-eqiad.mgmt is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:53:04] !log otto@deploy1003 otto: Continuing with deployment [19:53:25] looks good. i was freaking out because i couldn't see any events with a grep. needed grep --line-buffered GAH [19:53:28] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp7011.magru.wmnet [19:53:50] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp7003.magru.wmnet [19:54:18] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1159-1162].eqiad.wmnet [19:55:23] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief-test1001.eqiad.wmnet [19:56:10] FIRING: [4x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.9 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:56:17] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:56:39] FIRING: [6x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:56:57] PROBLEM - BFD status on cloudsw1-c8-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:57:35] !log otto@deploy1003 Finished scap sync-world: Backport for [[gerrit:1289408|BugFix: Emit page_change at version 1.6.0 to pick up user wiki_id (T426198)]] (duration: 16m 15s) [19:57:39] T426198: Event schemas - mediawiki user entity should be wiki aware - https://phabricator.wikimedia.org/T426198 [19:58:49] (03CR) 10Jforrester: "The replacement ones are the Rust-based ones; the Node-based ones are already there (and are being replaced by the Rust-based ones)." [dns] - 10https://gerrit.wikimedia.org/r/1289393 (https://phabricator.wikimedia.org/T417870) (owner: 10Jforrester) [19:59:08] thanks dancy mutante , looking better now [19:59:18] Excellent [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260519T2000) [20:00:04] tgr, ebernhardson, sbassett, and cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:19] o/ [20:00:38] o/ [20:00:49] o/ [20:00:57] RECOVERY - BFD status on cloudsw1-c8-eqiad.mgmt is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:01:02] my patch should do nothing :) so can be deployed along with any other config patch [20:01:10] RESOLVED: [4x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.9 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:01:11] cool. Any others? [20:01:19] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1159-1162].eqiad.wmnet [20:01:21] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1159-1162].eqiad.wmnet [20:01:24] sbassett has a config patch? [20:01:32] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1163-1165,1240].eqiad.wmnet [20:01:39] RESOLVED: [8x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [20:01:56] sbassett (that wmg/wg is a mistake I've often made!) [20:02:31] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host acmechief1002.eqiad.wmnet [20:03:50] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1163-1165,1240].eqiad.wmnet [20:03:51] \o [20:05:44] "Change '1143131' has 1 Depends-On relationship(s) (1138466) but none were deemed relevant by the dependency analysis rules. This may be unexpected." [20:06:00] I guess this is scap's slightly complex way of saying everything is fine? [20:06:22] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief1002.eqiad.wmnet [20:06:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288953 (https://phabricator.wikimedia.org/T426614) (owner: 10Gergő Tisza) [20:06:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289352 (owner: 10C. Scott Ananian) [20:07:05] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host acmechief2002.codfw.wmnet [20:07:29] 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Follow up on multiple RAID / drive issues - https://phabricator.wikimedia.org/T426610#11938205 (10wiki_willy) [20:07:36] tgr_: i'd expect that if (eg) the depends-on patch was already included in wmf.3 and so didn't need to be backported. [20:07:44] (03Merged) 10jenkins-bot: Add CommonsFinder to $wgUrlProtocols [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288953 (https://phabricator.wikimedia.org/T426614) (owner: 10Gergő Tisza) [20:07:47] cscott: yes, I hope that’s all this is. [20:07:48] (03Merged) 10jenkins-bot: Remove unused ParsoidFragmentInput and ParsoidFragmentSupport [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289352 (owner: 10C. Scott Ananian) [20:08:14] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1288953|Add CommonsFinder to $wgUrlProtocols (T426614)]], [[gerrit:1289352|Remove unused ParsoidFragmentInput and ParsoidFragmentSupport]] [20:08:18] T426614: add "CommonsFinder://" custom scheme to $wgUrlProtocols for native app OAuth2 support - https://phabricator.wikimedia.org/T426614 [20:10:11] !log tgr@deploy1003 cscott, tgr: Backport for [[gerrit:1288953|Add CommonsFinder to $wgUrlProtocols (T426614)]], [[gerrit:1289352|Remove unused ParsoidFragmentInput and ParsoidFragmentSupport]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:10:48] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1163-1165,1240].eqiad.wmnet [20:10:50] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1163-1165,1240].eqiad.wmnet [20:10:56] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief2002.codfw.wmnet [20:11:00] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1241-1244].eqiad.wmnet [20:12:25] !log tgr@deploy1003 cscott, tgr: Continuing with deployment [20:13:11] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1241-1244].eqiad.wmnet [20:14:39] (03CR) 10Kimberly Sarabia: Make image browsing available in Beta and TestWiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288996 (https://phabricator.wikimedia.org/T421019) (owner: 10Kimberly Sarabia) [20:16:39] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1288953|Add CommonsFinder to $wgUrlProtocols (T426614)]], [[gerrit:1289352|Remove unused ParsoidFragmentInput and ParsoidFragmentSupport]] (duration: 08m 25s) [20:16:40] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:16:43] T426614: add "CommonsFinder://" custom scheme to $wgUrlProtocols for native app OAuth2 support - https://phabricator.wikimedia.org/T426614 [20:19:21] ebernhardson: over to you [20:19:35] tgr_: thanks! [20:20:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [extensions/CirrusSearch] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288983 (owner: 10Ebernhardson) [20:20:46] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1241-1244].eqiad.wmnet [20:20:48] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1241-1244].eqiad.wmnet [20:20:59] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1245-1248].eqiad.wmnet [20:21:11] (03CR) 10Ladsgroup: [C:03+1] "LGTM, I can try to deploy it tomorrow when more people from my team are around." [puppet] - 10https://gerrit.wikimedia.org/r/1289382 (https://phabricator.wikimedia.org/T421705) (owner: 10JHathaway) [20:23:10] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1245-1248].eqiad.wmnet [20:23:59] (03PS2) 10Ladsgroup: Limit $wgThumbLimits to three options [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287441 (https://phabricator.wikimedia.org/T426328) (owner: 10Jdlrobson) [20:24:09] (03CR) 10CI reject: [V:04-1] Limit $wgThumbLimits to three options [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287441 (https://phabricator.wikimedia.org/T426328) (owner: 10Jdlrobson) [20:29:26] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir and not A:ncredir-magru and A:ncredir [20:30:45] (03CR) 10Dzahn: [C:04-1] "user also needs to be in the docker group" [puppet] - 10https://gerrit.wikimedia.org/r/1286999 (owner: 10Dzahn) [20:31:52] (03Merged) 10jenkins-bot: Revert^2 "Include xff in search logs" [extensions/CirrusSearch] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288983 (owner: 10Ebernhardson) [20:32:20] !log ebernhardson@deploy1003 Started scap sync-world: Backport for [[gerrit:1288983|Revert^2 "Include xff in search logs"]] [20:34:13] !log ebernhardson@deploy1003 ebernhardson: Backport for [[gerrit:1288983|Revert^2 "Include xff in search logs"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:35:15] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp7004.magru.wmnet [20:35:15] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp700[3-4].magru.wmnet} and A:cp [20:35:20] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp7012.magru.wmnet [20:35:21] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp701[1-2].magru.wmnet} and A:cp [20:36:23] !log ebernhardson@deploy1003 ebernhardson: Continuing with deployment [20:38:20] (03PS6) 10CDanis: puppetserver: install cidergrinder, run daily grind on primary [puppet] - 10https://gerrit.wikimedia.org/r/1270971 [20:40:32] !log ebernhardson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1288983|Revert^2 "Include xff in search logs"]] (duration: 08m 12s) [20:41:03] (03PS7) 10CDanis: puppetserver: cidergrinder daily grind on primary [puppet] - 10https://gerrit.wikimedia.org/r/1270971 [20:41:18] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1270971 (owner: 10CDanis) [20:41:45] sbassett: all done, your up (or i cna help if needeD) [20:42:40] (03PS3) 10Dzahn: zuul: replace user/group setup with systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/1286999 (https://phabricator.wikimedia.org/T395938) [20:43:25] PROBLEM - Host wikikube-worker1246 is DOWN: PING CRITICAL - Packet loss = 100% [20:43:31] (03CR) 10JHathaway: profile::postfix::mx: Mark the SMTP port as intentionally open (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1283043 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [20:45:03] FIRING: [2x] KubernetesCalicoDown: wikikube-worker1245.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:45:51] (03CR) 10Dzahn: [V:03+1 C:03+1] "User[zuul]" [puppet] - 10https://gerrit.wikimedia.org/r/1286999 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [20:46:23] ebernhardson: Ok, thanks. I’m ready to spiderpig... [20:46:30] (happy to run that myself) [20:47:05] +1 [20:47:15] !log brett@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs7003.magru.wmnet} and A:liberica [20:47:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbassett@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288999 (https://phabricator.wikimedia.org/T424058) (owner: 10SBassett) [20:47:50] (03PS8) 10CDanis: puppetserver: cidergrinder daily grind on primary [puppet] - 10https://gerrit.wikimedia.org/r/1270971 [20:48:00] (03CR) 10Dzahn: [V:03+1 C:03+1] "@Moritz,good? zuul (923) is reserved in data.yaml and just switching to systemd" [puppet] - 10https://gerrit.wikimedia.org/r/1286999 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [20:48:38] (03CR) 10CDanis: [V:03+1 C:03+2] "Done, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1270971 (owner: 10CDanis) [20:48:40] (03Merged) 10jenkins-bot: Explicitly set wgCSPUseReportURIDirective and not wmgCSPUseReportURIDirective to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288999 (https://phabricator.wikimedia.org/T424058) (owner: 10SBassett) [20:49:09] !log sbassett@deploy1003 Started scap sync-world: Backport for [[gerrit:1288999|Explicitly set wgCSPUseReportURIDirective and not wmgCSPUseReportURIDirective to true (T424058)]] [20:49:12] T424058: Properly set the Reporting-Endpoints header and the report-to directive via MediaWiki's CSP implementation - https://phabricator.wikimedia.org/T424058 [20:51:03] !log sbassett@deploy1003 sbassett: Backport for [[gerrit:1288999|Explicitly set wgCSPUseReportURIDirective and not wmgCSPUseReportURIDirective to true (T424058)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:51:09] !log brett@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs7003.magru.wmnet} and A:liberica [20:51:37] !log sbassett@deploy1003 sbassett: Continuing with deployment [20:55:49] !log sbassett@deploy1003 Finished scap sync-world: Backport for [[gerrit:1288999|Explicitly set wgCSPUseReportURIDirective and not wmgCSPUseReportURIDirective to true (T424058)]] (duration: 06m 40s) [20:55:53] T424058: Properly set the Reporting-Endpoints header and the report-to directive via MediaWiki's CSP implementation - https://phabricator.wikimedia.org/T424058 [20:58:26] (03PS1) 10CDanis: apt: cidergrinder: add bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1289420 [20:59:12] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:00:04] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260519T2100) [21:01:40] (03CR) 10Herron: [C:03+1] apt: cidergrinder: add bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1289420 (owner: 10CDanis) [21:01:52] (03CR) 10CDanis: [C:03+2] apt: cidergrinder: add bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1289420 (owner: 10CDanis) [21:04:36] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [21:08:28] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wdqs1036 to eqiad - jclark@cumin1003" [21:08:34] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wdqs1036 to eqiad - jclark@cumin1003" [21:08:34] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:09:05] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1036.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:09:08] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wdqs1036.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:09:12] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:09:45] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1036.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:09:48] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wdqs1036.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:10:07] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1037.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:10:56] !log 💔cdanis@apt1002.wikimedia.org ~ 🕔🍺 sudo -i reprepro -C main --ignore=wrongdistribution copy bookworm-wikimedia trixie-wikimedia cidergrinder [21:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:06] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1038.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:12:58] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1036.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:13:01] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wdqs1036.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:15:20] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs2014.codfw.wmnet [21:16:06] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wdqs1036 [21:16:10] !log jclark@cumin1003 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host wdqs1036 [21:18:33] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wdqs1037.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:18:36] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wdqs1037 [21:18:37] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1038.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:19:32] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs1037 [21:20:40] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs2014.codfw.wmnet [21:20:49] (03PS1) 10Cwhite: logstash: restore sampling of webrequest logs [puppet] - 10https://gerrit.wikimedia.org/r/1289423 (https://phabricator.wikimedia.org/T390215) [21:22:07] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1036.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:23:19] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1037.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:24:54] (03PS1) 10CDanis: puppetserver: cidergrinder: use webproxy for fetch [puppet] - 10https://gerrit.wikimedia.org/r/1289424 [21:25:01] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1289424 (owner: 10CDanis) [21:25:58] 10ops-eqiad, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Q4:rack/setup/install wdqs103[6-8] - https://phabricator.wikimedia.org/T423314#11938332 (10Jclark-ctr) [21:26:30] (03PS2) 10CDanis: puppetserver: cidergrinder: use webproxy for fetch [puppet] - 10https://gerrit.wikimedia.org/r/1289424 [21:26:31] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1289424 (owner: 10CDanis) [21:27:28] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wdqs1037.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:31:19] (03CR) 10CDanis: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1289424/6799/puppetserver1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1289424 (owner: 10CDanis) [21:31:30] (03CR) 10Scott French: [C:03+1] puppetserver: cidergrinder: use webproxy for fetch [puppet] - 10https://gerrit.wikimedia.org/r/1289424 (owner: 10CDanis) [21:31:55] (03PS9) 10Cwhite: opensearch: move pki::get_cert call into profile module [puppet] - 10https://gerrit.wikimedia.org/r/1280788 (https://phabricator.wikimedia.org/T424204) [21:33:45] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1036.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:34:02] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1037.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:38:06] (03CR) 10Cwhite: [C:03+2] opensearch: move pki::get_cert call into profile module [puppet] - 10https://gerrit.wikimedia.org/r/1280788 (https://phabricator.wikimedia.org/T424204) (owner: 10Cwhite) [21:38:29] (03PS5) 10Cwhite: beta-logs: rename pki intermediate parameter [puppet] - 10https://gerrit.wikimedia.org/r/1284024 [21:39:43] (03CR) 10Dzahn: [V:03+1 C:03+2] "being bold: I can't compile it and it needs a little while for the cloud puppetmaster to sync with prod. So I will just try it and merge a" [puppet] - 10https://gerrit.wikimedia.org/r/1285488 (https://phabricator.wikimedia.org/T421147) (owner: 10Dzahn) [21:41:13] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wdqs1037 [21:41:30] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wdqs1037.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:42:12] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs1037 [21:42:24] 10ops-eqiad, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Q4:rack/setup/install wdqs103[6-8] - https://phabricator.wikimedia.org/T423314#11938366 (10Jclark-ctr) [21:42:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Q4:rack/setup/install wdqs103[6-8] - https://phabricator.wikimedia.org/T423314#11938367 (10Jclark-ctr) wdqs1037 is failing to provision will check cabling next time on site [21:46:19] !log jiji@cumin1003 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on P{wikikube-worker[1006-1007,1015-1016,1021,1034-1057,1064-1081,1084-1087,1093-1095,1113-1165,1240-1289,1291-1327,1375-1384].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [21:48:45] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:56:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Q4:rack/setup/install wdqs103[6-8] - https://phabricator.wikimedia.org/T423314#11938382 (10bking) Moving to in progress per IRC conversation with @Jclark-ctr . These are our first servers with the hardw... [21:59:10] (03CR) 10Cwhite: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1286507 (https://phabricator.wikimedia.org/T425795) (owner: 10Herron) [22:04:14] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on dbproxy2005 - https://phabricator.wikimedia.org/T426791 (10ops-monitoring-bot) 03NEW [22:07:17] (03PS9) 10Cwhite: alerts: Add optional pre-deploy transformations [puppet] - 10https://gerrit.wikimedia.org/r/1288883 (https://phabricator.wikimedia.org/T424814) (owner: 10Filippo Giunchedi) [22:07:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install rdb101[56] - https://phabricator.wikimedia.org/T418916#11938423 (10Jclark-ctr) a:05Jclark-ctr→03Effib [22:08:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install rdb101[56] - https://phabricator.wikimedia.org/T418916#11938427 (10Jclark-ctr) a:05Effib→03jijiki [22:09:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install rdb101[56] - https://phabricator.wikimedia.org/T418916#11938429 (10Jclark-ctr) a:05jijiki→03Clement_Goubert [22:09:21] 10ops-codfw, 06SRE, 06DC-Ops: Too low optic power on - pfw1-codfw:xe-7/2/0 (Core: cr2-codfw:xe-0/0/1:0 {#122503}) - https://phabricator.wikimedia.org/T426671#11938432 (10Jhancock.wm) found issue was the xcon between DH7 and DH5. attempted to clean the ports but was unsuccessful. changed the port on the patch... [22:10:02] (03PS1) 10Bking: wdqs: Add config for net-new wdqs hosts [puppet] - 10https://gerrit.wikimedia.org/r/1289428 (https://phabricator.wikimedia.org/T423314) [22:16:41] !log brett@cumin2002 START - Cookbook sre.hosts.remove-downtime for lvs2012.codfw.wmnet [22:16:42] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs2012.codfw.wmnet [22:18:02] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on lvs2012.codfw.wmnet with reason: MD RAID failure [22:26:26] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11938571 (10BCornwall) 05Resolved→03Open p:05Triage→03High Sadly, it appears that the replacement drive is still problematic. From the latest boot log: ` May 11 15:21:16 lvs2012 kernel: md... [22:36:25] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11938592 (10BCornwall) I see the new disk as "Ready" instead of "Online" in iDRAC. I'm also noticing a discrepancy: lvs2014 has the virtual disks set to the "Write Back" policy while lvs2012 has "W... [22:39:17] !log disabling pybal/puppet on lvs2012 due to hardware misconfiguration/failure - T425890 [22:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:21] T425890: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890 [22:46:25] FIRING: SystemdUnitFailed: prometheus-node-textfile-prometheus-check-discovery-certificate-expiry.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:47:17] (03PS1) 10Dzahn: codesearch: let the lock file cleanup run only once an hour [puppet] - 10https://gerrit.wikimedia.org/r/1289433 (https://phabricator.wikimedia.org/T421147) [22:48:45] (03CR) 10Dzahn: [C:03+2] codesearch: let the lock file cleanup run only once an hour [puppet] - 10https://gerrit.wikimedia.org/r/1289433 (https://phabricator.wikimedia.org/T421147) (owner: 10Dzahn) [22:54:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and fe80::a6e1:1a00:1a6f:d3a3 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:55:12] (03CR) 10Clare Ming: [C:03+2] growtbook: New release that supports status as a filter for API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289366 (https://phabricator.wikimedia.org/T421800) (owner: 10Santiago Faci) [22:57:06] (03CR) 10Clare Ming: [C:03+2] Test Kitchen UI: Deploy v1.3.4 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287039 (https://phabricator.wikimedia.org/T393434) (owner: 10Santiago Faci) [22:57:10] (03Merged) 10jenkins-bot: growtbook: New release that supports status as a filter for API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289366 (https://phabricator.wikimedia.org/T421800) (owner: 10Santiago Faci) [22:59:10] RESOLVED: BFDdown: BFD session down between cr2-eqdfw and fe80::a6e1:1a00:1a6f:d3a3 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:59:26] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.3.4 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287039 (https://phabricator.wikimedia.org/T393434) (owner: 10Santiago Faci) [23:05:27] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:08:37] (03PS1) 10Pwangai: WIP: Isolate extreme duration PHPUnit classes in split groups [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1289437 [23:10:27] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:15:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:19:31] (03PS1) 10DDesouza: miscweb(design-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289440 (https://phabricator.wikimedia.org/T344471) [23:19:51] (03CR) 10Bartosz Dziewoński: 404.php: Force a redirect to /wiki/ in very obvious cases (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288274 (https://phabricator.wikimedia.org/T129433) (owner: 10Ladsgroup) [23:20:47] !log reprepro include php8.3_8.3.31-1+wmf11u2+icu72u1 into component/php83-icu72 [23:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:03] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1289003 (owner: 10TrainBranchBot) [23:40:08] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1289443 [23:40:08] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1289443 (owner: 10TrainBranchBot) [23:51:41] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1289443 (owner: 10TrainBranchBot)