[00:02:02] jouncebot: nowandnext [00:02:02] No deployments scheduled for the next 1 hour(s) and 57 minute(s) [00:02:02] In 1 hour(s) and 57 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260407T0200) [00:02:08] (03CR) 10Zabe: [C:03+2] Start reading from the new file tables on more large wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268291 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe) [00:02:59] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1003.eqiad.wmnet with OS bookworm [00:03:03] (03Merged) 10jenkins-bot: Start reading from the new file tables on more large wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268291 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe) [00:03:47] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1268291|Start reading from the new file tables on more large wikis (T416548)]] [00:03:49] T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548 [00:05:30] !log zabe@deploy1003 zabe: Backport for [[gerrit:1268291|Start reading from the new file tables on more large wikis (T416548)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:05:41] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-nginx-exporter.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:05:54] !log zabe@deploy1003 zabe: Continuing with sync [00:10:09] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1268291|Start reading from the new file tables on more large wikis (T416548)]] (duration: 06m 22s) [00:10:12] T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548 [00:33:50] (03PS5) 10Dzahn: zuul: break out mTLS setup into separate class [puppet] - 10https://gerrit.wikimedia.org/r/1261690 (https://phabricator.wikimedia.org/T421398) [00:35:32] (03CR) 10Dzahn: [C:04-1] "modules/profile/manifests/zuul/base.pp:19" [puppet] - 10https://gerrit.wikimedia.org/r/1261690 (https://phabricator.wikimedia.org/T421398) (owner: 10Dzahn) [00:40:04] (03PS1) 10Stang: Revert "zhwiki: Temporary Logo Change for WP25" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268293 [00:40:14] (03CR) 10CI reject: [V:04-1] Revert "zhwiki: Temporary Logo Change for WP25" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268293 (owner: 10Stang) [00:40:58] (03PS2) 10Stang: Revert "zhwiki: Temporary Logo Change for WP25" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268293 (https://phabricator.wikimedia.org/T414299) [00:41:07] (03CR) 10CI reject: [V:04-1] Revert "zhwiki: Temporary Logo Change for WP25" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268293 (https://phabricator.wikimedia.org/T414299) (owner: 10Stang) [00:48:48] (03PS3) 10Stang: Revert "zhwiki: Temporary Logo Change for WP25" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268293 (https://phabricator.wikimedia.org/T414299) [00:49:36] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:51:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 07 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268293 (https://phabricator.wikimedia.org/T414299) (owner: 10Stang) [00:54:00] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [01:02:00] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:03:36] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:05:00] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [01:06:00] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:06:36] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [01:08:36] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:09:07] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.46.0-wmf.23 [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268294 (https://phabricator.wikimedia.org/T420481) [01:09:09] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.46.0-wmf.23 [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268294 (https://phabricator.wikimedia.org/T420481) (owner: 10TrainBranchBot) [01:09:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1268295 [01:09:16] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1268295 (owner: 10TrainBranchBot) [01:23:23] (03Merged) 10jenkins-bot: Branch commit for wmf/1.46.0-wmf.23 [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268294 (https://phabricator.wikimedia.org/T420481) (owner: 10TrainBranchBot) [01:23:29] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1268295 (owner: 10TrainBranchBot) [01:44:00] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [01:44:36] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [01:47:00] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:50:00] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [01:52:36] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:53:00] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260407T0200) [02:09:14] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:19:29] (03PS1) 10Pppery: Drop 1.5x logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268300 (https://phabricator.wikimedia.org/T246054) [02:20:16] (03CR) 10CI reject: [V:04-1] Drop 1.5x logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268300 (https://phabricator.wikimedia.org/T246054) (owner: 10Pppery) [02:23:57] (03PS2) 10Pppery: Drop 1.5x logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268300 (https://phabricator.wikimedia.org/T246054) [02:24:44] (03CR) 10CI reject: [V:04-1] Drop 1.5x logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268300 (https://phabricator.wikimedia.org/T246054) (owner: 10Pppery) [02:25:11] (03PS3) 10Pppery: Drop 1.5x logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268300 (https://phabricator.wikimedia.org/T246054) [02:26:02] (03CR) 10CI reject: [V:04-1] Drop 1.5x logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268300 (https://phabricator.wikimedia.org/T246054) (owner: 10Pppery) [02:27:44] (03PS4) 10Pppery: Drop 1.5x logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268300 (https://phabricator.wikimedia.org/T246054) [02:34:14] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:42:13] FIRING: GitlabPackagePullerFailedOnPrepare: Package puller has some run errors while preparing projects. - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DGitlabPackagePullerFailedOnPrepare [02:56:13] FIRING: CertAlmostExpired: Certificate for service opensearch-test:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#opensearch-test:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:00:04] Deploy window Automatic deployment of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260407T0300) [03:01:52] (03PS1) 10TrainBranchBot: testwikis to 1.46.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268301 (https://phabricator.wikimedia.org/T420481) [03:01:54] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268301 (https://phabricator.wikimedia.org/T420481) (owner: 10TrainBranchBot) [03:02:49] (03Merged) 10jenkins-bot: testwikis to 1.46.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268301 (https://phabricator.wikimedia.org/T420481) (owner: 10TrainBranchBot) [03:03:11] !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.46.0-wmf.23 refs T420481 [03:03:14] T420481: 1.46.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T420481 [03:14:18] (03CR) 10SD0001: [C:03+1] Move createwithcontentmodel to autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268225 (https://phabricator.wikimedia.org/T248294) (owner: 10Pppery) [03:39:06] !log mwpresync@deploy1003 Finished scap sync-world: testwikis to 1.46.0-wmf.23 refs T420481 (duration: 35m 55s) [03:39:09] T420481: 1.46.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T420481 [04:00:04] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260407T0400) [04:02:29] !log mwpresync@deploy1003 Pruned MediaWiki: 1.46.0-wmf.20 (duration: 02m 27s) [04:05:41] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-nginx-exporter.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:49:36] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:01:00] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:07:00] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:10:00] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:10:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:16:36] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:17:00] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:20:20] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [05:20:30] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [05:20:36] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:21:30] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Upgrade to 10.11.16.v3 [05:22:00] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:23:00] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:26:00] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:27:00] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:27:36] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:29:27] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2142,2248-2249].codfw.wmnet with reason: Upgrade to 10.11.16.v3 [05:29:40] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2142,2248-2249].codfw.wmnet,db1169.eqiad.wmnet with reason: Upgrade to 10.11.16.v3 [05:30:24] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2249: Upgrade package [05:30:36] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:30:43] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) depool db2249: Upgrade package [05:31:00] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:31:05] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2249: Upgrade package [05:31:13] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) depool db2249: Upgrade package [05:31:41] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2249.codfw.wmnet: Upgrade package [05:31:59] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) depool db2249.codfw.wmnet: Upgrade package [05:33:39] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2248: Upgrade package [05:33:58] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) depool db2248: Upgrade package [05:35:36] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:35:42] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2248: Upgrade package [05:35:50] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2248: Upgrade package [05:36:00] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:36:50] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2142: Upgrade package [05:36:50] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [05:36:59] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [05:36:59] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2142: Upgrade package [05:37:19] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1169: Upgrade package [05:37:37] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1169: Upgrade package [05:39:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:39:36] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:41:00] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:41:36] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:42:00] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:45:36] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:45:53] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1169: Upgrade package [05:46:02] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:50:26] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-nginx-exporter.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:52:00] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:52:36] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:52:43] (03PS1) 10Marostegui: control-mariadb-10.11-trixie: Update package version [software] - 10https://gerrit.wikimedia.org/r/1268434 (https://phabricator.wikimedia.org/T420177) [05:52:48] (03PS1) 10Muehlenhoff: Update SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1268435 (https://phabricator.wikimedia.org/T420053) [05:55:00] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:55:36] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:56:49] (03CR) 10Muehlenhoff: [C:03+2] Update SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1268435 (https://phabricator.wikimedia.org/T420053) (owner: 10Muehlenhoff) [05:57:36] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:58:44] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 13Patch-For-Review: Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11792317 (10MoritzMuehlenhoff) 05Open→03Resolved p:05Triage→03Medium a:05BTullis→03Non... [05:59:55] (03PS1) 10Muehlenhoff: Add andreawest to wdqs-admins [puppet] - 10https://gerrit.wikimedia.org/r/1268436 (https://phabricator.wikimedia.org/T422141) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260407T0600) [06:00:05] marostegui, Amir1, and federico3: Your horoscope predicts another Primary database switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260407T0600). [06:00:36] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:01:22] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1169: Upgrade package [06:01:27] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2249: Upgrade package [06:01:36] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:02:00] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:02:08] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool pc1011: after upgrade [06:02:08] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [06:02:21] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [06:02:21] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc1011: after upgrade [06:02:39] (03CR) 10Marostegui: [C:03+2] control-mariadb-10.11-trixie: Update package version [software] - 10https://gerrit.wikimedia.org/r/1268434 (https://phabricator.wikimedia.org/T420177) (owner: 10Marostegui) [06:03:11] (03Merged) 10jenkins-bot: control-mariadb-10.11-trixie: Update package version [software] - 10https://gerrit.wikimedia.org/r/1268434 (https://phabricator.wikimedia.org/T420177) (owner: 10Marostegui) [06:05:36] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:08:46] (03CR) 10Muehlenhoff: [C:03+2] Add andreawest to wdqs-admins [puppet] - 10https://gerrit.wikimedia.org/r/1268436 (https://phabricator.wikimedia.org/T422141) (owner: 10Muehlenhoff) [06:09:00] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:11:36] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:14:24] (03Restored) 10Arnaudb: gerrit: adjust idleTimeout on Jetty [puppet] - 10https://gerrit.wikimedia.org/r/1262020 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb) [06:14:36] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:16:42] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2249: Upgrade package [06:16:48] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2248: Upgrade package [06:17:36] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:17:55] (03CR) 10Filippo Giunchedi: [C:03+1] P:toolforge::prometheus: Filter out noisy Kyverno metric [puppet] - 10https://gerrit.wikimedia.org/r/1268059 (https://phabricator.wikimedia.org/T422287) (owner: 10Majavah) [06:18:00] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:21:03] (03PS1) 10Marostegui: control-mariadb-10.11-bookworm: Update package version [software] - 10https://gerrit.wikimedia.org/r/1268438 (https://phabricator.wikimedia.org/T420177) [06:21:36] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:22:00] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:26:00] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:26:20] (03CR) 10Marostegui: "recheck" [software] - 10https://gerrit.wikimedia.org/r/1268438 (https://phabricator.wikimedia.org/T420177) (owner: 10Marostegui) [06:28:40] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1159: after upgrade [06:29:03] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2157: after upgrade [06:32:04] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2248: Upgrade package [06:32:10] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2142: Upgrade package [06:32:10] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [06:32:23] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [06:32:23] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2142: Upgrade package [06:32:36] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:34:00] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:35:00] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:35:36] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:36:43] 06SRE, 10SRE-Access-Requests, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Request: wdqs shell access for user AWesterinen - https://phabricator.wikimedia.org/T422141#11792349 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @AWesterinen-WMF You have been... [06:37:21] 06SRE, 10SRE-Access-Requests: Update SSH key for production access – Surbhi Gupta - https://phabricator.wikimedia.org/T422363#11792352 (10MoritzMuehlenhoff) p:05Triage→03Medium a:03MoritzMuehlenhoff [06:37:36] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:38:00] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:40:36] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:42:00] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:42:13] FIRING: GitlabPackagePullerFailedOnPrepare: Package puller has some run errors while preparing projects. - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DGitlabPackagePullerFailedOnPrepare [06:42:55] (03CR) 10Marostegui: [C:03+2] control-mariadb-10.11-bookworm: Update package version [software] - 10https://gerrit.wikimedia.org/r/1268438 (https://phabricator.wikimedia.org/T420177) (owner: 10Marostegui) [06:43:43] (03Merged) 10jenkins-bot: control-mariadb-10.11-bookworm: Update package version [software] - 10https://gerrit.wikimedia.org/r/1268438 (https://phabricator.wikimedia.org/T420177) (owner: 10Marostegui) [06:45:00] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:46:36] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:47:00] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:49:27] 10ops-eqiad, 06SRE, 06DC-Ops: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11792410 (10jcrespo) backup1012 hosts gerrit backups hourly. As long as we put it down just before maintenance, it could be done any time. If we stop it and do... [06:49:32] (03CR) 10Muehlenhoff: [C:03+2] Mark WDQS spec tests to run on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1239596 (owner: 10Muehlenhoff) [06:51:00] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:51:36] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:52:00] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:52:36] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:55:36] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:56:00] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:56:13] FIRING: CertAlmostExpired: Certificate for service opensearch-test:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#opensearch-test:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:57:00] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:57:36] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:00:05] Amir1, Urbanecm, and awight: That opportune time for a UTC morning backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260407T0700). [07:00:05] kipfel: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:36] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:02:36] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:07:00] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:08:48] (03CR) 10Arnaudb: gerrit: update sshd timeouts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1266149 (https://phabricator.wikimedia.org/T417996) (owner: 10Arnaudb) [07:08:54] (03Abandoned) 10Arnaudb: gerrit: update sshd timeouts [puppet] - 10https://gerrit.wikimedia.org/r/1266149 (https://phabricator.wikimedia.org/T417996) (owner: 10Arnaudb) [07:09:06] (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge::prometheus: Filter out noisy Kyverno metric [puppet] - 10https://gerrit.wikimedia.org/r/1268059 (https://phabricator.wikimedia.org/T422287) (owner: 10Majavah) [07:09:36] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:10:36] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:12:00] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:14:08] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1159: after upgrade [07:14:30] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2157: after upgrade [07:26:03] (03PS1) 10Kevin Bazira: ml-services: set NCCL/RCCL env vars for stable SHM multi-GPU communication in gpt isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268445 (https://phabricator.wikimedia.org/T418350) [07:33:49] 06SRE: Another blob upload invalid error when pushing to docker-registry - https://phabricator.wikimedia.org/T422424#11792486 (10A_smart_kitten) [07:34:10] (03CR) 10Jaime Nuche: "🎉 I can see Java 21 on the releases machines. Thank you Daniel!" [puppet] - 10https://gerrit.wikimedia.org/r/1267301 (owner: 10Dzahn) [07:39:34] (03CR) 10Ozge: [C:03+1] ml-services: set NCCL/RCCL env vars for stable SHM multi-GPU communication in gpt isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268445 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [07:39:36] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:40:42] (03CR) 10Kevin Bazira: [C:03+2] ml-services: set NCCL/RCCL env vars for stable SHM multi-GPU communication in gpt isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268445 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [07:43:00] (03Merged) 10jenkins-bot: ml-services: set NCCL/RCCL env vars for stable SHM multi-GPU communication in gpt isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268445 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [07:44:11] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [07:45:36] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:54:29] !log Moved `operations-puppet-tests-bullseye` job from a Jenkins agent running Bullseye to one running Bookworm. The image is still on Bullseye! | T421114 [07:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:32] T421114: Rebuild all Jenkins agents VM to Bookworm to support Java 21 - https://phabricator.wikimedia.org/T421114 [07:57:52] (03CR) 10Dpogorzelski: [C:03+2] ml-serve: add modified kserve 0.17 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [07:59:05] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1017.eqiad.wmnet with reason: Maintenance [07:59:58] !log push pfw policies - T422204 [07:59:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:08] !log Upgrade clouddb1017 to mariadb 10.11.16 (v3) T420177 [08:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:11] T420177: clouddb1013 crashed after the upgrade to mariadb 10.11.16 - https://phabricator.wikimedia.org/T420177 [08:05:01] !log Moved Debian Glue jobs to Jenkins agents running Bookworm (integration-agent-pkgbuilder-1005 and integration-agent-pkgbuilder-1006)| T421114 [08:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:04] T421114: Rebuild all Jenkins agents VM to Bookworm to support Java 21 - https://phabricator.wikimedia.org/T421114 [08:05:54] (03Merged) 10jenkins-bot: ml-serve: add modified kserve 0.17 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [08:07:59] (03PS1) 10Majavah: P:toolforge::prometheus: Do not store pod info for generic metrics [puppet] - 10https://gerrit.wikimedia.org/r/1268494 (https://phabricator.wikimedia.org/T422287) [08:09:02] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8380/co" [puppet] - 10https://gerrit.wikimedia.org/r/1268494 (https://phabricator.wikimedia.org/T422287) (owner: 10Majavah) [08:11:53] (03PS1) 10Muehlenhoff: Extend access for olliekryva [puppet] - 10https://gerrit.wikimedia.org/r/1268495 [08:18:07] !log update pfw1-eqiad NAT - T422380 [08:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:50] (03CR) 10Elukey: [C:03+1] ml-serve: add modified kserve 0.17 chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [08:22:52] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 42 [08:25:22] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 42 [08:25:51] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, as per task there doesn't seem anything automated consuming these logs ATM" [puppet] - 10https://gerrit.wikimedia.org/r/1266291 (https://phabricator.wikimedia.org/T422042) (owner: 10Majavah) [08:26:39] (03CR) 10Majavah: [V:03+1 C:03+2] P:dumps::distribution::web: Rsync logs from all servers [puppet] - 10https://gerrit.wikimedia.org/r/1266291 (https://phabricator.wikimedia.org/T422042) (owner: 10Majavah) [08:28:06] (03CR) 10Filippo Giunchedi: [C:03+1] "Good idea, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1268494 (https://phabricator.wikimedia.org/T422287) (owner: 10Majavah) [08:28:45] (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge::prometheus: Do not store pod info for generic metrics [puppet] - 10https://gerrit.wikimedia.org/r/1268494 (https://phabricator.wikimedia.org/T422287) (owner: 10Majavah) [08:33:46] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdb1008 - https://phabricator.wikimedia.org/T414374#11792651 (10VRiley-WMF) Under @Papaul guidence, I have doubled checked the PERC controller, I found that one of the cables became unseated. Could you please try this again? [08:35:16] (03PS1) 10Majavah: conftool-data: Add dumps services [puppet] - 10https://gerrit.wikimedia.org/r/1268503 (https://phabricator.wikimedia.org/T422040) [08:35:18] (03PS1) 10Majavah: hieradata: service: Add dumps services [puppet] - 10https://gerrit.wikimedia.org/r/1268504 (https://phabricator.wikimedia.org/T422040) [08:35:21] (03PS1) 10Majavah: O:dumps::distribution::server: Configure as LVS realserver [puppet] - 10https://gerrit.wikimedia.org/r/1268505 (https://phabricator.wikimedia.org/T422040) [08:35:23] (03PS1) 10Majavah: hieradata: Move dumps to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1268506 (https://phabricator.wikimedia.org/T422040) [08:35:25] (03PS1) 10Majavah: hieradata: Move dumps to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1268507 (https://phabricator.wikimedia.org/T422040) [08:37:51] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5007.eqsin.wmnet [08:38:00] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti5007.eqsin.wmnet [08:38:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5007.eqsin.wmnet [08:39:15] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11792688 (10ops-monitoring-bot) Draining ganeti5007.eqsin.wmnet of running VMs [08:40:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5007.eqsin.wmnet [08:41:37] (03CR) 10Filippo Giunchedi: [C:03+1] conftool-data: Add dumps services [puppet] - 10https://gerrit.wikimedia.org/r/1268503 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah) [08:42:19] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5007.eqsin.wmnet [08:42:37] (03CR) 10Majavah: [C:03+2] conftool-data: Add dumps services [puppet] - 10https://gerrit.wikimedia.org/r/1268503 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah) [08:43:47] (03CR) 10Ayounsi: [C:03+2] eqsin: add routed ganeti ranges [homer/public] - 10https://gerrit.wikimedia.org/r/1265456 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [08:45:06] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11792735 (10ops-monitoring-bot) Draining ganeti5007.eqsin.wmnet of running VMs [08:45:11] (03Merged) 10jenkins-bot: eqsin: add routed ganeti ranges [homer/public] - 10https://gerrit.wikimedia.org/r/1265456 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [08:46:22] (03PS2) 10Majavah: hieradata: Move dumps to production [puppet] - 10https://gerrit.wikimedia.org/r/1268507 (https://phabricator.wikimedia.org/T422040) [08:49:54] (03Abandoned) 10Muehlenhoff: pybal-eval-check.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/670952 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [08:50:43] (03Abandoned) 10Muehlenhoff: check_pybal_ipvs_diff.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/670938 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [08:51:16] (03CR) 10Slyngshede: [C:03+1] Extend access for olliekryva [puppet] - 10https://gerrit.wikimedia.org/r/1268495 (owner: 10Muehlenhoff) [08:51:27] (03Abandoned) 10Muehlenhoff: get-raid-status-megacli.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/670973 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [08:52:03] !log tightening the rate limit for non-standard thumbnails (T402792 T414805) [08:52:04] (03Abandoned) 10Muehlenhoff: base/monitoring/check-fresh-files-in-dir.py: Port to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/630693 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [08:52:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:08] T402792: Consider rate limiting non-standard thumbnail sizes - https://phabricator.wikimedia.org/T402792 [08:52:08] T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805 [08:52:41] (03Abandoned) 10Muehlenhoff: update-ocsp.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/670977 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [08:53:31] (03Abandoned) 10Muehlenhoff: profile::auto_restarts: allow the systemd timer to not be installed [puppet] - 10https://gerrit.wikimedia.org/r/920648 (https://phabricator.wikimedia.org/T316544) (owner: 10Arturo Borrero Gonzalez) [09:06:12] (03PS1) 10Arnaudb: gerrit: update service state [puppet] - 10https://gerrit.wikimedia.org/r/1268512 (https://phabricator.wikimedia.org/T422468) [09:10:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:10:48] 10ops-eqiad, 06SRE, 06DC-Ops: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11792820 (10VRiley-WMF) Checked BIOS and iDRAC firmware, everything is up to date. [09:12:59] (03CR) 10Muehlenhoff: [C:03+2] Switch our servers to use deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/1256371 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [09:15:39] (03PS1) 10WMDE-Fisch: Enable sub-references on Czech and Italian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268514 (https://phabricator.wikimedia.org/T420938) [09:17:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 08 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268514 (https://phabricator.wikimedia.org/T420938) (owner: 10WMDE-Fisch) [09:21:53] (03CR) 10Filippo Giunchedi: [C:03+2] openstack: add lock_path for trove-guestagent [puppet] - 10https://gerrit.wikimedia.org/r/1265425 (https://phabricator.wikimedia.org/T421857) (owner: 10Filippo Giunchedi) [09:22:09] 06SRE, 06Traffic: IP Block/Throttling relief request: urbipedia.org - Bot attack mitigated - https://phabricator.wikimedia.org/T421650#11792871 (10hnowlan) 05Open→03Resolved a:03hnowlan Glad that it's sorted out! [09:23:26] (03PS1) 10Muehlenhoff: Switch three test systems to deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/1268515 [09:30:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 07 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266964 (https://phabricator.wikimedia.org/T421749) (owner: 10Mhorsey) [09:33:58] (03PS2) 10Muehlenhoff: Switch three test systems to deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/1268515 (https://phabricator.wikimedia.org/T416707) [09:34:56] 06SRE, 10DNS, 06Infrastructure-Foundations, 10netbox, and 2 others: Missing includes in DNS repo from Netbox-generated snippets - https://phabricator.wikimedia.org/T422115#11792933 (10ayounsi) What would be a good day to alert about those ? Or even better, not even need an alert ? [09:34:59] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new: Another blob upload invalid error when pushing to docker-registry - https://phabricator.wikimedia.org/T422424#11792934 (10MoritzMuehlenhoff) p:05Triage→03Medium [09:39:33] (03CR) 10Muehlenhoff: [C:03+2] Switch three test systems to deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/1268515 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [09:39:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:39:52] (03PS1) 10Majavah: cr-cloud-vrf: Remove clouddumps NAT exemption rule [homer/public] - 10https://gerrit.wikimedia.org/r/1268516 [09:43:57] (03CR) 10Thiemo Kreuz (WMDE): [C:03+1] Enable sub-references on Czech and Italian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268514 (https://phabricator.wikimedia.org/T420938) (owner: 10WMDE-Fisch) [09:48:53] (03CR) 10Clément Goubert: "I think @hnowlan@wikimedia.org can give more context, but iiuc it was never actually used/publicized." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259942 (owner: 10Clément Goubert) [09:49:02] (03PS1) 10STran: Fix blockConnectedTempAccounts existence error [extensions/CheckUser] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268518 (https://phabricator.wikimedia.org/T422388) [09:49:16] (03PS1) 10STran: Fix blockConnectedTempAccounts existence error [extensions/CheckUser] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268519 (https://phabricator.wikimedia.org/T422388) [09:49:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 07 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/CheckUser] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268518 (https://phabricator.wikimedia.org/T422388) (owner: 10STran) [09:49:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 07 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/CheckUser] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268519 (https://phabricator.wikimedia.org/T422388) (owner: 10STran) [09:50:41] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-nginx-exporter.service on urldownloader2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:55:42] (03CR) 10Hnowlan: [C:03+1] "This wasn't documented as deprecated as it was never documented as launched :) The context behind this is that under the API gateway's rat" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259942 (owner: 10Clément Goubert) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260407T1000) [10:00:06] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for annet - https://phabricator.wikimedia.org/T422251#11793017 (10MoritzMuehlenhoff) @AnneT Can you please clarify which access you need following https://wikitech.wikimedia.org/wiki/Data_Platform/Data_access#What_access_should_I_re... [10:00:53] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for annet - https://phabricator.wikimedia.org/T422251#11793018 (10MoritzMuehlenhoff) [10:02:17] (03PS1) 10Daniel Kinzler: rest gateway: use IP as rate limit key for compliant bots [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268520 (https://phabricator.wikimedia.org/T422471) [10:04:43] (03PS2) 10Daniel Kinzler: rest gateway: use IP as rate limit key for compliant bots [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268520 (https://phabricator.wikimedia.org/T422471) [10:04:50] (03PS1) 10Muehlenhoff: Switch our servers to use deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/1268522 (https://phabricator.wikimedia.org/T416707) [10:05:26] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-nginx-exporter.service on urldownloader2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:10:55] (03CR) 10Dreamy Jazz: [C:03+1] Fix blockConnectedTempAccounts existence error [extensions/CheckUser] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268519 (https://phabricator.wikimedia.org/T422388) (owner: 10STran) [10:11:00] (03CR) 10Dreamy Jazz: [C:03+1] Fix blockConnectedTempAccounts existence error [extensions/CheckUser] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268518 (https://phabricator.wikimedia.org/T422388) (owner: 10STran) [10:11:07] (03CR) 10JMeybohm: [C:03+1] wikifeeds: Add request definition for page analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220629 (https://phabricator.wikimedia.org/T411769) (owner: 10Jgiannelos) [10:11:49] !log shift inter-site traffic from exsiting 10G to new 100G transport circuit between eqiad<->codfw T395878 [10:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:57] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install apus-be200[56] - https://phabricator.wikimedia.org/T418902#11793059 (10MatthewVernon) @Jhancock.wm that should be fine, thanks! [10:13:05] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for annet - https://phabricator.wikimedia.org/T422251#11793060 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [10:13:40] (03CR) 10Lucas Werkmeister (WMDE): search: add space-discount for wikidata custom prefix search profiles (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267130 (https://phabricator.wikimedia.org/T420427) (owner: 10DCausse) [10:25:28] 06SRE, 10Pywikibot, 06Traffic, 10Wikidata, and 2 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11793088 (10gmodena) Hi, Thanks for reaching out. Roughly speaking, we start to throttle connections (for bots that respect maxlag) when the change propagation lag betw... [10:46:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5007.eqsin.wmnet [10:47:52] (03CR) 10Muehlenhoff: [C:03+2] Extend access for olliekryva [puppet] - 10https://gerrit.wikimedia.org/r/1268495 (owner: 10Muehlenhoff) [10:50:12] (03PS1) 10Muehlenhoff: Remove ganeti5007 from Ganeti cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1268531 (https://phabricator.wikimedia.org/T421863) [10:56:13] FIRING: CertAlmostExpired: Certificate for service opensearch-test:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#opensearch-test:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:20:38] (03CR) 10Ayounsi: [C:03+1] Remove ganeti5007 from Ganeti cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1268531 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [11:24:07] (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti5007 from Ganeti cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1268531 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [11:26:24] PROBLEM - ganeti-confd running on ganeti5007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 109 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [11:26:24] PROBLEM - ganeti-noded running on ganeti5007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [11:26:50] FIRING: ProbeDown: Service ganeti5007:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:31:20] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti5007.eqsin.wmnet [11:36:10] !log ayounsi@cumin1003 START - Cookbook sre.dns.admin DNS admin: depool esams [reason: network maintenance, T416450] [11:36:13] !log depool esams for network maintenance - T416450 [11:36:13] T416450: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450 [11:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:18] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool esams [reason: network maintenance, T416450] [11:39:11] (03PS1) 10Ayounsi: Temporarily geodns GB and IE to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1268538 (https://phabricator.wikimedia.org/T416450) [11:41:32] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ml-serve1001 - https://phabricator.wikimedia.org/T422382#11793372 (10Jclark-ctr) Slot 3 slot 3 has already been removed by system for failure. Only showing 3 drives ` Drive 3 in disk drive bay 1 is operating normally. Mon Apr 06 2026 14:04:14 A predictive fa... [11:45:58] jmm@cumin2002 upgrade-firmware (PID 3496294) is awaiting input [11:47:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Degraded RAID on ml-serve1001 - https://phabricator.wikimedia.org/T422382#11793381 (10Jclark-ctr) @wiki_willy this server is out of warranty and we do not have any 2tb HHD sata drives on hand we would need to order this if it needs to be replaced [11:51:50] RESOLVED: ProbeDown: Service ganeti5007:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:55:12] jmm@cumin2002 upgrade-firmware (PID 3496294) is awaiting input [11:56:26] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5007.eqsin.wmnet [12:00:03] 06SRE, 10Pywikibot, 06Traffic, 10Wikidata, and 2 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11793405 (10Ladsgroup) FWIW, if the maxlag is consistently high but some bots are still editing so fast that are keeping wdqs under pressure, it is a clear violation of... [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260407T1200) [12:01:24] !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cr1-esams,cr1-esams IPv6,re0.cr1-esams.mgmt with reason: router upgrade [12:03:54] !log ayounsi@cumin1003 conftool action : set/pooled=no; selector: cluster=dnsbox,dc=esams [reason: esams network maintenance] [12:04:47] !log reboot re1.cr1-esams (backup RE) for upgrade - T416450 [12:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:50] T416450: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450 [12:05:27] (03CR) 10Cathal Mooney: [C:03+1] Temporarily geodns GB and IE to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1268538 (https://phabricator.wikimedia.org/T416450) (owner: 10Ayounsi) [12:06:43] I am going to freeze CI operations for some minutes, I am migrating the cache saving system to a new instance [12:08:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5007.eqsin.wmnet [12:08:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti5007.eqsin.wmnet [12:10:17] 06SRE, 06DBA, 10Observability-Alerting: single DB server replag / downtime should not page us anymore - https://phabricator.wikimedia.org/T396816#11793443 (10FCeratto-WMF) It's been a month and there has been no warnings from it the replication lag monitoring. We can discuss the next steps in the next team m... [12:11:45] !log re0.cr1-esams> request chassis routing-engine master switch - that will cause router's short unavailability - T416450 [12:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:48] T416450: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450 [12:11:59] 06SRE, 06DBA, 10Observability-Alerting: single DB server replag / downtime should not page us anymore - https://phabricator.wikimedia.org/T396816#11793447 (10Marostegui) >>! In T396816#11793443, @FCeratto-WMF wrote: > It's been a month and there has been no warnings from it the replication lag monitoring. We... [12:14:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:15:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:15:39] FIRING: [6x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:15:48] 06SRE, 10Pywikibot, 06Traffic, 10Wikidata, and 2 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11793455 (10Ladsgroup) The top editor yesterday and the day before was Mahir256 with 40K edits each day. The day before that was @Epidosis with 203K edits(!), the day b... [12:15:51] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - asw1-bw27-esams:et-0/0/48 (Core: cr1-esams:et-1/0/0 {#30367}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [12:15:57] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti5007.eqsin.wmnet [12:16:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti5007.eqsin.wmnet [12:17:37] I have finished the CI operation [12:19:51] FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:20:29] 06SRE, 10Pywikibot, 06Traffic, 10Wikidata, and 2 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11793465 (10Mahir256) @Ladsgroup both Epìdosis and I were using QuickStatements (he version 3.0 and I version 2.0); your complaint about tools not respecting maxlag shou... [12:20:30] jmm@cumin2002 reimage (PID 3517681) is awaiting input [12:20:39] FIRING: [8x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:20:51] RESOLVED: [2x] SwitchCoreInterfaceDown: Switch core interface down - asw1-bw27-esams:et-0/0/48 (Core: cr1-esams:et-1/0/0 {#30367}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [12:21:19] (03PS1) 10Hashar: cloudbackup: do not back integration-castor* instances [puppet] - 10https://gerrit.wikimedia.org/r/1268540 (https://phabricator.wikimedia.org/T421114) [12:21:33] (03CR) 10Jelto: [C:03+1] "lgtm, thanks for the fix" [puppet] - 10https://gerrit.wikimedia.org/r/1268512 (https://phabricator.wikimedia.org/T422468) (owner: 10Arnaudb) [12:21:43] re1 is back up, working on re0 now [12:22:27] 06SRE, 10Pywikibot, 06Traffic, 10Wikidata, and 2 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11793473 (10magnusmanske) Oops, I'll fix it in V2 [12:23:01] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1268257 (https://phabricator.wikimedia.org/T418993) (owner: 10JHathaway) [12:23:43] 06SRE, 10Pywikibot, 06Traffic, 10Wikidata, and 2 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11793480 (10Ladsgroup) Thanks! [12:24:51] RESOLVED: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:25:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:25:39] RESOLVED: [8x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:27:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti5007.eqsin.wmnet with OS bookworm [12:27:25] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11793507 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti5007.eqsin.wmnet with OS bookworm [12:28:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 07 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268293 (https://phabricator.wikimedia.org/T414299) (owner: 10Stang) [12:30:19] (03CR) 10DCausse: search: add space-discount for wikidata custom prefix search profiles (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267130 (https://phabricator.wikimedia.org/T420427) (owner: 10DCausse) [12:31:15] (03CR) 10David Caro: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1268540 (https://phabricator.wikimedia.org/T421114) (owner: 10Hashar) [12:36:10] FIRING: [3x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:36:39] FIRING: [6x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:36:53] (03PS1) 10Muehlenhoff: Failover idp.w.o [dns] - 10https://gerrit.wikimedia.org/r/1268546 [12:39:41] !log re1.cr1-esams> request chassis routing-engine master switch - that will cause router's short unavailability - T416450 [12:39:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:44] T416450: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450 [12:41:10] RESOLVED: [3x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:41:39] RESOLVED: [8x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:42:40] FIRING: [3x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:42:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:43:15] (03PS1) 10Majavah: P:toolforge::prometheus: Drop metrics about zero quotas [puppet] - 10https://gerrit.wikimedia.org/r/1268548 (https://phabricator.wikimedia.org/T422287) [12:43:39] FIRING: [6x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:43:51] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - asw1-bw27-esams:et-0/0/48 (Core: cr1-esams:et-1/0/0 {#30367}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [12:45:22] (03PS1) 10Ladsgroup: ExternalStore: Start reading and writing from clusters 32 and 33 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268549 (https://phabricator.wikimedia.org/T421729) [12:45:57] PROBLEM - Host install3004 is DOWN: PING CRITICAL - Packet loss = 100% [12:45:57] PROBLEM - Host hcaptcha-proxy3001 is DOWN: PING CRITICAL - Packet loss = 100% [12:45:57] PROBLEM - Host hcaptcha-proxy3002 is DOWN: PING CRITICAL - Packet loss = 100% [12:46:08] (03CR) 10David Caro: [C:03+2] cloudbackup: do not back integration-castor* instances [puppet] - 10https://gerrit.wikimedia.org/r/1268540 (https://phabricator.wikimedia.org/T421114) (owner: 10Hashar) [12:46:31] PROBLEM - Host doh3005 is DOWN: PING CRITICAL - Packet loss = 100% [12:46:32] PROBLEM - Host doh3006 is DOWN: PING CRITICAL - Packet loss = 100% [12:46:32] PROBLEM - Host durum3005 is DOWN: PING CRITICAL - Packet loss = 100% [12:46:32] (03CR) 10Svantje Lilienthal: [C:03+1] Enable sub-references on Czech and Italian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268514 (https://phabricator.wikimedia.org/T420938) (owner: 10WMDE-Fisch) [12:46:32] PROBLEM - Host ps1-by27-esams is DOWN: PING CRITICAL - Packet loss = 100% [12:46:34] I am getting '500, Cannot find server' ATS errors [12:46:35] PROBLEM - Host durum3006 is DOWN: PING CRITICAL - Packet loss = 100% [12:46:35] PROBLEM - Host ps1-bw27-esams is DOWN: PING CRITICAL - Packet loss = 100% [12:46:38] PROBLEM - Host mr1-esams is DOWN: PING CRITICAL - Packet loss = 100% [12:46:40] PROBLEM - Host asw1-bw27-esams is DOWN: PING CRITICAL - Packet loss = 100% [12:46:40] PROBLEM - Host asw1-by27-esams is DOWN: PING CRITICAL - Packet loss = 100% [12:46:48] PROBLEM - Host ncredir3005 is DOWN: PING CRITICAL - Packet loss = 100% [12:46:48] PROBLEM - Host ncredir3006 is DOWN: PING CRITICAL - Packet loss = 100% [12:46:48] PROBLEM - Host netflow3004 is DOWN: PING CRITICAL - Packet loss = 100% [12:46:48] PROBLEM - Host prometheus3004 is DOWN: PING CRITICAL - Packet loss = 100% [12:47:08] taavi: where? [12:47:14] PROBLEM - Host tcp-proxy3001 is DOWN: PING CRITICAL - Packet loss = 100% [12:47:16] trying to load phabricator [12:47:18] (03CR) 10CI reject: [V:04-1] ExternalStore: Start reading and writing from clusters 32 and 33 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268549 (https://phabricator.wikimedia.org/T421729) (owner: 10Ladsgroup) [12:47:18] PROBLEM - Host tcp-proxy3002 is DOWN: PING CRITICAL - Packet loss = 100% [12:47:23] cp3071 [12:47:23] taavi: esams is depooled [12:47:24] PROBLEM - Host bast3007 is DOWN: PING CRITICAL - Packet loss = 100% [12:47:39] taavi: phab works for me [12:47:40] RESOLVED: [3x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:47:51] FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:48:34] routing convergence is taking a bit longer than expected on cr1-esams, but everything is coming up as expected on the cli [12:48:39] FIRING: [8x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:48:44] (03PS2) 10Ladsgroup: ExternalStore: Start reading and writing from clusters 32 and 33 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268549 (https://phabricator.wikimedia.org/T421729) [12:48:51] RESOLVED: [2x] SwitchCoreInterfaceDown: Switch core interface down - asw1-bw27-esams:et-0/0/48 (Core: cr1-esams:et-1/0/0 {#30367}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [12:48:54] PROBLEM - Host mr1-esams IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [12:49:13] hmm I wonder where that is being cached then [12:49:18] (03CR) 10Ssingh: [C:03+1] Temporarily geodns GB and IE to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1268538 (https://phabricator.wikimedia.org/T416450) (owner: 10Ayounsi) [12:49:51] ;; HTTP session (HTTP/2-POST)-([2001:67c:930::1]/dns-query)-(status: 502) [12:49:51] ;; WARNING: can't receive reply from 2001:67c:930::1@443(HTTPS) [12:49:54] (03CR) 10CI reject: [V:04-1] ExternalStore: Start reading and writing from clusters 32 and 33 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268549 (https://phabricator.wikimedia.org/T421729) (owner: 10Ladsgroup) [12:50:04] PROBLEM - Host asw1-bw27-esams IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [12:50:04] PROBLEM - Host asw1-by27-esams IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [12:51:24] RECOVERY - Host ncredir3005 is UP: PING OK - Packet loss = 0%, RTA = 80.72 ms [12:51:24] RECOVERY - Host ncredir3006 is UP: PING OK - Packet loss = 0%, RTA = 80.74 ms [12:51:28] FIRING: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:51:32] RECOVERY - Host ps1-bw27-esams is UP: PING OK - Packet loss = 0%, RTA = 81.42 ms [12:51:32] RECOVERY - Host ps1-by27-esams is UP: PING OK - Packet loss = 0%, RTA = 81.96 ms [12:51:34] RECOVERY - Host durum3005 is UP: PING OK - Packet loss = 0%, RTA = 80.87 ms [12:51:36] RECOVERY - Host durum3006 is UP: PING OK - Packet loss = 0%, RTA = 81.03 ms [12:51:40] RECOVERY - Host netflow3004 is UP: PING OK - Packet loss = 0%, RTA = 80.69 ms [12:51:46] RECOVERY - Host tcp-proxy3001 is UP: PING OK - Packet loss = 0%, RTA = 80.56 ms [12:51:46] RECOVERY - Host prometheus3004 is UP: PING OK - Packet loss = 0%, RTA = 84.49 ms [12:51:49] uh.. network glitch in esams? [12:51:50] RECOVERY - Host tcp-proxy3002 is UP: PING OK - Packet loss = 0%, RTA = 80.77 ms [12:51:52] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=esams - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:51:59] vgutierrez: network maintenance, site is depooled [12:52:03] ack :D [12:52:10] yeah my wikidough requests are still going to esams, and seemingly that's having some problems? and firefox is using stale records because it can't resolve the name? [12:52:24] sukhe: ^^ [12:52:38] I've depooled the dnsbox [12:52:51] RESOLVED: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:52:56] XioNoX: we just got paged for text/esams reachability too [12:53:00] FIRING: [8x] LibericaEtcdErrors: Liberica instance lvs3008:3003 is experiencing etcd issues - https://wikitech.wikimedia.org/wiki/Liberica#LibericaEtcdErrors - https://alerts.wikimedia.org/?q=alertname%3DLibericaEtcdErrors [12:53:04] yep, saw it above [12:53:04] I assume that’s you I will resolve? [12:53:13] XioNoX: but this is Wikimedia DNS -- I presumed that the routers being down also meant that we are not advertising the IPs for everything else as well? [12:53:26] yeah, cr1-esams blackholed some traffic [12:53:26] taavi: I will manually depool [12:53:32] is there something else in esams we should proactively downtime or nothing else excepted to page? [12:53:39] RESOLVED: [8x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:53:56] RECOVERY - Host mr1-esams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 80.75 ms [12:54:32] RECOVERY - Host mr1-esams is UP: PING OK - Packet loss = 0%, RTA = 80.83 ms [12:54:34] RECOVERY - Host asw1-bw27-esams is UP: PING OK - Packet loss = 0%, RTA = 83.91 ms [12:54:42] RECOVERY - Host doh3005 is UP: PING OK - Packet loss = 0%, RTA = 80.84 ms [12:54:42] RECOVERY - Host doh3006 is UP: PING OK - Packet loss = 0%, RTA = 80.86 ms [12:54:44] RECOVERY - Host asw1-by27-esams is UP: PING OK - Packet loss = 0%, RTA = 83.86 ms [12:54:46] RECOVERY - Host install3004 is UP: PING OK - Packet loss = 0%, RTA = 80.74 ms [12:54:52] RECOVERY - Host hcaptcha-proxy3001 is UP: PING OK - Packet loss = 0%, RTA = 80.88 ms [12:54:52] RECOVERY - Host hcaptcha-proxy3002 is UP: PING OK - Packet loss = 0%, RTA = 80.90 ms [12:54:54] RECOVERY - Host bast3007 is UP: PING OK - Packet loss = 0%, RTA = 80.85 ms [12:55:06] RECOVERY - Host asw1-bw27-esams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 80.78 ms [12:55:06] RECOVERY - Host asw1-by27-esams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 80.65 ms [12:55:13] (03PS3) 10Ladsgroup: ExternalStore: Start reading and writing from clusters 32 and 33 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268549 (https://phabricator.wikimedia.org/T421729) [12:56:09] (03CR) 10CI reject: [V:04-1] ExternalStore: Start reading and writing from clusters 32 and 33 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268549 (https://phabricator.wikimedia.org/T421729) (owner: 10Ladsgroup) [12:56:27] RESOLVED: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:56:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-esams and cr1-esams (185.15.59.152) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=cr2-esams:9804&var-bgp_group=Confed_esams&var-bgp_neighbor=cr1-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:56:52] RESOLVED: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=esams - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:57:31] PROBLEM - Host hcaptcha-proxy3001 is DOWN: PING CRITICAL - Packet loss = 100% [12:57:31] PROBLEM - Host install3004 is DOWN: PING CRITICAL - Packet loss = 100% [12:57:31] PROBLEM - Host hcaptcha-proxy3002 is DOWN: PING CRITICAL - Packet loss = 100% [12:57:41] there is something wrong with cr1-esams [12:57:47] PROBLEM - Host bast3007 is DOWN: PING CRITICAL - Packet loss = 100% [12:57:51] PROBLEM - Host doh3005 is DOWN: PING CRITICAL - Packet loss = 100% [12:57:51] PROBLEM - Host doh3006 is DOWN: PING CRITICAL - Packet loss = 100% [12:57:55] PROBLEM - Host mr1-esams is DOWN: PING CRITICAL - Packet loss = 100% [12:58:00] RESOLVED: [8x] LibericaEtcdErrors: Liberica instance lvs3008:3003 is experiencing etcd issues - https://wikitech.wikimedia.org/wiki/Liberica#LibericaEtcdErrors - https://alerts.wikimedia.org/?q=alertname%3DLibericaEtcdErrors [12:58:01] since the upgrade, BGP keeps flapping (cc topranks) [12:58:01] PROBLEM - Host asw1-bw27-esams is DOWN: PING CRITICAL - Packet loss = 100% [12:58:15] PROBLEM - Host asw1-by27-esams is DOWN: PING CRITICAL - Packet loss = 100% [12:58:22] * topranks looking [12:58:28] what exactly is flapping? [12:58:40] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 2a02:ec80:300:fe09::2 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:58:54] RESOLVED: [6x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (2a02:ec80:300:fe04::1) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:59:25] topranks: all BGP sessions at least [12:59:46] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti5007.eqsin.wmnet with reason: host reimage [12:59:48] (03CR) 10Filippo Giunchedi: [C:03+1] P:toolforge::prometheus: Drop metrics about zero quotas [puppet] - 10https://gerrit.wikimedia.org/r/1268548 (https://phabricator.wikimedia.org/T422287) (owner: 10Majavah) [12:59:51] (03PS1) 10MVernon: mvernon: add FIDO ssh key from spare Yubikey [puppet] - 10https://gerrit.wikimedia.org/r/1268554 [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260407T1300) [13:00:05] HouseOfM, Tran, and kipfel: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:06] they all went down at once at least twice [13:00:14] o/ [13:00:17] o/ [13:00:25] * Lucas_WMDE reads backscroll [13:00:32] o/ [13:00:55] o/ [13:01:25] and now my ssh connections via esams are failing due to dns issues, had to switch to the drmrs bastion [13:01:28] not sure if it’s okay to deploy at the moment (sounds like there are errors but not really at the MediaWiki level?) [13:01:31] does this mean another cancelled window? [13:01:31] (03CR) 10Majavah: [C:03+2] P:toolforge::prometheus: Drop metrics about zero quotas [puppet] - 10https://gerrit.wikimedia.org/r/1268548 (https://phabricator.wikimedia.org/T422287) (owner: 10Majavah) [13:02:33] HouseOfM: it might :/ we’ll see [13:03:07] 06SRE, 10Pywikibot, 06Traffic, 10Wikidata, and 2 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11793667 (10Epidosis) Hi, this may be related to my import of data from GND into Wikidata via QS 3.0 which ran from April 1 to April 5 (https://w.wiki/KdP6). I thought Q... [13:03:31] topranks: now `cr1-esams# run show bgp summary` hangs, maybe we're asking too much? [13:03:31] Ok, I'll just hover. thanks [13:03:40] RESOLVED: BFDdown: BFD session down between cr2-eqiad and fe80::7ee2:caff:fede:4a67 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:03:40] (I ran into the same SSH issue as taavi, switching to bast6003 helped) [13:03:54] XioNoX: yes that's as far as I got so far.... it hangs :) [13:04:06] XioNoX: any thoughts if we’re okay to deploy at the moment or should hold? [13:04:07] RECOVERY - Host install3004 is UP: PING OK - Packet loss = 0%, RTA = 80.98 ms [13:04:07] RECOVERY - Host hcaptcha-proxy3001 is UP: PING OK - Packet loss = 0%, RTA = 80.79 ms [13:04:07] RECOVERY - Host hcaptcha-proxy3002 is UP: PING OK - Packet loss = 0%, RTA = 80.95 ms [13:04:11] RECOVERY - Host asw1-bw27-esams is UP: PING OK - Packet loss = 0%, RTA = 85.68 ms [13:04:11] RECOVERY - Host asw1-by27-esams is UP: PING OK - Packet loss = 0%, RTA = 86.04 ms [13:04:15] RECOVERY - Host mr1-esams is UP: PING OK - Packet loss = 0%, RTA = 80.91 ms [13:04:15] `error: the routing subsystem is not responding to management requests` [13:04:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:04:41] 06SRE, 10Pywikibot, 06Traffic, 10Wikidata, and 2 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11793682 (10magnusmanske) V2 should be fixed now [13:04:41] RECOVERY - Host doh3005 is UP: PING OK - Packet loss = 0%, RTA = 80.88 ms [13:04:41] RECOVERY - Host doh3006 is UP: PING OK - Packet loss = 0%, RTA = 80.90 ms [13:04:47] PROBLEM - Bird Internet Routing Daemon on doh3005 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:04:47] PROBLEM - Bird Internet Routing Daemon on doh3006 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:04:53] RECOVERY - Host bast3007 is UP: PING OK - Packet loss = 0%, RTA = 80.84 ms [13:05:00] topranks: and going down again.. [13:05:19] 07sre-alert-triage, 07Essential-Work, 06Machine-Learning-Team (Q4 FY2025-26): Alert in need of triage: HelmfileAdminNGPendingChanges (instance deploy1003:9100) - https://phabricator.wikimedia.org/T414971#11793683 (10isarantopoulos) [13:05:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti5007.eqsin.wmnet with reason: host reimage [13:05:47] topranks: forcing the transit and IX sessions down to see if it helps [13:06:01] HouseOfM: in the meantime I’d be interested if you have any thoughts on https://phabricator.wikimedia.org/T421749#11793090 ^^ [13:06:14] 07sre-alert-triage, 06Machine-Learning-Team (Q4 FY2025-26): Alert in need of triage: SmartNotHealthy (instance ml-serve1001:9100) - https://phabricator.wikimedia.org/T414969#11793701 (10isarantopoulos) [13:06:51] PROBLEM - Host doh3005 is DOWN: PING CRITICAL - Packet loss = 100% [13:06:51] PROBLEM - Host doh3006 is DOWN: PING CRITICAL - Packet loss = 100% [13:07:09] FIRING: [8x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:07:17] PROBLEM - Host install3004 is DOWN: PING CRITICAL - Packet loss = 100% [13:07:25] PROBLEM - Host hcaptcha-proxy3001 is DOWN: PING CRITICAL - Packet loss = 100% [13:07:25] PROBLEM - Host hcaptcha-proxy3002 is DOWN: PING CRITICAL - Packet loss = 100% [13:07:25] PROBLEM - Host bast3007 is DOWN: PING CRITICAL - Packet loss = 100% [13:07:29] PROBLEM - Host asw1-bw27-esams is DOWN: PING CRITICAL - Packet loss = 100% [13:07:29] PROBLEM - Host asw1-by27-esams is DOWN: PING CRITICAL - Packet loss = 100% [13:07:44] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for annet - https://phabricator.wikimedia.org/T422251#11793709 (10Jdrewniak) As @AnneT 's manager I approve this request and verify that it is required to confirm experiment data for logged-in users. [13:07:47] PROBLEM - Host mr1-esams is DOWN: PING CRITICAL - Packet loss = 100% [13:08:42] topranks: should I go with a RE switchover? [13:08:54] XioNoX: What's up with these messages? [13:08:54] (03PS1) 10Muehlenhoff: Remove obsolete Hiera file [puppet] - 10https://gerrit.wikimedia.org/r/1268555 [13:08:54] RESOLVED: [7x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:08:55] Apr 7 13:07:41 re0.cr1-esams rpd[48830]: bgp_pp_recv:5733: NOTIFICATION sent to 80.249.210.223+50398 (proto): code 6 (Cease) subcode 5 (Connection Rejected), Reason: no group for 80.249.210.223+50398 (proto) from AS 37100 found (peer idled) in master(ae1.380), dropping him [13:09:08] peer is configured [13:09:11] set protocols bgp group IX4 neighbor 80.249.210.223 description "SEACOM Limited" [13:09:11] set protocols bgp group IX4 neighbor 80.249.210.223 peer-as 37100 [13:09:22] topranks: I've set the group to shutdown maybe? [13:09:26] or was it before? [13:09:33] ah yes probably that ok [13:09:50] (03CR) 10Vgutierrez: hieradata: service: Add dumps services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1268504 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah) [13:09:53] RECOVERY - Host install3004 is UP: PING OK - Packet loss = 0%, RTA = 80.85 ms [13:09:53] RECOVERY - Host bast3007 is UP: PING OK - Packet loss = 0%, RTA = 80.93 ms [13:09:55] RECOVERY - Host hcaptcha-proxy3001 is UP: PING OK - Packet loss = 0%, RTA = 80.79 ms [13:09:55] RECOVERY - Host hcaptcha-proxy3002 is UP: PING OK - Packet loss = 0%, RTA = 80.71 ms [13:09:55] RECOVERY - Host asw1-by27-esams is UP: PING OK - Packet loss = 0%, RTA = 82.03 ms [13:09:55] RECOVERY - Host asw1-bw27-esams is UP: PING OK - Packet loss = 0%, RTA = 87.11 ms [13:09:59] RECOVERY - Host mr1-esams is UP: PING OK - Packet loss = 0%, RTA = 80.94 ms [13:10:38] topranks: should I go with a RE switchover? then reboot RE0 ? [13:10:51] XioNoX: yeah it's worth a shot [13:10:56] FIRING: TransitBGPDown: Transit BGP session down between cr1-esams and Arelion (62.115.179.162) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=cr1-esams:9804&var-bgp_group=Transit4&var-bgp_neighbor=Arelion - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [13:11:06] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:11:09] !log re0.cr1-esams> request chassis routing-engine master switch [13:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:15] I'm connected to the console of re1, console of re0 didn't work when I tried are you on that? [13:11:18] (03CR) 10Majavah: hieradata: service: Add dumps services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1268504 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah) [13:11:42] I'm on re0's console [13:11:47] RECOVERY - Host doh3006 is UP: PING WARNING - Packet loss = 75%, RTA = 80.82 ms [13:11:47] RECOVERY - Host doh3005 is UP: PING WARNING - Packet loss = 75%, RTA = 80.81 ms [13:11:47] PROBLEM - Bird Internet Routing Daemon on doh3005 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:11:47] PROBLEM - Bird Internet Routing Daemon on doh3006 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:12:38] topranks: fyi you can do `cr1-esams> request routing-engine login re0` to go from one to the other [13:12:44] XioNoX: ok very good [13:12:51] PROBLEM - Host doh3006 is DOWN: PING CRITICAL - Packet loss = 100% [13:12:51] PROBLEM - Host doh3005 is DOWN: PING CRITICAL - Packet loss = 100% [13:12:51] yup [13:13:43] PROBLEM - Host asw1-bw27-esams is DOWN: PING CRITICAL - Packet loss = 100% [13:13:43] PROBLEM - Host asw1-by27-esams is DOWN: PING CRITICAL - Packet loss = 100% [13:13:45] FIRING: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [13:13:47] I'm rebooting re0 (now backup) [13:14:01] PROBLEM - Host mr1-esams is DOWN: PING CRITICAL - Packet loss = 100% [13:14:12] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1268554 (owner: 10MVernon) [13:14:21] PROBLEM - Host install3004 is DOWN: PING CRITICAL - Packet loss = 100% [13:14:25] PROBLEM - Host bast3007 is DOWN: PING CRITICAL - Packet loss = 100% [13:14:25] PROBLEM - Host hcaptcha-proxy3001 is DOWN: PING CRITICAL - Packet loss = 100% [13:14:25] PROBLEM - Host hcaptcha-proxy3002 is DOWN: PING CRITICAL - Packet loss = 100% [13:14:39] FIRING: [7x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:14:49] hmm, with re1 as master the problem persists [13:15:34] rpd is maxing the cpu [13:15:37] root@re1:~ # ps aux | egrep "CPU|rpd" [13:15:37] USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND [13:15:37] root 53190 97.4 7.3 4456036 3649916 - R 13:11 3:44.35 /usr/libexec64/rpd -N [13:15:39] FIRING: [5x] TransitBGPDown: Transit BGP session down between cr1-esams and Arelion (2001:2035:0:699::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [13:16:07] (03CR) 10MVernon: [C:03+2] mvernon: add FIDO ssh key from spare Yubikey [puppet] - 10https://gerrit.wikimedia.org/r/1268554 (owner: 10MVernon) [13:16:08] rebooting re0 [13:16:26] it's not logging much other than the "no group" errors, which I think are ok [13:16:41] RECOVERY - Host doh3005 is UP: PING OK - Packet loss = 0%, RTA = 80.95 ms [13:16:41] RECOVERY - Host doh3006 is UP: PING OK - Packet loss = 0%, RTA = 80.81 ms [13:16:43] RECOVERY - Host asw1-by27-esams is UP: PING OK - Packet loss = 0%, RTA = 82.00 ms [13:16:43] RECOVERY - Host asw1-bw27-esams is UP: PING OK - Packet loss = 0%, RTA = 91.45 ms [13:16:49] PROBLEM - Bird Internet Routing Daemon on doh3005 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:16:49] PROBLEM - Bird Internet Routing Daemon on doh3006 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:16:49] RECOVERY - Host install3004 is UP: PING OK - Packet loss = 0%, RTA = 80.87 ms [13:16:53] RECOVERY - Host bast3007 is UP: PING OK - Packet loss = 0%, RTA = 80.99 ms [13:16:55] RECOVERY - Host hcaptcha-proxy3001 is UP: PING OK - Packet loss = 0%, RTA = 80.88 ms [13:16:55] RECOVERY - Host hcaptcha-proxy3002 is UP: PING OK - Packet loss = 0%, RTA = 80.85 ms [13:17:05] RECOVERY - Host mr1-esams is UP: PING OK - Packet loss = 0%, RTA = 80.86 ms [13:17:21] PROBLEM - OSPF status on cr1-esams is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:17:28] (03CR) 10Vgutierrez: [C:03+1] hieradata: service: Add dumps services [puppet] - 10https://gerrit.wikimedia.org/r/1268504 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah) [13:17:33] PROBLEM - Router interfaces on mr1-esams is CRITICAL: CRITICAL: No response from remote host 185.15.59.130 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:17:55] PROBLEM - NTP peers and stratum check on dns3003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [13:17:55] PROBLEM - NTP peers and stratum check on dns3004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [13:18:21] RECOVERY - OSPF status on cr1-esams is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:18:25] PROBLEM - Host re0.cr1-esams.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:18:51] PROBLEM - Host doh3006 is DOWN: PING CRITICAL - Packet loss = 100% [13:18:51] PROBLEM - Host doh3005 is DOWN: PING CRITICAL - Packet loss = 100% [13:18:54] FIRING: [9x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:18:55] PROBLEM - Host asw1-bw27-esams is DOWN: PING CRITICAL - Packet loss = 100% [13:18:55] PROBLEM - Host asw1-by27-esams is DOWN: PING CRITICAL - Packet loss = 100% [13:18:55] I still have no idea how severe these issues currently being worked on are… [13:18:57] (03PS1) 10Arnaudb: gerrit: disable connection re-use [puppet] - 10https://gerrit.wikimedia.org/r/1268557 (https://phabricator.wikimedia.org/T421827) [13:19:10] FIRING: BFDdown: BFD session down between cr2-eqiad and fe80::7ee2:caff:fede:4a67 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:19:13] PROBLEM - Host mr1-esams is DOWN: PING CRITICAL - Packet loss = 100% [13:19:16] Lucas_WMDE: not user impacting, but a big loss of redundancy [13:19:16] XioNoX, topranks: is it okay to do the deployment window at the moment or not? [13:19:18] !log taavi@cumin1003 conftool action : set/weight=100; selector: cluster=dumps [13:19:21] PROBLEM - Host install3004 is DOWN: PING CRITICAL - Packet loss = 100% [13:19:25] PROBLEM - Host bast3007 is DOWN: PING CRITICAL - Packet loss = 100% [13:19:25] PROBLEM - Host hcaptcha-proxy3001 is DOWN: PING CRITICAL - Packet loss = 100% [13:19:25] PROBLEM - Host hcaptcha-proxy3002 is DOWN: PING CRITICAL - Packet loss = 100% [13:19:29] 06SRE, 10Pywikibot, 06Traffic, 10Wikidata, and 2 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11793758 (10Xqt) For the record: The problems began on March 25th or 26th (see Grafana control panel), and it is still an issue currently because the minimum maxlag is 9... [13:19:39] RESOLVED: [9x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:19:46] !log taavi@cumin1003 conftool action : set/pooled=no; selector: name=clouddumps1001.wikimedia.org [13:19:49] Lucas_WMDE: is there a risk that the deploy causes an outage? [13:19:51] !log taavi@cumin1003 conftool action : set/pooled=yes; selector: name=clouddumps1002.wikimedia.org [13:20:07] Lucas_WMDE: the main issue is having to work on two incidents at once [13:20:08] FIRING: NetworkDeviceAlarmActive: Alarm active on cr1-esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [13:20:15] one of the backports (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CheckUser/+/1268518) looks pretty safe to me and ought to eliminate a source of logspam IIUC [13:20:27] but I don’t think any of the deploys are urgent either [13:21:13] topranks: re0 is back up [13:21:25] RECOVERY - Router interfaces on mr1-esams is OK: OK: host 185.15.59.130, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:21:27] RECOVERY - Host mr1-esams is UP: PING OK - Packet loss = 0%, RTA = 80.90 ms [13:21:29] RECOVERY - Host re0.cr1-esams.mgmt is UP: PING OK - Packet loss = 0%, RTA = 80.82 ms [13:21:31] RECOVERY - Host asw1-bw27-esams is UP: PING OK - Packet loss = 0%, RTA = 85.67 ms [13:21:41] RECOVERY - Host doh3005 is UP: PING OK - Packet loss = 0%, RTA = 80.82 ms [13:21:41] RECOVERY - Host doh3006 is UP: PING OK - Packet loss = 0%, RTA = 80.88 ms [13:21:43] RECOVERY - Host asw1-by27-esams is UP: PING OK - Packet loss = 0%, RTA = 84.84 ms [13:21:51] PROBLEM - Bird Internet Routing Daemon on doh3006 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:21:51] PROBLEM - Bird Internet Routing Daemon on doh3005 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:21:51] RECOVERY - Host install3004 is UP: PING OK - Packet loss = 0%, RTA = 80.99 ms [13:21:53] XioNoX: I guess we can flip back, re1 isn't happy [13:21:53] topranks: let's go for another switchover? [13:21:55] RECOVERY - Host bast3007 is UP: PING OK - Packet loss = 0%, RTA = 80.76 ms [13:21:55] RECOVERY - Host hcaptcha-proxy3001 is UP: PING OK - Packet loss = 0%, RTA = 80.99 ms [13:21:55] RECOVERY - Host hcaptcha-proxy3002 is UP: PING OK - Packet loss = 0%, RTA = 80.97 ms [13:22:52] (03PS1) 10Majavah: P:dumps::distribution::rsync: Allow LVS health checks in firewall [puppet] - 10https://gerrit.wikimedia.org/r/1268558 (https://phabricator.wikimedia.org/T422040) [13:23:48] topranks: done [13:23:51] PROBLEM - Bird Internet Routing Daemon on doh3006 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:23:51] PROBLEM - Bird Internet Routing Daemon on doh3005 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:23:51] re0.cr1-esams> show bgp summary [13:23:51] error: the routing subsystem is not running [13:23:52] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8382/co" [puppet] - 10https://gerrit.wikimedia.org/r/1268558 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah) [13:24:08] so either is still unhappy or it haven't initialized yet [13:24:09] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:24:10] RESOLVED: BFDdown: BFD session down between cr2-eqiad and fe80::7ee2:caff:fede:4a67 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:24:11] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:24:17] (03PS1) 10Dreamy Jazz: Set $wgGlobalBlockingWikisWhereGlobalBlocksDoNotApply [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268560 (https://phabricator.wikimedia.org/T422220) [13:25:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:25:39] FIRING: [5x] TransitBGPDown: Transit BGP session down between cr1-esams and Arelion (2001:2035:0:699::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [13:25:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:25:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti5007.eqsin.wmnet with OS bookworm [13:25:56] ok, rpd is back [13:25:59] for how long? [13:26:09] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:26:09] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:26:10] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11793783 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti5007.eqsin.wmnet with OS bookworm completed: - ganeti5... [13:26:46] XioNoX: FWIW I disabled puppet on netflow3004, manually removed cr1-esams from stats collection and restarted gnmic [13:26:51] PROBLEM - Host doh3006 is DOWN: PING CRITICAL - Packet loss = 100% [13:26:51] PROBLEM - Host doh3005 is DOWN: PING CRITICAL - Packet loss = 100% [13:26:57] PROBLEM - Host mr1-esams is DOWN: PING CRITICAL - Packet loss = 100% [13:27:07] I seen a lot of crashes / restarts of the telemetry stuff, might just be noise [13:27:12] ok [13:27:19] `18856 root 103 0 4184M 3303M CPU0 0 2:12 100.00% rpd{rpd}` RPD is still at 100% [13:27:23] PROBLEM - Host install3004 is DOWN: PING CRITICAL - Packet loss = 100% [13:27:25] PROBLEM - Host asw1-bw27-esams is DOWN: PING CRITICAL - Packet loss = 100% [13:27:25] PROBLEM - Host asw1-by27-esams is DOWN: PING CRITICAL - Packet loss = 100% [13:27:25] PROBLEM - Host bast3007 is DOWN: PING CRITICAL - Packet loss = 100% [13:27:25] PROBLEM - Host hcaptcha-proxy3001 is DOWN: PING CRITICAL - Packet loss = 100% [13:27:25] PROBLEM - Host hcaptcha-proxy3002 is DOWN: PING CRITICAL - Packet loss = 100% [13:28:20] we should probably do a cold reboot of the whole box if we can, rather than switching REs. [13:28:22] the lack of rpd logs is annoying [13:28:54] FIRING: [12x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:29:18] (03CR) 10Andrew Bogott: "very useful, thank you hashar!" [puppet] - 10https://gerrit.wikimedia.org/r/1268540 (https://phabricator.wikimedia.org/T421114) (owner: 10Hashar) [13:29:25] RESOLVED: [3x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:29:38] topranks: sounds good, let me extend the downtime first [13:29:49] RECOVERY - Host doh3006 is UP: PING WARNING - Packet loss = 80%, RTA = 80.81 ms [13:29:49] RECOVERY - Host doh3005 is UP: PING WARNING - Packet loss = 80%, RTA = 80.88 ms [13:29:51] XioNoX: yeah I'd expect to see more given it's seemingly locking up [13:29:51] PROBLEM - Bird Internet Routing Daemon on doh3006 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:29:51] PROBLEM - Bird Internet Routing Daemon on doh3005 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:29:51] RECOVERY - Host install3004 is UP: PING OK - Packet loss = 0%, RTA = 80.91 ms [13:29:55] RECOVERY - Host bast3007 is UP: PING OK - Packet loss = 0%, RTA = 80.90 ms [13:29:55] RECOVERY - Host hcaptcha-proxy3002 is UP: PING OK - Packet loss = 0%, RTA = 80.86 ms [13:29:55] RECOVERY - Host hcaptcha-proxy3001 is UP: PING OK - Packet loss = 0%, RTA = 80.87 ms [13:29:57] RECOVERY - Host asw1-by27-esams is UP: PING OK - Packet loss = 0%, RTA = 84.57 ms [13:29:57] RECOVERY - Host asw1-bw27-esams is UP: PING OK - Packet loss = 0%, RTA = 92.08 ms [13:30:08] RESOLVED: NetworkDeviceAlarmActive: Alarm active on cr1-esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [13:30:18] I guess we got to consider a downgrade to an older JunOS? Are we using this version on any other MX480 ? [13:30:33] !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on cr1-esams,cr1-esams IPv6,re0.cr1-esams.mgmt with reason: router upgrade [13:30:38] topranks: nah it's the first MX480 we upgrade to that, no issues on the MX204s [13:30:49] PROBLEM - SSH on doh3006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:30:49] PROBLEM - SSH on doh3005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:30:49] PROBLEM - Wikidough DoT Check -IPv4- on doh3006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [13:30:49] PROBLEM - Wikidough DoT Check -IPv4- on doh3005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [13:30:51] RESOLVED: [5x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:30:52] yeah if the reboot doesn't help downgrade is best [13:31:00] !log reboot cr1-esams [13:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:07] XioNoX: ok, well to an extent better in esams than codfw or eqiad [13:31:26] haha yeah [13:31:31] topranks: `cr1-esams> request system reboot both-routing-engines ` ? [13:31:40] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and fe80::7ee2:caff:fede:4a67 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:31:44] (03PS2) 10Majavah: hieradata: service: Add dumps services [puppet] - 10https://gerrit.wikimedia.org/r/1268504 (https://phabricator.wikimedia.org/T422040) [13:31:44] (03PS2) 10Majavah: O:dumps::distribution::server: Configure as LVS realserver [puppet] - 10https://gerrit.wikimedia.org/r/1268505 (https://phabricator.wikimedia.org/T422040) [13:31:44] (03PS2) 10Majavah: hieradata: Move dumps to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1268506 (https://phabricator.wikimedia.org/T422040) [13:31:45] (03PS3) 10Majavah: hieradata: Move dumps to production [puppet] - 10https://gerrit.wikimedia.org/r/1268507 (https://phabricator.wikimedia.org/T422040) [13:31:51] PROBLEM - Host doh3006 is DOWN: PING CRITICAL - Packet loss = 100% [13:31:51] PROBLEM - Host doh3005 is DOWN: PING CRITICAL - Packet loss = 100% [13:31:55] XioNoX: +1 yeah [13:32:21] PROBLEM - Host install3004 is DOWN: PING CRITICAL - Packet loss = 100% [13:32:25] PROBLEM - Host bast3007 is DOWN: PING CRITICAL - Packet loss = 100% [13:32:25] PROBLEM - Host hcaptcha-proxy3001 is DOWN: PING CRITICAL - Packet loss = 100% [13:32:25] PROBLEM - Host hcaptcha-proxy3002 is DOWN: PING CRITICAL - Packet loss = 100% [13:32:31] PROBLEM - Host asw1-bw27-esams is DOWN: PING CRITICAL - Packet loss = 100% [13:32:31] PROBLEM - Host asw1-by27-esams is DOWN: PING CRITICAL - Packet loss = 100% [13:32:33] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8384/co" [puppet] - 10https://gerrit.wikimedia.org/r/1268505 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah) [13:32:39] RECOVERY - Wikidough DoT Check -IPv4- on doh3006 is OK: TCP OK - 0.170 second response time on 185.15.59.100 port 853 https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [13:32:39] RECOVERY - SSH on doh3006 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:32:39] RECOVERY - Wikidough DoT Check -IPv4- on doh3005 is OK: TCP OK - 0.169 second response time on 185.15.59.98 port 853 https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [13:32:39] RECOVERY - SSH on doh3005 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:32:41] RECOVERY - Host doh3006 is UP: PING OK - Packet loss = 0%, RTA = 80.89 ms [13:32:41] RECOVERY - Host doh3005 is UP: PING OK - Packet loss = 0%, RTA = 80.95 ms [13:32:43] RECOVERY - Host asw1-bw27-esams is UP: PING OK - Packet loss = 0%, RTA = 86.19 ms [13:32:43] RECOVERY - Host asw1-by27-esams is UP: PING OK - Packet loss = 0%, RTA = 85.83 ms [13:32:51] PROBLEM - Bird Internet Routing Daemon on doh3006 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:32:51] PROBLEM - Bird Internet Routing Daemon on doh3005 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:32:51] RECOVERY - Host mr1-esams is UP: PING OK - Packet loss = 0%, RTA = 80.88 ms [13:32:51] RECOVERY - Host install3004 is UP: PING OK - Packet loss = 0%, RTA = 82.23 ms [13:32:55] RECOVERY - Host bast3007 is UP: PING OK - Packet loss = 0%, RTA = 80.66 ms [13:32:55] RECOVERY - Host hcaptcha-proxy3001 is UP: PING OK - Packet loss = 0%, RTA = 80.89 ms [13:32:55] RECOVERY - Host hcaptcha-proxy3002 is UP: PING OK - Packet loss = 0%, RTA = 80.78 ms [13:34:19] (03CR) 10Clément Goubert: [C:03+1] rest gateway: use IP as rate limit key for compliant bots [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268520 (https://phabricator.wikimedia.org/T422471) (owner: 10Daniel Kinzler) [13:35:09] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:35:10] (03CR) 10Arnaudb: [C:03+2] gerrit: update service state [puppet] - 10https://gerrit.wikimedia.org/r/1268512 (https://phabricator.wikimedia.org/T422468) (owner: 10Arnaudb) [13:35:13] topranks: it's also quite a minor upgrade -> 23.4R2-S3.9 to 23.4R2-S7.4. [13:35:55] wow yeah ok [13:35:57] (03CR) 10Eevans: [C:03+2] sessionstore: upgrade to Cassandra 4.1.11 [puppet] - 10https://gerrit.wikimedia.org/r/1266389 (https://phabricator.wikimedia.org/T418417) (owner: 10Eevans) [13:36:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-esams and cr1-esams (185.15.59.152) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=cr2-esams:9804&var-bgp_group=Confed_esams&var-bgp_neighbor=cr1-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:36:40] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and fe80::7ee2:caff:fede:4a67 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:36:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:ae0 (cr1-esams:ae0) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:37:09] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:37:12] (03CR) 10Filippo Giunchedi: [C:03+1] P:dumps::distribution::rsync: Allow LVS health checks in firewall [puppet] - 10https://gerrit.wikimedia.org/r/1268558 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah) [13:37:31] re0.cr1-esams> show system processes extensive | grep rpd [13:37:31] 18543 root 20 0 1139M 263M kqread 0 0:03 0.68% rpd{rpd} [13:37:37] but interfaces are not up yet [13:37:43] so no BGP sessions yet [13:37:45] RECOVERY - NTP peers and stratum check on dns3003 is OK: NTP OK: Offset 0.001748767 secs, stratum=1 https://wikitech.wikimedia.org/wiki/NTP [13:37:45] RECOVERY - NTP peers and stratum check on dns3004 is OK: NTP OK: Offset 5.7805e-05 secs, stratum=1 https://wikitech.wikimedia.org/wiki/NTP [13:38:23] (03CR) 10Majavah: [V:03+1 C:03+2] P:dumps::distribution::rsync: Allow LVS health checks in firewall [puppet] - 10https://gerrit.wikimedia.org/r/1268558 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah) [13:38:54] FIRING: [8x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:39:14] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ganeti1056.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:40:28] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore2*: Upgrade Cassandra to 4.1.11 — T418417 - eevans@cumin1003 [13:40:31] T418417: Upgrade Cassandra clusters to 4.1.11 - https://phabricator.wikimedia.org/T418417 [13:40:42] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete Hiera file [puppet] - 10https://gerrit.wikimedia.org/r/1268555 (owner: 10Muehlenhoff) [13:40:51] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - asw1-bw27-esams:et-0/0/48 (Core: cr1-esams:et-1/0/0 {#30367}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:41:19] !log installed cumin v6.0.0 on cumin2002 [13:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:48] interfaces are up [13:41:51] FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:41:53] host replies to ping [13:42:27] RPD looks fine - 25.10% rpd{rpd} [13:42:31] FIRING: [2x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device asw1-bw27-esams.mgmt.esams.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [13:42:46] (03PS2) 10Muehlenhoff: eqsin routed ganeti: initial setup [puppet] - 10https://gerrit.wikimedia.org/r/1265453 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [13:42:51] and right when I say that... 100.00% rpd{rpd} [13:43:17] PROBLEM - Host 2a02:ec80:300:1:185:15:59:2 is DOWN: PING CRITICAL - Packet loss = 100% [13:43:17] PROBLEM - Host 2a02:ec80:300:2:185:15:59:34 is DOWN: PING CRITICAL - Packet loss = 100% [13:43:21] (03CR) 10Muehlenhoff: [C:03+1] "I made an edit to change the initial node, otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1265453 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [13:43:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [13:43:46] FIRING: GerritHAProxyBackendUnavailable: Gerrit backend is unavilable for tcp-proxy (HAProxy) gerrit_ssh - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritHAProxyBackendUnavailable - grafana.wikimedia.org/d/459365f6-df37-48d6-8142-82b22c1875e7/gerrit-tcp-proxy?viewPanel=panel-15 - https://alerts.wikimedia.org/?q=alertname%3DGerritHAProxyBackendUnavailable [13:43:49] PROBLEM - Wikidough DoT Check -IPv6- on doh3006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [13:43:49] PROBLEM - Wikidough DoT Check -IPv6- on doh3005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [13:43:54] FIRING: [8x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:44:01] PROBLEM - Wikidough DoH Check -IPv6- on doh3005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [13:44:01] PROBLEM - Wikidough DoH Check -IPv6- on doh3006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [13:44:22] (03PS1) 10AikoChou: ml-services: add EVENTGATE env vars for revise-tone-task-generator on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268562 [13:44:39] RECOVERY - Wikidough DoT Check -IPv6- on doh3006 is OK: TCP OK - 0.168 second response time on 2a02:ec80:300:3:185:15:59:100 port 853 https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [13:44:39] RECOVERY - Wikidough DoT Check -IPv6- on doh3005 is OK: TCP OK - 0.169 second response time on 2a02:ec80:300:3:185:15:59:98 port 853 https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [13:44:44] (03CR) 10Slyngshede: [C:03+1] "Looks good" [dns] - 10https://gerrit.wikimedia.org/r/1268546 (owner: 10Muehlenhoff) [13:44:46] FIRING: GerritHAProxyServiceUnavailable: Gerrit tcp-proxy (HAProxy) service gerrit_ssh is DOWN in esams - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritHAProxyServiceUnavailable - grafana.wikimedia.org/d/459365f6-df37-48d6-8142-82b22c1875e7/gerrit-tcp-proxy?viewPanel=panel-15 - https://alerts.wikimedia.org/?q=alertname%3DGerritHAProxyServiceUnavailable [13:44:53] RECOVERY - Wikidough DoH Check -IPv6- on doh3006 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 0.332 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [13:44:53] RECOVERY - Wikidough DoH Check -IPv6- on doh3005 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 0.332 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [13:44:55] RECOVERY - Host 2a02:ec80:300:1:185:15:59:2 is UP: PING OK - Packet loss = 0%, RTA = 80.62 ms [13:44:55] RECOVERY - Host 2a02:ec80:300:2:185:15:59:34 is UP: PING OK - Packet loss = 0%, RTA = 80.17 ms [13:45:07] (03PS1) 10Arnaudb: Revert "gerrit: update service state" [puppet] - 10https://gerrit.wikimedia.org/r/1268563 [13:45:09] topranks: next step is downgrade, unless you see something else? [13:45:11] XioNoX: same thing, once interfaces settled down rpd starts to seem to work, some sessions come up but it's hogging CPU. "show bgp summary" worked at first but then stops responding [13:45:51] RESOLVED: [2x] SwitchCoreInterfaceDown: Switch core interface down - asw1-bw27-esams:et-0/0/48 (Core: cr1-esams:et-1/0/0 {#30367}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:45:51] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1056.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:45:58] (03CR) 10Arnaudb: [V:03+2 C:03+2] Revert "gerrit: update service state" [puppet] - 10https://gerrit.wikimedia.org/r/1268563 (owner: 10Arnaudb) [13:46:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host testvm2006.codfw.wmnet with OS trixie [13:46:39] RESOLVED: [8x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:46:40] (03CR) 10JHathaway: [C:03+2] bastions: add bast4006 [puppet] - 10https://gerrit.wikimedia.org/r/1268257 (https://phabricator.wikimedia.org/T418993) (owner: 10JHathaway) [13:46:51] RESOLVED: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:47:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11793876 (10Jclark-ctr) [13:47:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11793882 (10Jclark-ctr) Ran provision script on 1056. [13:47:43] XioNoX: yeah nothing is jumping out at me here [13:47:54] rpd is getting restarted, but probably just because it's not responsive to mgd [13:48:24] (03CR) 10Jelto: "-1 I could not explain why this would case an outage" [puppet] - 10https://gerrit.wikimedia.org/r/1268563 (owner: 10Arnaudb) [13:48:46] RESOLVED: [2x] GerritHAProxyBackendUnavailable: Gerrit backend is unavilable for tcp-proxy (HAProxy) gerrit_ssh - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritHAProxyBackendUnavailable - grafana.wikimedia.org/d/459365f6-df37-48d6-8142-82b22c1875e7/gerrit-tcp-proxy?viewPanel=panel-15 - https://alerts.wikimedia.org/?q=alertname%3DGerritHAProxyBackendUnavailable [13:48:57] (03CR) 10Lucas Werkmeister (WMDE): search: add space-discount for wikidata custom prefix search profiles (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267130 (https://phabricator.wikimedia.org/T420427) (owner: 10DCausse) [13:49:06] not sure what's going on but I can't login to juniper.net to download files, and I can't ssh to apt1002.wikimedia.org [13:49:19] (03PS1) 10Arnaudb: Revert^2 "gerrit: update service state" [puppet] - 10https://gerrit.wikimedia.org/r/1268564 [13:49:46] RESOLVED: GerritHAProxyServiceUnavailable: Gerrit tcp-proxy (HAProxy) service gerrit_ssh is DOWN in esams - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritHAProxyServiceUnavailable - grafana.wikimedia.org/d/459365f6-df37-48d6-8142-82b22c1875e7/gerrit-tcp-proxy?viewPanel=panel-15 - https://alerts.wikimedia.org/?q=alertname%3DGerritHAProxyServiceUnavailable [13:50:07] XioNoX: let me see [13:50:26] (03PS4) 10Majavah: nftables: Fix issues around virtual resource dependencies [puppet] - 10https://gerrit.wikimedia.org/r/1260721 [13:50:26] (03PS3) 10Majavah: P:base: Make nftables::set resources always defined [puppet] - 10https://gerrit.wikimedia.org/r/1266205 [13:50:26] (03PS16) 10Majavah: firewall: Declare resources for both providers [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) [13:50:28] (03PS16) 10Majavah: P:wmcs::instance: Convert to firewall wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1211652 (https://phabricator.wikimedia.org/T411089) [13:50:54] (03CR) 10Majavah: nftables: Fix issues around virtual resource dependencies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1260721 (owner: 10Majavah) [13:51:02] (03CR) 10Majavah: P:base: Make nftables::set resources always defined (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1266205 (owner: 10Majavah) [13:51:53] (03CR) 10Majavah: "I think you're just seeing issues from the current network maintenance/outage?" [puppet] - 10https://gerrit.wikimedia.org/r/1268563 (owner: 10Arnaudb) [13:51:55] !log jmm@dns1004 START - running authdns-update [13:52:05] (03CR) 10Dpogorzelski: [C:03+1] ml-services: add EVENTGATE env vars for revise-tone-task-generator on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268562 (owner: 10AikoChou) [13:52:07] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1268564 (owner: 10Arnaudb) [13:52:20] (03CR) 10Jgiannelos: [C:03+2] wikifeeds: Add request definition for page analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220629 (https://phabricator.wikimedia.org/T411769) (owner: 10Jgiannelos) [13:52:31] topranks: apt1002 only have too old junos versions [13:53:04] !log jmm@dns1004 END - running authdns-update [13:53:05] (03CR) 10Arnaudb: [V:03+2 C:03+2] "indeed, I was mistaken by the timing. this has been reverted with 1268564" [puppet] - 10https://gerrit.wikimedia.org/r/1268563 (owner: 10Arnaudb) [13:53:09] !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cr1-esams,cr1-esams IPv6,re0.cr1-esams.mgmt with reason: router upgrade [13:53:15] (03CR) 10Arnaudb: [V:03+2 C:03+2] Revert^2 "gerrit: update service state" [puppet] - 10https://gerrit.wikimedia.org/r/1268564 (owner: 10Arnaudb) [13:53:59] XioNoX: I'm downloading 23.4R2-S3.9 to bast3007 now, I'll scp over when it is done [13:54:06] thx [13:54:45] (03CR) 10Muehlenhoff: [C:03+2] Failover idp.w.o [dns] - 10https://gerrit.wikimedia.org/r/1268546 (owner: 10Muehlenhoff) [13:54:51] !log jmm@dns1004 START - running authdns-update [13:56:04] !log jmm@dns1004 END - running authdns-update [13:56:53] (03CR) 10Vgutierrez: [C:03+1] "after these two get merged, IPIP can be checked using `sudo cookbook sre.loadbalancer.check-ipip --dc eqiad --query "P{clouddumps.*}" dump" [puppet] - 10https://gerrit.wikimedia.org/r/1268505 (https://phabricator.wikimedia.org/T422040) (owner: 10Majavah) [13:57:04] !log jmm@dns1004 START - running authdns-update [13:57:44] (03Merged) 10jenkins-bot: wikifeeds: Add request definition for page analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220629 (https://phabricator.wikimedia.org/T411769) (owner: 10Jgiannelos) [13:58:15] !log jmm@dns1004 END - running authdns-update [13:58:21] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore2*: Upgrade Cassandra to 4.1.11 — T418417 - eevans@cumin1003 [13:58:23] T418417: Upgrade Cassandra clusters to 4.1.11 - https://phabricator.wikimedia.org/T418417 [13:58:46] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore1*: Upgrade Cassandra to 4.1.11 — T418417 - eevans@cumin1003 [13:59:10] FIRING: BFDdown: BFD session down between cr2-eqiad and fe80::7ee2:caff:fede:4a67 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:00:04] Deploy window Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260407T1400) [14:00:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [14:01:27] !log cgoubert@cumin1003 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker1273.eqiad.wmnet [14:01:28] !log cgoubert@cumin1003 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker1273.eqiad.wmnet [14:02:02] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for annet - https://phabricator.wikimedia.org/T422251#11794024 (10Jdrewniak) [14:02:19] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1251059 (https://phabricator.wikimedia.org/T419831) (owner: 10Effie Mouzeli) [14:03:22] XioNoX: scp started, another 5 mins [14:03:31] topranks: rgr [14:03:34] !log cgoubert@cumin1003 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker1273.eqiad.wmnet [14:03:35] !log cgoubert@cumin1003 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker1273.eqiad.wmnet [14:04:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and fe80::7ee2:caff:fede:4a67 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:04:11] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1262033 (owner: 10PipelineBot) [14:04:11] (03Abandoned) 10Jgiannelos: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264583 (owner: 10PipelineBot) [14:04:12] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1262032 (owner: 10PipelineBot) [14:05:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [14:07:29] XioNoX: ok done, /var/tmp/junos-vmhost-install-mx-x86-64-23.4R2-S3.9.tgz [14:07:52] alright, disabling graceful-switchover then will push to re1 [14:09:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and fe80::7ee2:caff:fede:4a67 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:11:00] XioNoX: before we do the software downgrade might it be worth doing a "request tech-support" or something? [14:11:15] topranks: good idea [14:11:25] (03PS1) 10Clément Goubert: sre.k8s.renumber-node: Fix pool-depool coobook call [cookbooks] - 10https://gerrit.wikimedia.org/r/1268568 [14:11:42] doing it [14:11:49] cool thanks [14:12:47] (03PS2) 10Clément Goubert: sre.k8s.renumber-node: Fix pool-depool coobook call [cookbooks] - 10https://gerrit.wikimedia.org/r/1268568 [14:13:15] topranks: Wrote 158743 lines of output to '/var/log/RSI_NODE_0-rpd-issue.txt' [14:14:06] and /var/tmp/NODE_0_LOGS_RSI.tgz for the whole thing [14:14:40] o/ I'm investigating the Undefined array key errors coming from the TestKitchen extension (re. T422112) [14:14:41] T422112: PHP Warning: Trying to access array offset on null - https://phabricator.wikimedia.org/T422112 [14:15:09] I'm going to drop into a shell on the deployment host to inspect a cached value on a couple of wikis [14:16:39] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore1*: Upgrade Cassandra to 4.1.11 — T418417 - eevans@cumin1003 [14:16:42] T418417: Upgrade Cassandra clusters to 4.1.11 - https://phabricator.wikimedia.org/T418417 [14:16:43] (03PS3) 10Clément Goubert: sre.k8s.renumber-node: Fix pool-depool coobook call [cookbooks] - 10https://gerrit.wikimedia.org/r/1268568 [14:19:25] FIRING: [3x] BFDdown: BFD session down between cr2-eqiad and fe80::7ee2:caff:fede:4a67 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:19:49] (03CR) 10Mszwarc: [C:03+1] Set $wgGlobalBlockingWikisWhereGlobalBlocksDoNotApply [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268560 (https://phabricator.wikimedia.org/T422220) (owner: 10Dreamy Jazz) [14:20:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [14:23:38] !log cgoubert@cumin1003 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker1273.eqiad.wmnet [14:23:42] !log cgoubert@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1273.eqiad.wmnet [14:23:44] rebooting RE1 [14:24:15] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker1273.eqiad.wmnet [14:24:25] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and fe80::7ee2:caff:fede:4a67 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:24:32] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1273.eqiad.wmnet with OS bookworm [14:25:00] !log cgoubert@cumin1003 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1273 [14:25:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [14:25:31] !log cgoubert@cumin1003 START - Cookbook sre.dns.netbox [14:27:09] (03PS4) 10Clément Goubert: sre.k8s.renumber-node: Fix pool-depool coobook call [cookbooks] - 10https://gerrit.wikimedia.org/r/1268568 (https://phabricator.wikimedia.org/T421711) [14:29:52] (03PS1) 10JMeybohm: wikikube: Request 1 CPU and 500M memory per replica [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268569 (https://phabricator.wikimedia.org/T422455) [14:30:04] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260407T1430) [14:30:15] topranks: re1 is back up [14:30:25] (03CR) 10LSobanski: "Approved in the IF meeting." [puppet] - 10https://gerrit.wikimedia.org/r/1266980 (https://phabricator.wikimedia.org/T417213) (owner: 10Btullis) [14:30:42] !log cgoubert@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1273 - cgoubert@cumin1003" [14:30:47] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1273 - cgoubert@cumin1003" [14:30:47] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:30:47] !log cgoubert@cumin1003 START - Cookbook sre.dns.wipe-cache wikikube-worker1273.eqiad.wmnet 128.48.64.10.in-addr.arpa 8.2.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:30:51] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1273.eqiad.wmnet 128.48.64.10.in-addr.arpa 8.2.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:30:52] !log cgoubert@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1273 [14:30:57] !log re0.cr1-esams> request chassis routing-engine master switch [14:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:36] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11794192 (10RobH) Mainboard swap will occur on Wednesday, April 8th @ 10:00Singapore time which is Tuesday, Tuesday April 7th 18:00 Pacific. I'll be online for the duration of the work and to ensure... [14:32:48] 06SRE, 06Infrastructure-Foundations, 10netops: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421704#11794201 (10LSobanski) p:05Triage→03Low [14:33:23] (03CR) 10Scott French: [C:03+1] wikikube: Request 1 CPU and 500M memory per replica [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268569 (https://phabricator.wikimedia.org/T422455) (owner: 10JMeybohm) [14:33:45] (03CR) 10Clément Goubert: [C:03+1] wikikube: Request 1 CPU and 500M memory per replica [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268569 (https://phabricator.wikimedia.org/T422455) (owner: 10JMeybohm) [14:34:07] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:34:41] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1273 [14:34:41] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1273 [14:35:39] FIRING: [8x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:35:40] FIRING: [3x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:35:51] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - asw1-bw27-esams:et-0/0/48 (Core: cr1-esams:et-1/0/0 {#30367}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [14:36:27] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host testvm2006.codfw.wmnet with OS trixie [14:37:17] (03PS1) 10Scott French: wikikube: Temporarily double coredns replicas (12) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268573 (https://phabricator.wikimedia.org/T422455) [14:38:14] (03CR) 10Clément Goubert: [C:03+1] wikikube: Temporarily double coredns replicas (12) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268573 (https://phabricator.wikimedia.org/T422455) (owner: 10Scott French) [14:38:30] (03CR) 10JMeybohm: [C:03+1] wikikube: Temporarily double coredns replicas (12) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268573 (https://phabricator.wikimedia.org/T422455) (owner: 10Scott French) [14:38:51] FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:40:07] topranks: re1 is now live on the new code version, but rpd is at 100% again... [14:40:25] wait, no [14:40:28] now it's down [14:40:29] good [14:40:39] RESOLVED: [8x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:40:40] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:40:51] RESOLVED: [2x] SwitchCoreInterfaceDown: Switch core interface down - asw1-bw27-esams:et-0/0/48 (Core: cr1-esams:et-1/0/0 {#30367}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [14:41:36] XioNoX: yeah it shot up but that's expected on start [14:41:40] (03PS1) 10Zabe: Start reading from new file tables everwhere except enwiki and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268574 (https://phabricator.wikimedia.org/T416548) [14:41:42] seems to have calmed down.... let's see [14:41:58] pushing the same code version to re0 [14:43:56] RESOLVED: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:49:13] (03CR) 10JHathaway: [C:03+1] nftables: Fix issues around virtual resource dependencies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1260721 (owner: 10Majavah) [14:52:50] (03PS1) 10Muehlenhoff: Add annet to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1268575 (https://phabricator.wikimedia.org/T422251) [14:53:24] (03CR) 10ArielGlenn: [C:03+1] "With the caveat that I'd like to understand the practical impacts (which groups of users are liable to see issues after having complied wi" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268520 (https://phabricator.wikimedia.org/T422471) (owner: 10Daniel Kinzler) [14:54:59] !log cgoubert@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1347.eqiad.wmnet [14:55:00] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1347.eqiad.wmnet [14:55:15] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdb1008 - https://phabricator.wikimedia.org/T414374#11794412 (10Jgreen) >>! In T414374#11792638, @VRiley-WMF wrote: > Under @Papaul guidence, I have doubled checked the PERC controller, I found that one of the cables became unseat... [14:56:08] (03CR) 10Muehlenhoff: [C:03+2] Add annet to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1268575 (https://phabricator.wikimedia.org/T422251) (owner: 10Muehlenhoff) [14:56:13] FIRING: CertAlmostExpired: Certificate for service opensearch-test:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#opensearch-test:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:56:15] rebooting re0 [14:56:33] (03CR) 10Scott French: [C:03+2] wikikube: Request 1 CPU and 500M memory per replica [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268569 (https://phabricator.wikimedia.org/T422455) (owner: 10JMeybohm) [14:57:28] !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on cr1-esams,cr1-esams IPv6,re0.cr1-esams.mgmt with reason: router upgrade [14:57:28] !log cgoubert@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1273.eqiad.wmnet with reason: host reimage [14:57:56] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics_privatedata_users for annet - https://phabricator.wikimedia.org/T422251#11794426 (10MoritzMuehlenhoff) 05Open→03Resolved @AnneT Your access is enabled and being rolled out by Puppet over the next 30 minutes. I'm closing... [15:00:05] jelto, arnoldokoth, mutante, and arnaudb: How many deployers does it take to do SRE Collaboration Services office hours deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260407T1500). [15:00:27] 06SRE, 10Icinga, 10observability, 10Observability-Alerting: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336#11794461 (10herron) >>! In T196336#11723288, @gerritbot wrote: > Change #1253576 **merged** by Herron: > %%%[operations/puppet@production] icinga:... [15:00:41] (03CR) 10JHathaway: "taavi, do have a link to the pcc output?" [puppet] - 10https://gerrit.wikimedia.org/r/1266205 (owner: 10Majavah) [15:01:45] (03CR) 10JHathaway: [C:03+1] Switch our servers to use deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/1268522 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [15:03:12] (03PS1) 10Muehlenhoff: Update SSH key for sg912 [puppet] - 10https://gerrit.wikimedia.org/r/1268578 (https://phabricator.wikimedia.org/T422363) [15:04:28] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1273.eqiad.wmnet with reason: host reimage [15:04:33] (03Merged) 10jenkins-bot: wikikube: Request 1 CPU and 500M memory per replica [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268569 (https://phabricator.wikimedia.org/T422455) (owner: 10JMeybohm) [15:04:38] hopefully last `request chassis routing-engine master switch` for cr1-esams [15:05:14] (03CR) 10Muehlenhoff: [C:03+2] Update SSH key for sg912 [puppet] - 10https://gerrit.wikimedia.org/r/1268578 (https://phabricator.wikimedia.org/T422363) (owner: 10Muehlenhoff) [15:07:46] (03CR) 10AikoChou: [C:03+2] ml-services: add EVENTGATE env vars for revise-tone-task-generator on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268562 (owner: 10AikoChou) [15:07:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:08:00] (03PS1) 10JMeybohm: coredns: Add a switch to enable autopath [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268579 (https://phabricator.wikimedia.org/T422455) [15:08:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:08:38] (03CR) 10Scott French: sre.k8s.renumber-node: Fix pool-depool coobook call (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1268568 (https://phabricator.wikimedia.org/T421711) (owner: 10Clément Goubert) [15:08:55] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Update SSH key for production access – Surbhi Gupta - https://phabricator.wikimedia.org/T422363#11794508 (10MoritzMuehlenhoff) 05Open→03Resolved @SGupta-WMF Your key has been updated and will be rolled out by Puppet over the next 30 minutes. I'm resol... [15:09:24] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new: Another blob upload invalid error when pushing to docker-registry - https://phabricator.wikimedia.org/T422424#11794510 (10dancy) >>! In T422424#11791512, @Scott_French wrote: > Yes, this looks like the "classic" read-your-writes issue in the swift driver... [15:09:39] FIRING: [8x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:09:51] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - asw1-bw27-esams:et-0/0/48 (Core: cr1-esams:et-1/0/0 {#30367}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [15:10:10] (03CR) 10Clément Goubert: sre.k8s.renumber-node: Fix pool-depool coobook call (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1268568 (https://phabricator.wikimedia.org/T421711) (owner: 10Clément Goubert) [15:10:11] (03Merged) 10jenkins-bot: ml-services: add EVENTGATE env vars for revise-tone-task-generator on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268562 (owner: 10AikoChou) [15:10:37] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [15:11:51] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for katiamusiolek - https://phabricator.wikimedia.org/T420459#11794525 (10MoritzMuehlenhoff) p:05Triage→03Medium [15:12:26] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:12:51] RESOLVED: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:13:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:13:13] (03CR) 10Scott French: [C:03+1] sre.k8s.renumber-node: Fix pool-depool coobook call (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1268568 (https://phabricator.wikimedia.org/T421711) (owner: 10Clément Goubert) [15:13:16] topranks: all good, re0 is live and rpd CPU is down [15:13:24] I mean, is *normal* [15:14:11] (03PS1) 10Muehlenhoff: Record LDAP access for kmusiolek [puppet] - 10https://gerrit.wikimedia.org/r/1268582 (https://phabricator.wikimedia.org/T420459) [15:14:39] RESOLVED: [8x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:14:47] !log cr1-esams - re-enabling external peers [15:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:51] RESOLVED: [2x] SwitchCoreInterfaceDown: Switch core interface down - asw1-bw27-esams:et-0/0/48 (Core: cr1-esams:et-1/0/0 {#30367}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [15:15:17] (03CR) 10Clément Goubert: [C:03+2] sre.k8s.renumber-node: Fix pool-depool coobook call [cookbooks] - 10https://gerrit.wikimedia.org/r/1268568 (https://phabricator.wikimedia.org/T421711) (owner: 10Clément Goubert) [15:17:57] (03Merged) 10jenkins-bot: sre.k8s.renumber-node: Fix pool-depool coobook call [cookbooks] - 10https://gerrit.wikimedia.org/r/1268568 (https://phabricator.wikimedia.org/T421711) (owner: 10Clément Goubert) [15:20:15] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:20:37] !log restart swift object/container replicaton services on ms-be1069 [15:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:14] if Tran is still around, I wouldn’t be opposed to deploying the CheckUser backport now [15:21:43] assuming the router issues are sorted out (any objections XioNoX?), and jelto arnoldokoth mutante arnaudb are okay with this happening during SRE Collaboration Services office hours [15:21:55] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:21:58] Lucas_WMDE: lgtm [15:22:03] thx for waiting [15:22:15] SRE Collaboration Services office hours are done [15:22:21] also kipfel [15:22:23] ok thanks jelto! [15:23:04] (I think I saw someone get tripped up by the logspam during another deploy – “did I cause those warnings?” – that’s why I’m kinda keen to deploy that backport ^^) [15:24:10] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1273.eqiad.wmnet with OS bookworm [15:24:25] I'd also like to backport [15:24:32] Tran is definitely away, but I can do it for them [15:24:48] (or test for them as you want Lucas) [15:24:48] !log homer cr*eqiad* commit '' [15:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:57] either works for me, do you want to deploy or shall I? [15:25:06] !log homer lsw1-d1-eqiad* commit '' [15:25:06] as long as we have someone who knows how to test it ^^ [15:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:15] I could always bundle in my other changes at the same time [15:25:18] So I can handle it [15:25:22] sure [15:25:23] go ahead [15:25:52] (03PS1) 10Dreamy Jazz: GlobalBlockLocalStatusLookup: Remove unused constructor param [extensions/GlobalBlocking] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268584 (https://phabricator.wikimedia.org/T422220) [15:26:50] (03PS3) 10Dreamy Jazz: GlobalBlockLocalStatusLookup: Support wikis that don't apply blocks [extensions/GlobalBlocking] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268585 (https://phabricator.wikimedia.org/T422220) [15:27:29] jouncebot: nowandnext [15:27:29] For the next 0 hour(s) and 32 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260407T1500) [15:27:29] In 0 hour(s) and 32 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260407T1600) [15:27:45] (03CR) 10Dreamy Jazz: [C:03+2] GlobalBlockLocalStatusLookup: Remove unused constructor param [extensions/GlobalBlocking] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268584 (https://phabricator.wikimedia.org/T422220) (owner: 10Dreamy Jazz) [15:27:49] (03CR) 10Dreamy Jazz: [C:03+2] GlobalBlockLocalStatusLookup: Support wikis that don't apply blocks [extensions/GlobalBlocking] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268585 (https://phabricator.wikimedia.org/T422220) (owner: 10Dreamy Jazz) [15:27:53] topranks, sukhe, we should be good to repool esams [15:27:58] XioNoX: cr1-esams seems to be ok again [15:28:00] heh yeah [15:28:06] (03CR) 10CI reject: [V:04-1] GlobalBlockLocalStatusLookup: Support wikis that don't apply blocks [extensions/GlobalBlocking] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268585 (https://phabricator.wikimedia.org/T422220) (owner: 10Dreamy Jazz) [15:28:08] (03PS1) 10Dreamy Jazz: GlobalBlockLocalStatusLookup: Remove unused constructor param [extensions/GlobalBlocking] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268586 (https://phabricator.wikimedia.org/T422220) [15:28:15] (03CR) 10Dreamy Jazz: [C:03+2] GlobalBlockLocalStatusLookup: Remove unused constructor param [extensions/GlobalBlocking] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268586 (https://phabricator.wikimedia.org/T422220) (owner: 10Dreamy Jazz) [15:28:27] (03PS2) 10Dreamy Jazz: GlobalBlockLocalStatusLookup: Remove unused constructor param [extensions/GlobalBlocking] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268586 (https://phabricator.wikimedia.org/T422220) [15:28:44] topranks: that was a lot of steps to go back to square 1 [15:28:57] !log ayounsi@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool esams [reason: network maintenance over, T416450] [15:29:00] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool esams [reason: network maintenance over, T416450] [15:29:00] T416450: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450 [15:29:02] XioNoX: ok! gl [15:29:03] (03PS4) 10Dreamy Jazz: GlobalBlockLocalStatusLookup: Support wikis that don't apply blocks [extensions/GlobalBlocking] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268585 (https://phabricator.wikimedia.org/T422220) [15:29:09] (03CR) 10Dreamy Jazz: [C:03+2] GlobalBlockLocalStatusLookup: Support wikis that don't apply blocks [extensions/GlobalBlocking] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268585 (https://phabricator.wikimedia.org/T422220) (owner: 10Dreamy Jazz) [15:29:17] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for kmusiolek [puppet] - 10https://gerrit.wikimedia.org/r/1268582 (https://phabricator.wikimedia.org/T420459) (owner: 10Muehlenhoff) [15:30:12] XioNoX: heh yeah.... kind of surprising tbh, even if it did act like that and we had a load of weird log messages from rpd it would make more sense [15:30:25] I don't really have any sense of what was happening [15:30:39] I guess we open a JTAC on it, perhaps they'll have seen it [15:30:40] (03PS3) 10Dreamy Jazz: GlobalBlockLocalStatusLookup: Support wikis that don't apply blocks [extensions/GlobalBlocking] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268587 (https://phabricator.wikimedia.org/T422220) [15:30:43] !log installing postgresql-15 security updates [15:30:44] (03CR) 10Dreamy Jazz: [C:03+2] GlobalBlockLocalStatusLookup: Support wikis that don't apply blocks [extensions/GlobalBlocking] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268587 (https://phabricator.wikimedia.org/T422220) (owner: 10Dreamy Jazz) [15:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:45] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:30:46] (03Merged) 10jenkins-bot: GlobalBlockLocalStatusLookup: Remove unused constructor param [extensions/GlobalBlocking] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268584 (https://phabricator.wikimedia.org/T422220) (owner: 10Dreamy Jazz) [15:31:12] !log ayounsi@cumin1003 conftool action : set/pooled=yes; selector: cluster=dnsbox,dc=esams [reason: esams maintenance over] [15:31:34] !log aikochou@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [15:31:45] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:31:51] RECOVERY - Bird Internet Routing Daemon on doh3005 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [15:31:52] (03Merged) 10jenkins-bot: GlobalBlockLocalStatusLookup: Support wikis that don't apply blocks [extensions/GlobalBlocking] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268585 (https://phabricator.wikimedia.org/T422220) (owner: 10Dreamy Jazz) [15:31:55] RECOVERY - Bird Internet Routing Daemon on doh3006 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [15:31:56] sukhe: dnsbox and `sre.dns.admin` repooled [15:32:18] thanks! [15:32:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268560 (https://phabricator.wikimedia.org/T422220) (owner: 10Dreamy Jazz) [15:32:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/GlobalBlocking] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268586 (https://phabricator.wikimedia.org/T422220) (owner: 10Dreamy Jazz) [15:32:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/GlobalBlocking] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268587 (https://phabricator.wikimedia.org/T422220) (owner: 10Dreamy Jazz) [15:32:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268518 (https://phabricator.wikimedia.org/T422388) (owner: 10STran) [15:32:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268519 (https://phabricator.wikimedia.org/T422388) (owner: 10STran) [15:32:27] haven't followed up with the backlog yet but will :) [15:32:59] looks good on the DNS boxes, requests comingin [15:33:24] (03Merged) 10jenkins-bot: Set $wgGlobalBlockingWikisWhereGlobalBlocksDoNotApply [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268560 (https://phabricator.wikimedia.org/T422220) (owner: 10Dreamy Jazz) [15:33:33] (03PS1) 10Muehlenhoff: Add kmusiolek to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1268590 (https://phabricator.wikimedia.org/T420459) [15:33:43] (03Merged) 10jenkins-bot: GlobalBlockLocalStatusLookup: Remove unused constructor param [extensions/GlobalBlocking] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268586 (https://phabricator.wikimedia.org/T422220) (owner: 10Dreamy Jazz) [15:33:44] This scap will be slower due to i18n changes, but needed because Special:CentralAuth is broken for users who have an account on any closed wiki [15:33:46] (03Merged) 10jenkins-bot: GlobalBlockLocalStatusLookup: Support wikis that don't apply blocks [extensions/GlobalBlocking] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268587 (https://phabricator.wikimedia.org/T422220) (owner: 10Dreamy Jazz) [15:33:53] So may go into the puppet window [15:33:54] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics_privatedata_users for katiamusiolek - https://phabricator.wikimedia.org/T420459#11794645 (10MoritzMuehlenhoff) 05Stalled→03Open a:05katiamusiolekwmde→03MoritzMuehlenhoff [15:34:00] But there doesn't seem to be anything listed so should be fine [15:34:15] cgoubert@cumin1003 renumber-node (PID 3577324) is awaiting input [15:34:15] (03CR) 10Ayounsi: [C:03+1] "thx!" [homer/public] - 10https://gerrit.wikimedia.org/r/1268516 (owner: 10Majavah) [15:35:09] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:35:14] (03CR) 10Ayounsi: [C:04-1] "thx for the reviews, keeping it around just in case for the next esams maintenance." [dns] - 10https://gerrit.wikimedia.org/r/1268538 (https://phabricator.wikimedia.org/T416450) (owner: 10Ayounsi) [15:35:36] Dreamy_Jazz: yep, nothing planned today afaik for the puppet window [15:35:45] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:35:47] Thanks for confirming! [15:37:09] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:37:31] RESOLVED: [2x] Not accepting/receiving prefixes from anycast BGP peer: Device asw1-bw27-esams.mgmt.esams.wmnet recovered from Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [15:37:45] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:38:04] (03CR) 10Muehlenhoff: [C:03+2] Add kmusiolek to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1268590 (https://phabricator.wikimedia.org/T420459) (owner: 10Muehlenhoff) [15:38:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Degraded RAID on ml-serve1001 - https://phabricator.wikimedia.org/T422382#11794682 (10wiki_willy) Thanks @Jclark-ctr. Hi @isarantopoulos - since we're looking to refresh this soon, do you still need us to purchase a replacement drive? Thanks, Willy... [15:41:45] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:42:45] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:42:45] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is c93fecee2fcc8420e509eebc224f4336347b5a34, dns.git is 24bbe785f6d1c77f29b5ddbabc22f526960b2bb1) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:42:46] (03PS4) 10Elukey: tox: rework venvs to speed up local and CI timings [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267678 (https://phabricator.wikimedia.org/T420475) [15:42:47] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is c93fecee2fcc8420e509eebc224f4336347b5a34, dns.git is 24bbe785f6d1c77f29b5ddbabc22f526960b2bb1) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:42:50] huh ok [15:42:58] !log sukhe@dns1004 START - running authdns-update [15:43:29] (03Merged) 10jenkins-bot: Fix blockConnectedTempAccounts existence error [extensions/CheckUser] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268518 (https://phabricator.wikimedia.org/T422388) (owner: 10STran) [15:44:06] (03Merged) 10jenkins-bot: Fix blockConnectedTempAccounts existence error [extensions/CheckUser] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268519 (https://phabricator.wikimedia.org/T422388) (owner: 10STran) [15:44:08] !log sukhe@dns1004 END - running authdns-update [15:44:51] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1268560|Set $wgGlobalBlockingWikisWhereGlobalBlocksDoNotApply (T422220)]], [[gerrit:1268584|GlobalBlockLocalStatusLookup: Remove unused constructor param (T422220)]], [[gerrit:1268586|GlobalBlockLocalStatusLookup: Remove unused constructor param (T422220)]], [[gerrit:1268585|GlobalBlockLocalStatusLookup: Support wikis that don't apply blocks (T42222 [15:44:51] 0)]], [[gerrit:1268587|GlobalBlockLocalStatusLookup: Support wikis that don't apply blocks (T422220)]], [[gerrit:1268518|Fix blockConnectedTempAccounts existence error (T422388)]], [[gerrit:1268519|Fix blockConnectedTempAccounts existence error (T422388)]] [15:44:55] T422220: Access of global_block_whitelist table on closed wikis through Special:CentralAuth causes exceptions - https://phabricator.wikimedia.org/T422220 [15:44:56] T42222: Corrected logo for Sanskrit Wikipedia - https://phabricator.wikimedia.org/T42222 [15:44:56] T422388: PHP Warning: Undefined array key "blockConnectedTempAccounts" - https://phabricator.wikimedia.org/T422388 [15:45:10] !log cgoubert@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1273.eqiad.wmnet [15:45:11] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1273.eqiad.wmnet [15:45:11] !log cgoubert@cumin1003 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=1) Renumbering for host wikikube-worker1273.eqiad.wmnet [15:45:21] That moment when the scap message is too long for IRC :D [15:45:31] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:45:31] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:45:45] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:46:45] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:47:56] Dreamy_Jazz: poor T42222 [15:48:38] ikr :D [15:53:34] (03PS7) 10Elukey: Move linting to Ruff and apply code fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) [15:53:34] (03PS5) 10Elukey: tox: rework venvs to speed up local and CI timings [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267678 (https://phabricator.wikimedia.org/T420475) [15:55:21] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [15:55:33] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [15:55:51] 06SRE, 06Infrastructure-Foundations, 10netops: cr1-esams failed upgrade - https://phabricator.wikimedia.org/T422525 (10cmooney) 03NEW p:05Triage→03Medium [15:56:01] 06SRE, 06Infrastructure-Foundations, 10netops: cr1-esams failed upgrade - https://phabricator.wikimedia.org/T422525#11794809 (10cmooney) [15:56:16] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [15:56:53] (03CR) 10CI reject: [V:04-1] tox: rework venvs to speed up local and CI timings [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267678 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [15:59:29] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics_privatedata_users for katiamusiolek - https://phabricator.wikimedia.org/T420459#11794832 (10MoritzMuehlenhoff) 05Open→03Resolved @katiamusiolekwmde I've added you to the cn=nda LDAP group and the analytics-privatedata-us... [15:59:29] (03PS6) 10Elukey: tox: rework venvs to speed up local and CI timings [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267678 (https://phabricator.wikimedia.org/T420475) [16:00:04] jhathaway and rzl: gettimeofday() says it's time for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260407T1600) [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:03:35] !log dreamyjazz@deploy1003 stran, dreamyjazz: Backport for [[gerrit:1268560|Set $wgGlobalBlockingWikisWhereGlobalBlocksDoNotApply (T422220)]], [[gerrit:1268584|GlobalBlockLocalStatusLookup: Remove unused constructor param (T422220)]], [[gerrit:1268586|GlobalBlockLocalStatusLookup: Remove unused constructor param (T422220)]], [[gerrit:1268585|GlobalBlockLocalStatusLookup: Support wikis that don't apply blocks (T422220)]], [16:03:35] [[gerrit:1268587|GlobalBlockLocalStatusLookup: Support wikis that don't apply blocks (T422220)]], [[gerrit:1268518|Fix blockConnectedTempAccounts existence error (T422388)]], [[gerrit:1268519|Fix blockConnectedTempAccounts existence error (T422388)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:03:38] T422220: Access of global_block_whitelist table on closed wikis through Special:CentralAuth causes exceptions - https://phabricator.wikimedia.org/T422220 [16:03:38] T422388: PHP Warning: Undefined array key "blockConnectedTempAccounts" - https://phabricator.wikimedia.org/T422388 [16:03:45] at least this time it truncated at a space ^^ [16:03:49] s/truncated/split/ [16:05:22] :D [16:05:25] Testing... [16:07:50] !log dreamyjazz@deploy1003 stran, dreamyjazz: Continuing with sync [16:07:54] Proceeding [16:09:15] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:19:58] (03CR) 10Dzahn: [C:03+2] "Yes, I verified that Java 21 is selected as the active alternative, the version being used and behind the java binary." [puppet] - 10https://gerrit.wikimedia.org/r/1267301 (owner: 10Dzahn) [16:20:04] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1268560|Set $wgGlobalBlockingWikisWhereGlobalBlocksDoNotApply (T422220)]], [[gerrit:1268584|GlobalBlockLocalStatusLookup: Remove unused constructor param (T422220)]], [[gerrit:1268586|GlobalBlockLocalStatusLookup: Remove unused constructor param (T422220)]], [[gerrit:1268585|GlobalBlockLocalStatusLookup: Support wikis that don't apply blocks (T4222 [16:20:04] 20)]], [[gerrit:1268587|GlobalBlockLocalStatusLookup: Support wikis that don't apply blocks (T422220)]], [[gerrit:1268518|Fix blockConnectedTempAccounts existence error (T422388)]], [[gerrit:1268519|Fix blockConnectedTempAccounts existence error (T422388)]] (duration: 35m 13s) [16:20:09] T422220: Access of global_block_whitelist table on closed wikis through Special:CentralAuth causes exceptions - https://phabricator.wikimedia.org/T422220 [16:20:10] T4222: Unexpected search results with AND and OR - https://phabricator.wikimedia.org/T4222 [16:20:10] T422388: PHP Warning: Undefined array key "blockConnectedTempAccounts" - https://phabricator.wikimedia.org/T422388 [16:21:00] (03CR) 10Dzahn: [C:03+1] "Please let patch owners decide when to abandon." [puppet] - 10https://gerrit.wikimedia.org/r/1262020 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb) [16:21:40] I'm done with scap [16:22:04] !log UTC afternoon backport+config window (belatedly) done [16:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:06] just for fun [16:22:08] thanks Dreamy_Jazz! [16:22:21] :D, np [16:22:31] kipfel: please reschedule your zhwiki config change, sorry it didn’t work out today [16:22:56] (ditto HouseOfM but she’s no longer online) [16:27:02] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [16:27:10] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [16:27:48] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [16:27:51] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [16:28:04] (03CR) 10Dzahn: [C:03+1] "thanks! that was also an oversight on my part back in early January or so, I think" [puppet] - 10https://gerrit.wikimedia.org/r/1268564 (owner: 10Arnaudb) [16:28:45] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:28:57] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [16:29:00] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [16:29:25] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [16:30:06] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [16:30:09] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:30:49] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [16:30:53] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [16:31:26] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [16:34:15] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:36:45] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2015 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:37:42] (03CR) 10Dzahn: [C:03+1] "yea, seems like we have to just test it" [puppet] - 10https://gerrit.wikimedia.org/r/1268557 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb) [16:37:43] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2015 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:37:45] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:38:32] (03CR) 10JHathaway: "In my testing on sretest1005, the whitespace change will result in a full config reload, `ExecReload=/usr/sbin/nft -f /etc/nftables/main.n" [puppet] - 10https://gerrit.wikimedia.org/r/1261497 (owner: 10JHathaway) [16:39:09] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:39:39] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2010 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:39:44] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [16:40:39] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2010 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:41:10] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [16:47:16] 06SRE, 06Infrastructure-Foundations, 10netops: cr1-esams failed upgrade - https://phabricator.wikimedia.org/T422525#11795258 (10cmooney) [16:49:01] (03CR) 10Ssingh: [C:03+1] "Yeah it's a fair question. I ignored my own question and gave a +1 because we haven't done Probenet for a lot of stuff in geo-maps already" [dns] - 10https://gerrit.wikimedia.org/r/1267042 (owner: 10Ayounsi) [16:49:19] jouncebot: nowandnext [16:49:19] For the next 0 hour(s) and 10 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260407T1600) [16:49:19] In 0 hour(s) and 10 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260407T1700) [16:49:26] 06SRE, 06Infrastructure-Foundations, 10netops: cr1-esams failed upgrade - https://phabricator.wikimedia.org/T422525#11795284 (10cmooney) [16:50:25] FIRING: [2x] SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:50:39] (03CR) 10Volans: tox: rework venvs to speed up local and CI timings (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267678 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [16:55:25] FIRING: [2x] SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:55:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267157 (https://phabricator.wikimedia.org/T422185) (owner: 10Reedy) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260407T1700) [17:00:42] (03PS1) 10C. Scott Ananian: Ensure RevisionOutputCache uses post-processing options where appropriate [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268600 (https://phabricator.wikimedia.org/T421629) [17:00:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268600 (https://phabricator.wikimedia.org/T421629) (owner: 10C. Scott Ananian) [17:02:27] (03CR) 10C. Scott Ananian: [C:03+1] "I'd like to wait until" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265467 (https://phabricator.wikimedia.org/T376183) (owner: 10Isabelle Hurbain-Palatin) [17:02:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265467 (https://phabricator.wikimedia.org/T376183) (owner: 10Isabelle Hurbain-Palatin) [17:04:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265468 (owner: 10Isabelle Hurbain-Palatin) [17:05:25] (03PS2) 10Daniel Kinzler: rest gateway: prevent abuse of exempt api modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255731 [17:07:32] (03PS1) 10Bking: Revert "dse-k8s: reduce readahead script timer frequency" [puppet] - 10https://gerrit.wikimedia.org/r/1268602 [17:07:38] (03CR) 10Bking: [V:03+2 C:03+2] Revert "dse-k8s: reduce readahead script timer frequency" [puppet] - 10https://gerrit.wikimedia.org/r/1268602 (owner: 10Bking) [17:08:23] (03CR) 10CDobbins: "Yep, I'll take care of it." [dns] - 10https://gerrit.wikimedia.org/r/1267042 (owner: 10Ayounsi) [17:09:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268225 (https://phabricator.wikimedia.org/T248294) (owner: 10Pppery) [17:12:16] (03CR) 10CI reject: [V:04-1] Ensure RevisionOutputCache uses post-processing options where appropriate [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268600 (https://phabricator.wikimedia.org/T421629) (owner: 10C. Scott Ananian) [17:17:34] 06SRE, 10LDAP-Access-Requests: Grant Access to Turnilo for migurski-wme - https://phabricator.wikimedia.org/T422537 (10Migurski-WME) 03NEW [17:18:06] (03PS5) 10Majavah: nftables: Fix issues around virtual resource dependencies [puppet] - 10https://gerrit.wikimedia.org/r/1260721 [17:18:06] (03PS4) 10Majavah: P:base: Make nftables::set resources always defined [puppet] - 10https://gerrit.wikimedia.org/r/1266205 [17:18:06] (03PS17) 10Majavah: firewall: Declare resources for both providers [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) [17:18:07] (03PS17) 10Majavah: P:wmcs::instance: Convert to firewall wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1211652 (https://phabricator.wikimedia.org/T411089) [17:18:46] (03CR) 10Majavah: nftables: Fix issues around virtual resource dependencies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1260721 (owner: 10Majavah) [17:19:54] (03CR) 10Majavah: [C:03+2] nftables: Fix issues around virtual resource dependencies [puppet] - 10https://gerrit.wikimedia.org/r/1260721 (owner: 10Majavah) [17:20:19] 06SRE, 10LDAP-Access-Requests: Grant Access to Turnilo and Superset for migurski-wme - https://phabricator.wikimedia.org/T422537#11795606 (10Migurski-WME) [17:23:25] 10ops-codfw, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup2005 power supplies fried or overvoltage - https://phabricator.wikimedia.org/T419970#11795620 (10Jhancock.wm) @jcrespo would loading the disks from a foreign config be acceptable for you? or will that cause issues with recovery? [17:48:02] jouncebot: nowandnext [17:48:02] For the next 0 hour(s) and 11 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260407T1700) [17:48:02] In 0 hour(s) and 11 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260407T1800) [17:48:21] (03PS1) 10Dreamy Jazz: ClientHints: Don't collect header only on null edit [extensions/CheckUser] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268612 (https://phabricator.wikimedia.org/T418989) [17:48:31] Want to backport a fix that is a train blocker [17:48:37] Anyone using the current window? [17:49:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268612 (https://phabricator.wikimedia.org/T418989) (owner: 10Dreamy Jazz) [18:00:05] dancy and jnuche: May I have your attention please! MediaWiki train - Utc-7+Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260407T1800) [18:00:38] Hello, currently using scap to fix a train blocker (CU data not being stored correctly) [18:00:57] (03Merged) 10jenkins-bot: ClientHints: Don't collect header only on null edit [extensions/CheckUser] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268612 (https://phabricator.wikimedia.org/T418989) (owner: 10Dreamy Jazz) [18:01:09] Can ping when done if wanted [18:01:26] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1268612|ClientHints: Don't collect header only on null edit (T418989)]] [18:01:29] T418989: CheckUser: Store x-is-browser, x-ja3n and x-ja4h CDN header values - https://phabricator.wikimedia.org/T418989 [18:03:51] (03CR) 10Dillon: [C:03+1] "LGTM, thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192921 (https://phabricator.wikimedia.org/T405152) (owner: 10Kgraessle) [18:05:00] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1268612|ClientHints: Don't collect header only on null edit (T418989)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:05:13] (03CR) 10Cathal Mooney: "Thanks Chris, if you could also do it for Reunion Island (RE) that would be appreciated." [dns] - 10https://gerrit.wikimedia.org/r/1267042 (owner: 10Ayounsi) [18:06:05] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [18:06:14] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [18:07:23] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [18:07:27] o/ [18:07:40] dreamyjazz: ok. Lemme know when you're clear [18:08:04] Sure will ping [18:09:40] 06SRE, 10LDAP-Access-Requests: Grant Access to Turnilo and Superset for migurski-wme - https://phabricator.wikimedia.org/T422537#11795835 (10Aklapper) Hi, currently the Phabricator account @Migurski-WME is linked to some personal, self-created MediaWiki (SUL) account. So there is **risk of impersonation** in t... [18:13:41] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1268612|ClientHints: Don't collect header only on null edit (T418989)]] (duration: 12m 14s) [18:13:42] (03PS1) 10Eevans: cassandra: promote 4.1.11 to '4.x' [puppet] - 10https://gerrit.wikimedia.org/r/1268621 (https://phabricator.wikimedia.org/T418417) [18:13:44] T418989: CheckUser: Store x-is-browser, x-ja3n and x-ja4h CDN header values - https://phabricator.wikimedia.org/T418989 [18:13:53] dancy: I'm done [18:14:11] Thanks! Pressing the train button! [18:14:34] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1268621 (https://phabricator.wikimedia.org/T418417) (owner: 10Eevans) [18:16:27] (03PS1) 10TrainBranchBot: group0 to 1.46.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268622 (https://phabricator.wikimedia.org/T420481) [18:16:30] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dancy@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268622 (https://phabricator.wikimedia.org/T420481) (owner: 10TrainBranchBot) [18:18:05] (03Merged) 10jenkins-bot: group0 to 1.46.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268622 (https://phabricator.wikimedia.org/T420481) (owner: 10TrainBranchBot) [18:21:02] (03PS1) 10Kgraessle: Remove Navigation Menu Link Instrumentation on Personal Dashboard [extensions/WikimediaEvents] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268625 (https://phabricator.wikimedia.org/T422512) [18:21:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268625 (https://phabricator.wikimedia.org/T422512) (owner: 10Kgraessle) [18:23:48] (03CR) 10Dillon: [C:03+1] "LGTM, thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247065 (https://phabricator.wikimedia.org/T411485) (owner: 10Kgraessle) [18:24:01] !log dancy@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.46.0-wmf.23 refs T420481 [18:24:04] T420481: 1.46.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T420481 [18:30:15] (03PS1) 10Dzahn: zuul: move creation of full chain from main to base [puppet] - 10https://gerrit.wikimedia.org/r/1268627 [18:30:48] (03CR) 10CI reject: [V:04-1] zuul: move creation of full chain from main to base [puppet] - 10https://gerrit.wikimedia.org/r/1268627 (owner: 10Dzahn) [18:30:53] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q3:rack/setup/install phab1006 - https://phabricator.wikimedia.org/T418905#11795910 (10Jclark-ctr) a:03Jclark-ctr [18:43:32] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1080.eqiad.wmnet with OS bullseye [18:44:04] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1080 [18:44:09] !log bking@cumin2002 START - Cookbook sre.dns.netbox [18:45:26] (03PS1) 10Andrew Bogott: Remove references to cloudcephmon2004-dev [puppet] - 10https://gerrit.wikimedia.org/r/1268632 (https://phabricator.wikimedia.org/T422437) [18:45:32] (03PS1) 10Bartosz Dziewoński: ForeignWikiRequest: Pass session to internal 'centralauthtoken' request [extensions/Echo] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268633 (https://phabricator.wikimedia.org/T422218) [18:45:41] (03PS1) 10Bartosz Dziewoński: ForeignWikiRequest: Pass session to internal 'centralauthtoken' request [extensions/Echo] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268634 (https://phabricator.wikimedia.org/T422218) [18:45:45] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:45:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/Echo] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268633 (https://phabricator.wikimedia.org/T422218) (owner: 10Bartosz Dziewoński) [18:46:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/Echo] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268634 (https://phabricator.wikimedia.org/T422218) (owner: 10Bartosz Dziewoński) [18:46:45] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:46:48] 06SRE, 10LDAP-Access-Requests: Grant Access to Turnilo and Superset for migurski-wme - https://phabricator.wikimedia.org/T422537#11796003 (10Migurski-WME) Thanks Aklapper! I’m under the impression I had already linked this account to Phabricator correctly. Do you mean that there are two separate Migurski-WME a... [18:47:09] (03CR) 10C. Scott Ananian: "recheck" [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268600 (https://phabricator.wikimedia.org/T421629) (owner: 10C. Scott Ananian) [18:47:42] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch1080 - bking@cumin2002" [18:47:47] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch1080 - bking@cumin2002" [18:47:47] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:47:48] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1080.eqiad.wmnet 29.32.64.10.in-addr.arpa 9.2.0.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [18:47:51] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1080.eqiad.wmnet 29.32.64.10.in-addr.arpa 9.2.0.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [18:47:52] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1080 [18:47:56] (03CR) 10Andrew Bogott: [C:03+2] Remove references to cloudcephmon2004-dev [puppet] - 10https://gerrit.wikimedia.org/r/1268632 (https://phabricator.wikimedia.org/T422437) (owner: 10Andrew Bogott) [18:49:30] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1080 [18:49:30] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1080 [18:49:45] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:50:09] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:51:07] (03CR) 10CDobbins: "As promised..." [dns] - 10https://gerrit.wikimedia.org/r/1267042 (owner: 10Ayounsi) [18:51:45] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:52:09] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:52:49] jouncebot: next [18:52:50] In 1 hour(s) and 7 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260407T2000) [18:53:01] (03CR) 10Eevans: [C:03+2] cassandra: promote 4.1.11 to '4.x' [puppet] - 10https://gerrit.wikimedia.org/r/1268621 (https://phabricator.wikimedia.org/T418417) (owner: 10Eevans) [18:53:10] any deployers around? we seem to have a very full window coming up. maybe we could ship some patches early [18:54:45] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:55:10] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:56:02] 06SRE, 10LDAP-Access-Requests: Grant Access to Turnilo and Superset for migurski-wme - https://phabricator.wikimedia.org/T422537#11796020 (10Migurski-WME) Oh I see, WMF vs. WME. I have been able to log in and claim MMigurski-WMF, I’ll update this issue to reflect this one. [18:56:19] 06SRE, 10LDAP-Access-Requests: Grant Access to Turnilo and Superset for MMigurski-WMF - https://phabricator.wikimedia.org/T422537#11796021 (10Migurski-WME) [19:00:00] (03PS1) 10C. Scott Ananian: Bump jawiki to 100% Parsoid Read Views (from 10%) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268635 (https://phabricator.wikimedia.org/T420273) [19:00:47] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:01:09] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:01:11] (03PS2) 10C. Scott Ananian: ParserMigration: transition to new configuration variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267439 (https://phabricator.wikimedia.org/T422543) [19:03:45] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:04:57] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1080.eqiad.wmnet with reason: host reimage [19:05:09] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:06:24] inflatador: ryankemper: ^ known? [19:07:03] * ryankemper looks [19:07:09] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:07:55] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1082.eqiad.wmnet with OS bullseye [19:08:21] sukhe: known in general. wdqs codfw gets slammed and we're essentially unable to keep up with volume. the only thing i can do is shrink the auto-restart frequency, let me run the numbers on if it would help. i think rn we're at a 5 minute frequency with a splay of 2 mins so it can take up to 7 mins for auto-restart once the backend is detected as deadlocked [19:08:27] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1082 [19:08:49] !log bking@cumin2002 START - Cookbook sre.dns.netbox [19:08:56] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1080.eqiad.wmnet with reason: host reimage [19:09:17] ryankemper: no worries! was just checking in the context of the pybal alerts [19:09:24] 06SRE, 10LDAP-Access-Requests: Grant Access to Turnilo and Superset for MMigurski-WMF - https://phabricator.wikimedia.org/T422537#11796059 (10MMigurski-WMF) I reconnected my Phabricator account to the -WMF one. [19:09:34] sukhe I removed the bot protections yesterday, thinking they weren't doing much good. But I guess they were helping somewhat after all [19:10:10] eh, it's the same behavior we've seen for days. it's too hard to correlate it to adding/removing blocks [19:10:11] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:10:57] If there's a way to silence these alerts LMK, I think we this is going to be the new normal for WDQS until we get rid of blazegraph [19:11:53] Yeah, if we could tell pybal to not care about wdqs backends that would be excellent. We've already hit the limits of our ability to mitigate. sukhe: is that level of granularity possible or is it all-or-nothing wrt the backend check alerts [19:12:37] ryankemper: it's all or nothing unfortunately. and it's not a big deal as this alert is not paging, so it's just adding to the noise but not actually bothering anyone I guess [19:13:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267439 (https://phabricator.wikimedia.org/T422543) (owner: 10C. Scott Ananian) [19:13:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268635 (https://phabricator.wikimedia.org/T420273) (owner: 10C. Scott Ananian) [19:14:04] ack. i'll fiddle with the restart cooldown and thresholds a bit. i might be able to just barely get it where it's restoring health before the alert fires. it's just tricky since the jvm/blazegraph are gonna take like a minute to start up, so as restart cooldowns get shorter it gets a bit dicier [19:14:30] bking@cumin2002 reimage (PID 3731080) is awaiting input [19:16:47] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:16:53] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch1082 - bking@cumin2002" [19:16:58] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch1082 - bking@cumin2002" [19:16:59] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:16:59] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1082.eqiad.wmnet 167.32.64.10.in-addr.arpa 7.6.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [19:17:03] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1082.eqiad.wmnet 167.32.64.10.in-addr.arpa 7.6.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [19:17:03] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1082 [19:17:11] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:17:37] (03CR) 10Eevans: [C:03+2] aqs1023: assign aqs role [puppet] - 10https://gerrit.wikimedia.org/r/1264800 (https://phabricator.wikimedia.org/T412830) (owner: 10Eevans) [19:19:48] (03CR) 10Arlolra: [C:03+1] Bump jawiki to 100% Parsoid Read Views (from 10%) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268635 (https://phabricator.wikimedia.org/T420273) (owner: 10C. Scott Ananian) [19:20:11] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:20:47] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:20:57] ^ working on a patch. i had retuned the numbers during the dc switchover, so i think the threshholds are a little suboptimal now. tightening them more aggressively [19:21:11] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:21:47] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:22:21] (03PS1) 10Ryan Kemper: wdqs: Reduce deadlock remediation cooldown and threshold [puppet] - 10https://gerrit.wikimedia.org/r/1268640 (https://phabricator.wikimedia.org/T242453) [19:22:55] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1082 [19:22:55] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1082 [19:23:34] (03CR) 10Ryan Kemper: [V:03+2 C:03+2] "deploying to address outage" [puppet] - 10https://gerrit.wikimedia.org/r/1268640 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [19:26:23] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [19:26:26] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1080.eqiad.wmnet with OS bullseye [19:27:21] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [19:27:34] !log eevans@cumin1003 START - Cookbook sre.dns.netbox [19:31:27] any deployers around? we seem to have a very full window coming up. maybe we could ship some patches early [19:32:03] !log eevans@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Additional IPs for aqs1023 - eevans@cumin1003" [19:32:09] !log eevans@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Additional IPs for aqs1023 - eevans@cumin1003" [19:32:09] !log eevans@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:32:21] FIRING: [6x] SLOBudgetBurn: Search update lag is below 95% target in eqiad - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [19:32:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [19:34:57] (03PS2) 10Eevans: aqs1024: assign aqs role [puppet] - 10https://gerrit.wikimedia.org/r/1264801 (https://phabricator.wikimedia.org/T412830) [19:34:57] (03PS2) 10Eevans: aqs1025: assign aqs role [puppet] - 10https://gerrit.wikimedia.org/r/1264802 (https://phabricator.wikimedia.org/T412830) [19:34:57] (03PS2) 10Eevans: aqs1026: assign aqs role [puppet] - 10https://gerrit.wikimedia.org/r/1264803 (https://phabricator.wikimedia.org/T412830) [19:34:58] (03PS2) 10Eevans: aqs1027: assign aqs role [puppet] - 10https://gerrit.wikimedia.org/r/1264804 (https://phabricator.wikimedia.org/T412830) [19:34:59] (03PS1) 10Eevans: aqs1023: add secondary IPs [puppet] - 10https://gerrit.wikimedia.org/r/1268642 (https://phabricator.wikimedia.org/T412830) [19:37:45] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [19:38:02] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1082.eqiad.wmnet with reason: host reimage [19:40:04] (03CR) 10Eevans: [C:03+2] aqs1023: add secondary IPs [puppet] - 10https://gerrit.wikimedia.org/r/1268642 (https://phabricator.wikimedia.org/T412830) (owner: 10Eevans) [19:41:19] (03CR) 10Arlolra: [C:03+1] ParserMigration: transition to new configuration variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267439 (https://phabricator.wikimedia.org/T422543) (owner: 10C. Scott Ananian) [19:41:45] (03CR) 10Ssingh: [C:03+1] "Thanks Chris. Can we run it for Mayotte as well? YT country code." [dns] - 10https://gerrit.wikimedia.org/r/1267042 (owner: 10Ayounsi) [19:42:21] FIRING: [9x] SLOBudgetBurn: Search update lag is below 95% target in eqiad - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [19:43:47] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1082.eqiad.wmnet with reason: host reimage [19:48:30] (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.23.0-a26 [vendor] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268647 (https://phabricator.wikimedia.org/T251506) [19:49:11] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q3:rack/setup/install phab1006 - https://phabricator.wikimedia.org/T418905#11796330 (10Jclark-ctr) [19:49:22] (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.23.0-a26 [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268648 (https://phabricator.wikimedia.org/T422394) [19:50:25] FIRING: [3x] ProbeDown: Service aqs1023-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:50:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [vendor] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268647 (https://phabricator.wikimedia.org/T251506) (owner: 10C. Scott Ananian) [19:50:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268648 (https://phabricator.wikimedia.org/T422394) (owner: 10C. Scott Ananian) [19:50:36] (03PS2) 10Dzahn: zuul: move creation of full chain from main to base [puppet] - 10https://gerrit.wikimedia.org/r/1268627 [19:51:11] (03CR) 10CI reject: [V:04-1] zuul: move creation of full chain from main to base [puppet] - 10https://gerrit.wikimedia.org/r/1268627 (owner: 10Dzahn) [19:52:21] FIRING: [9x] SLOBudgetBurn: Search update lag is below 95% target in eqiad - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [19:55:25] FIRING: [4x] ProbeDown: Service aqs1023-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:55:26] (03PS3) 10Dzahn: zuul: move creation of full chain from main to base [puppet] - 10https://gerrit.wikimedia.org/r/1268627 [19:55:55] (03CR) 10CI reject: [V:04-1] zuul: move creation of full chain from main to base [puppet] - 10https://gerrit.wikimedia.org/r/1268627 (owner: 10Dzahn) [19:56:22] MatmaRex: we can combine our wmf.22 and wmf.23 backports perhaps? [19:56:35] cscott: sure. thanks [19:56:43] maybe throw katherine_g 's wmf.22 backport in there as well. [19:57:21] FIRING: [9x] SLOBudgetBurn: Search update lag is below 95% target in eqiad - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [19:57:28] sure, ty! [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260407T2000). nyaa~ [20:00:05] hyang, Reedy, cscott, Pppery, katherine_g, and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:09] here [20:00:16] here o/ [20:00:17] here [20:00:19] here [20:00:49] hyang, Reedy do you want to kick it off, and then I'll start by deploying all of the wmf.22 patches together (mine, MatmaRex, and katherine_g ) [20:00:58] (03CR) 10CI reject: [V:04-1] Bump wikimedia/parsoid to 0.23.0-a26 [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268648 (https://phabricator.wikimedia.org/T422394) (owner: 10C. Scott Ananian) [20:01:09] yes [20:01:34] (03CR) 10C. Scott Ananian: "recheck" [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268648 (https://phabricator.wikimedia.org/T422394) (owner: 10C. Scott Ananian) [20:02:15] !log eevans@cumin1003 START - Cookbook sre.hosts.reboot-single for host aqs1023.eqiad.wmnet [20:02:21] FIRING: [9x] SLOBudgetBurn: Search update lag is below 95% target in eqiad - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [20:02:40] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1082.eqiad.wmnet with OS bullseye [20:04:30] (03PS1) 10Santiago Faci: PHP SDK: Handle experiment config missing or malformed [extensions/TestKitchen] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268651 (https://phabricator.wikimedia.org/T422112) [20:05:00] 06SRE, 10LDAP-Access-Requests: Grant Access to Turnilo and Superset for MMigurski-WMF - https://phabricator.wikimedia.org/T422537#11796382 (10Aklapper) No, you created a new Phabricator account instead. But that also works; I boldly disabled the Phabricator account @Migurski-WME to avoid more confusion. [20:06:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/TestKitchen] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268651 (https://phabricator.wikimedia.org/T422112) (owner: 10Santiago Faci) [20:07:17] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs1023.eqiad.wmnet [20:07:21] RESOLVED: [9x] SLOBudgetBurn: Search update lag is below 95% target in eqiad - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [20:07:58] (03CR) 10Reedy: [C:03+2] Undeploy Extension:StopForumSpam [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267157 (https://phabricator.wikimedia.org/T422185) (owner: 10Reedy) [20:08:34] I just scheduled a patch to be deploy in this window. I'm not a deployer so I would need someone to deploy it. Could anyone do it? Thanks! [20:08:53] (03Merged) 10jenkins-bot: Undeploy Extension:StopForumSpam [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267157 (https://phabricator.wikimedia.org/T422185) (owner: 10Reedy) [20:09:27] !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1267157|Undeploy Extension:StopForumSpam (T422185)]] [20:09:31] T422185: Undeploy StopForumSpam extension from Wikimedia production - https://phabricator.wikimedia.org/T422185 [20:09:47] 06SRE, 10LDAP-Access-Requests: Grant Access to Turnilo and Superset for MMigurski-WMF - https://phabricator.wikimedia.org/T422537#11796416 (10Aklapper) > * The **username** of your existing LDAP account on https://idm.wikimedia.org or https://gerrit.wikimedia.org: > `MMigurski-WMF` I don't think that LDAP ac... [20:09:52] 06SRE, 10LDAP-Access-Requests: Grant Access to Turnilo and Superset for MMigurski-WMF - https://phabricator.wikimedia.org/T422537#11796417 (10MMigurski-WMF) Thank you! I was looking for a way to disable it. [20:11:01] 06SRE, 10LDAP-Access-Requests: Grant Access to Turnilo and Superset for MMigurski-WMF - https://phabricator.wikimedia.org/T422537#11796419 (10MMigurski-WMF) >> `MMigurski-WMF` > > I don't think that LDAP account exists? https://ldap.toolforge.org/user/migurski-wme exists though... Surprising, let’s see if I... [20:11:49] sfaci: maybe cscott could include that one as well when deploying [20:12:11] sfaci: you might want to backport to 1.46.0-wmf.23 as well, unless the bug only affects wmf.22 and not the latest version [20:12:23] see https://versions.toolforge.org/ for currently live versions [20:12:34] (03PS1) 10MusikAnimal: CommonSettings: use CodeMirror instead of CodeEditor in AbuseFilter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268652 (https://phabricator.wikimedia.org/T399673) [20:13:49] (03CR) 10Dzahn: [C:04-2] "back to moving all this code to a separate class that only sets up certs and does not have any relation to zookeeper. then do dependencies" [puppet] - 10https://gerrit.wikimedia.org/r/1268627 (owner: 10Dzahn) [20:13:49] sfaci: in the top right hand three-dots drop down in gerrit is an "included in" option which can show you which mediawiki versions include a patch [20:14:26] !log eevans@cumin1003 START - Cookbook sre.dns.netbox [20:14:31] [PHP SDK: Handle experiment config missing or malformed (1268526) · Gerrit Code Review](https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TestKitchen/+/1268526) is not included in wmf.23 (this week's deploy) [20:14:57] you might want to cherry-pick it *just* to wmf.23 and let it roll out normally with the train tomorrow/thursday? [20:16:38] cscott: oops! I didn't notice wmf.23 is already here. I'll cherry pick for that branch as well. Thanks [20:17:31] same, let me do that quick [20:17:44] (03PS1) 10Kgraessle: Remove Navigation Menu Link Instrumentation on Personal Dashboard [extensions/WikimediaEvents] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268653 (https://phabricator.wikimedia.org/T422512) [20:17:49] well, i can do the backport to wmf.22 while you are all getting wmf.23 ready :) [20:17:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268653 (https://phabricator.wikimedia.org/T422512) (owner: 10Kgraessle) [20:17:57] (just waiting for reedy to finish, i think) [20:18:03] (03PS1) 10Santiago Faci: PHP SDK: Handle experiment config missing or malformed [extensions/TestKitchen] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268654 (https://phabricator.wikimedia.org/T422112) [20:18:13] cscott: Cool! Thanks! [20:18:14] (03PS3) 10Eevans: aqs1024: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264801 (https://phabricator.wikimedia.org/T412830) [20:18:14] (03PS3) 10Eevans: aqs1025: assign aqs role [puppet] - 10https://gerrit.wikimedia.org/r/1264802 (https://phabricator.wikimedia.org/T412830) [20:18:14] (03PS3) 10Eevans: aqs1026: assign aqs role [puppet] - 10https://gerrit.wikimedia.org/r/1264803 (https://phabricator.wikimedia.org/T412830) [20:18:15] (03PS3) 10Eevans: aqs1027: assign aqs role [puppet] - 10https://gerrit.wikimedia.org/r/1264804 (https://phabricator.wikimedia.org/T412830) [20:18:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/TestKitchen] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268654 (https://phabricator.wikimedia.org/T422112) (owner: 10Santiago Faci) [20:18:43] cherry picked for both wmf.22 and wmf.23 ty for reminder [20:18:51] cscott: cherry pick for wmf.23 is already scheduled. Just waiting for the pipeline to finish [20:19:22] !log eevans@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Additional IPs for aqs1024 - eevans@cumin1003" [20:19:27] !log eevans@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Additional IPs for aqs1024 - eevans@cumin1003" [20:19:27] !log eevans@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:22:28] 06SRE, 10LDAP-Access-Requests: Grant Access to Turnilo and Superset for MMigurski-WMF - https://phabricator.wikimedia.org/T422537#11796471 (10MMigurski-WMF) My onboarder has suggested that Developer accounts don’t traditionally include the "-WMF" extension, so I’m instead creating one called "MMigurski" and wa... [20:25:17] I'm not sure if my shell has frozen or... [20:25:30] yeah i was going to say it seemed to be taking a long time to sync [20:25:48] I think deploying/undeploying an extension requires an i18n update so is slower than you think [20:25:49] Took 11 mins just to build the containers [20:25:57] Oh, ffs yeah [20:26:03] Removing it from extension-list will [20:26:09] oh, yeah, if there's a i18n update we'll be waiting for a while. [20:26:14] 20:26:12 K8s deployment progress: 66% (ok: 8; fail: 0; left: 4) / [20:26:16] it's moving again [20:26:46] * cscott waits patiently [20:26:51] !log eevans@cumin1003 START - Cookbook sre.dns.netbox [20:27:40] i'm still here too [20:28:07] hyang: i can throw your config patch in with a set of others if that would help speed things up [20:28:08] (03PS4) 10Eevans: aqs1025: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264802 (https://phabricator.wikimedia.org/T412830) [20:28:08] (03PS4) 10Eevans: aqs1026: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264803 (https://phabricator.wikimedia.org/T412830) [20:28:08] (03PS4) 10Eevans: aqs1027: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264804 (https://phabricator.wikimedia.org/T412830) [20:28:15] !log reedy@deploy1003 reedy: Backport for [[gerrit:1267157|Undeploy Extension:StopForumSpam (T422185)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:28:18] T422185: Undeploy StopForumSpam extension from Wikimedia production - https://phabricator.wikimedia.org/T422185 [20:28:25] can just go straight out anyway [20:28:38] !log reedy@deploy1003 reedy: Continuing with sync [20:28:40] yes please cscott [20:29:05] Pppery: does your config patch need its own deploy, or can i group it with others? [20:29:17] I think it can be grouped with others [20:30:27] ok, i'm going to do the wmf.22 backports, since i've got them queued up, then the next batch will be hyang's config patch (1264856) Pppery's config patch (1268225) and one of mine (1268635). [20:30:34] !log eevans@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Additional IPs for aqs hosts - eevans@cumin1003" [20:30:40] !log eevans@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Additional IPs for aqs hosts - eevans@cumin1003" [20:30:41] !log eevans@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:30:48] then i'll do all the wmf.23 backports, which should be straightforward if the wmf.22 ones didn't cause problems. [20:31:17] and if there's time remaining i'll do my remaining config patches, which i might have to test individually. [20:32:01] it doesn't seem like anything involves i18n so all the remaining syncs should be faster than reedy's :) [20:33:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 08 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259251 (https://phabricator.wikimedia.org/T421939) (owner: 10LorenMora) [20:33:41] !log eevans@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on aqs1023.eqiad.wmnet with reason: Bootstrapping — T412830 [20:33:44] T412830: Hardware refresh of aqs101[0-2,4-5] w/ aqs102[3-7] - https://phabricator.wikimedia.org/T412830 [20:34:02] (actually, i'm going to do hyang and Pppery's config patches first, since they should be fast) [20:35:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:xe-0/1/1 (Transport: cr2-eqord:xe-0/1/3 (Arelion, IC-313592 51ms 10Gbps wave) {#1062}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:36:16] Pppery: your permission isn't present in wmf.22, are you sure it's safe to deploy before the train runs to completion? [20:37:08] 06SRE, 10LDAP-Access-Requests: Grant Access to Turnilo and Superset for MMigurski-WMF - https://phabricator.wikimedia.org/T422537#11796503 (10Aklapper) Sounds fine, however that would be your third developer account, assuming https://ldap.toolforge.org/user/migurski is also you? Not sure that's needed. [20:37:23] It would result in broken strings on Special:ListGroupRights I guess, but shouldn't cause any other harm [20:38:13] But you're probably right that I should have scheduled this later; I was trying to thread a needle with the fact that core gives this to all users but WMF config should only give it to autoconfirmed users [20:38:31] so I wanted that part done ASAP (and unfortunately I won't be able to make the Thursday UTC late window) [20:39:05] yeah i was looking at that as well. i think it makes sense to do now, just test it well to make sure it doesn't cause any issues other than the broken i18n on Special:ListGroupRights. [20:39:51] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:39:52] it's on sync-apaches, so should be done momentarily [20:39:59] i probably would have sequenced these patches slightly differently, temporarily giving the right to autoconfirmed in core, letting that ride the train, then doing the config change and patching core to move the permission from autoconfirmed to user. [20:40:26] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host ganeti1058.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:40:44] !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1267157|Undeploy Extension:StopForumSpam (T422185)]] (duration: 31m 17s) [20:40:46] You have a point there; that's the sort of thing one learns through experience (and I don't have any experience doing something like this before) [20:40:47] T422185: Undeploy StopForumSpam extension from Wikimedia production - https://phabricator.wikimedia.org/T422185 [20:40:49] (03PS4) 10Eevans: aqs1024: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264801 (https://phabricator.wikimedia.org/T412830) [20:40:49] (03PS5) 10Eevans: aqs1025: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264802 (https://phabricator.wikimedia.org/T412830) [20:40:49] (03PS5) 10Eevans: aqs1026: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264803 (https://phabricator.wikimedia.org/T412830) [20:40:49] (03PS5) 10Eevans: aqs1027: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264804 (https://phabricator.wikimedia.org/T412830) [20:40:49] cscott: All yours [20:40:50] (03PS1) 10Eevans: aqs1023: configure data file directories [puppet] - 10https://gerrit.wikimedia.org/r/1268660 (https://phabricator.wikimedia.org/T412830) [20:40:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:41:00] jouncebot: next [20:41:00] In 0 hour(s) and 18 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260407T2100) [20:41:10] You can probably overrun... [20:41:13] ok, doing the config patches, hopefully fast [20:41:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268635 (https://phabricator.wikimedia.org/T420273) (owner: 10C. Scott Ananian) [20:41:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264856 (https://phabricator.wikimedia.org/T419619) (owner: 10KineticPelagic) [20:41:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268225 (https://phabricator.wikimedia.org/T248294) (owner: 10Pppery) [20:42:41] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1058.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:43:28] (03Merged) 10jenkins-bot: Bump jawiki to 100% Parsoid Read Views (from 10%) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268635 (https://phabricator.wikimedia.org/T420273) (owner: 10C. Scott Ananian) [20:43:31] (03Merged) 10jenkins-bot: REST: Publish ReadingLists v0 module in REST Sandbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264856 (https://phabricator.wikimedia.org/T419619) (owner: 10KineticPelagic) [20:43:35] (03Merged) 10jenkins-bot: Move createwithcontentmodel to autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268225 (https://phabricator.wikimedia.org/T248294) (owner: 10Pppery) [20:44:01] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1268635|Bump jawiki to 100% Parsoid Read Views (from 10%) (T420273)]], [[gerrit:1264856|REST: Publish ReadingLists v0 module in REST Sandbox (T419619)]], [[gerrit:1268225|Move createwithcontentmodel to autoconfirmed (T248294)]] [20:44:08] T420273: Parsoid Read Views to deploy ~2026-03-19 - https://phabricator.wikimedia.org/T420273 [20:44:08] T419619: REST: publish ReadingLists v0 module in the REST Sandbox - https://phabricator.wikimedia.org/T419619 [20:44:09] T248294: Separate permission for creating a page with a custom content model - https://phabricator.wikimedia.org/T248294 [20:44:13] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:44:47] " 0 languages rebuilt out of 546" (the magic words that mean "this will be fast") [20:46:49] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2011 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:46:51] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:47:13] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:47:20] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host ganeti1058.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:47:49] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2011 is OK: TCP OK - 0.001 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:47:51] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2015 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:48:01] !log cscott@deploy1003 cscott, pppery, kineticpelagic: Backport for [[gerrit:1268635|Bump jawiki to 100% Parsoid Read Views (from 10%) (T420273)]], [[gerrit:1264856|REST: Publish ReadingLists v0 module in REST Sandbox (T419619)]], [[gerrit:1268225|Move createwithcontentmodel to autoconfirmed (T248294)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:48:07] hyang, Pppery: ok, config is on test servers, please test [20:48:15] Does anyone need a refresher on https://wikitech.wikimedia.org/wiki/WikimediaDebug ? [20:48:16] looking [20:48:31] looking [20:48:51] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2015 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:49:21] confirmed [20:49:26] Okay Special:ListGroupRights updated as I expected it do (aside from the broken i18n appearing in a slightly different way than I thought it would) [20:49:28] 06SRE, 10LDAP-Access-Requests: Grant Access to Turnilo and Superset for MMigurski-WMF - https://phabricator.wikimedia.org/T422537#11796558 (10MMigurski-WMF) Ah, I had forgotten I had that one during today’s account confusion, it’s quite old. I was able to log in and it’s a fine one to use. I’m updating this is... [20:49:53] 06SRE, 10LDAP-Access-Requests: Grant Access to Turnilo and Superset for MMigurski-WMF - https://phabricator.wikimedia.org/T422537#11796561 (10MMigurski-WMF) [20:50:55] (i apparently need a refresher on the WikimediaDebug extension, as I couldn't figure out why my config change didn't seem to have had an effect, then I realized I hadn't turned on WikimediaDebug. sigh.) [20:51:10] ok, continuing sync [20:51:16] !log cscott@deploy1003 cscott, pppery, kineticpelagic: Continuing with sync [20:51:19] vriley@cumin1003 provision (PID 3623028) is awaiting input [20:53:19] Pppery: https://test.wikipedia.org/wiki/Special:ListGroupRights looks fine (wmf.23) [20:53:34] Yeah, I had tested the same thing on MediaWiki.org [20:53:43] (03PS2) 10Eevans: aqs1023: configure data file directories [puppet] - 10https://gerrit.wikimedia.org/r/1268660 (https://phabricator.wikimedia.org/T412830) [20:53:43] (03PS5) 10Eevans: aqs1024: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264801 (https://phabricator.wikimedia.org/T412830) [20:53:43] (03PS6) 10Eevans: aqs1025: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264802 (https://phabricator.wikimedia.org/T412830) [20:53:43] (03PS6) 10Eevans: aqs1026: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264803 (https://phabricator.wikimedia.org/T412830) [20:53:44] (03PS6) 10Eevans: aqs1027: assign aqs role & configure [puppet] - 10https://gerrit.wikimedia.org/r/1264804 (https://phabricator.wikimedia.org/T412830) [20:54:52] And it's hard to tell but I think Special:ChangeContentModel is working as I expected it do [20:55:10] (03CR) 10Eevans: [C:03+2] aqs1023: configure data file directories [puppet] - 10https://gerrit.wikimedia.org/r/1268660 (https://phabricator.wikimedia.org/T412830) (owner: 10Eevans) [20:55:25] FIRING: [2x] SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:57:06] Pppery: i wrote up https://phabricator.wikimedia.org/T420481#11796588 just to record this info for the train drivers. feel free to update/amend it if I've gotten anything wrong. [20:57:28] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1268635|Bump jawiki to 100% Parsoid Read Views (from 10%) (T420273)]], [[gerrit:1264856|REST: Publish ReadingLists v0 module in REST Sandbox (T419619)]], [[gerrit:1268225|Move createwithcontentmodel to autoconfirmed (T248294)]] (duration: 13m 27s) [20:57:33] T420273: Parsoid Read Views to deploy ~2026-03-19 - https://phabricator.wikimedia.org/T420273 [20:57:34] T419619: REST: publish ReadingLists v0 module in the REST Sandbox - https://phabricator.wikimedia.org/T419619 [20:57:34] T248294: Separate permission for creating a page with a custom content model - https://phabricator.wikimedia.org/T248294 [20:57:41] ok, now the wmf.22 patches [20:57:51] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:57:52] Remember the Web Team window is starting soon [20:57:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268600 (https://phabricator.wikimedia.org/T421629) (owner: 10C. Scott Ananian) [20:57:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268625 (https://phabricator.wikimedia.org/T422512) (owner: 10Kgraessle) [20:57:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [extensions/Echo] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268633 (https://phabricator.wikimedia.org/T422218) (owner: 10Bartosz Dziewoński) [20:57:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [extensions/TestKitchen] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268651 (https://phabricator.wikimedia.org/T422112) (owner: 10Santiago Faci) [20:58:13] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260407T2100) [21:00:23] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1083.eqiad.wmnet with OS bullseye [21:00:55] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1083 [21:01:07] !log bking@cumin2002 START - Cookbook sre.dns.netbox [21:01:53] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:02:13] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:02:36] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1058.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:02:48] (03Merged) 10jenkins-bot: Ensure RevisionOutputCache uses post-processing options where appropriate [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268600 (https://phabricator.wikimedia.org/T421629) (owner: 10C. Scott Ananian) [21:02:53] (03Merged) 10jenkins-bot: Remove Navigation Menu Link Instrumentation on Personal Dashboard [extensions/WikimediaEvents] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268625 (https://phabricator.wikimedia.org/T422512) (owner: 10Kgraessle) [21:03:49] (03CR) 10Bartosz Dziewoński: rest gateway: prevent abuse of exempt api modules (037 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255731 (owner: 10Daniel Kinzler) [21:03:53] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [21:05:10] !log [WDQS] codfw is getting slammed hard enough that hosts are falling immediately back into deadlock post-restart and largely failing to report metrics. not much we can do atm, there will be some noise [21:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:25] FIRING: [4x] ProbeDown: Service aqs1023-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:06:47] bking@cumin2002 reimage (PID 3808066) is awaiting input [21:08:30] (03Merged) 10jenkins-bot: ForeignWikiRequest: Pass session to internal 'centralauthtoken' request [extensions/Echo] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268633 (https://phabricator.wikimedia.org/T422218) (owner: 10Bartosz Dziewoński) [21:08:32] (03Merged) 10jenkins-bot: PHP SDK: Handle experiment config missing or malformed [extensions/TestKitchen] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1268651 (https://phabricator.wikimedia.org/T422112) (owner: 10Santiago Faci) [21:08:53] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [21:08:58] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host ganeti1058.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:09:01] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1268600|Ensure RevisionOutputCache uses post-processing options where appropriate (T421629)]], [[gerrit:1268625|Remove Navigation Menu Link Instrumentation on Personal Dashboard (T422512)]], [[gerrit:1268633|ForeignWikiRequest: Pass session to internal 'centralauthtoken' request (T422218)]], [[gerrit:1268651|PHP SDK: Handle experiment config missing or [21:09:02] malformed (T422112)]] [21:09:08] T421629: TOC missing with Parsoid on some wikis (except for Vector 2022) - https://phabricator.wikimedia.org/T421629 [21:09:08] T422512: Remove Navigation Menu Link Instrumentation - https://phabricator.wikimedia.org/T422512 [21:09:09] T422218: Marking cross-wiki notifications as read doesn't work - https://phabricator.wikimedia.org/T422218 [21:09:09] T422112: PHP Warning: Trying to access array offset on null - https://phabricator.wikimedia.org/T422112 [21:10:51] !log cscott@deploy1003 cscott, kgraessle, sfaci, matmarex: Backport for [[gerrit:1268600|Ensure RevisionOutputCache uses post-processing options where appropriate (T421629)]], [[gerrit:1268625|Remove Navigation Menu Link Instrumentation on Personal Dashboard (T422512)]], [[gerrit:1268633|ForeignWikiRequest: Pass session to internal 'centralauthtoken' request (T422218)]], [[gerrit:1268651|PHP SDK: Handle experiment config [21:10:51] missing or malformed (T422112)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:10:57] MatmaRex, sfaci, katherine_g can you test? [21:11:32] cscott: yup. my change looks good on mwdebug [21:11:42] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch1083 - bking@cumin2002" [21:11:47] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch1083 - bking@cumin2002" [21:11:48] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:11:48] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1083.eqiad.wmnet 168.32.64.10.in-addr.arpa 8.6.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [21:11:51] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1083.eqiad.wmnet 168.32.64.10.in-addr.arpa 8.6.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [21:11:52] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1083 [21:12:19] cscott: my changes look good [21:12:23] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1083 [21:12:23] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1083 [21:12:47] cscott: all good here [21:13:36] my backport looks good too [21:13:38] continuing sync [21:13:42] !log cscott@deploy1003 cscott, kgraessle, sfaci, matmarex: Continuing with sync [21:13:57] vriley@cumin1003 provision (PID 3625886) is awaiting input [21:15:25] FIRING: [2x] SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:15:35] (03CR) 10Bartosz Dziewoński: rest gateway: prevent abuse of exempt api modules (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255731 (owner: 10Daniel Kinzler) [21:17:21] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [21:17:22] is anyone from web team here? we're running over, but i'm going to continue so long as web team doesn't need the window. [21:17:54] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1268600|Ensure RevisionOutputCache uses post-processing options where appropriate (T421629)]], [[gerrit:1268625|Remove Navigation Menu Link Instrumentation on Personal Dashboard (T422512)]], [[gerrit:1268633|ForeignWikiRequest: Pass session to internal 'centralauthtoken' request (T422218)]], [[gerrit:1268651|PHP SDK: Handle experiment config missing or [21:17:54] malformed (T422112)]] (duration: 08m 52s) [21:18:00] T421629: TOC missing with Parsoid on some wikis (except for Vector 2022) - https://phabricator.wikimedia.org/T421629 [21:18:00] T422512: Remove Navigation Menu Link Instrumentation - https://phabricator.wikimedia.org/T422512 [21:18:01] T422218: Marking cross-wiki notifications as read doesn't work - https://phabricator.wikimedia.org/T422218 [21:18:01] T422112: PHP Warning: Trying to access array offset on null - https://phabricator.wikimedia.org/T422112 [21:18:45] ok, going to move on to the wmf.23 backports [21:19:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [extensions/TestKitchen] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268654 (https://phabricator.wikimedia.org/T422112) (owner: 10Santiago Faci) [21:19:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [extensions/Echo] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268634 (https://phabricator.wikimedia.org/T422218) (owner: 10Bartosz Dziewoński) [21:19:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268653 (https://phabricator.wikimedia.org/T422512) (owner: 10Kgraessle) [21:19:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268648 (https://phabricator.wikimedia.org/T422394) (owner: 10C. Scott Ananian) [21:19:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [vendor] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268647 (https://phabricator.wikimedia.org/T251506) (owner: 10C. Scott Ananian) [21:21:18] (03Merged) 10jenkins-bot: PHP SDK: Handle experiment config missing or malformed [extensions/TestKitchen] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268654 (https://phabricator.wikimedia.org/T422112) (owner: 10Santiago Faci) [21:21:37] (03Merged) 10jenkins-bot: ForeignWikiRequest: Pass session to internal 'centralauthtoken' request [extensions/Echo] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268634 (https://phabricator.wikimedia.org/T422218) (owner: 10Bartosz Dziewoński) [21:23:30] (03Merged) 10jenkins-bot: Remove Navigation Menu Link Instrumentation on Personal Dashboard [extensions/WikimediaEvents] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268653 (https://phabricator.wikimedia.org/T422512) (owner: 10Kgraessle) [21:27:32] Reedy, bd808 : so this is fun: spiderpig is tellilng me that this deploy failed, but jenkins is still happily merging the patches. [21:27:47] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1083.eqiad.wmnet with reason: host reimage [21:28:18] I'm going to restart spiderpig on the same patches once zuul/jenkins is done merging everything, and i hope (?) that will skip the merge step and jump right to sync? [21:29:54] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.23.0-a26 [vendor] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268647 (https://phabricator.wikimedia.org/T251506) (owner: 10C. Scott Ananian) [21:30:25] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.23.0-a26 [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1268648 (https://phabricator.wikimedia.org/T422394) (owner: 10C. Scott Ananian) [21:31:00] (seems like that worked; i just used the handy 'retry this job' button once the merges completed) [21:31:06] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1268654|PHP SDK: Handle experiment config missing or malformed (T422112)]], [[gerrit:1268634|ForeignWikiRequest: Pass session to internal 'centralauthtoken' request (T422218)]], [[gerrit:1268653|Remove Navigation Menu Link Instrumentation on Personal Dashboard (T422512)]], [[gerrit:1268648|Bump wikimedia/parsoid to 0.23.0-a26 (T422394)]], [[gerrit:12686 [21:31:06] 47|Bump wikimedia/parsoid to 0.23.0-a26 (T251506 T422394)]] [21:31:16] T422112: PHP Warning: Trying to access array offset on null - https://phabricator.wikimedia.org/T422112 [21:31:16] T422218: Marking cross-wiki notifications as read doesn't work - https://phabricator.wikimedia.org/T422218 [21:31:17] T422512: Remove Navigation Menu Link Instrumentation - https://phabricator.wikimedia.org/T422512 [21:31:17] T422394: CTT tasks week of 2026-04-03 - https://phabricator.wikimedia.org/T422394 [21:31:17] T251506: Long headings result in long table of contents links that cause "PHP Warning: DOMDocument::saveHTML(): Memory allocation failed : escaping URI value" - https://phabricator.wikimedia.org/T251506 [21:32:55] !log cscott@deploy1003 matmarex, sfaci, cscott, kgraessle: Backport for [[gerrit:1268654|PHP SDK: Handle experiment config missing or malformed (T422112)]], [[gerrit:1268634|ForeignWikiRequest: Pass session to internal 'centralauthtoken' request (T422218)]], [[gerrit:1268653|Remove Navigation Menu Link Instrumentation on Personal Dashboard (T422512)]], [[gerrit:1268648|Bump wikimedia/parsoid to 0.23.0-a26 (T422394)]], [[g [21:32:55] errit:1268647|Bump wikimedia/parsoid to 0.23.0-a26 (T251506 T422394)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:33:11] MatmaRex, sfaci, katherine_g time to test, although you'll have to do so on a group0 wiki (mediawiki.org works) [21:33:34] looks good on wmf.23 too [21:33:55] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1083.eqiad.wmnet with reason: host reimage [21:34:30] cscott: my changes look good on wmf.23 [21:35:04] all good here! [21:35:05] sfaci? [21:35:10] ok, continuing [21:35:16] !log cscott@deploy1003 matmarex, sfaci, cscott, kgraessle: Continuing with sync [21:39:25] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1268654|PHP SDK: Handle experiment config missing or malformed (T422112)]], [[gerrit:1268634|ForeignWikiRequest: Pass session to internal 'centralauthtoken' request (T422218)]], [[gerrit:1268653|Remove Navigation Menu Link Instrumentation on Personal Dashboard (T422512)]], [[gerrit:1268648|Bump wikimedia/parsoid to 0.23.0-a26 (T422394)]], [[gerrit:1268 [21:39:25] 647|Bump wikimedia/parsoid to 0.23.0-a26 (T251506 T422394)]] (duration: 08m 19s) [21:39:33] T422112: PHP Warning: Trying to access array offset on null - https://phabricator.wikimedia.org/T422112 [21:39:33] T422218: Marking cross-wiki notifications as read doesn't work - https://phabricator.wikimedia.org/T422218 [21:39:33] ok, now i'm by myself for two quick rounds of config patches. [21:39:34] T422512: Remove Navigation Menu Link Instrumentation - https://phabricator.wikimedia.org/T422512 [21:39:34] T422394: CTT tasks week of 2026-04-03 - https://phabricator.wikimedia.org/T422394 [21:39:34] T251506: Long headings result in long table of contents links that cause "PHP Warning: DOMDocument::saveHTML(): Memory allocation failed : escaping URI value" - https://phabricator.wikimedia.org/T251506 [21:40:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267439 (https://phabricator.wikimedia.org/T422543) (owner: 10C. Scott Ananian) [21:40:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265467 (https://phabricator.wikimedia.org/T376183) (owner: 10Isabelle Hurbain-Palatin) [21:41:52] thanks for deploying cscott :) [21:42:09] MatmaRex: no problem [21:42:19] cscott: Thank you very much!!!!! [21:42:36] (03Merged) 10jenkins-bot: ParserMigration: transition to new configuration variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267439 (https://phabricator.wikimedia.org/T422543) (owner: 10C. Scott Ananian) [21:42:49] (03Merged) 10jenkins-bot: Enable legacy post-processing cache for DiscussionTools [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265467 (https://phabricator.wikimedia.org/T376183) (owner: 10Isabelle Hurbain-Palatin) [21:43:13] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1267439|ParserMigration: transition to new configuration variables (T422543)]], [[gerrit:1265467|Enable legacy post-processing cache for DiscussionTools (T376183)]] [21:43:18] T422543: Deploy Parsoid Read Views to MobileFrontEnd readers on enwiki - https://phabricator.wikimedia.org/T422543 [21:43:18] T376183: Use postprocessing cache for Discussion Tools - https://phabricator.wikimedia.org/T376183 [21:45:03] !log cscott@deploy1003 ihurbain, cscott: Backport for [[gerrit:1267439|ParserMigration: transition to new configuration variables (T422543)]], [[gerrit:1265467|Enable legacy post-processing cache for DiscussionTools (T376183)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:46:40] !log cscott@deploy1003 ihurbain, cscott: Continuing with sync [21:50:54] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1267439|ParserMigration: transition to new configuration variables (T422543)]], [[gerrit:1265467|Enable legacy post-processing cache for DiscussionTools (T376183)]] (duration: 07m 40s) [21:50:58] T422543: Deploy Parsoid Read Views to MobileFrontEnd readers on enwiki - https://phabricator.wikimedia.org/T422543 [21:50:58] T376183: Use postprocessing cache for Discussion Tools - https://phabricator.wikimedia.org/T376183 [21:51:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265468 (owner: 10Isabelle Hurbain-Palatin) [21:52:36] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1083.eqiad.wmnet with OS bullseye [21:53:03] (03Merged) 10jenkins-bot: Actually enable parsoid postproc for all wikis (except enwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265468 (owner: 10Isabelle Hurbain-Palatin) [21:53:28] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1265468|Actually enable parsoid postproc for all wikis (except enwiki)]] [21:55:19] !log cscott@deploy1003 cscott, ihurbain: Backport for [[gerrit:1265468|Actually enable parsoid postproc for all wikis (except enwiki)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:57:23] !log cscott@deploy1003 cscott, ihurbain: Continuing with sync [21:58:44] (03PS1) 10Bking: bking: add some helpers to dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/1268672 [21:58:47] 06SRE, 10Wikimedia-Mailing-lists: Close mailing list editing-team@lists.wikimedia.org - https://phabricator.wikimedia.org/T422562 (10VPuffetMichel) 03NEW [22:01:33] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1265468|Actually enable parsoid postproc for all wikis (except enwiki)]] (duration: 08m 05s) [22:02:02] ok, that's it for me, i'm done! [22:02:07] web team (if you're out there) it's all yours. [22:08:00] 06SRE, 10Wikimedia-Mailing-lists: Close mailing list editing-team@lists.wikimedia.org - https://phabricator.wikimedia.org/T422562#11796869 (10VPuffetMichel) [22:23:34] (03PS1) 10C. Scott Ananian: Turn on Parsoid Read Views for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268679 (https://phabricator.wikimedia.org/T422524) [22:23:36] (03PS1) 10C. Scott Ananian: Turn on Parsoid Read Views for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268680 (https://phabricator.wikimedia.org/T422524) [22:24:55] (03CR) 10C. Scott Ananian: [C:04-2] "Shouldn't be deployed before wmf.23 rolls out to group 2, because of T421629." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268680 (https://phabricator.wikimedia.org/T422524) (owner: 10C. Scott Ananian) [22:26:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268679 (https://phabricator.wikimedia.org/T422524) (owner: 10C. Scott Ananian) [22:28:33] (03PS1) 10Cwhite: initial pki config for beta-logs env [puppet] - 10https://gerrit.wikimedia.org/r/1268682 (https://phabricator.wikimedia.org/T350516) [22:29:22] (03PS1) 10Cwhite: add beta-logs pki key [labs/private] - 10https://gerrit.wikimedia.org/r/1268683 (https://phabricator.wikimedia.org/T350516) [22:31:04] (03PS2) 10Cwhite: initial pki config for beta-logs env [puppet] - 10https://gerrit.wikimedia.org/r/1268682 (https://phabricator.wikimedia.org/T350516) [23:03:40] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Wikimedia-Mailing-lists: lists.wikimedia.org subscription email rejected by DKIM - https://phabricator.wikimedia.org/T409137#11797030 (10jeremyb) 05Open→03Stalled [23:39:38] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1268685 [23:39:38] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1268685 (owner: 10TrainBranchBot) [23:49:58] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1268685 (owner: 10TrainBranchBot)