[00:21:47] (03PS9) 10Andrew Bogott: designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) [00:22:10] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [00:22:35] (03CR) 10CI reject: [V:04-1] designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [00:29:35] (03PS10) 10Andrew Bogott: designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) [00:30:05] (03CR) 10CI reject: [V:04-1] designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [00:31:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T419635)', diff saved to https://phabricator.wikimedia.org/P90850 and previous config saved to /var/cache/conftool/dbconfig/20260416-003130-fceratto.json [00:31:34] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [00:36:04] (03PS11) 10Andrew Bogott: designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) [00:36:34] (03CR) 10CI reject: [V:04-1] designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [00:38:26] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:38:41] (03PS12) 10Andrew Bogott: designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) [00:39:12] (03CR) 10CI reject: [V:04-1] designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [00:39:50] (03PS13) 10Andrew Bogott: designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) [00:40:19] (03CR) 10CI reject: [V:04-1] designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [00:41:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P90851 and previous config saved to /var/cache/conftool/dbconfig/20260416-004138-fceratto.json [00:43:19] (03PS14) 10Andrew Bogott: designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) [00:46:48] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [00:50:29] (03PS15) 10Andrew Bogott: designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) [00:51:00] (03CR) 10CI reject: [V:04-1] designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [00:51:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P90852 and previous config saved to /var/cache/conftool/dbconfig/20260416-005146-fceratto.json [00:53:19] (03PS16) 10Andrew Bogott: designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) [00:57:07] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T410589)', diff saved to https://phabricator.wikimedia.org/P90853 and previous config saved to /var/cache/conftool/dbconfig/20260416-005706-ladsgroup.json [00:57:10] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [01:00:18] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [01:00:19] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [01:01:19] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:01:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T419635)', diff saved to https://phabricator.wikimedia.org/P90854 and previous config saved to /var/cache/conftool/dbconfig/20260416-010154-fceratto.json [01:01:58] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [01:02:11] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1241.eqiad.wmnet with reason: Maintenance [01:02:19] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1241 (T419635)', diff saved to https://phabricator.wikimedia.org/P90855 and previous config saved to /var/cache/conftool/dbconfig/20260416-010218-fceratto.json [01:07:15] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P90856 and previous config saved to /var/cache/conftool/dbconfig/20260416-010714-ladsgroup.json [01:09:55] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1272020 [01:09:55] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1272020 (owner: 10TrainBranchBot) [01:10:19] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [01:11:19] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:17:23] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P90857 and previous config saved to /var/cache/conftool/dbconfig/20260416-011722-ladsgroup.json [01:20:43] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1272020 (owner: 10TrainBranchBot) [01:27:31] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T410589)', diff saved to https://phabricator.wikimedia.org/P90858 and previous config saved to /var/cache/conftool/dbconfig/20260416-012730-ladsgroup.json [01:27:35] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [01:27:47] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2222.codfw.wmnet with reason: Maintenance [01:27:56] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2222 (T410589)', diff saved to https://phabricator.wikimedia.org/P90859 and previous config saved to /var/cache/conftool/dbconfig/20260416-012755-ladsgroup.json [02:01:20] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:04:32] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:05:59] 06SRE, 10China-Judgments-Online-Preservation-Program, 10Wikimedia-Mailing-lists, 07Chinese-Sites: Request creation of mailing list for zhwikisource sysops - https://phabricator.wikimedia.org/T423520#11827920 (10Bugreporter) Did we decide that it should be a mailing list instead of VRT queue? If the latter... [02:07:37] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 16s) [02:07:43] 06SRE, 10China-Judgments-Online-Preservation-Program, 10Wikimedia-Mailing-lists, 07Chinese-Sites: Request creation of mailing list for zhwikisource sysops - https://phabricator.wikimedia.org/T423520#11827921 (10Supergrey1) >>! In T423520#11827920, @Bugreporter wrote: > Did we decide that it should be a mai... [02:09:16] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:22:24] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T410589)', diff saved to https://phabricator.wikimedia.org/P90860 and previous config saved to /var/cache/conftool/dbconfig/20260416-022223-ladsgroup.json [02:22:27] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [02:32:32] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P90861 and previous config saved to /var/cache/conftool/dbconfig/20260416-023231-ladsgroup.json [02:34:16] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:36:42] (03PS17) 10Andrew Bogott: designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) [02:36:58] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [02:42:40] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P90862 and previous config saved to /var/cache/conftool/dbconfig/20260416-024239-ladsgroup.json [02:42:52] (03CR) 10Ignacio Rodríguez: Restore PageImages functionality to Wikisources and Wikibooks (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271862 (https://phabricator.wikimedia.org/T417538) (owner: 10Ignacio Rodríguez) [02:44:11] (03PS18) 10Andrew Bogott: designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) [02:44:19] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [02:48:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T419635)', diff saved to https://phabricator.wikimedia.org/P90863 and previous config saved to /var/cache/conftool/dbconfig/20260416-024845-fceratto.json [02:48:50] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [02:50:20] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:51:06] (03PS19) 10Andrew Bogott: designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) [02:51:17] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [02:52:10] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:52:48] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T410589)', diff saved to https://phabricator.wikimedia.org/P90864 and previous config saved to /var/cache/conftool/dbconfig/20260416-025247-ladsgroup.json [02:52:57] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [02:53:04] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [02:55:20] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:56:08] (03PS20) 10Andrew Bogott: designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) [02:58:20] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:58:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P90865 and previous config saved to /var/cache/conftool/dbconfig/20260416-025853-fceratto.json [02:59:35] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [03:01:10] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:01:20] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:01:24] (03CR) 10Krinkle: Limit and standardize thumbnail options (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251196 (https://phabricator.wikimedia.org/T376152) (owner: 10Jdlrobson) [03:04:10] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - dse-k8s-ctrl_6443: Servers dse-k8s-ctrl2002.codfw.wmnet are marked down but pooled: wdqs-main_443: Servers wdqs2013.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:04:20] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - dse-k8s-ctrl_6443: Servers dse-k8s-ctrl2002.codfw.wmnet are marked down but pooled: wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:07:10] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:07:20] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:09:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P90866 and previous config saved to /var/cache/conftool/dbconfig/20260416-030902-fceratto.json [03:10:20] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:10:21] (03PS21) 10Andrew Bogott: designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) [03:10:27] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [03:11:20] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:14:10] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:14:20] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:15:37] (03PS22) 10Andrew Bogott: designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) [03:15:48] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [03:17:10] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:17:20] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:19:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T419635)', diff saved to https://phabricator.wikimedia.org/P90867 and previous config saved to /var/cache/conftool/dbconfig/20260416-031910-fceratto.json [03:19:15] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [03:19:22] (03PS1) 10Andrew Bogott: designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1272121 (https://phabricator.wikimedia.org/T422646) [03:19:28] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1242.eqiad.wmnet with reason: Maintenance [03:19:31] (03Abandoned) 10Andrew Bogott: designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1271944 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [03:19:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1242 (T419635)', diff saved to https://phabricator.wikimedia.org/P90868 and previous config saved to /var/cache/conftool/dbconfig/20260416-031934-fceratto.json [03:19:40] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1272121 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [03:20:10] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:20:20] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:21:10] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:25:10] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:31:10] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:32:14] (03PS4) 10Ignacio Rodríguez: Restore PageImages functionality to Wikisources and Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271862 (https://phabricator.wikimedia.org/T417538) [03:32:20] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:37:18] (03CR) 10Ignacio Rodríguez: [C:03+1] "If I understand this correctly, the error is caused because of a comment indented with spaces instead of tabs. I did it unconsciously beca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271862 (https://phabricator.wikimedia.org/T417538) (owner: 10Ignacio Rodríguez) [03:44:55] (03CR) 10Andrew Bogott: [C:03+2] designate: fix up zookeeper host lists to use private network [puppet] - 10https://gerrit.wikimedia.org/r/1272121 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [03:45:20] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:46:20] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:49:06] (03PS1) 10Andrew Bogott: designate: fix zookeeper cluster name in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1272143 (https://phabricator.wikimedia.org/T422646) [03:49:16] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1272143 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [03:49:36] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1272143 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [03:50:10] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:51:10] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:51:57] (03CR) 10Andrew Bogott: [C:03+2] designate: fix zookeeper cluster name in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1272143 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [03:54:18] (03PS1) 10Andrew Bogott: designate: derive zookeeper cluster rather than hardcoding [puppet] - 10https://gerrit.wikimedia.org/r/1272146 (https://phabricator.wikimedia.org/T422646) [03:54:46] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1272146 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [03:55:10] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:55:20] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:56:10] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:56:20] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:01:12] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:01:20] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:02:12] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:02:20] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:05:57] (03PS2) 10Andrew Bogott: designate: derive zookeeper cluster rather than hardcoding [puppet] - 10https://gerrit.wikimedia.org/r/1272146 (https://phabricator.wikimedia.org/T422646) [04:06:10] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:06:23] (03PS3) 10Andrew Bogott: designate: derive zookeeper cluster rather than hardcoding [puppet] - 10https://gerrit.wikimedia.org/r/1272146 (https://phabricator.wikimedia.org/T422646) [04:06:42] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1272146 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [04:10:10] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:10:20] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:12:20] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:13:33] (03PS4) 10Andrew Bogott: designate: derive zookeeper cluster rather than hardcoding [puppet] - 10https://gerrit.wikimedia.org/r/1272146 (https://phabricator.wikimedia.org/T422646) [04:13:44] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1272146 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [04:15:20] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:16:20] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:27:10] (03PS5) 10Andrew Bogott: designate: derive zookeeper cluster rather than hardcoding [puppet] - 10https://gerrit.wikimedia.org/r/1272146 (https://phabricator.wikimedia.org/T422646) [04:27:22] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1272146 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [04:27:40] (03CR) 10CI reject: [V:04-1] designate: derive zookeeper cluster rather than hardcoding [puppet] - 10https://gerrit.wikimedia.org/r/1272146 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [04:28:37] (03PS6) 10Andrew Bogott: designate: derive zookeeper cluster rather than hardcoding [puppet] - 10https://gerrit.wikimedia.org/r/1272146 (https://phabricator.wikimedia.org/T422646) [04:29:38] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1272146 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [04:35:26] (03PS7) 10Andrew Bogott: designate: derive zookeeper cluster rather than hardcoding [puppet] - 10https://gerrit.wikimedia.org/r/1272146 (https://phabricator.wikimedia.org/T422646) [04:35:32] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1272146 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [04:38:42] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:39:03] (03PS8) 10Andrew Bogott: designate: derive zookeeper cluster rather than hardcoding [puppet] - 10https://gerrit.wikimedia.org/r/1272146 (https://phabricator.wikimedia.org/T422646) [04:39:06] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1272146 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [04:44:30] (03PS9) 10Andrew Bogott: designate: derive zookeeper cluster rather than hardcoding [puppet] - 10https://gerrit.wikimedia.org/r/1272146 (https://phabricator.wikimedia.org/T422646) [04:44:39] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1272146 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [04:49:33] (03PS10) 10Andrew Bogott: designate: derive zookeeper cluster rather than hardcoding [puppet] - 10https://gerrit.wikimedia.org/r/1272146 (https://phabricator.wikimedia.org/T422646) [04:49:48] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1272146 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [04:57:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 16 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270567 (owner: 10Bodhisattwa) [05:00:35] (03CR) 10Nathillard: [C:03+1] "Added a few vars, otherwise +1 from me - thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271580 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [05:06:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T419635)', diff saved to https://phabricator.wikimedia.org/P90869 and previous config saved to /var/cache/conftool/dbconfig/20260416-050609-fceratto.json [05:06:14] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [05:12:25] (03PS1) 10Marostegui: clouddb1019.yaml: Remove file [puppet] - 10https://gerrit.wikimedia.org/r/1272216 (https://phabricator.wikimedia.org/T422813) [05:13:26] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:14:16] RESOLVED: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:16:18] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P90870 and previous config saved to /var/cache/conftool/dbconfig/20260416-051618-fceratto.json [05:19:24] (03CR) 10Marostegui: [C:03+2] clouddb1019.yaml: Remove file [puppet] - 10https://gerrit.wikimedia.org/r/1272216 (https://phabricator.wikimedia.org/T422813) (owner: 10Marostegui) [05:22:28] !log marostegui@cumin1003 START - Cookbook sre.hosts.decommission for hosts clouddb1019.eqiad.wmnet [05:23:43] (03PS1) 10Marostegui: site.pp: Remove clouddb1019 [puppet] - 10https://gerrit.wikimedia.org/r/1272222 (https://phabricator.wikimedia.org/T423151) [05:24:29] (03CR) 10Marostegui: [C:03+2] site.pp: Remove clouddb1019 [puppet] - 10https://gerrit.wikimedia.org/r/1272222 (https://phabricator.wikimedia.org/T423151) (owner: 10Marostegui) [05:26:10] (03CR) 10Anzx: [C:04-1] "please regenerate logos, support for 1.5x size logo was removed in Ie9284ca06eda39407dc5ea865dc95e31dbe6b7f9" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140748 (https://phabricator.wikimedia.org/T342173) (owner: 10Robertsky) [05:26:27] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P90871 and previous config saved to /var/cache/conftool/dbconfig/20260416-052626-fceratto.json [05:27:04] !log marostegui@cumin1003 START - Cookbook sre.dns.netbox [05:30:46] !log marostegui@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: clouddb1019.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [05:31:01] !log marostegui@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: clouddb1019.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [05:31:01] !log marostegui@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [05:31:02] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts clouddb1019.eqiad.wmnet [05:33:54] 10ops-eqiad, 06cloud-services-team, 10Data-Services, 06DBA, and 2 others: decommission clouddb1019.eqiad.wmnet - https://phabricator.wikimedia.org/T423151#11828108 (10Marostegui) a:05Marostegui→03Jclark-ctr [05:34:07] 10ops-eqiad, 06cloud-services-team, 10Data-Services, 06DBA, and 2 others: decommission clouddb1019.eqiad.wmnet - https://phabricator.wikimedia.org/T423151#11828114 (10Marostegui) This is ready for DC-Ops [05:35:20] 10ops-eqiad, 06cloud-services-team, 10Data-Services, 06DBA, and 2 others: decommission clouddb1019.eqiad.wmnet - https://phabricator.wikimedia.org/T423151#11828120 (10Marostegui) >>! In T423151#11828091, @ops-monitoring-bot wrote: > cookbooks.sre.hosts.decommission executed by marostegui@cumin1003 for host... [05:36:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T419635)', diff saved to https://phabricator.wikimedia.org/P90872 and previous config saved to /var/cache/conftool/dbconfig/20260416-053635-fceratto.json [05:36:39] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [05:36:51] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1243.eqiad.wmnet with reason: Maintenance [05:37:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1243 (T419635)', diff saved to https://phabricator.wikimedia.org/P90873 and previous config saved to /var/cache/conftool/dbconfig/20260416-053659-fceratto.json [05:43:51] (03PS1) 10Marostegui: db2193: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1272372 (https://phabricator.wikimedia.org/T422777) [05:47:16] (03CR) 10Marostegui: [C:03+2] db2193: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1272372 (https://phabricator.wikimedia.org/T422777) (owner: 10Marostegui) [05:47:41] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2193.codfw.wmnet with reason: Reimage to Trixie [05:47:46] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2193: Reimage to Trixie [05:48:05] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2193: Reimage to Trixie [05:49:02] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2193.codfw.wmnet with OS trixie [05:59:38] (03PS1) 10Marostegui: Revert "db2193: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1272378 [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260416T0600) [06:00:05] marostegui, Amir1, and federico3: #bothumor I � Unicode. All rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260416T0600). [06:08:03] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2193.codfw.wmnet with reason: host reimage [06:14:47] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2193.codfw.wmnet with reason: host reimage [06:26:26] (03CR) 10Marostegui: [C:03+2] Revert "db2193: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1272378 (owner: 10Marostegui) [06:37:33] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2193.codfw.wmnet with OS trixie [06:38:00] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2193: after reimage to trixie [06:38:24] FIRING: [8x] ProbeDown: Service pki1002:443 has failed probes (http_PKI_cassandra_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#pki1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:38:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:39:57] (03CR) 10Muehlenhoff: "Yes, both KDCs can be rebooted relatively freely, but krb2002 is the better choice still: All Kerberos clients have both krb1002 and krb20" [puppet] - 10https://gerrit.wikimedia.org/r/1271794 (https://phabricator.wikimedia.org/T407726) (owner: 10JHathaway) [06:40:19] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1271794 (https://phabricator.wikimedia.org/T407726) (owner: 10JHathaway) [06:40:23] !log jayme@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2280.codfw.wmnet [06:40:25] !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2280.codfw.wmnet [06:40:32] !log jayme@cumin2002 START - Cookbook sre.hosts.remove-downtime for wikikube-worker2280.codfw.wmnet [06:40:33] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wikikube-worker2280.codfw.wmnet [06:43:24] RESOLVED: [9x] ProbeDown: Service pki1002:443 has failed probes (http_PKI_cassandra_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#pki1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:44:17] (03CR) 10JMeybohm: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1271729 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [06:46:48] (03CR) 10Anzx: [C:04-1] "I have suggested update to README.md since support for 1.5x size logo was removed in Ie9284ca06eda39407dc5ea865dc95e31dbe6b7f9" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140748 (https://phabricator.wikimedia.org/T342173) (owner: 10Robertsky) [06:47:54] (03CR) 10JMeybohm: [C:03+1] "Maybe add a comment to the configmap_nochecksum.yaml explaining it's name so the people from the future are less confused. Other than that" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271736 (https://phabricator.wikimedia.org/T421504) (owner: 10Effie Mouzeli) [06:51:11] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271745 (https://phabricator.wikimedia.org/T420437) (owner: 10Btullis) [06:55:17] !log imported opensearch-madvise 0.2+deb13u1 to component/opensearch2 of trixie-wikimedia T422860 [06:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:20] T422860: Migrate Cloudelastic to OpenSearch 2.x - https://phabricator.wikimedia.org/T422860 [06:57:04] (03PS2) 10Anzx: sahwikisource: add Ааптар (author) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272424 (https://phabricator.wikimedia.org/T423374) [06:57:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 16 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272424 (https://phabricator.wikimedia.org/T423374) (owner: 10Anzx) [06:59:21] !log jmm@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:aux-worker-eqiad [07:00:05] Amir1, Urbanecm, and awight: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260416T0700). [07:00:05] bodhisattwa and anzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:24] o/ [07:04:20] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1271776 (https://phabricator.wikimedia.org/T420688) (owner: 10Muehlenhoff) [07:05:24] (03CR) 10Muehlenhoff: [C:03+2] Add missing record for new group [puppet] - 10https://gerrit.wikimedia.org/r/1271776 (https://phabricator.wikimedia.org/T420688) (owner: 10Muehlenhoff) [07:13:45] 06SRE, 10China-Judgments-Online-Preservation-Program, 10Wikimedia-Mailing-lists, 07Chinese-Sites: Request creation of mailing list for zhwikisource sysops - https://phabricator.wikimedia.org/T423520#11828229 (10Ericliu1912) You could add me as list admin, ericliu.roc [at] gmail.com. [07:15:17] 06SRE, 10China-Judgments-Online-Preservation-Program, 10Wikimedia-Mailing-lists, 07Chinese-Sites: Request creation of mailing list for zhwikisource sysops - https://phabricator.wikimedia.org/T423520#11828231 (10Supergrey1) [07:16:03] 06SRE, 10China-Judgments-Online-Preservation-Program, 10Wikimedia-Mailing-lists, 07Chinese-Sites: Request creation of mailing list for zhwikisource sysops - https://phabricator.wikimedia.org/T423520#11828234 (10Supergrey1) >>! In T423520#11828229, @Ericliu1912 wrote: > You could add me as list admin, ericl... [07:16:35] !log jmm@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host aux-k8s-worker1006.eqiad.wmnet [07:21:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host aux-k8s-worker1006.eqiad.wmnet [07:22:32] (03PS2) 10Jelto: miscweb: add config environment variables to wmf-navigator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271580 (https://phabricator.wikimedia.org/T414405) [07:23:25] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2193: after reimage to trixie [07:24:33] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T419635)', diff saved to https://phabricator.wikimedia.org/P90879 and previous config saved to /var/cache/conftool/dbconfig/20260416-072432-fceratto.json [07:24:37] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [07:26:35] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [07:26:43] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb1015.eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [07:26:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1165 (T419961)', diff saved to https://phabricator.wikimedia.org/P90880 and previous config saved to /var/cache/conftool/dbconfig/20260416-072650-fceratto.json [07:26:54] !log jmm@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host aux-k8s-worker1006.eqiad.wmnet [07:26:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host aux-k8s-worker1006.eqiad.wmnet [07:27:02] !log jmm@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host aux-k8s-worker1007.eqiad.wmnet [07:27:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host aux-k8s-worker1007.eqiad.wmnet [07:31:26] (03PS1) 10JMeybohm: wikikube: Allow access to typha from DOMAIN_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/1272448 (https://phabricator.wikimedia.org/T365687) [07:32:53] !log jmm@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host aux-k8s-worker1007.eqiad.wmnet [07:32:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host aux-k8s-worker1007.eqiad.wmnet [07:33:01] !log jmm@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host aux-k8s-worker1008.eqiad.wmnet [07:33:31] FIRING: Outbound discards: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [07:33:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host aux-k8s-worker1008.eqiad.wmnet [07:33:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T419961)', diff saved to https://phabricator.wikimedia.org/P90881 and previous config saved to /var/cache/conftool/dbconfig/20260416-073354-fceratto.json [07:34:41] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P90882 and previous config saved to /var/cache/conftool/dbconfig/20260416-073440-fceratto.json [07:34:49] (03CR) 10Jelto: [C:03+2] miscweb: add config environment variables to wmf-navigator (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271580 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [07:34:52] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1272448 (https://phabricator.wikimedia.org/T365687) (owner: 10JMeybohm) [07:35:01] (03CR) 10Muehlenhoff: wikikube: Allow access to typha from DOMAIN_NETWORKS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1272448 (https://phabricator.wikimedia.org/T365687) (owner: 10JMeybohm) [07:36:37] (03PS2) 10Anzx: etwikiquote: delete unused logo files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272447 (https://phabricator.wikimedia.org/T313698) [07:37:10] (03Merged) 10jenkins-bot: miscweb: add config environment variables to wmf-navigator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271580 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [07:38:14] (03PS1) 10Marostegui: db2169: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1272455 (https://phabricator.wikimedia.org/T422777) [07:38:52] (03PS3) 10Anzx: etwikiquote: delete unused temporary logo files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272447 (https://phabricator.wikimedia.org/T313698) [07:39:06] !log jmm@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host aux-k8s-worker1008.eqiad.wmnet [07:39:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host aux-k8s-worker1008.eqiad.wmnet [07:39:14] !log jmm@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host aux-k8s-worker1009.eqiad.wmnet [07:39:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host aux-k8s-worker1009.eqiad.wmnet [07:40:10] (03CR) 10Marostegui: [C:03+2] db2169: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1272455 (https://phabricator.wikimedia.org/T422777) (owner: 10Marostegui) [07:40:26] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2169.codfw.wmnet with reason: Reimage to Trixie [07:40:32] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2169: Reimage to Trixie [07:40:51] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2169: Reimage to Trixie [07:41:34] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2169.codfw.wmnet with OS trixie [07:42:15] (03PS2) 10JMeybohm: wikikube: Allow access to typha from DOMAIN_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/1272448 (https://phabricator.wikimedia.org/T365687) [07:43:09] (03CR) 10JMeybohm: wikikube: Allow access to typha from DOMAIN_NETWORKS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1272448 (https://phabricator.wikimedia.org/T365687) (owner: 10JMeybohm) [07:44:03] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P90884 and previous config saved to /var/cache/conftool/dbconfig/20260416-074402-fceratto.json [07:44:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P90885 and previous config saved to /var/cache/conftool/dbconfig/20260416-074448-fceratto.json [07:45:15] !log jmm@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host aux-k8s-worker1009.eqiad.wmnet [07:45:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host aux-k8s-worker1009.eqiad.wmnet [07:45:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:aux-worker-eqiad [07:45:33] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1272448 (https://phabricator.wikimedia.org/T365687) (owner: 10JMeybohm) [07:51:50] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 365522208 and 39 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [07:52:07] (03PS1) 10Marostegui: Revert "db2169: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1272467 [07:52:50] (03CR) 10JMeybohm: [C:03+2] wikikube: Allow access to typha from DOMAIN_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/1272448 (https://phabricator.wikimedia.org/T365687) (owner: 10JMeybohm) [07:54:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P90886 and previous config saved to /var/cache/conftool/dbconfig/20260416-075410-fceratto.json [07:54:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T419635)', diff saved to https://phabricator.wikimedia.org/P90887 and previous config saved to /var/cache/conftool/dbconfig/20260416-075457-fceratto.json [07:55:02] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [07:55:14] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1244.eqiad.wmnet with reason: Maintenance [07:55:23] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1244 (T419635)', diff saved to https://phabricator.wikimedia.org/P90888 and previous config saved to /var/cache/conftool/dbconfig/20260416-075522-fceratto.json [07:58:48] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 91608 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [07:59:22] (03PS1) 10Matthias Mullie: Squashed diff to master [extensions/ReaderExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1272527 [07:59:24] (03CR) 10Elukey: role::cluster::management: add profile to sync firmwares (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1271564 (https://phabricator.wikimedia.org/T418873) (owner: 10Elukey) [08:00:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/ReaderExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1272527 (owner: 10Matthias Mullie) [08:00:48] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2169.codfw.wmnet with reason: host reimage [08:03:34] (03PS2) 10Elukey: role::cluster::management: add profile to sync firmwares [puppet] - 10https://gerrit.wikimedia.org/r/1271564 (https://phabricator.wikimedia.org/T418873) [08:04:00] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271564 (https://phabricator.wikimedia.org/T418873) (owner: 10Elukey) [08:04:21] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T419961)', diff saved to https://phabricator.wikimedia.org/P90889 and previous config saved to /var/cache/conftool/dbconfig/20260416-080420-fceratto.json [08:04:36] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance [08:04:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1201 (T419961)', diff saved to https://phabricator.wikimedia.org/P90890 and previous config saved to /var/cache/conftool/dbconfig/20260416-080445-fceratto.json [08:05:04] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2169.codfw.wmnet with reason: host reimage [08:06:11] (03CR) 10Elukey: "Added a suggestion from Jesse and fixed a silly hiera mistake, pcc now works!" [puppet] - 10https://gerrit.wikimedia.org/r/1271564 (https://phabricator.wikimedia.org/T418873) (owner: 10Elukey) [08:10:51] (03PS1) 10Muehlenhoff: Migrate calico-bird ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1272537 (https://phabricator.wikimedia.org/T365687) [08:12:08] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1271564 (https://phabricator.wikimedia.org/T418873) (owner: 10Elukey) [08:13:07] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T419961)', diff saved to https://phabricator.wikimedia.org/P90891 and previous config saved to /var/cache/conftool/dbconfig/20260416-081305-fceratto.json [08:14:06] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1272537 (https://phabricator.wikimedia.org/T365687) (owner: 10Muehlenhoff) [08:17:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [08:22:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [08:22:52] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 362839504 and 30 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [08:23:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P90892 and previous config saved to /var/cache/conftool/dbconfig/20260416-082314-fceratto.json [08:24:32] (03CR) 10Marostegui: [C:03+2] Revert "db2169: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1272467 (owner: 10Marostegui) [08:27:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [08:27:39] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2169.codfw.wmnet with OS trixie [08:28:54] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 60888 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [08:29:32] (03CR) 10Jcrespo: [C:03+2] dbbackups: Perform a ro backup & start backing up only the latest 2 clusters [puppet] - 10https://gerrit.wikimedia.org/r/1271728 (https://phabricator.wikimedia.org/T421729) (owner: 10Jcrespo) [08:31:11] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2169: repool after maintenance [08:33:00] 06SRE, 10Lift-Wing, 06Machine-Learning-Team (Q4 FY2025-26): Fix securityContext propagation in liftwing - https://phabricator.wikimedia.org/T423149#11828470 (10DPogorzelski-WMF) the issue can be reproduced locally with a simple kserve "hello world" [08:33:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P90894 and previous config saved to /var/cache/conftool/dbconfig/20260416-083323-fceratto.json [08:36:48] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backup[1007,1014].eqiad.wmnet with reason: maintenance [08:36:50] (03PS1) 10Muehlenhoff: Revert "Depool puppetserver1002" [dns] - 10https://gerrit.wikimedia.org/r/1272559 (https://phabricator.wikimedia.org/T423282) [08:36:54] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11828477 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=acda330c-af7e-43eb-ab9e-f17a3dfaee68) set by j... [08:40:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [08:43:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T419961)', diff saved to https://phabricator.wikimedia.org/P90895 and previous config saved to /var/cache/conftool/dbconfig/20260416-084331-fceratto.json [08:43:52] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [08:45:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [08:50:06] (03CR) 10Muehlenhoff: [C:03+2] Revert "Depool puppetserver1002" [dns] - 10https://gerrit.wikimedia.org/r/1272559 (https://phabricator.wikimedia.org/T423282) (owner: 10Muehlenhoff) [08:50:12] !log jmm@dns1004 START - running authdns-update [08:50:36] (03CR) 10Elukey: role::cluster::management: add profile to sync firmwares (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1271564 (https://phabricator.wikimedia.org/T418873) (owner: 10Elukey) [08:51:39] !log jmm@dns1004 END - running authdns-update [08:54:19] 10SRE-SLO, 10observability, 10Wikidata, 06Wikidata Platform Team, and 4 others: Update WDQS SLOs to reflect graph split changes - https://phabricator.wikimedia.org/T393966#11828533 (10gmodena) [08:54:55] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 500083984 and 119 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [08:55:22] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [08:55:45] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [08:56:00] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [08:56:53] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [08:57:28] (03PS1) 10Elukey: debian: replace commas with spaces [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1272578 [08:57:44] (03CR) 10Elukey: [V:03+2 C:03+2] debian: replace commas with spaces [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1272578 (owner: 10Elukey) [08:57:55] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 11920 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:00:44] (03CR) 10Muehlenhoff: [C:03+1] role::cluster::management: add profile to sync firmwares (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1271564 (https://phabricator.wikimedia.org/T418873) (owner: 10Elukey) [09:03:12] (03PS3) 10Ayounsi: move-vlan cookbook: add "inplace" support [cookbooks] - 10https://gerrit.wikimedia.org/r/1270965 [09:03:31] !log ayounsi@cumin1003 START - Cookbook sre.hosts.move-vlan for host backup1007 [09:03:46] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.hosts.move-vlan (exit_code=99) for host backup1007 [09:04:55] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 129492432 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:05:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, ... [09:05:51] 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F1%2F1%3A0 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [09:06:09] !ack [09:06:10] 7843 (ACKED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw) [09:08:48] (03PS1) 10Elukey: debian: add explicit ordering between node labeller and gpu plugin [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1272580 (https://phabricator.wikimedia.org/T420507) [09:08:49] marostegui: is it correct that this is likely some kind of scraping? [09:09:00] bjensen: yeah, I believe so [09:09:08] I am checking graphs at the moment [09:10:27] (03CR) 10Klausman: amg-gpu: Set up explicit GPU partitioning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1269344 (https://phabricator.wikimedia.org/T420507) (owner: 10Dpogorzelski) [09:10:32] (03PS4) 10Ayounsi: move-vlan cookbook: add "inplace" support [cookbooks] - 10https://gerrit.wikimedia.org/r/1270965 [09:10:50] !log ayounsi@cumin1003 START - Cookbook sre.hosts.move-vlan for host backup1007 [09:11:05] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [09:11:06] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1159.eqiad.wmnet with reason: Maintenance [09:11:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1159 (T419961)', diff saved to https://phabricator.wikimedia.org/P90898 and previous config saved to /var/cache/conftool/dbconfig/20260416-091115-fceratto.json [09:11:40] (03CR) 10Klausman: [V:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271699 (https://phabricator.wikimedia.org/T421461) (owner: 10Klausman) [09:11:46] (03CR) 10Muehlenhoff: [C:03+2] thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271718 (owner: 10Muehlenhoff) [09:12:45] (03PS2) 10Btullis: airflow: Only mount geoip volumes for certain instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271811 (https://phabricator.wikimedia.org/T405509) [09:13:10] !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [09:13:14] (03PS2) 10Elukey: debian: add explicit ordering between node labeller and gpu plugin [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1272580 (https://phabricator.wikimedia.org/T420507) [09:13:17] !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply [09:13:42] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:14:51] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host backup1007 - ayounsi@cumin1003" [09:14:57] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host backup1007 - ayounsi@cumin1003" [09:14:57] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:14:57] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache backup1007.eqiad.wmnet 88.48.64.10.in-addr.arpa 8.8.0.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:15:01] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) backup1007.eqiad.wmnet 88.48.64.10.in-addr.arpa 8.8.0.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:15:01] !log ayounsi@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host backup1007 [09:15:16] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host backup1007 [09:15:16] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host backup1007 [09:16:20] (03PS1) 10Brouberol: kafka: deploy a new kafka script wrapper when installing kafka 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1272588 (https://phabricator.wikimedia.org/T422842) [09:16:27] (03CR) 10Elukey: amg-gpu: Set up explicit GPU partitioning (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1269344 (https://phabricator.wikimedia.org/T420507) (owner: 10Dpogorzelski) [09:16:40] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2169: repool after maintenance [09:16:52] (03CR) 10CI reject: [V:04-1] kafka: deploy a new kafka script wrapper when installing kafka 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1272588 (https://phabricator.wikimedia.org/T422842) (owner: 10Brouberol) [09:16:55] (03PS2) 10Brouberol: kafka: deploy a new kafka script wrapper when installing kafka 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1272588 (https://phabricator.wikimedia.org/T422842) [09:17:26] (03CR) 10CI reject: [V:04-1] kafka: deploy a new kafka script wrapper when installing kafka 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1272588 (https://phabricator.wikimedia.org/T422842) (owner: 10Brouberol) [09:18:18] !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [09:18:56] (03PS3) 10Brouberol: kafka: deploy a new kafka script wrapper when installing kafka 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1272588 (https://phabricator.wikimedia.org/T422842) [09:19:27] (03CR) 10CI reject: [V:04-1] kafka: deploy a new kafka script wrapper when installing kafka 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1272588 (https://phabricator.wikimedia.org/T422842) (owner: 10Brouberol) [09:20:07] (03PS4) 10Brouberol: kafka: deploy a new kafka script wrapper when installing kafka 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1272588 (https://phabricator.wikimedia.org/T422842) [09:20:26] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [09:20:26] (03PS5) 10Brouberol: kafka: deploy a new kafka script wrapper when installing kafka 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1272588 (https://phabricator.wikimedia.org/T422842) [09:20:34] !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [09:21:05] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Timeouts on puppetserver1002 past reboot - https://phabricator.wikimedia.org/T423282#11828641 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff >>! In T423282#11823753, @MoritzMuehlenhoff wrote: > I'm in... [09:21:07] (03CR) 10CI reject: [V:04-1] kafka: deploy a new kafka script wrapper when installing kafka 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1272588 (https://phabricator.wikimedia.org/T422842) (owner: 10Brouberol) [09:21:55] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 55512 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:22:00] (03PS6) 10Brouberol: kafka: deploy a new kafka script wrapper when installing kafka 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1272588 (https://phabricator.wikimedia.org/T422842) [09:24:33] !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [09:24:50] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [09:25:42] (03PS1) 10MVernon: apus: move controller node [puppet] - 10https://gerrit.wikimedia.org/r/1272589 (https://phabricator.wikimedia.org/T418902) [09:27:12] (03PS2) 10MVernon: apus: move controller node [puppet] - 10https://gerrit.wikimedia.org/r/1272589 (https://phabricator.wikimedia.org/T418902) [09:28:04] (03CR) 10Brouberol: [C:03+1] Switch the dse-k8s-ctrl service from Weighted Round Robin to Maglev [puppet] - 10https://gerrit.wikimedia.org/r/1271745 (https://phabricator.wikimedia.org/T420437) (owner: 10Btullis) [09:29:22] !log setting backup1014 in maintenance, no backup or recovery will run while it T421719 [09:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:26] T421719: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719 [09:29:47] (03CR) 10MVernon: [C:03+1] "LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/1271985 (https://phabricator.wikimedia.org/T412830) (owner: 10Eevans) [09:29:56] (03PS5) 10Ayounsi: move-vlan cookbook: add "inplace" support [cookbooks] - 10https://gerrit.wikimedia.org/r/1270965 [09:30:56] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 333140400 and 40 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:31:02] !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [09:35:19] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:36:20] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:36:27] (03CR) 10Brouberol: airflow: Only mount geoip volumes for certain instances (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271811 (https://phabricator.wikimedia.org/T405509) (owner: 10Btullis) [09:37:20] (03PS3) 10Btullis: airflow: Only mount geoip volumes for certain instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271811 (https://phabricator.wikimedia.org/T405509) [09:37:41] !log installing imagemagick security updates [09:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:54] (03CR) 10Btullis: airflow: Only mount geoip volumes for certain instances (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271811 (https://phabricator.wikimedia.org/T405509) (owner: 10Btullis) [09:38:45] (03CR) 10Btullis: [C:03+2] Switch the dse-k8s-ctrl service from Weighted Round Robin to Maglev [puppet] - 10https://gerrit.wikimedia.org/r/1271745 (https://phabricator.wikimedia.org/T420437) (owner: 10Btullis) [09:39:16] FIRING: JobUnavailable: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:39:56] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 42296 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:40:18] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:40:35] !log jynus@cumin1003 START - Cookbook sre.hosts.move-vlan for host backup1014 [09:41:16] !log jynus@cumin1003 START - Cookbook sre.dns.netbox [09:42:14] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: cp2041.codfw.wmnet [09:42:20] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:42:25] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: cp2042.codfw.wmnet [09:44:20] !log jmm@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:aux-master-eqiad [09:44:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T419635)', diff saved to https://phabricator.wikimedia.org/P90900 and previous config saved to /var/cache/conftool/dbconfig/20260416-094436-fceratto.json [09:44:43] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [09:45:43] !log jynus@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host backup1014 - jynus@cumin1003" [09:45:49] !log jynus@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host backup1014 - jynus@cumin1003" [09:45:49] !log jynus@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:45:49] !log jynus@cumin1003 START - Cookbook sre.dns.wipe-cache backup1014.eqiad.wmnet 20.48.64.10.in-addr.arpa 0.2.0.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:45:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, ... [09:45:51] 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F1%2F1%3A0 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [09:45:53] !log jynus@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) backup1014.eqiad.wmnet 20.48.64.10.in-addr.arpa 0.2.0.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:45:53] !log jynus@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host backup1014 [09:47:36] !log jynus@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host backup1014 [09:47:36] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host backup1014 [09:50:52] (03PS5) 10Arnaudb: gerrit: migrate gerrit_site away from root partition [puppet] - 10https://gerrit.wikimedia.org/r/1270774 (https://phabricator.wikimedia.org/T423027) [09:51:01] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.4 point update - https://phabricator.wikimedia.org/T420240#11828738 (10MoritzMuehlenhoff) [09:51:31] (03CR) 10Ayounsi: [C:03+2] move-vlan cookbook: add "inplace" support [cookbooks] - 10https://gerrit.wikimedia.org/r/1270965 (owner: 10Ayounsi) [09:52:07] !log installing qemu security updates [09:52:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:08] PROBLEM - Swift https backend on ms-fe1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [09:53:58] RECOVERY - Swift https backend on ms-fe1018 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Swift [09:54:16] (03Merged) 10jenkins-bot: move-vlan cookbook: add "inplace" support [cookbooks] - 10https://gerrit.wikimedia.org/r/1270965 (owner: 10Ayounsi) [09:54:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:aux-master-eqiad [09:54:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P90901 and previous config saved to /var/cache/conftool/dbconfig/20260416-095455-fceratto.json [09:55:02] PROBLEM - Swift https frontend on ms-fe1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [09:55:02] PROBLEM - Swift https backend on ms-fe1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [09:55:52] RECOVERY - Swift https frontend on ms-fe1021 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.088 second response time https://wikitech.wikimedia.org/wiki/Swift [09:55:52] RECOVERY - Swift https backend on ms-fe1021 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.143 second response time https://wikitech.wikimedia.org/wiki/Swift [09:56:56] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1050870416 and 92 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:59:08] (03CR) 10Federico Ceratto: [C:03+1] "I see moss-be2001 is being replaced with apus-be2005 which has been set up in the related task." [puppet] - 10https://gerrit.wikimedia.org/r/1272589 (https://phabricator.wikimedia.org/T418902) (owner: 10MVernon) [09:59:16] RESOLVED: JobUnavailable: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260416T1000) [10:00:38] (03CR) 10Elukey: [C:03+1] kafka: deploy a new kafka script wrapper when installing kafka 3.7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1272588 (https://phabricator.wikimedia.org/T422842) (owner: 10Brouberol) [10:01:56] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 25984 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:02:07] (03PS7) 10Brouberol: kafka: deploy a new kafka script wrapper when installing kafka 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1272588 (https://phabricator.wikimedia.org/T422842) [10:02:32] (03PS8) 10Brouberol: kafka: deploy a new kafka script wrapper when installing kafka 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1272588 (https://phabricator.wikimedia.org/T422842) [10:02:37] (03CR) 10CI reject: [V:04-1] kafka: deploy a new kafka script wrapper when installing kafka 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1272588 (https://phabricator.wikimedia.org/T422842) (owner: 10Brouberol) [10:02:41] (03CR) 10Brouberol: kafka: deploy a new kafka script wrapper when installing kafka 3.7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1272588 (https://phabricator.wikimedia.org/T422842) (owner: 10Brouberol) [10:03:46] (03CR) 10Brouberol: [C:03+2] kafka: deploy a new kafka script wrapper when installing kafka 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1272588 (https://phabricator.wikimedia.org/T422842) (owner: 10Brouberol) [10:04:54] (03CR) 10Brouberol: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271811 (https://phabricator.wikimedia.org/T405509) (owner: 10Btullis) [10:04:56] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 88450864 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:05:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P90902 and previous config saved to /var/cache/conftool/dbconfig/20260416-100505-fceratto.json [10:05:25] (03CR) 10Btullis: [C:03+2] airflow: Only mount geoip volumes for certain instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271811 (https://phabricator.wikimedia.org/T405509) (owner: 10Btullis) [10:05:56] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 84848 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:08:16] PROBLEM - gerrit process on gerrit2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-17-openjdk-amd64/bin/java .*-jar /var/lib/gerrit/review_site/bin/gerrit.war daemon -d /var/lib/gerrit/review_site https://wikitech.wikimedia.org/wiki/Gerrit [10:09:20] !log backup1014 returns from maintenance, backups and recovery can flow as usual T421719 [10:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:24] T421719: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719 [10:09:48] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11828779 (10jcrespo) [10:11:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T419961)', diff saved to https://phabricator.wikimedia.org/P90903 and previous config saved to /var/cache/conftool/dbconfig/20260416-101135-fceratto.json [10:12:16] RECOVERY - gerrit process on gerrit2002 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-17-openjdk-amd64/bin/java .*-jar /var/lib/gerrit/review_site/bin/gerrit.war daemon -d /var/lib/gerrit/review_site https://wikitech.wikimedia.org/wiki/Gerrit [10:12:59] ^ gerrit2002 alerts are expected, I'm doing some tests on the standby-host [10:13:35] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:13:49] (03CR) 10Jcrespo: [C:03+1] "Tested, works as intended." [cookbooks] - 10https://gerrit.wikimedia.org/r/1270965 (owner: 10Ayounsi) [10:13:52] (03CR) 10Klausman: [V:03+1 C:03+2] manifests: Enable iommu=pt kernel parameter for MI300 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1271699 (https://phabricator.wikimedia.org/T421461) (owner: 10Klausman) [10:13:53] (03Merged) 10jenkins-bot: airflow: Only mount geoip volumes for certain instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271811 (https://phabricator.wikimedia.org/T405509) (owner: 10Btullis) [10:14:04] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:14:56] (03CR) 10Klausman: [V:03+1 C:03+2] manifests: Enable iommu=pt kernel parameter for MI300 hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1271699 (https://phabricator.wikimedia.org/T421461) (owner: 10Klausman) [10:15:01] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [10:15:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T419635)', diff saved to https://phabricator.wikimedia.org/P90904 and previous config saved to /var/cache/conftool/dbconfig/20260416-101514-fceratto.json [10:15:18] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [10:15:28] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.4 point update - https://phabricator.wikimedia.org/T420240#11828796 (10MoritzMuehlenhoff) [10:15:32] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1245.eqiad.wmnet with reason: Maintenance [10:15:51] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [10:17:16] PROBLEM - gerrit process on gerrit2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-17-openjdk-amd64/bin/java .*-jar /var/lib/gerrit/review_site/bin/gerrit.war daemon -d /var/lib/gerrit/review_site https://wikitech.wikimedia.org/wiki/Gerrit [10:20:16] RECOVERY - gerrit process on gerrit2002 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-17-openjdk-amd64/bin/java .*-jar /var/lib/gerrit/review_site/bin/gerrit.war daemon -d /var/lib/gerrit/review_site https://wikitech.wikimedia.org/wiki/Gerrit [10:20:51] (03CR) 10MVernon: [C:03+2] apus: move controller node [puppet] - 10https://gerrit.wikimedia.org/r/1272589 (https://phabricator.wikimedia.org/T418902) (owner: 10MVernon) [10:21:44] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P90905 and previous config saved to /var/cache/conftool/dbconfig/20260416-102143-fceratto.json [10:30:39] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] debdeploy: Bump changelog for new release [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/1270883 (owner: 10Muehlenhoff) [10:31:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P90906 and previous config saved to /var/cache/conftool/dbconfig/20260416-103152-fceratto.json [10:35:13] (03PS1) 10MVernon: hiera: adjust cephadm roles for codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/1272609 (https://phabricator.wikimedia.org/T418902) [10:35:43] (03CR) 10CI reject: [V:04-1] hiera: adjust cephadm roles for codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/1272609 (https://phabricator.wikimedia.org/T418902) (owner: 10MVernon) [10:35:47] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1272609 (https://phabricator.wikimedia.org/T418902) (owner: 10MVernon) [10:38:13] (03PS2) 10MVernon: hiera: adjust cephadm roles for codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/1272609 (https://phabricator.wikimedia.org/T418902) [10:38:25] (03PS1) 10Jelto: gerrit: migrate data ways from /var/lib/gerrit on gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/1272612 (https://phabricator.wikimedia.org/T333143) [10:38:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:38:42] (03CR) 10CI reject: [V:04-1] hiera: adjust cephadm roles for codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/1272609 (https://phabricator.wikimedia.org/T418902) (owner: 10MVernon) [10:39:32] (03PS3) 10MVernon: hiera: adjust cephadm roles for codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/1272609 (https://phabricator.wikimedia.org/T418902) [10:41:14] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1272612 (https://phabricator.wikimedia.org/T333143) (owner: 10Jelto) [10:42:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T419961)', diff saved to https://phabricator.wikimedia.org/P90907 and previous config saved to /var/cache/conftool/dbconfig/20260416-104201-fceratto.json [10:42:11] (03PS1) 10Brouberol: Ensure the system python is used to execute debmonitor-client in the image [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272619 [10:42:23] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [10:42:23] 10SRE-tools, 06Infrastructure-Foundations, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): debmonitor-client crashes for growthbook image - https://phabricator.wikimedia.org/T423413#11828883 (10brouberol) 05Open→03In progress [10:42:24] 10SRE-tools, 06Infrastructure-Foundations, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): debmonitor-client crashes for growthbook image - https://phabricator.wikimedia.org/T423413#11828882 (10brouberol) [10:42:32] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:42:41] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1161 (T419961)', diff saved to https://phabricator.wikimedia.org/P90908 and previous config saved to /var/cache/conftool/dbconfig/20260416-104240-fceratto.json [10:42:58] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 277361280 and 16 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:44:16] (03CR) 10CI reject: [V:04-1] Ensure the system python is used to execute debmonitor-client in the image [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272619 (owner: 10Brouberol) [10:45:12] (03CR) 10Jelto: [V:03+1] "see T333143#11828854 for a bit more context." [puppet] - 10https://gerrit.wikimedia.org/r/1272612 (https://phabricator.wikimedia.org/T333143) (owner: 10Jelto) [10:45:38] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2014 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [10:47:20] 06SRE: wiki.opentreetmap.org wikicommons thumbs rate limit allowance - https://phabricator.wikimedia.org/T423570 (10Firefishy) 03NEW [10:47:31] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: wikikube-ctrl2004.codfw.wmnet [10:47:39] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: wikikube-ctrl2005.codfw.wmnet [10:48:04] ^ This pybal not restarted alert is related to work that I'm doing on T420437 - I've mentioned in #wikimedia-traffic that I'd welcome a hand on restarting pybal, if possible. [10:48:05] T420437: Migrate DSE k8s apiserver and services to IPIP - https://phabricator.wikimedia.org/T420437 [10:50:08] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.4 point update - https://phabricator.wikimedia.org/T420240#11828905 (10MoritzMuehlenhoff) [10:50:11] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:50:20] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [10:50:20] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [10:50:41] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T419961)', diff saved to https://phabricator.wikimedia.org/P90909 and previous config saved to /var/cache/conftool/dbconfig/20260416-105040-fceratto.json [10:50:55] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:51:54] (03CR) 10Jcrespo: [C:03+1] hiera: adjust cephadm roles for codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/1272609 (https://phabricator.wikimedia.org/T418902) (owner: 10MVernon) [10:52:26] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: wikikube-ctrl2006.codfw.wmnet [10:52:39] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware, 07Kubernetes, 13Patch-For-Review: wikikube-ctrl2006 implementation tracking - https://phabricator.wikimedia.org/T406596#11828916 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: wikikube-ctrl2006.codfw.... [10:53:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272447 (https://phabricator.wikimedia.org/T313698) (owner: 10Anzx) [10:53:44] (03CR) 10MVernon: [C:03+2] hiera: adjust cephadm roles for codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/1272609 (https://phabricator.wikimedia.org/T418902) (owner: 10MVernon) [10:55:20] (03PS1) 10Clare Ming: Add script to get constructive edits for all wikis [puppet] - 10https://gerrit.wikimedia.org/r/1272633 (https://phabricator.wikimedia.org/T422736) [10:55:38] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1020 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [10:55:51] !log imported debdeploy 0.0.99.15 for bookworm-wikimedia (compat release for Cumin 6) [10:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:14] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:56:45] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:57:14] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply [10:57:46] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply [10:59:29] (03PS2) 10Clare Ming: Add script to get constructive edits for all wikis [puppet] - 10https://gerrit.wikimedia.org/r/1272633 (https://phabricator.wikimedia.org/T422736) [11:00:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P90910 and previous config saved to /var/cache/conftool/dbconfig/20260416-110049-fceratto.json [11:03:38] 07Puppet: Add PATCH method to Wmflib::HTTP::Method - https://phabricator.wikimedia.org/T392096#11828941 (10Fabfur) 05Open→03Resolved Thanks, this should've been closed long long time ago... [11:04:20] PROBLEM - MariaDB Replica Lag: pc1 on pc2011 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.55 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:07:36] !log updating debdeploy on bookworm to 0.0.99.15 [11:07:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:20] PROBLEM - MariaDB Replica Lag: pc5 on pc2015 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 322.61 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:10:19] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [11:10:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P90911 and previous config saved to /var/cache/conftool/dbconfig/20260416-111058-fceratto.json [11:12:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-eqiad:xe-3/3/2 (Transit: ... [11:12:51] Lumen (442550281) {#3867}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=eqiad+prometheus%2Fops&var-device=cr1-eqiad:9804&var-interface=xe-3%2F3%2F2 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [11:16:17] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [11:16:17] (03CR) 10Federico Ceratto: [C:03+1] "I see the moss to apus switch; moss-be2001 being added as storage, apus-be200 from 4 to 9 being storage." [puppet] - 10https://gerrit.wikimedia.org/r/1272609 (https://phabricator.wikimedia.org/T418902) (owner: 10MVernon) [11:16:49] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [11:17:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-eqiad:xe-3/3/2 (Transit: ... [11:17:51] Lumen (442550281) {#3867}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=eqiad+prometheus%2Fops&var-device=cr1-eqiad:9804&var-interface=xe-3%2F3%2F2 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [11:19:16] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-fr-tech: apply [11:19:51] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-fr-tech: apply [11:19:58] (03PS1) 10Klausman: home/klausman: fix c&p error on tmuxp config [puppet] - 10https://gerrit.wikimedia.org/r/1272658 [11:20:29] (03CR) 10Klausman: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1272658 (owner: 10Klausman) [11:20:41] 06SRE: wiki.opentreetmap.org wikicommons thumbs rate limit allowance - https://phabricator.wikimedia.org/T423570#11829008 (10jcrespo) Full referrer is: ` QuickInstantCommons/1.5.2-REL1_43 MediaWiki/1.43.8 OpenStreetMap%20Wiki (https://osm.wiki/) ` And I see it has been throttled quite a bit lately. I wonder if... [11:21:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T419961)', diff saved to https://phabricator.wikimedia.org/P90913 and previous config saved to /var/cache/conftool/dbconfig/20260416-112105-fceratto.json [11:21:27] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance [11:21:37] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1185 (T419961)', diff saved to https://phabricator.wikimedia.org/P90914 and previous config saved to /var/cache/conftool/dbconfig/20260416-112136-fceratto.json [11:21:56] FIRING: MaxConntrack: Elevated conntrack usage on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [11:22:59] !log klausman@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{ml-serve1012.eqiad.wmnet} and (A:ml-serve-master-eqiad or A:ml-serve-worker-eqiad) [11:23:03] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve1012.eqiad.wmnet [11:26:56] RESOLVED: MaxConntrack: Elevated conntrack usage on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [11:26:56] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:28:01] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [11:28:35] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [11:28:41] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [11:29:11] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [11:29:17] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [11:29:23] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] sahwikisource: add Ааптар (author) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272424 (https://phabricator.wikimedia.org/T423374) (owner: 10Anzx) [11:29:57] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] etwikiquote: delete unused temporary logo files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272447 (https://phabricator.wikimedia.org/T313698) (owner: 10Anzx) [11:29:58] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [11:30:05] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [11:30:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T419961)', diff saved to https://phabricator.wikimedia.org/P90915 and previous config saved to /var/cache/conftool/dbconfig/20260416-113005-fceratto.json [11:30:56] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [11:31:02] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [11:31:20] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2003.codfw.wmnet, ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:31:31] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [11:31:48] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2003.codfw.wmnet, ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:31:54] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-sre: apply [11:32:02] (03CR) 10Clément Goubert: [C:03+1] Ensure the system python is used to execute debmonitor-client in the image [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272619 (owner: 10Brouberol) [11:32:28] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-sre: apply [11:32:37] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wikidata: apply [11:33:13] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve1012.eqiad.wmnet [11:33:27] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wikidata: apply [11:33:45] FIRING: Outbound discards: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [11:33:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-eqiad:xe-3/1/6 (Transit: ... [11:33:51] NTT (234630) {#3475}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=eqiad+prometheus%2Fops&var-device=cr1-eqiad:9804&var-interface=xe-3%2F1%2F6 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [11:34:04] !ack [11:34:04] 7845 (ACKED) TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Transit: NTT (234630) {#3475} xe-3/1/6 gnmi eqiad) [11:34:26] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:35:22] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 03 Jun 2026 06:56:12 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:38:43] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1012.eqiad.wmnet [11:38:45] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1012.eqiad.wmnet [11:38:45] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on P{ml-serve1012.eqiad.wmnet} and (A:ml-serve-master-eqiad or A:ml-serve-worker-eqiad) [11:40:15] !incidents [11:40:15] 7845 (ACKED) TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Transit: NTT (234630) {#3475} xe-3/1/6 gnmi eqiad) [11:40:15] 7844 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Transit: Lumen (442550281) {#3867} xe-3/3/2 gnmi eqiad) [11:40:15] 7843 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw) [11:40:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P90916 and previous config saved to /var/cache/conftool/dbconfig/20260416-114014-fceratto.json [11:41:52] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:42:48] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [11:43:20] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [11:43:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-eqiad:xe-3/1/6 (Transit: ... [11:43:51] NTT (234630) {#3475}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=eqiad+prometheus%2Fops&var-device=cr1-eqiad:9804&var-interface=xe-3%2F1%2F6 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [11:44:04] !incidents [11:44:05] 7845 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Transit: NTT (234630) {#3475} xe-3/1/6 gnmi eqiad) [11:44:05] 7844 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Transit: Lumen (442550281) {#3867} xe-3/3/2 gnmi eqiad) [11:44:05] 7843 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw) [11:50:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P90918 and previous config saved to /var/cache/conftool/dbconfig/20260416-115024-fceratto.json [11:50:43] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1247.eqiad.wmnet with reason: Maintenance [11:50:55] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1247 (T419635)', diff saved to https://phabricator.wikimedia.org/P90919 and previous config saved to /var/cache/conftool/dbconfig/20260416-115055-fceratto.json [11:51:01] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [11:53:03] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [11:53:08] !log klausman@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{ml-serve1013.eqiad.wmnet} and (A:ml-serve-master-eqiad or A:ml-serve-worker-eqiad) [11:53:10] (03CR) 10Effie Mouzeli: [C:03+1] prometheus, thanos: move recording rule [puppet] - 10https://gerrit.wikimedia.org/r/1270480 (https://phabricator.wikimedia.org/T249663) (owner: 10Hnowlan) [11:53:12] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve1013.eqiad.wmnet [11:56:03] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [11:57:10] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:58:48] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:59:20] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260416T1200) [12:00:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T419961)', diff saved to https://phabricator.wikimedia.org/P90920 and previous config saved to /var/cache/conftool/dbconfig/20260416-120033-fceratto.json [12:00:55] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance [12:01:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1200 (T419961)', diff saved to https://phabricator.wikimedia.org/P90921 and previous config saved to /var/cache/conftool/dbconfig/20260416-120104-fceratto.json [12:02:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host testvm2004.codfw.wmnet with OS trixie [12:03:10] (03PS1) 10MVernon: hiera: remove two old apus backends for decom [puppet] - 10https://gerrit.wikimedia.org/r/1272676 (https://phabricator.wikimedia.org/T418902) [12:03:24] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve1013.eqiad.wmnet [12:05:36] (03CR) 10Elukey: "Added a couple of comments, I think that black needs to run to fix the current formatting issues. They don't seem to be related to your ch" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272619 (owner: 10Brouberol) [12:05:54] (03CR) 10Jcrespo: [C:03+1] hiera: remove two old apus backends for decom [puppet] - 10https://gerrit.wikimedia.org/r/1272676 (https://phabricator.wikimedia.org/T418902) (owner: 10MVernon) [12:06:16] PROBLEM - ganeti-noded running on ganeti-test2001 is CRITICAL: PROCS CRITICAL: 3 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [12:06:58] (03CR) 10MVernon: [C:03+2] hiera: remove two old apus backends for decom [puppet] - 10https://gerrit.wikimedia.org/r/1272676 (https://phabricator.wikimedia.org/T418902) (owner: 10MVernon) [12:08:53] (03CR) 10Brouberol: Ensure the system python is used to execute debmonitor-client in the image (032 comments) [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272619 (owner: 10Brouberol) [12:09:04] !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1013.eqiad.wmnet [12:09:06] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1013.eqiad.wmnet [12:09:06] !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on P{ml-serve1013.eqiad.wmnet} and (A:ml-serve-master-eqiad or A:ml-serve-worker-eqiad) [12:09:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T419961)', diff saved to https://phabricator.wikimedia.org/P90922 and previous config saved to /var/cache/conftool/dbconfig/20260416-120935-fceratto.json [12:11:09] (03CR) 10Brouberol: Ensure the system python is used to execute debmonitor-client in the image (031 comment) [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272619 (owner: 10Brouberol) [12:11:44] !log mvernon@cumin2002 START - Cookbook sre.hosts.decommission for hosts moss-be[2001-2002].codfw.wmnet [12:14:17] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:16:16] RECOVERY - ganeti-noded running on ganeti-test2001 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [12:19:34] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [12:19:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P90923 and previous config saved to /var/cache/conftool/dbconfig/20260416-121945-fceratto.json [12:21:10] (03CR) 10Elukey: Ensure the system python is used to execute debmonitor-client in the image (031 comment) [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272619 (owner: 10Brouberol) [12:22:31] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272689 [12:25:16] mvernon@cumin2002 decommission (PID 867176) is awaiting input [12:25:20] RECOVERY - MariaDB Replica Lag: pc5 on pc2015 is OK: OK slave_sql_lag Replication lag: 43.54 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:26:05] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:26:13] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:27:01] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: moss-be[2001-2002].codfw.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin2002" [12:28:01] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:28:33] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:29:00] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:29:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P90924 and previous config saved to /var/cache/conftool/dbconfig/20260416-122953-fceratto.json [12:30:06] mvernon@cumin2002 decommission (PID 867176) is awaiting input [12:30:34] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: moss-be[2001-2002].codfw.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin2002" [12:30:35] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:30:36] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts moss-be[2001-2002].codfw.wmnet [12:30:49] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q3:rack/setup/install apus-be200[56] - https://phabricator.wikimedia.org/T418902#11829146 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by mvernon@cumin2002 for hosts: `moss-be[2001-2002].codfw.wmnet` - moss-b... [12:33:20] RECOVERY - MariaDB Replica Lag: pc1 on pc2011 is OK: OK slave_sql_lag Replication lag: 0.38 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:34:46] (03CR) 10Arnaudb: [C:03+1] "looks good to me, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1272612 (https://phabricator.wikimedia.org/T333143) (owner: 10Jelto) [12:34:58] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 90320 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:36:16] PROBLEM - gerrit process on gerrit2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-17-openjdk-amd64/bin/java .*-jar /var/lib/gerrit/review_site/bin/gerrit.war daemon -d /var/lib/gerrit/review_site https://wikitech.wikimedia.org/wiki/Gerrit [12:36:26] ^ gerrit2002 is me [12:36:35] (03PS1) 10Muehlenhoff: Avoid false positive alerts after Ganeti master failover [puppet] - 10https://gerrit.wikimedia.org/r/1272701 [12:36:57] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:37:02] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:37:46] (03CR) 10Jelto: [V:03+1 C:03+2] gerrit: migrate data ways from /var/lib/gerrit on gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/1272612 (https://phabricator.wikimedia.org/T333143) (owner: 10Jelto) [12:38:32] (03PS3) 10Robertsky: siwikitionary: update logo to localised svg version. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140748 (https://phabricator.wikimedia.org/T342173) [12:38:52] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:38:59] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:39:13] (03CR) 10Robertsky: siwikitionary: update logo to localised svg version. (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140748 (https://phabricator.wikimedia.org/T342173) (owner: 10Robertsky) [12:40:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T419961)', diff saved to https://phabricator.wikimedia.org/P90925 and previous config saved to /var/cache/conftool/dbconfig/20260416-124001-fceratto.json [12:40:24] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1207.eqiad.wmnet with reason: Maintenance [12:40:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1207 (T419961)', diff saved to https://phabricator.wikimedia.org/P90926 and previous config saved to /var/cache/conftool/dbconfig/20260416-124032-fceratto.json [12:41:23] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:43:35] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [12:45:57] 06SRE, 10Lift-Wing, 06Machine-Learning-Team (Q4 FY2025-26): Fix securityContext propagation in liftwing - https://phabricator.wikimedia.org/T423149#11829192 (10DPogorzelski-WMF) the issue seems to be solved locally by simply appending the securityContext to the container, but the same doesn't seem to work on... [12:47:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T419961)', diff saved to https://phabricator.wikimedia.org/P90927 and previous config saved to /var/cache/conftool/dbconfig/20260416-124742-fceratto.json [12:51:50] (03PS4) 10Robertsky: siwikitionary: update logo to localised svg version. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140748 (https://phabricator.wikimedia.org/T342173) [12:52:22] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1272701 (owner: 10Muehlenhoff) [12:52:35] (03CR) 10Robertsky: "regenerated." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140748 (https://phabricator.wikimedia.org/T342173) (owner: 10Robertsky) [12:54:04] (03CR) 10Anzx: [C:03+1] siwikitionary: update logo to localised svg version. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140748 (https://phabricator.wikimedia.org/T342173) (owner: 10Robertsky) [12:54:21] 10ops-codfw, 10SRE-swift-storage, 10Ceph, 06DC-Ops, 10decommission-hardware: decommission moss-be200[1-2].codfw.wmnet - https://phabricator.wikimedia.org/T423584 (10MatthewVernon) 03NEW [12:57:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P90928 and previous config saved to /var/cache/conftool/dbconfig/20260416-125750-fceratto.json [12:58:31] FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [12:58:35] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [12:58:39] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [12:59:23] (03PS2) 10Brouberol: Ensure the system python is used to execute debmonitor-client in the image [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272619 [12:59:23] (03PS1) 10Brouberol: Automatic formatting [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272705 [12:59:23] (03PS1) 10Brouberol: Replace python 3.9 by 3.13 in CI [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272706 [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260416T1300). [13:00:05] Robertsky, matthiasmullie, and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:15] o/ [13:00:16] o/ [13:00:45] o/ [13:01:22] o/ [13:01:35] (03CR) 10CI reject: [V:04-1] Ensure the system python is used to execute debmonitor-client in the image [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272619 (owner: 10Brouberol) [13:01:36] let’s start with anzx’ changes, I think they can be done together [13:01:57] ok [13:01:58] i can be the last one! [13:01:58] (03PS3) 10Effie Mouzeli: mcrouter: do not checksum configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271736 (https://phabricator.wikimedia.org/T421504) [13:02:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272447 (https://phabricator.wikimedia.org/T313698) (owner: 10Anzx) [13:02:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272424 (https://phabricator.wikimedia.org/T423374) (owner: 10Anzx) [13:02:18] 06SRE, 10Lift-Wing, 06Machine-Learning-Team (Q4 FY2025-26): Fix securityContext propagation in liftwing - https://phabricator.wikimedia.org/T423149#11829398 (10elukey) Is there a diff in this call between local and staging? ` kubectl get mutatingwebhookconfiguration -n kserve -o json | \ jq -r '.items[] |... [13:03:02] (03Merged) 10jenkins-bot: etwikiquote: delete unused temporary logo files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272447 (https://phabricator.wikimedia.org/T313698) (owner: 10Anzx) [13:03:05] (03Merged) 10jenkins-bot: sahwikisource: add Ааптар (author) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272424 (https://phabricator.wikimedia.org/T423374) (owner: 10Anzx) [13:03:17] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [13:03:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:05:06] (03CR) 10Lucas Werkmeister (WMDE): siwikitionary: update logo to localised svg version. (035 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140748 (https://phabricator.wikimedia.org/T342173) (owner: 10Robertsky) [13:05:22] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1272447|etwikiquote: delete unused temporary logo files (T313698)]], [[gerrit:1272424|sahwikisource: add Ааптар (author) namespace (T423374)]] [13:05:27] T313698: Requesting temporary logo change for et.wikiquote.org - https://phabricator.wikimedia.org/T313698 [13:05:28] T423374: Author namespace in sahwikisource - https://phabricator.wikimedia.org/T423374 [13:05:36] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T410589)', diff saved to https://phabricator.wikimedia.org/P90929 and previous config saved to /var/cache/conftool/dbconfig/20260416-130535-ladsgroup.json [13:05:40] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [13:06:12] “late 2025” wow [13:06:24] (03CR) 10Effie Mouzeli: [C:03+2] mcrouter: do not checksum configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271736 (https://phabricator.wikimedia.org/T421504) (owner: 10Effie Mouzeli) [13:06:41] (03PS3) 10Brouberol: Ensure the system python is used to execute debmonitor-client in the image [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272619 [13:07:10] (03PS2) 10Effie Mouzeli: mw-debug: use new mcrouter image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271771 (https://phabricator.wikimedia.org/T420223) [13:07:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P90930 and previous config saved to /var/cache/conftool/dbconfig/20260416-130758-fceratto.json [13:08:44] (03Merged) 10jenkins-bot: mcrouter: do not checksum configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271736 (https://phabricator.wikimedia.org/T421504) (owner: 10Effie Mouzeli) [13:09:22] !log lucaswerkmeister-wmde@deploy1003 anzx, lucaswerkmeister-wmde: Backport for [[gerrit:1272447|etwikiquote: delete unused temporary logo files (T313698)]], [[gerrit:1272424|sahwikisource: add Ааптар (author) namespace (T423374)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:09:37] checking [13:09:41] (03CR) 10Lucas Werkmeister (WMDE): Squashed diff to master (031 comment) [extensions/ReaderExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1272527 (owner: 10Matthias Mullie) [13:09:44] (03PS1) 10Brouberol: Bump version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272710 [13:10:17] Lucas_WMDE: namespace appears, ok to sync [13:10:26] etwikiquote also still looks fine [13:10:27] thanks [13:10:35] !log lucaswerkmeister-wmde@deploy1003 anzx, lucaswerkmeister-wmde: Continuing with sync [13:10:52] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:12:00] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host apus-be1005.eqiad.wmnet [13:12:41] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:13:42] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:13:49] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:14:43] 06SRE, 10Cloud-Services, 06Infrastructure-Foundations: Adjust WMCS Gitlab CI/CD repo to stop using mirrors.wikimedia.org - https://phabricator.wikimedia.org/T423596 (10Jdforrester-WMF) 03NEW The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wiki... [13:15:04] calendar sprung a surprise meeting on me (I forgot, my bad) [13:15:25] I can still do the namespaceDupes run for sahwikisource (cc anzx) but after that it would be great if someone else could take over the deployment window [13:15:38] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:15:44] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P90932 and previous config saved to /var/cache/conftool/dbconfig/20260416-131543-ladsgroup.json [13:16:00] I can handle my own deploy [13:16:04] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [13:16:14] (03PS2) 10Brouberol: Automatic formatting [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272705 [13:16:14] (03PS2) 10Brouberol: Replace python 3.9 by 3.13 in CI [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272706 [13:16:14] (03PS4) 10Brouberol: Ensure the system python is used to execute debmonitor-client in the image [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272619 [13:16:15] (03PS2) 10Brouberol: Bump version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272710 [13:16:21] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1272447|etwikiquote: delete unused temporary logo files (T313698)]], [[gerrit:1272424|sahwikisource: add Ааптар (author) namespace (T423374)]] (duration: 10m 59s) [13:16:26] T313698: Requesting temporary logo change for et.wikiquote.org - https://phabricator.wikimedia.org/T313698 [13:16:26] T423374: Author namespace in sahwikisource - https://phabricator.wikimedia.org/T423374 [13:16:38] * Lucas_WMDE runs maintenance scripts [13:16:39] mattiasmullie: can you help to deploy mine too? [13:17:04] !log lucaswerkmeister-wmde@deploy1003 mwscript-k8s job started: namespaceDupes sahwikisource --fix # T423273 [13:17:08] T423273: Restructure frontend code to facilitate re-use across experiments - https://phabricator.wikimedia.org/T423273 [13:17:25] oooops [13:17:26] one sec [13:17:27] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apus-be1005.eqiad.wmnet [13:17:56] !log correction, namespaceDupes sahwikisource run was for T423374, my bad [13:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:03] matthiasmullie: over to you [13:18:07] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T419961)', diff saved to https://phabricator.wikimedia.org/P90933 and previous config saved to /var/cache/conftool/dbconfig/20260416-131806-fceratto.json [13:18:07] (03CR) 10CI reject: [V:04-1] Automatic formatting [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272705 (owner: 10Brouberol) [13:18:14] (03CR) 10Brouberol: Ensure the system python is used to execute debmonitor-client in the image (031 comment) [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272619 (owner: 10Brouberol) [13:18:14] (03CR) 10CI reject: [V:04-1] Ensure the system python is used to execute debmonitor-client in the image [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272619 (owner: 10Brouberol) [13:18:18] (03CR) 10CI reject: [V:04-1] Bump version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272710 (owner: 10Brouberol) [13:18:29] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1210.eqiad.wmnet with reason: Maintenance [13:18:32] (03CR) 10CI reject: [V:04-1] Replace python 3.9 by 3.13 in CI [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272706 (owner: 10Brouberol) [13:18:37] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1210 (T419961)', diff saved to https://phabricator.wikimedia.org/P90934 and previous config saved to /var/cache/conftool/dbconfig/20260416-131836-fceratto.json [13:19:54] (03CR) 10CDanis: [C:03+1] cache::haproxy: small fix in contact info regex [puppet] - 10https://gerrit.wikimedia.org/r/1271593 (owner: 10Fabfur) [13:19:54] (03PS1) 10Michael Große: fix: add missing hook registration for create account stats [extensions/GrowthExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1272712 (https://phabricator.wikimedia.org/T422283) [13:20:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mlitn@deploy1003 using scap backport" [extensions/ReaderExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1272527 (owner: 10Matthias Mullie) [13:20:30] Heyy, late addition, could we backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/1272712 as well? [13:20:42] I can add it to the deployment calendar in a second [13:20:59] (03CR) 10Eevans: [C:03+2] installserver: configure new aqs hosts for partition reuse [puppet] - 10https://gerrit.wikimedia.org/r/1271985 (https://phabricator.wikimedia.org/T412830) (owner: 10Eevans) [13:21:04] robertsky: sure [13:21:08] (03Merged) 10jenkins-bot: Squashed diff to master [extensions/ReaderExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1272527 (owner: 10Matthias Mullie) [13:21:09] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install apus-be100[56] - https://phabricator.wikimedia.org/T418901#11829519 (10MatthewVernon) 05Open→03Resolved I've had a poke through the web iDRAC, and I think I've found the offending disk on apus-be1005; apu... [13:21:19] (03Abandoned) 10Fabfur: cache::haproxy: small fix in contact info regex [puppet] - 10https://gerrit.wikimedia.org/r/1271593 (owner: 10Fabfur) [13:21:33] !log mlitn@deploy1003 Started scap sync-world: Backport for [[gerrit:1272527|Squashed diff to master]] [13:22:33] 06SRE, 10Lift-Wing, 06Machine-Learning-Team (Q4 FY2025-26): Fix securityContext propagation in liftwing - https://phabricator.wikimedia.org/T423149#11829523 (10DPogorzelski-WMF) adding seccompProfile: type: RuntimeDefault to the chart values and handling it in the configmap patch seems to sol... [13:22:33] (03CR) 10JHathaway: [C:03+2] kdc: ensure net.netfilter.nf_conntrack_max is updated [puppet] - 10https://gerrit.wikimedia.org/r/1271794 (https://phabricator.wikimedia.org/T407726) (owner: 10JHathaway) [13:23:51] (03PS3) 10Brouberol: Automatic formatting [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272705 [13:23:51] (03PS3) 10Brouberol: Replace python 3.9 by 3.13 in CI [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272706 [13:23:51] (03PS5) 10Brouberol: Ensure the system python is used to execute debmonitor-client in the image [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272619 [13:23:52] (03PS3) 10Brouberol: Bump version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272710 [13:23:55] !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on krb2002.codfw.wmnet with reason: T407726 [13:23:59] T407726: Increase net.nf_conntrack_max on kerberos hosts if needed - https://phabricator.wikimedia.org/T407726 [13:25:13] (03PS1) 10MVernon: apus: add two new storage nodes in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1272713 (https://phabricator.wikimedia.org/T418901) [13:25:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T419961)', diff saved to https://phabricator.wikimedia.org/P90935 and previous config saved to /var/cache/conftool/dbconfig/20260416-132525-fceratto.json [13:25:53] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P90936 and previous config saved to /var/cache/conftool/dbconfig/20260416-132551-ladsgroup.json [13:26:04] (03CR) 10CI reject: [V:04-1] Bump version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272710 (owner: 10Brouberol) [13:26:13] (03CR) 10CI reject: [V:04-1] Ensure the system python is used to execute debmonitor-client in the image [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272619 (owner: 10Brouberol) [13:26:18] (03CR) 10CI reject: [V:04-1] Replace python 3.9 by 3.13 in CI [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272706 (owner: 10Brouberol) [13:26:34] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [13:27:22] (03PS1) 10Majavah: P:toolforge::k8s::haproxy: Remove old ingress nodes [puppet] - 10https://gerrit.wikimedia.org/r/1272714 (https://phabricator.wikimedia.org/T392356) [13:27:49] (03PS4) 10Brouberol: Replace python 3.9 by 3.13 in CI [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272706 [13:27:49] (03PS6) 10Brouberol: Ensure the system python is used to execute debmonitor-client in the image [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272619 [13:27:49] (03PS4) 10Brouberol: Bump version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272710 [13:28:33] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271926 (owner: 10CDanis) [13:28:40] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8427/co" [puppet] - 10https://gerrit.wikimedia.org/r/1272714 (https://phabricator.wikimedia.org/T392356) (owner: 10Majavah) [13:28:47] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [13:29:10] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host testvm2004.codfw.wmnet with OS trixie [13:31:57] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:32:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host testvm2002.codfw.wmnet with OS trixie [13:33:16] (03CR) 10Elukey: [C:03+1] Automatic formatting [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272705 (owner: 10Brouberol) [13:33:21] 06SRE, 06Infrastructure-Foundations: Increase net.nf_conntrack_max on kerberos hosts if needed - https://phabricator.wikimedia.org/T407726#11829592 (10jhathaway) 05Open→03Resolved a:03jhathaway Fix has been applied and rolled out to both krb hosts. After rebooting `krb2002` the `sysctl` for `net.netf... [13:33:44] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:34:27] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:34:34] (03CR) 10Elukey: [C:03+1] "Thanks a lot for the refactor <3" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272706 (owner: 10Brouberol) [13:34:55] (03CR) 10Elukey: [C:03+1] Ensure the system python is used to execute debmonitor-client in the image [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272619 (owner: 10Brouberol) [13:35:23] 06SRE: wiki.openstreetmap.org Commons thumbs rate limit allowance - https://phabricator.wikimedia.org/T423570#11829601 (10Aklapper) [13:35:33] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P90937 and previous config saved to /var/cache/conftool/dbconfig/20260416-133533-fceratto.json [13:35:44] (03PS4) 10Brouberol: Automatic formatting [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272705 (https://phabricator.wikimedia.org/T423413) [13:35:46] (03PS5) 10Brouberol: Replace python 3.9 by 3.13 in CI [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272706 (https://phabricator.wikimedia.org/T423413) [13:35:50] (03PS7) 10Brouberol: Ensure the system python is used to execute debmonitor-client in the image [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272619 (https://phabricator.wikimedia.org/T423413) [13:35:53] (03PS5) 10Brouberol: Bump version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272710 (https://phabricator.wikimedia.org/T423413) [13:35:53] (03CR) 10Elukey: [C:03+1] "Left a nit, nothing big, feel free to go ahead after it (if any)." [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272710 (https://phabricator.wikimedia.org/T423413) (owner: 10Brouberol) [13:36:01] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T410589)', diff saved to https://phabricator.wikimedia.org/P90938 and previous config saved to /var/cache/conftool/dbconfig/20260416-133600-ladsgroup.json [13:36:05] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [13:36:17] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:36:28] (03PS6) 10Brouberol: Bump version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272710 (https://phabricator.wikimedia.org/T423413) [13:36:53] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [13:36:58] (03CR) 10Brouberol: Bump version (031 comment) [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272710 (https://phabricator.wikimedia.org/T423413) (owner: 10Brouberol) [13:37:34] (03CR) 10Elukey: [C:03+2] role::cluster::management: add profile to sync firmwares [puppet] - 10https://gerrit.wikimedia.org/r/1271564 (https://phabricator.wikimedia.org/T418873) (owner: 10Elukey) [13:38:04] 06SRE, 10Cloud-Services: Migrate our use of osbpo away from mirrors.wikimedia.org - https://phabricator.wikimedia.org/T423598 (10MoritzMuehlenhoff) 03NEW The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and repl... [13:38:10] (03CR) 10Brouberol: [C:03+2] Automatic formatting [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272705 (https://phabricator.wikimedia.org/T423413) (owner: 10Brouberol) [13:38:20] (03CR) 10Brouberol: [C:03+2] Ensure the system python is used to execute debmonitor-client in the image [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272619 (https://phabricator.wikimedia.org/T423413) (owner: 10Brouberol) [13:38:26] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:38:30] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [13:38:46] !log mlitn@deploy1003 mlitn: Backport for [[gerrit:1272527|Squashed diff to master]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:39:01] (03CR) 10Brouberol: [C:03+2] Bump version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272710 (https://phabricator.wikimedia.org/T423413) (owner: 10Brouberol) [13:39:19] !log mlitn@deploy1003 mlitn: Continuing with sync [13:40:23] robertsky: Lucas_WMDE left some CR feedback on that patch - are you working on that, or will you address it later on? [13:40:25] (03Merged) 10jenkins-bot: Automatic formatting [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272705 (https://phabricator.wikimedia.org/T423413) (owner: 10Brouberol) [13:40:26] (03CR) 10Effie Mouzeli: [C:03+2] mw-debug: use new mcrouter image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271771 (https://phabricator.wikimedia.org/T420223) (owner: 10Effie Mouzeli) [13:40:29] (03Merged) 10jenkins-bot: Replace python 3.9 by 3.13 in CI [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272706 (https://phabricator.wikimedia.org/T423413) (owner: 10Brouberol) [13:40:45] (03Merged) 10jenkins-bot: Ensure the system python is used to execute debmonitor-client in the image [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272619 (https://phabricator.wikimedia.org/T423413) (owner: 10Brouberol) [13:41:14] (03Merged) 10jenkins-bot: Bump version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1272710 (https://phabricator.wikimedia.org/T423413) (owner: 10Brouberol) [13:41:28] matthiasmullie: if there is still time at the end, would it also be possible to back-port the change I still added late to this window? [13:41:32] !log decommissioning Cassandra [a,b] on aqs1010 — T412830 [13:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:36] T412830: Hardware refresh of aqs101[0-2,4-5] w/ aqs102[3-7] - https://phabricator.wikimedia.org/T412830 [13:41:42] (it is already in the calendar) [13:41:46] (03PS1) 10Jelto: gerrit: make daemon_user_dir configurable and set it to /srv for gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/1272718 (https://phabricator.wikimedia.org/T333143) [13:42:55] MichaelG_WMF: yeah should work! [13:43:02] (03Merged) 10jenkins-bot: mw-debug: use new mcrouter image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271771 (https://phabricator.wikimedia.org/T420223) (owner: 10Effie Mouzeli) [13:43:04] (03PS5) 10Robertsky: siwikitionary: update logo to localised svg version. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140748 (https://phabricator.wikimedia.org/T342173) [13:43:05] Yay, thank you! [13:43:17] (03CR) 10Robertsky: siwikitionary: update logo to localised svg version. (035 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140748 (https://phabricator.wikimedia.org/T342173) (owner: 10Robertsky) [13:44:24] matthiasmullie: resolved. [13:44:26] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T419635)', diff saved to https://phabricator.wikimedia.org/P90939 and previous config saved to /var/cache/conftool/dbconfig/20260416-134426-fceratto.json [13:44:27] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8428/co" [puppet] - 10https://gerrit.wikimedia.org/r/1272718 (https://phabricator.wikimedia.org/T333143) (owner: 10Jelto) [13:44:31] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [13:44:31] 06SRE, 10Cloud-Services: Migrate our use of osbpo away from mirrors.wikimedia.org - https://phabricator.wikimedia.org/T423598#11829648 (10Andrew) a:03Andrew [13:45:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P90940 and previous config saved to /var/cache/conftool/dbconfig/20260416-134541-fceratto.json [13:46:47] (03PS2) 10Jelto: gerrit: make daemon_user_dir configurable and set it to /srv for gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/1272718 (https://phabricator.wikimedia.org/T333143) [13:47:28] (03CR) 10CI reject: [V:04-1] gerrit: make daemon_user_dir configurable and set it to /srv for gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/1272718 (https://phabricator.wikimedia.org/T333143) (owner: 10Jelto) [13:47:42] 06SRE, 10Lift-Wing, 06Machine-Learning-Team (Q4 FY2025-26): Fix securityContext propagation in liftwing - https://phabricator.wikimedia.org/T423149#11829657 (10DPogorzelski-WMF) nvm, i had a typo, it doesn't actually solve anything. i'll keep looking [13:48:00] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2002.codfw.wmnet with reason: host reimage [13:48:42] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1272718 (https://phabricator.wikimedia.org/T333143) (owner: 10Jelto) [13:48:58] (03PS3) 10Jelto: gerrit: make daemon_user_dir configurable and set it to /srv for gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/1272718 (https://phabricator.wikimedia.org/T333143) [13:49:02] jouncebot: now [13:49:02] For the next 0 hour(s) and 10 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260416T1300) [13:49:28] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [13:49:36] 06SRE, 10envoy, 06ServiceOps new, 10ServiceOps-Services-Oids: Upgrade Envoy to v1.35.7 - https://phabricator.wikimedia.org/T410975#11829664 (10Eevans) [13:50:14] 06SRE, 10envoy, 06ServiceOps new, 10ServiceOps-Services-Oids: Upgrade Envoy to v1.35.7 - https://phabricator.wikimedia.org/T410975#11829666 (10Eevans) [13:50:38] 06SRE, 10Lift-Wing, 06Machine-Learning-Team (Q4 FY2025-26): Fix securityContext propagation in liftwing - https://phabricator.wikimedia.org/T423149#11829670 (10DPogorzelski-WMF) >>! In T423149#11829398, @elukey wrote: > Is there a diff in this call between local and staging? > > ` > kubectl get mutatingwebh... [13:50:52] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1272718 (https://phabricator.wikimedia.org/T333143) (owner: 10Jelto) [13:51:29] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [13:51:54] !log mlitn@deploy1003 Finished scap sync-world: Backport for [[gerrit:1272527|Squashed diff to master]] (duration: 30m 21s) [13:52:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mlitn@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140748 (https://phabricator.wikimedia.org/T342173) (owner: 10Robertsky) [13:53:04] robertsky: begun your patch [13:53:10] ok [13:53:50] (03Merged) 10jenkins-bot: siwikitionary: update logo to localised svg version. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140748 (https://phabricator.wikimedia.org/T342173) (owner: 10Robertsky) [13:54:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2002.codfw.wmnet with reason: host reimage [13:54:18] !log mlitn@deploy1003 Started scap sync-world: Backport for [[gerrit:1140748|siwikitionary: update logo to localised svg version. (T342173)]] [13:54:22] T342173: Icons: siwiktionary logo icon should be localized to language - https://phabricator.wikimedia.org/T342173 [13:54:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P90941 and previous config saved to /var/cache/conftool/dbconfig/20260416-135434-fceratto.json [13:54:58] (03CR) 10Jelto: [V:03+1] "Unfortunately another patch is needed to also address the `daemon_user_dir` in the Puppet code. Without this patch puppet creates the fold" [puppet] - 10https://gerrit.wikimedia.org/r/1272718 (https://phabricator.wikimedia.org/T333143) (owner: 10Jelto) [13:55:24] (03PS1) 10Majavah: P:openstack: Remove remains of openstack_control_node_interface options [puppet] - 10https://gerrit.wikimedia.org/r/1272722 [13:55:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T419961)', diff saved to https://phabricator.wikimedia.org/P90942 and previous config saved to /var/cache/conftool/dbconfig/20260416-135549-fceratto.json [13:56:05] (03PS11) 10Andrew Bogott: designate: derive zookeeper cluster rather than hardcoding [puppet] - 10https://gerrit.wikimedia.org/r/1272146 (https://phabricator.wikimedia.org/T422646) [13:56:15] !log mlitn@deploy1003 mlitn, robertsky: Backport for [[gerrit:1140748|siwikitionary: update logo to localised svg version. (T342173)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:56:40] (03CR) 10CI reject: [V:04-1] designate: derive zookeeper cluster rather than hardcoding [puppet] - 10https://gerrit.wikimedia.org/r/1272146 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [13:56:44] robertsky: it's on testservers - can you check & confirm? [13:57:37] matthiasmullie: verified. [13:57:44] !log mlitn@deploy1003 mlitn, robertsky: Continuing with sync [13:58:03] (03PS12) 10Andrew Bogott: designate: derive zookeeper cluster rather than hardcoding [puppet] - 10https://gerrit.wikimedia.org/r/1272146 (https://phabricator.wikimedia.org/T422646) [13:58:14] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1245.eqiad.wmnet with reason: Maintenance [14:00:30] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [14:01:14] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [14:01:27] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1272146 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [14:01:29] !log mlitn@deploy1003 Finished scap sync-world: Backport for [[gerrit:1140748|siwikitionary: update logo to localised svg version. (T342173)]] (duration: 07m 11s) [14:01:33] T342173: Icons: siwiktionary logo icon should be localized to language - https://phabricator.wikimedia.org/T342173 [14:01:36] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [14:01:44] robertsky: done; MichaelG_WMF starting your patch [14:01:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mlitn@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1272712 (https://phabricator.wikimedia.org/T422283) (owner: 10Michael Große) [14:02:11] matthiasmullie: thank you! [14:03:31] FIRING: [2x] Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [14:04:22] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [14:04:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P90943 and previous config saved to /var/cache/conftool/dbconfig/20260416-140442-fceratto.json [14:05:48] (03CR) 10Filippo Giunchedi: [C:03+1] P:openstack: Remove remains of openstack_control_node_interface options [puppet] - 10https://gerrit.wikimedia.org/r/1272722 (owner: 10Majavah) [14:05:57] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [14:06:28] (03CR) 10MVernon: [C:03+1] "A couple of optional nits; I think it'd be worth testing on a single frontend before rolling out everywhere?" [puppet] - 10https://gerrit.wikimedia.org/r/1271927 (owner: 10CDanis) [14:06:54] I've got a prod config fix to deploy. [14:06:59] (03PS3) 10Jforrester: mc: Use MCROUTER_SERVER values rather than local sidepod for WF cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271895 (https://phabricator.wikimedia.org/T423311) [14:07:02] (03CR) 10Majavah: [C:03+2] P:openstack: Remove remains of openstack_control_node_interface options [puppet] - 10https://gerrit.wikimedia.org/r/1272722 (owner: 10Majavah) [14:07:10] (03CR) 10Jforrester: [C:03+2] mc: Use MCROUTER_SERVER values rather than local sidepod for WF cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271895 (https://phabricator.wikimedia.org/T423311) (owner: 10Jforrester) [14:07:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host testvm2002.codfw.wmnet with OS trixie [14:07:18] (03PS1) 10Atsuko: Install flink in blubber-compatible venv [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1272726 (https://phabricator.wikimedia.org/T418525) [14:07:48] (03CR) 10Jforrester: [C:03+1] mc: Use MCROUTER_SERVER values rather than local sidepod for WF cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271895 (https://phabricator.wikimedia.org/T423311) (owner: 10Jforrester) [14:08:05] matthiasmullie: Ah, you're still deploying? [14:08:09] James_F: we (= matthiasmullie) are doing one last backport, 5 min to be merged. [14:08:10] matthiasmullie: thanks for deploying! [14:08:17] (03CR) 10MVernon: [C:03+1] "Oh, one other question: are we confident that x-request-id is appropriately UTF-8 and URL-encoded (or such that such encoding it unnecessa" [puppet] - 10https://gerrit.wikimedia.org/r/1271927 (owner: 10CDanis) [14:08:18] Ack. [14:10:13] The windows are time-limited for a reason. :-( [14:10:44] I know, I'm sorry. [14:11:03] We really need to make CI run faster for these kinds of things, so deploys aren't so slow. [14:11:23] See my complains about the new GrowthExperiments CI job that's mis-applied to the wmf/ branch. [14:11:26] +t [14:11:56] (03Merged) 10jenkins-bot: fix: add missing hook registration for create account stats [extensions/GrowthExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1272712 (https://phabricator.wikimedia.org/T422283) (owner: 10Michael Große) [14:12:00] Given that it is still merging, and your fix might be urgent, would it make sense to actually _deploy_ your fix first and then mine afterward, or does that not make sense? [14:12:15] No, it'd just make scap very sad. [14:12:23] !log mlitn@deploy1003 Started scap sync-world: Backport for [[gerrit:1272712|fix: add missing hook registration for create account stats (T422283)]] [14:12:24] Also yours is now merged. :-) [14:12:27] T422283: [V1 experiment changes] Enable reliable measurement of account creation for mobile registration experiment on auth.wikimedia.org domain and support broader rollout - https://phabricator.wikimedia.org/T422283 [14:12:57] (03CR) 10Andrew Bogott: [C:03+2] designate: derive zookeeper cluster rather than hardcoding [puppet] - 10https://gerrit.wikimedia.org/r/1272146 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [14:13:11] James_F: I saw your note on that change. I kinda was wondering if RelEng would look into that, but since they probably missed that, I'll follow up on it. [14:13:19] gah; I didn't watch the time & CI delay close enoug - sorry James_F, my bad! [14:13:38] there is nothing to test for my change [14:13:41] MichaelG_WMF: I think the problem is just that hasharAway is, indeed, away, and no-one else looks at those comments. Maybe when he's back. [14:13:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270518 (https://phabricator.wikimedia.org/T423188) (owner: 10Arlolra) [14:13:48] it will only affect enwiki and that does not have .24 yet [14:13:50] matthiasmullie: <3 [14:14:21] !log mlitn@deploy1003 mlitn, migr: Backport for [[gerrit:1272712|fix: add missing hook registration for create account stats (T422283)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:14:41] !log mlitn@deploy1003 mlitn, migr: Continuing with sync [14:14:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T419635)', diff saved to https://phabricator.wikimedia.org/P90944 and previous config saved to /var/cache/conftool/dbconfig/20260416-141450-fceratto.json [14:14:55] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [14:14:56] (03PS2) 10CDanis: envoyproxy::tls_terminator: request header rewriting [puppet] - 10https://gerrit.wikimedia.org/r/1271926 [14:14:56] (03PS3) 10CDanis: swift::proxy: attempt some tracing context propagation [puppet] - 10https://gerrit.wikimedia.org/r/1271927 [14:15:07] skipping test; continuing with sync [14:15:08] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1248.eqiad.wmnet with reason: Maintenance [14:15:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1248 (T419635)', diff saved to https://phabricator.wikimedia.org/P90945 and previous config saved to /var/cache/conftool/dbconfig/20260416-141515-fceratto.json [14:17:21] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271927 (owner: 10CDanis) [14:17:25] (03CR) 10CDanis: "My rollout plan was:" [puppet] - 10https://gerrit.wikimedia.org/r/1271927 (owner: 10CDanis) [14:17:48] (03PS1) 10Andrew Bogott: designate codfw1dev: catch up with openstack_control_node_interface removal [puppet] - 10https://gerrit.wikimedia.org/r/1272732 (https://phabricator.wikimedia.org/T422646) [14:17:52] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11829798 (10AnnieKim_WMDE) @Martyn.ranyard I need to use jupyter notebooks to conduct data analysis of mobile-first communities for... [14:17:52] (03CR) 10Arnaudb: [C:03+1] "looks good to me, minor comments inline, +1!" [puppet] - 10https://gerrit.wikimedia.org/r/1272718 (https://phabricator.wikimedia.org/T333143) (owner: 10Jelto) [14:17:55] (03CR) 10Brouberol: Install flink in blubber-compatible venv (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1272726 (https://phabricator.wikimedia.org/T418525) (owner: 10Atsuko) [14:18:10] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [14:18:30] !log mlitn@deploy1003 Finished scap sync-world: Backport for [[gerrit:1272712|fix: add missing hook registration for create account stats (T422283)]] (duration: 06m 07s) [14:18:33] MichaelG_WMF: done; James_F: all yours! [14:18:34] T422283: [V1 experiment changes] Enable reliable measurement of account creation for mobile registration experiment on auth.wikimedia.org domain and support broader rollout - https://phabricator.wikimedia.org/T422283 [14:18:36] Ack. [14:18:41] (03CR) 10Jforrester: [C:03+2] mc: Use MCROUTER_SERVER values rather than local sidepod for WF cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271895 (https://phabricator.wikimedia.org/T423311) (owner: 10Jforrester) [14:18:43] Yay! [14:18:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271895 (https://phabricator.wikimedia.org/T423311) (owner: 10Jforrester) [14:19:09] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1272732 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [14:19:45] (03Merged) 10jenkins-bot: mc: Use MCROUTER_SERVER values rather than local sidepod for WF cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271895 (https://phabricator.wikimedia.org/T423311) (owner: 10Jforrester) [14:19:49] (03CR) 10FNegri: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1272714 (https://phabricator.wikimedia.org/T392356) (owner: 10Majavah) [14:20:09] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1271895|mc: Use MCROUTER_SERVER values rather than local sidepod for WF cache (T423311)]] [14:20:13] T423311: Writes to /*/wf-wan/ failing with CONNECTION FAILURE or SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY (mcrouter not being reached?) - https://phabricator.wikimedia.org/T423311 [14:21:56] (03CR) 10Jelto: "I can rebase this change so it also cleans up the code which I added for the migration" [puppet] - 10https://gerrit.wikimedia.org/r/1270774 (https://phabricator.wikimedia.org/T423027) (owner: 10Arnaudb) [14:21:58] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1271895|mc: Use MCROUTER_SERVER values rather than local sidepod for WF cache (T423311)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:22:29] (03CR) 10Andrew Bogott: [C:03+2] designate codfw1dev: catch up with openstack_control_node_interface removal [puppet] - 10https://gerrit.wikimedia.org/r/1272732 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [14:25:37] (03CR) 10Atsuko: Install flink in blubber-compatible venv (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1272726 (https://phabricator.wikimedia.org/T418525) (owner: 10Atsuko) [14:25:53] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [14:25:56] !log jforrester@deploy1003 jforrester: Continuing with sync [14:27:02] (03CR) 10Jelto: [V:03+1 C:03+2] gerrit: make daemon_user_dir configurable and set it to /srv for gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/1272718 (https://phabricator.wikimedia.org/T333143) (owner: 10Jelto) [14:28:34] (03CR) 10Arnaudb: "feel free to rebase! the conflict resolving might be tedious. Given this change is mostly deletion (-28 +6) it's also OK to drop or rebase" [puppet] - 10https://gerrit.wikimedia.org/r/1270774 (https://phabricator.wikimedia.org/T423027) (owner: 10Arnaudb) [14:29:18] (03PS2) 10Atsuko: flink: Install flink in blubber-compatible venv [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1272726 (https://phabricator.wikimedia.org/T418525) [14:29:45] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1271895|mc: Use MCROUTER_SERVER values rather than local sidepod for WF cache (T423311)]] (duration: 09m 36s) [14:29:49] T423311: Writes to /*/wf-wan/ failing with CONNECTION FAILURE or SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY (mcrouter not being reached?) - https://phabricator.wikimedia.org/T423311 [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260416T1430) [14:30:34] (Done deploying.) [14:31:52] 06SRE, 06Infrastructure-Foundations, 06Release-Engineering-Team (Radar): Sunsetting mirrors.wikimedia.org - https://phabricator.wikimedia.org/T416707#11829949 (10thcipriani) Checking my understanding of "sunsetting" here: - We're no longer hosting a mirror? vs. - the `mirrors.wikimedia.org` url will cease t... [14:38:25] (03CR) 10JMeybohm: [C:03+1] rest-gateway: Add liftwing listeners and network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269401 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert) [14:38:31] RESOLVED: Outbound discards: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [14:38:56] (03CR) 10MVernon: [C:03+1] "Sounds good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1271927 (owner: 10CDanis) [14:42:56] (03CR) 10Eevans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1272713 (https://phabricator.wikimedia.org/T418901) (owner: 10MVernon) [14:44:02] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [14:44:31] (03CR) 10Kamila Součková: [C:03+2] deployment_server: add dse-k8s-codfw to ::general [puppet] - 10https://gerrit.wikimedia.org/r/1271729 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [14:44:42] (03CR) 10Ottomata: "Thank you! Some comments and questions." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1272726 (https://phabricator.wikimedia.org/T418525) (owner: 10Atsuko) [14:45:00] (03CR) 10JMeybohm: [C:04-1] rest-gateway: Add liftwing inference routes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269403 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert) [14:45:34] !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host gerrit2002.wikimedia.org [14:45:38] (03PS9) 10Clément Goubert: rest-gateway: Add liftwing inference routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269403 (https://phabricator.wikimedia.org/T422804) [14:45:50] (03CR) 10MVernon: [C:03+2] apus: add two new storage nodes in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1272713 (https://phabricator.wikimedia.org/T418901) (owner: 10MVernon) [14:45:50] (03CR) 10Clément Goubert: "Good catch" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269403 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert) [14:51:45] (03PS4) 10Elukey: Improve tox and setup's configuration [cookbooks] - 10https://gerrit.wikimedia.org/r/1271594 (https://phabricator.wikimedia.org/T420475) [14:52:16] (03CR) 10Elukey: [C:03+2] Improve tox and setup's configuration (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1271594 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [14:52:18] (03CR) 10Elukey: [V:03+2 C:03+2] Improve tox and setup's configuration [cookbooks] - 10https://gerrit.wikimedia.org/r/1271594 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [14:52:27] (03CR) 10Elukey: [C:03+2] Improve tox and setup's configuration [cookbooks] - 10https://gerrit.wikimedia.org/r/1271594 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [14:52:54] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:53:32] (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge::k8s::haproxy: Remove old ingress nodes [puppet] - 10https://gerrit.wikimedia.org/r/1272714 (https://phabricator.wikimedia.org/T392356) (owner: 10Majavah) [14:53:49] jouncebot: nowandnext [14:53:49] For the next 0 hour(s) and 6 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260416T1430) [14:53:49] In 0 hour(s) and 6 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260416T1500) [14:54:42] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:54:57] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:55:11] (03Merged) 10jenkins-bot: Improve tox and setup's configuration [cookbooks] - 10https://gerrit.wikimedia.org/r/1271594 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [14:55:15] (03CR) 10Klausman: [C:03+1] rest-gateway: Add liftwing inference routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269403 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert) [14:55:22] (03CR) 10Kamila Součková: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [14:55:25] (03CR) 10Brouberol: flink: Install flink in blubber-compatible venv (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1272726 (https://phabricator.wikimedia.org/T418525) (owner: 10Atsuko) [14:55:41] (03CR) 10Klausman: [C:03+1] debian: add explicit ordering between node labeller and gpu plugin [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1272580 (https://phabricator.wikimedia.org/T420507) (owner: 10Elukey) [14:56:26] (03CR) 10Kamila Součková: [C:03+1] API rate limits: add highlimits-user class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270765 (https://phabricator.wikimedia.org/T419796) (owner: 10Daniel Kinzler) [14:56:46] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:56:52] !log jelto@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host gerrit2002.wikimedia.org [14:56:52] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [14:56:56] !log root@cumin2002 DONE (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1:00:00 on mr1-codfw IPv6,mr1-codfw.oob,mr-codfw with reason: router upgrade [14:57:23] PROBLEM - gerrit process on gerrit2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-17-openjdk-amd64/bin/java .*-jar /srv/gerrit/site_path/review_site/bin/gerrit.war daemon -d /srv/gerrit/site_path/review_site https://wikitech.wikimedia.org/wiki/Gerrit [14:57:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by daniel@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270765 (https://phabricator.wikimedia.org/T419796) (owner: 10Daniel Kinzler) [14:58:17] !log ongoing maintenace on mr1-codfw [14:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:44] !log root@cumin2002 DONE (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1:00:00 on mr1-codfw IPv6,mr-codfw with reason: router upgrade [14:58:55] (03Merged) 10jenkins-bot: API rate limits: add highlimits-user class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270765 (https://phabricator.wikimedia.org/T419796) (owner: 10Daniel Kinzler) [14:59:20] !log daniel@deploy1003 Started scap sync-world: Backport for [[gerrit:1270765|API rate limits: add highlimits-user class (T419796)]] [14:59:24] T419796: API rate limits: define tiers for logged-in (browser) users - https://phabricator.wikimedia.org/T419796 [15:00:05] dduvall and dancy: OwO what's this, a deployment window?? Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260416T1500). nyaa~ [15:00:35] !log root@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mr1-codfw,mr1-codfw IPv6,mr1-codfw.oob with reason: router upgrade [15:00:57] (03PS1) 10C. Scott Ananian: ParsoidCachePrewarmJob: Define the title in the req context [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1272750 (https://phabricator.wikimedia.org/T422780) [15:01:44] !log daniel@deploy1003 daniel: Backport for [[gerrit:1270765|API rate limits: add highlimits-user class (T419796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:03:10] (03CR) 10Bartosz Dziewoński: "I'm not sure who to ask for reviews here, I hope this is relevant to some of y'all's interests." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271969 (https://phabricator.wikimedia.org/T418507) (owner: 10Bartosz Dziewoński) [15:03:39] !log daniel@deploy1003 daniel: Continuing with sync [15:03:51] PROBLEM - Host wikikube-worker2280 is DOWN: PING CRITICAL - Packet loss = 100% [15:04:27] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [15:04:44] (03CR) 10JMeybohm: rest-gateway: Add liftwing inference routes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269403 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert) [15:06:15] (03PS10) 10Clément Goubert: rest-gateway: Add liftwing inference routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269403 (https://phabricator.wikimedia.org/T422804) [15:06:53] (03CR) 10JMeybohm: [C:03+1] rest-gateway: Add liftwing recommendation-api-ng routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270434 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert) [15:06:53] RECOVERY - Host wikikube-worker2280 is UP: PING OK - Packet loss = 0%, RTA = 31.61 ms [15:07:15] 06SRE, 06Infrastructure-Foundations, 06Release-Engineering-Team (Radar): Sunsetting mirrors.wikimedia.org - https://phabricator.wikimedia.org/T416707#11830231 (10MoritzMuehlenhoff) >>! In T416707#11829949, @thcipriani wrote: > Checking my understanding of "sunsetting" here: > > - We're no longer hosting a m... [15:08:47] (03CR) 10JMeybohm: [C:03+1] rest-gateway: Add liftwing inference routes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269403 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert) [15:10:08] !log daniel@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270765|API rate limits: add highlimits-user class (T419796)]] (duration: 10m 47s) [15:10:15] T419796: API rate limits: define tiers for logged-in (browser) users - https://phabricator.wikimedia.org/T419796 [15:12:19] (03PS6) 10Clément Goubert: rest-gateway: Add liftwing listeners and network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269401 (https://phabricator.wikimedia.org/T422804) [15:12:19] (03PS11) 10Clément Goubert: rest-gateway: Add liftwing inference routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269403 (https://phabricator.wikimedia.org/T422804) [15:12:19] (03PS9) 10Clément Goubert: rest-gateway: Add liftwing recommendation-api-ng routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270434 (https://phabricator.wikimedia.org/T422804) [15:13:07] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:14:52] !log installing sequoia-sqv security updates [15:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:56] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:15:02] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:15:08] (03PS8) 10Jasmine: service::catalog: add sophroid service catalog entry [puppet] - 10https://gerrit.wikimedia.org/r/1260767 (https://phabricator.wikimedia.org/T418748) [15:15:57] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Add liftwing recommendation-api-ng routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270434 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert) [15:16:05] (03PS9) 10Jasmine: service::catalog: add sophroid service catalog entry [puppet] - 10https://gerrit.wikimedia.org/r/1260767 (https://phabricator.wikimedia.org/T418748) [15:16:15] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Add liftwing listeners and network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269401 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert) [15:16:18] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Add liftwing inference routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269403 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert) [15:16:53] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:16:59] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [15:17:08] (03CR) 10CI reject: [V:04-1] ParsoidCachePrewarmJob: Define the title in the req context [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1272750 (https://phabricator.wikimedia.org/T422780) (owner: 10C. Scott Ananian) [15:17:48] (03CR) 10Elukey: [V:03+2 C:03+2] debian: add explicit ordering between node labeller and gpu plugin [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1272580 (https://phabricator.wikimedia.org/T420507) (owner: 10Elukey) [15:18:29] (03Merged) 10jenkins-bot: rest-gateway: Add liftwing listeners and network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269401 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert) [15:19:02] (03Merged) 10jenkins-bot: rest-gateway: Add liftwing inference routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269403 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert) [15:19:05] (03Merged) 10jenkins-bot: rest-gateway: Add liftwing recommendation-api-ng routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270434 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert) [15:19:47] (03CR) 10Atsuko: [C:04-1] "doesn't work in blubber because of permissions to venv" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1272726 (https://phabricator.wikimedia.org/T418525) (owner: 10Atsuko) [15:21:07] 06SRE, 10Lift-Wing, 06Machine-Learning-Team (Q4 FY2025-26): Fix securityContext propagation in liftwing - https://phabricator.wikimedia.org/T423149#11830310 (10DPogorzelski-WMF) ok it works, this missing bit was workloadType: initContainer in ` apiVersion: serving.kserve.io/v1alpha1 kind: ClusterStorageC... [15:22:31] PROBLEM - Host ps1-b6-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:22:31] PROBLEM - Host ps1-b7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:22:31] PROBLEM - Host ps1-c5-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:22:31] PROBLEM - Host ps1-c7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:22:31] PROBLEM - Host ps1-c6-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:22:31] PROBLEM - Host ps1-f3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:22:43] PROBLEM - Host ps1-e2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:22:43] PROBLEM - Host ps1-d3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:22:43] PROBLEM - Host ps1-f2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:22:43] PROBLEM - Host ps1-e3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:22:43] PROBLEM - Host ps1-f4-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:22:43] PROBLEM - Host ps1-e5-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:22:43] PROBLEM - Host ps1-e4-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:22:59] PROBLEM - Host ps1-a2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:22:59] PROBLEM - Host ps1-f1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:22:59] PROBLEM - Host ps1-b5-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:22:59] PROBLEM - Host ps1-b1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:22:59] PROBLEM - Host ps1-a7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:22:59] PROBLEM - Host ps1-a4-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:22:59] PROBLEM - Host ps1-a3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:23:01] PROBLEM - Host ps1-a5-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:23:01] PROBLEM - Host ps1-a8-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:23:05] PROBLEM - Host ps1-b2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:23:05] PROBLEM - Host ps1-b3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:23:05] PROBLEM - Host ps1-b4-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:23:05] PROBLEM - Host ps1-c1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:23:07] PROBLEM - Host ps1-a1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:23:15] PROBLEM - Host ps1-a6-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:23:15] PROBLEM - Host ps1-f5-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:23:15] PROBLEM - Host ps1-b8-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:23:15] PROBLEM - Host ps1-c2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:23:17] PROBLEM - Host ps1-d4-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:23:17] PROBLEM - Host ps1-d6-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:23:17] PROBLEM - Host ps1-c8-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:23:19] PROBLEM - Host ps1-d5-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:23:19] PROBLEM - Host ps1-c4-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:23:19] PROBLEM - Host ps1-d2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:23:19] PROBLEM - Host ps1-c3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:23:19] PROBLEM - Host ps1-d1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:23:21] PROBLEM - Host ps1-d8-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:23:21] PROBLEM - Host ps1-e1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:23:21] PROBLEM - Host ps1-d7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:23:52] 06SRE, 10Lift-Wing, 06Machine-Learning-Team (Q4 FY2025-26): Fix securityContext propagation in liftwing - https://phabricator.wikimedia.org/T423149#11830337 (10elukey) Very weird, I recall that we removed the workloadType because it wasn't in the CRD spec, I am very confused. [15:26:26] (03CR) 10Jasmine: service::catalog: add sophroid service catalog entry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1260767 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [15:26:38] (03CR) 10Jasmine: service::catalog: add sophroid service catalog entry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1260767 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [15:28:23] FIRING: [7x] GnmiTargetDown: lsw1-a4-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [15:28:24] (03PS3) 10CDanis: envoyproxy::tls_terminator: request header rewriting [puppet] - 10https://gerrit.wikimedia.org/r/1271926 (https://phabricator.wikimedia.org/T328872) [15:28:26] (03PS4) 10CDanis: swift::proxy: attempt some tracing context propagation [puppet] - 10https://gerrit.wikimedia.org/r/1271927 (https://phabricator.wikimedia.org/T328872) [15:28:31] RECOVERY - Host ps1-d7-codfw is UP: PING WARNING - Packet loss = 80%, RTA = 36.62 ms [15:28:31] RECOVERY - Host ps1-e1-codfw is UP: PING WARNING - Packet loss = 80%, RTA = 44.31 ms [15:28:31] RECOVERY - Host ps1-d8-codfw is UP: PING WARNING - Packet loss = 80%, RTA = 37.02 ms [15:28:33] RECOVERY - Host ps1-a3-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.06 ms [15:28:33] RECOVERY - Host ps1-b2-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.91 ms [15:28:33] RECOVERY - Host ps1-a4-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.15 ms [15:28:33] RECOVERY - Host ps1-b5-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.49 ms [15:28:33] RECOVERY - Host ps1-b7-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.01 ms [15:28:33] RECOVERY - Host ps1-a7-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.29 ms [15:28:33] RECOVERY - Host ps1-a1-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.14 ms [15:28:34] RECOVERY - Host ps1-a2-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.49 ms [15:28:34] RECOVERY - Host ps1-f2-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.99 ms [15:28:35] RECOVERY - Host ps1-b1-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.08 ms [15:28:35] RECOVERY - Host ps1-c6-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.92 ms [15:28:36] RECOVERY - Host ps1-b8-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.25 ms [15:28:36] RECOVERY - Host ps1-c7-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.27 ms [15:28:37] RECOVERY - Host ps1-c1-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.88 ms [15:28:37] RECOVERY - Host ps1-d3-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.95 ms [15:28:38] RECOVERY - Host ps1-c2-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.56 ms [15:28:38] RECOVERY - Host ps1-b6-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.99 ms [15:28:39] RECOVERY - Host ps1-a5-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.93 ms [15:28:39] RECOVERY - Host ps1-a8-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.63 ms [15:28:40] RECOVERY - Host ps1-c5-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.45 ms [15:28:40] RECOVERY - Host ps1-f3-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.31 ms [15:28:41] RECOVERY - Host ps1-c8-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.04 ms [15:28:41] RECOVERY - Host ps1-c4-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.92 ms [15:28:42] RECOVERY - Host ps1-d6-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.25 ms [15:28:42] RECOVERY - Host ps1-e5-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.51 ms [15:28:43] RECOVERY - Host ps1-d4-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.06 ms [15:28:43] RECOVERY - Host ps1-b4-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.29 ms [15:28:44] RECOVERY - Host ps1-c3-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.08 ms [15:28:44] RECOVERY - Host ps1-d5-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.18 ms [15:28:45] RECOVERY - Host ps1-b3-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.23 ms [15:28:45] RECOVERY - Host ps1-e2-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.20 ms [15:28:46] RECOVERY - Host ps1-e4-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.24 ms [15:28:46] RECOVERY - Host ps1-f1-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.97 ms [15:28:47] RECOVERY - Host ps1-a6-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.46 ms [15:28:47] RECOVERY - Host ps1-f4-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.71 ms [15:28:48] RECOVERY - Host ps1-f5-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.57 ms [15:28:48] RECOVERY - Host ps1-e3-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.36 ms [15:28:49] RECOVERY - Host ps1-d1-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.49 ms [15:28:49] RECOVERY - Host ps1-d2-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.96 ms [15:29:01] !log 💔cdanis@cumin1003.eqiad.wmnet ~ 🕦☕ sudo cumin 'A:swift-fe' 'disable-puppet "cdanis deploy I3aaec0ca T328872"' [15:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:06] T328872: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 [15:29:41] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179#11830369 (10MoritzMuehlenhoff) [15:29:53] (03CR) 10CDanis: [C:03+2] envoyproxy::tls_terminator: request header rewriting [puppet] - 10https://gerrit.wikimedia.org/r/1271926 (https://phabricator.wikimedia.org/T328872) (owner: 10CDanis) [15:29:57] (03CR) 10CDanis: [C:03+2] swift::proxy: attempt some tracing context propagation [puppet] - 10https://gerrit.wikimedia.org/r/1271927 (https://phabricator.wikimedia.org/T328872) (owner: 10CDanis) [15:30:54] !log cgoubert@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [15:31:01] !log cgoubert@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [15:33:23] RESOLVED: [23x] GnmiTargetDown: fasw1-f5b-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [15:34:35] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [15:34:57] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [15:35:07] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [15:35:26] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [15:35:30] !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on krb2002.codfw.wmnet with reason: T407726 [15:35:31] 06SRE, 10Lift-Wing, 06Machine-Learning-Team (Q4 FY2025-26): Fix securityContext propagation in liftwing - https://phabricator.wikimedia.org/T423149#11830431 (10DPogorzelski-WMF) i think the difference lies in the fact that without initContainer field the ClusterStorageContainer is not used at all to construc... [15:35:34] T407726: Increase net.nf_conntrack_max on kerberos hosts if needed - https://phabricator.wikimedia.org/T407726 [15:35:40] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance [15:35:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2148 (T419961)', diff saved to https://phabricator.wikimedia.org/P90946 and previous config saved to /var/cache/conftool/dbconfig/20260416-153547-fceratto.json [15:40:52] 06SRE, 06DC-Ops: Should we skip some directories from deploy backups? - https://phabricator.wikimedia.org/T423619 (10jcrespo) 03NEW [15:42:26] 06SRE, 06DC-Ops: Should we skip some directories from deploy backups? - https://phabricator.wikimedia.org/T423619#11830457 (10jcrespo) CC @hashar in case it is interesting for releng, but this looks more infra and serviceops related (k8s). [15:44:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T419961)', diff saved to https://phabricator.wikimedia.org/P90947 and previous config saved to /var/cache/conftool/dbconfig/20260416-154408-fceratto.json [15:49:47] (03PS3) 10Muehlenhoff: Remove Puppet 5 support from Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/1240877 (https://phabricator.wikimedia.org/T365798) [15:50:03] (03PS1) 10Muehlenhoff: profile::zookeeper::firewall: Also allow to pass hosts (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1272766 [15:50:21] (03CR) 10JMeybohm: [C:03+1] service::catalog: add sophroid service catalog entry [puppet] - 10https://gerrit.wikimedia.org/r/1260767 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [15:54:07] (03PS2) 10Muehlenhoff: profile::zookeeper::firewall: Also allow to pass hosts (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1272766 [15:54:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P90948 and previous config saved to /var/cache/conftool/dbconfig/20260416-155416-fceratto.json [15:56:39] 06SRE, 06DC-Ops, 06ServiceOps new: Should we skip some directories from deploy backups? - https://phabricator.wikimedia.org/T423619#11830502 (10Scott_French) Thanks for raising this. So, `/srv/docker/` is the docker data-root, and will generally contain temporary files which I believe we generally do not wa... [15:56:41] (03PS1) 10Papaul: Cahnge OOB interface to ge-0/0/0 [homer/public] - 10https://gerrit.wikimedia.org/r/1272767 (https://phabricator.wikimedia.org/T421674) [15:58:50] (03PS2) 10Papaul: Cahnge OOB interface to ge-0/0/7 [homer/public] - 10https://gerrit.wikimedia.org/r/1272767 (https://phabricator.wikimedia.org/T421674) [16:00:04] jhathaway and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260416T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:35] (03CR) 10Papaul: [C:03+2] Cahnge OOB interface to ge-0/0/7 [homer/public] - 10https://gerrit.wikimedia.org/r/1272767 (https://phabricator.wikimedia.org/T421674) (owner: 10Papaul) [16:02:16] (03CR) 10Elukey: "@jhathaway@wikimedia.org when you have a moment lemme know what you think about the change, it should hopefully unblock the provisioning o" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271631 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey) [16:03:39] (03CR) 10Atsuko: [C:04-1] "-1 because images that consumes this base image has been broken + addressing the comments" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1272726 (https://phabricator.wikimedia.org/T418525) (owner: 10Atsuko) [16:04:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P90949 and previous config saved to /var/cache/conftool/dbconfig/20260416-160424-fceratto.json [16:06:55] (03PS1) 10CDanis: envoy: fix YAML quoting snafu [puppet] - 10https://gerrit.wikimedia.org/r/1272769 [16:07:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T419635)', diff saved to https://phabricator.wikimedia.org/P90950 and previous config saved to /var/cache/conftool/dbconfig/20260416-160710-fceratto.json [16:07:15] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [16:07:30] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1272769 (owner: 10CDanis) [16:09:17] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:09:48] (03CR) 10CI reject: [V:04-1] envoy: fix YAML quoting snafu [puppet] - 10https://gerrit.wikimedia.org/r/1272769 (owner: 10CDanis) [16:10:07] (03PS2) 10CDanis: envoy: fix YAML quoting snafu [puppet] - 10https://gerrit.wikimedia.org/r/1272769 [16:10:12] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1272769 (owner: 10CDanis) [16:10:17] 06SRE, 06Infrastructure-Foundations, 06Release-Engineering-Team (Radar): New base images without mirrors.wikimedia.org - https://phabricator.wikimedia.org/T423622 (10thcipriani) 03NEW [16:11:22] !log eevans@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on aqs1010.eqiad.wmnet with reason: Bootstrapping — T412830 [16:11:26] T412830: Hardware refresh of aqs101[0-2,4-5] w/ aqs102[3-7] - https://phabricator.wikimedia.org/T412830 [16:12:30] (03PS1) 10Aaron Schulz: Enable attribution.v0-beta in RestSandboxSpecs for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272770 (https://phabricator.wikimedia.org/T419545) [16:12:44] (03CR) 10CDanis: [C:03+2] envoy: fix YAML quoting snafu [puppet] - 10https://gerrit.wikimedia.org/r/1272769 (owner: 10CDanis) [16:13:13] (03PS1) 10C. Scott Ananian: Convert language to internal code in tests [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1272771 [16:13:27] PROBLEM - Host mr1-codfw.oob is DOWN: PING CRITICAL - Packet loss = 100% [16:13:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272770 (https://phabricator.wikimedia.org/T419545) (owner: 10Aaron Schulz) [16:14:33] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T419961)', diff saved to https://phabricator.wikimedia.org/P90951 and previous config saved to /var/cache/conftool/dbconfig/20260416-161432-fceratto.json [16:14:56] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance [16:15:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2175 (T419961)', diff saved to https://phabricator.wikimedia.org/P90952 and previous config saved to /var/cache/conftool/dbconfig/20260416-161504-fceratto.json [16:15:10] (03PS2) 10Ryan Kemper: opensearch: allowlist upstream-only plugins, add force-overwrite [puppet] - 10https://gerrit.wikimedia.org/r/1271947 (https://phabricator.wikimedia.org/T423327) [16:17:02] (03PS1) 10CDanis: Revert "swift::proxy: attempt some tracing context propagation" [puppet] - 10https://gerrit.wikimedia.org/r/1272773 [16:17:19] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P90953 and previous config saved to /var/cache/conftool/dbconfig/20260416-161719-fceratto.json [16:17:35] (03CR) 10Ladsgroup: [C:03+1] "😞" [puppet] - 10https://gerrit.wikimedia.org/r/1272773 (owner: 10CDanis) [16:17:37] (03PS2) 10CDanis: Revert "swift::proxy: attempt some tracing context propagation" [puppet] - 10https://gerrit.wikimedia.org/r/1272773 (https://phabricator.wikimedia.org/T328872) [16:18:16] (03CR) 10CDanis: [C:03+2] Revert "swift::proxy: attempt some tracing context propagation" [puppet] - 10https://gerrit.wikimedia.org/r/1272773 (https://phabricator.wikimedia.org/T328872) (owner: 10CDanis) [16:18:26] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:21:13] 06SRE, 06Infrastructure-Foundations, 06Release-Engineering-Team (Radar): New base images without mirrors.wikimedia.org - https://phabricator.wikimedia.org/T423622#11830659 (10Jdforrester-WMF) To confirm, specifically is this task talking about SRE's base Debian distro images, whose sourceslist is configured... [16:22:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T419961)', diff saved to https://phabricator.wikimedia.org/P90955 and previous config saved to /var/cache/conftool/dbconfig/20260416-162229-fceratto.json [16:23:25] (03PS1) 10JHathaway: nf_conntrack_buckets: use default value [puppet] - 10https://gerrit.wikimedia.org/r/1272774 (https://phabricator.wikimedia.org/T105307) [16:24:03] 06SRE, 10China-Judgments-Online-Preservation-Program, 10Wikimedia-Mailing-lists, 07Chinese-Sites: Request creation of mailing list for zhwikisource sysops - https://phabricator.wikimedia.org/T423520#11830694 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup https://lists.wikimedia.org/postorius/lists... [16:26:01] (03CR) 10CI reject: [V:04-1] nf_conntrack_buckets: use default value [puppet] - 10https://gerrit.wikimedia.org/r/1272774 (https://phabricator.wikimedia.org/T105307) (owner: 10JHathaway) [16:26:08] (03PS1) 10CDanis: swift::proxy: re-try some tracing context propagation [puppet] - 10https://gerrit.wikimedia.org/r/1272775 (https://phabricator.wikimedia.org/T328872) [16:27:21] (03CR) 10Jdlrobson: [C:03+1] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271862 (https://phabricator.wikimedia.org/T417538) (owner: 10Ignacio Rodríguez) [16:27:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P90956 and previous config saved to /var/cache/conftool/dbconfig/20260416-162727-fceratto.json [16:27:35] (03PS2) 10CDanis: swift::proxy: re-try some tracing context propagation [puppet] - 10https://gerrit.wikimedia.org/r/1272775 (https://phabricator.wikimedia.org/T328872) [16:27:57] !log upgrade envoyproxy, restbase[1031,2024] (canary) — T419637 & T410975 [16:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:02] T419637: Upgrade Envoy to v1.35.9 - https://phabricator.wikimedia.org/T419637 [16:28:02] T410975: Upgrade Envoy to v1.35.7 - https://phabricator.wikimedia.org/T410975 [16:29:02] (03CR) 10RLazarus: [C:03+1] swift::proxy: re-try some tracing context propagation [puppet] - 10https://gerrit.wikimedia.org/r/1272775 (https://phabricator.wikimedia.org/T328872) (owner: 10CDanis) [16:29:13] (03CR) 10CDanis: [C:03+2] swift::proxy: re-try some tracing context propagation [puppet] - 10https://gerrit.wikimedia.org/r/1272775 (https://phabricator.wikimedia.org/T328872) (owner: 10CDanis) [16:29:39] (03CR) 10Jdlrobson: [C:03+1] Restore PageImages functionality to Wikisources and Wikibooks (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271862 (https://phabricator.wikimedia.org/T417538) (owner: 10Ignacio Rodríguez) [16:29:45] (03PS3) 10Ryan Kemper: opensearch: allowlist upstream plugins + overwrite [puppet] - 10https://gerrit.wikimedia.org/r/1271947 (https://phabricator.wikimedia.org/T423327) [16:30:11] !log 💙cdanis@cumin1003.eqiad.wmnet ~ 🕧☕ sudo cumin 'A:swift-fe' 'disable-puppet "cdanis deploy 8ad070a466 T328872"' [16:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:15] T328872: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 [16:32:30] (03PS1) 10Effie Mouzeli: mcrouter: update to 1.3.5 (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272777 (https://phabricator.wikimedia.org/T421360) [16:32:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P90957 and previous config saved to /var/cache/conftool/dbconfig/20260416-163237-fceratto.json [16:33:07] (03PS5) 10Jdlrobson: Restore PageImages functionality to Wikisources and Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271862 (https://phabricator.wikimedia.org/T417538) (owner: 10Ignacio Rodríguez) [16:33:43] (03CR) 10Ottomata: "Ya, that makes sense. The dependent images will have to be updated to use /opt/lib/venv and stop using use-system-site-packages." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1272726 (https://phabricator.wikimedia.org/T418525) (owner: 10Atsuko) [16:33:48] (03CR) 10Ryan Kemper: "This does feel like a more proper longer-term fix. We should look into this after getting the immediate OS work unblocked" [puppet] - 10https://gerrit.wikimedia.org/r/1271947 (https://phabricator.wikimedia.org/T423327) (owner: 10Ryan Kemper) [16:33:56] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271947 (https://phabricator.wikimedia.org/T423327) (owner: 10Ryan Kemper) [16:34:16] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:35:28] (03PS3) 10Atsuko: flink: Install flink in blubber-compatible venv [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1272726 (https://phabricator.wikimedia.org/T418525) [16:37:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T419635)', diff saved to https://phabricator.wikimedia.org/P90958 and previous config saved to /var/cache/conftool/dbconfig/20260416-163736-fceratto.json [16:37:40] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [16:37:53] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1249.eqiad.wmnet with reason: Maintenance [16:38:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1249 (T419635)', diff saved to https://phabricator.wikimedia.org/P90959 and previous config saved to /var/cache/conftool/dbconfig/20260416-163800-fceratto.json [16:38:37] !log 💙cdanis@cumin1003.eqiad.wmnet ~ 🕧☕ sudo cumin 'A:swift-fe' 'enable-puppet "cdanis deploy 8ad070a466 T328872"' [16:38:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:41] T328872: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 [16:38:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270567 (owner: 10Bodhisattwa) [16:40:21] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:41:21] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:41:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:42:23] (03CR) 10Atsuko: "Dependent images that uses blubber are using `/opt/lib/venv`, this is blubber default behaviour." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1272726 (https://phabricator.wikimedia.org/T418525) (owner: 10Atsuko) [16:42:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P90960 and previous config saved to /var/cache/conftool/dbconfig/20260416-164245-fceratto.json [16:44:21] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:44:27] RECOVERY - Host mr1-codfw.oob is UP: PING OK - Packet loss = 0%, RTA = 37.77 ms [16:45:28] (03CR) 10Ottomata: "Ack! Nice!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1272726 (https://phabricator.wikimedia.org/T418525) (owner: 10Atsuko) [16:46:19] (03CR) 10Ottomata: "If you can run docker-pkg locally, you might be able to test by locally editing blubber.yaml and pointing it to your local docker image ra" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1272726 (https://phabricator.wikimedia.org/T418525) (owner: 10Atsuko) [16:46:21] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:48:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Standardize management routers interfaces - https://phabricator.wikimedia.org/T421674#11830771 (10Papaul) [16:52:19] (03PS1) 10Effie Mouzeli: mcrouter: Add EXTRA_ARGS env var [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272785 (https://phabricator.wikimedia.org/T421360) [16:52:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T419961)', diff saved to https://phabricator.wikimedia.org/P90961 and previous config saved to /var/cache/conftool/dbconfig/20260416-165253-fceratto.json [16:53:17] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2189.codfw.wmnet with reason: Maintenance [16:53:26] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2189 (T419961)', diff saved to https://phabricator.wikimedia.org/P90962 and previous config saved to /var/cache/conftool/dbconfig/20260416-165326-fceratto.json [16:55:21] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:59:55] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:00:05] bd808: How many deployers does it take to do Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260416T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260416T1700) [17:00:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T419961)', diff saved to https://phabricator.wikimedia.org/P90963 and previous config saved to /var/cache/conftool/dbconfig/20260416-170033-fceratto.json [17:00:55] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:01:21] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:02:01] o/ It looks like I have a few new translations I can push for developer-portal during this window. [17:03:55] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:05:21] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:06:55] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:09:55] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:10:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P90964 and previous config saved to /var/cache/conftool/dbconfig/20260416-171041-fceratto.json [17:10:55] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:11:21] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:13:37] (03PS1) 10BryanDavis: developer-portal: Bump version to 2026-04-13-122511-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272794 [17:15:21] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:16:21] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:17:58] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump version to 2026-04-13-122511-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272794 (owner: 10BryanDavis) [17:19:55] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:20:01] (03Merged) 10jenkins-bot: developer-portal: Bump version to 2026-04-13-122511-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272794 (owner: 10BryanDavis) [17:20:21] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:20:26] (03CR) 10Ottomata: [C:03+1] "Great stuff! one final comment nit but if it works LGTM. I'll +1 and you can merge at will." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1272726 (https://phabricator.wikimedia.org/T418525) (owner: 10Atsuko) [17:20:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P90966 and previous config saved to /var/cache/conftool/dbconfig/20260416-172050-fceratto.json [17:21:55] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:22:21] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:25:55] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:25:56] 06SRE, 06Infrastructure-Foundations, 06Release-Engineering-Team (Radar): New base images without mirrors.wikimedia.org - https://phabricator.wikimedia.org/T423622#11830878 (10thcipriani) [17:26:21] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:26:45] !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:26:53] PROBLEM - MegaRAID on pc1011 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:26:55] ACKNOWLEDGEMENT - MegaRAID on pc1011 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T423630 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:26:57] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:26:59] !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:27:05] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on pc1011 - https://phabricator.wikimedia.org/T423630 (10ops-monitoring-bot) 03NEW [17:27:06] !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:27:21] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:27:26] !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:27:34] !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:28:23] !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:30:18] 06SRE, 06Infrastructure-Foundations, 06Release-Engineering-Team (Radar): New base images without mirrors.wikimedia.org - https://phabricator.wikimedia.org/T423622#11830895 (10thcipriani) >>! In T423622#11830659, @Jdforrester-WMF wrote: > To confirm, specifically is this task talking about SRE's base Debian d... [17:30:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T419961)', diff saved to https://phabricator.wikimedia.org/P90967 and previous config saved to /var/cache/conftool/dbconfig/20260416-173058-fceratto.json [17:31:22] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2197.codfw.wmnet with reason: Maintenance [17:34:30] Done with my window [17:36:33] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2204.codfw.wmnet with reason: Maintenance [17:36:41] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2204 (T419961)', diff saved to https://phabricator.wikimedia.org/P90968 and previous config saved to /var/cache/conftool/dbconfig/20260416-173640-fceratto.json [17:37:15] 06SRE, 06Infrastructure-Foundations, 06Release-Engineering-Team (Radar): New base images without mirrors.wikimedia.org - https://phabricator.wikimedia.org/T423622#11830927 (10MoritzMuehlenhoff) I'll take care of this tomorrow. [17:40:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254359 (https://phabricator.wikimedia.org/T420420) (owner: 10Jforrester) [17:41:33] (03PS1) 10Andrew Bogott: designate codfw1dev: add tooz_backend variable to switch between memc and zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/1272810 (https://phabricator.wikimedia.org/T422646) [17:42:03] (03CR) 10CI reject: [V:04-1] designate codfw1dev: add tooz_backend variable to switch between memc and zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/1272810 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [17:42:59] (03PS2) 10Andrew Bogott: designate codfw1dev: add tooz_backend variable to switch between memc and zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/1272810 (https://phabricator.wikimedia.org/T422646) [17:43:27] (03CR) 10CI reject: [V:04-1] designate codfw1dev: add tooz_backend variable to switch between memc and zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/1272810 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [17:43:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2204 (T419961)', diff saved to https://phabricator.wikimedia.org/P90969 and previous config saved to /var/cache/conftool/dbconfig/20260416-174350-fceratto.json [17:45:21] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:45:45] (03PS3) 10Andrew Bogott: designate codfw1dev: add tooz_backend variable (mcrouter or zookeeper) [puppet] - 10https://gerrit.wikimedia.org/r/1272810 (https://phabricator.wikimedia.org/T422646) [17:48:57] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:49:17] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1272810 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [17:50:21] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:50:57] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:53:48] (03PS4) 10Andrew Bogott: designate codfw1dev: add tooz_backend variable (mcrouter or zookeeper) [puppet] - 10https://gerrit.wikimedia.org/r/1272810 (https://phabricator.wikimedia.org/T422646) [17:53:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2204', diff saved to https://phabricator.wikimedia.org/P90970 and previous config saved to /var/cache/conftool/dbconfig/20260416-175358-fceratto.json [17:54:07] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1272810 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [17:55:48] (03CR) 10Dzahn: [C:03+1] gerrit: make daemon_user_dir configurable and set it to /srv for gerrit2002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1272718 (https://phabricator.wikimedia.org/T333143) (owner: 10Jelto) [17:57:14] (03PS5) 10Andrew Bogott: designate codfw1dev: add tooz_backend variable (mcrouter or zookeeper) [puppet] - 10https://gerrit.wikimedia.org/r/1272810 (https://phabricator.wikimedia.org/T422646) [17:58:44] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1272810 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [18:00:05] dduvall and dancy: Time to do the MediaWiki train - Utc-7 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260416T1800). [18:02:30] (03CR) 10Andrew Bogott: [C:03+2] designate codfw1dev: add tooz_backend variable (mcrouter or zookeeper) [puppet] - 10https://gerrit.wikimedia.org/r/1272810 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [18:04:07] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2204', diff saved to https://phabricator.wikimedia.org/P90971 and previous config saved to /var/cache/conftool/dbconfig/20260416-180407-fceratto.json [18:04:57] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:05:57] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:12:04] (03PS1) 10JHathaway: ensure net.netfilter.nf_conntrack_max is updated [puppet] - 10https://gerrit.wikimedia.org/r/1272832 [18:14:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2204 (T419961)', diff saved to https://phabricator.wikimedia.org/P90972 and previous config saved to /var/cache/conftool/dbconfig/20260416-181415-fceratto.json [18:14:39] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2225.codfw.wmnet with reason: Maintenance [18:14:45] (03PS2) 10JHathaway: nf_conntrack_buckets: use default value [puppet] - 10https://gerrit.wikimedia.org/r/1272774 (https://phabricator.wikimedia.org/T105307) [18:14:46] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1272832 (owner: 10JHathaway) [18:14:48] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2225 (T419961)', diff saved to https://phabricator.wikimedia.org/P90973 and previous config saved to /var/cache/conftool/dbconfig/20260416-181447-fceratto.json [18:14:54] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1272774 (https://phabricator.wikimedia.org/T105307) (owner: 10JHathaway) [18:17:35] (03CR) 10Dzahn: "Function lookup() did not find a value for the name 'profile::zuul::executor::http_proxy' on zuul1002 - let's see ..." [puppet] - 10https://gerrit.wikimedia.org/r/1271948 (https://phabricator.wikimedia.org/T406384) (owner: 10Dduvall) [18:18:32] (03CR) 10Dzahn: [C:04-1] "the main class is not applied on executor nodes. we gotta put it in base or individual roles.. let me amend" [puppet] - 10https://gerrit.wikimedia.org/r/1271948 (https://phabricator.wikimedia.org/T406384) (owner: 10Dduvall) [18:18:48] (03CR) 10JHathaway: [C:03+1] Avoid false positive alerts after Ganeti master failover [puppet] - 10https://gerrit.wikimedia.org/r/1272701 (owner: 10Muehlenhoff) [18:20:35] (03PS3) 10Dzahn: zuul: Configure environment variables for http(s) proxy [puppet] - 10https://gerrit.wikimedia.org/r/1271948 (https://phabricator.wikimedia.org/T406384) (owner: 10Dduvall) [18:21:08] (03CR) 10CDanis: fundraising_data_import maintenance script wrapper & timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis) [18:21:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2225 (T419961)', diff saved to https://phabricator.wikimedia.org/P90974 and previous config saved to /var/cache/conftool/dbconfig/20260416-182157-fceratto.json [18:23:12] (03PS3) 10JHathaway: nf_conntrack_buckets: use default value [puppet] - 10https://gerrit.wikimedia.org/r/1272774 (https://phabricator.wikimedia.org/T105307) [18:23:19] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1272774 (https://phabricator.wikimedia.org/T105307) (owner: 10JHathaway) [18:25:28] (03PS1) 10TrainBranchBot: group2 to 1.46.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272833 (https://phabricator.wikimedia.org/T420482) [18:25:30] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dduvall@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272833 (https://phabricator.wikimedia.org/T420482) (owner: 10TrainBranchBot) [18:26:23] (03Merged) 10jenkins-bot: group2 to 1.46.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272833 (https://phabricator.wikimedia.org/T420482) (owner: 10TrainBranchBot) [18:27:07] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T419635)', diff saved to https://phabricator.wikimedia.org/P90975 and previous config saved to /var/cache/conftool/dbconfig/20260416-182707-fceratto.json [18:27:12] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [18:27:35] (03CR) 10JHathaway: [C:03+1] "looks good, one minor question" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1271631 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey) [18:28:46] !log jasmine@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [18:31:10] (03CR) 10Dzahn: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1271948/8432/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1271948 (https://phabricator.wikimedia.org/T406384) (owner: 10Dduvall) [18:31:37] (03CR) 10Dzahn: [C:03+2] zuul: Configure environment variables for http(s) proxy [puppet] - 10https://gerrit.wikimedia.org/r/1271948 (https://phabricator.wikimedia.org/T406384) (owner: 10Dduvall) [18:31:59] !log dduvall@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.46.0-wmf.24 refs T420482 [18:32:03] T420482: 1.46.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T420482 [18:32:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2225', diff saved to https://phabricator.wikimedia.org/P90976 and previous config saved to /var/cache/conftool/dbconfig/20260416-183205-fceratto.json [18:35:38] (03PS1) 10Andrew Bogott: Openstack: use debian.net repo rather than the wmf-hosted repo [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) [18:36:27] !log jasmine@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-ctrl2005.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [18:37:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P90977 and previous config saved to /var/cache/conftool/dbconfig/20260416-183715-fceratto.json [18:37:51] (03PS2) 10Andrew Bogott: Openstack: use debian.net repo rather than the wmf-hosted repo [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) [18:37:54] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [18:39:07] !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [18:42:04] !log jasmine@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [18:42:14] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2225', diff saved to https://phabricator.wikimedia.org/P90978 and previous config saved to /var/cache/conftool/dbconfig/20260416-184213-fceratto.json [18:46:48] !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2005.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [18:47:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P90979 and previous config saved to /var/cache/conftool/dbconfig/20260416-184723-fceratto.json [18:49:16] !log jasmine@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-ctrl2005.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [18:52:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2225 (T419961)', diff saved to https://phabricator.wikimedia.org/P90980 and previous config saved to /var/cache/conftool/dbconfig/20260416-185222-fceratto.json [18:52:27] !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [18:52:46] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2226.codfw.wmnet with reason: Maintenance [18:52:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2226 (T419961)', diff saved to https://phabricator.wikimedia.org/P90981 and previous config saved to /var/cache/conftool/dbconfig/20260416-185253-fceratto.json [18:53:30] (03PS3) 10Andrew Bogott: Openstack: use debian.net repo rather than the wmf-hosted repo [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) [18:54:18] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [18:55:06] (03PS3) 10Dzahn: lists: notify apache2 service when config changes [puppet] - 10https://gerrit.wikimedia.org/r/1271019 (https://phabricator.wikimedia.org/T323208) [18:55:52] (03CR) 10Dzahn: lists: notify apache2 service when config changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1271019 (https://phabricator.wikimedia.org/T323208) (owner: 10Dzahn) [18:56:06] (03CR) 10Dzahn: [C:03+2] lists: notify apache2 service when config changes [puppet] - 10https://gerrit.wikimedia.org/r/1271019 (https://phabricator.wikimedia.org/T323208) (owner: 10Dzahn) [18:57:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T419635)', diff saved to https://phabricator.wikimedia.org/P90982 and previous config saved to /var/cache/conftool/dbconfig/20260416-185731-fceratto.json [18:57:36] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [18:57:49] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1252.eqiad.wmnet with reason: Maintenance [18:57:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1252 (T419635)', diff saved to https://phabricator.wikimedia.org/P90983 and previous config saved to /var/cache/conftool/dbconfig/20260416-185757-fceratto.json [18:59:38] !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2005.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [19:00:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2226 (T419961)', diff saved to https://phabricator.wikimedia.org/P90984 and previous config saved to /var/cache/conftool/dbconfig/20260416-190004-fceratto.json [19:01:11] (03CR) 10Andrew Bogott: [C:04-1] "WIP because I need to think through how this affects VMs (if at all)" [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [19:02:44] !log jasmine@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2004.codfw.wmnet with OS trixie [19:03:26] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:03:35] !log jasmine@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2005.codfw.wmnet with OS trixie [19:06:45] (03PS1) 10Ottomata: html-enrich - bump to v1.51.0 and apply some flink tuning [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272843 (https://phabricator.wikimedia.org/T421216) [19:08:33] (03CR) 10Ottomata: [V:03+2 C:03+2] html-enrich - bump to v1.51.0 and apply some flink tuning [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272843 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata) [19:10:13] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2226', diff saved to https://phabricator.wikimedia.org/P90985 and previous config saved to /var/cache/conftool/dbconfig/20260416-191012-fceratto.json [19:11:41] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [19:11:45] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [19:12:52] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [19:12:57] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [19:14:44] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [19:14:49] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [19:15:07] !log jasmine@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-ctrl2004.codfw.wmnet with reason: host reimage [19:16:05] !log jasmine@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-ctrl2005.codfw.wmnet with reason: host reimage [19:16:39] (03CR) 10Dzahn: "naturally this became outdated and the IP addresses changed or were removed" [puppet] - 10https://gerrit.wikimedia.org/r/681246 (owner: 10Legoktm) [19:17:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1272771 (owner: 10C. Scott Ananian) [19:18:44] (03CR) 10C. Scott Ananian: "recheck" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1272750 (https://phabricator.wikimedia.org/T422780) (owner: 10C. Scott Ananian) [19:18:52] (03PS1) 10Dzahn: lists: remove outdated IP address exemptions for monitoring servers [puppet] - 10https://gerrit.wikimedia.org/r/1272849 (https://phabricator.wikimedia.org/T323208) [19:18:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1272750 (https://phabricator.wikimedia.org/T422780) (owner: 10C. Scott Ananian) [19:19:01] !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl2004.codfw.wmnet with reason: host reimage [19:20:04] (03CR) 10Dzahn: [V:03+1 C:03+1] "host 208.80.154.88" [puppet] - 10https://gerrit.wikimedia.org/r/1272849 (https://phabricator.wikimedia.org/T323208) (owner: 10Dzahn) [19:20:21] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2226', diff saved to https://phabricator.wikimedia.org/P90986 and previous config saved to /var/cache/conftool/dbconfig/20260416-192020-fceratto.json [19:21:29] !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl2005.codfw.wmnet with reason: host reimage [19:24:56] 06SRE-OnFire, 06Release-Engineering-Team, 10Scap, 06serviceops-deprecated, 07Sustainability (Incident Followup): Should scap be able to update helmfile-defaults when -Dbuild_mw_container_image:False ? - https://phabricator.wikimedia.org/T390531#11831389 (10dancy) 05Open→03Resolved a:03dancy >>!... [19:26:34] (03CR) 10Dzahn: [V:03+1 C:03+2] lists: remove outdated IP address exemptions for monitoring servers [puppet] - 10https://gerrit.wikimedia.org/r/1272849 (https://phabricator.wikimedia.org/T323208) (owner: 10Dzahn) [19:26:54] (03CR) 10Bking: flink-app - default to setting metrics.internal.query-service.port (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268071 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata) [19:27:41] (03CR) 10Btullis: flink-app - default to setting metrics.internal.query-service.port (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268071 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata) [19:30:29] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2226 (T419961)', diff saved to https://phabricator.wikimedia.org/P90987 and previous config saved to /var/cache/conftool/dbconfig/20260416-193028-fceratto.json [19:30:52] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2238.codfw.wmnet with reason: Maintenance [19:31:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2238 (T419961)', diff saved to https://phabricator.wikimedia.org/P90988 and previous config saved to /var/cache/conftool/dbconfig/20260416-193100-fceratto.json [19:31:03] jouncebot: nowandnext [19:31:03] For the next 0 hour(s) and 28 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260416T1800) [19:31:03] In 0 hour(s) and 28 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260416T2000) [19:34:44] !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-ctrl2004.codfw.wmnet with OS trixie [19:36:53] !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-ctrl2005.codfw.wmnet with OS trixie [19:38:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2238 (T419961)', diff saved to https://phabricator.wikimedia.org/P90989 and previous config saved to /var/cache/conftool/dbconfig/20260416-193814-fceratto.json [19:39:57] (03CR) 10Zabe: [C:03+2] Set $wgGlobalUsageSharedRepoWiki for testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269758 (https://phabricator.wikimedia.org/T421914) (owner: 10Zabe) [19:39:59] (03CR) 10Zabe: [C:03+2] Also disable updates for GloballyWantedFiles on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269759 (https://phabricator.wikimedia.org/T421914) (owner: 10Zabe) [19:41:01] (03Merged) 10jenkins-bot: Set $wgGlobalUsageSharedRepoWiki for testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269758 (https://phabricator.wikimedia.org/T421914) (owner: 10Zabe) [19:41:05] (03Merged) 10jenkins-bot: Also disable updates for GloballyWantedFiles on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269759 (https://phabricator.wikimedia.org/T421914) (owner: 10Zabe) [19:41:34] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1269758|Set $wgGlobalUsageSharedRepoWiki for testcommonswiki (T421914)]], [[gerrit:1269759|Also disable updates for GloballyWantedFiles on testcommonswiki (T421914)]] [19:41:39] T421914: Test links virtual domain split on testcommonswiki - https://phabricator.wikimedia.org/T421914 [19:43:14] !log zabe@deploy1003 zabe: Backport for [[gerrit:1269758|Set $wgGlobalUsageSharedRepoWiki for testcommonswiki (T421914)]], [[gerrit:1269759|Also disable updates for GloballyWantedFiles on testcommonswiki (T421914)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:44:30] !log zabe@deploy1003 zabe: Continuing with sync [19:48:23] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1269758|Set $wgGlobalUsageSharedRepoWiki for testcommonswiki (T421914)]], [[gerrit:1269759|Also disable updates for GloballyWantedFiles on testcommonswiki (T421914)]] (duration: 06m 48s) [19:48:23] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2238', diff saved to https://phabricator.wikimedia.org/P90990 and previous config saved to /var/cache/conftool/dbconfig/20260416-194823-fceratto.json [19:48:27] T421914: Test links virtual domain split on testcommonswiki - https://phabricator.wikimedia.org/T421914 [19:58:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2238', diff saved to https://phabricator.wikimedia.org/P90991 and previous config saved to /var/cache/conftool/dbconfig/20260416-195831-fceratto.json [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260416T2000). [20:00:05] Tran, maryum, arlolra, AaronSchulz, bodhisattwa, James_F, and cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:09] o/ [20:00:12] o/ [20:00:19] o/ [20:00:20] o/ [20:00:30] Who's deploying? [20:00:39] I have a private settings bit to deploy as well as my patch [20:00:53] OK maryum do you need to go first then? [20:01:22] I can, I don't have to [20:01:24] i can spiderpig, and i can deploy arlolra's patch as well if he's not here. [20:01:41] Let's have maryum go first. [20:01:52] okay great, I'll do the regular patch with spiderpig first [20:01:58] then the private settings deploy [20:02:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mstyles@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267116 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [20:02:19] o/ [20:02:43] o/ [20:02:51] we can probably combine some of the config patches [20:03:16] (03Merged) 10jenkins-bot: config: Enable EmailConfirmationBanner on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267116 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [20:03:30] !log mstyles@deploy1003 Started scap sync-world: Backport for [[gerrit:1267116|config: Enable EmailConfirmationBanner on selected wikis (T421366)]] [20:03:35] T421366: Test Kitchen Experiment setup to measure the impact of the banner - https://phabricator.wikimedia.org/T421366 [20:05:12] !log mstyles@deploy1003 mmartorana, mstyles: Backport for [[gerrit:1267116|config: Enable EmailConfirmationBanner on selected wikis (T421366)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:05:49] !log mstyles@deploy1003 mmartorana, mstyles: Continuing with sync [20:08:32] FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [20:08:41] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2238 (T419961)', diff saved to https://phabricator.wikimedia.org/P90992 and previous config saved to /var/cache/conftool/dbconfig/20260416-200839-fceratto.json [20:09:36] !log mstyles@deploy1003 Finished scap sync-world: Backport for [[gerrit:1267116|config: Enable EmailConfirmationBanner on selected wikis (T421366)]] (duration: 06m 06s) [20:09:40] T421366: Test Kitchen Experiment setup to measure the impact of the banner - https://phabricator.wikimedia.org/T421366 [20:10:23] cscott: yeah, mine can be combined with stuff [20:10:36] Mine too. [20:10:36] okay my spiderpig finished, preparing to run scap for PS.php [20:15:10] (03PS1) 10CDanis: varnish: trace all file uploads [puppet] - 10https://gerrit.wikimedia.org/r/1272869 [20:17:16] !log Removed private mitigation for T419137 [20:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:21] my deploys are finished! [20:19:09] Who's next? Tran? [20:19:11] 06SRE, 06Infrastructure-Foundations, 06Release-Engineering-Team (Radar): Sunsetting mirrors.wikimedia.org - https://phabricator.wikimedia.org/T416707#11831547 (10bd808) >>! In T416707#11830231, @MoritzMuehlenhoff wrote: > These are both true. We will no longer operate a mirror (which is running under mirrors... [20:19:26] (03PS1) 10C. Scott Ananian: Move language variant parser option setting from Article to WikiPage [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1272875 (https://phabricator.wikimedia.org/T423534) [20:19:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1272875 (https://phabricator.wikimedia.org/T423534) (owner: 10C. Scott Ananian) [20:19:38] Sure. I can self-deploy [20:21:20] k, starting. Both of mine should go together as one is the other's dependency [20:22:05] ah you can't deploy something that's a formal dependency in the same backport :\ [20:22:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy1003 using scap backport" [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1270888 (https://phabricator.wikimedia.org/T423042) (owner: 10STran) [20:22:59] Tran: You should able able to. [20:23:17] Just C+2 them manually and at worst once they're merged SpiderPig will let you go ahead. [20:24:31] (03Restored) 10STran: Allow the 'ReportIncidentEnabledNamespaces' config to be ovewritten [extensions/ReportIncident] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270887 (https://phabricator.wikimedia.org/T423042) (owner: 10STran) [20:25:51] James_F: I think I need to fix this change id error but also already started the first backport. Is it worth cancelling to try and batch the two again? [20:26:10] What's the error? [20:26:34] https://www.irccloud.com/pastebin/0yWG48Fa/ [20:26:50] and I assume it's this patch causing the problem: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ReportIncident/+/1270887/1 [20:27:19] Oh, yes, never abandon re-used Change-IDs if you want Depends-On to not break. [20:27:29] It's fine, just edit it out of the config patch's commit. [20:27:34] (03Merged) 10jenkins-bot: Allow the 'ReportIncidentEnabledNamespaces' config to be ovewritten [extensions/ReportIncident] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1270888 (https://phabricator.wikimedia.org/T423042) (owner: 10STran) [20:27:38] (03PS3) 10Jforrester: Deploy IRS to enwiki's Event Talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270872 (https://phabricator.wikimedia.org/T423042) (owner: 10STran) [20:27:40] Done. [20:27:50] !log stran@deploy1003 Started scap sync-world: Backport for [[gerrit:1270888|Allow the 'ReportIncidentEnabledNamespaces' config to be ovewritten (T423042)]] [20:27:54] T423042: Deploy IRS "as-is" to Event Talk namespace - https://phabricator.wikimedia.org/T423042 [20:28:24] (03PS2) 10CDanis: varnish: trace all file uploads [puppet] - 10https://gerrit.wikimedia.org/r/1272869 [20:29:27] !log stran@deploy1003 stran: Backport for [[gerrit:1270888|Allow the 'ReportIncidentEnabledNamespaces' config to be ovewritten (T423042)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:30:02] (03CR) 10CDanis: [V:03+1] "VTCs say: 0 tests failed, 0 tests skipped, 40 tests passed" [puppet] - 10https://gerrit.wikimedia.org/r/1272869 (owner: 10CDanis) [20:30:53] James_F: So worth cancelling and doing over? Or just keep moving forward? It's at testing step for the first patch. [20:31:10] (03Abandoned) 10STran: Allow the 'ReportIncidentEnabledNamespaces' config to be ovewritten [extensions/ReportIncident] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1270887 (https://phabricator.wikimedia.org/T423042) (owner: 10STran) [20:31:55] Tran: I don't know your code, sorry. But if the images are built. let's roll and move on to the config? [20:32:07] :+1 [20:32:13] (03CR) 10Pmiazga: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272770 (https://phabricator.wikimedia.org/T419545) (owner: 10Aaron Schulz) [20:32:13] (03CR) 10BBlack: [C:03+1] "SGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1272869 (owner: 10CDanis) [20:32:20] We're half way through the whole window and so far we've done 1.5 out of 7 people. :-( [20:33:08] !log stran@deploy1003 stran: Continuing with sync [20:36:30] cscott will take over deploying my patch, it can be combined with other config patches [20:36:57] !log stran@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270888|Allow the 'ReportIncidentEnabledNamespaces' config to be ovewritten (T423042)]] (duration: 09m 07s) [20:37:01] T423042: Deploy IRS "as-is" to Event Talk namespace - https://phabricator.wikimedia.org/T423042 [20:37:24] I'm about to start a config deploy. Do we want to batch some of the other configs in it? [20:37:38] Fine from my end. [20:37:55] Mine as well [20:37:56] arlolra's is a bit delicate though, maybe best for cscott to handle? [20:38:02] OK, go for it then. :-) [20:38:23] Yeah maybe leave out the arlolra patch and I'll handle it a little later [20:38:37] Ha [20:38:39] the Attribution also can go together, it's enablingone module in sandbox that was already enabled on testwiki [20:38:42] Or not. Either way is fine really. [20:39:04] Tran: Just go for it with yours, pmiazga's, and mine, then. [20:39:11] k I'm adding 1254359 and 1272770 [20:40:29] https://www.irccloud.com/pastebin/Q8kK5Pmc/ [20:40:38] James_F: I assume ok to proceed? [20:41:19] yes from my end [20:41:30] yep [20:41:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:41:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270872 (https://phabricator.wikimedia.org/T423042) (owner: 10STran) [20:41:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254359 (https://phabricator.wikimedia.org/T420420) (owner: 10Jforrester) [20:41:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272770 (https://phabricator.wikimedia.org/T419545) (owner: 10Aaron Schulz) [20:42:51] (03Merged) 10jenkins-bot: Deploy IRS to enwiki's Event Talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270872 (https://phabricator.wikimedia.org/T423042) (owner: 10STran) [20:42:54] (03Merged) 10jenkins-bot: Make abstractwiki a multi-lingual Wikidata client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254359 (https://phabricator.wikimedia.org/T420420) (owner: 10Jforrester) [20:42:58] (03Merged) 10jenkins-bot: Enable attribution.v0-beta in RestSandboxSpecs for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272770 (https://phabricator.wikimedia.org/T419545) (owner: 10Aaron Schulz) [20:43:13] !log stran@deploy1003 Started scap sync-world: Backport for [[gerrit:1270872|Deploy IRS to enwiki's Event Talk namespace (T423042)]], [[gerrit:1254359|Make abstractwiki a multi-lingual Wikidata client (T420420)]], [[gerrit:1272770|Enable attribution.v0-beta in RestSandboxSpecs for all wikis (T419545)]] [20:43:21] T423042: Deploy IRS "as-is" to Event Talk namespace - https://phabricator.wikimedia.org/T423042 [20:43:22] T420420: Register Abstract Wikipedia as a special kind of wiki with Wikidata so it can be linked to/from distinctly from being just a Wikipedia? - https://phabricator.wikimedia.org/T420420 [20:43:22] T419545: Enable "Attribution API (beta)" in all REST Sandboxes - https://phabricator.wikimedia.org/T419545 [20:43:32] FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [20:44:54] !log stran@deploy1003 aaron, stran, jforrester: Backport for [[gerrit:1270872|Deploy IRS to enwiki's Event Talk namespace (T423042)]], [[gerrit:1254359|Make abstractwiki a multi-lingual Wikidata client (T420420)]], [[gerrit:1272770|Enable attribution.v0-beta in RestSandboxSpecs for all wikis (T419545)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:45:28] (03PS5) 10Jforrester: [bnwikisource] Enable PageImages on NS:4, NS:100, NS:104, NS:106, NS:114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270567 (owner: 10Bodhisattwa) [20:46:11] please test patches if necessary pmiazga James_F [20:47:29] sorry, bit rusty when comes to those deployes, it's my first for very long time [20:47:42] Go from me. [20:47:45] I can see the change on k8s-mwdebug, but not on prod yet [20:47:52] perfect, that's the testing [20:47:54] this is correct, right? on mwdebug looks good [20:47:57] moving forward [20:48:02] !log stran@deploy1003 aaron, stran, jforrester: Continuing with sync [20:50:55] * AaronSchulz looked at https://meta.wikimedia.org/w/index.php?api=attribution.v0-beta&title=Special%3ARestSandbox with mwdebug on and it seemed fine [20:51:49] !log stran@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270872|Deploy IRS to enwiki's Event Talk namespace (T423042)]], [[gerrit:1254359|Make abstractwiki a multi-lingual Wikidata client (T420420)]], [[gerrit:1272770|Enable attribution.v0-beta in RestSandboxSpecs for all wikis (T419545)]] (duration: 08m 36s) [20:52:01] I've ticked a bunch of the ones done on the list. [20:52:01] T423042: Deploy IRS "as-is" to Event Talk namespace - https://phabricator.wikimedia.org/T423042 [20:52:01] T420420: Register Abstract Wikipedia as a special kind of wiki with Wikidata so it can be linked to/from distinctly from being just a Wikipedia? - https://phabricator.wikimedia.org/T420420 [20:52:02] T419545: Enable "Attribution API (beta)" in all REST Sandboxes - https://phabricator.wikimedia.org/T419545 [20:52:16] Done, thanks for everyone's patience 🙇 [20:52:58] thank you! [20:52:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252 (T419635)', diff saved to https://phabricator.wikimedia.org/P90993 and previous config saved to /var/cache/conftool/dbconfig/20260416-205258-fceratto.json [20:53:03] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [20:53:45] cscott: Can you handle bodhisattwa's patch with all of yours? [20:56:24] 4 mins left :) [20:58:40] (03CR) 10Ladsgroup: "Grafana dashboard for checking https://grafana.wikimedia.org/d/000000559/mediawiki-action-api-breakdown?orgId=1&from=now-24h&to=now&timezo" [puppet] - 10https://gerrit.wikimedia.org/r/1272869 (owner: 10CDanis) [21:00:05] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260416T2100) [21:02:24] James_F: I can. Is that all that's left now? [21:02:33] cscott: Yes, plus yours and arlolra's. [21:03:07] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252', diff saved to https://phabricator.wikimedia.org/P90994 and previous config saved to /var/cache/conftool/dbconfig/20260416-210307-fceratto.json [21:04:34] Readers: are you here, or can we continue the backports? [21:04:52] (the official guidance says we should wait until 5 min past to see if they show up) [21:05:59] Thought you were addressing the readers as in the channel lurkers that are reading your message. [21:07:13] Ok, I think I can continue [21:09:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1034:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1034 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:13:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252', diff saved to https://phabricator.wikimedia.org/P90995 and previous config saved to /var/cache/conftool/dbconfig/20260416-211315-fceratto.json [21:14:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1034:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1034 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:14:49] (03PS1) 10SBassett: Set CSP to enforce with currently-allow-listed domains on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272895 (https://phabricator.wikimedia.org/T419612) [21:14:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270518 (https://phabricator.wikimedia.org/T423188) (owner: 10Arlolra) [21:14:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270567 (owner: 10Bodhisattwa) [21:15:06] sorry for the delay bodhisattwa , we're getting started now [21:15:24] sure [21:15:43] (03CR) 10CI reject: [V:04-1] Set CSP to enforce with currently-allow-listed domains on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272895 (https://phabricator.wikimedia.org/T419612) (owner: 10SBassett) [21:15:49] (03Merged) 10jenkins-bot: Deploy PRV to 4 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270518 (https://phabricator.wikimedia.org/T423188) (owner: 10Arlolra) [21:15:53] (03Merged) 10jenkins-bot: [bnwikisource] Enable PageImages on NS:4, NS:100, NS:104, NS:106, NS:114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270567 (owner: 10Bodhisattwa) [21:16:10] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1270518|Deploy PRV to 4 wikis (T423188)]], [[gerrit:1270567|[bnwikisource] Enable PageImages on NS:4, NS:100, NS:104, NS:106, NS:114]] [21:16:14] T423188: Parsoid Read Views to deploy ~2026-04-16 - https://phabricator.wikimedia.org/T423188 [21:16:38] (03PS2) 10SBassett: Set CSP to enforce with currently-allow-listed domains on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272895 (https://phabricator.wikimedia.org/T419612) [21:17:52] !log cscott@deploy1003 cscott, arlolra, bodhisattwa: Backport for [[gerrit:1270518|Deploy PRV to 4 wikis (T423188)]], [[gerrit:1270567|[bnwikisource] Enable PageImages on NS:4, NS:100, NS:104, NS:106, NS:114]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:18:04] bodhisattwa: ok, time to test [21:19:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1034:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1034 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:21:26] (03PS3) 10SBassett: Set CSP to enforce with currently-allow-listed domains on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1272895 (https://phabricator.wikimedia.org/T419612) [21:23:11] bodhisattwa: is your patch testable on bn.wikisource? [21:23:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252 (T419635)', diff saved to https://phabricator.wikimedia.org/P90996 and previous config saved to /var/cache/conftool/dbconfig/20260416-212323-fceratto.json [21:23:28] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [21:23:40] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1260.eqiad.wmnet with reason: Maintenance [21:23:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1260 (T419635)', diff saved to https://phabricator.wikimedia.org/P90997 and previous config saved to /var/cache/conftool/dbconfig/20260416-212348-fceratto.json [21:24:12] yes [21:24:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1034:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1034 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:24:55] 06SRE, 06Infrastructure-Foundations, 06Release-Engineering-Team (Radar): Sunsetting mirrors.wikimedia.org - https://phabricator.wikimedia.org/T416707#11831821 (10A_smart_kitten) >>! In T416707#11831547, @bd808 wrote: > And going backward? Is there a way we can help [[https://www.w3.org/Provider/Style/URI|"co... [21:28:54] bodhisattwa: it seems like https://bn.wikisource.org/w/api.php?action=query&format=json&prop=pageimages&titles=%E0%A6%89%E0%A6%87%E0%A6%95%E0%A6%BF%E0%A6%B8%E0%A6%82%E0%A6%95%E0%A6%B2%E0%A6%A8%3A%E0%A6%AA%E0%A7%8D%E0%A6%B0%E0%A6%A7%E0%A6%BE%E0%A6%A8_%E0%A6%AA%E0%A6%BE%E0%A6%A4%E0%A6%BE%7C%E0%A6%B2%E0%A7%8B%E0%A6%95%E0%A6%B8%E0%A6%BE%E0%A6%B9%E0%A6%BF%E0%A6%A4%E0%A7%8D%E0%A6%AF&formatversion=2 has pageimages info in it [21:29:06] (03PS8) 10Bking: opensearch on k8s: Add semantic-search and ipoid to services proxy [puppet] - 10https://gerrit.wikimedia.org/r/1264739 (https://phabricator.wikimedia.org/T421293) [21:29:25] bodhisattwa: can we say this is good & continue the sync? [21:29:45] yes, this is good [21:29:48] !log cscott@deploy1003 cscott, arlolra, bodhisattwa: Continuing with sync [21:30:32] (03CR) 10Bking: opensearch on k8s: Add semantic-search and ipoid to services proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1264739 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking) [21:30:38] (03PS1) 10C. Scott Ananian: ConverterRule: convert `null` to `false` when needed [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1272908 (https://phabricator.wikimedia.org/T423639) [21:32:24] (03PS1) 10Bking: opensearch on k8s: Activate semantic-search and ipoid in services proxy [puppet] - 10https://gerrit.wikimedia.org/r/1272909 (https://phabricator.wikimedia.org/T421293) [21:33:36] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270518|Deploy PRV to 4 wikis (T423188)]], [[gerrit:1270567|[bnwikisource] Enable PageImages on NS:4, NS:100, NS:104, NS:106, NS:114]] (duration: 17m 26s) [21:33:40] T423188: Parsoid Read Views to deploy ~2026-04-16 - https://phabricator.wikimedia.org/T423188 [21:34:26] bodhisattwa: ok, all done! [21:34:33] thanks [21:34:34] i'm moving on to the wmf.24 backports [21:34:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1272908 (https://phabricator.wikimedia.org/T423639) (owner: 10C. Scott Ananian) [21:34:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1272771 (owner: 10C. Scott Ananian) [21:34:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1272750 (https://phabricator.wikimedia.org/T422780) (owner: 10C. Scott Ananian) [21:34:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1272875 (https://phabricator.wikimedia.org/T423534) (owner: 10C. Scott Ananian) [21:45:41] (03Merged) 10jenkins-bot: ConverterRule: convert `null` to `false` when needed [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1272908 (https://phabricator.wikimedia.org/T423639) (owner: 10C. Scott Ananian) [21:45:48] (03Merged) 10jenkins-bot: Convert language to internal code in tests [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1272771 (owner: 10C. Scott Ananian) [21:45:55] (03Merged) 10jenkins-bot: ParsoidCachePrewarmJob: Define the title in the req context [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1272750 (https://phabricator.wikimedia.org/T422780) (owner: 10C. Scott Ananian) [21:46:01] (03CR) 10CI reject: [V:04-1] Move language variant parser option setting from Article to WikiPage [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1272875 (https://phabricator.wikimedia.org/T423534) (owner: 10C. Scott Ananian) [21:47:33] (03CR) 10C. Scott Ananian: [C:03+2] "recheck" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1272875 (https://phabricator.wikimedia.org/T423534) (owner: 10C. Scott Ananian) [21:47:40] 06SRE, 06Infrastructure-Foundations, 06Release-Engineering-Team (Radar): Sunsetting mirrors.wikimedia.org - https://phabricator.wikimedia.org/T416707#11831873 (10bd808) >>! In T416707#11831821, @A_smart_kitten wrote: > But maybe e.g. #collaboration-services could host a microsite at the `mirrors.wikimedia.or... [21:48:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1272875 (https://phabricator.wikimedia.org/T423534) (owner: 10C. Scott Ananian) [21:57:55] !log zabe@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [21:58:12] (03Merged) 10jenkins-bot: Move language variant parser option setting from Article to WikiPage [core] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1272875 (https://phabricator.wikimedia.org/T423534) (owner: 10C. Scott Ananian) [21:58:32] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1272908|ConverterRule: convert `null` to `false` when needed (T423639)]], [[gerrit:1272771|Convert language to internal code in tests]], [[gerrit:1272750|ParsoidCachePrewarmJob: Define the title in the req context (T422780)]], [[gerrit:1272875|Move language variant parser option setting from Article to WikiPage (T423534)]] [21:58:39] T423639: TypeError: Cannot assign null to property MediaWiki\Language\ConverterRule::$mRuleDisplay of type Wikimedia\Parsoid\DOM\DocumentFragment|string|false - https://phabricator.wikimedia.org/T423639 [21:58:39] T422780: Production error: MediaWiki\Context\RequestContext::getTitle called with no title set. - https://phabricator.wikimedia.org/T422780 [21:58:40] T423534: Edit previews do not render with correct variant when using new Parsoid LanguageConverter implementation - https://phabricator.wikimedia.org/T423534 [21:58:42] !log zabe@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [22:00:12] !log cscott@deploy1003 cscott: Backport for [[gerrit:1272908|ConverterRule: convert `null` to `false` when needed (T423639)]], [[gerrit:1272771|Convert language to internal code in tests]], [[gerrit:1272750|ParsoidCachePrewarmJob: Define the title in the req context (T422780)]], [[gerrit:1272875|Move language variant parser option setting from Article to WikiPage (T423534)]] synced to the testservers (see https://wikitech [22:00:12] .wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:04:22] !log cscott@deploy1003 cscott: Continuing with sync [22:04:26] looks good [22:08:13] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1272908|ConverterRule: convert `null` to `false` when needed (T423639)]], [[gerrit:1272771|Convert language to internal code in tests]], [[gerrit:1272750|ParsoidCachePrewarmJob: Define the title in the req context (T422780)]], [[gerrit:1272875|Move language variant parser option setting from Article to WikiPage (T423534)]] (duration: 09m 41s) [22:08:20] T423639: TypeError: Cannot assign null to property MediaWiki\Language\ConverterRule::$mRuleDisplay of type Wikimedia\Parsoid\DOM\DocumentFragment|string|false - https://phabricator.wikimedia.org/T423639 [22:08:20] T422780: Production error: MediaWiki\Context\RequestContext::getTitle called with no title set. - https://phabricator.wikimedia.org/T422780 [22:08:21] T423534: Edit previews do not render with correct variant when using new Parsoid LanguageConverter implementation - https://phabricator.wikimedia.org/T423534 [22:08:21] ok, done. [22:14:04] !log jforrester@deploy1003:/srv/mediawiki-staging$ foreachwikiindblist sul extensions/Wikibase/lib/maintenance/populateSitesTable.php # T423660 [22:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:08] T423660: Adjust Abstract Wikipedia's entry in the sites table to move out of the 'wikipedia' group (it's a special wiki) - https://phabricator.wikimedia.org/T423660 [22:28:32] FIRING: [2x] Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [22:46:30] (03PS1) 10Dduvall: zuul: Name service containers and remove them when stopped [puppet] - 10https://gerrit.wikimedia.org/r/1272961 (https://phabricator.wikimedia.org/T406384) [22:53:13] (03CR) 10Jdlrobson: Restore PageImages functionality to Wikisources and Wikibooks (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271862 (https://phabricator.wikimedia.org/T417538) (owner: 10Ignacio Rodríguez) [22:58:02] (03PS1) 10Dduvall: zuul: Provide tenant configuration [puppet] - 10https://gerrit.wikimedia.org/r/1272970 (https://phabricator.wikimedia.org/T406384) [23:00:06] (03CR) 10CI reject: [V:04-1] zuul: Provide tenant configuration [puppet] - 10https://gerrit.wikimedia.org/r/1272970 (https://phabricator.wikimedia.org/T406384) (owner: 10Dduvall) [23:03:32] RESOLVED: Outbound discards: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [23:03:42] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:17:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268652 (https://phabricator.wikimedia.org/T399673) (owner: 10MusikAnimal) [23:18:24] (03Merged) 10jenkins-bot: CommonSettings: use CodeMirror instead of CodeEditor in AbuseFilter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268652 (https://phabricator.wikimedia.org/T399673) (owner: 10MusikAnimal) [23:18:39] !log musikanimal@deploy1003 Started scap sync-world: Backport for [[gerrit:1268652|CommonSettings: use CodeMirror instead of CodeEditor in AbuseFilter (T399673)]] [23:18:43] T399673: Add CodeMirror mode for AbuseFilter syntax - https://phabricator.wikimedia.org/T399673 [23:20:18] !log musikanimal@deploy1003 musikanimal: Backport for [[gerrit:1268652|CommonSettings: use CodeMirror instead of CodeEditor in AbuseFilter (T399673)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:20:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260 (T419635)', diff saved to https://phabricator.wikimedia.org/P90999 and previous config saved to /var/cache/conftool/dbconfig/20260416-232036-fceratto.json [23:20:41] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [23:21:22] !log musikanimal@deploy1003 musikanimal: Continuing with sync [23:25:14] !log musikanimal@deploy1003 Finished scap sync-world: Backport for [[gerrit:1268652|CommonSettings: use CodeMirror instead of CodeEditor in AbuseFilter (T399673)]] (duration: 06m 35s) [23:25:19] T399673: Add CodeMirror mode for AbuseFilter syntax - https://phabricator.wikimedia.org/T399673 [23:30:45] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260', diff saved to https://phabricator.wikimedia.org/P91000 and previous config saved to /var/cache/conftool/dbconfig/20260416-233044-fceratto.json [23:39:57] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1273007 [23:39:57] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1273007 (owner: 10TrainBranchBot) [23:40:53] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260', diff saved to https://phabricator.wikimedia.org/P91001 and previous config saved to /var/cache/conftool/dbconfig/20260416-234052-fceratto.json [23:44:36] (03CR) 10Ignacio Rodríguez: Restore PageImages functionality to Wikisources and Wikibooks (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271862 (https://phabricator.wikimedia.org/T417538) (owner: 10Ignacio Rodríguez) [23:50:09] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1273007 (owner: 10TrainBranchBot) [23:51:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260 (T419635)', diff saved to https://phabricator.wikimedia.org/P91002 and previous config saved to /var/cache/conftool/dbconfig/20260416-235059-fceratto.json [23:51:04] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [23:51:16] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1261.eqiad.wmnet with reason: Maintenance [23:51:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1261 (T419635)', diff saved to https://phabricator.wikimedia.org/P91003 and previous config saved to /var/cache/conftool/dbconfig/20260416-235123-fceratto.json [23:56:06] !log jasmine@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART