[00:00:34] (DatasourceError) firing: Nonwrite HTTP requests with primary DB connections alert - https://grafana.wikimedia.org/alerting/grafana/4tAKSjJVz/view - https://wikitech.wikimedia.org/wiki/Monitoring/DatasourceError - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [00:04:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10Jclark-ctr) [00:04:21] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1018.eqiad.wmnet with reason: host reimage [00:04:24] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1019.eqiad.wmnet with reason: host reimage [00:04:27] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1024.eqiad.wmnet with reason: host reimage [00:07:25] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1018.eqiad.wmnet with reason: host reimage [00:08:59] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1022.eqiad.wmnet with reason: host reimage [00:09:34] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1019.eqiad.wmnet with reason: host reimage [00:10:33] (DatasourceError) resolved: Nonwrite HTTP requests with primary DB connections alert - https://grafana.wikimedia.org/alerting/grafana/4tAKSjJVz/view - https://wikitech.wikimedia.org/wiki/Monitoring/DatasourceError - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [00:12:22] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1022.eqiad.wmnet with reason: host reimage [00:14:20] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1024.eqiad.wmnet with reason: host reimage [00:22:17] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:23:09] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [00:24:35] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:24:48] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [00:24:53] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1018.eqiad.wmnet with OS bullseye [00:24:59] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1018.eqiad.wmnet with OS bullseye completed: - wdqs1018 (**PASS**) - Removed from Pupp... [00:25:26] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [00:26:44] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [00:26:49] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1019.eqiad.wmnet with OS bullseye [00:26:56] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1019.eqiad.wmnet with OS bullseye completed: - wdqs1019 (**PASS**) - Removed from Pupp... [00:27:24] (03PS34) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) [00:27:40] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [00:27:51] (03CR) 10CI reject: [V: 04-1] designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) (owner: 10Andrew Bogott) [00:28:39] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [00:28:40] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1022.eqiad.wmnet with OS bullseye [00:28:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1022.eqiad.wmnet with OS bullseye completed: - wdqs1022 (**PASS**) - Remov... [00:29:58] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [00:30:59] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [00:31:00] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1024.eqiad.wmnet with OS bullseye [00:31:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1024.eqiad.wmnet with OS bullseye completed: - wdqs1024 (**PASS**) - Remov... [00:31:10] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T343198)', diff saved to https://phabricator.wikimedia.org/P52624 and previous config saved to /var/cache/conftool/dbconfig/20230926-003109-arnaudb.json [00:31:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10Jclark-ctr) [00:32:13] (03PS35) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) [00:32:30] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10Jclark-ctr) [00:33:46] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1021'] [00:33:55] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1017'] [00:34:37] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1017'] [00:34:40] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1021'] [00:34:47] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1022'] [00:35:58] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1017.eqiad.wmnet with OS bullseye [00:36:00] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1021.eqiad.wmnet with OS bullseye [00:36:06] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1017.eqiad.wmnet with OS bullseye [00:36:07] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1017.eqiad.wmnet with OS bullseye [00:36:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1021.eqiad.wmnet with OS bullseye [00:36:09] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1021.eqiad.wmnet with OS bullseye [00:36:14] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1017.eqiad.wmnet with OS bullseye executed with errors: - wdqs1017 (**FAIL**) - Remove... [00:36:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1021.eqiad.wmnet with OS bullseye executed with errors: - wdqs1021 (**FAIL**... [00:37:06] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1017.eqiad.wmnet with OS bullseye [00:37:07] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:37:13] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1017.eqiad.wmnet with OS bullseye [00:37:15] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1017.eqiad.wmnet with OS bullseye [00:37:20] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1017.eqiad.wmnet with OS bullseye executed with errors: - wdqs1017 (**FAIL**) - Remove... [00:38:13] (03PS36) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) [00:38:15] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/960667 [00:38:21] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/960667 (owner: 10TrainBranchBot) [00:38:29] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:40:32] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1017.eqiad.wmnet with OS bullseye [00:40:40] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1017.eqiad.wmnet with OS bullseye [00:40:40] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1017.eqiad.wmnet with OS bullseye [00:40:45] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1017.eqiad.wmnet with OS bullseye executed with errors: - wdqs1017 (**FAIL**) - Remove... [00:40:51] (03PS37) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) [00:42:47] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:44:11] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:45:45] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:17] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P52625 and previous config saved to /var/cache/conftool/dbconfig/20230926-004616-arnaudb.json [00:49:39] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:52:31] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:53:25] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/960667 (owner: 10TrainBranchBot) [01:01:23] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P52626 and previous config saved to /var/cache/conftool/dbconfig/20230926-010123-arnaudb.json [01:03:50] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T347368 (10phaultfinder) [01:16:30] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T343198)', diff saved to https://phabricator.wikimedia.org/P52627 and previous config saved to /var/cache/conftool/dbconfig/20230926-011629-arnaudb.json [01:16:33] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance [01:16:38] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [01:16:46] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance [01:16:47] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [01:17:01] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [01:17:07] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2159 (T343198)', diff saved to https://phabricator.wikimedia.org/P52628 and previous config saved to /var/cache/conftool/dbconfig/20230926-011707-arnaudb.json [01:18:06] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1017.eqiad.wmnet with OS bullseye [01:18:12] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1017.eqiad.wmnet with OS bullseye [01:18:24] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1017.eqiad.wmnet with OS bullseye [01:18:30] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1017.eqiad.wmnet with OS bullseye executed with errors: - wdqs1017 (**FAIL**) - Remove... [01:19:21] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1021.eqiad.wmnet with OS bullseye [01:19:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1021.eqiad.wmnet with OS bullseye [01:19:39] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1021.eqiad.wmnet with OS bullseye [01:19:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1021.eqiad.wmnet with OS bullseye executed with errors: - wdqs1021 (**FAIL**... [01:19:52] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1017.eqiad.wmnet with OS bullseye [01:19:58] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1017.eqiad.wmnet with OS bullseye [01:20:00] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1017.eqiad.wmnet with OS bullseye [01:20:06] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1017.eqiad.wmnet with OS bullseye executed with errors: - wdqs1017 (**FAIL**) - Remove... [01:20:53] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1021.eqiad.wmnet with OS bullseye [01:20:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1021.eqiad.wmnet with OS bullseye [01:21:02] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1021.eqiad.wmnet with OS bullseye [01:21:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1021.eqiad.wmnet with OS bullseye executed with errors: - wdqs1021 (**FAIL**... [01:24:08] (PuppetConstantChange) firing: (2) Puppet performing a change on every puppet run - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [01:39:11] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:43:27] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service,httpbb_kubernetes_mw-api-ext_hourly.service,httpbb_kubernetes_mw-api-int_hourly.service,httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:45:17] PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:48:05] RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:50:01] PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:51:03] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:51:23] RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:52:25] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:59:08] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1226.eqiad.wmnet with OS bullseye [01:59:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1226.eqiad.wmnet with OS bullseye [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230926T0200) [02:01:43] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1227.eqiad.wmnet with OS bullseye [02:01:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1227.eqiad.wmnet with OS bullseye [02:02:53] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1228.eqiad.wmnet with OS bullseye [02:03:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1228.eqiad.wmnet with OS bullseye [02:03:47] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1229.eqiad.wmnet with OS bullseye [02:03:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1229.eqiad.wmnet with OS bullseye [02:04:45] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1230.eqiad.wmnet with OS bullseye [02:04:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1230.eqiad.wmnet with OS bullseye [02:05:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1231.eqiad.wmnet with OS bullseye [02:05:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1231.eqiad.wmnet with OS bullseye [02:06:30] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1232.eqiad.wmnet with OS bullseye [02:06:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1232.eqiad.wmnet with OS bullseye [02:06:52] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.28 [core] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/960668 (https://phabricator.wikimedia.org/T345889) [02:06:58] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.28 [core] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/960668 (https://phabricator.wikimedia.org/T345889) (owner: 10TrainBranchBot) [02:07:19] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1233.eqiad.wmnet with OS bullseye [02:07:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1233.eqiad.wmnet with OS bullseye [02:08:44] (JobUnavailable) firing: (6) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:52] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1226.eqiad.wmnet with reason: host reimage [02:14:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1226.eqiad.wmnet with reason: host reimage [02:17:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1230.eqiad.wmnet with reason: host reimage [02:18:32] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1231.eqiad.wmnet with reason: host reimage [02:19:20] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1232.eqiad.wmnet with reason: host reimage [02:20:10] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1233.eqiad.wmnet with reason: host reimage [02:20:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1230.eqiad.wmnet with reason: host reimage [02:21:38] (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.28 [core] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/960668 (https://phabricator.wikimedia.org/T345889) (owner: 10TrainBranchBot) [02:23:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1233.eqiad.wmnet with reason: host reimage [02:23:45] (JobUnavailable) firing: (6) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:23:51] (03PS38) 10Andrew Bogott: designate/pdns: refactor a bunch of address settings [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) [02:23:53] (03PS1) 10Andrew Bogott: pdns: eliminate profile::openstack::base::pdns::auth::listen_on [puppet] - 10https://gerrit.wikimedia.org/r/960753 [02:23:55] (03PS1) 10Andrew Bogott: pdns: eliminate pdns::query_local_address in hiera [puppet] - 10https://gerrit.wikimedia.org/r/960754 [02:23:57] (03PS1) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/960755 (https://phabricator.wikimedia.org/T346385) [02:24:31] (03CR) 10CI reject: [V: 04-1] pdns: eliminate pdns::query_local_address in hiera [puppet] - 10https://gerrit.wikimedia.org/r/960754 (owner: 10Andrew Bogott) [02:25:05] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1231.eqiad.wmnet with reason: host reimage [02:27:33] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1232.eqiad.wmnet with reason: host reimage [02:28:26] (03PS2) 10Andrew Bogott: pdns: eliminate profile::openstack::base::pdns::auth::listen_on [puppet] - 10https://gerrit.wikimedia.org/r/960753 [02:28:28] (03PS2) 10Andrew Bogott: pdns: eliminate pdns::query_local_address in hiera [puppet] - 10https://gerrit.wikimedia.org/r/960754 [02:28:30] (03PS2) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/960755 (https://phabricator.wikimedia.org/T346385) [02:29:07] (03CR) 10CI reject: [V: 04-1] pdns: eliminate pdns::query_local_address in hiera [puppet] - 10https://gerrit.wikimedia.org/r/960754 (owner: 10Andrew Bogott) [02:29:24] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:30:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:30:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1226.eqiad.wmnet with OS bullseye [02:30:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1226.eqiad.wmnet with OS bullseye completed: - db1226 (**PASS**) - Removed f... [02:31:49] (03PS3) 10Andrew Bogott: pdns: eliminate profile::openstack::base::pdns::auth::listen_on [puppet] - 10https://gerrit.wikimedia.org/r/960753 [02:31:51] (03PS3) 10Andrew Bogott: pdns: eliminate pdns::query_local_address in hiera [puppet] - 10https://gerrit.wikimedia.org/r/960754 [02:31:53] (03PS3) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/960755 (https://phabricator.wikimedia.org/T346385) [02:32:29] (03CR) 10CI reject: [V: 04-1] pdns: eliminate pdns::query_local_address in hiera [puppet] - 10https://gerrit.wikimedia.org/r/960754 (owner: 10Andrew Bogott) [02:32:51] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:34:01] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:34:09] (03PS4) 10Andrew Bogott: pdns: eliminate pdns::query_local_address in hiera [puppet] - 10https://gerrit.wikimedia.org/r/960754 [02:34:11] (03PS4) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/960755 (https://phabricator.wikimedia.org/T346385) [02:37:02] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:37:06] (03CR) 10Andrew Bogott: designate/pdns: refactor a bunch of address settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) (owner: 10Andrew Bogott) [02:37:33] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:38:44] (JobUnavailable) firing: (6) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:39:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1230.eqiad.wmnet with OS bullseye [02:39:31] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:39:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1233.eqiad.wmnet with OS bullseye [02:39:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1230.eqiad.wmnet with OS bullseye completed: - db1230 (**PASS**) - Removed f... [02:39:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1233.eqiad.wmnet with OS bullseye completed: - db1233 (**WARN**) - Removed f... [02:41:01] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:42:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:42:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1231.eqiad.wmnet with OS bullseye [02:42:03] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:42:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1231.eqiad.wmnet with OS bullseye completed: - db1231 (**PASS**) - Removed f... [02:42:59] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:43:23] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:43:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:44:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1232.eqiad.wmnet with OS bullseye [02:44:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1232.eqiad.wmnet with OS bullseye completed: - db1232 (**PASS**) - Removed f... [02:54:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10Jhancock.wm) [02:57:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10Jhancock.wm) @Jclark-ctr or @VRiley-WMF can you check the cables on 1227 and 1228? they're showing as connected but won't pxe. it could be the port on the switch side, the cabl... [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230926T0300) [03:01:38] (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960759 (https://phabricator.wikimedia.org/T345889) [03:01:42] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960759 (https://phabricator.wikimedia.org/T345889) (owner: 10TrainBranchBot) [03:02:22] (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960759 (https://phabricator.wikimedia.org/T345889) (owner: 10TrainBranchBot) [03:02:49] !log mwpresync@deploy2002 Started scap: testwikis wikis to 1.41.0-wmf.28 refs T345889 [03:02:56] T345889: 1.41.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T345889 [03:06:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1229.eqiad.wmnet with OS bullseye executed with errors: - db1229 (**FAIL**)... [03:07:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1229.eqiad.wmnet with OS bullseye [03:07:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1229.eqiad.wmnet with OS bullseye [03:26:16] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:31:05] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:32:31] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:33:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:38:34] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (DELETE ipamhandles) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:39:29] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:45:19] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:46:43] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:50:51] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:51:17] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:52:17] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:52:21] !log mwpresync@deploy2002 Finished scap: testwikis wikis to 1.41.0-wmf.28 refs T345889 (duration: 49m 31s) [03:52:28] T345889: 1.41.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T345889 [03:54:36] !log mwpresync@deploy2002 Pruned MediaWiki: 1.41.0-wmf.26 (duration: 02m 13s) [04:09:16] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1229.eqiad.wmnet with OS bullseye [04:09:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1229.eqiad.wmnet with OS bullseye executed with errors: - db1229 (**FAIL**)... [04:21:13] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:22:36] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:50:19] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:51:45] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:24:08] (PuppetConstantChange) firing: (2) Puppet performing a change on every puppet run - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [05:49:17] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:50:43] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:52:09] PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:53:33] RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:54:55] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:56:21] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230926T0600) [06:00:05] kormat, marostegui, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230926T0600). [06:04:17] (PoolcounterFullQueues) firing: Full queues for poolcounter2003:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:17] (PoolcounterFullQueues) resolved: Full queues for poolcounter2003:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:34:06] (03PS1) 10Majavah: admin: update tmux config for taavi [puppet] - 10https://gerrit.wikimedia.org/r/960942 [06:35:16] (03CR) 10Majavah: [C: 03+2] admin: update tmux config for taavi [puppet] - 10https://gerrit.wikimedia.org/r/960942 (owner: 10Majavah) [06:36:10] (03PS4) 10Majavah: site: re-assign role for cloudcontrol1007 [puppet] - 10https://gerrit.wikimedia.org/r/960642 (https://phabricator.wikimedia.org/T346892) [06:38:47] (JobUnavailable) firing: (4) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:41:31] (03CR) 10Majavah: [C: 03+2] site: re-assign role for cloudcontrol1007 [puppet] - 10https://gerrit.wikimedia.org/r/960642 (https://phabricator.wikimedia.org/T346892) (owner: 10Majavah) [06:42:44] !log taavi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol1007.eqiad.wmnet with OS bullseye [06:42:58] 10SRE, 10ops-eqiad, 10Patch-For-Review, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1007: move to new network setup - https://phabricator.wikimedia.org/T346892 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1001 for host cloudcontrol1007.eqi... [06:44:58] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/960723 (https://phabricator.wikimedia.org/T346656) (owner: 10Andrea Denisse) [06:45:04] (03CR) 10Filippo Giunchedi: "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/958807 (https://phabricator.wikimedia.org/T346656) (owner: 10Andrea Denisse) [06:45:14] (03CR) 10Filippo Giunchedi: "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/960638 (https://phabricator.wikimedia.org/T346656) (owner: 10Andrea Denisse) [06:47:23] PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:48:49] RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:53:32] (03CR) 10DCausse: [C: 03+1] add search update pipeline streams (update + fetch_error) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960616 (https://phabricator.wikimedia.org/T317609) (owner: 10Peter Fischer) [06:53:58] (03PS1) 10Ilias Sarantopoulos: ml-services: allows CORS in ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/960998 (https://phabricator.wikimedia.org/T347367) [06:54:56] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: allows CORS in ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/960998 (https://phabricator.wikimedia.org/T347367) (owner: 10Ilias Sarantopoulos) [06:55:46] (03Merged) 10jenkins-bot: ml-services: allows CORS in ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/960998 (https://phabricator.wikimedia.org/T347367) (owner: 10Ilias Sarantopoulos) [06:56:36] !log isaranto@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [06:57:33] !log isaranto@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [06:58:57] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [07:00:05] Amir1, Urbanecm, and taavi: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230926T0700) [07:00:05] aanzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:03:20] (03PS1) 10Majavah: openstack: Temp exclude cloud-private from cloudcontrol1007 [puppet] - 10https://gerrit.wikimedia.org/r/960999 [07:03:38] (03CR) 10Majavah: [C: 03+2] openstack: Temp exclude cloud-private from cloudcontrol1007 [puppet] - 10https://gerrit.wikimedia.org/r/960999 (owner: 10Majavah) [07:03:40] (03CR) 10Majavah: [V: 03+2 C: 03+2] openstack: Temp exclude cloud-private from cloudcontrol1007 [puppet] - 10https://gerrit.wikimedia.org/r/960999 (owner: 10Majavah) [07:05:06] !log taavi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol1007.eqiad.wmnet with reason: host reimage [07:06:32] o/ [07:06:56] o/ I can deploy unless no-one else is around [07:07:09] (03CR) 10Muehlenhoff: [C: 03+2] standard_packages: Remove Python 3.7 packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/960634 (owner: 10Muehlenhoff) [07:07:47] (03PS3) 10DCausse: cirrus: add the mediawiki.cirrussearch.page_rerender stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957726 (https://phabricator.wikimedia.org/T325565) [07:07:49] (03PS3) 10DCausse: cirrus: add wgCirrusSearchUseEventBusBridge and enable it on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957727 (https://phabricator.wikimedia.org/T325565) [07:08:09] !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol1007.eqiad.wmnet with reason: host reimage [07:09:17] (03CR) 10DCausse: cirrus: add the mediawiki.cirrussearch.page_rerender stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957726 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse) [07:09:46] (03PS2) 10Majavah: guwikisource: add audiobook namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960200 (https://phabricator.wikimedia.org/T347189) (owner: 10Anzx) [07:10:07] (03PS5) 10Majavah: add throttle rule for UIUC Wikipedia edit-a-thon October 13, 2023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958946 (https://phabricator.wikimedia.org/T346043) (owner: 10Anzx) [07:10:34] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958049 (https://phabricator.wikimedia.org/T346582) (owner: 10Anzx) [07:10:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960200 (https://phabricator.wikimedia.org/T347189) (owner: 10Anzx) [07:10:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958946 (https://phabricator.wikimedia.org/T346043) (owner: 10Anzx) [07:11:13] (03PS3) 10Majavah: guwikisource: add audiobook namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960200 (https://phabricator.wikimedia.org/T347189) (owner: 10Anzx) [07:11:19] (03PS6) 10Majavah: add throttle rule for UIUC Wikipedia edit-a-thon October 13, 2023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958946 (https://phabricator.wikimedia.org/T346043) (owner: 10Anzx) [07:11:21] (03CR) 10CI reject: [V: 04-1] add throttle rule for UIUC Wikipedia edit-a-thon October 13, 2023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958946 (https://phabricator.wikimedia.org/T346043) (owner: 10Anzx) [07:11:32] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958049 (https://phabricator.wikimedia.org/T346582) (owner: 10Anzx) [07:11:34] (03CR) 10TrainBranchBot: "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960200 (https://phabricator.wikimedia.org/T347189) (owner: 10Anzx) [07:11:36] (03CR) 10TrainBranchBot: "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958946 (https://phabricator.wikimedia.org/T346043) (owner: 10Anzx) [07:11:42] (03Merged) 10jenkins-bot: Enable wgMinervaEnableSiteNotice for knwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958049 (https://phabricator.wikimedia.org/T346582) (owner: 10Anzx) [07:12:06] (03PS4) 10DCausse: cirrus: add the mediawiki.cirrussearch.page_rerender.v1 stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957726 (https://phabricator.wikimedia.org/T325565) [07:12:08] (03PS4) 10DCausse: cirrus: add wgCirrusSearchUseEventBusBridge and enable it on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957727 (https://phabricator.wikimedia.org/T325565) [07:12:35] (03CR) 10DCausse: cirrus: add the mediawiki.cirrussearch.page_rerender.v1 stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957726 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse) [07:12:37] (03Merged) 10jenkins-bot: guwikisource: add audiobook namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960200 (https://phabricator.wikimedia.org/T347189) (owner: 10Anzx) [07:12:39] (03Merged) 10jenkins-bot: add throttle rule for UIUC Wikipedia edit-a-thon October 13, 2023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958946 (https://phabricator.wikimedia.org/T346043) (owner: 10Anzx) [07:13:09] (03PS1) 10Filippo Giunchedi: service: allow disabling icinga checks for 'node' [puppet] - 10https://gerrit.wikimedia.org/r/961002 (https://phabricator.wikimedia.org/T314118) [07:13:11] (03PS1) 10Filippo Giunchedi: restbase: disable per-host icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/961003 (https://phabricator.wikimedia.org/T314118) [07:13:40] !log taavi@deploy2002 Started scap: Backport for [[gerrit:958049|Enable wgMinervaEnableSiteNotice for knwiki (T346582)]], [[gerrit:960200|guwikisource: add audiobook namespace (T347189)]], [[gerrit:958946|add throttle rule for UIUC Wikipedia edit-a-thon October 13, 2023 (T346043)]] [07:13:53] T346043: Lift IP caps for UIUC Wikipedia edit-a-thon (Oct13, Nov13, 2023) - https://phabricator.wikimedia.org/T346043 [07:13:53] T346582: Enable wgMinervaEnableSiteNotice for knwiki - https://phabricator.wikimedia.org/T346582 [07:13:54] T347189: Audio Book namespace creation On guwikisource - https://phabricator.wikimedia.org/T347189 [07:15:19] !log taavi@deploy2002 anzx and taavi: Backport for [[gerrit:958049|Enable wgMinervaEnableSiteNotice for knwiki (T346582)]], [[gerrit:960200|guwikisource: add audiobook namespace (T347189)]], [[gerrit:958946|add throttle rule for UIUC Wikipedia edit-a-thon October 13, 2023 (T346043)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug k [07:15:19] ubernetes deployment (accessible via k8s-experimental XWD option) [07:15:27] aanzx: please test [07:15:36] ok [07:17:09] 10ops-codfw, 10DBA: db2109 crashed - https://phabricator.wikimedia.org/T347318 (10jcrespo) The host had previously power issues (unsure if due to actual power issues or maintenance or power supply issues): ` The power supplies are redundant. Thu 27 Apr 2023 17:45:34 The input power for power supply 2... [07:17:57] taavi: looks good [07:18:07] (03PS1) 10Muehlenhoff: standard_packages: Remove more obsolete packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/961005 [07:18:54] !log taavi@deploy2002 anzx and taavi: Continuing with sync [07:20:31] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1007: move to new network setup - https://phabricator.wikimedia.org/T346892 (10taavi) [07:22:48] (03PS1) 10Majavah: Revert "openstack: Temp exclude cloud-private from cloudcontrol1007" [puppet] - 10https://gerrit.wikimedia.org/r/960733 [07:25:22] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:958049|Enable wgMinervaEnableSiteNotice for knwiki (T346582)]], [[gerrit:960200|guwikisource: add audiobook namespace (T347189)]], [[gerrit:958946|add throttle rule for UIUC Wikipedia edit-a-thon October 13, 2023 (T346043)]] (duration: 11m 41s) [07:25:32] T346043: Lift IP caps for UIUC Wikipedia edit-a-thon (Oct13, Nov13, 2023) - https://phabricator.wikimedia.org/T346043 [07:25:32] T346582: Enable wgMinervaEnableSiteNotice for knwiki - https://phabricator.wikimedia.org/T346582 [07:25:33] T347189: Audio Book namespace creation On guwikisource - https://phabricator.wikimedia.org/T347189 [07:25:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:26:15] and namespaceDupes.php on guwikisource reports no changes [07:26:16] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:26:23] aanzx: all done! [07:27:53] taavi: this one is not done https://gerrit.wikimedia.org/r/c/mediawiki/core/+/936120/ [07:28:59] (03CR) 10Filippo Giunchedi: [C: 03+1] standard_packages: Remove more obsolete packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/961005 (owner: 10Muehlenhoff) [07:29:16] aanzx: that one is a core patch, not a config patch, meaning that it'll be deployed next week via the deployment train (https://wikitech.wikimedia.org/wiki/Deployments/Train) [07:29:29] ok thanks taavi [07:30:18] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/960603 (owner: 10Muehlenhoff) [07:30:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:33:44] (PuppetConstantChange) firing: (2) Puppet performing a change on every puppet run - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [07:34:42] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:35:18] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 143, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:36:16] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/960632 (https://phabricator.wikimedia.org/T345590) (owner: 10AOkoth) [07:37:52] (03PS3) 10Muehlenhoff: Create DNS records for new LDAP Bookworm cluster [dns] - 10https://gerrit.wikimedia.org/r/959203 (https://phabricator.wikimedia.org/T331699) [07:38:45] (PuppetConstantChange) resolved: (2) Puppet performing a change on every puppet run - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [07:40:56] (03CR) 10Jelto: [C: 03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/960633 (https://phabricator.wikimedia.org/T345590) (owner: 10AOkoth) [07:44:16] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:44:42] !log taavi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "cloudcontrol1007 - taavi@cumin1001" [07:44:58] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 144, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:45:59] !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "cloudcontrol1007 - taavi@cumin1001" [07:46:11] (03CR) 10Majavah: [C: 03+2] Revert "openstack: Temp exclude cloud-private from cloudcontrol1007" [puppet] - 10https://gerrit.wikimedia.org/r/960733 (owner: 10Majavah) [07:46:27] (03CR) 10Muehlenhoff: [C: 03+2] Create DNS records for new LDAP Bookworm cluster [dns] - 10https://gerrit.wikimedia.org/r/959203 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff) [07:50:24] (03PS3) 10Muehlenhoff: Extend acmechief config with new names of Bookworm hosts [puppet] - 10https://gerrit.wikimedia.org/r/959201 (https://phabricator.wikimedia.org/T331699) [07:50:46] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:52:12] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:53:14] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:54:26] !log taavi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - taavi@cumin1001" [07:54:42] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:55:15] !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - taavi@cumin1001" [07:55:21] !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol1007.eqiad.wmnet with OS bullseye [07:55:29] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1007: move to new network setup - https://phabricator.wikimedia.org/T346892 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by taavi@cumin1001 for host cloudcontrol1007.eqiad.wmnet with OS bullseye... [07:56:09] !log taavi@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1007.eqiad.wmnet [07:56:29] (03PS1) 10Majavah: interface: ensure vlan package is installed when running command [puppet] - 10https://gerrit.wikimedia.org/r/961018 [07:58:01] (03PS1) 10Majavah: ipmi: ensure directory exists for package [puppet] - 10https://gerrit.wikimedia.org/r/961019 [07:58:22] (03PS1) 10DCausse: search: simplify flink parallelism configuration [alerts] - 10https://gerrit.wikimedia.org/r/961020 (https://phabricator.wikimedia.org/T346456) [07:58:28] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: remove deprecated grafana feature [puppet] - 10https://gerrit.wikimedia.org/r/959689 (owner: 10Jelto) [07:58:46] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: delay restore timer 30 minutes [puppet] - 10https://gerrit.wikimedia.org/r/959683 (owner: 10Jelto) [07:59:08] (03CR) 10Muehlenhoff: [C: 03+2] Extend acmechief config with new names of Bookworm hosts [puppet] - 10https://gerrit.wikimedia.org/r/959201 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff) [08:00:02] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:00:44] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 143, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:01:26] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/961019 (owner: 10Majavah) [08:02:19] (03PS2) 10Majavah: ipmi: ensure directory exists for package [puppet] - 10https://gerrit.wikimedia.org/r/961019 [08:02:56] (03CR) 10Majavah: [C: 03+2] ipmi: ensure directory exists for package [puppet] - 10https://gerrit.wikimedia.org/r/961019 (owner: 10Majavah) [08:03:28] !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol1007.eqiad.wmnet [08:05:23] (03PS1) 10Majavah: hieradata: drop some now-unnecessary openstack overrides [puppet] - 10https://gerrit.wikimedia.org/r/961023 [08:05:32] (03CR) 10Majavah: [C: 03+2] hieradata: drop some now-unnecessary openstack overrides [puppet] - 10https://gerrit.wikimedia.org/r/961023 (owner: 10Majavah) [08:06:22] (03PS1) 10DCausse: rdf-streaming-updater: simplify parallelism configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/961024 (https://phabricator.wikimedia.org/T346456) [08:07:46] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 144, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:08:28] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:09:34] (03PS1) 10Hashar: ci: add Gerrit ssh key to ssh_known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/961025 (https://phabricator.wikimedia.org/T328543) [08:09:40] (03CR) 10DCausse: [C: 04-1] "deployment procedure:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/961024 (https://phabricator.wikimedia.org/T346456) (owner: 10DCausse) [08:11:57] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1007: move to new network setup - https://phabricator.wikimedia.org/T346892 (10taavi) [08:12:33] (03CR) 10Hashar: "I have cherry picked it on the integration Puppet master:" [puppet] - 10https://gerrit.wikimedia.org/r/961025 (https://phabricator.wikimedia.org/T328543) (owner: 10Hashar) [08:13:34] (03PS3) 10Peter Fischer: add search update pipeline streams (update + fetch_error) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960616 (https://phabricator.wikimedia.org/T317609) [08:13:44] (JobUnavailable) firing: (5) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:21:28] (03PS1) 10Ammarpad: arwikisource: Increase autoconfirm edit count to 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961030 (https://phabricator.wikimedia.org/T347264) [08:28:59] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1007: move to new network setup - https://phabricator.wikimedia.org/T346892 (10taavi) 05Open→03Resolved [08:29:03] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10taavi) [08:29:50] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10taavi) [08:31:30] (03CR) 10Muehlenhoff: [C: 03+2] ssh: Disable ChallengeResponseAuthentication for cloud [puppet] - 10https://gerrit.wikimedia.org/r/959894 (owner: 10Muehlenhoff) [08:37:08] PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:37:44] PROBLEM - clamd running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [08:38:55] (03PS1) 10Clément Goubert: httpbb: Fix test.wikidata.org item [puppet] - 10https://gerrit.wikimedia.org/r/961037 [08:39:30] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:40:17] (03PS1) 10Ammarpad: Enable Minerva site notice for wikifunctions wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961038 (https://phabricator.wikimedia.org/T345463) [08:43:03] (03PS4) 10Muehlenhoff: Configure ldap-rw1001/2001 as LDAP servers [puppet] - 10https://gerrit.wikimedia.org/r/959200 (https://phabricator.wikimedia.org/T331699) [08:43:44] (JobUnavailable) firing: (5) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:45:54] (03PS2) 10Majavah: wikitech: Disable password resets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954076 (https://phabricator.wikimedia.org/T345226) [08:45:55] jouncebot: nowandnext [08:45:56] No deployments scheduled for the next 1 hour(s) and 14 minute(s) [08:45:56] In 1 hour(s) and 14 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230926T1000) [08:45:56] (03PS1) 10Majavah: wikitech: Block account creation by sysops too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961042 (https://phabricator.wikimedia.org/T345226) [08:46:33] (03CR) 10Majavah: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954076 (https://phabricator.wikimedia.org/T345226) (owner: 10Majavah) [08:46:35] (03CR) 10CI reject: [V: 04-1] wikitech: Disable password resets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954076 (https://phabricator.wikimedia.org/T345226) (owner: 10Majavah) [08:50:02] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954076 (https://phabricator.wikimedia.org/T345226) (owner: 10Majavah) [08:50:04] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961042 (https://phabricator.wikimedia.org/T345226) (owner: 10Majavah) [08:50:46] (03Merged) 10jenkins-bot: wikitech: Disable password resets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954076 (https://phabricator.wikimedia.org/T345226) (owner: 10Majavah) [08:50:49] (03Merged) 10jenkins-bot: wikitech: Block account creation by sysops too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961042 (https://phabricator.wikimedia.org/T345226) (owner: 10Majavah) [08:51:12] !log taavi@deploy2002 Started scap: Backport for [[gerrit:954076|wikitech: Disable password resets (T345226)]], [[gerrit:961042|wikitech: Block account creation by sysops too (T345226)]] [08:51:14] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:51:19] T345226: Switch developer account creation to Bitu - https://phabricator.wikimedia.org/T345226 [08:52:38] !log taavi@deploy2002 taavi: Backport for [[gerrit:954076|wikitech: Disable password resets (T345226)]], [[gerrit:961042|wikitech: Block account creation by sysops too (T345226)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [08:52:59] !log taavi@deploy2002 taavi: Continuing with sync [08:53:30] PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:56:12] (03CR) 10Vgutierrez: [C: 04-1] "Looking good, it needs some adjustments as mentioned on the comments 😊" [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) (owner: 10Fabfur) [08:56:16] RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:57:37] <_joe_> jouncebot: now [08:57:38] No deployments scheduled for the next 1 hour(s) and 2 minute(s) [08:57:44] <_joe_> jouncebot: now and next [08:57:44] No deployments scheduled for the next 1 hour(s) and 2 minute(s) [08:57:54] <_joe_> jouncebot: nowandnext [08:57:55] No deployments scheduled for the next 1 hour(s) and 2 minute(s) [08:57:55] In 1 hour(s) and 2 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230926T1000) [08:57:56] RECOVERY - Check systemd state on vrts1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:58:08] <_joe_> taavi: are you done with deployments? [08:58:34] RECOVERY - clamd running on vrts1001 is OK: PROCS OK: 1 process with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [08:58:37] _joe_: not quite, php-fpm-restarts are just about to finish [08:58:52] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:954076|wikitech: Disable password resets (T345226)]], [[gerrit:961042|wikitech: Block account creation by sysops too (T345226)]] (duration: 07m 40s) [08:58:59] now I'm done [08:59:05] <_joe_> taavi: yeah I meant after this was completed :) [08:59:12] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:59:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:00:36] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:02:44] (03CR) 10Muehlenhoff: [C: 03+2] Configure ldap-rw1001/2001 as LDAP servers [puppet] - 10https://gerrit.wikimedia.org/r/959200 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff) [09:03:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] interface: ensure vlan package is installed when running command [puppet] - 10https://gerrit.wikimedia.org/r/961018 (owner: 10Majavah) [09:03:45] (03CR) 10Majavah: [C: 03+2] interface: ensure vlan package is installed when running command [puppet] - 10https://gerrit.wikimedia.org/r/961018 (owner: 10Majavah) [09:04:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:04:41] (03PS1) 10Jaime Nuche: phabricator: add configuration for the remote aphlict server [puppet] - 10https://gerrit.wikimedia.org/r/961045 (https://phabricator.wikimedia.org/T346321) [09:06:06] (03PS1) 10Giuseppe Lavagetto: mw-api-int: bump replicas before moving wikifeeds [deployment-charts] - 10https://gerrit.wikimedia.org/r/961046 [09:06:46] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netbox, 10User-aborrero: netbox: add support for cloud-private subnet in server network provisioning automation - https://phabricator.wikimedia.org/T346428 (10aborrero) [09:07:03] (03CR) 10Giuseppe Lavagetto: [C: 03+2] eventgate: migrate to mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/957933 (https://phabricator.wikimedia.org/T346448) (owner: 10Giuseppe Lavagetto) [09:07:23] (03PS2) 10Giuseppe Lavagetto: eventgate: migrate to mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/957933 (https://phabricator.wikimedia.org/T346448) [09:08:57] !log gmodena@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [09:09:01] !log gmodena@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [09:09:33] (03CR) 10DCausse: add search update pipeline streams (update + fetch_error) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960616 (https://phabricator.wikimedia.org/T317609) (owner: 10Peter Fischer) [09:11:19] (03PS1) 10Muehlenhoff: Fix Hiera entries for ldap master nodes [puppet] - 10https://gerrit.wikimedia.org/r/961047 (https://phabricator.wikimedia.org/T331699) [09:11:50] (03CR) 10Effie Mouzeli: [C: 03+2] Update mathoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/953261 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli) [09:11:56] (03PS1) 10Clément Goubert: mw-web, mw-api-ext: Raise replicas, raise apcu size [deployment-charts] - 10https://gerrit.wikimedia.org/r/961048 (https://phabricator.wikimedia.org/T346422) [09:12:24] !log oblivian@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [09:12:43] (03Merged) 10jenkins-bot: Update mathoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/953261 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli) [09:13:12] !log gmodena@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: sync [09:13:15] !log oblivian@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [09:13:25] !log gmodena@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-main: sync [09:14:37] (03CR) 10Muehlenhoff: [C: 03+2] Fix Hiera entries for ldap master nodes [puppet] - 10https://gerrit.wikimedia.org/r/961047 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff) [09:14:40] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/mathoid: apply [09:14:44] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/mathoid: apply [09:15:01] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mathoid: apply [09:15:01] (03PS1) 10Majavah: wikitech: Properly disable password resets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961051 (https://phabricator.wikimedia.org/T345226) [09:15:04] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply [09:15:20] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10JMeybohm) [09:15:31] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/mathoid: apply [09:15:44] (03CR) 10CI reject: [V: 04-1] wikitech: Properly disable password resets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961051 (https://phabricator.wikimedia.org/T345226) (owner: 10Majavah) [09:15:47] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/mathoid: apply [09:15:53] !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [09:15:57] (03CR) 10Majavah: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961051 (https://phabricator.wikimedia.org/T345226) (owner: 10Majavah) [09:16:02] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mathoid: apply [09:16:40] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mathoid: apply [09:17:09] !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [09:17:19] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mathoid: apply [09:18:59] !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [09:19:09] (03CR) 10Jaime Nuche: "PCC: https://puppet-compiler.wmflabs.org/output/961045/43604/" [puppet] - 10https://gerrit.wikimedia.org/r/961045 (https://phabricator.wikimedia.org/T346321) (owner: 10Jaime Nuche) [09:19:21] (03PS1) 10Ilias Sarantopoulos: ml-services: fix origins in ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/961052 [09:19:39] !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [09:19:56] !log oblivian@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [09:20:12] !log oblivian@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [09:20:15] (03CR) 10Vgutierrez: [C: 04-1] vanish: allow PURGE requests only from dedicated socket (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) (owner: 10Fabfur) [09:20:44] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Assuming the new URL exists, LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/961037 (owner: 10Clément Goubert) [09:20:55] !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [09:20:58] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: fix origins in ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/961052 (owner: 10Ilias Sarantopoulos) [09:21:32] !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [09:21:45] (03Merged) 10jenkins-bot: ml-services: fix origins in ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/961052 (owner: 10Ilias Sarantopoulos) [09:22:32] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [09:22:33] !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [09:22:59] !log isaranto@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [09:23:02] !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [09:23:23] !log isaranto@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [09:23:56] (03CR) 10Clément Goubert: [C: 03+2] httpbb: Fix test.wikidata.org item [puppet] - 10https://gerrit.wikimedia.org/r/961037 (owner: 10Clément Goubert) [09:24:32] (03PS10) 10Fabfur: vanish: allow PURGE requests only from dedicated socket [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) [09:24:41] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mw-web, mw-api-ext: Raise replicas, raise apcu size [deployment-charts] - 10https://gerrit.wikimedia.org/r/961048 (https://phabricator.wikimedia.org/T346422) (owner: 10Clément Goubert) [09:24:57] (03CR) 10CI reject: [V: 04-1] vanish: allow PURGE requests only from dedicated socket [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) (owner: 10Fabfur) [09:25:03] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply [09:25:07] (03CR) 10Clément Goubert: [C: 03+1] trafficserver: move 6.5% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/957857 (https://phabricator.wikimedia.org/T346422) (owner: 10Giuseppe Lavagetto) [09:25:15] (03CR) 10Fabfur: vanish: allow PURGE requests only from dedicated socket (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) (owner: 10Fabfur) [09:26:22] (03PS11) 10Fabfur: vanish: allow PURGE requests only from dedicated socket [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) [09:26:47] (03CR) 10CI reject: [V: 04-1] vanish: allow PURGE requests only from dedicated socket [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) (owner: 10Fabfur) [09:26:52] !log oblivian@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [09:26:52] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:27:04] (03PS1) 10Majavah: cr-cloud: Remove unused terms [homer/public] - 10https://gerrit.wikimedia.org/r/961054 (https://phabricator.wikimedia.org/T346439) [09:27:14] !log oblivian@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [09:27:36] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:27:50] (03CR) 10Vgutierrez: vanish: allow PURGE requests only from dedicated socket (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) (owner: 10Fabfur) [09:28:10] (03PS12) 10Fabfur: vanish: allow PURGE requests only from dedicated socket [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) [09:28:20] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:28:53] !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [09:29:41] !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [09:29:42] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:29:55] !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [09:30:38] !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [09:31:01] (03PS1) 10Majavah: cr-labs: Permit wiki replica account creation related flows [homer/public] - 10https://gerrit.wikimedia.org/r/961055 (https://phabricator.wikimedia.org/T347381) [09:32:03] (03CR) 10EoghanGaffney: [C: 03+1] "Sorry this one flew under the radar!" [puppet] - 10https://gerrit.wikimedia.org/r/930673 (https://phabricator.wikimedia.org/T339172) (owner: 10Hashar) [09:32:33] (03CR) 10Clément Goubert: [C: 03+2] mw-web, mw-api-ext: Raise replicas, raise apcu size [deployment-charts] - 10https://gerrit.wikimedia.org/r/961048 (https://phabricator.wikimedia.org/T346422) (owner: 10Clément Goubert) [09:32:46] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:32:46] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:33:16] (03Merged) 10jenkins-bot: mw-web, mw-api-ext: Raise replicas, raise apcu size [deployment-charts] - 10https://gerrit.wikimedia.org/r/961048 (https://phabricator.wikimedia.org/T346422) (owner: 10Clément Goubert) [09:33:43] !log oblivian@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [09:33:47] !log oblivian@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [09:34:21] !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [09:34:30] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [09:34:31] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43607/console" [puppet] - 10https://gerrit.wikimedia.org/r/960603 (owner: 10Muehlenhoff) [09:34:51] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [09:34:56] (03CR) 10Giuseppe Lavagetto: [C: 03+1] service: allow disabling icinga checks for 'node' [puppet] - 10https://gerrit.wikimedia.org/r/961002 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi) [09:34:58] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [09:35:05] !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [09:35:11] (03CR) 10Giuseppe Lavagetto: [C: 03+1] restbase: disable per-host icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/961003 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi) [09:35:30] !log Raised replicas to 20 for mw-api-ext and mw-web - T346422 [09:35:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:36] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [09:35:37] T346422: Move 10% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T346422 [09:35:44] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [09:36:14] (03CR) 10EoghanGaffney: [C: 03+1] phabricator deployment: restart php when finalizing deploy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956486 (https://phabricator.wikimedia.org/T314460) (owner: 10Brennen Bearnes) [09:36:17] (03CR) 10Filippo Giunchedi: [C: 03+2] service: allow disabling icinga checks for 'node' [puppet] - 10https://gerrit.wikimedia.org/r/961002 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi) [09:36:26] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [09:36:56] (03CR) 10Gmodena: [C: 03+1] add search update pipeline streams (update + fetch_error) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960616 (https://phabricator.wikimedia.org/T317609) (owner: 10Peter Fischer) [09:36:59] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [09:37:16] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:37:45] !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [09:37:48] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/960637 (owner: 10FNegri) [09:38:15] !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [09:38:21] (03CR) 10Filippo Giunchedi: [C: 03+2] restbase: disable per-host icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/961003 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi) [09:38:41] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [09:38:55] (03CR) 10Jbond: [V: 03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/960603 (owner: 10Muehlenhoff) [09:39:09] (03PS3) 10Klausman: APIGW: add entry for multilingual readability LW isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/959684 (https://phabricator.wikimedia.org/T334182) [09:40:14] (03CR) 10Klausman: [C: 03+2] APIGW: add entry for multilingual readability LW isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/959684 (https://phabricator.wikimedia.org/T334182) (owner: 10Klausman) [09:40:24] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:40:32] (03CR) 10FNegri: [V: 03+2] Package for Debian Bookworm [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/959212 (https://phabricator.wikimedia.org/T346762) (owner: 10FNegri) [09:41:01] (03Merged) 10jenkins-bot: APIGW: add entry for multilingual readability LW isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/959684 (https://phabricator.wikimedia.org/T334182) (owner: 10Klausman) [09:41:28] (03CR) 10FNegri: [C: 03+2] d/changelog: bump version [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/959316 (owner: 10David Caro) [09:41:30] (03CR) 10FNegri: [V: 03+2 C: 03+2] d/changelog: bump version [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/959316 (owner: 10David Caro) [09:41:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:41:41] (03CR) 10FNegri: [C: 03+2] Add more details to Readme [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/960637 (owner: 10FNegri) [09:41:43] (03CR) 10FNegri: [V: 03+2 C: 03+2] Add more details to Readme [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/960637 (owner: 10FNegri) [09:43:27] (03CR) 10Arturo Borrero Gonzalez: cr-cloud: Remove unused terms (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/961054 (https://phabricator.wikimedia.org/T346439) (owner: 10Majavah) [09:44:31] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:45:25] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cr-labs: Permit wiki replica account creation related flows [homer/public] - 10https://gerrit.wikimedia.org/r/961055 (https://phabricator.wikimedia.org/T347381) (owner: 10Majavah) [09:46:13] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:46:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:47:37] !log klausman@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: sync [09:47:48] !log klausman@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync [09:47:56] 10SRE, 10serviceops-radar, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi) [09:48:16] !log remove per-host restbase healthchecks, replaced by service-level swagger-exporter checks - T314118 [09:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:22] T314118: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 [09:48:38] (03PS1) 10Muehlenhoff: Don't include the OpenLDAP exporter on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/961059 (https://phabricator.wikimedia.org/T266147) [09:49:26] (03PS2) 10Muehlenhoff: Don't include the OpenLDAP exporter on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/961059 (https://phabricator.wikimedia.org/T266147) [09:49:32] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:49:34] (03PS1) 10FNegri: Fix reprepro command [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/961060 [09:50:24] PROBLEM - Check systemd state on ldap-rw2001 is CRITICAL: CRITICAL - degraded: The following units failed: slapd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:51:56] (03PS2) 10Giuseppe Lavagetto: eventstreams: migrate to mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/957934 [09:51:59] (03CR) 10Muehlenhoff: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43609/console" [puppet] - 10https://gerrit.wikimedia.org/r/960603 (owner: 10Muehlenhoff) [09:52:14] !log klausman@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [09:52:43] !log klausman@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [09:53:27] !log klausman@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [09:53:44] (JobUnavailable) firing: (5) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:53:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "it can be `sudo -i` too." [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/961060 (owner: 10FNegri) [09:53:53] (03CR) 10David Caro: [C: 03+1] Fix reprepro command [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/961060 (owner: 10FNegri) [09:54:14] PROBLEM - LDAP -writable server- on ldap-rw2001 is CRITICAL: Could not bind to the LDAP server https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [09:54:14] !log klausman@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [09:54:48] 10SRE, 10Infrastructure-Foundations, 10netops: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10cmooney) @Jhancock.wm not 100%, I will try to chase on that. [09:54:53] (03CR) 10Muehlenhoff: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43610/console" [puppet] - 10https://gerrit.wikimedia.org/r/961059 (https://phabricator.wikimedia.org/T266147) (owner: 10Muehlenhoff) [09:56:02] (03CR) 10Giuseppe Lavagetto: [C: 03+2] eventstreams: migrate to mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/957934 (owner: 10Giuseppe Lavagetto) [09:56:30] (03CR) 10FNegri: [C: 03+2] Fix reprepro command [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/961060 (owner: 10FNegri) [09:56:33] (03CR) 10FNegri: [V: 03+2 C: 03+2] Fix reprepro command [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/961060 (owner: 10FNegri) [09:57:00] (03Merged) 10jenkins-bot: eventstreams: migrate to mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/957934 (owner: 10Giuseppe Lavagetto) [09:57:12] (03Abandoned) 10Clément Goubert: httpbb: Switch to a different entity for testwikidata [puppet] - 10https://gerrit.wikimedia.org/r/960693 (owner: 10RLazarus) [09:57:59] (03CR) 10Muehlenhoff: [V: 03+1 C: 03+2] Don't include the OpenLDAP exporter on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/961059 (https://phabricator.wikimedia.org/T266147) (owner: 10Muehlenhoff) [09:58:08] (03CR) 10Jbond: [C: 04-1] "lgtp but please use dnsquery over iplookup" [puppet] - 10https://gerrit.wikimedia.org/r/960034 (https://phabricator.wikimedia.org/T337107) (owner: 10Jelto) [09:58:16] (03PS1) 10Filippo Giunchedi: maps: remove per-host healthchck [puppet] - 10https://gerrit.wikimedia.org/r/961062 (https://phabricator.wikimedia.org/T314118) [09:58:44] (JobUnavailable) firing: (5) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:59:04] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230926T1000) [10:00:16] (03CR) 10Jbond: [C: 03+1] postgresql: fix ordering on a new install (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959228 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [10:00:22] !log oblivian@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [10:00:36] !log oblivian@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [10:00:52] !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [10:01:54] !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [10:02:24] (03CR) 10Majavah: cr-cloud: Remove unused terms (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/961054 (https://phabricator.wikimedia.org/T346439) (owner: 10Majavah) [10:02:36] (03CR) 10Majavah: [C: 03+2] cr-labs: Permit wiki replica account creation related flows [homer/public] - 10https://gerrit.wikimedia.org/r/961055 (https://phabricator.wikimedia.org/T347381) (owner: 10Majavah) [10:02:56] !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply [10:03:09] (03Merged) 10jenkins-bot: cr-labs: Permit wiki replica account creation related flows [homer/public] - 10https://gerrit.wikimedia.org/r/961055 (https://phabricator.wikimedia.org/T347381) (owner: 10Majavah) [10:03:27] !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply [10:03:44] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/960641 (owner: 10Muehlenhoff) [10:03:47] !log update CR firewall policy to permit wiki replica account creation in the new cloud-private network setup, https://gerrit.wikimedia.org/r/961055 T347381 [10:03:53] !log oblivian@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply [10:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:55] T347381: Support maintain-dbusers on the new network layout - https://phabricator.wikimedia.org/T347381 [10:04:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:04:06] !log oblivian@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [10:04:08] (JobUnavailable) firing: (5) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:04:18] !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [10:04:55] !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [10:05:19] !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams: apply [10:05:43] !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [10:07:52] (03PS1) 10Filippo Giunchedi: sre: swagger probe failure to critical [alerts] - 10https://gerrit.wikimedia.org/r/961063 [10:10:09] (03CR) 10Jbond: [C: 04-1] prometheus-postgres-exporter: install configs before service (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/959230 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [10:10:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cr-cloud: Remove unused terms (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/961054 (https://phabricator.wikimedia.org/T346439) (owner: 10Majavah) [10:13:58] (03CR) 10Jbond: [V: 03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/960603 (owner: 10Muehlenhoff) [10:18:44] (JobUnavailable) firing: (5) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:19:08] (JobUnavailable) firing: (5) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:22:21] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/958473 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [10:22:27] (03PS1) 10Muehlenhoff: On Bookworm ship ppolicy.schema via Puppet [puppet] - 10https://gerrit.wikimedia.org/r/961066 (https://phabricator.wikimedia.org/T331699) [10:25:13] (03CR) 10Muehlenhoff: firewall: Also support Stdlib::Port::Unprivileged in Ferm::Port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960033 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:26:05] (03Abandoned) 10Muehlenhoff: firewall: Also support Stdlib::Port::Unprivileged in Ferm::Port [puppet] - 10https://gerrit.wikimedia.org/r/960033 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:26:41] (03CR) 10Muehlenhoff: [C: 03+2] puppetdb: Remove obsolete Hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/960641 (owner: 10Muehlenhoff) [10:27:30] (03PS1) 10Majavah: wiki-replicas.sql: Drop grants for old labstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/961067 [10:27:32] (03PS1) 10Majavah: Allow cloudcontrol1005 and 1007 to connect to wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/961068 (https://phabricator.wikimedia.org/T347381) [10:28:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Allow cloudcontrol1005 and 1007 to connect to wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/961068 (https://phabricator.wikimedia.org/T347381) (owner: 10Majavah) [10:29:04] (03CR) 10EoghanGaffney: [C: 03+1] gitlab: change service_name on replica hosts [puppet] - 10https://gerrit.wikimedia.org/r/960632 (https://phabricator.wikimedia.org/T345590) (owner: 10AOkoth) [10:29:46] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): deomission puppetdb[12]002 - https://phabricator.wikimedia.org/T347285 (10jbond) p:05Triage→03Medium [10:31:22] (03PS1) 10Jbond: site.pp: puppetboard[12]002 move top insetup::infrastructure_foundations [puppet] - 10https://gerrit.wikimedia.org/r/961069 (https://phabricator.wikimedia.org/T347285) [10:31:24] (03PS1) 10Jbond: site.pp: puppetdb[12]002 migrate to insetup::infrastructure_foundations [puppet] - 10https://gerrit.wikimedia.org/r/961070 (https://phabricator.wikimedia.org/T347285) [10:32:58] (03CR) 10EoghanGaffney: [C: 03+1] gitlab: swap replica records [dns] - 10https://gerrit.wikimedia.org/r/960633 (https://phabricator.wikimedia.org/T345590) (owner: 10AOkoth) [10:33:39] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10MatthewVernon) @Jhancock.wm I think (from netbox) that moss-be2003 is a PowerEdge R740xd2 - ConfigJ 202107 system. We have a number of these nodes wh... [10:35:14] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961066 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff) [10:35:16] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Joe) [10:35:49] (03PS13) 10Fabfur: varnish: allow PURGE requests only from dedicated socket [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) [10:35:54] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [10:36:35] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate all eventgate installations to mw-api-int - https://phabricator.wikimedia.org/T346448 (10Joe) 05Open→03Resolved a:03Joe [10:36:36] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [10:36:42] (03Abandoned) 10Giuseppe Lavagetto: mw-api-int: increase replicas for movement of wikifeeds [deployment-charts] - 10https://gerrit.wikimedia.org/r/957935 (https://phabricator.wikimedia.org/T346447) (owner: 10Giuseppe Lavagetto) [10:36:46] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Joe) [10:37:10] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on an-worker1086.eqiad.wmnet with reason: Downtiming host for RAID controller battery replacement [10:37:24] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on an-worker1086.eqiad.wmnet with reason: Downtiming host for RAID controller battery replacement [10:37:41] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [10:37:45] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [10:37:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Replace RAID controller battery on an-worker1086 - https://phabricator.wikimedia.org/T347287 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d9fc4cc1-c0d4-4a6d-83d0-127f1d08a401) set by btullis@cumin1001 for 3 days, 0:00:00 on 1 host(s)... [10:38:01] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [10:38:05] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [10:38:20] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [10:38:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Replace RAID controller battery on an-worker1086 - https://phabricator.wikimedia.org/T347287 (10BTullis) I have shut down the host, so it is ready for work. Feel free to boot it normally when finished. Thanks @Jclark-ctr. [10:39:15] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [10:40:35] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [10:40:36] <_joe_> jouncebot: now [10:40:36] For the next 0 hour(s) and 19 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230926T1000) [10:40:39] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [10:41:01] (03PS2) 10Jelto: gitlab: use one sshkey for gitlab and remove suffix [puppet] - 10https://gerrit.wikimedia.org/r/960034 (https://phabricator.wikimedia.org/T337107) [10:41:05] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [10:41:08] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [10:42:30] (03PS2) 10Giuseppe Lavagetto: wikifeeds: add networkpolicy for egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/957936 (https://phabricator.wikimedia.org/T346447) [10:42:36] (03CR) 10Jelto: gitlab: use one sshkey for gitlab and remove suffix (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960034 (https://phabricator.wikimedia.org/T337107) (owner: 10Jelto) [10:44:31] (03PS1) 10Muehlenhoff: Add Cumin aliases for maps masters [puppet] - 10https://gerrit.wikimedia.org/r/961072 [10:44:47] (03CR) 10Giuseppe Lavagetto: [C: 03+2] wikifeeds: add networkpolicy for egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/957936 (https://phabricator.wikimedia.org/T346447) (owner: 10Giuseppe Lavagetto) [10:45:45] (03Merged) 10jenkins-bot: wikifeeds: add networkpolicy for egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/957936 (https://phabricator.wikimedia.org/T346447) (owner: 10Giuseppe Lavagetto) [10:46:35] !log oblivian@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [10:46:39] !log oblivian@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [10:46:54] (03CR) 10Majavah: [C: 04-1] gitlab: use one sshkey for gitlab and remove suffix (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960034 (https://phabricator.wikimedia.org/T337107) (owner: 10Jelto) [10:46:54] !log oblivian@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [10:46:58] !log oblivian@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [10:47:31] (03PS4) 10Muehlenhoff: profile::cumin::cloud_target: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/959179 [10:51:15] !log oblivian@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [10:51:32] !log oblivian@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [10:53:43] !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [10:54:27] !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [10:55:01] PROBLEM - DPKG on stat1007 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:55:24] !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [10:55:52] !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [10:56:12] (03PS2) 10Giuseppe Lavagetto: wikifeeds: migrate to mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/957937 (https://phabricator.wikimedia.org/T346447) [10:56:23] (03CR) 10Effie Mouzeli: [C: 03+1] Update chromium-render to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958473 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [10:56:44] (03CR) 10Effie Mouzeli: [C: 03+1] Update machinetranslation to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/960625 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [10:57:06] (03PS1) 10Muehlenhoff: Add a cookbook to restart/reboot maps masters [cookbooks] - 10https://gerrit.wikimedia.org/r/961074 [11:00:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:00:29] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959179 (owner: 10Muehlenhoff) [11:01:21] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): decomission puppetdb[12]002 - https://phabricator.wikimedia.org/T347285 (10Aklapper) [11:05:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:06:55] _joe_: are you still deploying or can I sneak in one more mediawiki thing? [11:07:07] !log joal@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: sync [11:07:20] !log joal@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-main: sync [11:08:09] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/960671 [11:09:05] <_joe_> taavi:sorry, please do [11:09:33] thanks [11:09:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961051 (https://phabricator.wikimedia.org/T345226) (owner: 10Majavah) [11:10:14] <_joe_> nemo-yiannis: are you deploying wikifeeds? [11:10:19] (03Merged) 10jenkins-bot: wikitech: Properly disable password resets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961051 (https://phabricator.wikimedia.org/T345226) (owner: 10Majavah) [11:10:22] _joe_: no [11:10:25] <_joe_> if so, I'll pause on the switch to mw on k8s [11:10:30] <_joe_> ah ok :) [11:10:44] the patch is automatically created on each merge [11:10:45] !log taavi@deploy2002 Started scap: Backport for [[gerrit:961051|wikitech: Properly disable password resets (T345226)]] [11:10:52] T345226: Switch developer account creation to Bitu - https://phabricator.wikimedia.org/T345226 [11:10:57] <_joe_> yeah, hence I was asking :) [11:11:13] (03CR) 10Michael Große: [C: 03+1] "I checked that this covers all things listed in https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/Enable_Client step 5.2." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960066 (https://phabricator.wikimedia.org/T342857) (owner: 10Lucas Werkmeister (WMDE)) [11:12:15] !log taavi@deploy2002 taavi: Backport for [[gerrit:961051|wikitech: Properly disable password resets (T345226)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [11:12:19] !log taavi@deploy2002 taavi: Continuing with sync [11:16:15] (03PS1) 10Effie Mouzeli: Update tegola-vector-tiles to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/961077 (https://phabricator.wikimedia.org/T300033) [11:17:23] (03PS1) 10Effie Mouzeli: Update push-notifications to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/961078 (https://phabricator.wikimedia.org/T300033) [11:18:34] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T343198)', diff saved to https://phabricator.wikimedia.org/P52631 and previous config saved to /var/cache/conftool/dbconfig/20230926-111834-arnaudb.json [11:18:41] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [11:18:43] (03CR) 10Jbond: [C: 03+2] site.pp: puppetboard[12]002 move top insetup::infrastructure_foundations [puppet] - 10https://gerrit.wikimedia.org/r/961069 (https://phabricator.wikimedia.org/T347285) (owner: 10Jbond) [11:18:46] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:961051|wikitech: Properly disable password resets (T345226)]] (duration: 08m 00s) [11:18:46] (03CR) 10Jbond: [C: 03+2] site.pp: puppetdb[12]002 migrate to insetup::infrastructure_foundations [puppet] - 10https://gerrit.wikimedia.org/r/961070 (https://phabricator.wikimedia.org/T347285) (owner: 10Jbond) [11:18:54] T345226: Switch developer account creation to Bitu - https://phabricator.wikimedia.org/T345226 [11:20:39] (03PS1) 10Majavah: wikitech: $wgPasswordResetRoutes takes an empty array, not false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961079 [11:20:49] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961079 (owner: 10Majavah) [11:21:33] (03Merged) 10jenkins-bot: wikitech: $wgPasswordResetRoutes takes an empty array, not false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961079 (owner: 10Majavah) [11:21:56] !log taavi@deploy2002 Started scap: Backport for [[gerrit:961079|wikitech: $wgPasswordResetRoutes takes an empty array, not false]] [11:22:06] (03CR) 10Stevemunene: admin: Create analytics-wmde system user and airflow admin group (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [11:22:20] (03CR) 10Jbond: [C: 03+1] Add Cumin aliases for maps masters [puppet] - 10https://gerrit.wikimedia.org/r/961072 (owner: 10Muehlenhoff) [11:23:05] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/961074 (owner: 10Muehlenhoff) [11:23:22] !log taavi@deploy2002 taavi: Backport for [[gerrit:961079|wikitech: $wgPasswordResetRoutes takes an empty array, not false]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [11:23:27] !log taavi@deploy2002 taavi: Continuing with sync [11:26:16] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:26:57] (03CR) 10Jbond: [C: 04-1] "lgtm but still need to update the gid on the gourp" [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [11:27:13] (03PS1) 10Muehlenhoff: firewall: Add explicit check for provider == 'none' [puppet] - 10https://gerrit.wikimedia.org/r/961081 (https://phabricator.wikimedia.org/T336497) [11:27:27] (03PS2) 10Muehlenhoff: firewall: Add explicit check for provider == 'none' [puppet] - 10https://gerrit.wikimedia.org/r/961081 (https://phabricator.wikimedia.org/T336497) [11:28:06] (03CR) 10Ayounsi: [C: 03+1] "Nice! thanks for taking the time to clean unused things up!" [homer/public] - 10https://gerrit.wikimedia.org/r/961054 (https://phabricator.wikimedia.org/T346439) (owner: 10Majavah) [11:29:25] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:961079|wikitech: $wgPasswordResetRoutes takes an empty array, not false]] (duration: 07m 28s) [11:29:34] (KubernetesAPILatency) firing: (6) High Kubernetes API latency (PATCH events) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:29:44] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961081 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:29:49] (03PS10) 10Stevemunene: admin: Create analytics-wmde system user and airflow admin group [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) [11:31:18] (03PS1) 10Muehlenhoff: linuxbridge: Switch to ensure_packages() [puppet] - 10https://gerrit.wikimedia.org/r/961082 [11:31:20] (03CR) 10Stevemunene: admin: Create analytics-wmde system user and airflow admin group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [11:31:25] (03CR) 10Ayounsi: [C: 03+1] cr-labs: Permit wiki replica account creation related flows [homer/public] - 10https://gerrit.wikimedia.org/r/961055 (https://phabricator.wikimedia.org/T347381) (owner: 10Majavah) [11:31:53] <_joe_> taavi: are you done? [11:32:15] yes [11:33:42] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P52632 and previous config saved to /var/cache/conftool/dbconfig/20230926-113340-arnaudb.json [11:34:34] (KubernetesAPILatency) resolved: (10) High Kubernetes API latency (PUT deployments) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:34:56] sigh, found the issue :D https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/LdapAuthentication/+/refs/heads/master/includes/LdapAuthenticationHooks.php#155 overrides the setting I'm trying to set [11:36:25] (03CR) 10Muehlenhoff: [C: 03+2] Add Cumin aliases for maps masters [puppet] - 10https://gerrit.wikimedia.org/r/961072 (owner: 10Muehlenhoff) [11:36:37] <_joe_> taavi: so there's another patch coming? [11:36:39] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] arwikisource: Increase autoconfirm edit count to 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961030 (https://phabricator.wikimedia.org/T347264) (owner: 10Ammarpad) [11:36:56] no, not right now at least [11:37:12] I'll need to dig a bit more into the relevant code to figure out how to best do this [11:37:53] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Enable Minerva site notice for wikifunctions wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961038 (https://phabricator.wikimedia.org/T345463) (owner: 10Ammarpad) [11:38:01] <_joe_> ack [11:38:32] <_joe_> jouncebot: next [11:38:32] In 0 hour(s) and 21 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230926T1200) [11:38:50] <_joe_> heh I guess I'll do the wikifeeds move then :) [11:41:03] (03PS6) 10Jbond: puppet: Add new PuppetServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 [11:45:40] (03CR) 10CI reject: [V: 04-1] puppet: Add new PuppetServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 (owner: 10Jbond) [11:47:38] (03PS1) 10Jbond: puppetmasters: remove puppetmasters[12]004 [puppet] - 10https://gerrit.wikimedia.org/r/961085 (https://phabricator.wikimedia.org/T345067) [11:48:49] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P52633 and previous config saved to /var/cache/conftool/dbconfig/20230926-114848-arnaudb.json [11:49:19] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Create backups for puppetservers - https://phabricator.wikimedia.org/T347390 (10jbond) [11:49:41] 10SRE, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Create backups for puppetservers - https://phabricator.wikimedia.org/T347390 (10jbond) p:05Triage→03Medium [11:50:31] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [11:51:42] (03CR) 10Jbond: [C: 03+2] puppetmasters: remove puppetmasters[12]004 [puppet] - 10https://gerrit.wikimedia.org/r/961085 (https://phabricator.wikimedia.org/T345067) (owner: 10Jbond) [11:55:48] 10SRE, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Create backups for puppetservers - https://phabricator.wikimedia.org/T347390 (10jcrespo) I remember Moritz asking me about this some time ago, I think. I thought this was done back then? [11:56:05] 10SRE, 10Prod-Kubernetes, 10Kubernetes: Reverse DNS for k8s pods IPs - https://phabricator.wikimedia.org/T344171 (10JMeybohm) From IRC discussion today: * we do have already some for loops in the dns that generate `kubernetes-pod-10-64-x-y.eqiad.wmnet` records (grep for 'range' in dns repo) * the ranges are... [11:56:56] 10SRE, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Create backups for puppetservers - https://phabricator.wikimedia.org/T347390 (10jbond) >>! In T347390#9198967, @jcrespo wrote: > I remember Moritz asking me about this some time ago, I think. I... [11:58:17] (03PS1) 10Jbond: site.pp: move puppetmasters[12]004 back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/961089 (https://phabricator.wikimedia.org/T345067) [11:58:45] (03CR) 10Jbond: [C: 03+2] site.pp: move puppetmasters[12]004 back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/961089 (https://phabricator.wikimedia.org/T345067) (owner: 10Jbond) [12:00:01] 10SRE, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Create backups for puppetservers - https://phabricator.wikimedia.org/T347390 (10jcrespo) >>! In T347390#9198984, @jbond wrote: >>>! In T347390#9198967, @jcrespo wrote: >> I remember Moritz aski... [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230926T1200) [12:00:25] !log jbond@cumin2002 START - Cookbook sre.hosts.decommission for hosts puppetmaster2004.codfw.wmnet [12:01:56] (03PS1) 10Jbond: puppetdb::bookworm: drop puppetmaster[12]004 [puppet] - 10https://gerrit.wikimedia.org/r/961090 (https://phabricator.wikimedia.org/T345067) [12:02:10] (03CR) 10Jbond: [C: 03+2] puppetdb::bookworm: drop puppetmaster[12]004 [puppet] - 10https://gerrit.wikimedia.org/r/961090 (https://phabricator.wikimedia.org/T345067) (owner: 10Jbond) [12:03:56] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T343198)', diff saved to https://phabricator.wikimedia.org/P52634 and previous config saved to /var/cache/conftool/dbconfig/20230926-120355-arnaudb.json [12:03:57] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [12:04:11] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [12:04:15] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [12:04:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:04:18] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T343198)', diff saved to https://phabricator.wikimedia.org/P52635 and previous config saved to /var/cache/conftool/dbconfig/20230926-120417-arnaudb.json [12:05:39] (03PS1) 10Majavah: galera: Fix some ordering issues [puppet] - 10https://gerrit.wikimedia.org/r/961092 [12:05:54] !log jbond@cumin1001 START - Cookbook sre.hosts.decommission for hosts puppetmaster1004.eqiad.wmnet [12:06:15] (03CR) 10Majavah: [C: 03+2] cr-cloud: Remove unused terms [homer/public] - 10https://gerrit.wikimedia.org/r/961054 (https://phabricator.wikimedia.org/T346439) (owner: 10Majavah) [12:07:08] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43623/console" [puppet] - 10https://gerrit.wikimedia.org/r/961092 (owner: 10Majavah) [12:07:22] (03Merged) 10jenkins-bot: cr-cloud: Remove unused terms [homer/public] - 10https://gerrit.wikimedia.org/r/961054 (https://phabricator.wikimedia.org/T346439) (owner: 10Majavah) [12:08:32] (03PS7) 10AOkoth: gitlab: swap replica records [dns] - 10https://gerrit.wikimedia.org/r/960633 (https://phabricator.wikimedia.org/T345590) [12:08:42] !log jbond@cumin2002 START - Cookbook sre.dns.netbox [12:10:02] !log jbond@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [12:10:03] !log jbond@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts puppetmaster2004.codfw.wmnet [12:10:13] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): reimage puppetmasteres to puppetserveres - https://phabricator.wikimedia.org/T345067 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jbond@cumin2002 for hosts: `puppetmaster2004.co... [12:10:35] !log deploy https://gerrit.wikimedia.org/r/961054 via homer [12:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:33] !log jbond@cumin1001 START - Cookbook sre.dns.netbox [12:12:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM puppetdb2002.codfw.wmnet [12:15:01] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetmaster1004.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jbond@cumin1001" [12:16:09] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetmaster1004.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jbond@cumin1001" [12:16:09] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:16:10] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts puppetmaster1004.eqiad.wmnet [12:16:15] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): reimage puppetmasteres to puppetserveres - https://phabricator.wikimedia.org/T345067 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jbond@cumin1001 for hosts: `puppetmaster1004.eqiad.wmnet` - puppetmas... [12:16:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM puppetdb2002.codfw.wmnet [12:17:09] (03PS7) 10Jbond: puppet: Add new PuppetServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 [12:17:46] (03CR) 10Jbond: "ready for review" [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 (owner: 10Jbond) [12:20:42] (03CR) 10EoghanGaffney: [C: 03+1] phabricator: add configuration for the remote aphlict server [puppet] - 10https://gerrit.wikimedia.org/r/961045 (https://phabricator.wikimedia.org/T346321) (owner: 10Jaime Nuche) [12:22:39] (03CR) 10Majavah: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/961081 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:22:41] (03PS1) 10Muehlenhoff: Add OIDC/datahub stub secret [labs/private] - 10https://gerrit.wikimedia.org/r/961094 (https://phabricator.wikimedia.org/T305874) [12:26:01] RECOVERY - DPKG on stat1007 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:27:04] (03PS1) 10EoghanGaffney: [gitlab/failover] Fix command line args [cookbooks] - 10https://gerrit.wikimedia.org/r/961096 [12:28:00] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet [12:28:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet [12:30:14] PROBLEM - Check systemd state on puppetserver1001 is CRITICAL: CRITICAL - degraded: The following units failed: upload_puppet_facts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:57] (03CR) 10Jelto: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/961096 (owner: 10EoghanGaffney) [12:31:15] (03CR) 10AOkoth: [C: 03+1] [gitlab/failover] Fix command line args [cookbooks] - 10https://gerrit.wikimedia.org/r/961096 (owner: 10EoghanGaffney) [12:31:28] (03CR) 10EoghanGaffney: [C: 03+2] [gitlab/failover] Fix command line args [cookbooks] - 10https://gerrit.wikimedia.org/r/961096 (owner: 10EoghanGaffney) [12:31:30] (03CR) 10Jelto: [C: 03+1] [gitlab/failover] Fix command line args (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/961096 (owner: 10EoghanGaffney) [12:31:49] (03CR) 10Peter Fischer: [C: 03+1] cirrus: add the mediawiki.cirrussearch.page_rerender.v1 stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957726 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse) [12:32:14] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:33:12] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/961081 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:33:14] (03CR) 10David Caro: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/961082 (owner: 10Muehlenhoff) [12:33:35] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet [12:34:03] (03Merged) 10jenkins-bot: [gitlab/failover] Fix command line args [cookbooks] - 10https://gerrit.wikimedia.org/r/961096 (owner: 10EoghanGaffney) [12:34:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet [12:34:35] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet [12:35:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet [12:35:16] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:35:19] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet [12:36:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet [12:37:28] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Rename puppetmaster1004 to puppetserver1004 - https://phabricator.wikimedia.org/T347395 (10jbond) [12:38:12] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Relabel puppetmaster2004 to puppetserver2004 - https://phabricator.wikimedia.org/T347396 (10jbond) [12:38:37] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Relabel puppetmaster1004 to puppetserver1004 - https://phabricator.wikimedia.org/T347395 (10jbond) [12:42:11] (03PS1) 10EoghanGaffney: [gitlab/failover] Fix arg for task-id [cookbooks] - 10https://gerrit.wikimedia.org/r/961097 [12:42:41] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Relabel puppetmaster1004 to puppetserver1003 - https://phabricator.wikimedia.org/T347395 (10jbond) [12:43:00] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Relabel puppetmaster2004 to puppetserver2002 - https://phabricator.wikimedia.org/T347396 (10jbond) [12:44:16] 10SRE, 10Gerrit, 10Release-Engineering-Team (Seen): Create Gerrit Administrator right policy - https://phabricator.wikimedia.org/T218686 (10LSobanski) [12:44:25] (03CR) 10Jelto: [C: 03+1] "lgtm, one comment in line" [cookbooks] - 10https://gerrit.wikimedia.org/r/961097 (owner: 10EoghanGaffney) [12:45:14] (03CR) 10AOkoth: [C: 03+1] [gitlab/failover] Fix arg for task-id [cookbooks] - 10https://gerrit.wikimedia.org/r/961097 (owner: 10EoghanGaffney) [12:45:24] 10SRE, 10Abstract Wikipedia team, 10MW-on-K8s, 10Traffic, and 4 others: Migrate functions-orchestrator service to mw-api-int - https://phabricator.wikimedia.org/T347397 (10Jdforrester-WMF) [12:46:02] (03PS1) 10JMeybohm: wikifunctions: Enable/Use service-mesh to reach mw-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/961099 (https://phabricator.wikimedia.org/T344998) [12:46:11] (03PS2) 10EoghanGaffney: [gitlab/failover] Fix arg for task-id [cookbooks] - 10https://gerrit.wikimedia.org/r/961097 [12:46:56] (03CR) 10Jelto: [C: 03+1] "lgtm, thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/961097 (owner: 10EoghanGaffney) [12:48:14] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2019.codfw.wmnet with OS bullseye [12:48:21] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase2019.codfw.wmnet with OS bullseye [12:49:41] (03CR) 10EoghanGaffney: [C: 03+2] [gitlab/failover] Fix arg for task-id [cookbooks] - 10https://gerrit.wikimedia.org/r/961097 (owner: 10EoghanGaffney) [12:49:44] !log jbond@cumin1001 START - Cookbook sre.dns.netbox [12:51:26] (03CR) 10Jforrester: "Huh, does this mean without an explicit file that staging wasn't inheriting these?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/961099 (https://phabricator.wikimedia.org/T344998) (owner: 10JMeybohm) [12:52:03] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: rename puppetmaster[12]004 - jbond@cumin1001" [12:52:06] (03Merged) 10jenkins-bot: [gitlab/failover] Fix arg for task-id [cookbooks] - 10https://gerrit.wikimedia.org/r/961097 (owner: 10EoghanGaffney) [12:52:37] !log jbond@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host puppetserver2002 [12:52:37] !log jbond@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host puppetserver2002 [12:52:50] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: rename puppetmaster[12]004 - jbond@cumin1001" [12:52:50] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:53:07] (03PS1) 10Majavah: Do not set $wgPasswordResetRoutes['domain'] [extensions/LdapAuthentication] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/960742 (https://phabricator.wikimedia.org/T345226) [12:53:17] !log jbond@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host puppetserver2002 [12:53:17] !log jbond@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host puppetserver2002 [12:53:18] (03PS1) 10Majavah: Do not set $wgPasswordResetRoutes['domain'] [extensions/LdapAuthentication] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/960743 (https://phabricator.wikimedia.org/T345226) [12:53:47] !log jbond@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host puppetserver1003 [12:54:34] !log aokoth@cumin1001 START - Cookbook sre.gitlab.failover Failover of gitlab from gitlab2002.wikimedia.org to gitlab1003.wikimedia.org [12:55:01] !log jbond@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host puppetserver1003 [12:55:17] (03CR) 10JMeybohm: wikifunctions: Enable/Use service-mesh to reach mw-api (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/961099 (https://phabricator.wikimedia.org/T344998) (owner: 10JMeybohm) [12:55:29] (03CR) 10CI reject: [V: 04-1] Do not set $wgPasswordResetRoutes['domain'] [extensions/LdapAuthentication] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/960742 (https://phabricator.wikimedia.org/T345226) (owner: 10Majavah) [12:55:52] (03CR) 10Majavah: "recheck" [extensions/LdapAuthentication] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/960742 (https://phabricator.wikimedia.org/T345226) (owner: 10Majavah) [12:56:29] James_F: you want to deploy that or should I? [12:56:46] jayme: Please do. [12:56:54] (03CR) 10JMeybohm: [C: 03+2] wikifunctions: Enable/Use service-mesh to reach mw-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/961099 (https://phabricator.wikimedia.org/T344998) (owner: 10JMeybohm) [12:57:42] (03Merged) 10jenkins-bot: wikifunctions: Enable/Use service-mesh to reach mw-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/961099 (https://phabricator.wikimedia.org/T344998) (owner: 10JMeybohm) [12:57:55] !log jbond@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host puppetserver2002 [12:57:55] !log jbond@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host puppetserver2002 [12:58:18] (03PS1) 10Lucas Werkmeister: Add $wgExternalLinksDomainGaps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961102 (https://phabricator.wikimedia.org/T341000) [12:58:25] James_F: ack. Is the curl you used in the phab task "the rigt thing to do" there clearly was a gap between what curl does and what the frontend does that lead to the outage when I changed the firewall rules [12:59:09] jayme: Yes, we fixed that a month ago. The one that matter is the third one in the runbook, which is what I pasted in the task yesterday, yes. [12:59:35] (03CR) 10CI reject: [V: 04-1] Add $wgExternalLinksDomainGaps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961102 (https://phabricator.wikimedia.org/T341000) (owner: 10Lucas Werkmeister) [12:59:51] James_F: ack [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: Your horoscope predicts another unfortunate UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230926T1300). [13:00:05] pfischer, Ammarpad, Ammarpad, and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] (03CR) 10Giuseppe Lavagetto: [C: 03+2] wikifeeds: migrate to mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/957937 (https://phabricator.wikimedia.org/T346447) (owner: 10Giuseppe Lavagetto) [13:00:11] !log aokoth@cumin1001 END (FAIL) - Cookbook sre.gitlab.failover (exit_code=93) Failover of gitlab from gitlab2002.wikimedia.org to gitlab1003.wikimedia.org [13:00:40] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: sync [13:00:43] * urbanecm is here, but sees Lucas_WMDE has a patch scheduled [13:00:57] o/ [13:00:58] (03Merged) 10jenkins-bot: wikifeeds: migrate to mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/957937 (https://phabricator.wikimedia.org/T346447) (owner: 10Giuseppe Lavagetto) [13:01:04] I can deploy :) [13:01:06] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: sync [13:01:11] ty! [13:01:15] <_joe_> urbanecm, Lucas_WMDE can I ask to wait a few mins to sync to the whole cluster? [13:01:38] sure! [13:01:41] 👍 [13:01:42] * Lucas_WMDE still setting up [13:01:42] !log oblivian@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [13:01:54] !log oblivian@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [13:02:08] !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [13:02:49] !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [13:04:18] !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [13:04:38] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2019.codfw.wmnet with reason: host reimage [13:04:40] !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [13:05:00] logspam-watch appears to be broken [13:05:04] Use of uninitialized value $bar in concatenation (.) or string at /usr/local/bin/logspam line 451. [13:05:30] (it sure looks like $bar is assigned just a few lines earlier, but I don’t know much Perl) [13:06:37] !log jbond@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host puppetserver2002 [13:06:39] dancy: any idea why that error might happen? (or anyone else, really) [13:07:11] dcausse, gmodena: there’s an unresolved comment on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/960616/, is it still okay to deploy? [13:07:34] Lucas_WMDE: looking [13:07:48] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2019.codfw.wmnet with reason: host reimage [13:07:53] !log jbond@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host puppetserver2002 [13:07:56] no hurry, I just noticed neither pfischer nor Ammarpad are around yet ^^ [13:08:00] so maybe I’ll do my change first [13:08:09] (but still waiting on _joe_ at the moment) [13:08:16] Lucas_WMDE: please go ahead yes, I'll ammend the patch [13:08:19] <_joe_> Lucas_WMDE: green light! [13:08:24] ok! [13:08:30] then let’s do the wikifunctions [13:08:39] (03PS2) 10Lucas Werkmeister (WMDE): Make wikifunctionswiki a multilingual Wikidata client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960066 (https://phabricator.wikimedia.org/T342857) [13:09:24] * Lucas_WMDE notices a hundred thousand errors in logstash o_O [13:10:00] apparently there’s a fix that was merged but not backported (T346365) [13:10:01] T346365: PHP Notice: Undefined index: DEFAULT - https://phabricator.wikimedia.org/T346365 [13:10:01] eh ok then [13:10:45] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960066 (https://phabricator.wikimedia.org/T342857) (owner: 10Lucas Werkmeister (WMDE)) [13:10:57] Lucas_WMDE: also T347318 [13:10:58] T347318: db2109 crashed - https://phabricator.wikimedia.org/T347318 [13:11:29] (03Merged) 10jenkins-bot: Make wikifunctionswiki a multilingual Wikidata client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960066 (https://phabricator.wikimedia.org/T342857) (owner: 10Lucas Werkmeister (WMDE)) [13:11:53] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:960066|Make wikifunctionswiki a multilingual Wikidata client (T342857)]] [13:12:11] T342857: Add wikidata support for wikifunctionswiki - https://phabricator.wikimedia.org/T342857 [13:12:39] (03PS4) 10DCausse: add search update pipeline streams (update + fetch_error) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960616 (https://phabricator.wikimedia.org/T317609) (owner: 10Peter Fischer) [13:12:46] (03PS1) 10Jbond: sre.netbox: Throw an error if now primary interfaces are found [cookbooks] - 10https://gerrit.wikimedia.org/r/961105 [13:13:06] Hi, I know this is cheeky, but is there any chance to squeeze in a config patch to this backport window? it was meant to go yesterday but scap-backport was having issues [13:13:16] (03CR) 10DCausse: [C: 03+1] add search update pipeline streams (update + fetch_error) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960616 (https://phabricator.wikimedia.org/T317609) (owner: 10Peter Fischer) [13:13:20] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:960066|Make wikifunctionswiki a multilingual Wikidata client (T342857)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:13:26] * Lucas_WMDE testing [13:13:33] HouseOfM: if there’s enough time, sure [13:13:42] right now some other people who wanted changes deployed aren’t there yet [13:13:47] so you can try it ^^ [13:13:47] Awesome, thank you! [13:14:06] Lucas_WMDE, I'm here (late) [13:14:18] ah, different name [13:14:19] ok then [13:14:22] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate wikifeeds to mw-api-int - https://phabricator.wikimedia.org/T346447 (10Joe) 05Open→03Resolved a:03Joe [13:14:26] added my deets to wiki [13:14:32] (03CR) 10Jbond: [C: 03+1] Remove minversion=1.6 from tox.ini files [puppet] - 10https://gerrit.wikimedia.org/r/960064 (owner: 10Hashar) [13:14:40] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Joe) [13:14:46] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: sync [13:14:55] tada https://www.wikidata.org/w/index.php?title=Q5296&diff=prev&oldid=1981977326 [13:15:06] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Joe) [13:15:15] and after a purge it’s also in the https://www.wikifunctions.org/wiki/Wikifunctions:Main_Page sidebar \o/ [13:15:19] I’ll call that a successful test [13:15:20] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [13:15:33] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: sync [13:18:17] !log aokoth@cumin1001 START - Cookbook sre.gitlab.failover Failover of gitlab from gitlab2002.wikimedia.org to gitlab1003.wikimedia.org [13:18:29] (03PS1) 10Elukey: role::cache::text: set pass for ores.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/961106 (https://phabricator.wikimedia.org/T347344) [13:19:04] James_F: I don't see in issue in staging or in eqiad [13:19:19] jayme: Let me check. [13:19:44] metrics say the path through envoy is used https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?forceLogin&orgId=1&var-app=function-orchestrator&var-datasource=thanos&var-destination=All&var-prometheus=k8s&var-site=eqiad&from=now-15m&to=now&refresh=30s [13:19:47] I wonder if the logspam(-watch) issue is due to the very high error level (all those “undefined index: DEFAULT” notices) [13:19:58] but the code looks like it’s supposed to take the number of defined bars into account and not go beyond that [13:20:03] (03CR) 10Vgutierrez: [C: 03+1] role::cache::text: set pass for ores.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/961106 (https://phabricator.wikimedia.org/T347344) (owner: 10Elukey) [13:20:35] taavi: was logspam(-watch) working for you when you were deploying earlier today? [13:20:53] (03CR) 10Elukey: [C: 03+2] role::cache::text: set pass for ores.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/961106 (https://phabricator.wikimedia.org/T347344) (owner: 10Elukey) [13:20:55] (03CR) 10Btullis: [C: 03+1] "Looks good to me. Do you need me to +2 this? Who is planning to deploy it?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/960610 (https://phabricator.wikimedia.org/T344688) (owner: 10Aqu) [13:21:11] jayme: Yeah, staging is working correctly. This is what I encountered yesterday. Is there a URL for the eqiad cluster rather than the discovery one? [13:21:38] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:960066|Make wikifunctionswiki a multilingual Wikidata client (T342857)]] (duration: 09m 44s) [13:21:45] T342857: Add wikidata support for wikifunctionswiki - https://phabricator.wikimedia.org/T342857 [13:21:51] James_F: you have to make curl resolve differently: curl --resolve wikifunctions.discovery.wmnet:30443:10.2.2.70 https://wikifunctions.discovery.wmnet:30443/1/v1/evaluate/ [13:22:07] !log jbond@cumin1001 START - Cookbook sre.dns.netbox [13:22:13] to force it to go to eqiad [13:22:15] Ah, right. [13:22:28] well...probably the eqiad service record works as well [13:22:55] jayme: That indeed seems to work in eqiad too. Huh. [13:23:24] Lucas_WMDE: I can watch for pfischer's patch if it does not join in time (issues with irccloud) [13:23:38] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: Maintenance [13:23:39] o/ [13:23:45] jayme: In that case, I'm now very confused as to why it broke for codfw. [13:23:45] good timing :D [13:23:47] :) [13:23:51] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: Maintenance [13:23:54] alright, then going ahead with that [13:23:58] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T343198)', diff saved to https://phabricator.wikimedia.org/P52637 and previous config saved to /var/cache/conftool/dbconfig/20230926-132357-arnaudb.json [13:24:04] (03CR) 10Ayounsi: "overall lgtm 2 small comments." [cookbooks] - 10https://gerrit.wikimedia.org/r/961105 (owner: 10Jbond) [13:24:05] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [13:24:05] Lucas_WMDE: thank you. Sorry for the delay. [13:24:08] (and assuming jayme / James_F discussion is not a deployment blocker unless they say otherwise) [13:24:12] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: rename puppetmaster[12]004 - jbond@cumin1001" [13:24:20] Lucas_WMDE: nono, sorry [13:24:24] (03PS5) 10Lucas Werkmeister (WMDE): add search update pipeline streams (update + fetch_error) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960616 (https://phabricator.wikimedia.org/T317609) (owner: 10Peter Fischer) [13:24:32] we move somewhere else [13:24:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960616 (https://phabricator.wikimedia.org/T317609) (owner: 10Peter Fischer) [13:24:41] no problem, just wanted to be explicit ^^ [13:24:43] I don’t mind [13:25:02] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: rename puppetmaster[12]004 - jbond@cumin1001" [13:25:02] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:25:28] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache puppetserver2002.codfw.wmnet on all recursors [13:25:30] (03Merged) 10jenkins-bot: add search update pipeline streams (update + fetch_error) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960616 (https://phabricator.wikimedia.org/T317609) (owner: 10Peter Fischer) [13:25:31] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) puppetserver2002.codfw.wmnet on all recursors [13:25:46] !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host puppetserver2002.codfw.wmnet with OS bookworm [13:25:52] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): reimage puppetmasteres to puppetserveres - https://phabricator.wikimedia.org/T345067 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin2002 for host puppetserver2002.codfw.wmnet with OS... [13:25:56] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:960616|add search update pipeline streams (update + fetch_error) (T317609)]] [13:26:03] T317609: Create a schema for fetch failures - https://phabricator.wikimedia.org/T317609 [13:27:19] !log lucaswerkmeister-wmde@deploy2002 pfischer and lucaswerkmeister-wmde: Backport for [[gerrit:960616|add search update pipeline streams (update + fetch_error) (T317609)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:27:37] pfischer: can you test the change on mwdebug? [13:28:01] Sure, looking into it. [13:28:04] (03PS1) 10JMeybohm: wikifunctions: Switch all clusters to use the service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/961108 (https://phabricator.wikimedia.org/T344998) [13:28:44] (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:29:14] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host puppetserver1003.eqiad.wmnet with OS bookworm [13:31:01] Lucas_WMDE: I can confirm that the changes are effective (I can see the streams listed for meta) [13:31:23] ok, thanks! [13:31:24] !log lucaswerkmeister-wmde@deploy2002 pfischer and lucaswerkmeister-wmde: Continuing with sync [13:31:25] (03CR) 10JMeybohm: [C: 03+2] wikifunctions: Switch all clusters to use the service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/961108 (https://phabricator.wikimedia.org/T344998) (owner: 10JMeybohm) [13:31:39] (03CR) 10Muehlenhoff: [C: 03+2] firewall: Add explicit check for provider == 'none' [puppet] - 10https://gerrit.wikimedia.org/r/961081 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:31:50] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2019.codfw.wmnet with OS bullseye [13:31:58] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase2019.codfw.wmnet with OS bullseye completed: - restbase20... [13:32:26] (03Merged) 10jenkins-bot: wikifunctions: Switch all clusters to use the service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/961108 (https://phabricator.wikimedia.org/T344998) (owner: 10JMeybohm) [13:34:47] (03PS2) 10Jbond: sre.netbox: Throw an error if now primary interfaces are found [cookbooks] - 10https://gerrit.wikimedia.org/r/961105 [13:34:48] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: sync [13:35:20] (03PS5) 10Muehlenhoff: profile::cumin::cloud_target: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/959179 [13:35:31] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet [13:35:38] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: sync [13:36:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet [13:36:13] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet [13:36:27] (03PS3) 10Jbond: sre.netbox: Throw an error if now primary interfaces are found [cookbooks] - 10https://gerrit.wikimedia.org/r/961105 [13:36:34] (03CR) 10Jbond: "thanks updated" [cookbooks] - 10https://gerrit.wikimedia.org/r/961105 (owner: 10Jbond) [13:36:45] (03PS2) 10Lucas Werkmeister: Add $wgExternalLinksDomainGaps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961102 (https://phabricator.wikimedia.org/T341000) [13:36:49] jouncebot: now [13:36:50] For the next 0 hour(s) and 23 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230926T1300) [13:36:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet [13:37:24] 10SRE, 10Infrastructure-Foundations, 10netops: Move cr1-esams<->cr2-esams link to QSFP port - https://phabricator.wikimedia.org/T347323 (10ayounsi) Thanks, I remembered there was a reason but forgot what it was! I guess it doesn't make much sens to buy a `MIC3-3D-2X40GE-QSFPP` seeing the [[ https://www.juni... [13:37:34] (03CR) 10CI reject: [V: 04-1] Add $wgExternalLinksDomainGaps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961102 (https://phabricator.wikimedia.org/T341000) (owner: 10Lucas Werkmeister) [13:37:51] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:960616|add search update pipeline streams (update + fetch_error) (T317609)]] (duration: 11m 54s) [13:37:58] T317609: Create a schema for fetch failures - https://phabricator.wikimedia.org/T317609 [13:38:17] alright, Ammar next [13:38:19] 10SRE, 10Infrastructure-Foundations, 10netops: Add 4x10G breakout cable to cr2-esams - https://phabricator.wikimedia.org/T347323 (10ayounsi) [13:38:25] OK [13:38:26] (03PS2) 10Lucas Werkmeister (WMDE): arwikisource: Increase autoconfirm edit count to 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961030 (https://phabricator.wikimedia.org/T347264) (owner: 10Ammarpad) [13:38:29] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961030 (https://phabricator.wikimedia.org/T347264) (owner: 10Ammarpad) [13:38:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:39:06] (03PS3) 10Lucas Werkmeister: Add $wgExternalLinksDomainGaps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961102 (https://phabricator.wikimedia.org/T341000) [13:39:11] (03Merged) 10jenkins-bot: arwikisource: Increase autoconfirm edit count to 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961030 (https://phabricator.wikimedia.org/T347264) (owner: 10Ammarpad) [13:39:37] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:961030|arwikisource: Increase autoconfirm edit count to 10 (T347264)]] [13:39:57] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [13:40:50] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for restbase2019.codfw.wmnet [13:40:50] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase2019.codfw.wmnet [13:40:59] !log lucaswerkmeister-wmde@deploy2002 ammarpad and lucaswerkmeister-wmde: Backport for [[gerrit:961030|arwikisource: Increase autoconfirm edit count to 10 (T347264)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:41:05] Ammar: please test :) [13:41:10] hm, although… [13:41:14] can this be tested? [13:41:24] (not sure whether this setting is shown anywhere) [13:41:24] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2021.codfw.wmnet with OS bullseye [13:41:32] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase2021.codfw.wmnet with OS bullseye [13:41:44] HouseOfM: there’s an all-caps “do not merge” on your change – can you add another comment if it is now okay to merge? :P [13:42:01] (not sure whether there will be enough time to deploy this now, but imho that comment needs addressing whether the deployment happens now or later ^^) [13:42:39] stashbot noooo /s [13:42:48] ooooh noooo [13:42:56] (03CR) 10Mhorsey: "OK to merge" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960559 (https://phabricator.wikimedia.org/T347065) (owner: 10Mhorsey) [13:43:04] (03PS3) 10Mhorsey: Enable Campaigns email on test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960559 (https://phabricator.wikimedia.org/T347065) [13:43:15] Done and done [13:43:21] thanks ^^ [13:43:34] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (PUT endpoints) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:43:35] (03CR) 10Ayounsi: [C: 03+1] sre.netbox: Throw an error if now primary interfaces are found (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/961105 (owner: 10Jbond) [13:44:00] (03CR) 10Ssingh: [C: 03+2] durum: Select the custom nginx provider with no additional modules [puppet] - 10https://gerrit.wikimedia.org/r/959749 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [13:44:06] hm, in the code it doesn’t look like the wgAutoConfirmCount is shown anywhere [13:44:15] it’s only used for one comparison in UserGroupManager afaict [13:44:20] Lucas_WMDE yes it's not [13:44:26] ok, then I’ll just sync [13:44:28] !log lucaswerkmeister-wmde@deploy2002 ammarpad and lucaswerkmeister-wmde: Continuing with sync [13:44:34] 10SRE, 10Infrastructure-Foundations, 10netops: Add 4x10G breakout cable to cr2-esams - https://phabricator.wikimedia.org/T347323 (10cmooney) >>! In T347323#9199448, @ayounsi wrote: > Thanks, I remembered there was a reason but forgot what it was! Yeah it's a shame. I made the same mistake while planning th... [13:44:35] the diff looks safe enough after all [13:45:00] hm, though perhaps I should’ve waited for stashbot to come back [13:45:01] I was thinking to check account creation log, but AutoConfirmAge is already set to >= 4 days everywhere [13:45:07] that log is lost now [13:47:20] wb stashbot [13:47:36] !log lucaswerkmeister-wmde@deploy2002 ammarpad and lucaswerkmeister-wmde: Continuing with sync [originally 13:44 UTC] [13:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:07] I *think* that’s the only log that was missed [13:50:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:51:04] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:961030|arwikisource: Increase autoconfirm edit count to 10 (T347264)]] (duration: 11m 27s) [13:51:14] T347264: Set minimum requirements of autoconfirmed users to 10 edits on arwikisource - https://phabricator.wikimedia.org/T347264 [13:51:17] (03PS2) 10Lucas Werkmeister (WMDE): Enable Minerva site notice for wikifunctions wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961038 (https://phabricator.wikimedia.org/T345463) (owner: 10Ammarpad) [13:51:25] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961038 (https://phabricator.wikimedia.org/T345463) (owner: 10Ammarpad) [13:51:32] jouncebot: next [13:51:32] In 2 hour(s) and 8 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230926T1600) [13:52:07] (03Merged) 10jenkins-bot: Enable Minerva site notice for wikifunctions wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961038 (https://phabricator.wikimedia.org/T345463) (owner: 10Ammarpad) [13:52:12] maybe I’ll just deploy the CampaignEvents change in the break after the window [13:52:22] Looks like I'm not gonna get in, I'll move my patch. Thanks anyway @Lucas_WMDE [13:52:30] effie: or did you want to deploy something? [13:52:33] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:961038|Enable Minerva site notice for wikifunctions wiki (T345463)]] [13:52:40] I won't be around to test after the window [13:52:42] T345463: Enable wgMinervaEnableSiteNotice for wikifunctions - https://phabricator.wikimedia.org/T345463 [13:52:48] HouseOfM: ok, then good luck next time [13:53:03] Lucas_WMDE: we wanted to bump traffic to mw-on-k8s [13:53:10] oooh [13:53:23] I might slightly overrun the window, but I won’t have anything to deploy after it after all [13:53:32] so it hopefully won’t be too long [13:53:37] I can wait, just give me a shout when [13:53:41] no rush [13:53:42] will do [13:53:46] cheers [13:54:08] !log lucaswerkmeister-wmde@deploy2002 ammarpad and lucaswerkmeister-wmde: Backport for [[gerrit:961038|Enable Minerva site notice for wikifunctions wiki (T345463)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:54:21] Ammar: can you test this one? [13:54:39] * Lucas_WMDE sees a site notice on https://m.wikifunctions.org/wiki/Wikifunctions:Main_Page on mwdebug [13:55:13] (03PS4) 10Jbond: sre.netbox: Throw an error if now primary interfaces are found [cookbooks] - 10https://gerrit.wikimedia.org/r/961105 [13:55:23] (03CR) 10Jbond: "updated" [cookbooks] - 10https://gerrit.wikimedia.org/r/961105 (owner: 10Jbond) [13:55:27] Lucas_WMDE Yes, I can see the notice on mobile. It looks ok to me [13:55:32] ok thanks! [13:55:34] !log lucaswerkmeister-wmde@deploy2002 ammarpad and lucaswerkmeister-wmde: Continuing with sync [13:55:35] (KubernetesAPILatency) resolved: (8) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:57:27] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2021.codfw.wmnet with reason: host reimage [13:59:49] (03PS1) 10Jbond: site.pp: rename puppetmasters[12]004 [puppet] - 10https://gerrit.wikimedia.org/r/961110 (https://phabricator.wikimedia.org/T345067) [14:00:03] (03CR) 10Jbond: [C: 03+2] site.pp: rename puppetmasters[12]004 [puppet] - 10https://gerrit.wikimedia.org/r/961110 (https://phabricator.wikimedia.org/T345067) (owner: 10Jbond) [14:00:33] (03PS1) 10JMeybohm: Revert "wikifunctions: Switch all clusters to use the service mesh" [deployment-charts] - 10https://gerrit.wikimedia.org/r/961111 (https://phabricator.wikimedia.org/T347397) [14:01:31] (03CR) 10JMeybohm: [C: 03+2] Revert "wikifunctions: Switch all clusters to use the service mesh" [deployment-charts] - 10https://gerrit.wikimedia.org/r/961111 (https://phabricator.wikimedia.org/T347397) (owner: 10JMeybohm) [14:01:57] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2021.codfw.wmnet with reason: host reimage [14:01:58] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on puppetserver1003.eqiad.wmnet with reason: host reimage [14:02:04] (KubernetesAPILatency) firing: (12) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:02:20] (03Merged) 10jenkins-bot: Revert "wikifunctions: Switch all clusters to use the service mesh" [deployment-charts] - 10https://gerrit.wikimedia.org/r/961111 (https://phabricator.wikimedia.org/T347397) (owner: 10JMeybohm) [14:02:24] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:961038|Enable Minerva site notice for wikifunctions wiki (T345463)]] (duration: 09m 51s) [14:02:29] !log UTC afternoon backport+config window done [14:02:31] T345463: Enable wgMinervaEnableSiteNotice for wikifunctions - https://phabricator.wikimedia.org/T345463 [14:02:31] effie: all yours [14:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:07] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nat Hillard - https://phabricator.wikimedia.org/T342588 (10Isaac) access confirmed -- thanks! [14:04:17] (03PS1) 10Jbond: site.pp: correct typo [puppet] - 10https://gerrit.wikimedia.org/r/961112 (https://phabricator.wikimedia.org/T345067) [14:04:32] (03PS2) 10Jbond: site.pp: correct typo [puppet] - 10https://gerrit.wikimedia.org/r/961112 (https://phabricator.wikimedia.org/T345067) [14:04:35] (03CR) 10Jbond: [V: 03+2 C: 03+2] site.pp: correct typo [puppet] - 10https://gerrit.wikimedia.org/r/961112 (https://phabricator.wikimedia.org/T345067) (owner: 10Jbond) [14:04:38] oh, logspam-watch is working again [14:04:44] 🤷 [14:04:51] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959179 (owner: 10Muehlenhoff) [14:05:05] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetserver1003.eqiad.wmnet with reason: host reimage [14:07:05] (KubernetesAPILatency) resolved: (33) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:08:04] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Adapt profile::nginx to new packaging scheme introduced in Bookworm - https://phabricator.wikimedia.org/T329529 (10MoritzMuehlenhoff) [14:08:33] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1021.eqiad.wmnet with OS bullseye [14:08:35] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1017.eqiad.wmnet with OS bullseye [14:08:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1021.eqiad.wmnet with OS bullseye [14:08:46] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1017.eqiad.wmnet with OS bullseye [14:10:18] !log jbond@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on puppetserver2002.codfw.wmnet with reason: host reimage [14:11:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Replace RAID controller battery on an-worker1086 - https://phabricator.wikimedia.org/T347287 (10Jclark-ctr) a:03Jclark-ctr @BTullis Replaced Raid controller battery server is coming back up now [14:12:50] (03CR) 10Muehlenhoff: "With the various patches that went into supporting profile::firewall::provider='none' this is now correctly a NOP for systems without a Pu" [puppet] - 10https://gerrit.wikimedia.org/r/959179 (owner: 10Muehlenhoff) [14:13:25] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetserver2002.codfw.wmnet with reason: host reimage [14:15:09] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T343198)', diff saved to https://phabricator.wikimedia.org/P52638 and previous config saved to /var/cache/conftool/dbconfig/20230926-141508-arnaudb.json [14:15:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Replace RAID controller battery on an-worker1086 - https://phabricator.wikimedia.org/T347287 (10Jclark-ctr) This is second battery replacement for this server. T326127 was 1st. although battery did not physically look bad i did still replace it. If it r... [14:15:17] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [14:15:31] 10ops-esams: esams: breakout between cr1 and cr2 - https://phabricator.wikimedia.org/T347403 (10ayounsi) [14:16:22] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jbond@cumin1001" [14:17:19] 10SRE, 10Infrastructure-Foundations, 10netops: Add 4x10G breakout cable to cr2-esams - https://phabricator.wikimedia.org/T347323 (10ayounsi) That's a great idea! Opened {T347403} [14:17:22] !log prune obsolete nginx packages from durum hosts after migration to new library scheme T329529 [14:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:29] T329529: Adapt profile::nginx to new packaging scheme introduced in Bookworm - https://phabricator.wikimedia.org/T329529 [14:21:25] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1021.eqiad.wmnet with OS bullseye [14:21:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1021.eqiad.wmnet with OS bullseye [14:21:34] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1021.eqiad.wmnet with OS bullseye [14:21:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1021.eqiad.wmnet with OS bullseye executed with errors: - wdqs1021 (**FAIL**... [14:21:46] (03CR) 10Majavah: [C: 03+1] profile::cumin::cloud_target: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/959179 (owner: 10Muehlenhoff) [14:22:32] (03CR) 10Ayounsi: [C: 03+1] "LGTM! no need for more reviews after my comments :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/961105 (owner: 10Jbond) [14:22:54] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10Jhancock.wm) I was trying to make it work and I consulted with @RobH. He said it wasn't going to work on this machine because of limitations of the ha... [14:23:29] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jbond@cumin1001" [14:23:29] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host puppetserver1003.eqiad.wmnet with OS bookworm [14:24:07] (03CR) 10FNegri: [C: 03+1] linuxbridge: Switch to ensure_packages() [puppet] - 10https://gerrit.wikimedia.org/r/961082 (owner: 10Muehlenhoff) [14:24:32] (03PS1) 10Jbond: service.yaml: drop the puppetdb-api-next service [puppet] - 10https://gerrit.wikimedia.org/r/961113 (https://phabricator.wikimedia.org/T347285) [14:24:34] !log jbond@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jbond@cumin2002" [14:24:36] (03PS1) 10Jbond: puppetboard-next: drop domain as services have now been migrated [puppet] - 10https://gerrit.wikimedia.org/r/961114 (https://phabricator.wikimedia.org/T347286) [14:25:15] (03CR) 10Muehlenhoff: [C: 03+2] profile::cumin::cloud_target: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/959179 (owner: 10Muehlenhoff) [14:25:23] !log jbond@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jbond@cumin2002" [14:25:24] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host puppetserver2002.codfw.wmnet with OS bookworm [14:27:15] (03PS2) 10Jbond: puppetboard-next: drop domain as services have now been migrated [puppet] - 10https://gerrit.wikimedia.org/r/961114 (https://phabricator.wikimedia.org/T347286) [14:27:17] (03PS1) 10Jbond: puppetboard-next: drop domain as services have now been migrated [puppet] - 10https://gerrit.wikimedia.org/r/961119 (https://phabricator.wikimedia.org/T347286) [14:27:36] !log jbond@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "puppetserver2002.codfw.wmnet - jbond@cumin2002" [14:27:37] (03CR) 10Filippo Giunchedi: [C: 03+1] service.yaml: drop the puppetdb-api-next service [puppet] - 10https://gerrit.wikimedia.org/r/961113 (https://phabricator.wikimedia.org/T347285) (owner: 10Jbond) [14:27:57] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): reimage puppetmasteres to puppetserveres - https://phabricator.wikimedia.org/T345067 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin2002 for host puppetserver2002.codfw.wmnet with OS book... [14:28:05] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10RobH) >>! In T342674#9199689, @Jhancock.wm wrote: > I was trying to make it work and I consulted with @RobH. He said it wasn't going to work on this m... [14:28:19] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10RobH) 05Resolved→03Open [14:28:31] (03CR) 10Muehlenhoff: [C: 03+2] linuxbridge: Switch to ensure_packages() [puppet] - 10https://gerrit.wikimedia.org/r/961082 (owner: 10Muehlenhoff) [14:29:17] (03PS5) 10Jbond: sre.netbox: Throw an error if now primary interfaces are found [cookbooks] - 10https://gerrit.wikimedia.org/r/961105 [14:29:27] (03CR) 10Jbond: [C: 03+2] sre.netbox: Throw an error if now primary interfaces are found (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/961105 (owner: 10Jbond) [14:29:45] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10RobH) 05Open→03Resolved >>! In T342674#9198699, @MatthewVernon wrote: > @Jhancock.wm I think (from netbox) that moss-be2003 is a PowerEdge R740xd... [14:30:03] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:30:05] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:30:09] (03PS6) 10Muehlenhoff: Switch cloudgw/codfw1dev to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/958905 (https://phabricator.wikimedia.org/T336497) [14:30:15] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P52639 and previous config saved to /var/cache/conftool/dbconfig/20230926-143015-arnaudb.json [14:30:33] (03CR) 10Jbond: [C: 03+2] service.yaml: drop the puppetdb-api-next service [puppet] - 10https://gerrit.wikimedia.org/r/961113 (https://phabricator.wikimedia.org/T347285) (owner: 10Jbond) [14:31:30] (03CR) 10CDanis: [C: 03+2] admin: Create analytics-wmde system user and airflow admin group (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [14:31:49] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE, 10Patch-For-Review: Requesting Creation of a new POSIX group and system user for the Analytics WMDE team. - https://phabricator.wikimedia.org/T345726 (10CDanis) We discussed this in our Monday I/F meeting and approved it. [14:32:07] (03Merged) 10jenkins-bot: sre.netbox: Throw an error if now primary interfaces are found [cookbooks] - 10https://gerrit.wikimedia.org/r/961105 (owner: 10Jbond) [14:32:22] (03CR) 10Andrew Bogott: [C: 03+2] designate/pdns: refactor a bunch of address settings [puppet] - 10https://gerrit.wikimedia.org/r/959379 (https://phabricator.wikimedia.org/T346385) (owner: 10Andrew Bogott) [14:32:58] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Requesting Creation of a new POSIX group and system user for the Analytics WMDE team. - https://phabricator.wikimedia.org/T345726 (10CDanis) 05Open→03Resolved Will be live in half an hour. [14:33:19] (03CR) 10Andrew Bogott: [C: 03+2] pdns: eliminate profile::openstack::base::pdns::auth::listen_on [puppet] - 10https://gerrit.wikimedia.org/r/960753 (owner: 10Andrew Bogott) [14:33:26] (03CR) 10Andrew Bogott: [C: 03+2] pdns: eliminate pdns::query_local_address in hiera [puppet] - 10https://gerrit.wikimedia.org/r/960754 (owner: 10Andrew Bogott) [14:33:46] (03PS4) 10Andrew Bogott: pdns: eliminate profile::openstack::base::pdns::auth::listen_on [puppet] - 10https://gerrit.wikimedia.org/r/960753 [14:33:53] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:33:54] (03PS5) 10Andrew Bogott: pdns: eliminate pdns::query_local_address in hiera [puppet] - 10https://gerrit.wikimedia.org/r/960754 [14:33:55] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.275 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:34:52] !log jbond@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "puppetserver2002.codfw.wmnet - jbond@cumin2002" [14:35:44] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host wdqs1021.mgmt.eqiad.wmnet with reboot policy FORCED [14:36:18] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958905 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:36:19] PROBLEM - gdnsd checkconf on dns4004 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [14:36:23] (03PS5) 10Andrew Bogott: designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/960755 (https://phabricator.wikimedia.org/T346385) [14:37:09] (03CR) 10Effie Mouzeli: [C: 03+2] trafficserver: move 6.5% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/957857 (https://phabricator.wikimedia.org/T346422) (owner: 10Giuseppe Lavagetto) [14:37:11] (03CR) 10Andrew Bogott: [C: 03+2] designate pools.yaml: contact pdns webserver on private IP [puppet] - 10https://gerrit.wikimedia.org/r/960755 (https://phabricator.wikimedia.org/T346385) (owner: 10Andrew Bogott) [14:37:21] PROBLEM - gdnsd checkconf on dns5004 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [14:37:55] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: sync [14:38:05] !log Rump up traffic to mw-on-k8s to 6.5% - T346422 [14:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:12] T346422: Move 10% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T346422 [14:38:41] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43628/console" [puppet] - 10https://gerrit.wikimedia.org/r/959222 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [14:38:44] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: sync [14:38:44] (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:51] PROBLEM - gdnsd checkconf on dns2004 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [14:39:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:40:06] 10SRE, 10SRE-swift-storage, 10Traffic-Icebox, 10Wikimedia-Performance-recommendation, 10affects-Kiwix-and-openZIM: Swift sends ETAG without double-quotes - https://phabricator.wikimedia.org/T256217 (10Kelson) @MatthewVernon That would be welcome on MWoffliner side! For the rest I can not say, but respect... [14:43:16] (03PS5) 10Bking: k8s config: Provide kafka and zookeeper hostnames [puppet] - 10https://gerrit.wikimedia.org/r/960662 (https://phabricator.wikimedia.org/T346315) (owner: 10Ebernhardson) [14:43:18] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/961114 (https://phabricator.wikimedia.org/T347286) (owner: 10Jbond) [14:43:31] PROBLEM - gdnsd checkconf on dns3004 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [14:44:00] (03CR) 10Bking: [C: 03+1] k8s config: Provide kafka and zookeeper hostnames [puppet] - 10https://gerrit.wikimedia.org/r/960662 (https://phabricator.wikimedia.org/T346315) (owner: 10Ebernhardson) [14:44:09] PROBLEM - gdnsd checkconf on dns3003 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [14:44:46] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:44:57] PROBLEM - gdnsd checkconf on dns1006 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [14:44:57] PROBLEM - gdnsd checkconf on dns6002 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [14:45:22] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P52640 and previous config saved to /var/cache/conftool/dbconfig/20230926-144521-arnaudb.json [14:45:39] (03CR) 10Bking: "Adding Janis and Alexandros, since this touches the kubernetes deployment global config." [puppet] - 10https://gerrit.wikimedia.org/r/960662 (https://phabricator.wikimedia.org/T346315) (owner: 10Ebernhardson) [14:46:25] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2021.codfw.wmnet with OS bullseye [14:46:33] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase2021.codfw.wmnet with OS bullseye completed: - restbase20... [14:47:02] (03PS1) 10Jbond: puppetserver1003: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/961123 (https://phabricator.wikimedia.org/T345067) [14:47:17] PROBLEM - gdnsd checkconf on dns6001 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [14:47:24] (03CR) 10Jbond: [C: 03+2] puppetserver1003: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/961123 (https://phabricator.wikimedia.org/T345067) (owner: 10Jbond) [14:47:55] !log installing lldpd security updates [14:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:02] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Juniper network device audit - all sites - https://phabricator.wikimedia.org/T213843 (10ayounsi) 05Open→03Resolved a:03RobH I think we can close that one. @RobH did the audit afaik. [14:48:37] PROBLEM - gdnsd checkconf on dns1004 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [14:48:44] (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:48:45] (03PS1) 10Jbond: puppetserver2002: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/961124 (https://phabricator.wikimedia.org/T345067) [14:49:55] PROBLEM - gdnsd checkconf on dns4003 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [14:49:57] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wdqs1021.mgmt.eqiad.wmnet with reboot policy FORCED [14:50:19] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [14:51:27] PROBLEM - gdnsd checkconf on dns5003 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [14:51:57] PROBLEM - gdnsd checkconf on dns2006 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [14:52:52] (03CR) 10Jbond: [C: 03+2] puppetserver2002: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/961124 (https://phabricator.wikimedia.org/T345067) (owner: 10Jbond) [14:52:58] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt wdqs1017-20 - jclark@cumin1001" [14:53:51] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt wdqs1017-20 - jclark@cumin1001" [14:53:51] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:54:24] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:57:14] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [14:57:26] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adding ganeti-test server to codfw - jhancock@cumin2002" [14:57:33] RECOVERY - Check systemd state on puppetserver1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:57:58] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host wdqs1021.mgmt.eqiad.wmnet with reboot policy FORCED [14:58:08] (03PS1) 10FNegri: cloud_management: remove unneeded tokens, profiles [puppet] - 10https://gerrit.wikimedia.org/r/961125 (https://phabricator.wikimedia.org/T324986) [14:58:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adding ganeti-test server to codfw - jhancock@cumin2002" [14:58:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:58:32] (03CR) 10CI reject: [V: 04-1] cloud_management: remove unneeded tokens, profiles [puppet] - 10https://gerrit.wikimedia.org/r/961125 (https://phabricator.wikimedia.org/T324986) (owner: 10FNegri) [14:58:42] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wdqs1021.mgmt.eqiad.wmnet with reboot policy FORCED [14:59:14] (03PS2) 10FNegri: cloud_management: remove unneeded tokens, profiles [puppet] - 10https://gerrit.wikimedia.org/r/961125 (https://phabricator.wikimedia.org/T324986) [15:00:00] (03PS6) 10Ebernhardson: k8s config: Provide kafka and zookeeper hostnames [puppet] - 10https://gerrit.wikimedia.org/r/960662 [15:00:04] PROBLEM - gdnsd checkconf on dns1005 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [15:00:08] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt 20 - jclark@cumin1001" [15:00:29] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T343198)', diff saved to https://phabricator.wikimedia.org/P52641 and previous config saved to /var/cache/conftool/dbconfig/20230926-150028-arnaudb.json [15:00:31] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [15:00:41] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [15:00:45] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [15:00:46] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [15:00:50] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [15:00:57] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T343198)', diff saved to https://phabricator.wikimedia.org/P52642 and previous config saved to /var/cache/conftool/dbconfig/20230926-150056-arnaudb.json [15:00:58] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt 20 - jclark@cumin1001" [15:00:58] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:01:29] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED [15:01:30] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:01:32] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED [15:01:51] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host wdqs1021.mgmt.eqiad.wmnet with reboot policy FORCED [15:01:52] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host wdqs1020.mgmt.eqiad.wmnet with reboot policy FORCED [15:01:54] (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43629/console" [puppet] - 10https://gerrit.wikimedia.org/r/961125 (https://phabricator.wikimedia.org/T324986) (owner: 10FNegri) [15:03:21] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator maintenance [15:03:26] PROBLEM - gdnsd checkconf on dns2005 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [15:03:32] (03PS1) 10Jbond: site.pp: migrate puppetserver to puppetserver role [puppet] - 10https://gerrit.wikimedia.org/r/961126 (https://phabricator.wikimedia.org/T345067) [15:03:34] hmm?? [15:03:36] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator maintenance [15:03:51] (03CR) 10Jbond: [C: 03+2] site.pp: migrate puppetserver to puppetserver role [puppet] - 10https://gerrit.wikimedia.org/r/961126 (https://phabricator.wikimedia.org/T345067) (owner: 10Jbond) [15:03:54] !log beginning routine phabricator update shortly [15:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:16] PROBLEM - Check systemd state on puppetserver1001 is CRITICAL: CRITICAL - degraded: The following units failed: upload_puppet_facts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:45] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED [15:04:47] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED [15:05:32] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - No response from remote host 198.35.26.192 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:05:35] jbond: ^ [15:05:41] Sep 26 15:01:07 dns2005 gdnsd[3229433]: plugin_geoip: Invalid resource name 'disc-puppetdb-api-next' detected from zonefile lookup [15:05:54] !log brennen@deploy2002 Started deploy [phabricator/deployment@d895dde]: test deploy to phab2002 [15:06:29] !log brennen@deploy2002 Finished deploy [phabricator/deployment@d895dde]: test deploy to phab2002 (duration: 00m 35s) [15:06:51] !log brennen@deploy2002 Started deploy [phabricator/deployment@d895dde]: deploy to phab1004 for weekly updates [15:07:36] !log brennen@deploy2002 Finished deploy [phabricator/deployment@d895dde]: deploy to phab1004 for weekly updates (duration: 00m 44s) [15:08:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:08:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:08:21] thumbor, acking [15:09:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:09:20] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline." [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [15:09:31] (03PS1) 10Herron: pyrra: add trafficserver mapping [puppet] - 10https://gerrit.wikimedia.org/r/961128 (https://phabricator.wikimedia.org/T302995) [15:09:33] (03PS1) 10Herron: services: add pyrra conftool-data and service stub entry [puppet] - 10https://gerrit.wikimedia.org/r/961129 (https://phabricator.wikimedia.org/T302995) [15:09:34] who to call for thumbor? [15:09:35] (03PS1) 10Herron: pyrra: use load balancing [puppet] - 10https://gerrit.wikimedia.org/r/961130 (https://phabricator.wikimedia.org/T302995) [15:09:37] (03PS1) 10Herron: pyrra: add serveraliases and redirect to apache config [puppet] - 10https://gerrit.wikimedia.org/r/961131 (https://phabricator.wikimedia.org/T302995) [15:09:41] kamila_ you may know? [15:09:43] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wdqs1021.mgmt.eqiad.wmnet with reboot policy FORCED [15:09:54] (03CR) 10CI reject: [V: 04-1] pyrra: use load balancing [puppet] - 10https://gerrit.wikimedia.org/r/961130 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [15:09:58] (03CR) 10CI reject: [V: 04-1] services: add pyrra conftool-data and service stub entry [puppet] - 10https://gerrit.wikimedia.org/r/961129 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [15:09:59] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [15:10:02] (03CR) 10CI reject: [V: 04-1] pyrra: add serveraliases and redirect to apache config [puppet] - 10https://gerrit.wikimedia.org/r/961131 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [15:10:05] (03CR) 10BCornwall: dns::dotls: expose and gather haproxy internal metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/948087 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [15:10:06] there was 2 spikes [15:10:28] autoregister says the graph [15:10:35] (03PS1) 10Herron: pyrra add service dns entries [dns] - 10https://gerrit.wikimedia.org/r/961132 (https://phabricator.wikimedia.org/T302995) [15:10:37] (03PS1) 10Herron: pyrra: add public dns entries [dns] - 10https://gerrit.wikimedia.org/r/961133 (https://phabricator.wikimedia.org/T302995) [15:11:01] jbond: ./wmnet:puppetdb-api-next 300/10 IN DYNA geoip!disc-puppetdb-api-next [15:11:12] ^ this doesn't have a matching discovery part, so it's breaking DNS updates [15:11:21] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:11:22] I think it is not stable-bad, but it would be nice to know the cause of those 2 spikes [15:11:41] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host wdqs1020.mgmt.eqiad.wmnet with reboot policy FORCED [15:11:43] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wdqs1020.mgmt.eqiad.wmnet with reboot policy FORCED [15:11:58] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:12:24] bad behaviour started at 14:52 [15:12:45] is this registry-related? [15:12:50] claime: ^ [15:12:54] bblack: thanks sending patch now i removed from the service cataalog early but forgot the dns patch [15:13:07] ok, makes sense! [15:13:07] in an interview sorry [15:13:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:13:23] who can help me that know k8s? [15:13:27] (03CR) 10CI reject: [V: 04-1] pyrra add service dns entries [dns] - 10https://gerrit.wikimedia.org/r/961132 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [15:13:29] (03CR) 10CI reject: [V: 04-1] pyrra: add public dns entries [dns] - 10https://gerrit.wikimedia.org/r/961133 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [15:13:36] jayme: you around? [15:13:46] (03PS1) 10Jbond: wmet: drop puppetdb-api-next [dns] - 10https://gerrit.wikimedia.org/r/961134 (https://phabricator.wikimedia.org/T347285) [15:14:00] issues seems gone now [15:14:43] yeah: swift-account-stats_docker:registry.service Failed on ms-fe2009:9100 now resolved [15:14:57] (03PS2) 10Herron: pyrra: add public dns entries [dns] - 10https://gerrit.wikimedia.org/r/961133 (https://phabricator.wikimedia.org/T302995) [15:15:13] (03PS1) 10Jbond: wikimedia.org: drop puppetboard-next [dns] - 10https://gerrit.wikimedia.org/r/961135 (https://phabricator.wikimedia.org/T347286) [15:15:20] (03CR) 10Andrea Denisse: [C: 03+2] prometheus: Prevent Prometheus from scraping certain statsd-exporters [puppet] - 10https://gerrit.wikimedia.org/r/958807 (https://phabricator.wikimedia.org/T346656) (owner: 10Andrea Denisse) [15:15:22] (03PS2) 10Herron: pyrra: add trafficserver mapping [puppet] - 10https://gerrit.wikimedia.org/r/961128 (https://phabricator.wikimedia.org/T302995) [15:15:38] jynus: is thumbor a known issue? [15:15:51] I think it is k8s registry [15:15:54] not thumbor [15:16:02] (03PS1) 10Jbond: site.pp: use puppetserver not puppetmaster for puppetserver1003 [puppet] - 10https://gerrit.wikimedia.org/r/961136 (https://phabricator.wikimedia.org/T345067) [15:16:08] several k8s thingies complained about that [15:16:14] oh. [15:16:27] (03CR) 10Ssingh: [C: 03+1] wmet: drop puppetdb-api-next [dns] - 10https://gerrit.wikimedia.org/r/961134 (https://phabricator.wikimedia.org/T347285) (owner: 10Jbond) [15:16:30] we are trying to reach jayme as claim* is busy atm [15:16:42] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:16:56] sorry, did not see firtst ping [15:17:02] (03CR) 10Jbond: [C: 03+2] wmet: drop puppetdb-api-next [dns] - 10https://gerrit.wikimedia.org/r/961134 (https://phabricator.wikimedia.org/T347285) (owner: 10Jbond) [15:17:02] whats up? [15:17:26] thumbor paged, at first we thought it was the software [15:17:38] but then we saw it was complaining about autoregistry [15:17:40] I am here too [15:17:43] (03CR) 10AOkoth: [C: 03+2] gitlab: change service_name on replica hosts [puppet] - 10https://gerrit.wikimedia.org/r/960632 (https://phabricator.wikimedia.org/T345590) (owner: 10AOkoth) [15:17:50] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active, ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:17:50] RECOVERY - gdnsd checkconf on dns1004 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [15:17:51] jynus: where did you see that ? [15:18:00] RECOVERY - gdnsd checkconf on dns2005 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [15:18:01] and registry.service faiures on other k8s services [15:18:07] (03CR) 10CI reject: [V: 04-1] pyrra: add public dns entries [dns] - 10https://gerrit.wikimedia.org/r/961133 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [15:18:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:18:08] jbond: thanks! [15:18:10] could it be dns? [15:18:18] RECOVERY - gdnsd checkconf on dns5003 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [15:18:28] that is autoregistry? [15:18:31] *what is [15:18:32] RECOVERY - gdnsd checkconf on dns2004 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [15:18:32] RECOVERY - gdnsd checkconf on dns4003 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [15:18:41] effie: on the linked grafana [15:18:42] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:45] sukhe: np thanks for the review [15:18:46] RECOVERY - gdnsd checkconf on dns3003 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [15:18:46] RECOVERY - gdnsd checkconf on dns4004 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [15:18:48] RECOVERY - gdnsd checkconf on dns1005 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [15:18:52] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:53] https://grafana.wikimedia.org/d/000000435/kubernetes-api?var-site=eqiad&var-cluster=k8s-staging&orgId=1&from=now-1h&to=now [15:18:53] (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:18:58] RECOVERY - gdnsd checkconf on dns3004 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [15:18:59] ^effie [15:19:00] RECOVERY - gdnsd checkconf on dns1006 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [15:19:00] RECOVERY - gdnsd checkconf on dns6002 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [15:19:03] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr4-ulsfo.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [15:19:07] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr4-ulsfo.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [15:19:08] RECOVERY - gdnsd checkconf on dns6001 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [15:19:11] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [15:19:12] RECOVERY - gdnsd checkconf on dns2006 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [15:19:16] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [15:20:03] (03CR) 10Jbond: [C: 03+2] site.pp: use puppetserver not puppetmaster for puppetserver1003 [puppet] - 10https://gerrit.wikimedia.org/r/961136 (https://phabricator.wikimedia.org/T345067) (owner: 10Jbond) [15:20:28] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - No response from remote host 198.35.26.192 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:22:00] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.193 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:22:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:22:29] (03CR) 10EoghanGaffney: [C: 03+2] phabricator: add configuration for the remote aphlict server [puppet] - 10https://gerrit.wikimedia.org/r/961045 (https://phabricator.wikimedia.org/T346321) (owner: 10Jaime Nuche) [15:23:13] (ATSBackendErrorsHigh) firing: (2) ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [15:23:51] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/961125 (https://phabricator.wikimedia.org/T324986) (owner: 10FNegri) [15:24:09] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for restbase2021.codfw.wmnet [15:24:10] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase2021.codfw.wmnet [15:24:18] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1020.mgmt.eqiad.wmnet with reboot policy FORCED [15:24:19] (03CR) 10Jelto: [C: 03+2] phabricator deployment: restart php when finalizing deploy [puppet] - 10https://gerrit.wikimedia.org/r/956486 (https://phabricator.wikimedia.org/T314460) (owner: 10Brennen Bearnes) [15:24:36] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1020.eqiad.wmnet with OS bullseye [15:24:44] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:24:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1020.eqiad.wmnet with OS bullseye [15:25:05] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host wdqs1021.mgmt.eqiad.wmnet with reboot policy FORCED [15:25:20] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:25:21] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wdqs1021.mgmt.eqiad.wmnet with reboot policy FORCED [15:26:16] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:27:21] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adding ganeti-test server to codfw - jhancock@cumin2002" [15:27:53] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [dns] - 10https://gerrit.wikimedia.org/r/961135 (https://phabricator.wikimedia.org/T347286) (owner: 10Jbond) [15:28:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adding ganeti-test server to codfw - jhancock@cumin2002" [15:28:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:28:13] (ATSBackendErrorsHigh) firing: (3) ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [15:28:28] RECOVERY - gdnsd checkconf on dns5004 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [15:28:44] (JobUnavailable) firing: (6) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:28:45] (Primary outbound port utilisation over 80% #page) firing: (2) Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [15:28:47] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1017.eqiad.wmnet with OS bullseye [15:28:50] (Primary outbound port utilisation over 80% #page) firing: (2) Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [15:28:54] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1017.eqiad.wmnet with OS bullseye executed with errors: - wdqs1017 (**FAIL**) - Remove... [15:30:02] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host wdqs1021.mgmt.eqiad.wmnet with reboot policy FORCED [15:32:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:32:54] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - No response from remote host 198.35.26.192 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:33:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Switch cloudgw/codfw1dev to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/958905 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [15:33:44] (JobUnavailable) firing: (6) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:33:46] PROBLEM - Recursive DNS on 2620:0:863:1:198:35:26:8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [15:34:46] RECOVERY - Recursive DNS on 2620:0:863:1:198:35:26:8 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [15:35:00] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:36:24] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T347368 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact [15:37:09] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wdqs1021.mgmt.eqiad.wmnet with reboot policy FORCED [15:39:04] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): update netbox sync to also sync to puppetservers - https://phabricator.wikimedia.org/T347410 (10jbond) [15:39:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:39:11] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host wdqs1017.mgmt.eqiad.wmnet with reboot policy FORCED [15:39:23] (03PS1) 10Jbond: wmnet: Add new puppetservers [dns] - 10https://gerrit.wikimedia.org/r/961140 (https://phabricator.wikimedia.org/T345067) [15:39:35] (03PS3) 10Herron: pyrra: add trafficserver mapping [puppet] - 10https://gerrit.wikimedia.org/r/961128 (https://phabricator.wikimedia.org/T302995) [15:39:37] (03PS2) 10Herron: services: add pyrra conftool-data and service stub entry [puppet] - 10https://gerrit.wikimedia.org/r/961129 (https://phabricator.wikimedia.org/T302995) [15:40:20] (03CR) 10Btullis: [C: 03+1] [sre.kafka] Use broker in-sync status as a gate between broker restarts [cookbooks] - 10https://gerrit.wikimedia.org/r/959720 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol) [15:40:28] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T343198)', diff saved to https://phabricator.wikimedia.org/P52643 and previous config saved to /var/cache/conftool/dbconfig/20230926-154027-arnaudb.json [15:40:36] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [15:40:39] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wdqs1017.mgmt.eqiad.wmnet with reboot policy FORCED [15:40:57] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Relabel puppetmaster2004 to puppetserver2002 - https://phabricator.wikimedia.org/T347396 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm server has been relabled [15:41:03] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:41:05] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): reimage puppetmasteres to puppetserveres - https://phabricator.wikimedia.org/T345067 (10Jhancock.wm) [15:41:59] (03CR) 10EoghanGaffney: [C: 03+2] zuul: replace zuul-gearman.py by gearman-tools [puppet] - 10https://gerrit.wikimedia.org/r/930673 (https://phabricator.wikimedia.org/T339172) (owner: 10Hashar) [15:42:46] (03CR) 10Jbond: [C: 03+2] wmnet: Add new puppetservers [dns] - 10https://gerrit.wikimedia.org/r/961140 (https://phabricator.wikimedia.org/T345067) (owner: 10Jbond) [15:43:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:43:42] (03PS17) 10Ryan Kemper: [sre.kafka] Use broker in-sync status as a gate between broker restarts [cookbooks] - 10https://gerrit.wikimedia.org/r/959720 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol) [15:43:58] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2010.codfw.wmnet, ms-fe2012.codfw.wmnet are marked down but pooled: swift_80: Servers ms-fe2013.codfw.wmnet, ms-fe2014.codfw.wmnet, ms-fe2009.codfw.wmnet, ms-fe2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:44:37] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:44:38] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:44:44] PROBLEM - Docker registry HTTPS interface on registry2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [15:44:48] PROBLEM - Docker registry HTTPS interface on registry2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [15:45:00] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:45:02] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:45:06] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:45:14] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2009.codfw.wmnet, moss-fe2001.codfw.wmnet, ms-fe2011.codfw.wmnet, ms-fe2012.codfw.wmnet, ms-fe2014.codfw.wmnet, ms-fe2010.codfw.wmnet are marked down but pooled: swift_80: Servers ms-fe2013.codfw.wmnet, ms-fe2009.codfw.wmnet, ms-fe2011.codfw.wmnet, ms-fe2012.codfw.wmnet, ms-fe2014.codfw.wmnet are m [15:45:14] wn but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:45:52] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): update netbox sync to also sync to puppetservers - https://phabricator.wikimedia.org/T347410 (10jbond) p:05Triage→03Medium [15:46:02] RECOVERY - Docker registry HTTPS interface on registry2004 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 9.090 second response time https://wikitech.wikimedia.org/wiki/Docker [15:46:03] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:46:34] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1021'] [15:46:40] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:46:44] (HaproxyUnavailable) firing: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [15:46:50] (03CR) 10Brouberol: [C: 03+2] [sre.kafka] Use broker in-sync status as a gate between broker restarts [cookbooks] - 10https://gerrit.wikimedia.org/r/959720 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol) [15:46:55] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs1021'] [15:46:58] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:46:59] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1021'] [15:47:05] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10MatthewVernon) @RobH are you sure that's correct? e.g. ms-be2073 has all 26 drives as Non-RAID disks (which is what I was wanting for these nodes), wh... [15:47:08] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:47:17] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1021'] [15:47:32] (03PS1) 10Ahmon Dancy: logspam-watch: Avoid termination via SIGPIPE [puppet] - 10https://gerrit.wikimedia.org/r/961142 [15:47:37] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1021.eqiad.wmnet with OS bullseye [15:47:44] (VarnishUnavailable) firing: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [15:47:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1021.eqiad.wmnet with OS bullseye [15:48:07] (ProbeDown) firing: (2) Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:48:14] PROBLEM - Swift https frontend on moss-fe2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:48:16] PROBLEM - Swift https backend on moss-fe2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:48:16] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:48:22] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:48:33] (03PS18) 10Ryan Kemper: [sre.kafka] Use broker in-sync status as a gate between broker restarts [cookbooks] - 10https://gerrit.wikimedia.org/r/959720 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol) [15:48:45] (Primary outbound port utilisation over 80% #page) firing: (2) Device cr1-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [15:48:50] (Primary outbound port utilisation over 80% #page) firing: (2) Device cr1-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [15:49:37] (ProbeDown) firing: (2) Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:49:52] (03CR) 10Brouberol: [sre.kafka] Use broker in-sync status as a gate between broker restarts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/959720 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol) [15:50:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [15:52:17] (03Merged) 10jenkins-bot: [sre.kafka] Use broker in-sync status as a gate between broker restarts [cookbooks] - 10https://gerrit.wikimedia.org/r/959720 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol) [15:52:44] (VarnishUnavailable) resolved: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [15:53:16] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:53:31] (03CR) 10Ahmon Dancy: logspam.pl: Consolidate another database-related message (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959802 (https://phabricator.wikimedia.org/T347064) (owner: 10Ahmon Dancy) [15:53:45] (Primary inbound port utilisation over 80% #page) resolved: Device cr4-ulsfo.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [15:53:45] (Primary inbound port utilisation over 80% #page) resolved: Device cr4-ulsfo.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [15:53:54] 10SRE, 10Infrastructure-Foundations: Drive host network config from Netbox, and move away from ifupdown - https://phabricator.wikimedia.org/T347411 (10cmooney) p:05Triage→03Low [15:55:04] PROBLEM - Docker registry HTTPS interface on registry2004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string schemaVersion not found on https://registry2004.codfw.wmnet:443/v2/bullseye/manifests/latest - 362 bytes in 0.133 second response time https://wikitech.wikimedia.org/wiki/Docker [15:55:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [15:55:18] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:55:35] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P52644 and previous config saved to /var/cache/conftool/dbconfig/20230926-155534-arnaudb.json [15:56:29] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] logspam.pl: Consolidate another database-related message (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959802 (https://phabricator.wikimedia.org/T347064) (owner: 10Ahmon Dancy) [15:56:34] RECOVERY - Docker registry HTTPS interface on registry2004 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 5.245 second response time https://wikitech.wikimedia.org/wiki/Docker [15:57:14] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.462 second response time https://wikitech.wikimedia.org/wiki/Swift [15:57:28] (03CR) 10Cathal Mooney: Support configuration of EVPN anycast GW on switches (033 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/959873 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney) [15:57:31] (03CR) 10Cathal Mooney: [C: 03+2] Support configuration of EVPN anycast GW on switches [homer/public] - 10https://gerrit.wikimedia.org/r/959873 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney) [15:57:48] RECOVERY - Docker registry HTTPS interface on registry2003 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.780 second response time https://wikitech.wikimedia.org/wiki/Docker [15:58:16] (MediaWikiHighErrorRate) firing: (4) Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:58:29] (03Merged) 10jenkins-bot: Support configuration of EVPN anycast GW on switches [homer/public] - 10https://gerrit.wikimedia.org/r/959873 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney) [15:59:08] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.372 second response time https://wikitech.wikimedia.org/wiki/Swift [15:59:10] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.398 second response time https://wikitech.wikimedia.org/wiki/Swift [15:59:30] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1017.eqiad.wmnet with OS bullseye [15:59:36] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1017.eqiad.wmnet with OS bullseye executed with errors: - wdqs1017 (**FAIL**) - Remove... [15:59:50] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 9.344 second response time https://wikitech.wikimedia.org/wiki/Swift [16:00:05] jbond and rzl: gettimeofday() says it's time for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230926T1600) [16:00:05] dancy: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:12] o/ [16:00:14] PROBLEM - Check systemd state on ms-fe2009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:00:28] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Swift [16:00:55] (03CR) 10Dr0ptp4kt: [C: 03+1] wiki-replicas.sql: Drop grants for old labstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/961067 (owner: 10Majavah) [16:00:57] dancy: 👋 interview running over but I'll be right with you [16:01:04] OK [16:01:20] (03CR) 10Dr0ptp4kt: [C: 03+1] Allow cloudcontrol1005 and 1007 to connect to wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/961068 (https://phabricator.wikimedia.org/T347381) (owner: 10Majavah) [16:02:18] PROBLEM - Docker registry HTTPS interface on registry2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [16:03:04] (03CR) 10FNegri: [V: 03+1 C: 03+2] cloud_management: remove unneeded tokens, profiles [puppet] - 10https://gerrit.wikimedia.org/r/961125 (https://phabricator.wikimedia.org/T324986) (owner: 10FNegri) [16:03:14] PROBLEM - puppet last run on flink-zk1001 is CRITICAL: CRITICAL: Puppet has been disabled for 604827 seconds, message: btullis-T341792 - btullis, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:03:16] (MediaWikiHighErrorRate) firing: (4) Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:03:30] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 5.533 second response time https://wikitech.wikimedia.org/wiki/Swift [16:03:44] (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:03:50] (Primary outbound port utilisation over 80% #page) firing: (2) Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:03:54] (Primary outbound port utilisation over 80% #page) firing: (2) Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:04:09] (JobUnavailable) firing: (6) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:04:12] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.236 second response time https://wikitech.wikimedia.org/wiki/Swift [16:04:32] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:04:37] (ProbeDown) firing: (2) Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:04:42] (03PS8) 10EoghanGaffney: gitlab: swap replica records [dns] - 10https://gerrit.wikimedia.org/r/960633 (https://phabricator.wikimedia.org/T345590) (owner: 10AOkoth) [16:05:18] PROBLEM - Docker registry HTTPS interface on registry2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [16:05:28] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 3.774 second response time https://wikitech.wikimedia.org/wiki/Swift [16:05:32] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.149 second response time https://wikitech.wikimedia.org/wiki/Swift [16:06:15] dancy: looking now (cc rzl) [16:06:28] RECOVERY - Swift https backend on moss-fe2001 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 8.420 second response time https://wikitech.wikimedia.org/wiki/Swift [16:06:30] ah thanks [16:06:48] (03CR) 10Jbond: [C: 03+2] logspam.pl: Consolidate another database-related message [puppet] - 10https://gerrit.wikimedia.org/r/959802 (https://phabricator.wikimedia.org/T347064) (owner: 10Ahmon Dancy) [16:07:10] Thx! [16:07:52] RECOVERY - Swift https frontend on moss-fe2001 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 3.955 second response time https://wikitech.wikimedia.org/wiki/Swift [16:08:01] jbond: Can you process https://gerrit.wikimedia.org/r/c/operations/puppet/+/961142 as well please? [16:08:04] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 3.636 second response time https://wikitech.wikimedia.org/wiki/Swift [16:08:06] RECOVERY - Docker registry HTTPS interface on registry2004 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.251 second response time https://wikitech.wikimedia.org/wiki/Docker [16:08:16] (MediaWikiHighErrorRate) resolved: (4) Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:08:22] dancy: sure [16:08:25] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host wdqs1021.eqiad.wmnet with OS bullseye [16:08:29] Awesome. thanks! [16:08:45] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr4-ulsfo.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [16:08:45] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr4-ulsfo.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [16:08:51] (03CR) 10CI reject: [V: 04-1] gitlab: swap replica records [dns] - 10https://gerrit.wikimedia.org/r/960633 (https://phabricator.wikimedia.org/T345590) (owner: 10AOkoth) [16:09:06] (03CR) 10Jbond: [C: 03+2] logspam-watch: Avoid termination via SIGPIPE [puppet] - 10https://gerrit.wikimedia.org/r/961142 (owner: 10Ahmon Dancy) [16:09:24] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 2.301 second response time https://wikitech.wikimedia.org/wiki/Swift [16:09:26] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [16:09:27] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1021.eqiad.wmnet with reason: host reimage [16:09:37] (ProbeDown) firing: (2) Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:09:41] (03CR) 10EoghanGaffney: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/960633 (https://phabricator.wikimedia.org/T345590) (owner: 10AOkoth) [16:09:54] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.445 second response time https://wikitech.wikimedia.org/wiki/Swift [16:10:14] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - No response from remote host 198.35.26.193 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:10:40] dancy: merged and deployed [16:10:42] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P52645 and previous config saved to /var/cache/conftool/dbconfig/20230926-161041-arnaudb.json [16:10:50] PROBLEM - Swift https backend on moss-fe2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.389 second response time https://wikitech.wikimedia.org/wiki/Swift [16:10:59] jbond: Thanks again. Everything's working well. [16:11:45] no probs [16:11:58] (03PS1) 10BCornwall: varnish: Post name in HighThreadCount summary [alerts] - 10https://gerrit.wikimedia.org/r/961148 [16:12:34] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1021.eqiad.wmnet with reason: host reimage [16:12:59] (03CR) 10CI reject: [V: 04-1] varnish: Post name in HighThreadCount summary [alerts] - 10https://gerrit.wikimedia.org/r/961148 (owner: 10BCornwall) [16:13:46] (03CR) 10EoghanGaffney: [C: 03+2] gitlab: swap replica records [dns] - 10https://gerrit.wikimedia.org/r/960633 (https://phabricator.wikimedia.org/T345590) (owner: 10AOkoth) [16:14:10] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:14:37] (ProbeDown) firing: (2) Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:14:54] (03CR) 10BCornwall: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/961148 (owner: 10BCornwall) [16:14:57] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - No response from remote host 198.35.26.192 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:15:10] !log aokoth@cumin1001 START - Cookbook sre.dns.wipe-cache https://gitlab-replica.wikimedia.org/ https://gitlab-replica-old.wikimedia.org/ on all recursors [16:15:15] !log aokoth@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) https://gitlab-replica.wikimedia.org/ https://gitlab-replica-old.wikimedia.org/ on all recursors [16:15:47] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.191 second response time https://wikitech.wikimedia.org/wiki/Swift [16:16:03] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.160 second response time https://wikitech.wikimedia.org/wiki/Swift [16:16:10] (03PS1) 10Vgutierrez: mtail::cache_haproxy: Don't consider queue time for SLO purposes [puppet] - 10https://gerrit.wikimedia.org/r/961151 [16:16:33] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 8.835 second response time https://wikitech.wikimedia.org/wiki/Swift [16:16:35] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.142 second response time https://wikitech.wikimedia.org/wiki/Swift [16:16:45] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 4.642 second response time https://wikitech.wikimedia.org/wiki/Swift [16:17:13] RECOVERY - Docker registry HTTPS interface on registry2003 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 1.988 second response time https://wikitech.wikimedia.org/wiki/Docker [16:17:15] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.149 second response time https://wikitech.wikimedia.org/wiki/Swift [16:17:43] !log aokoth@cumin1001 START - Cookbook sre.dns.wipe-cache https://gitlab-replica.wikimedia.org/ https://gitlab-replica-old.wikimedia.org/ on all recursors [16:17:48] !log aokoth@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) https://gitlab-replica.wikimedia.org/ https://gitlab-replica-old.wikimedia.org/ on all recursors [16:17:55] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:18:33] (03CR) 10CI reject: [V: 04-1] mtail::cache_haproxy: Don't consider queue time for SLO purposes [puppet] - 10https://gerrit.wikimedia.org/r/961151 (owner: 10Vgutierrez) [16:18:44] (JobUnavailable) firing: (6) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:19:10] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:19:39] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [16:19:44] (03PS2) 10Vgutierrez: mtail::cache_haproxy: Don't consider queue time for SLO purposes [puppet] - 10https://gerrit.wikimedia.org/r/961151 [16:20:03] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:20:17] PROBLEM - Docker registry HTTPS interface on registry2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string schemaVersion not found on https://registry2003.codfw.wmnet:443/v2/bullseye/manifests/latest - 362 bytes in 0.133 second response time https://wikitech.wikimedia.org/wiki/Docker [16:20:55] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.266 second response time https://wikitech.wikimedia.org/wiki/Swift [16:21:01] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2014.codfw.wmnet, ms-fe2010.codfw.wmnet, ms-fe2011.codfw.wmnet are marked down but pooled: swift_80: Servers ms-fe2013.codfw.wmnet, ms-fe2014.codfw.wmnet, ms-fe2010.codfw.wmnet, ms-fe2011.codfw.wmnet, ms-fe2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:21:19] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.364 second response time https://wikitech.wikimedia.org/wiki/Swift [16:21:31] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.228 second response time https://wikitech.wikimedia.org/wiki/Swift [16:23:03] PROBLEM - Docker registry HTTPS interface on registry2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [16:23:22] (ProbeDown) firing: (2) Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:23:25] (03CR) 10Fabfur: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/961151 (owner: 10Vgutierrez) [16:23:40] !log aokoth@cumin1001 START - Cookbook sre.dns.wipe-cache https://gitlab-replica.wikimedia.org/ https://gitlab-replica-old.wikimedia.org/ on all recursors [16:23:44] !log aokoth@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) https://gitlab-replica.wikimedia.org/ https://gitlab-replica-old.wikimedia.org/ on all recursors [16:23:53] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.394 second response time https://wikitech.wikimedia.org/wiki/Swift [16:23:59] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.225 second response time https://wikitech.wikimedia.org/wiki/Swift [16:23:59] RECOVERY - Docker registry HTTPS interface on registry2004 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.530 second response time https://wikitech.wikimedia.org/wiki/Docker [16:24:01] (03PS1) 10Jbond: cumin: Add puppetserver alias [puppet] - 10https://gerrit.wikimedia.org/r/961153 (https://phabricator.wikimedia.org/T330490) [16:24:37] (ProbeDown) firing: (2) Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:24:49] PROBLEM - Swift https frontend on moss-fe2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [16:25:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:25:35] (03CR) 10Dbrant: [C: 03+1] .well-known: Add F-Droid signature to assetlinks.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959327 (https://phabricator.wikimedia.org/T346951) (owner: 10Samtar) [16:25:48] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T343198)', diff saved to https://phabricator.wikimedia.org/P52646 and previous config saved to /var/cache/conftool/dbconfig/20230926-162547-arnaudb.json [16:25:50] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [16:25:59] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [16:26:03] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [16:26:10] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T343198)', diff saved to https://phabricator.wikimedia.org/P52647 and previous config saved to /var/cache/conftool/dbconfig/20230926-162609-arnaudb.json [16:26:15] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.141 second response time https://wikitech.wikimedia.org/wiki/Swift [16:26:49] RECOVERY - Docker registry HTTPS interface on registry2003 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 1.852 second response time https://wikitech.wikimedia.org/wiki/Docker [16:27:52] !log aokoth@cumin1001 END (PASS) - Cookbook sre.gitlab.failover (exit_code=0) Failover of gitlab from gitlab2002.wikimedia.org to gitlab1003.wikimedia.org [16:28:04] (03CR) 10Volans: pyrra add service dns entries (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/961132 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [16:28:07] (ProbeDown) firing: (2) Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:28:07] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 6.295 second response time https://wikitech.wikimedia.org/wiki/Swift [16:28:40] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [16:28:40] (03CR) 10Vgutierrez: [C: 03+2] mtail::cache_haproxy: Don't consider queue time for SLO purposes [puppet] - 10https://gerrit.wikimedia.org/r/961151 (owner: 10Vgutierrez) [16:28:45] (Primary inbound port utilisation over 80% #page) resolved: Device cr4-ulsfo.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [16:28:45] (Primary inbound port utilisation over 80% #page) resolved: Device cr4-ulsfo.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [16:28:50] (Primary outbound port utilisation over 80% #page) firing: (2) Device cr1-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:28:54] (Primary outbound port utilisation over 80% #page) firing: (2) Device cr1-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:29:08] (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:29:37] (ProbeDown) firing: (2) Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:29:49] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [16:30:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:30:31] RECOVERY - Swift https frontend on moss-fe2001 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.137 second response time https://wikitech.wikimedia.org/wiki/Swift [16:30:43] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.183 second response time https://wikitech.wikimedia.org/wiki/Swift [16:31:01] (03PS1) 10Jbond: sre.puppet.sync-netbox-hiera: Add puppetservers to sync [cookbooks] - 10https://gerrit.wikimedia.org/r/961155 (https://phabricator.wikimedia.org/T347410) [16:31:55] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 5.240 second response time https://wikitech.wikimedia.org/wiki/Swift [16:32:17] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.266 second response time https://wikitech.wikimedia.org/wiki/Swift [16:32:25] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:32:58] (03CR) 10Jbond: [C: 03+2] cumin: Add puppetserver alias [puppet] - 10https://gerrit.wikimedia.org/r/961153 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [16:33:01] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.574 second response time https://wikitech.wikimedia.org/wiki/Swift [16:33:17] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.513 second response time https://wikitech.wikimedia.org/wiki/Swift [16:33:41] (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: Add puppetservers to sync [cookbooks] - 10https://gerrit.wikimedia.org/r/961155 (https://phabricator.wikimedia.org/T347410) (owner: 10Jbond) [16:35:31] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [16:35:45] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [16:35:53] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.144 second response time https://wikitech.wikimedia.org/wiki/Swift [16:36:09] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2014.codfw.wmnet, ms-fe2009.codfw.wmnet, ms-fe2010.codfw.wmnet, moss-fe2001.codfw.wmnet are marked down but pooled: swift_80: Servers ms-fe2014.codfw.wmnet, ms-fe2009.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:36:27] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 4.959 second response time https://wikitech.wikimedia.org/wiki/Swift [16:36:37] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.145 second response time https://wikitech.wikimedia.org/wiki/Swift [16:36:44] (HaproxyUnavailable) resolved: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [16:36:49] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.138 second response time https://wikitech.wikimedia.org/wiki/Swift [16:36:51] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.151 second response time https://wikitech.wikimedia.org/wiki/Swift [16:37:05] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.149 second response time https://wikitech.wikimedia.org/wiki/Swift [16:37:19] PROBLEM - Check systemd state on arclamp2001 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_compress_logs.service,arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:37:27] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:38:45] (Primary outbound port utilisation over 80% #page) resolved: Device cr4-ulsfo.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:38:50] (Primary outbound port utilisation over 80% #page) resolved: Device cr4-ulsfo.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:40:09] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.170 second response time https://wikitech.wikimedia.org/wiki/Swift [16:40:55] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [16:41:23] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2009.codfw.wmnet, ms-fe2014.codfw.wmnet, ms-fe2010.codfw.wmnet are marked down but pooled: swift_80: Servers ms-fe2013.codfw.wmnet, ms-fe2009.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:42:05] (03PS1) 10Andrew Bogott: cloudservices2004-dev: switch ldap backend to lmdb [puppet] - 10https://gerrit.wikimedia.org/r/961159 [16:42:09] (03PS2) 10Jbond: sre.puppet.sync-netbox-hiera: Add puppetservers to sync [cookbooks] - 10https://gerrit.wikimedia.org/r/961155 (https://phabricator.wikimedia.org/T347410) [16:42:31] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.414 second response time https://wikitech.wikimedia.org/wiki/Swift [16:42:38] (03CR) 10Andrew Bogott: [C: 03+2] cloudservices2004-dev: switch ldap backend to lmdb [puppet] - 10https://gerrit.wikimedia.org/r/961159 (owner: 10Andrew Bogott) [16:42:39] RECOVERY - Swift https backend on moss-fe2001 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.149 second response time https://wikitech.wikimedia.org/wiki/Swift [16:42:39] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.148 second response time https://wikitech.wikimedia.org/wiki/Swift [16:43:43] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.271 second response time https://wikitech.wikimedia.org/wiki/Swift [16:43:43] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.133 second response time https://wikitech.wikimedia.org/wiki/Swift [16:44:49] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.332 second response time https://wikitech.wikimedia.org/wiki/Swift [16:44:52] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1020.eqiad.wmnet with OS bullseye [16:44:57] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.136 second response time https://wikitech.wikimedia.org/wiki/Swift [16:44:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1020.eqiad.wmnet with OS bullseye executed with errors: - wdqs1020 (**FAIL**... [16:45:06] (03PS1) 10Giuseppe Lavagetto: cache: add netmap file for known-clients from requestctl [puppet] - 10https://gerrit.wikimedia.org/r/961160 [16:45:21] !log bblack@cumin1001 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw [16:45:43] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.250 second response time https://wikitech.wikimedia.org/wiki/Swift [16:46:01] PROBLEM - Docker registry HTTPS interface on registry2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [16:46:15] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.323 second response time https://wikitech.wikimedia.org/wiki/Swift [16:46:37] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.188 second response time https://wikitech.wikimedia.org/wiki/Swift [16:46:44] (03CR) 10CI reject: [V: 04-1] cache: add netmap file for known-clients from requestctl [puppet] - 10https://gerrit.wikimedia.org/r/961160 (owner: 10Giuseppe Lavagetto) [16:46:56] (03CR) 10BCornwall: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/961148 (owner: 10BCornwall) [16:47:00] (03PS2) 10Giuseppe Lavagetto: cache: add netmap file for known-clients from requestctl [puppet] - 10https://gerrit.wikimedia.org/r/961160 [16:47:13] RECOVERY - Docker registry HTTPS interface on registry2003 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.223 second response time https://wikitech.wikimedia.org/wiki/Docker [16:47:19] (03CR) 10BBlack: [C: 03+1] varnish: Post name in HighThreadCount summary [alerts] - 10https://gerrit.wikimedia.org/r/961148 (owner: 10BCornwall) [16:47:45] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [16:47:57] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.434 second response time https://wikitech.wikimedia.org/wiki/Swift [16:48:07] (ProbeDown) firing: (2) Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:48:21] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.142 second response time https://wikitech.wikimedia.org/wiki/Swift [16:48:51] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.137 second response time https://wikitech.wikimedia.org/wiki/Swift [16:48:57] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.208 second response time https://wikitech.wikimedia.org/wiki/Swift [16:49:02] (03CR) 10BCornwall: [C: 03+2] varnish: Post name in HighThreadCount summary [alerts] - 10https://gerrit.wikimedia.org/r/961148 (owner: 10BCornwall) [16:49:14] (03CR) 10CI reject: [V: 04-1] cache: add netmap file for known-clients from requestctl [puppet] - 10https://gerrit.wikimedia.org/r/961160 (owner: 10Giuseppe Lavagetto) [16:49:31] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:49:33] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:49:37] (ProbeDown) resolved: (2) Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:50:35] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.136 second response time https://wikitech.wikimedia.org/wiki/Swift [16:50:36] (03PS3) 10Giuseppe Lavagetto: cache: add netmap file for known-clients from requestctl [puppet] - 10https://gerrit.wikimedia.org/r/961160 [16:50:39] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.139 second response time https://wikitech.wikimedia.org/wiki/Swift [16:50:41] (03PS1) 10Andrew Bogott: cloudservices2004-dev: switch ldap backend to mdb [puppet] - 10https://gerrit.wikimedia.org/r/961161 [16:51:53] !log bblack@cumin1001 END (FAIL) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=1) rolling restart_daemons on A:swift-fe-codfw [16:51:57] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.298 second response time https://wikitech.wikimedia.org/wiki/Swift [16:52:54] !log ms-fe2009 - restart swift_dispersion_stats + swift_dispersion_stats_lowlatency services (failing in systemctl) [16:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:53:37] (ProbeDown) firing: (2) Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:53:37] RECOVERY - Check systemd state on ms-fe2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:53:43] (03PS3) 10Jbond: sre.puppet.sync-netbox-hiera: Add puppetservers to sync [cookbooks] - 10https://gerrit.wikimedia.org/r/961155 (https://phabricator.wikimedia.org/T347410) [16:53:51] !log bblack@cumin1001 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw [16:54:46] (03CR) 10Giuseppe Lavagetto: [C: 03+2] cache: add netmap file for known-clients from requestctl [puppet] - 10https://gerrit.wikimedia.org/r/961160 (owner: 10Giuseppe Lavagetto) [16:55:37] (ProbeDown) firing: (2) Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:56:05] (03PS2) 10Herron: pyrra add service dns entries [dns] - 10https://gerrit.wikimedia.org/r/961132 (https://phabricator.wikimedia.org/T302995) [16:56:07] (03PS4) 10Jbond: sre.puppet.sync-netbox-hiera: Add puppetservers to sync [cookbooks] - 10https://gerrit.wikimedia.org/r/961155 (https://phabricator.wikimedia.org/T347410) [16:57:06] (03PS6) 10Bking: cloudelastic: new partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/960114 (https://phabricator.wikimedia.org/T342463) [16:57:40] (03CR) 10CI reject: [V: 04-1] cloudelastic: new partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/960114 (https://phabricator.wikimedia.org/T342463) (owner: 10Bking) [16:58:37] (ProbeDown) firing: (2) Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:58:45] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.132 second response time https://wikitech.wikimedia.org/wiki/Swift [16:59:05] (03PS7) 10Bking: cloudelastic: new partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/960114 (https://phabricator.wikimedia.org/T342463) [16:59:20] !log bblack@cumin1001 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230926T1700) [17:00:37] (ProbeDown) resolved: (2) Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:00:51] (03CR) 10Jbond: "tested and working, ready for review" [cookbooks] - 10https://gerrit.wikimedia.org/r/961155 (https://phabricator.wikimedia.org/T347410) (owner: 10Jbond) [17:01:38] !log A:swift-fe-codfw: manually rolling systemctl restart of swift-proxy and nginx [17:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:06] (03PS2) 10Andrew Bogott: cloudservices2004-dev: switch ldap backend to mdb [puppet] - 10https://gerrit.wikimedia.org/r/961161 [17:02:30] (03CR) 10Ebernhardson: [C: 03+1] cloudelastic: new partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/960114 (https://phabricator.wikimedia.org/T342463) (owner: 10Bking) [17:02:32] (03CR) 10CI reject: [V: 04-1] cloudservices2004-dev: switch ldap backend to mdb [puppet] - 10https://gerrit.wikimedia.org/r/961161 (owner: 10Andrew Bogott) [17:03:13] (ATSBackendErrorsHigh) resolved: (3) ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [17:03:31] (03CR) 10Bking: [C: 03+2] cloudelastic: new partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/960114 (https://phabricator.wikimedia.org/T342463) (owner: 10Bking) [17:03:37] (ProbeDown) resolved: (2) Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:03:41] (03PS8) 10Bking: cloudelastic: new partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/960114 (https://phabricator.wikimedia.org/T342463) [17:04:26] (03PS3) 10Andrew Bogott: cloudservices2004-dev: switch ldap backend to mdb [puppet] - 10https://gerrit.wikimedia.org/r/961161 [17:05:38] <_joe_> !incidents [17:05:38] 4082 (RESOLVED) [2x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [17:05:39] 4087 (RESOLVED) ProbeDown sre (ip4 probes/service codfw) [17:05:39] 4083 (RESOLVED) ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw) [17:05:39] 4080 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-codfw.wikimedia.org) [17:05:39] 4084 (RESOLVED) HaproxyUnavailable cache_upload global sre () [17:05:39] 4086 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr4-ulsfo.wikimedia.org) [17:05:40] 4079 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr4-ulsfo.wikimedia.org) [17:05:40] 4085 (RESOLVED) VarnishUnavailable global sre (varnish-upload) [17:05:40] 4081 (RESOLVED) ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw) [17:05:41] 4078 (RESOLVED) ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw) [17:05:50] <_joe_> bblack: ^^ everything is resolved [17:06:40] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T343198)', diff saved to https://phabricator.wikimedia.org/P52648 and previous config saved to /var/cache/conftool/dbconfig/20230926-170639-arnaudb.json [17:06:46] (03CR) 10Andrew Bogott: [C: 03+2] cloudservices2004-dev: switch ldap backend to mdb [puppet] - 10https://gerrit.wikimedia.org/r/961161 (owner: 10Andrew Bogott) [17:06:50] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [17:12:12] !log herron@cumin1001 START - Cookbook sre.dns.netbox [17:14:37] !log herron@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding pyrra.svc records - herron@cumin1001" [17:14:46] (03PS1) 10Bking: cloudelastic: Add new hosts into site.pp [puppet] - 10https://gerrit.wikimedia.org/r/961167 (https://phabricator.wikimedia.org/T342538) [17:15:11] (03CR) 10CI reject: [V: 04-1] cloudelastic: Add new hosts into site.pp [puppet] - 10https://gerrit.wikimedia.org/r/961167 (https://phabricator.wikimedia.org/T342538) (owner: 10Bking) [17:15:29] !log herron@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding pyrra.svc records - herron@cumin1001" [17:15:29] !log herron@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:16:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:37] (03PS1) 10BBlack: varnish: add X-Known-Client netmap [puppet] - 10https://gerrit.wikimedia.org/r/961168 [17:16:39] jouncebot: nowandnext [17:16:40] For the next 0 hour(s) and 43 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230926T1700) [17:16:40] In 0 hour(s) and 43 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230926T1800) [17:16:44] (03PS3) 10Herron: pyrra: add public dns entries [dns] - 10https://gerrit.wikimedia.org/r/961133 (https://phabricator.wikimedia.org/T302995) [17:17:09] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:18:13] (03CR) 10Herron: "this one is for initial deployment alongside If265e8aaa4b46261e69bd21f11ec5334e5b9ae95" [puppet] - 10https://gerrit.wikimedia.org/r/961128 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [17:18:19] RECOVERY - Check systemd state on arclamp2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:18:20] (03CR) 10BBlack: [C: 03+2] varnish: add X-Known-Client netmap [puppet] - 10https://gerrit.wikimedia.org/r/961168 (owner: 10BBlack) [17:19:43] (03PS4) 10Herron: pyrra: add trafficserver mapping [puppet] - 10https://gerrit.wikimedia.org/r/961128 (https://phabricator.wikimedia.org/T302995) [17:19:51] (03PS3) 10Herron: services: add pyrra conftool-data and service stub entry [puppet] - 10https://gerrit.wikimedia.org/r/961129 (https://phabricator.wikimedia.org/T302995) [17:19:57] (03CR) 10CI reject: [V: 04-1] pyrra: add public dns entries [dns] - 10https://gerrit.wikimedia.org/r/961133 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [17:19:59] (03PS2) 10Bking: cloudelastic: Add new hosts into site.pp [puppet] - 10https://gerrit.wikimedia.org/r/961167 (https://phabricator.wikimedia.org/T342538) [17:20:25] (03CR) 10CI reject: [V: 04-1] cloudelastic: Add new hosts into site.pp [puppet] - 10https://gerrit.wikimedia.org/r/961167 (https://phabricator.wikimedia.org/T342538) (owner: 10Bking) [17:20:59] (03CR) 10Herron: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/961131 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [17:21:36] (03CR) 10Herron: pyrra: add trafficserver mapping (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961128 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [17:21:47] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P52649 and previous config saved to /var/cache/conftool/dbconfig/20230926-172146-arnaudb.json [17:22:57] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:23:27] (03PS4) 10Herron: pyrra: add public dns entries [dns] - 10https://gerrit.wikimedia.org/r/961133 (https://phabricator.wikimedia.org/T302995) [17:24:36] (03PS3) 10Jdlrobson: Wordmarks for Wikinews projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959876 (https://phabricator.wikimedia.org/T341258) [17:25:03] (03PS1) 10Andrew Bogott: designate pools.yaml: remove a domain-terminating '.' [puppet] - 10https://gerrit.wikimedia.org/r/961170 [17:26:18] (03PS3) 10Bking: cloudelastic: Add new hosts into site.pp [puppet] - 10https://gerrit.wikimedia.org/r/961167 (https://phabricator.wikimedia.org/T342538) [17:26:44] (03CR) 10Jbond: nginx: mount lib on tmpfs vol in cloud (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960708 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [17:26:45] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [17:27:49] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2024.codfw.wmnet with OS bullseye [17:27:57] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase2024.codfw.wmnet with OS bullseye [17:27:59] (03CR) 10RobH: [C: 03+2] cloudelastic: Add new hosts into site.pp [puppet] - 10https://gerrit.wikimedia.org/r/961167 (https://phabricator.wikimedia.org/T342538) (owner: 10Bking) [17:29:48] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10netops, and 2 others: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10jbond) [17:29:54] 10SRE, 10Infrastructure-Foundations: Drive host network config from Netbox, and move away from ifupdown - https://phabricator.wikimedia.org/T347411 (10jbond) [17:31:13] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:32:48] (03PS1) 10Reedy: incubatorwiki: Disable xml upload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961171 (https://phabricator.wikimedia.org/T341565) [17:33:19] (03PS2) 10Reedy: incubatorwiki: Disable xml upload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961171 (https://phabricator.wikimedia.org/T341565) [17:35:05] (03PS3) 10BCornwall: aptrepo: Add Bookworm HAProxy third party repos [puppet] - 10https://gerrit.wikimedia.org/r/957766 (https://phabricator.wikimedia.org/T342154) [17:36:03] (03CR) 10Reedy: [C: 03+2] incubatorwiki: Disable xml upload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961171 (https://phabricator.wikimedia.org/T341565) (owner: 10Reedy) [17:36:06] (03CR) 10Herron: pyrra add service dns entries (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/961132 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [17:36:12] (03PS3) 10Herron: pyrra add service dns entries [dns] - 10https://gerrit.wikimedia.org/r/961132 (https://phabricator.wikimedia.org/T302995) [17:36:55] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P52650 and previous config saved to /var/cache/conftool/dbconfig/20230926-173653-arnaudb.json [17:37:49] (03Merged) 10jenkins-bot: incubatorwiki: Disable xml upload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961171 (https://phabricator.wikimedia.org/T341565) (owner: 10Reedy) [17:38:09] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:39:27] (03PS4) 10Herron: services: add pyrra conftool-data and service stub entry [puppet] - 10https://gerrit.wikimedia.org/r/961129 (https://phabricator.wikimedia.org/T302995) [17:40:24] (03CR) 10Herron: "the corresponding dns patch is at I04bf7783e60355d30b837c0b9b280d5f59925a5e" [puppet] - 10https://gerrit.wikimedia.org/r/961129 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [17:41:01] 10SRE, 10Infrastructure-Foundations: Drive host network config from Netbox, and move away from ifupdown - https://phabricator.wikimedia.org/T347411 (10cmooney) [17:41:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10bking) a:05Jclark-ctr→03None [17:41:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10bking) [17:41:41] (03CR) 10Herron: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/961130 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [17:43:10] (03PS4) 10BCornwall: aptrepo: Add Bookworm HAProxy third party repos [puppet] - 10https://gerrit.wikimedia.org/r/957766 (https://phabricator.wikimedia.org/T342154) [17:43:24] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2024.codfw.wmnet with reason: host reimage [17:43:40] (03CR) 10Muehlenhoff: cloudelastic: new partman recipe (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/960114 (https://phabricator.wikimedia.org/T342463) (owner: 10Bking) [17:43:56] (03PS1) 10Bking: site.pp: Fix number of cloudelastic hosts [puppet] - 10https://gerrit.wikimedia.org/r/961173 (https://phabricator.wikimedia.org/T342538) [17:44:33] (03CR) 10BCornwall: aptrepo: Add Bookworm HAProxy third party repos (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957766 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [17:44:45] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:45:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:45:29] (03CR) 10Muehlenhoff: aptrepo: Add Bookworm HAProxy third party repos (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957766 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [17:45:53] (03CR) 10Ebernhardson: [C: 03+1] site.pp: Fix number of cloudelastic hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961173 (https://phabricator.wikimedia.org/T342538) (owner: 10Bking) [17:45:55] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2024.codfw.wmnet with reason: host reimage [17:46:08] 10SRE, 10Infrastructure-Foundations: Drive host network config from Netbox, and move away from ifupdown - https://phabricator.wikimedia.org/T347411 (10cmooney) [17:46:32] (03PS2) 10Bking: site.pp: Fix number of cloudelastic hosts [puppet] - 10https://gerrit.wikimedia.org/r/961173 (https://phabricator.wikimedia.org/T342538) [17:48:22] (03CR) 10Bking: site.pp: Fix number of cloudelastic hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961173 (https://phabricator.wikimedia.org/T342538) (owner: 10Bking) [17:49:56] 10SRE, 10Infrastructure-Foundations: Drive host network config from Netbox, and move away from ifupdown - https://phabricator.wikimedia.org/T347411 (10cmooney) [17:51:16] 10SRE, 10Infrastructure-Foundations: Drive host network config from Netbox, and move away from ifupdown - https://phabricator.wikimedia.org/T347411 (10jbond) @cmooney thanks for the write up, i attached on older task to this which list some of the edge cases and issues etc. overall i think the the proposal is... [17:51:59] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:52:01] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T343198)', diff saved to https://phabricator.wikimedia.org/P52651 and previous config saved to /var/cache/conftool/dbconfig/20230926-175201-arnaudb.json [17:52:03] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [17:52:09] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [17:52:10] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@94ac23e]: tune parallelism of process_sparql_query_hourly [17:52:17] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [17:52:23] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T343198)', diff saved to https://phabricator.wikimedia.org/P52652 and previous config saved to /var/cache/conftool/dbconfig/20230926-175222-arnaudb.json [17:52:38] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@94ac23e]: tune parallelism of process_sparql_query_hourly (duration: 00m 27s) [17:56:49] (03PS2) 10Jdlrobson: Update README clarifying the use of local images. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955393 [17:58:18] !log gmodena@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [17:58:25] !log gmodena@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [17:59:55] (03PS5) 10BCornwall: aptrepo: Add Bookworm HAProxy third party repos [puppet] - 10https://gerrit.wikimedia.org/r/957766 (https://phabricator.wikimedia.org/T342154) [18:00:05] dduvall and brennen: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230926T1800). [18:00:05] (03CR) 10BCornwall: aptrepo: Add Bookworm HAProxy third party repos (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957766 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [18:00:35] (03CR) 10BCornwall: "The diff is somewhat confusing since I sorted one of the entries for consistency." [puppet] - 10https://gerrit.wikimedia.org/r/957766 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [18:00:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:00:57] (03CR) 10Herron: [C: 03+1] prometheus: Enable selective scraping for Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/960723 (https://phabricator.wikimedia.org/T346656) (owner: 10Andrea Denisse) [18:01:12] !log jclark@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [18:01:17] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1021.eqiad.wmnet with OS bullseye [18:01:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1021.eqiad.wmnet with OS bullseye completed: - wdqs1021 (**WARN**) - Remov... [18:01:47] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1017'] [18:01:58] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1020'] [18:03:01] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1017'] [18:03:17] o/ [18:03:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10Jclark-ctr) [18:03:57] just checking if dduvall is about. will go ahead as backup after a bit if needed. [18:04:54] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1017.eqiad.wmnet with OS bullseye [18:05:00] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1017.eqiad.wmnet with OS bullseye [18:05:38] (03PS1) 10Jbond: puppetserver: add backups [puppet] - 10https://gerrit.wikimedia.org/r/961177 (https://phabricator.wikimedia.org/T347390) [18:07:41] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:07:52] (03CR) 10CI reject: [V: 04-1] puppetserver: add backups [puppet] - 10https://gerrit.wikimedia.org/r/961177 (https://phabricator.wikimedia.org/T347390) (owner: 10Jbond) [18:08:01] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1020'] [18:08:45] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1020.eqiad.wmnet with OS bullseye [18:08:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1020.eqiad.wmnet with OS bullseye [18:09:15] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:09:55] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2024.codfw.wmnet with OS bullseye [18:10:03] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase2024.codfw.wmnet with OS bullseye completed: - restbase20... [18:10:34] (03CR) 10RobH: [C: 03+2] site.pp: Fix number of cloudelastic hosts [puppet] - 10https://gerrit.wikimedia.org/r/961173 (https://phabricator.wikimedia.org/T342538) (owner: 10Bking) [18:12:04] (03PS2) 10Jbond: puppetserver: add backups [puppet] - 10https://gerrit.wikimedia.org/r/961177 (https://phabricator.wikimedia.org/T347390) [18:12:06] (03PS1) 10Jbond: backups: Ad new filesets for puppetservers [puppet] - 10https://gerrit.wikimedia.org/r/961179 (https://phabricator.wikimedia.org/T347390) [18:12:44] (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [18:12:47] (03PS2) 10Jbond: backups: Add new filesets for puppetservers [puppet] - 10https://gerrit.wikimedia.org/r/961179 (https://phabricator.wikimedia.org/T347390) [18:12:59] (03PS3) 10Jbond: puppetserver: add backups [puppet] - 10https://gerrit.wikimedia.org/r/961177 (https://phabricator.wikimedia.org/T347390) [18:15:59] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:16:15] RECOVERY - puppet last run on flink-zk1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [18:16:16] (03CR) 10CI reject: [V: 04-1] puppetserver: add backups [puppet] - 10https://gerrit.wikimedia.org/r/961177 (https://phabricator.wikimedia.org/T347390) (owner: 10Jbond) [18:16:19] (03CR) 10CI reject: [V: 04-1] backups: Add new filesets for puppetservers [puppet] - 10https://gerrit.wikimedia.org/r/961179 (https://phabricator.wikimedia.org/T347390) (owner: 10Jbond) [18:16:57] (03CR) 10Gmodena: [C: 03+1] "LGTM. Feel free to merge once the v1.25.0 image has been pushed to docker-registry." [deployment-charts] - 10https://gerrit.wikimedia.org/r/960610 (https://phabricator.wikimedia.org/T344688) (owner: 10Aqu) [18:17:37] (03PS3) 10Jbond: backups: Add new filesets for puppetservers [puppet] - 10https://gerrit.wikimedia.org/r/961179 (https://phabricator.wikimedia.org/T347390) [18:17:44] (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [18:18:27] (03PS4) 10Jbond: puppetserver: add backups [puppet] - 10https://gerrit.wikimedia.org/r/961177 (https://phabricator.wikimedia.org/T347390) [18:18:41] !log train 1.41.0-wmf.28 (T345889): no current blockers, rolling to group0 [18:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:49] T345889: 1.41.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T345889 [18:19:09] (03PS4) 10Jbond: backups: Add new filesets for puppetservers [puppet] - 10https://gerrit.wikimedia.org/r/961179 (https://phabricator.wikimedia.org/T347390) [18:19:11] (03PS5) 10Jbond: puppetserver: add backups [puppet] - 10https://gerrit.wikimedia.org/r/961177 (https://phabricator.wikimedia.org/T347390) [18:22:07] (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961181 (https://phabricator.wikimedia.org/T345889) [18:22:10] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961181 (https://phabricator.wikimedia.org/T345889) (owner: 10TrainBranchBot) [18:22:51] (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961181 (https://phabricator.wikimedia.org/T345889) (owner: 10TrainBranchBot) [18:22:55] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:24:07] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1007.eqiad.wmnet with OS bullseye [18:24:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e... [18:26:45] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1017.eqiad.wmnet with reason: host reimage [18:27:43] (03PS7) 10Ebernhardson: k8s config: Provide kafka and zookeeper hostnames [puppet] - 10https://gerrit.wikimedia.org/r/960662 [18:28:21] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1020.eqiad.wmnet with reason: host reimage [18:29:52] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1017.eqiad.wmnet with reason: host reimage [18:30:23] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.28 refs T345889 [18:30:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:30:39] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:30:42] T345889: 1.41.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T345889 [18:32:27] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1020.eqiad.wmnet with reason: host reimage [18:33:24] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T343198)', diff saved to https://phabricator.wikimedia.org/P52653 and previous config saved to /var/cache/conftool/dbconfig/20230926-183323-arnaudb.json [18:33:32] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [18:35:34] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:36:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:37:38] (03PS1) 10Ebernhardson: k8s config: Include the cluster name in the exported configuration [puppet] - 10https://gerrit.wikimedia.org/r/961182 [18:39:23] 10SRE, 10Prod-Kubernetes, 10Kubernetes: Reverse DNS for k8s pods IPs - https://phabricator.wikimedia.org/T344171 (10CDanis) p:05Triage→03Low [18:40:54] !log gmodena@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [18:41:01] !log gmodena@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [18:43:21] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [18:44:46] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:45:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:45:47] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [18:46:47] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [18:46:48] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1017.eqiad.wmnet with OS bullseye [18:46:53] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for restbase2024.codfw.wmnet [18:46:53] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase2024.codfw.wmnet [18:46:55] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1017.eqiad.wmnet with OS bullseye completed: - wdqs1017 (**PASS**) - Removed from Pupp... [18:47:22] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2015.codfw.wmnet with OS bullseye [18:47:30] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase2015.codfw.wmnet with OS bullseye [18:47:44] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10Jclark-ctr) [18:47:56] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10Jclark-ctr) 05Open→03Resolved [18:48:16] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [18:48:30] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P52654 and previous config saved to /var/cache/conftool/dbconfig/20230926-184830-arnaudb.json [18:48:54] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1023'] [18:48:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10bking) a:03bking [18:49:08] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:50:44] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [18:50:48] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1020.eqiad.wmnet with OS bullseye [18:50:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1020.eqiad.wmnet with OS bullseye completed: - wdqs1020 (**PASS**) - Remov... [18:54:50] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1023'] [18:56:54] (03PS1) 10Bking: cloudelastic: correct partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/961186 (https://phabricator.wikimedia.org/T342463) [18:57:39] (03CR) 10Jcrespo: [C: 03+1] backups: Add new filesets for puppetservers [puppet] - 10https://gerrit.wikimedia.org/r/961179 (https://phabricator.wikimedia.org/T347390) (owner: 10Jbond) [18:57:48] (03PS1) 10Andrew Bogott: slapd: introduce new slapd.conf template for ldap >= 2.5 [puppet] - 10https://gerrit.wikimedia.org/r/961188 (https://phabricator.wikimedia.org/T331699) [18:57:55] (03CR) 10Jcrespo: [C: 03+1] puppetserver: add backups [puppet] - 10https://gerrit.wikimedia.org/r/961177 (https://phabricator.wikimedia.org/T347390) (owner: 10Jbond) [18:58:12] (03CR) 10CI reject: [V: 04-1] slapd: introduce new slapd.conf template for ldap >= 2.5 [puppet] - 10https://gerrit.wikimedia.org/r/961188 (https://phabricator.wikimedia.org/T331699) (owner: 10Andrew Bogott) [18:58:26] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1007.eqiad.wmnet with OS bullseye [18:58:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad... [18:58:38] (03CR) 10Ryan Kemper: [C: 03+1] cloudelastic: correct partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/961186 (https://phabricator.wikimedia.org/T342463) (owner: 10Bking) [18:58:48] (03CR) 10Bking: [C: 03+2] cloudelastic: correct partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/961186 (https://phabricator.wikimedia.org/T342463) (owner: 10Bking) [18:59:02] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1023.eqiad.wmnet with OS bullseye [18:59:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host wdqs1023.eqiad.wmnet with OS bullseye [18:59:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10Jclark-ctr) [19:00:08] (03CR) 10JHathaway: [C: 03+2] nginx: mount lib on tmpfs vol in cloud (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960708 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [19:00:12] (03CR) 10Andrew Bogott: "when I include this file I get a failure when starting slapd about a duplicate schema; I believe these settings have moved from config to " [puppet] - 10https://gerrit.wikimedia.org/r/961066 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff) [19:00:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10Jclark-ctr) a:03Jclark-ctr [19:00:40] (03CR) 10Bking: "See Id32c270eb0b3a652bb2dadf3106d47fca8116663 for corrected recipe" [puppet] - 10https://gerrit.wikimedia.org/r/960114 (https://phabricator.wikimedia.org/T342463) (owner: 10Bking) [19:02:04] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2015.codfw.wmnet with reason: host reimage [19:02:42] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1007.eqiad.wmnet with OS bullseye [19:02:52] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1007.eqiad.wmnet with OS bullseye [19:02:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e... [19:03:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad... [19:03:37] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P52655 and previous config saved to /var/cache/conftool/dbconfig/20230926-190336-arnaudb.json [19:05:18] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2015.codfw.wmnet with reason: host reimage [19:07:49] (03PS1) 10Bking: dse-k8s: trigger savepoint for flink-app [deployment-charts] - 10https://gerrit.wikimedia.org/r/961190 (https://phabricator.wikimedia.org/T346231) [19:13:58] (03CR) 10Ryan Kemper: [C: 03+1] dse-k8s: trigger savepoint for flink-app [deployment-charts] - 10https://gerrit.wikimedia.org/r/961190 (https://phabricator.wikimedia.org/T346231) (owner: 10Bking) [19:14:22] (03CR) 10Bking: [C: 03+2] dse-k8s: trigger savepoint for flink-app [deployment-charts] - 10https://gerrit.wikimedia.org/r/961190 (https://phabricator.wikimedia.org/T346231) (owner: 10Bking) [19:14:24] (03PS1) 10Jclark-ctr: add pc10(15-16) T342164 [puppet] - 10https://gerrit.wikimedia.org/r/961191 (https://phabricator.wikimedia.org/T342164) [19:14:49] (03CR) 10CI reject: [V: 04-1] add pc10(15-16) T342164 [puppet] - 10https://gerrit.wikimedia.org/r/961191 (https://phabricator.wikimedia.org/T342164) (owner: 10Jclark-ctr) [19:16:00] (03PS1) 10Ryan Kemper: Revert "wdqs: silence alerts on new hosts" [puppet] - 10https://gerrit.wikimedia.org/r/961206 [19:16:04] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [19:16:17] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [19:16:21] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [19:16:33] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [19:16:43] (03PS2) 10Ryan Kemper: Revert "wdqs: silence alerts on new hosts" [puppet] - 10https://gerrit.wikimedia.org/r/961206 (https://phabricator.wikimedia.org/T345475) [19:16:43] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:18:21] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1023.eqiad.wmnet with reason: host reimage [19:18:44] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T343198)', diff saved to https://phabricator.wikimedia.org/P52656 and previous config saved to /var/cache/conftool/dbconfig/20230926-191843-arnaudb.json [19:18:45] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance [19:18:58] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [19:18:58] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance [19:19:05] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1187 (T343198)', diff saved to https://phabricator.wikimedia.org/P52657 and previous config saved to /var/cache/conftool/dbconfig/20230926-191904-arnaudb.json [19:21:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, 10Patch-For-Review: Q1:rack/setup/install pc101[56] - https://phabricator.wikimedia.org/T342164 (10Jclark-ctr) [19:21:29] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1023.eqiad.wmnet with reason: host reimage [19:21:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:21:42] (03PS1) 10Joal: Update eventgate services docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/961197 (https://phabricator.wikimedia.org/T325565) [19:23:09] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:23:22] (03CR) 10Gmodena: [C: 03+1] Update eventgate services docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/961197 (https://phabricator.wikimedia.org/T325565) (owner: 10Joal) [19:24:17] (03CR) 10Gmodena: [C: 03+2] Update eventgate services docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/961197 (https://phabricator.wikimedia.org/T325565) (owner: 10Joal) [19:24:52] (03Abandoned) 10Jclark-ctr: add pc10(15-16) T342164 [puppet] - 10https://gerrit.wikimedia.org/r/961191 (https://phabricator.wikimedia.org/T342164) (owner: 10Jclark-ctr) [19:25:15] (03Merged) 10jenkins-bot: Update eventgate services docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/961197 (https://phabricator.wikimedia.org/T325565) (owner: 10Joal) [19:26:16] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:27:10] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [19:27:34] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:27:47] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [19:28:09] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:30:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:30:47] !log joal@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [19:31:07] !log joal@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [19:32:46] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1007.eqiad.wmnet with OS bullseye [19:32:55] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1007.eqiad.wmnet with OS bullseye [19:32:56] !log joal@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [19:32:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e... [19:33:05] (03CR) 10Bking: [C: 03+1] Revert "wdqs: silence alerts on new hosts" [puppet] - 10https://gerrit.wikimedia.org/r/961206 (https://phabricator.wikimedia.org/T345475) (owner: 10Ryan Kemper) [19:33:08] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host pc1016.eqiad.wmnet with OS bullseye [19:33:08] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host pc1015.eqiad.wmnet with OS bullseye [19:33:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad... [19:33:20] (03CR) 10Ryan Kemper: [C: 03+2] Revert "wdqs: silence alerts on new hosts" [puppet] - 10https://gerrit.wikimedia.org/r/961206 (https://phabricator.wikimedia.org/T345475) (owner: 10Ryan Kemper) [19:33:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc101[56] - https://phabricator.wikimedia.org/T342164 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host pc1016.eqiad.wmnet with OS bullseye [19:33:28] !log joal@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [19:33:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc101[56] - https://phabricator.wikimedia.org/T342164 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host pc1015.eqiad.wmnet with OS bullseye [19:33:58] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:34:20] (03PS5) 10Andrew Bogott: Openstack backend: make use of all_tenants nova api flag [software/cumin] - 10https://gerrit.wikimedia.org/r/869332 (https://phabricator.wikimedia.org/T325773) [19:36:26] (03CR) 10Andrew Bogott: "recheck" [software/cumin] - 10https://gerrit.wikimedia.org/r/868814 (https://phabricator.wikimedia.org/T321349) (owner: 10Andrew Bogott) [19:37:03] !log joal@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [19:37:30] !log joal@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [19:38:28] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [19:40:35] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [19:40:36] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1023.eqiad.wmnet with OS bullseye [19:40:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host wdqs1023.eqiad.wmnet with OS bullseye completed: - wdqs1023 (**PASS**) - Remov... [19:41:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Replace RAID controller battery on an-worker1086 - https://phabricator.wikimedia.org/T347287 (10Jclark-ctr) 05Open→03Resolved [19:41:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure: Multiple RAID battery failures on hadoop worker hosts - https://phabricator.wikimedia.org/T318659 (10Jclark-ctr) [19:41:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10Jclark-ctr) [19:42:08] (03CR) 10CI reject: [V: 04-1] Openstack backend: make use of all_tenants nova api flag [software/cumin] - 10https://gerrit.wikimedia.org/r/869332 (https://phabricator.wikimedia.org/T325773) (owner: 10Andrew Bogott) [19:42:31] !log joal@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [19:42:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10Jclark-ctr) [19:42:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10Jclark-ctr) 05Open→03Resolved [19:43:01] !log joal@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [19:43:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10nskaggs) [19:43:46] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2015.codfw.wmnet with OS bullseye [19:43:54] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase2015.codfw.wmnet with OS bullseye completed: - restbase20... [19:45:13] !log joal@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [19:45:57] !log joal@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [19:46:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:46:51] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [19:46:59] !log joal@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [19:47:18] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for restbase2015.codfw.wmnet [19:47:18] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase2015.codfw.wmnet [19:47:44] !log joal@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [19:48:32] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2016.codfw.wmnet with OS bullseye [19:48:40] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase2016.codfw.wmnet with OS bullseye [19:49:12] (03PS6) 10CDanis: haproxy: Add support for filter bwlim-(in|out) [puppet] - 10https://gerrit.wikimedia.org/r/928541 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [19:50:40] (03PS9) 10CDanis: hiera: Test HAProxy bw limits per URL on cp4052 [puppet] - 10https://gerrit.wikimedia.org/r/928548 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [19:51:10] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/928541 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [19:51:13] (03PS1) 10Jclark-ctr: add new lists server lists1004 [puppet] - 10https://gerrit.wikimedia.org/r/961204 (https://phabricator.wikimedia.org/T342374) [19:51:18] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/928548 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [19:52:01] (03CR) 10Jclark-ctr: [C: 03+2] add new lists server lists1004 [puppet] - 10https://gerrit.wikimedia.org/r/961204 (https://phabricator.wikimedia.org/T342374) (owner: 10Jclark-ctr) [19:52:52] !log joal@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [19:52:53] (03CR) 10RobH: [C: 03+2] add new lists server lists1004 [puppet] - 10https://gerrit.wikimedia.org/r/961204 (https://phabricator.wikimedia.org/T342374) (owner: 10Jclark-ctr) [19:53:12] !log joal@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [19:53:42] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host lists1004.eqiad.wmnet with OS bullseye [19:53:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, 10Patch-For-Review: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host lists1004.eqiad.wmnet with OS bullseye [19:54:53] !log joal@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [19:54:57] (03PS1) 10Bking: rdf-streaming-updater: restore from checkpoint (dse-k8s) [deployment-charts] - 10https://gerrit.wikimedia.org/r/961205 (https://phabricator.wikimedia.org/T346231) [19:55:23] !log joal@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [19:55:38] (03CR) 10Ebernhardson: [C: 03+1] rdf-streaming-updater: restore from checkpoint (dse-k8s) [deployment-charts] - 10https://gerrit.wikimedia.org/r/961205 (https://phabricator.wikimedia.org/T346231) (owner: 10Bking) [19:55:51] (03CR) 10Ryan Kemper: [C: 03+1] rdf-streaming-updater: restore from checkpoint (dse-k8s) [deployment-charts] - 10https://gerrit.wikimedia.org/r/961205 (https://phabricator.wikimedia.org/T346231) (owner: 10Bking) [19:56:28] (03PS10) 10CDanis: hiera: Test HAProxy bw limits per URL on cp4052 [puppet] - 10https://gerrit.wikimedia.org/r/928548 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [19:56:37] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/928548 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [19:57:27] !log joal@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [19:57:50] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T343198)', diff saved to https://phabricator.wikimedia.org/P52659 and previous config saved to /var/cache/conftool/dbconfig/20230926-195750-arnaudb.json [19:57:58] !log joal@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [19:57:59] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [19:58:01] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: restore from checkpoint (dse-k8s) [deployment-charts] - 10https://gerrit.wikimedia.org/r/961205 (https://phabricator.wikimedia.org/T346231) (owner: 10Bking) [19:58:23] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on pc1015.eqiad.wmnet with reason: host reimage [19:59:33] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [19:59:40] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230926T2000). [20:00:04] lucaswerkmeister and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:10] here [20:00:12] o/ [20:00:16] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [20:00:17] o/ [20:00:18] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [20:00:48] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [20:00:48] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [20:00:51] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [20:00:54] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [20:01:08] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [20:01:56] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [20:01:58] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [20:02:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc1015.eqiad.wmnet with reason: host reimage [20:02:10] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [20:02:16] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1012 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [20:02:25] !log joal@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [20:02:31] (03PS11) 10CDanis: hiera: Test HAProxy bw limits per URL on cp4052 [puppet] - 10https://gerrit.wikimedia.org/r/928548 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [20:02:32] starting with Jdlrobson's patches, since Lucas's has some very magical-looking numbers that I'm not 100% sure about [20:02:38] !log joal@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [20:02:50] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [20:02:55] (03PS4) 10Majavah: Update wikiquote wordmarks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959848 (https://phabricator.wikimedia.org/T341260) (owner: 10Jdlrobson) [20:02:58] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/928548 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [20:03:00] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1001 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [20:03:00] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1006 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [20:03:10] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959876 (https://phabricator.wikimedia.org/T341258) (owner: 10Jdlrobson) [20:03:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959848 (https://phabricator.wikimedia.org/T341260) (owner: 10Jdlrobson) [20:03:14] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955393 (owner: 10Jdlrobson) [20:03:50] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1004 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [20:04:01] !log joal@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [20:04:11] (03PS1) 10Bking: flink-app: Increment chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/961227 (https://phabricator.wikimedia.org/T346231) [20:04:20] !log joal@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [20:04:25] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [20:04:27] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [20:04:36] (03CR) 10Ryan Kemper: [C: 03+1] flink-app: Increment chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/961227 (https://phabricator.wikimedia.org/T346231) (owner: 10Bking) [20:04:38] (03Merged) 10jenkins-bot: Wordmarks for Wikinews projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959876 (https://phabricator.wikimedia.org/T341258) (owner: 10Jdlrobson) [20:04:41] (03Merged) 10jenkins-bot: Update wikiquote wordmarks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959848 (https://phabricator.wikimedia.org/T341260) (owner: 10Jdlrobson) [20:04:49] (03Merged) 10jenkins-bot: Update README clarifying the use of local images. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955393 (owner: 10Jdlrobson) [20:04:52] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [20:04:58] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [20:05:17] !log taavi@deploy2002 Started scap: Backport for [[gerrit:959876|Wordmarks for Wikinews projects (T341258)]], [[gerrit:959848|Update wikiquote wordmarks (T341260)]], [[gerrit:955393|Update README clarifying the use of local images.]] [20:05:56] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [20:06:04] !log joal@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [20:06:06] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [20:06:06] T341258: Provide wordmarks for Wikinews projects - https://phabricator.wikimedia.org/T341258 [20:06:07] T341260: Design: Provide wordmarks for Wikiquote projects - https://phabricator.wikimedia.org/T341260 [20:06:14] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1013 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [20:06:23] !log joal@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [20:06:39] !log taavi@deploy2002 taavi and jdlrobson: Backport for [[gerrit:959876|Wordmarks for Wikinews projects (T341258)]], [[gerrit:959848|Update wikiquote wordmarks (T341260)]], [[gerrit:955393|Update README clarifying the use of local images.]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-exp [20:06:39] erimental XWD option) [20:06:44] Jdlrobson: please test [20:06:56] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1007 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [20:07:06] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1009 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [20:07:22] taavi: on it [20:07:41] (03Abandoned) 10Ryan Kemper: flink-app: Increment chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/961227 (https://phabricator.wikimedia.org/T346231) (owner: 10Bking) [20:07:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on pc1016.eqiad.wmnet with reason: host reimage [20:08:00] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [20:08:02] (03PS2) 10Andrew Bogott: slapd: introduce new slapd.conf template for ldap >= 2.5 [puppet] - 10https://gerrit.wikimedia.org/r/961188 (https://phabricator.wikimedia.org/T331699) [20:08:28] taavi: wikiquote LGTM. Looking at Wikinews now [20:08:48] Amir1: hi :D is it ok to deploy lucaswerkmeister's patch (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/961102/)? [20:08:57] taavi: wikinews LGTM [20:09:04] !log taavi@deploy2002 taavi and jdlrobson: Continuing with sync [20:09:09] :D [20:09:20] if he is around and willing to do the work, sure :P [20:10:04] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1008 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [20:10:05] (03PS4) 10Majavah: Add $wgExternalLinksDomainGaps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961102 (https://phabricator.wikimedia.org/T341000) (owner: 10Lucas Werkmeister) [20:10:24] (03CR) 10CDanis: "Valentín, I believe I've brought this up-to-date with recent changes -- PTAL?" [puppet] - 10https://gerrit.wikimedia.org/r/928548 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [20:10:33] I should be able to test it [20:10:50] sounds good [20:10:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc1016.eqiad.wmnet with reason: host reimage [20:12:16] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [20:12:28] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [20:12:57] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P52660 and previous config saved to /var/cache/conftool/dbconfig/20230926-201256-arnaudb.json [20:13:16] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1007 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [20:13:38] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [20:13:51] what's up with kafka? [20:14:23] !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host restbase2016.codfw.wmnet with OS bullseye [20:14:31] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase2016.codfw.wmnet with OS bullseye executed with errors: -... [20:14:32] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1009 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [20:14:38] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1012 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [20:14:55] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2016.codfw.wmnet with OS bullseye [20:14:55] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1007.eqiad.wmnet with OS bullseye [20:15:04] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1007.eqiad.wmnet with OS bullseye [20:15:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e... [20:15:07] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase2016.codfw.wmnet with OS bullseye [20:15:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad... [20:15:20] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:15:22] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:959876|Wordmarks for Wikinews projects (T341258)]], [[gerrit:959848|Update wikiquote wordmarks (T341260)]], [[gerrit:955393|Update README clarifying the use of local images.]] (duration: 10m 04s) [20:15:41] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961102 (https://phabricator.wikimedia.org/T341000) (owner: 10Lucas Werkmeister) [20:15:54] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1007.eqiad.wmnet with OS bullseye [20:16:02] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1007.eqiad.wmnet with OS bullseye [20:16:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e... [20:16:06] T341258: Provide wordmarks for Wikinews projects - https://phabricator.wikimedia.org/T341258 [20:16:07] T341260: Design: Provide wordmarks for Wikiquote projects - https://phabricator.wikimedia.org/T341260 [20:16:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad... [20:16:23] (03Merged) 10jenkins-bot: Add $wgExternalLinksDomainGaps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961102 (https://phabricator.wikimedia.org/T341000) (owner: 10Lucas Werkmeister) [20:16:34] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:16:48] !log taavi@deploy2002 Started scap: Backport for [[gerrit:961102|Add $wgExternalLinksDomainGaps (T341000)]] [20:16:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:16:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc1015.eqiad.wmnet with OS bullseye [20:16:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc101[56] - https://phabricator.wikimedia.org/T342164 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host pc1015.eqiad.wmnet with OS bullseye completed: - pc1015 (**PASS**) - Removed fro... [20:17:06] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1007.eqiad.wmnet with OS bullseye [20:17:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e... [20:18:15] !log taavi@deploy2002 taavi and lucaswerkmeister: Backport for [[gerrit:961102|Add $wgExternalLinksDomainGaps (T341000)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:19:06] LGTM, the query linked at the top of https://phabricator.wikimedia.org/T341000 is now quite fast [20:19:21] and the example from m3api’s readme also runs well enough again if I hack the code to add x-w-d [20:19:28] (03CR) 10Andrew Bogott: "This is failing because of a CI issue which I don't understand." [software/cumin] - 10https://gerrit.wikimedia.org/r/868814 (https://phabricator.wikimedia.org/T321349) (owner: 10Andrew Bogott) [20:19:37] !log taavi@deploy2002 taavi and lucaswerkmeister: Continuing with sync [20:19:42] (though I might still want to look for a faster example query anyway) [20:21:34] (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:25:30] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:25:34] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST replicasets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:26:33] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:961102|Add $wgExternalLinksDomainGaps (T341000)]] (duration: 09m 44s) [20:26:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [extensions/LdapAuthentication] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/960742 (https://phabricator.wikimedia.org/T345226) (owner: 10Majavah) [20:27:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [extensions/LdapAuthentication] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/960743 (https://phabricator.wikimedia.org/T345226) (owner: 10Majavah) [20:28:04] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P52661 and previous config saved to /var/cache/conftool/dbconfig/20230926-202803-arnaudb.json [20:29:02] (03Merged) 10jenkins-bot: Do not set $wgPasswordResetRoutes['domain'] [extensions/LdapAuthentication] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/960742 (https://phabricator.wikimedia.org/T345226) (owner: 10Majavah) [20:29:05] (03Merged) 10jenkins-bot: Do not set $wgPasswordResetRoutes['domain'] [extensions/LdapAuthentication] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/960743 (https://phabricator.wikimedia.org/T345226) (owner: 10Majavah) [20:29:08] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:29:32] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1010 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [20:29:32] !log taavi@deploy2002 Started scap: Backport for [[gerrit:960742|Do not set $wgPasswordResetRoutes['domain'] (T345226)]], [[gerrit:960743|Do not set $wgPasswordResetRoutes['domain'] (T345226)]] [20:29:39] T345226: Switch developer account creation to Bitu - https://phabricator.wikimedia.org/T345226 [20:29:42] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [20:29:52] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [20:29:54] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1015 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [20:30:04] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [20:30:35] (KubernetesAPILatency) resolved: (27) High Kubernetes API latency (LIST blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:31:02] !log taavi@deploy2002 taavi: Backport for [[gerrit:960742|Do not set $wgPasswordResetRoutes['domain'] (T345226)]], [[gerrit:960743|Do not set $wgPasswordResetRoutes['domain'] (T345226)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:31:16] !log taavi@deploy2002 taavi: Continuing with sync [20:31:18] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1015 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [20:31:28] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1009 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [20:31:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:31:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc1016.eqiad.wmnet with OS bullseye [20:31:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc101[56] - https://phabricator.wikimedia.org/T342164 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host pc1016.eqiad.wmnet with OS bullseye completed: - pc1016 (**PASS**) - Removed fro... [20:32:20] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1010 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [20:32:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc101[56] - https://phabricator.wikimedia.org/T342164 (10Jhancock.wm) [20:32:30] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1002 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [20:32:40] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1008 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [20:34:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc101[56] - https://phabricator.wikimedia.org/T342164 (10Jhancock.wm) 05Open→03Resolved @Marostegui finished up [20:37:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (DELETE replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:38:08] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:960742|Do not set $wgPasswordResetRoutes['domain'] (T345226)]], [[gerrit:960743|Do not set $wgPasswordResetRoutes['domain'] (T345226)]] (duration: 08m 35s) [20:38:15] T345226: Switch developer account creation to Bitu - https://phabricator.wikimedia.org/T345226 [20:38:34] (03PS1) 10Majavah: Set WRITE_NEW for Wikitech on OATHAuth multiple devices migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961236 (https://phabricator.wikimedia.org/T242031) [20:38:42] (03PS1) 10Majavah: Set WRITE_BOTH for CA wikis on OATHAuth multiple devices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961237 (https://phabricator.wikimedia.org/T242031) [20:39:19] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lists1004.eqiad.wmnet with OS bullseye [20:39:22] (03PS1) 10Jdlrobson: Add wordmark for li wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961238 (https://phabricator.wikimedia.org/T341258) [20:39:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host lists1004.eqiad.wmnet with OS bullseye executed with errors: - lists1004... [20:39:43] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961236 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [20:40:27] (03Merged) 10jenkins-bot: Set WRITE_NEW for Wikitech on OATHAuth multiple devices migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961236 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [20:40:50] !log taavi@deploy2002 Started scap: Backport for [[gerrit:961236|Set WRITE_NEW for Wikitech on OATHAuth multiple devices migration (T242031)]] [20:41:10] T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031 [20:41:46] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:42:01] !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host restbase2016.codfw.wmnet with OS bullseye [20:42:19] !log taavi@deploy2002 taavi: Backport for [[gerrit:961236|Set WRITE_NEW for Wikitech on OATHAuth multiple devices migration (T242031)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:42:23] !log taavi@deploy2002 taavi: Continuing with sync [20:42:34] (KubernetesAPILatency) resolved: (16) High Kubernetes API latency (GET blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:42:54] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase2016.codfw.wmnet with OS bullseye executed with errors: -... [20:43:10] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T343198)', diff saved to https://phabricator.wikimedia.org/P52662 and previous config saved to /var/cache/conftool/dbconfig/20230926-204309-arnaudb.json [20:43:12] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance [20:43:22] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [20:43:25] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance [20:43:32] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1201 (T343198)', diff saved to https://phabricator.wikimedia.org/P52663 and previous config saved to /var/cache/conftool/dbconfig/20230926-204331-arnaudb.json [20:46:10] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:48:29] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:961236|Set WRITE_NEW for Wikitech on OATHAuth multiple devices migration (T242031)]] (duration: 07m 38s) [20:48:36] T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031 [20:48:44] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1007.eqiad.wmnet with OS bullseye [20:48:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad... [20:49:58] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2016.codfw.wmnet'] [20:50:08] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['restbase2016.codfw.wmnet'] [20:50:29] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2016.codfw.wmnet'] [20:50:33] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['restbase2016.codfw.wmnet'] [20:55:44] (03PS2) 10Jdlrobson: wordmarks/taglines for Wiktionary projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960147 (https://phabricator.wikimedia.org/T341257) [20:55:48] (03PS2) 10Jdlrobson: WIP: Logos for Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960148 (https://phabricator.wikimedia.org/T341257) [20:56:24] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:56:31] (03CR) 10CI reject: [V: 04-1] WIP: Logos for Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960148 (https://phabricator.wikimedia.org/T341257) (owner: 10Jdlrobson) [20:59:31] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2016.codfw.wmnet'] [20:59:37] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['restbase2016.codfw.wmnet'] [21:00:12] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2016.codfw.wmnet'] [21:00:44] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:00:57] (03CR) 10Andrea Denisse: [C: 03+2] prometheus: Enable selective scraping for Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/960723 (https://phabricator.wikimedia.org/T346656) (owner: 10Andrea Denisse) [21:01:05] (03CR) 10Andrea Denisse: [C: 03+2] "PCC results: https://puppet-compiler.wmflabs.org/output/960723/43637/" [puppet] - 10https://gerrit.wikimedia.org/r/960723 (https://phabricator.wikimedia.org/T346656) (owner: 10Andrea Denisse) [21:08:22] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['restbase2016.codfw.wmnet'] [21:09:02] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2016.codfw.wmnet'] [21:09:13] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['restbase2016.codfw.wmnet'] [21:09:26] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2016.codfw.wmnet'] [21:09:33] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['restbase2016.codfw.wmnet'] [21:22:27] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2016.codfw.wmnet'] [21:22:41] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T343198)', diff saved to https://phabricator.wikimedia.org/P52664 and previous config saved to /var/cache/conftool/dbconfig/20230926-212240-arnaudb.json [21:22:48] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [21:23:11] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['restbase2016.codfw.wmnet'] [21:23:35] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2016.codfw.wmnet'] [21:23:38] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['restbase2016.codfw.wmnet'] [21:23:46] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2016.codfw.wmnet'] [21:23:58] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['restbase2016.codfw.wmnet'] [21:30:04] (03PS1) 10Bking: partman: remove all traces of cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/961245 (https://phabricator.wikimedia.org/T342463) [21:32:39] (03CR) 10Eevans: [C: 03+1] partman: remove all traces of cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/961245 (https://phabricator.wikimedia.org/T342463) (owner: 10Bking) [21:33:01] (03CR) 10Bking: [C: 03+2] partman: remove all traces of cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/961245 (https://phabricator.wikimedia.org/T342463) (owner: 10Bking) [21:36:44] (03CR) 10Volans: "reply to question" [software/cumin] - 10https://gerrit.wikimedia.org/r/868814 (https://phabricator.wikimedia.org/T321349) (owner: 10Andrew Bogott) [21:37:09] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host lists1004.eqiad.wmnet with OS bullseye [21:37:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host lists1004.eqiad.wmnet with OS bullseye [21:37:48] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P52665 and previous config saved to /var/cache/conftool/dbconfig/20230926-213747-arnaudb.json [21:37:58] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2016.codfw.wmnet with OS bullseye [21:38:05] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase2016.codfw.wmnet with OS bullseye [21:52:55] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P52666 and previous config saved to /var/cache/conftool/dbconfig/20230926-215254-arnaudb.json [21:53:04] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:56:24] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2016.codfw.wmnet with reason: host reimage [21:59:29] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2016.codfw.wmnet with reason: host reimage [22:00:33] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [22:01:33] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1013 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [22:03:09] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1011 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [22:03:47] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [22:04:08] (03CR) 10Andrea Denisse: [C: 03+2] superset: Disable Prometheus scraping for superset metrics [puppet] - 10https://gerrit.wikimedia.org/r/960638 (https://phabricator.wikimedia.org/T346656) (owner: 10Andrea Denisse) [22:04:09] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1011 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [22:04:41] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1015 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [22:05:33] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1005 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [22:05:33] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [22:06:35] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1005 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [22:06:35] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1007 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [22:06:47] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1015 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [22:06:57] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1009 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [22:08:01] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T343198)', diff saved to https://phabricator.wikimedia.org/P52667 and previous config saved to /var/cache/conftool/dbconfig/20230926-220801-arnaudb.json [22:08:03] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance [22:08:06] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance [22:08:12] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1213:3316 (T343198)', diff saved to https://phabricator.wikimedia.org/P52668 and previous config saved to /var/cache/conftool/dbconfig/20230926-220812-arnaudb.json [22:08:13] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [22:15:49] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [22:17:03] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1008 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [22:17:55] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:22:53] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lists1004.eqiad.wmnet with OS bullseye [22:22:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host lists1004.eqiad.wmnet with OS bullseye executed with errors: - lists1004... [22:24:50] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lists1004'] [22:25:13] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['lists1004'] [22:25:19] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lists1004'] [22:26:11] (03CR) 10Cwhite: [C: 03+2] sre: swagger probe failure to critical [alerts] - 10https://gerrit.wikimedia.org/r/961063 (owner: 10Filippo Giunchedi) [22:26:15] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1015 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [22:26:15] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [22:26:15] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [22:26:25] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [22:26:40] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['lists1004'] [22:26:50] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lists1004'] [22:26:53] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1011 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [22:27:15] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['lists1004'] [22:27:27] (03Merged) 10jenkins-bot: sre: swagger probe failure to critical [alerts] - 10https://gerrit.wikimedia.org/r/961063 (owner: 10Filippo Giunchedi) [22:27:39] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1015 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [22:27:39] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1006 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [22:27:39] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1001 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [22:27:49] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1009 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [22:28:19] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1011 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [22:30:04] (03CR) 10Cwhite: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/961133 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [22:30:41] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host lists1004.mgmt.eqiad.wmnet with reboot policy FORCED [22:33:41] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lists1004.mgmt.eqiad.wmnet with reboot policy FORCED [22:34:10] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/961131 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [22:34:58] (03CR) 10Cwhite: [C: 03+1] pyrra: add trafficserver mapping [puppet] - 10https://gerrit.wikimedia.org/r/961128 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [22:35:58] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2016.codfw.wmnet with OS bullseye [22:36:08] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase2016.codfw.wmnet with OS bullseye completed: - restbase20... [22:36:35] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host lists1004.mgmt.eqiad.wmnet with reboot policy FORCED [22:41:39] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lists1004.mgmt.eqiad.wmnet with reboot policy FORCED [22:44:21] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host lists1004.mgmt.eqiad.wmnet with reboot policy FORCED [22:44:46] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:45:52] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lists1004.mgmt.eqiad.wmnet with reboot policy FORCED [22:46:01] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [22:47:24] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2020.codfw.wmnet'] [22:47:34] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['restbase2020.codfw.wmnet'] [22:49:07] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host lists1004.eqiad.wmnet with OS bullseye [22:49:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host lists1004.eqiad.wmnet with OS bullseye [22:50:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:53:12] (03PS3) 10Andrew Bogott: slapd: introduce new slapd.conf template for ldap >= 2.5 [puppet] - 10https://gerrit.wikimedia.org/r/961188 (https://phabricator.wikimedia.org/T331699) [22:55:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:55:54] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/961128 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [22:57:43] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/961131 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [22:58:26] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/961133 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [23:04:46] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T343198)', diff saved to https://phabricator.wikimedia.org/P52669 and previous config saved to /var/cache/conftool/dbconfig/20230926-230445-arnaudb.json [23:04:53] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [23:19:52] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P52670 and previous config saved to /var/cache/conftool/dbconfig/20230926-231951-arnaudb.json [23:26:16] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:34:59] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P52671 and previous config saved to /var/cache/conftool/dbconfig/20230926-233458-arnaudb.json [23:36:48] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2020.codfw.wmnet with OS bullseye [23:36:56] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase2020.codfw.wmnet with OS bullseye [23:41:12] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase2022.codfw.wmnet [23:41:23] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts restbase2022.codfw.wmnet [23:41:33] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase2022.codfw.wmnet [23:41:46] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts restbase2022.codfw.wmnet [23:46:18] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lists1004.eqiad.wmnet with OS bullseye [23:46:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host lists1004.eqiad.wmnet with OS bullseye executed with errors: - lists1004... [23:50:05] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T343198)', diff saved to https://phabricator.wikimedia.org/P52672 and previous config saved to /var/cache/conftool/dbconfig/20230926-235005-arnaudb.json [23:50:07] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [23:50:13] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [23:50:21] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [23:50:27] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T343198)', diff saved to https://phabricator.wikimedia.org/P52673 and previous config saved to /var/cache/conftool/dbconfig/20230926-235026-arnaudb.json [23:52:39] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2020.codfw.wmnet with reason: host reimage [23:55:14] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2020.codfw.wmnet with reason: host reimage [23:56:03] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316 (T343198)', diff saved to https://phabricator.wikimedia.org/P52674 and previous config saved to /var/cache/conftool/dbconfig/20230926-235602-arnaudb.json [23:56:10] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198