[00:38:24] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1088799 [00:38:24] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1088799 (owner: 10TrainBranchBot) [01:08:41] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1088814 [01:08:41] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1088814 (owner: 10TrainBranchBot) [01:12:00] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1088799 (owner: 10TrainBranchBot) [01:19:56] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:41:32] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1088814 (owner: 10TrainBranchBot) [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:46:02] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:46:34] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:48:56] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 3.298 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:49:24] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52922 bytes in 0.098 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241110T0800) [09:14:10] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventstreams_4892: Servers kubernetes2046.codfw.wmnet, parse2017.codfw.wmnet, wikikube-worker2120.codfw.wmnet, wikikube-worker2102.codfw.wmnet, wikikube-worker2081.codfw.wmnet, wikikube-worker2017.codfw.wmnet, wikikube-worker2046.codfw.wmnet, mw2375.codfw.wmnet, wikikube-worker2026.codfw.wmnet, mw2370.codfw.wmnet, wikikube-worker2084.codfw.wmnet, kub [09:14:10] 052.codfw.wmnet, wikikube-worker2113.codfw.wmnet, kubernetes2014.codfw.wmnet, parse2003.codfw.wmnet, wikikube-worker2076.codfw.wmnet, wikikube-worker2040.codfw.wmnet, wikikube-worker2083.codfw.wmnet, mw2366.codfw.wmnet, wikikube-worker2071.codfw.wmnet, parse2004.codfw.wmnet, wikikube-worker2044.codfw.wmnet, mw2351.codfw.wmnet, wikikube-worker2022.codfw.wmnet, wikikube-worker2027.codfw.wmnet, wikikube-worker2082.codfw.wmnet, wikikube-worke [09:14:10] dfw.wmnet, kubernetes2042.codfw.wmnet, wikikube-worker2096.codfw.wmnet, wikikube-worker2097.codfw.wmnet, wikikube-worker2065.codfw.wmnet, wikikube-worker2060.codfw.wmnet, wikikube-worke https://wikitech.wikimedia.org/wiki/PyBal [09:14:12] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventstreams_4892: Servers kubernetes2046.codfw.wmnet, wikikube-worker2021.codfw.wmnet, parse2001.codfw.wmnet, wikikube-worker2033.codfw.wmnet, wikikube-worker2120.codfw.wmnet, kubernetes2056.codfw.wmnet, wikikube-worker2102.codfw.wmnet, mw2375.codfw.wmnet, wikikube-worker2026.codfw.wmnet, kubernetes2024.codfw.wmnet, wikikube-worker2036.codfw.wmnet, [09:14:12] odfw.wmnet, mw2447.codfw.wmnet, mw2370.codfw.wmnet, wikikube-worker2084.codfw.wmnet, wikikube-worker2099.codfw.wmnet, kubernetes2014.codfw.wmnet, mw2443.codfw.wmnet, kubernetes2048.codfw.wmnet, wikikube-worker2091.codfw.wmnet, parse2018.codfw.wmnet, wikikube-worker2071.codfw.wmnet, parse2004.codfw.wmnet, kubernetes2050.codfw.wmnet, mw2351.codfw.wmnet, wikikube-worker2086.codfw.wmnet, parse2020.codfw.wmnet, wikikube-worker2043.codfw.wmnet, [09:14:12] e-worker2096.codfw.wmnet, wikikube-worker2065.codfw.wmnet, wikikube-worker2060.codfw.wmnet, mw2359.codfw.wmnet, wikikube-worker2002.codfw.wmnet, wikikube-worker2090.codfw.wmnet, wikikub https://wikitech.wikimedia.org/wiki/PyBal [09:18:12] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:19:10] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:24:10] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventstreams_4892: Servers wikikube-worker2021.codfw.wmnet, parse2001.codfw.wmnet, parse2017.codfw.wmnet, wikikube-worker2120.codfw.wmnet, kubernetes2056.codfw.wmnet, wikikube-worker2063.codfw.wmnet, parse2006.codfw.wmnet, wikikube-worker2102.codfw.wmnet, wikikube-worker2081.codfw.wmnet, wikikube-worker2017.codfw.wmnet, wikikube-worker2117.codfw.wmne [09:24:10] ube-worker2026.codfw.wmnet, wikikube-worker2036.codfw.wmnet, mw2338.codfw.wmnet, parse2009.codfw.wmnet, mw2370.codfw.wmnet, mw2368.codfw.wmnet, wikikube-worker2113.codfw.wmnet, wikikube-worker2091.codfw.wmnet, parse2018.codfw.wmnet, wikikube-worker2059.codfw.wmnet, wikikube-worker2092.codfw.wmnet, wikikube-worker2086.codfw.wmnet, wikikube-worker2043.codfw.wmnet, kubernetes2006.codfw.wmnet, wikikube-worker2023.codfw.wmnet, wikikube-worker2 [09:24:10] w.wmnet, wikikube-worker2041.codfw.wmnet, mw2359.codfw.wmnet, wikikube-worker2058.codfw.wmnet, wikikube-worker2089.codfw.wmnet, kubernetes2016.codfw.wmnet, mw2449.codfw.wmnet, wikikube- https://wikitech.wikimedia.org/wiki/PyBal [09:24:12] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventstreams_4892: Servers kubernetes2046.codfw.wmnet, parse2006.codfw.wmnet, wikikube-worker2102.codfw.wmnet, wikikube-worker2081.codfw.wmnet, wikikube-worker2026.codfw.wmnet, mw2447.codfw.wmnet, mw2370.codfw.wmnet, kubernetes2052.codfw.wmnet, mw2443.codfw.wmnet, kubernetes2048.codfw.wmnet, parse2003.codfw.wmnet, wikikube-worker2076.codfw.wmnet, par [09:24:12] odfw.wmnet, wikikube-worker2071.codfw.wmnet, parse2004.codfw.wmnet, wikikube-worker2044.codfw.wmnet, wikikube-worker2010.codfw.wmnet, mw2351.codfw.wmnet, wikikube-worker2092.codfw.wmnet, parse2020.codfw.wmnet, mw2419.codfw.wmnet, wikikube-worker2043.codfw.wmnet, wikikube-worker2065.codfw.wmnet, wikikube-worker2041.codfw.wmnet, wikikube-worker2124.codfw.wmnet, wikikube-worker2090.codfw.wmnet, kubernetes2039.codfw.wmnet, mw2353.codfw.wmnet, [09:24:12] e-worker2123.codfw.wmnet, wikikube-worker2050.codfw.wmnet, mw2356.codfw.wmnet, wikikube-worker2110.codfw.wmnet, mw2440.codfw.wmnet, kubernetes2042.codfw.wmnet, wikikube-worker2098.codfw https://wikitech.wikimedia.org/wiki/PyBal [09:27:10] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:27:12] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:45:10] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventstreams_4892: Servers wikikube-worker2021.codfw.wmnet, wikikube-worker2079.codfw.wmnet, parse2001.codfw.wmnet, wikikube-worker2033.codfw.wmnet, parse2017.codfw.wmnet, wikikube-worker2120.codfw.wmnet, wikikube-worker2086.codfw.wmnet, parse2006.codfw.wmnet, wikikube-worker2102.codfw.wmnet, wikikube-worker2081.codfw.wmnet, mw2375.codfw.wmnet, wikik [09:45:10] er2026.codfw.wmnet, mw2338.codfw.wmnet, mw2447.codfw.wmnet, mw2370.codfw.wmnet, kubernetes2048.codfw.wmnet, wikikube-worker2076.codfw.wmnet, wikikube-worker2040.codfw.wmnet, wikikube-worker2071.codfw.wmnet, kubernetes2050.codfw.wmnet, wikikube-worker2010.codfw.wmnet, mw2351.codfw.wmnet, kubernetes2056.codfw.wmnet, kubernetes2022.codfw.wmnet, wikikube-worker2096.codfw.wmnet, wikikube-worker2065.codfw.wmnet, wikikube-worker2060.codfw.wmnet, [09:45:10] e-worker2023.codfw.wmnet, wikikube-worker2041.codfw.wmnet, wikikube-worker2124.codfw.wmnet, wikikube-worker2002.codfw.wmnet, wikikube-worker2090.codfw.wmnet, wikikube-worker2055.codfw.w https://wikitech.wikimedia.org/wiki/PyBal [09:45:12] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventstreams_4892: Servers kubernetes2046.codfw.wmnet, wikikube-worker2021.codfw.wmnet, wikikube-worker2033.codfw.wmnet, parse2017.codfw.wmnet, kubernetes2056.codfw.wmnet, wikikube-worker2102.codfw.wmnet, wikikube-worker2117.codfw.wmnet, kubernetes2024.codfw.wmnet, mw2447.codfw.wmnet, mw2370.codfw.wmnet, wikikube-worker2113.codfw.wmnet, kubernetes201 [09:45:12] wmnet, mw2443.codfw.wmnet, kubernetes2048.codfw.wmnet, parse2003.codfw.wmnet, wikikube-worker2076.codfw.wmnet, parse2018.codfw.wmnet, wikikube-worker2083.codfw.wmnet, wikikube-worker2071.codfw.wmnet, parse2004.codfw.wmnet, kubernetes2050.codfw.wmnet, mw2351.codfw.wmnet, wikikube-worker2092.codfw.wmnet, wikikube-worker2030.codfw.wmnet, wikikube-worker2043.codfw.wmnet, wikikube-worker2097.codfw.wmnet, wikikube-worker2065.codfw.wmnet, wikiku [09:45:12] r2125.codfw.wmnet, mw2359.codfw.wmnet, wikikube-worker2002.codfw.wmnet, wikikube-worker2055.codfw.wmnet, wikikube-worker2111.codfw.wmnet, kubernetes2039.codfw.wmnet, kubernetes2016.codf https://wikitech.wikimedia.org/wiki/PyBal [09:46:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [09:46:12] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:47:10] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:51:07] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [10:02:42] (03CR) 10Milimetric: [C:03+1] Move start day of dump_fillin_wd job from the 7th to the 10th of the month [puppet] - 10https://gerrit.wikimedia.org/r/1088599 (https://phabricator.wikimedia.org/T379393) (owner: 10Xcollazo) [11:18:55] (03CR) 10Gergő Tisza: [C:03+1] systemd job to create missing local accounts on loginwiki/metawiki [puppet] - 10https://gerrit.wikimedia.org/r/1088552 (https://phabricator.wikimedia.org/T378401) (owner: 10ArielGlenn) [12:09:08] (03PS1) 10TheDJ: Correct range of A-z [puppet] - 10https://gerrit.wikimedia.org/r/1089077 (https://phabricator.wikimedia.org/T362829) [12:11:35] PROBLEM - MariaDB Replica SQL: s6 #page on db2217 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table archive is corrupt: try to repair it on query. Default database: frwiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:19:08] Depool? [12:19:11] PROBLEM - MariaDB Replica Lag: s6 #page on db2217 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 619.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:20:24] !incidents [12:20:25] You're not allowed to perform this action. [12:21:46] !incidents [12:21:46] 5390 (UNACKED) db2217 (paged)/MariaDB Replica SQL: s6 (paged) [12:21:47] 5391 (UNACKED) db2217 (paged)/MariaDB Replica Lag: s6 (paged) [12:21:59] !ack 5390 [12:22:00] 5390 (ACKED) db2217 (paged)/MariaDB Replica SQL: s6 (paged) [12:22:05] !ack 5391 [12:22:06] 5391 (ACKED) db2217 (paged)/MariaDB Replica Lag: s6 (paged) [12:23:30] claime: o/ here if needed [12:23:53] * kamila_ too [12:24:04] +1 to depool [12:24:08] the error index bla bla bla seems the issue that Amir mentioned a while ago [12:24:15] namely rebuilding the index fixes [12:24:23] not sure if there is a task or not [12:24:39] anyway, from orchestrator it seems safe to depool, plenty of replicas [12:25:24] no task afaict [12:25:33] !log slyngshede@cumin1002 dbctl commit (dc=all): 'Depool db2217', diff saved to https://phabricator.wikimedia.org/P70997 and previous config saved to /var/cache/conftool/dbconfig/20241110-122532-slyngshede.json [12:25:54] slyngs: thanks :) [12:26:02] kamila_: slyngs created a task for that particular issue https://phabricator.wikimedia.org/T379491#10307348 so DBA can look at it on monday [12:26:12] !log slyngshede@cumin1002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db2217.codfw.wmnet with reason: Corrupt Index [12:26:25] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2217.codfw.wmnet with reason: Corrupt Index [12:26:54] Also yes, there's the procedure from Amir's email, but tbh doing a corrupt index repair on frwiki's archive on a sunday seems ill advised when a depool should be enough to bring us back to steady state [12:29:46] the table is archive, it seems to have 11M records, not incredibly huge [12:31:26] elukey: How does one go about finding the broken table? Just to logs? [12:32:15] !log optimize table `archive` on db2217 - frwiki db - corrupt index error (host already depooled) [12:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:26] slyngs: just show slave status after sudo mysql [12:33:36] claime: I went forward just to prep the node if something goes sideways later on, already depooled, seems safe enough (and the db folks will have to do it anyway, hope to give them some help) [12:34:35] RECOVERY - MariaDB Replica SQL: s6 #page on db2217 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:34:39] \o/ [12:34:44] all right I think we are good [12:34:48] You fixed it :-) [12:35:11] RECOVERY - MariaDB Replica Lag: s6 #page on db2217 is OK: OK slave_sql_lag Replication lag: 0.13 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:35:59] I am not 100% confident to declare the replica ready for production, so I'll leave a note in #sre for data persistence [12:36:27] May also add a comment to: https://phabricator.wikimedia.org/T379491 [12:38:36] done! [12:39:39] all right, I think we can get back to our Sunday [12:39:41] thanks folks! [12:42:58] Thank you :-) [12:53:56] :o [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:18:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [15:23:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [15:51:34] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:52:02] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:53:24] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52922 bytes in 0.095 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:53:53] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:19:18] (03PS1) 10Hamish: Allow wgGroupsRemoveFromSelf for templateeditor, confirmed, and abusefilter-helper in zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1089182 (https://phabricator.wikimedia.org/T379500) [18:22:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 11 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1089182 (https://phabricator.wikimedia.org/T379500) (owner: 10Hamish) [18:49:04] (03CR) 10ZhaoFJx: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1089182 (https://phabricator.wikimedia.org/T379500) (owner: 10Hamish) [20:47:02] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:47:18] (03PS1) 10Urbanecm: Fix WeightedTagsUpdater [extensions/CirrusSearch] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1089230 (https://phabricator.wikimedia.org/T378664) [20:47:34] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:48:53] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:49:24] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52923 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:14:14] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:16:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:29:37] !log re-imaging ms-be2082 to test efi boot order [22:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:27] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2082.codfw.wmnet with OS bullseye [22:51:35] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10307868 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye [23:14:14] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:14:48] !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2082.codfw.wmnet with reason: host reimage [23:16:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:17:36] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2082.codfw.wmnet with reason: host reimage [23:24:57] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10307915 (10jhathaway) @elukey I was able to reproduce the issue, by wiping the files from the efi partition, before kicking off another re-image. I think... [23:43:53] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2082.codfw.wmnet with OS bullseye [23:44:04] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10307927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye comple...