[00:00:19] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:23] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:13:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P33418 and previous config saved to /var/cache/conftool/dbconfig/20220828-001317-ladsgroup.json [00:19:53] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:28:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P33419 and previous config saved to /var/cache/conftool/dbconfig/20220828-002823-ladsgroup.json [00:43:13] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:43:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T316186)', diff saved to https://phabricator.wikimedia.org/P33420 and previous config saved to /var/cache/conftool/dbconfig/20220828-004329-ladsgroup.json [00:43:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [00:43:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [00:43:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [00:44:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [00:44:11] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:44:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T316186)', diff saved to https://phabricator.wikimedia.org/P33421 and previous config saved to /var/cache/conftool/dbconfig/20220828-004410-ladsgroup.json [00:50:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T316186)', diff saved to https://phabricator.wikimedia.org/P33422 and previous config saved to /var/cache/conftool/dbconfig/20220828-005015-ladsgroup.json [01:05:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P33423 and previous config saved to /var/cache/conftool/dbconfig/20220828-010522-ladsgroup.json [01:20:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P33424 and previous config saved to /var/cache/conftool/dbconfig/20220828-012028-ladsgroup.json [01:23:11] RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:35:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T316186)', diff saved to https://phabricator.wikimedia.org/P33425 and previous config saved to /var/cache/conftool/dbconfig/20220828-013534-ladsgroup.json [01:35:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1110.eqiad.wmnet with reason: Maintenance [01:35:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1110.eqiad.wmnet with reason: Maintenance [01:35:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T316186)', diff saved to https://phabricator.wikimedia.org/P33426 and previous config saved to /var/cache/conftool/dbconfig/20220828-013558-ladsgroup.json [01:36:45] (JobUnavailable) firing: (3) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T316186)', diff saved to https://phabricator.wikimedia.org/P33427 and previous config saved to /var/cache/conftool/dbconfig/20220828-014101-ladsgroup.json [01:41:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:51:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:56:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P33428 and previous config saved to /var/cache/conftool/dbconfig/20220828-015608-ladsgroup.json [01:58:55] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:06:45] (JobUnavailable) firing: (8) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P33429 and previous config saved to /var/cache/conftool/dbconfig/20220828-021114-ladsgroup.json [02:11:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:26:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T316186)', diff saved to https://phabricator.wikimedia.org/P33430 and previous config saved to /var/cache/conftool/dbconfig/20220828-022620-ladsgroup.json [02:26:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [02:26:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [02:30:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2111.codfw.wmnet with reason: Maintenance [02:31:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2111.codfw.wmnet with reason: Maintenance [02:31:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2111 (T316186)', diff saved to https://phabricator.wikimedia.org/P33431 and previous config saved to /var/cache/conftool/dbconfig/20220828-023111-ladsgroup.json [02:36:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T316186)', diff saved to https://phabricator.wikimedia.org/P33432 and previous config saved to /var/cache/conftool/dbconfig/20220828-023618-ladsgroup.json [02:51:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P33433 and previous config saved to /var/cache/conftool/dbconfig/20220828-025124-ladsgroup.json [03:06:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P33434 and previous config saved to /var/cache/conftool/dbconfig/20220828-030631-ladsgroup.json [03:09:35] (FrontendUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [03:09:35] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [03:10:22] looking [03:12:57] RECOVERY - SSH on mw1314.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:14:35] (FrontendUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [03:14:35] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [03:21:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T316186)', diff saved to https://phabricator.wikimedia.org/P33435 and previous config saved to /var/cache/conftool/dbconfig/20220828-032137-ladsgroup.json [03:21:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance [03:21:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance [03:22:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2178 (T316186)', diff saved to https://phabricator.wikimedia.org/P33436 and previous config saved to /var/cache/conftool/dbconfig/20220828-032202-ladsgroup.json [03:23:55] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:27:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T316186)', diff saved to https://phabricator.wikimedia.org/P33437 and previous config saved to /var/cache/conftool/dbconfig/20220828-032713-ladsgroup.json [03:42:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P33438 and previous config saved to /var/cache/conftool/dbconfig/20220828-034219-ladsgroup.json [03:57:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P33439 and previous config saved to /var/cache/conftool/dbconfig/20220828-035725-ladsgroup.json [04:01:41] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:06:30] (03PS4) 10Gergő Tisza: Declare mediawiki.createaccount_blocked_user schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822686 (https://phabricator.wikimedia.org/T306018) (owner: 10Sergio Gimeno) [04:12:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T316186)', diff saved to https://phabricator.wikimedia.org/P33440 and previous config saved to /var/cache/conftool/dbconfig/20220828-041231-ladsgroup.json [04:12:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2101.codfw.wmnet with reason: Maintenance [04:12:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2101.codfw.wmnet with reason: Maintenance [04:15:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2128.codfw.wmnet with reason: Maintenance [04:16:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2128.codfw.wmnet with reason: Maintenance [04:16:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [04:16:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [04:16:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2128 (T316186)', diff saved to https://phabricator.wikimedia.org/P33441 and previous config saved to /var/cache/conftool/dbconfig/20220828-041622-ladsgroup.json [04:21:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T316186)', diff saved to https://phabricator.wikimedia.org/P33442 and previous config saved to /var/cache/conftool/dbconfig/20220828-042145-ladsgroup.json [04:36:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P33443 and previous config saved to /var/cache/conftool/dbconfig/20220828-043651-ladsgroup.json [04:51:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P33444 and previous config saved to /var/cache/conftool/dbconfig/20220828-045157-ladsgroup.json [05:03:13] 10SRE, 10Wikimedia-Mailing-lists: ESEAP mailing list set up - https://phabricator.wikimedia.org/T316454 (10Legoktm) To clarify, by "Secondary list administrator's email address" we want each list to have a second list administrator, not a secondary email for the first administrator :) [05:07:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T316186)', diff saved to https://phabricator.wikimedia.org/P33445 and previous config saved to /var/cache/conftool/dbconfig/20220828-050704-ladsgroup.json [05:07:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance [05:07:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance [05:07:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2157 (T316186)', diff saved to https://phabricator.wikimedia.org/P33446 and previous config saved to /var/cache/conftool/dbconfig/20220828-050729-ladsgroup.json [05:12:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T316186)', diff saved to https://phabricator.wikimedia.org/P33447 and previous config saved to /var/cache/conftool/dbconfig/20220828-051237-ladsgroup.json [05:27:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P33448 and previous config saved to /var/cache/conftool/dbconfig/20220828-052743-ladsgroup.json [05:42:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P33449 and previous config saved to /var/cache/conftool/dbconfig/20220828-054249-ladsgroup.json [05:57:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T316186)', diff saved to https://phabricator.wikimedia.org/P33450 and previous config saved to /var/cache/conftool/dbconfig/20220828-055756-ladsgroup.json [05:58:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2113.codfw.wmnet with reason: Maintenance [05:58:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2113.codfw.wmnet with reason: Maintenance [05:58:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2113 (T316186)', diff saved to https://phabricator.wikimedia.org/P33451 and previous config saved to /var/cache/conftool/dbconfig/20220828-055821-ladsgroup.json [06:03:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2113 (T316186)', diff saved to https://phabricator.wikimedia.org/P33452 and previous config saved to /var/cache/conftool/dbconfig/20220828-060336-ladsgroup.json [06:17:01] PROBLEM - SSH on mw1314.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:18:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2113', diff saved to https://phabricator.wikimedia.org/P33453 and previous config saved to /var/cache/conftool/dbconfig/20220828-061842-ladsgroup.json [06:33:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2113', diff saved to https://phabricator.wikimedia.org/P33454 and previous config saved to /var/cache/conftool/dbconfig/20220828-063348-ladsgroup.json [06:48:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2113 (T316186)', diff saved to https://phabricator.wikimedia.org/P33455 and previous config saved to /var/cache/conftool/dbconfig/20220828-064855-ladsgroup.json [06:49:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance [06:49:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance [06:49:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3314 (T316186)', diff saved to https://phabricator.wikimedia.org/P33456 and previous config saved to /var/cache/conftool/dbconfig/20220828-064920-ladsgroup.json [06:49:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3315 (T316186)', diff saved to https://phabricator.wikimedia.org/P33457 and previous config saved to /var/cache/conftool/dbconfig/20220828-064952-ladsgroup.json [06:55:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T316186)', diff saved to https://phabricator.wikimedia.org/P33458 and previous config saved to /var/cache/conftool/dbconfig/20220828-065557-ladsgroup.json [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220828T0700) [07:00:14] (03CR) 10Physikerwelt: "@Αλέξανδρος Κοσιάρης can you deploy this?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/826957 (owner: 10PipelineBot) [07:11:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P33459 and previous config saved to /var/cache/conftool/dbconfig/20220828-071103-ladsgroup.json [07:26:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P33460 and previous config saved to /var/cache/conftool/dbconfig/20220828-072610-ladsgroup.json [07:41:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T316186)', diff saved to https://phabricator.wikimedia.org/P33461 and previous config saved to /var/cache/conftool/dbconfig/20220828-074116-ladsgroup.json [07:43:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T316186)', diff saved to https://phabricator.wikimedia.org/P33462 and previous config saved to /var/cache/conftool/dbconfig/20220828-074332-ladsgroup.json [07:58:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P33463 and previous config saved to /var/cache/conftool/dbconfig/20220828-075838-ladsgroup.json [08:13:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P33464 and previous config saved to /var/cache/conftool/dbconfig/20220828-081344-ladsgroup.json [08:28:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T316186)', diff saved to https://phabricator.wikimedia.org/P33465 and previous config saved to /var/cache/conftool/dbconfig/20220828-082851-ladsgroup.json [08:42:49] 10SRE, 10Wikimedia-Etherpad: Upgrade etherpad.wikimedia.org to (more) recent Etherpad version with more rich end-user features - https://phabricator.wikimedia.org/T316421 (10JeanFred) That would be 1.8.16 → 1.8.18 then? There’s not much in the change log: https://github.com/ether/etherpad-lite/blob/develop/CHA... [09:05:03] (03PS1) 10Majavah: dynamicproxy: improve /zones API [puppet] - 10https://gerrit.wikimedia.org/r/826986 (https://phabricator.wikimedia.org/T316463) [09:20:55] RECOVERY - SSH on mw1314.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:28:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [09:28:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [09:31:49] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:33:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance [09:33:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance [09:33:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2154 (T316186)', diff saved to https://phabricator.wikimedia.org/P33466 and previous config saved to /var/cache/conftool/dbconfig/20220828-093346-ladsgroup.json [09:39:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T316186)', diff saved to https://phabricator.wikimedia.org/P33467 and previous config saved to /var/cache/conftool/dbconfig/20220828-093904-ladsgroup.json [09:54:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P33468 and previous config saved to /var/cache/conftool/dbconfig/20220828-095411-ladsgroup.json [09:54:54] (03PS1) 10Majavah: dynamicproxy: api: do not return 'no such project' errors [puppet] - 10https://gerrit.wikimedia.org/r/826990 [10:09:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P33469 and previous config saved to /var/cache/conftool/dbconfig/20220828-100917-ladsgroup.json [10:12:53] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [10:19:59] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [10:24:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T316186)', diff saved to https://phabricator.wikimedia.org/P33470 and previous config saved to /var/cache/conftool/dbconfig/20220828-102423-ladsgroup.json [10:24:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance [10:24:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance [10:27:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance [10:27:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance [10:28:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2162 (T316186)', diff saved to https://phabricator.wikimedia.org/P33471 and previous config saved to /var/cache/conftool/dbconfig/20220828-102800-ladsgroup.json [10:33:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T316186)', diff saved to https://phabricator.wikimedia.org/P33472 and previous config saved to /var/cache/conftool/dbconfig/20220828-103314-ladsgroup.json [10:47:49] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:48:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P33473 and previous config saved to /var/cache/conftool/dbconfig/20220828-104820-ladsgroup.json [11:03:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P33474 and previous config saved to /var/cache/conftool/dbconfig/20220828-110326-ladsgroup.json [11:18:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T316186)', diff saved to https://phabricator.wikimedia.org/P33475 and previous config saved to /var/cache/conftool/dbconfig/20220828-111832-ladsgroup.json [11:18:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance [11:18:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance [11:18:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2163 (T316186)', diff saved to https://phabricator.wikimedia.org/P33476 and previous config saved to /var/cache/conftool/dbconfig/20220828-111857-ladsgroup.json [11:24:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T316186)', diff saved to https://phabricator.wikimedia.org/P33477 and previous config saved to /var/cache/conftool/dbconfig/20220828-112412-ladsgroup.json [11:39:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P33478 and previous config saved to /var/cache/conftool/dbconfig/20220828-113918-ladsgroup.json [11:54:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P33479 and previous config saved to /var/cache/conftool/dbconfig/20220828-115424-ladsgroup.json [11:59:41] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [12:01:51] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 16 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [12:09:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T316186)', diff saved to https://phabricator.wikimedia.org/P33480 and previous config saved to /var/cache/conftool/dbconfig/20220828-120931-ladsgroup.json [12:09:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance [12:09:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance [12:09:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [12:09:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [12:10:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2164 (T316186)', diff saved to https://phabricator.wikimedia.org/P33481 and previous config saved to /var/cache/conftool/dbconfig/20220828-121000-ladsgroup.json [12:15:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T316186)', diff saved to https://phabricator.wikimedia.org/P33482 and previous config saved to /var/cache/conftool/dbconfig/20220828-121515-ladsgroup.json [12:30:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P33483 and previous config saved to /var/cache/conftool/dbconfig/20220828-123021-ladsgroup.json [12:45:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P33484 and previous config saved to /var/cache/conftool/dbconfig/20220828-124527-ladsgroup.json [12:50:34] 10SRE, 10Performance-Team, 10Traffic: Enable HTTP compression for arclamp trace logs - https://phabricator.wikimedia.org/T305783 (10Krinkle) 05Open→03Declined [13:00:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T316186)', diff saved to https://phabricator.wikimedia.org/P33485 and previous config saved to /var/cache/conftool/dbconfig/20220828-130033-ladsgroup.json [13:00:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance [13:00:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance [13:00:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2166 (T316186)', diff saved to https://phabricator.wikimedia.org/P33486 and previous config saved to /var/cache/conftool/dbconfig/20220828-130059-ladsgroup.json [13:06:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T316186)', diff saved to https://phabricator.wikimedia.org/P33487 and previous config saved to /var/cache/conftool/dbconfig/20220828-130614-ladsgroup.json [13:21:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P33488 and previous config saved to /var/cache/conftool/dbconfig/20220828-132120-ladsgroup.json [13:24:44] 10SRE, 10Wikimedia-Mailing-lists: ESEAP mailing list set up - https://phabricator.wikimedia.org/T316454 (10Aklapper) What is "ESEAP"? [13:25:57] 10SRE, 10Wikimedia-Mailing-lists: ESEAP mailing list set up - https://phabricator.wikimedia.org/T316454 (10Aklapper) Ah, found something. Is this somehow related to https://meta.wikimedia.org/wiki/ESEAP_Hub ? Who to use that list? [13:28:45] PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:36:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P33489 and previous config saved to /var/cache/conftool/dbconfig/20220828-133627-ladsgroup.json [13:40:53] 10SRE, 10Wikimedia-Mailing-lists: ESEAP mailing list set up - https://phabricator.wikimedia.org/T316454 (10Ladsgroup) >>! In T316454#8191716, @Legoktm wrote: > To clarify, by "Secondary list administrator's email address" we want each list to have a second list administrator, not a secondary email for the firs... [13:51:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T316186)', diff saved to https://phabricator.wikimedia.org/P33490 and previous config saved to /var/cache/conftool/dbconfig/20220828-135133-ladsgroup.json [13:51:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance [13:51:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance [13:51:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2165 (T316186)', diff saved to https://phabricator.wikimedia.org/P33491 and previous config saved to /var/cache/conftool/dbconfig/20220828-135158-ladsgroup.json [13:57:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T316186)', diff saved to https://phabricator.wikimedia.org/P33492 and previous config saved to /var/cache/conftool/dbconfig/20220828-135713-ladsgroup.json [14:12:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P33493 and previous config saved to /var/cache/conftool/dbconfig/20220828-141220-ladsgroup.json [14:27:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P33494 and previous config saved to /var/cache/conftool/dbconfig/20220828-142726-ladsgroup.json [14:29:59] RECOVERY - SSH on restbase2012.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:42:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T316186)', diff saved to https://phabricator.wikimedia.org/P33495 and previous config saved to /var/cache/conftool/dbconfig/20220828-144232-ladsgroup.json [14:42:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [14:42:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [14:42:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3311 (T316186)', diff saved to https://phabricator.wikimedia.org/P33496 and previous config saved to /var/cache/conftool/dbconfig/20220828-144257-ladsgroup.json [14:43:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3318 (T316186)', diff saved to https://phabricator.wikimedia.org/P33497 and previous config saved to /var/cache/conftool/dbconfig/20220828-144319-ladsgroup.json [14:48:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T316186)', diff saved to https://phabricator.wikimedia.org/P33498 and previous config saved to /var/cache/conftool/dbconfig/20220828-144830-ladsgroup.json [14:56:25] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:58:49] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [15:03:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P33499 and previous config saved to /var/cache/conftool/dbconfig/20220828-150336-ladsgroup.json [15:12:52] (03PS1) 10Ori: Increase roll-out of query-sorting to 50% [puppet] - 10https://gerrit.wikimedia.org/r/826994 (https://phabricator.wikimedia.org/T314868) [15:14:30] (03CR) 10Ori: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37010/console" [puppet] - 10https://gerrit.wikimedia.org/r/826994 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori) [15:18:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P33500 and previous config saved to /var/cache/conftool/dbconfig/20220828-151843-ladsgroup.json [15:19:23] (03PS1) 10Ori: Increase roll-out of query-sorting to 75% [puppet] - 10https://gerrit.wikimedia.org/r/826996 (https://phabricator.wikimedia.org/T314868) [15:20:19] (03CR) 10Ori: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37011/console" [puppet] - 10https://gerrit.wikimedia.org/r/826996 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori) [15:30:41] (03PS1) 10Ori: Increase query-sorting to 100%, remove sampling code [puppet] - 10https://gerrit.wikimedia.org/r/826997 (https://phabricator.wikimedia.org/T314868) [15:31:36] (03CR) 10Ori: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37012/console" [puppet] - 10https://gerrit.wikimedia.org/r/826997 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori) [15:33:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T316186)', diff saved to https://phabricator.wikimedia.org/P33501 and previous config saved to /var/cache/conftool/dbconfig/20220828-153349-ladsgroup.json [15:38:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T316186)', diff saved to https://phabricator.wikimedia.org/P33502 and previous config saved to /var/cache/conftool/dbconfig/20220828-153806-ladsgroup.json [15:46:01] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [15:50:45] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [15:53:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P33503 and previous config saved to /var/cache/conftool/dbconfig/20220828-155312-ladsgroup.json [16:08:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P33504 and previous config saved to /var/cache/conftool/dbconfig/20220828-160818-ladsgroup.json [16:23:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T316186)', diff saved to https://phabricator.wikimedia.org/P33505 and previous config saved to /var/cache/conftool/dbconfig/20220828-162324-ladsgroup.json [16:23:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance [16:23:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance [16:23:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2181 (T316186)', diff saved to https://phabricator.wikimedia.org/P33506 and previous config saved to /var/cache/conftool/dbconfig/20220828-162349-ladsgroup.json [16:29:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T316186)', diff saved to https://phabricator.wikimedia.org/P33507 and previous config saved to /var/cache/conftool/dbconfig/20220828-162906-ladsgroup.json [16:32:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T316186)', diff saved to https://phabricator.wikimedia.org/P33508 and previous config saved to /var/cache/conftool/dbconfig/20220828-163211-ladsgroup.json [16:34:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance [16:34:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance [16:34:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2152 (T316186)', diff saved to https://phabricator.wikimedia.org/P33509 and previous config saved to /var/cache/conftool/dbconfig/20220828-163447-ladsgroup.json [16:40:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T316186)', diff saved to https://phabricator.wikimedia.org/P33510 and previous config saved to /var/cache/conftool/dbconfig/20220828-164004-ladsgroup.json [16:41:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance [16:41:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance [16:41:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:42:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:42:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T316186)', diff saved to https://phabricator.wikimedia.org/P33511 and previous config saved to /var/cache/conftool/dbconfig/20220828-164211-ladsgroup.json [16:47:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T316186)', diff saved to https://phabricator.wikimedia.org/P33512 and previous config saved to /var/cache/conftool/dbconfig/20220828-164722-ladsgroup.json [17:02:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P33513 and previous config saved to /var/cache/conftool/dbconfig/20220828-170228-ladsgroup.json [17:17:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P33514 and previous config saved to /var/cache/conftool/dbconfig/20220828-171734-ladsgroup.json [17:18:23] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:32:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T316186)', diff saved to https://phabricator.wikimedia.org/P33515 and previous config saved to /var/cache/conftool/dbconfig/20220828-173241-ladsgroup.json [17:32:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance [17:32:57] hmm, db1101 seems to be fine from the mariadb point of view [17:32:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance [17:33:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T316186)', diff saved to https://phabricator.wikimedia.org/P33516 and previous config saved to /var/cache/conftool/dbconfig/20220828-173304-ladsgroup.json [17:39:47] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [17:40:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling failed', diff saved to https://phabricator.wikimedia.org/P33517 and previous config saved to /var/cache/conftool/dbconfig/20220828-174002-ladsgroup.json [17:40:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance [17:40:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance [17:40:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2179 (T316186)', diff saved to https://phabricator.wikimedia.org/P33518 and previous config saved to /var/cache/conftool/dbconfig/20220828-174059-ladsgroup.json [17:46:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T316186)', diff saved to https://phabricator.wikimedia.org/P33519 and previous config saved to /var/cache/conftool/dbconfig/20220828-174630-ladsgroup.json [17:46:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: Maintenance [17:46:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: Maintenance [17:46:55] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [17:46:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2106 (T316186)', diff saved to https://phabricator.wikimedia.org/P33520 and previous config saved to /var/cache/conftool/dbconfig/20220828-174655-ladsgroup.json [17:52:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T316186)', diff saved to https://phabricator.wikimedia.org/P33521 and previous config saved to /var/cache/conftool/dbconfig/20220828-175246-ladsgroup.json [17:52:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Maintenance [17:53:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Maintenance [17:53:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2140 (T316186)', diff saved to https://phabricator.wikimedia.org/P33522 and previous config saved to /var/cache/conftool/dbconfig/20220828-175311-ladsgroup.json [18:00:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2140 (T316186)', diff saved to https://phabricator.wikimedia.org/P33523 and previous config saved to /var/cache/conftool/dbconfig/20220828-180042-ladsgroup.json [18:00:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance [18:01:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance [18:01:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2172 (T316186)', diff saved to https://phabricator.wikimedia.org/P33524 and previous config saved to /var/cache/conftool/dbconfig/20220828-180108-ladsgroup.json [18:07:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T316186)', diff saved to https://phabricator.wikimedia.org/P33525 and previous config saved to /var/cache/conftool/dbconfig/20220828-180725-ladsgroup.json [18:07:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance [18:07:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance [18:07:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2119 (T316186)', diff saved to https://phabricator.wikimedia.org/P33526 and previous config saved to /var/cache/conftool/dbconfig/20220828-180751-ladsgroup.json [18:14:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T316186)', diff saved to https://phabricator.wikimedia.org/P33527 and previous config saved to /var/cache/conftool/dbconfig/20220828-181421-ladsgroup.json [18:14:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2099.codfw.wmnet with reason: Maintenance [18:14:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2099.codfw.wmnet with reason: Maintenance [18:17:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [18:17:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [18:18:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3312 (T316186)', diff saved to https://phabricator.wikimedia.org/P33528 and previous config saved to /var/cache/conftool/dbconfig/20220828-181805-ladsgroup.json [18:18:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3314 (T316186)', diff saved to https://phabricator.wikimedia.org/P33529 and previous config saved to /var/cache/conftool/dbconfig/20220828-181830-ladsgroup.json [18:19:35] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:23:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T316186)', diff saved to https://phabricator.wikimedia.org/P33530 and previous config saved to /var/cache/conftool/dbconfig/20220828-182350-ladsgroup.json [18:26:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T316186)', diff saved to https://phabricator.wikimedia.org/P33531 and previous config saved to /var/cache/conftool/dbconfig/20220828-182605-ladsgroup.json [18:26:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance [18:26:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance [18:26:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2136 (T316186)', diff saved to https://phabricator.wikimedia.org/P33532 and previous config saved to /var/cache/conftool/dbconfig/20220828-182630-ladsgroup.json [18:31:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T316186)', diff saved to https://phabricator.wikimedia.org/P33533 and previous config saved to /var/cache/conftool/dbconfig/20220828-183156-ladsgroup.json [18:32:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [18:32:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [18:32:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [18:32:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [18:32:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2155 (T316186)', diff saved to https://phabricator.wikimedia.org/P33534 and previous config saved to /var/cache/conftool/dbconfig/20220828-183226-ladsgroup.json [18:38:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T316186)', diff saved to https://phabricator.wikimedia.org/P33535 and previous config saved to /var/cache/conftool/dbconfig/20220828-183850-ladsgroup.json [18:38:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [18:39:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [18:39:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2147 (T316186)', diff saved to https://phabricator.wikimedia.org/P33536 and previous config saved to /var/cache/conftool/dbconfig/20220828-183915-ladsgroup.json [18:45:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T316186)', diff saved to https://phabricator.wikimedia.org/P33537 and previous config saved to /var/cache/conftool/dbconfig/20220828-184542-ladsgroup.json [18:50:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance [18:50:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance [18:50:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2148 (T316186)', diff saved to https://phabricator.wikimedia.org/P33538 and previous config saved to /var/cache/conftool/dbconfig/20220828-185022-ladsgroup.json [18:55:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T316186)', diff saved to https://phabricator.wikimedia.org/P33539 and previous config saved to /var/cache/conftool/dbconfig/20220828-185536-ladsgroup.json [18:55:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance [18:55:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance [18:55:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [18:55:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [18:56:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2126 (T316186)', diff saved to https://phabricator.wikimedia.org/P33540 and previous config saved to /var/cache/conftool/dbconfig/20220828-185606-ladsgroup.json [19:02:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T316186)', diff saved to https://phabricator.wikimedia.org/P33541 and previous config saved to /var/cache/conftool/dbconfig/20220828-190238-ladsgroup.json [19:02:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance [19:02:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance [19:03:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2125 (T316186)', diff saved to https://phabricator.wikimedia.org/P33542 and previous config saved to /var/cache/conftool/dbconfig/20220828-190303-ladsgroup.json [19:08:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T316186)', diff saved to https://phabricator.wikimedia.org/P33543 and previous config saved to /var/cache/conftool/dbconfig/20220828-190824-ladsgroup.json [19:08:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2107.codfw.wmnet with reason: Maintenance [19:08:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2107.codfw.wmnet with reason: Maintenance [19:08:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2107 (T316186)', diff saved to https://phabricator.wikimedia.org/P33544 and previous config saved to /var/cache/conftool/dbconfig/20220828-190849-ladsgroup.json [19:14:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2107 (T316186)', diff saved to https://phabricator.wikimedia.org/P33545 and previous config saved to /var/cache/conftool/dbconfig/20220828-191414-ladsgroup.json [19:14:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance [19:14:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance [19:14:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2175 (T316186)', diff saved to https://phabricator.wikimedia.org/P33546 and previous config saved to /var/cache/conftool/dbconfig/20220828-191440-ladsgroup.json [19:19:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T316186)', diff saved to https://phabricator.wikimedia.org/P33547 and previous config saved to /var/cache/conftool/dbconfig/20220828-191951-ladsgroup.json [19:19:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [19:20:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [19:20:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3311 (T316186)', diff saved to https://phabricator.wikimedia.org/P33548 and previous config saved to /var/cache/conftool/dbconfig/20220828-192016-ladsgroup.json [19:20:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3312 (T316186)', diff saved to https://phabricator.wikimedia.org/P33549 and previous config saved to /var/cache/conftool/dbconfig/20220828-192042-ladsgroup.json [19:21:57] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:25:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T316186)', diff saved to https://phabricator.wikimedia.org/P33550 and previous config saved to /var/cache/conftool/dbconfig/20220828-192550-ladsgroup.json [19:27:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T316186)', diff saved to https://phabricator.wikimedia.org/P33551 and previous config saved to /var/cache/conftool/dbconfig/20220828-192705-ladsgroup.json [19:27:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance [19:27:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance [19:34:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance [19:34:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance [19:35:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1188 (T316186)', diff saved to https://phabricator.wikimedia.org/P33552 and previous config saved to /var/cache/conftool/dbconfig/20220828-193500-ladsgroup.json [19:41:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T316186)', diff saved to https://phabricator.wikimedia.org/P33553 and previous config saved to /var/cache/conftool/dbconfig/20220828-194119-ladsgroup.json [19:56:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P33554 and previous config saved to /var/cache/conftool/dbconfig/20220828-195625-ladsgroup.json [20:11:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P33555 and previous config saved to /var/cache/conftool/dbconfig/20220828-201131-ladsgroup.json [20:18:39] PROBLEM - SSH on mw1313.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:18:57] !log mw1411, mw1413, mw1419, mw1429, mw1431, mw1433: set energy-performance preference to 0 via 'x86_energy_perf_policy --hwp-epp 0' T315398 [20:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:02] T315398: Set MW appserver scaling_governor to performance - https://phabricator.wikimedia.org/T315398 [20:26:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T316186)', diff saved to https://phabricator.wikimedia.org/P33556 and previous config saved to /var/cache/conftool/dbconfig/20220828-202638-ladsgroup.json [20:26:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1129.eqiad.wmnet with reason: Maintenance [20:26:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1129.eqiad.wmnet with reason: Maintenance [20:27:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T316186)', diff saved to https://phabricator.wikimedia.org/P33557 and previous config saved to /var/cache/conftool/dbconfig/20220828-202701-ladsgroup.json [20:32:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T316186)', diff saved to https://phabricator.wikimedia.org/P33558 and previous config saved to /var/cache/conftool/dbconfig/20220828-203223-ladsgroup.json [20:47:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P33559 and previous config saved to /var/cache/conftool/dbconfig/20220828-204729-ladsgroup.json [21:02:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P33560 and previous config saved to /var/cache/conftool/dbconfig/20220828-210235-ladsgroup.json [21:03:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P33561 and previous config saved to /var/cache/conftool/dbconfig/20220828-210336-ladsgroup.json [21:14:11] 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Set MW appserver scaling_governor to performance - https://phabricator.wikimedia.org/T315398 (10ori) I tried setting EPP to 0 using `x86_energy_perf_policy`, thinking that bypassing the sysfs interface and writing directly to the MSR would make the setti... [21:48:17] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:52:51] (03PS1) 10Legoktm: Use shell webservice-runner for python35/python37 images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/827007 (https://phabricator.wikimedia.org/T293552) [22:06:32] (03PS1) 10Legoktm: Use shell webservice-runner for node16 image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/827009 (https://phabricator.wikimedia.org/T293552) [22:21:17] RECOVERY - SSH on mw1313.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:25:55] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:50:53] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:52:46] 10SRE, 10Wikimedia-Mailing-lists: ESEAP mailing list set up - https://phabricator.wikimedia.org/T316454 (10Samwilson) a:05Samwilson→03None Other #WMAU lists have `accounts@wikimedia.org.au` as a secondary admin (Mailman username `wmau`). Would that be suitable, or is it best to have a single person nominat...