[00:51:07] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:53:17] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:13:47] PROBLEM - SSH on kubernetes1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:14:59] RECOVERY - SSH on kubernetes1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:51:21] (03CR) 10Gergő Tisza: snapshot: Dump information about Growth mentorship (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/740371 (https://phabricator.wikimedia.org/T291966) (owner: 10Urbanecm) [07:23:31] PROBLEM - SSH on db2083.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:24:39] RECOVERY - SSH on db2083.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:20:00] (03CR) 10Peachey88: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747973 (https://phabricator.wikimedia.org/T291737) (owner: 10Ideophagous) [10:20:19] PROBLEM - SSH on mw2254.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:21:27] RECOVERY - SSH on mw2254.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:23:59] PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1002.eqiad.wmnet.service,rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:18:33] RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:37:53] 10SRE, 10Wikimedia-Incident: The connection to Wikimedia sites is slow on 2021-12-27 - https://phabricator.wikimedia.org/T298332 (10Zabe) [13:39:53] 10SRE, 10Wikimedia-Incident: The connection to Wikimedia sites is slow on 2021-12-27 - https://phabricator.wikimedia.org/T298332 (10RhinosF1) Please follow https://wikitech-static.wikimedia.org/wiki/Reporting_a_connectivity_issue [13:40:10] 10SRE, 10Wikimedia-Incident: The connection to Wikimedia sites is slow on 2021-12-27 - https://phabricator.wikimedia.org/T298332 (10RhinosF1) FYI for others, the IP is text-lb.esqin [13:41:48] 10SRE: The connection to Wikimedia sites is slow on 2021-12-27 - https://phabricator.wikimedia.org/T298332 (10Majavah) [13:44:17] 10SRE: connection to Wikimedia sites is slow 2021-12-27 via eqsin - https://phabricator.wikimedia.org/T298332 (10RhinosF1) [13:44:29] 10SRE, 10Traffic: connection to Wikimedia sites is slow 2021-12-27 via eqsin - https://phabricator.wikimedia.org/T298332 (10RhinosF1) [13:51:47] FYI I could reproduce that at time of filing via online test but now stopped [13:52:02] 10SRE, 10Traffic, 10Security: connection to Wikimedia sites is slow 2021-12-27 via eqsin - https://phabricator.wikimedia.org/T298332 (10Xiplus) [13:53:01] There's a jump of 5xx's at time too [13:53:03] https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=14&orgId=1&var-site=eqsin&var-cache_type=text&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4&var-status_type=5 [13:53:17] 10SRE, 10Traffic, 10Security: connection to Wikimedia sites is slow 2021-12-27 via eqsin - https://phabricator.wikimedia.org/T298332 (10Xiplus) How to make the task private? Should I open another private task? [13:58:33] I'm not sure it's eqsin specific now [14:31:23] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:39:36] (03PS1) 10Zabe: puppetmaster::gitsync: remove absented crons and logrotate::conf [puppet] - 10https://gerrit.wikimedia.org/r/750251 (https://phabricator.wikimedia.org/T273673) [16:45:15] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:52:53] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:17:49] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:42:01] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:48:53] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:18:59] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:15:21] _joe_: should have updated the task when I figured that [20:25:49] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:11:11] PROBLEM - Check systemd state on sretest1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rasdaemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:12:19] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:22:17] PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:33:43] PROBLEM - SSH on mw2254.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:13:29] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:23:19] RECOVERY - SSH on analytics1063.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:30:13] (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [22:34:49] RECOVERY - SSH on mw2254.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:35:13] (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org