[00:02:05] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1013'] [00:04:33] PROBLEM - Check systemd state on puppetmaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:39] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:17:52] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-jumbo1013'] [00:23:10] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1013'] [00:29:03] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-jumbo1013'] [00:46:15] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:55:17] (03CR) 10Ssingh: [C: 03+2] cp5018: Set cp role via site.pp and related config [puppet] - 10https://gerrit.wikimedia.org/r/858640 (https://phabricator.wikimedia.org/T322048) (owner: 10BCornwall) [00:56:25] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5018.eqsin.wmnet with OS buster [00:56:36] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5018.eqsin.wmnet with OS buster [01:03:13] PROBLEM - PHP7 jobrunner on mw1308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [01:04:57] PROBLEM - PHP7 rendering on mw1308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:13:05] RECOVERY - PHP7 jobrunner on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 282 bytes in 0.151 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [01:13:37] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Papaul) [01:14:43] RECOVERY - PHP7 rendering on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 282 bytes in 0.879 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:14:55] (03PS1) 10Ssingh: cp5019: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/858671 (https://phabricator.wikimedia.org/T322048) [01:14:57] (03PS1) 10Ssingh: cp5020: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/858672 (https://phabricator.wikimedia.org/T322048) [01:14:59] (03PS1) 10Ssingh: cp5028: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/858673 (https://phabricator.wikimedia.org/T322048) [01:15:01] (03PS1) 10Ssingh: cp5029: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/858674 (https://phabricator.wikimedia.org/T322048) [01:15:03] (03PS1) 10Ssingh: cp5030: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/858675 (https://phabricator.wikimedia.org/T322048) [01:15:05] (03PS1) 10Ssingh: cp5031: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/858676 (https://phabricator.wikimedia.org/T322048) [01:23:13] PROBLEM - PHP7 jobrunner on mw1308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [01:24:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:24:53] PROBLEM - PHP7 rendering on mw1308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:24:56] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5018.eqsin.wmnet with reason: host reimage [01:24:58] (KubernetesAPILatency) firing: (8) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:25:11] RECOVERY - PHP7 jobrunner on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 282 bytes in 6.889 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [01:25:55] PROBLEM - PHP7 rendering on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:26:43] RECOVERY - PHP7 rendering on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 282 bytes in 0.660 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:28:24] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5018.eqsin.wmnet with reason: host reimage [01:29:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:32:45] PROBLEM - PHP7 jobrunner on mw1308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [01:37:41] RECOVERY - PHP7 rendering on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 2.392 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:37:45] (JobUnavailable) firing: (3) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:39:21] RECOVERY - PHP7 jobrunner on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [01:40:25] PROBLEM - PHP7 jobrunner on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:53] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:43:31] PROBLEM - PHP7 jobrunner on mw1308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [01:43:49] RECOVERY - SSH on mw1319.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:44:13] RECOVERY - PHP7 jobrunner on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.029 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [01:44:53] RECOVERY - PHP7 jobrunner on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 4.467 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [01:48:03] PROBLEM - PHP7 jobrunner on mw1308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [01:50:03] PROBLEM - PHP7 rendering on mw1308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:52:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:53:27] PROBLEM - PHP7 rendering on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:55:07] RECOVERY - PHP7 rendering on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 2.289 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:58:39] RECOVERY - PHP7 jobrunner on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [01:59:07] RECOVERY - PHP7 rendering on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 280 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:00:29] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:01:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:01:46] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5018.eqsin.wmnet with OS buster [02:01:55] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5018.eqsin.wmnet with OS buster completed: - cp5018 (**PASS**) -... [02:02:03] PROBLEM - PHP7 jobrunner on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [02:02:45] PROBLEM - PHP7 jobrunner on mw1308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [02:03:53] RECOVERY - PHP7 jobrunner on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [02:04:15] PROBLEM - PHP7 rendering on mw1308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:06:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:07:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:08:37] RECOVERY - PHP7 jobrunner on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 280 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [02:10:49] PROBLEM - PHP7 rendering on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:12:13] PROBLEM - PHP7 rendering on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:12:25] PROBLEM - PHP7 jobrunner on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [02:12:45] PROBLEM - PHP7 jobrunner on mw1308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [02:13:59] PROBLEM - PHP7 jobrunner on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [02:17:47] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:17:59] RECOVERY - PHP7 rendering on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:18:05] RECOVERY - PHP7 jobrunner on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.025 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [02:18:29] RECOVERY - PHP7 jobrunner on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.032 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [02:19:51] RECOVERY - PHP7 jobrunner on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 9.148 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [02:20:01] RECOVERY - PHP7 rendering on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 7.189 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:22:29] RECOVERY - PHP7 rendering on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.028 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:22:45] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:23:53] PROBLEM - PHP7 jobrunner on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [02:25:03] PROBLEM - PHP7 rendering on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:25:43] RECOVERY - PHP7 jobrunner on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 280 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [02:27:03] RECOVERY - PHP7 rendering on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 7.413 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:29:18] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:34:18] (ProbeDown) resolved: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:35:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:37:43] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:40:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:41:18] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:44:58] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:46:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:47:09] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:47:27] PROBLEM - SSH on mw1315.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:53:05] PROBLEM - PHP7 jobrunner on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [02:56:19] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:56:29] RECOVERY - PHP7 jobrunner on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 280 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [02:59:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:01:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:04:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:06:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:08:29] PROBLEM - PHP7 jobrunner on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [03:11:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:12:19] RECOVERY - PHP7 jobrunner on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [03:26:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:27:18] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:29:21] PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:30:15] PROBLEM - PHP7 jobrunner on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [03:31:09] PROBLEM - PHP7 rendering on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:31:17] RECOVERY - PHP7 rendering on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 6.442 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:32:03] RECOVERY - PHP7 jobrunner on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 282 bytes in 0.674 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [03:32:18] (ProbeDown) resolved: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:32:49] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [03:33:59] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:34:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:35:49] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:38:55] RECOVERY - PHP7 rendering on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.026 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:39:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:39:19] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1445.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:41:57] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1446.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:42:07] PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [03:43:19] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:43:57] RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.025 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [03:44:11] PROBLEM - PHP7 jobrunner on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [03:45:57] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:45:57] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:47:59] RECOVERY - PHP7 jobrunner on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 280 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [03:48:03] PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [03:50:03] PROBLEM - PHP7 rendering on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:51:53] RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 280 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [03:52:09] PROBLEM - PHP7 jobrunner on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [03:54:07] RECOVERY - PHP7 jobrunner on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 8.116 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [03:54:39] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [03:55:53] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 8.109 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:56:05] PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [03:57:55] RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 3.763 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [03:58:45] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [03:59:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:59:57] RECOVERY - PHP7 rendering on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 8.146 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:00:49] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:04:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:04:25] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.025 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [04:04:33] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.026 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:07:18] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:12:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:13:55] PROBLEM - PHP7 jobrunner on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [04:15:47] PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [04:17:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:17:51] RECOVERY - PHP7 jobrunner on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 3.055 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [04:19:39] RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 2.421 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [04:20:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:20:27] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [04:21:33] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1446.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:21:37] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:21:57] PROBLEM - PHP7 jobrunner on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [04:22:19] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [04:22:51] PROBLEM - PHP7 rendering on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:23:17] PROBLEM - WDQS SPARQL on wdqs1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:26:31] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [04:27:07] PROBLEM - SSH on an-coord1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:27:11] RECOVERY - WDQS SPARQL on wdqs1016 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.070 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:27:31] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.032 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:27:35] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:29:51] PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [04:29:53] RECOVERY - PHP7 jobrunner on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [04:30:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:30:47] RECOVERY - PHP7 rendering on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.028 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:32:23] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.025 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [04:34:01] PROBLEM - PHP7 jobrunner on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [04:34:57] PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:35:49] RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 5.860 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [04:35:55] PROBLEM - PHP7 rendering on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:36:55] RECOVERY - PHP7 rendering on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 7.577 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:38:35] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [04:39:41] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:40:57] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1446.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:41:55] PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [04:41:57] PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:42:59] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:43:49] RECOVERY - PHP7 rendering on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 1.880 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:43:55] RECOVERY - PHP7 jobrunner on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.027 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [04:46:17] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.026 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [04:47:41] RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 5.250 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [04:47:41] RECOVERY - PHP7 rendering on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 3.187 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:47:55] PROBLEM - PHP7 jobrunner on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [04:48:49] PROBLEM - PHP7 rendering on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:49:43] RECOVERY - PHP7 jobrunner on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [04:50:21] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [04:51:21] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.031 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:53:53] PROBLEM - PHP7 jobrunner on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [04:56:29] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:57:23] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1446.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:57:45] PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [04:57:45] RECOVERY - PHP7 jobrunner on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [04:57:47] PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:59:21] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:00:35] RECOVERY - PHP7 rendering on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:09:27] RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 282 bytes in 0.714 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [05:09:35] RECOVERY - PHP7 rendering on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 6.314 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:10:03] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [05:10:11] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.027 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:14:17] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [05:15:23] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:15:45] PROBLEM - PHP7 jobrunner on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [05:18:41] PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:19:42] RECOVERY - PHP7 jobrunner on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 280 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [05:20:09] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 280 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [05:20:33] RECOVERY - PHP7 rendering on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 3.233 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:21:13] PROBLEM - PHP7 jobrunner on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [05:21:15] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 280 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:22:15] PROBLEM - PHP7 rendering on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:26:01] RECOVERY - PHP7 rendering on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:26:59] RECOVERY - PHP7 jobrunner on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [05:27:53] RECOVERY - SSH on an-coord1002.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:50:17] RECOVERY - SSH on mw1315.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:45:13] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:04:58] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:05:28] working on --^ [08:09:58] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:10:40] !log re-created knative pods misbehaving for ml-serve-codfw (causing latency alerts) [08:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:59] (KubernetesAPILatency) firing: High Kubernetes API latency (GET clusterinformations) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:12:19] PROBLEM - SSH on mw1331.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:14:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET clusterinformations) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:19:57] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:13:17] RECOVERY - SSH on mw1331.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:20:41] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [10:24:45] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [12:37:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:42:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:10:09] PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:11:01] RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:34:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:39:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:34:39] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:40:39] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:08:20] (03CR) 10Vgutierrez: [C: 04-1] node: Exclude trafficserver promfile mtime check (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/858658 (owner: 10BCornwall) [16:28:48] (03PS1) 10Daniel Kinzler: Set parser cache write propability for /page/html endpoint. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858687 (https://phabricator.wikimedia.org/T322672) [16:32:47] PROBLEM - NTP peers on dns4003 is CRITICAL: NTP CRITICAL: Offset -0.803178 secs (CRITICAL) https://wikitech.wikimedia.org/wiki/NTP [16:38:53] RECOVERY - NTP peers on dns4003 is OK: NTP OK: Offset 0.000283 secs https://wikitech.wikimedia.org/wiki/NTP [16:49:47] PROBLEM - SSH on db1118.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:06:41] (03PS30) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [17:22:35] PROBLEM - Host elastic1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:22:43] PROBLEM - Host elastic1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:23:03] PROBLEM - Host an-presto1006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:00:09] RECOVERY - cassandra-b CQL 10.64.16.78:9042 on aqs1017 is OK: TCP OK - 0.000 second response time on 10.64.16.78 port 9042 https://phabricator.wikimedia.org/T93886 [19:43:03] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:47:53] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:22:01] PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve1007 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:28:09] (03CR) 10Ssingh: [C: 03+2] cp5019: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/858671 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [20:29:47] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5019.eqsin.wmnet with OS buster [20:29:56] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5019.eqsin.wmnet with OS buster [20:34:31] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:37:25] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:39:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:40:29] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:42:53] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:44:58] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:49:58] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:50:39] PROBLEM - Check systemd state on ms-be1045 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:51:03] (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:53:01] RECOVERY - Check whether ferm is active by checking the default input chain on ml-serve1007 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:53:37] RECOVERY - SSH on db1118.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:56:03] (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:56:20] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5019.eqsin.wmnet with reason: host reimage [20:59:48] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5019.eqsin.wmnet with reason: host reimage [20:59:58] (KubernetesAPILatency) firing: (7) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:05:17] PROBLEM - Check systemd state on ms-be1066 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:07:51] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1045 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:16:25] RECOVERY - Check systemd state on ms-be1045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:18:19] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1066 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:26:15] RECOVERY - Check systemd state on ms-be1066 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:30:58] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5019.eqsin.wmnet with OS buster [21:31:08] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5019.eqsin.wmnet with OS buster completed: - cp5019 (**PASS**) -... [21:38:45] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1045 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:41:09] !log initiating Cassandra bootstrap, aqs1020-a -- T307802 [21:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:14] T307802: Bootstrap new Cassandra nodes (eqiad) - https://phabricator.wikimedia.org/T307802 [21:42:03] RECOVERY - cassandra-a SSL 10.64.131.14:7001 on aqs1020 is OK: SSL OK - Certificate aqs1020-a valid until 2024-11-08 15:06:35 +0000 (expires in 719 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [21:42:31] RECOVERY - cassandra-a service on aqs1020 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:47:12] (03CR) 10Ssingh: [C: 03+2] cp5020: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/858672 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [21:48:32] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5020.eqsin.wmnet with OS buster [21:48:39] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5020.eqsin.wmnet with OS buster [21:49:17] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1066 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:54:58] (KubernetesAPILatency) firing: (7) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:59:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:59:58] (KubernetesAPILatency) firing: (7) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:04:03] (ProbeDown) firing: (3) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:04:58] (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:09:03] (ProbeDown) firing: (4) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:13:10] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: / (root with no query params) timed out before a response was received: /v2/page/{sourcelanguage}/{targetlanguage}/{title} (Translate enwiki protected page) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [22:14:03] (ProbeDown) resolved: (4) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:14:25] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [22:15:16] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5020.eqsin.wmnet with reason: host reimage [22:19:10] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5020.eqsin.wmnet with reason: host reimage [22:51:56] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5020.eqsin.wmnet with OS buster [22:52:03] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5020.eqsin.wmnet with OS buster completed: - cp5020 (**PASS**) -... [23:27:34] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:28:02] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:34:46] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:41:56] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48976 bytes in 4.835 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:42:40] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:43:24] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.230 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring