[00:16:36] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:17:38] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:33:55] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [00:42:02] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:42:07] (03PS1) 10Stang: kowiki: Add logo (legacy vector and vector-2022) for 600k articles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822717 (https://phabricator.wikimedia.org/T315127) [00:42:56] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:44:27] (03PS1) 10Stang: kowiki: Change logo for 600k articles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822718 (https://phabricator.wikimedia.org/T315127) [00:58:42] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:02:55] (03PS2) 10Stang: kowiki: Add logo (legacy vector and vector-2022) for 600k articles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822717 (https://phabricator.wikimedia.org/T315127) [01:03:21] (03PS2) 10Stang: kowiki: Change logo for 600k articles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822718 (https://phabricator.wikimedia.org/T315127) [01:04:06] (03CR) 10CI reject: [V: 04-1] kowiki: Change logo for 600k articles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822718 (https://phabricator.wikimedia.org/T315127) (owner: 10Stang) [01:08:32] (03PS3) 10Stang: kowiki: Change logo for 600k articles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822718 (https://phabricator.wikimedia.org/T315127) [01:30:52] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:39:45] (JobUnavailable) firing: (9) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:54] PROBLEM - puppet last run on gitlab1003 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:49:45] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:50:08] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:50:28] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:52:02] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:52:50] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:55:13] (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [02:01:34] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:09:45] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:12:52] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [02:13:18] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:15:40] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:17:32] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [02:17:58] RECOVERY - puppet last run on gitlab1003 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [02:19:45] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:25:18] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:27:24] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:29:12] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [02:31:28] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [02:31:54] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:32:08] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:38:58] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:45:59] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:59:06] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:01:16] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:03:10] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:03:46] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:14:44] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:17:08] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:27:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:37:30] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): More public IPs for codfw1dev - https://phabricator.wikimedia.org/T313977 (10Andrew) 05Open→03Resolved all good now! [03:40:58] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:45:38] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:57:04] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:57:22] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:57:46] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:02:04] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:11:28] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:14:16] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:15:58] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:16:12] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:28:04] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:30:24] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:32:14] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [04:34:38] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 10 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [04:54:22] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:29:36] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:32:00] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:43:50] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:52:46] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [05:55:08] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 4 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [05:55:13] (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [06:00:24] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:20:00] (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:41:36] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220814T0700) [07:06:38] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:27:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:17:50] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:33:12] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [08:35:34] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 4 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [08:54:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance [08:54:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance [08:54:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T312863)', diff saved to https://phabricator.wikimedia.org/P32380 and previous config saved to /var/cache/conftool/dbconfig/20220814-085443-ladsgroup.json [08:54:47] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [08:57:28] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:59:40] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:02:10] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:55:13] (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [10:13:06] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:15:28] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:20:00] (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:27:28] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:32:14] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:11:56] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:19:13] (03PS1) 10Urbanecm: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822725 (https://phabricator.wikimedia.org/T315182) [11:22:57] (03CR) 10Urbanecm: [C: 03+1] "change would work, but 'w' is the more standard prefix used in import sources." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819071 (https://phabricator.wikimedia.org/T314820) (owner: 10MdsShakil) [11:27:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:28:50] (03CR) 10Urbanecm: [C: 04-1] "extension needs to be present in at least two trains to be addable to extension-list (scap sync-world breaks otherwise). looks to be only " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821249 (https://phabricator.wikimedia.org/T314294) (owner: 10Samtar) [11:46:06] (03CR) 10Urbanecm: [C: 04-1] "Personally speaking, I don't like w.wiki URLs in user agents. Contact/link info in user agents is usually important when the infrastructur" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821246 (owner: 10Samtar) [11:59:01] (03CR) 10JMeybohm: Create basic haproxy container (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/821275 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [12:09:18] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:16:26] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:54:14] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:01:18] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:05:14] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:22:40] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:28:49] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [13:32:08] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:37:44] (03PS12) 10MdsShakil: Add bnwiki in wgImportSources to bnwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819071 (https://phabricator.wikimedia.org/T314820) [13:40:17] (03CR) 10MdsShakil: Add bnwiki in wgImportSources to bnwikibooks (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819071 (https://phabricator.wikimedia.org/T314820) (owner: 10MdsShakil) [13:40:49] (03CR) 10MdsShakil: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819071 (https://phabricator.wikimedia.org/T314820) (owner: 10MdsShakil) [13:55:13] (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [14:03:48] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [14:20:00] (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:03:49] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [15:27:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:03:57] (03PS3) 10Urbanecm: Pin the reason migration stage to read and write old [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820838 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz) [16:04:28] (03CR) 10Urbanecm: [C: 03+1] "LGTM, will merge tomorrow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820838 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz) [16:05:12] (03PS4) 10Urbanecm: Pin wgCheckUserLogReasonMigrationStage to read and write old [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820838 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz) [16:09:10] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:12:06] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:24:56] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms [16:34:04] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:47:04] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [17:15:26] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:21:50] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [17:23:09] (03CR) 10BryanDavis: Introduce DriverInterface (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699719 (owner: 10Giuseppe Lavagetto) [17:30:18] (03PS1) 10BryanDavis: Add missing attrs dependency [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/822734 [17:31:02] (03CR) 10BryanDavis: Introduce DriverInterface (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699719 (owner: 10Giuseppe Lavagetto) [17:32:58] (03CR) 10CI reject: [V: 04-1] Add missing attrs dependency [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/822734 (owner: 10BryanDavis) [17:50:05] (03PS1) 10David Caro: openstack: update control nodes after refresh [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/822735 [17:50:07] (03PS1) 10David Caro: wmcs.quota_increase: fix not needed parameter [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/822736 [17:55:13] (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [17:56:44] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:59:10] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:01:34] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:03:08] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:09:36] RECOVERY - Host cp1089.mgmt is UP: PING WARNING - Packet loss = 66%, RTA = 0.83 ms [18:20:00] (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:26:03] (03CR) 10Andrew Bogott: "Apologies! I duplicated most of this patch in https://gerrit.wikimedia.org/r/c/operations/puppet/+/822145" [puppet] - 10https://gerrit.wikimedia.org/r/800949 (owner: 10Majavah) [18:26:51] (03PS5) 10Andrew Bogott: P:openstack::glance: tidy up monitoring params [puppet] - 10https://gerrit.wikimedia.org/r/800949 (owner: 10Majavah) [18:27:42] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:30:53] (03CR) 10Andrew Bogott: [C: 03+2] P:openstack::glance: tidy up monitoring params [puppet] - 10https://gerrit.wikimedia.org/r/800949 (owner: 10Majavah) [18:31:40] (03PS5) 10Andrew Bogott: openstack::cinder: monitor the backend port [puppet] - 10https://gerrit.wikimedia.org/r/800950 (owner: 10Majavah) [18:34:33] (03CR) 10Andrew Bogott: [C: 03+2] openstack::cinder: monitor the backend port [puppet] - 10https://gerrit.wikimedia.org/r/800950 (owner: 10Majavah) [18:34:55] (03PS5) 10Andrew Bogott: openstack::nova: monitor the backend port [puppet] - 10https://gerrit.wikimedia.org/r/800951 (owner: 10Majavah) [18:36:43] (03CR) 10Andrew Bogott: [C: 03+2] openstack::nova: monitor the backend port [puppet] - 10https://gerrit.wikimedia.org/r/800951 (owner: 10Majavah) [18:37:29] (03PS5) 10Andrew Bogott: P:openstack::haproxy: codfw1dev: remove non-tls ports [puppet] - 10https://gerrit.wikimedia.org/r/800952 (https://phabricator.wikimedia.org/T267194) (owner: 10Majavah) [19:12:32] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:16:35] (03CR) 10Andrew Bogott: [C: 03+2] P:openstack::haproxy: codfw1dev: remove non-tls ports [puppet] - 10https://gerrit.wikimedia.org/r/800952 (https://phabricator.wikimedia.org/T267194) (owner: 10Majavah) [19:19:06] (03PS5) 10Andrew Bogott: P:openstack::haproxy: eqiad1: remove non-tls ports [puppet] - 10https://gerrit.wikimedia.org/r/800953 (https://phabricator.wikimedia.org/T267194) (owner: 10Majavah) [19:22:46] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.95 ms [19:26:21] (03CR) 10Andrew Bogott: [C: 03+2] P:openstack::haproxy: eqiad1: remove non-tls ports [puppet] - 10https://gerrit.wikimedia.org/r/800953 (https://phabricator.wikimedia.org/T267194) (owner: 10Majavah) [19:27:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:36:36] (03PS5) 10Andrew Bogott: P:openstack::designate::firewall: cleanup [puppet] - 10https://gerrit.wikimedia.org/r/800954 (https://phabricator.wikimedia.org/T267194) (owner: 10Majavah) [19:39:18] (03CR) 10Andrew Bogott: [C: 03+2] "diff lgtm https://puppet-compiler.wmflabs.org/pcc-worker1003/36736/cloudcontrol1005.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/800954 (https://phabricator.wikimedia.org/T267194) (owner: 10Majavah) [19:42:41] (03PS5) 10Andrew Bogott: P:openstack: misc cleanup for non-tls ports [puppet] - 10https://gerrit.wikimedia.org/r/800955 (https://phabricator.wikimedia.org/T267194) (owner: 10Majavah) [19:49:02] (03CR) 10Andrew Bogott: [C: 03+2] P:openstack: misc cleanup for non-tls ports [puppet] - 10https://gerrit.wikimedia.org/r/800955 (https://phabricator.wikimedia.org/T267194) (owner: 10Majavah) [20:01:41] (03PS1) 10Andrew Bogott: profile::openstack::base::designate::firewall::api: add a missing ) [puppet] - 10https://gerrit.wikimedia.org/r/822738 (https://phabricator.wikimedia.org/T267194) [20:04:51] (03CR) 10Andrew Bogott: [C: 03+2] profile::openstack::base::designate::firewall::api: add a missing ) [puppet] - 10https://gerrit.wikimedia.org/r/822738 (https://phabricator.wikimedia.org/T267194) (owner: 10Andrew Bogott) [20:13:44] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:42:32] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:53:56] (03CR) 10Samtar: extension-list: Add Phonos (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821249 (https://phabricator.wikimedia.org/T314294) (owner: 10Samtar) [21:23:52] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [21:25:48] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:38:48] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 11.90 ms [21:47:52] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:55:13] (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [22:06:52] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [22:20:00] (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:26:30] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:26:36] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:29:02] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [22:31:20] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:41:58] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.01 ms [22:45:12] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:19:20] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [23:23:09] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 3 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10Base) [23:27:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:45:30] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms