[00:10:01] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:14:33] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:17] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/947399 [00:38:23] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/947399 (owner: 10TrainBranchBot) [00:42:07] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:43:25] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:43] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:53:10] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/947399 (owner: 10TrainBranchBot) [01:00:27] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:03:27] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:10:51] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for fkaelin - https://phabricator.wikimedia.org/T343957 (10leila) Approved on my end. Thank you! [01:23:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:27:41] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:29:27] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:31:35] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:37:00] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:48:21] RECOVERY - BFD status on cr2-eqiad is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:49:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:54:25] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:55:57] RECOVERY - BFD status on cr2-eqiad is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:58:29] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:02:43] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:04:13] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:36] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:43] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:35] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:46:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:51:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:58:45] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:14:05] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:18:39] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:20:09] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:24:39] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:26:09] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:30:43] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:49:01] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:56:02] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [03:56:04] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [03:56:06] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [03:56:08] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [03:56:35] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:58:07] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:01:02] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [04:01:04] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [04:01:06] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [04:05:49] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:20:11] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:23:39] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:27:47] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: debian-weekly-rebuild.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:14:01] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:18:39] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:20:09] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:24:43] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:27:13] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:30:15] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:37:00] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:01:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:06:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:28:05] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:32:45] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:46:25] (03CR) 10Ayounsi: [C: 03+1] Add / update network definitions to include new esams ranges [puppet] - 10https://gerrit.wikimedia.org/r/948216 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney) [06:49:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:59:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230813T0700) [07:09:01] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:13:17] 10SRE, 10Wikimedia-Mailing-lists: Shut down two en-arbcom mailing lists (audit, appeals-en) - https://phabricator.wikimedia.org/T344112 (10Aklapper) [07:16:37] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:35:05] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:39:41] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:52:11] 10SRE, 10Wikimedia-Mailing-lists: Shut down two en-arbcom mailing lists (audit, appeals-en) - https://phabricator.wikimedia.org/T344112 (10Ladsgroup) a:03Ladsgroup Hi, I'll do it, can you link to the exact mailing list you want closed? [07:56:18] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [07:56:20] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [07:56:22] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [07:56:24] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [08:01:17] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [08:01:19] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [08:01:21] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [08:19:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:29:37] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:31:09] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:38:45] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:57:01] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:01:39] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:03:13] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:07:49] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:37:00] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:52:01] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:59:37] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:51:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:58:41] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:15:38] (03PS11) 10Andrew Bogott: wip Convert cloudbackup200[12] from cinder-backup nodes to backy2 nodes [puppet] - 10https://gerrit.wikimedia.org/r/946963 (https://phabricator.wikimedia.org/T344065) [11:21:48] (03PS12) 10Andrew Bogott: wip Convert cloudbackup200[12] from cinder-backup nodes to backy2 nodes [puppet] - 10https://gerrit.wikimedia.org/r/946963 (https://phabricator.wikimedia.org/T344065) [11:35:01] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:35:17] (03PS13) 10Andrew Bogott: wip Convert cloudbackup200[12] from cinder-backup nodes to backy2 nodes [puppet] - 10https://gerrit.wikimedia.org/r/946963 (https://phabricator.wikimedia.org/T344065) [11:35:19] (03PS1) 10Andrew Bogott: backy2: split out the instance backup logic into a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/948228 [11:38:28] (03CR) 10CI reject: [V: 04-1] wip Convert cloudbackup200[12] from cinder-backup nodes to backy2 nodes [puppet] - 10https://gerrit.wikimedia.org/r/946963 (https://phabricator.wikimedia.org/T344065) (owner: 10Andrew Bogott) [11:38:34] (03CR) 10CI reject: [V: 04-1] backy2: split out the instance backup logic into a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/948228 (owner: 10Andrew Bogott) [11:39:37] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:01:02] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [12:01:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:01:04] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [12:01:06] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [12:01:08] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [12:05:41] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:06:02] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [12:06:04] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [12:06:06] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [12:27:01] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:34:41] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:25:07] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:29:45] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:37:01] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:45:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:49:39] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:36] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:16:36] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:20:33] (03PS2) 10Andrew Bogott: backy2: split out the instance backup logic into a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/948228 [14:20:35] (03PS14) 10Andrew Bogott: wip Convert cloudbackup200[12] from cinder-backup nodes to backy2 nodes [puppet] - 10https://gerrit.wikimedia.org/r/946963 (https://phabricator.wikimedia.org/T344065) [14:23:44] (03CR) 10CI reject: [V: 04-1] backy2: split out the instance backup logic into a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/948228 (owner: 10Andrew Bogott) [14:23:48] (03CR) 10CI reject: [V: 04-1] wip Convert cloudbackup200[12] from cinder-backup nodes to backy2 nodes [puppet] - 10https://gerrit.wikimedia.org/r/946963 (https://phabricator.wikimedia.org/T344065) (owner: 10Andrew Bogott) [14:26:17] (03PS15) 10Andrew Bogott: wip Convert cloudbackup200[12] from cinder-backup nodes to backy2 nodes [puppet] - 10https://gerrit.wikimedia.org/r/946963 (https://phabricator.wikimedia.org/T344065) [14:29:00] (03CR) 10CI reject: [V: 04-1] wip Convert cloudbackup200[12] from cinder-backup nodes to backy2 nodes [puppet] - 10https://gerrit.wikimedia.org/r/946963 (https://phabricator.wikimedia.org/T344065) (owner: 10Andrew Bogott) [14:32:05] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:42:41] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:33] (03PS3) 10Andrew Bogott: backy2: split out the instance backup logic into a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/948228 [14:45:35] (03PS16) 10Andrew Bogott: wip Convert cloudbackup200[12] from cinder-backup nodes to backy2 nodes [puppet] - 10https://gerrit.wikimedia.org/r/946963 (https://phabricator.wikimedia.org/T344065) [14:48:43] (03CR) 10CI reject: [V: 04-1] backy2: split out the instance backup logic into a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/948228 (owner: 10Andrew Bogott) [14:49:57] (03PS1) 10Andrew Bogott: Add new dummy yaml for backy2/cinder_backups [labs/private] - 10https://gerrit.wikimedia.org/r/948231 (https://phabricator.wikimedia.org/T344065) [14:50:32] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Add new dummy yaml for backy2/cinder_backups [labs/private] - 10https://gerrit.wikimedia.org/r/948231 (https://phabricator.wikimedia.org/T344065) (owner: 10Andrew Bogott) [14:55:49] (03PS4) 10Andrew Bogott: backy2: split out the instance backup logic into a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/948228 [14:55:51] (03PS17) 10Andrew Bogott: wip Convert cloudbackup200[12] from cinder-backup nodes to backy2 nodes [puppet] - 10https://gerrit.wikimedia.org/r/946963 (https://phabricator.wikimedia.org/T344065) [14:58:05] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:02:39] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:26] (03PS5) 10Andrew Bogott: backy2: split out the instance backup logic into a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/948228 [15:11:28] (03PS18) 10Andrew Bogott: wip Convert cloudbackup200[12] from cinder-backup nodes to backy2 nodes [puppet] - 10https://gerrit.wikimedia.org/r/946963 (https://phabricator.wikimedia.org/T344065) [15:15:38] (03PS6) 10Andrew Bogott: backy2: split out the instance backup logic into a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/948228 [15:15:40] (03PS19) 10Andrew Bogott: wip Convert cloudbackup200[12] from cinder-backup nodes to backy2 nodes [puppet] - 10https://gerrit.wikimedia.org/r/946963 (https://phabricator.wikimedia.org/T344065) [15:19:15] (03CR) 10Andrew Bogott: [C: 03+2] backy2: split out the instance backup logic into a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/948228 (owner: 10Andrew Bogott) [15:26:34] !log disable transit and peering links on cr2-esams & cr3-esams before decom T329219 [15:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:38] T329219: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 [15:33:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:36:09] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:25] !log Disable transport cct cr2-esams to cr2-eqiad prior to disconnect T329219 [15:45:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:29] T329219: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 [15:46:45] RECOVERY - BFD status on cr2-eqiad is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:54:20] !log Disabling esams peering at AMS-IX prior to removing router [15:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:39] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:00:07] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:01:18] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [16:01:20] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [16:01:22] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [16:05:05] !log powering down cr2-esams [16:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:17] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [16:06:19] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [16:06:21] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [16:06:59] RECOVERY - BFD status on cr3-esams is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:07:43] !log powering down cr3-esams [16:07:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:07] (ProbeDown) firing: (4) Service text-https:443 has failed probes (http_text-https_ip6) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:18:09] (ProbeDown) firing: (14) Service ncredir-https:443 has failed probes (http_ncredir-https_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:19:14] ^^^ these all related to our work on site in esams. site depooled and devices downtime but this was an oversight [16:19:17] apologies for the noise [16:19:47] <_joe_> !incidents [16:19:48] 3944 (ACKED) [4x] ProbeDown sre (probes/service esams) [16:19:56] <_joe_> topranks: thanks [16:20:00] ack thanks! good luck with the work in the DC [16:20:36] thanks.. so far so good [16:22:07] (ProbeDown) firing: (14) Service ncredir-https:443 has failed probes (http_ncredir-https_ip6) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:30:29] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:31:17] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:35:45] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:41:15] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T344127 (10phaultfinder) [16:59:02] 10SRE, 10Traffic: esams is broken - https://phabricator.wikimedia.org/T344128 (10Bugreporter) [16:59:18] 10SRE, 10Traffic: esams is broken - https://phabricator.wikimedia.org/T344128 (10Bugreporter) p:05Triage→03Unbreak! [17:00:18] 10SRE, 10Traffic: esams is broken - https://phabricator.wikimedia.org/T344128 (10RhinosF1) Esams is depooled for planned maintenance. Why do you need to reach it? [17:01:00] 10SRE, 10Traffic: esams is broken - https://phabricator.wikimedia.org/T344128 (10taavi) p:05Unbreak!→03Triage [17:01:04] 10SRE, 10Traffic: esams is broken - https://phabricator.wikimedia.org/T344128 (10RhinosF1) p:05Triage→03Unbreak! [17:01:19] 10SRE, 10Traffic: esams is broken - https://phabricator.wikimedia.org/T344128 (10RhinosF1) p:05Unbreak!→03Triage [17:01:35] taavi: apologies for edit conflict [17:01:49] no worries, phab could really be a bit smarter about those [17:02:20] taavi: yes it really could [17:03:03] That task slightly concerns me that something isn't depooled but from what top.ranks said, esams is basically in the back of a van at this point so very expected to be broken [17:03:42] The task is effectively an XY problem [17:03:54] 10SRE, 10Traffic: esams is broken - https://phabricator.wikimedia.org/T344128 (10Bugreporter) Is it to be replaced with text-lb.knams.wikimedia.org? that is not yet resolvable [17:05:23] 10SRE, 10Traffic: esams is broken - https://phabricator.wikimedia.org/T344128 (10RhinosF1) >>! In T344128#9088791, @Bugreporter wrote: > Is it to be replaced with text-lb.knams.wikimedia.org? that is not yet resolvable I'm not sure whether they are renaming the actual domains etc but one of the router is curr... [17:09:31] 10SRE, 10Traffic: esams is broken - https://phabricator.wikimedia.org/T344128 (10Bugreporter) 05Open→03Invalid Closed in favor of {T344129} [17:10:23] 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10RhinosF1) [17:11:17] taavi: they just closed the ticket [17:11:25] Still confused as to why it was opened in the first place [17:12:02] And thank you to everyone in SRE and DC Ops that are working today to facilitate the work [17:13:09] PROBLEM - Recursive DNS on 2620:0:862:1:91:198:174:61 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [17:13:15] PROBLEM - Recursive DNS on 91.198.174.62 is CRITICAL: DNS_QUERY CRITICALError response or zero answers: https://wikitech.wikimedia.org/wiki/DNS [17:13:37] PROBLEM - Recursive DNS on 2620:0:862:1:91:198:174:62 is CRITICAL: DNS_QUERY CRITICALError response or zero answers: https://wikitech.wikimedia.org/wiki/DNS [17:13:43] PROBLEM - Recursive DNS on 91.198.174.61 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [17:13:48] dns3001 and dns3002, so expected [17:16:33] downtimed [17:17:09] Thanks sukhe [17:28:47] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:33:29] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:37:01] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:25:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:29:43] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:42:05] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:43:18] (KubernetesAPILatency) firing: High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:46:43] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:48:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:58:33] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:00:07] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:16:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:20:37] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:22:05] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:29:31] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:37:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:41:39] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:06:02] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [20:06:04] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [20:06:06] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [20:06:08] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [20:11:03] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [20:11:05] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [20:11:07] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [20:18:09] (ProbeDown) firing: (14) Service ncredir-https:443 has failed probes (http_ncredir-https_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:19:31] 10ops-codfw, 10DBA: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10Marostegui) Did we just lose a pdu? [20:22:07] (ProbeDown) firing: (14) Service ncredir-https:443 has failed probes (http_ncredir-https_ip6) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:26:50] Sry… just finished in the dc will sort those alarms out properly when back to hotel [20:27:17] all good, resolved! [20:27:38] or acked [20:51:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [21:21:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [21:37:01] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:00:09] PROBLEM - BGP status on cr3-knams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal, AS64605/IPv6: Connect - Anycast, AS64605/IPv6: Connect - Anycast, AS64600/IPv4: Active - PyBal, AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Connect - Anycast, AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64600/IPv4: Active - [22:00:09] S64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:09:45] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:10:33] RECOVERY - BFD status on cr2-esams is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:11:27] RECOVERY - Recursive DNS on 2620:0:862:1:91:198:174:61 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [22:11:43] RECOVERY - Recursive DNS on 91.198.174.62 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [22:11:59] RECOVERY - Recursive DNS on 2620:0:862:1:91:198:174:62 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [22:12:07] RECOVERY - Recursive DNS on 91.198.174.61 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [22:12:07] (ProbeDown) resolved: (14) Service ncredir-https:443 has failed probes (http_ncredir-https_ip6) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:12:21] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 12, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:12:31] RECOVERY - BGP status on cr3-knams is OK: BGP OK - up: 12, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:13:10] (ProbeDown) resolved: (14) Service ncredir-https:443 has failed probes (http_ncredir-https_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:40:07] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:44:33] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state