[00:00:38] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:03:28] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:12:02] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:18:16] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:21:06] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:23:26] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:29:58] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:30:08] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:32:50] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:34:48] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:37:38] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:38:40] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:41:26] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:44:16] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:46:14] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:48:44] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:51:56] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:58:12] <icinga-wm>	 PROBLEM - SSH on thumbor2004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:00:08] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:00:50] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:07:04] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[01:08:40] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:09:04] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:10:52] <icinga-wm>	 PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:18:30] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[01:20:30] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:25:40] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:28:54] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:34:10] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:37:02] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:37:16] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org
[01:37:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[01:40:14] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:42:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[01:45:38] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:48:28] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:48:48] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:00:00] <icinga-wm>	 RECOVERY - SSH on thumbor2004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:00:42] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[02:02:34] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:03:04] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:08:24] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:14:08] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:15:24] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[02:17:54] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[02:22:40] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:23:54] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[02:28:46] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:34:04] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:34:28] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:42:40] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:43:56] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[02:45:56] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:46:46] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[02:54:06] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:02:42] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:03:56] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[03:05:34] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:08:48] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:09:18] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[03:09:38] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[03:12:28] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[03:14:30] <icinga-wm>	 RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:15:00] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[03:23:04] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:25:54] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:27:16] <icinga-wm>	 PROBLEM - SSH on kubernetes2002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:31:16] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:34:30] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:37:22] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:42:38] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:43:36] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[03:45:54] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:46:28] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[03:48:44] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:54:44] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[03:59:16] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:01:14] <icinga-wm>	 RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:02:56] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[04:11:16] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[04:12:16] <icinga-wm>	 PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: debian-weekly-rebuild.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:16:46] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[04:21:18] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:27:56] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[04:29:02] <icinga-wm>	 RECOVERY - SSH on kubernetes2002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:33:22] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:35:36] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:36:14] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:44:44] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:08:10] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:32:44] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:33:20] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:37:16] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org
[05:39:24] <wikibugs>	 10SRE, 10Znuny, 10serviceops, 10Sustainability (Incident Followup): enhance Znuny (otrs) alerting - https://phabricator.wikimedia.org/T303190 (10Peachey88)
[05:41:22] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:45:22] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[05:48:12] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[05:49:48] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:49:54] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:55:24] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:58:38] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:06:52] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:09:56] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:12:58] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:13:58] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:16:50] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:19:56] <icinga-wm>	 PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:21:12] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:21:34] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:24:24] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:28:12] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:29:38] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:33:36] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:35:02] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:35:22] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:35:42] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 68, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:37:52] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:41:02] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:47:24] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:47:42] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:50:18] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:50:34] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 13 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:52:24] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220320T0700)
[07:00:38] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:09:10] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:09:30] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:09:50] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:14:54] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:18:48] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:21:38] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:21:44] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - No response from remote host 91.198.174.244 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:29:24] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:31:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T300775)', diff saved to https://phabricator.wikimedia.org/P22846 and previous config saved to /var/cache/conftool/dbconfig/20220320-073150-marostegui.json
[07:31:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:31:56] <stashbot>	 T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775
[07:32:14] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:32:36] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:34:46] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:38:18] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:38:56] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:40:28] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:46:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P22847 and previous config saved to /var/cache/conftool/dbconfig/20220320-074655-marostegui.json
[07:46:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:02] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:56:00] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 13 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:57:24] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:02:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P22848 and previous config saved to /var/cache/conftool/dbconfig/20220320-080200-marostegui.json
[08:02:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:04:22] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:04:36] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:05:54] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:06:46] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:09:00] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:09:38] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:10:04] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:10:20] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:14:40] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:16:41] <elukey>	 downtimed for a day the cr2-eqsin and cr4-ulsfo alerts
[08:16:58] <elukey>	 Cc: XioNoX --^
[08:17:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T300775)', diff saved to https://phabricator.wikimedia.org/P22849 and previous config saved to /var/cache/conftool/dbconfig/20220320-081705-marostegui.json
[08:17:07] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance
[08:17:08] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance
[08:17:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:17:10] <stashbot>	 T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775
[08:17:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:17:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T300775)', diff saved to https://phabricator.wikimedia.org/P22850 and previous config saved to /var/cache/conftool/dbconfig/20220320-081713-marostegui.json
[08:17:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:17:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:23:30] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:26:02] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:02:50] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:11:24] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:25:20] <icinga-wm>	 RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:37:16] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org
[09:47:56] <icinga-wm>	 PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:08:38] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:14:40] <icinga-wm>	 PROBLEM - SSH on thumbor2003.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:49:38] <icinga-wm>	 RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:16:26] <icinga-wm>	 RECOVERY - SSH on thumbor2003.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:07:54] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adapt the links to target language wiki. returned the unexpected status 500 (expecting: 200): /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestCl
[12:07:54] <icinga-wm>	 apt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[12:16:22] <icinga-wm>	 RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[12:49:34] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:57:44] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:14:01] <wikibugs>	 10SRE, 10Gerrit, 10GitLab, 10Horizon, and 2 others: Take some pointers from GitHub security updates - https://phabricator.wikimedia.org/T304231 (10Reedy)
[13:37:16] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org
[15:20:22] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:34:20] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:35:16] <icinga-wm>	 PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:54:51] <icinga-wm>	 PROBLEM - LVS appservers-https eqiad port 443/tcp - Main MediaWiki application server cluster- appservers.svc.eqiad.wmnet -https- IPv4 #page on appservers.svc.eqiad.wmnet is CRITICAL: CRITICAL - Certificate appservers.svc.eqiad.wmnet expires in 7 day(s) (Mon 28 Mar 2022 04:54:40 PM GMT +0000). https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[16:55:21] <Emperor>	 here - do we have a problem?
[16:55:45] <legoktm>	 Seems like the cert just needs to be renewed?
[16:55:47] <icinga-wm>	 PROBLEM - LVS appservers-https codfw port 443/tcp - Main MediaWiki application server cluster- appservers.svc.codfw.wmnet -https- IPv4 #page on appservers.svc.codfw.wmnet is CRITICAL: CRITICAL - Certificate appservers.svc.eqiad.wmnet expires in 7 day(s) (Mon 28 Mar 2022 04:54:40 PM GMT +0000). https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[16:56:22] <legoktm>	 But that could be done tomorrow..
[16:56:44] <Emperor>	 Yes. Sorry, saw the CRITICAL in my MUA alerts, came here to say hi before actually reading it /o\
[16:57:27] <Emperor>	 I wonder if I can downtime it until first thing tomorrow
[16:57:53] <rzl>	 hi
[16:57:59] <vgutierrez>	 Yup... it doesn't seem THAT critical
[16:58:06] <rzl>	 oh weird, yeah
[16:58:14] * akosiaris around
[16:58:50] <akosiaris>	 wait, this is a cert alert? and it paged ? 
[16:59:00] <sobanski>	 And it’s not immediately clear from the description
[16:59:24] <legoktm>	 akosiaris: It came through via LVS
[16:59:36] <legoktm>	 LVS monitoring*
[16:59:53] <Emperor>	 Yes, looks like I can ACK until 09:00 UTC on Monday. I'll do that, unless anyone wants to object
[16:59:54] <akosiaris>	 yeah, maybe that alert is a bit too smart 
[16:59:56] * volans|off here
[17:00:25] <akosiaris>	 Emperor: +1, a cert that expires in 7d can wait until Monday
[17:00:35] <rzl>	 Emperor: sgtm
[17:00:41] <volans>	 yes but... does this downtime the alert for appservers too?
[17:00:43] <taavi>	 Emperor: maybe a bit later than 9?
[17:01:01] <volans>	 I'm wondering if it's the http_check that checks also the availability
[17:01:05] <rzl>	 taavi: 9 UTC is midmorning for a bunch of our team :)
[17:01:12] * volans checking
[17:01:13] <taavi>	 as long as someone starts working on it tomorrow it probably shouldn't page again
[17:01:34] <icinga-wm>	 ACKNOWLEDGEMENT - LVS appservers-https codfw port 443/tcp - Main MediaWiki application server cluster- appservers.svc.codfw.wmnet -https- IPv4 #page on appservers.svc.codfw.wmnet is CRITICAL: CRITICAL - Certificate appservers.svc.eqiad.wmnet expires in 7 day(s) (Mon 28 Mar 2022 04:54:40 PM GMT +0000). MVernon Cert expiry needs doing, but can wait til Monday - The acknowledgement expires at: 2022-03-21 10:00:00. https://wikitech.wikimedia.
[17:01:34] <icinga-wm>	 /LVS%23Diagnosing_problems
[17:01:35] <icinga-wm>	 ACKNOWLEDGEMENT - LVS appservers-https eqiad port 443/tcp - Main MediaWiki application server cluster- appservers.svc.eqiad.wmnet -https- IPv4 #page on appservers.svc.eqiad.wmnet is CRITICAL: CRITICAL - Certificate appservers.svc.eqiad.wmnet expires in 7 day(s) (Mon 28 Mar 2022 04:54:40 PM GMT +0000). MVernon Cert expiry needs doing, but can wait til Monday - The acknowledgement expires at: 2022-03-21 10:00:00. https://wikitech.wikimedia.
[17:01:35] <icinga-wm>	 /LVS%23Diagnosing_problems
[17:02:04] <Emperor>	 darn it, s/expiry/renewal/ oh well
[17:02:06] <vgutierrez>	 Love how the acks page as well :)
[17:02:22] <Emperor>	 so now everyone gets a copy of my typo
[17:02:54] <volans>	 so yes, this is the actual alert for LVS appservers-https codfw port 443/tcp
[17:03:00] <volans>	 that does 
[17:03:01] <volans>	 check_https_url!en.wikipedia.org!/wiki/Special:BlankPage
[17:03:06] <volans>	 if we downtime it we're blind
[17:03:09] <volans>	 until tomorrow
[17:03:31] <Emperor>	 volans: you think my ACK is bad, then?
[17:03:33] * volans checking if we can make it ignore the cert
[17:06:11] <Emperor>	 (obviously if I un-ACK it everyone will get emailed again (again))
[17:06:20] <volans>	 so the definition for check_https_url calls check_http with -C9,7
[17:06:56] <volans>	 -C, --certificate=INTEGER[,INTEGER] Minimum number of days a certificate has to be valid.
[17:07:08] <volans>	 also it's interesting that it says "(when this option is used the URL is not checked.)"
[17:08:04] <volans>	 Emperor: let me do a quick hotfix patch to duplicate that command and change the critical value to 5 days
[17:08:32] <volans>	 and then we can probably remove the downtime and re-discuss details tomorrow during working hours
[17:08:36] <Emperor>	 volans: cool, thanks
[17:11:57] <rzl>	 fwiw the api.svc.{eqiad,codfw}.wmnet cert will expire on the same day, a couple hours earlier
[17:12:06] <rzl>	 we just don't have a check_https_url for it, I guess
[17:12:28] <Emperor>	 cert-fettling definitely feels like a Monday thing :)
[17:12:44] <rzl>	 TIL the word fettling :)
[17:12:56] <rzl>	 and yeah, the reason I was checking was if it was a couple hours *later* I didn't want to get paged again
[17:13:07] <volans>	 rzl: the interesting part is that we're not checing the URL in the command
[17:15:06] <volans>	 oh nice... now to use the new check... it's tricky
[17:15:14] <volans>	 because the data comes from services.yaml directly
[17:15:23] <rzl>	 volans: yeah, and it looks like we *do* also have check_https_url!en.wikipedia.org!/w/api.php?action=query&meta=siteinfo
[17:15:36] <rzl>	 pointed at api.svc.%{::site}.wmnet
[17:15:42] <rzl>	 so I'm not sure why that didn't page too
[17:16:13] <volans>	 it has not the -C
[17:16:17] <volans>	 ah same
[17:16:18] <volans>	 sorry
[17:16:20] <volans>	 not sure
[17:16:58] <volans>	 patch in 3 minutes
[17:17:01] <Emperor>	 \o/
[17:17:09] <rzl>	 👍
[17:17:21] <rzl>	 writing a phab task to sort this all out tomorrow
[17:17:27] <volans>	 thx
[17:17:44] <Emperor>	 sound plan :)
[17:18:00] <icinga-wm>	 PROBLEM - Disk space on dumpsdata1003 is CRITICAL: DISK CRITICAL - free space: /data 865540 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops
[17:22:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10serviceops: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10RLazarus) p:05Triage→03High
[17:22:59] <rzl>	 volans: T304237 if your patch wants a bug number
[17:22:59] <stashbot>	 T304237: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237
[17:23:05] <volans>	 rzl: thx
[17:23:21] <wikibugs>	 (03PS1) 10Volans: Mediawiki appservers-https check: temporary tweak [puppet] - 10https://gerrit.wikimedia.org/r/772032 (https://phabricator.wikimedia.org/T304237)
[17:23:22] <volans>	 here the patch ^^^
[17:23:23] <rzl>	 that's a bit rough, to be refined but feel free to edit
[17:23:36] <Emperor>	 volans: looking
[17:24:41] <rzl>	 volans: sound approach
[17:24:46] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] Mediawiki appservers-https check: temporary tweak [puppet] - 10https://gerrit.wikimedia.org/r/772032 (https://phabricator.wikimedia.org/T304237) (owner: 10Volans)
[17:25:09] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "LGTM, thanks :-)" [puppet] - 10https://gerrit.wikimedia.org/r/772032 (https://phabricator.wikimedia.org/T304237) (owner: 10Volans)
[17:25:10] <rzl>	 make sure you don't send it to volans for review, he won't let you get away with something like this
[17:25:17] <Emperor>	 lol
[17:25:19] * Emperor +1
[17:25:38] <volans>	 lol
[17:25:46] <volans>	 running pcc just in case
[17:26:09] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10RLazarus)
[17:26:10] <rzl>	 added the revert to that task, so we don't lose track
[17:28:43] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "PCC looks ok to me, merging:" [puppet] - 10https://gerrit.wikimedia.org/r/772032 (https://phabricator.wikimedia.org/T304237) (owner: 10Volans)
[17:28:49] <volans>	 I've added the link to the task
[17:28:54] <volans>	 to the patch
[17:28:56] <volans>	 sorry
[17:29:16] <volans>	 puppet-merging and runningpuppet on alert1001
[17:30:28] <Emperor>	 when that's done, I'll reschedule a check for the alerts I ack'd
[17:30:55] <volans>	 Emperor: they already run every minute
[17:31:16] <volans>	 both check_interval and retry_interval are 1
[17:31:17] <Emperor>	 <-- definitely a patient person ;-)
[17:32:18] <volans>	 puppet run completed just now
[17:33:43] <volans>	 one back to warning
[17:34:13] <Emperor>	 I'll remove the Ack from the warning one
[17:34:40] <volans>	 both back to warning to me
[17:34:53] <Emperor>	 likewise.
[17:35:30] <Emperor>	 OK, both Acks cleared, so I think we're done here?
[17:35:39] <rzl>	 +1
[17:36:11] <volans>	 I think so
[17:36:20] <rzl>	 thanks all - see you tomorrow
[17:37:16] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org
[17:37:45] <rzl>	 oh, I'm resolving in VO too
[17:37:56] <Amir1>	 Around but on phone 
[17:38:34] <volans>	 Amir1: you can go back to your sunday, just a certificate expiration become critical, can wait tomorrow
[17:38:40] <volans>	 rzl: thanks
[17:38:55] <Amir1>	 Awesome 
[17:39:34] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10Volans) As for: >  Evaluate whether there's a better monitoring strategy for future expiries than paging on Sunday :)  I totally agr...
[17:58:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10Volans) As for:  > Identify whether there are any more about to expire, for which we weren't alerted  I've done a quick check and se...
[18:04:23] <volans>	 Emperor, rzl: I'm modifying the check for the apis to use the _tmp version too, something wrong over there in icinga-land, it should have paged too
[18:04:32] <volans>	 (in case you're still around)
[18:18:47] * Emperor didn't see another page
[18:19:02] <volans>	 I now know wjy
[18:19:03] <volans>	 *why
[18:19:13] <Emperor>	 Ah; do you want some more code review?
[18:22:21] <wikibugs>	 (03PS1) 10Volans: Mediawiki api-https check: temporary tweak [puppet] - 10https://gerrit.wikimedia.org/r/772034 (https://phabricator.wikimedia.org/T304237)
[18:22:25] <volans>	 Emperor: if you're around ^^
[18:23:23] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10Volans) As for:  > We didn't get a matching page, but the cert for api.svc.[site].wmnet will also expire Monday, 2022-03-28 14:39:16...
[18:23:47] <Emperor>	 👀
[18:24:21] <volans>	 more details in the task
[18:25:32] <wikibugs>	 (03CR) 10Volans: "PCC results:" [puppet] - 10https://gerrit.wikimedia.org/r/772034 (https://phabricator.wikimedia.org/T304237) (owner: 10Volans)
[18:25:51] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "Good catch, I'm always keen on quoting..." [puppet] - 10https://gerrit.wikimedia.org/r/772034 (https://phabricator.wikimedia.org/T304237) (owner: 10Volans)
[18:26:19] <volans>	 to b eclear, I don't want to modify the base check today because something might start alerting
[18:26:38] <Emperor>	 Yeah, we don't need any more cert-related pages on Sunday evening 
[18:26:54] <volans>	 that wouldn't happen, at least for the puppet ones, I've checked them all (see task)
[18:26:57] <volans>	 but yeah
[18:27:01] <volans>	 aything else can happen
[18:27:02] <volans>	 :D
[18:27:12] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Mediawiki api-https check: temporary tweak [puppet] - 10https://gerrit.wikimedia.org/r/772034 (https://phabricator.wikimedia.org/T304237) (owner: 10Volans)
[18:30:23] <volans>	 thanks a lot for the review
[18:30:29] * volans monitoring icinga
[18:30:34] <volans>	 puppet run just completed
[18:33:55] <volans>	 there we go, now one is in warning
[18:34:06] <Emperor>	 👍
[18:35:33] <volans>	 both of them now (yes I cheated, I forced a recheck, the next scheduled one seemed too off in the future, I hope was an artifact of the icinga refresh)
[18:35:54] <volans>	 but don't want to start going down another rabbit hole
[18:37:21] <icinga-wm>	 PROBLEM - SSH on kubernetes2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:37:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10Volans) And now both checks for api are in warning, as they should be, added item to the task description.
[18:42:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10Volans)
[18:42:39] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10Volans)
[18:42:56] * volans done for today and this task :)
[18:53:55] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:53:55] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:56:25] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:56:25] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:02:51] <rzl>	 volans: amazing, thank you
[19:31:39] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:38:59] <icinga-wm>	 RECOVERY - SSH on kubernetes2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:58:29] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:01:19] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:40:53] <icinga-wm>	 RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:49:25] <icinga-wm>	 PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[20:56:17] <icinga-wm>	 RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.72 ms
[21:07:55] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:16:19] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:24:55] <icinga-wm>	 PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:35:11] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:37:16] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org
[22:17:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org
[22:23:17] <icinga-wm>	 PROBLEM - SSH on dumpsdata1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:25:51] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2004.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2004.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[22:26:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs_80: Servers wdqs2004.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[22:26:41] <icinga-wm>	 RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:27:39] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:29:15] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[22:30:29] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:34:21] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[22:40:35] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2001.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled https://wikitech.wikimed
[22:40:35] <icinga-wm>	 iki/PyBal
[22:41:05] <RhinosF1>	 I guess someone is hammering it
[22:42:49] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2001.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2001.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[22:45:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org
[22:50:13] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:53:03] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:58:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T300775)', diff saved to https://phabricator.wikimedia.org/P22853 and previous config saved to /var/cache/conftool/dbconfig/20220320-225835-marostegui.json
[22:58:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:58:42] <stashbot>	 T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775
[23:00:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org
[23:10:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org
[23:13:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P22854 and previous config saved to /var/cache/conftool/dbconfig/20220320-231340-marostegui.json
[23:13:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:28:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P22855 and previous config saved to /var/cache/conftool/dbconfig/20220320-232845-marostegui.json
[23:28:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:43:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T300775)', diff saved to https://phabricator.wikimedia.org/P22856 and previous config saved to /var/cache/conftool/dbconfig/20220320-234350-marostegui.json
[23:43:52] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[23:43:54] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[23:43:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:43:56] <stashbot>	 T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775
[23:43:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:43:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T300775)', diff saved to https://phabricator.wikimedia.org/P22857 and previous config saved to /var/cache/conftool/dbconfig/20220320-234358-marostegui.json
[23:44:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:44:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:57:07] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook