[00:39:23] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1257598
[00:39:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1257598 (owner: 10TrainBranchBot)
[00:51:45] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1257598 (owner: 10TrainBranchBot)
[01:09:27] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1257613
[01:09:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1257613 (owner: 10TrainBranchBot)
[01:17:10] <jinxer-wm>	 FIRING: [2x] MaxConntrack: Elevated conntrack usage on ganeti7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[01:19:54] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:21:22] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1257613 (owner: 10TrainBranchBot)
[01:45:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.33% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[01:46:25] <icinga-wm>	 PROBLEM - Wikidough DoH Check -IPv6- on doh7003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[01:47:15] <icinga-wm>	 RECOVERY - Wikidough DoH Check -IPv6- on doh7003 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 0.453 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[01:50:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.57% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[01:54:39] <icinga-wm>	 PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100%
[01:55:11] <icinga-wm>	 RECOVERY - Host ganeti7002 is UP: PING OK - Packet loss = 0%, RTA = 110.69 ms
[01:55:25] <icinga-wm>	 PROBLEM - Wikidough DoH Check -IPv6- on doh7003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[01:56:25] <icinga-wm>	 PROBLEM - Wikidough DoH Check -IPv4- on doh7003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[01:57:15] <icinga-wm>	 RECOVERY - Wikidough DoH Check -IPv6- on doh7003 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 0.456 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[01:57:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[01:58:15] <icinga-wm>	 RECOVERY - Wikidough DoH Check -IPv4- on doh7003 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 0.453 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[01:59:51] <icinga-wm>	 PROBLEM - Host doh7003 is DOWN: PING CRITICAL - Packet loss = 100%
[01:59:59] <icinga-wm>	 RECOVERY - Host doh7003 is UP: PING OK - Packet loss = 0%, RTA = 110.80 ms
[02:00:00] <jinxer-wm>	 FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability
[02:00:00] <jinxer-wm>	 FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability
[02:01:02] <logmsgbot>	 !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image
[02:03:25] <icinga-wm>	 PROBLEM - Wikidough DoH Check -IPv6- on doh7003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[02:03:27] <icinga-wm>	 PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100%
[02:03:39] <icinga-wm>	 PROBLEM - Host hcaptcha-proxy7001 is DOWN: PING CRITICAL - Packet loss = 100%
[02:03:45] <icinga-wm>	 PROBLEM - SSH on bast7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[02:03:47] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[02:04:03] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.74 ms
[02:04:37] <icinga-wm>	 RECOVERY - Host ganeti7002 is UP: PING WARNING - Packet loss = 77%, RTA = 110.55 ms
[02:05:11] <icinga-wm>	 RECOVERY - Host hcaptcha-proxy7001 is UP: PING OK - Packet loss = 0%, RTA = 110.93 ms
[02:06:25] <icinga-wm>	 PROBLEM - Wikidough DoH Check -IPv4- on doh7003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[02:06:41] <icinga-wm>	 RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[02:06:50] <jinxer-wm>	 FIRING: ProbeDown: Service ganeti7002:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:06:53] <icinga-wm>	 PROBLEM - Host bast7002 is DOWN: PING CRITICAL - Packet loss = 100%
[02:06:59] <icinga-wm>	 PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100%
[02:07:11] <icinga-wm>	 RECOVERY - Host ganeti7002 is UP: PING OK - Packet loss = 0%, RTA = 110.64 ms
[02:09:07] <icinga-wm>	 PROBLEM - SSH on doh7003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[02:09:07] <icinga-wm>	 PROBLEM - Wikidough DoT Check -IPv6- on doh7003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[02:09:07] <icinga-wm>	 PROBLEM - Wikidough DoT Check -IPv4- on doh7003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[02:09:19] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:09:23] <logmsgbot>	 !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 08m 21s)
[02:10:05] <icinga-wm>	 RECOVERY - Wikidough DoT Check -IPv4- on doh7003 is OK: TCP OK - 7.499 second response time on 195.200.68.98 port 853 https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[02:11:05] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[02:11:05] <icinga-wm>	 PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100%
[02:11:13] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.95 ms
[02:11:17] <icinga-wm>	 RECOVERY - Wikidough DoH Check -IPv4- on doh7003 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 0.454 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[02:11:35] <icinga-wm>	 RECOVERY - Host ganeti7002 is UP: PING OK - Packet loss = 0%, RTA = 110.49 ms
[02:12:15] <icinga-wm>	 RECOVERY - Host bast7002 is UP: PING WARNING - Packet loss = 71%, RTA = 110.98 ms
[02:12:15] <icinga-wm>	 RECOVERY - Wikidough DoH Check -IPv6- on doh7003 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 0.453 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[02:12:27] <icinga-wm>	 PROBLEM - Host doh7003 is DOWN: PING CRITICAL - Packet loss = 100%
[02:12:27] <icinga-wm>	 PROBLEM - Host hcaptcha-proxy7001 is DOWN: PING CRITICAL - Packet loss = 100%
[02:12:31] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[02:13:01] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.85 ms
[02:13:17] <icinga-wm>	 RECOVERY - Host doh7003 is UP: PING WARNING - Packet loss = 71%, RTA = 111.01 ms
[02:13:17] <icinga-wm>	 RECOVERY - Host hcaptcha-proxy7001 is UP: PING OK - Packet loss = 0%, RTA = 110.96 ms
[02:14:45] <icinga-wm>	 PROBLEM - SSH on bast7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[02:14:59] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[02:15:19] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 80%, RTA = 110.97 ms
[02:15:43] <icinga-wm>	 PROBLEM - Host hcaptcha-proxy7001 is DOWN: PING CRITICAL - Packet loss = 100%
[02:15:59] <icinga-wm>	 PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100%
[02:15:59] <icinga-wm>	 RECOVERY - SSH on doh7003 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[02:16:07] <icinga-wm>	 PROBLEM - Wikidough DoT Check -IPv4- on doh7003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[02:16:11] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[02:16:21] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.05 ms
[02:16:50] <jinxer-wm>	 RESOLVED: ProbeDown: Service ganeti7002:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:18:01] <icinga-wm>	 RECOVERY - Wikidough DoT Check -IPv4- on doh7003 is OK: TCP OK - 3.264 second response time on 195.200.68.98 port 853 https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[02:18:39] <icinga-wm>	 RECOVERY - Host ganeti7002 is UP: PING WARNING - Packet loss = 66%, RTA = 110.46 ms
[02:19:06] <jinxer-wm>	 FIRING: ProbeDown: Service ganeti7002:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:19:17] <icinga-wm>	 RECOVERY - Host hcaptcha-proxy7001 is UP: PING WARNING - Packet loss = 75%, RTA = 110.85 ms
[02:19:55] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[02:20:27] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 90%, RTA = 110.92 ms
[02:21:07] <icinga-wm>	 PROBLEM - SSH on doh7003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[02:21:11] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[02:21:25] <icinga-wm>	 PROBLEM - Wikidough DoH Check -IPv6- on doh7003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[02:21:25] <icinga-wm>	 PROBLEM - Wikidough DoH Check -IPv4- on doh7003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[02:21:27] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 75%, RTA = 110.97 ms
[02:21:59] <icinga-wm>	 RECOVERY - Wikidough DoT Check -IPv6- on doh7003 is OK: TCP OK - 0.233 second response time on 2a02:ec80:700:3:195:200:68:98 port 853 https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[02:22:07] <icinga-wm>	 PROBLEM - Wikidough DoT Check -IPv4- on doh7003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[02:22:17] <icinga-wm>	 RECOVERY - Wikidough DoH Check -IPv4- on doh7003 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 0.455 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[02:22:23] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[02:22:41] <icinga-wm>	 PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100%
[02:22:41] <icinga-wm>	 PROBLEM - Host hcaptcha-proxy7001 is DOWN: PING CRITICAL - Packet loss = 100%
[02:23:31] <icinga-wm>	 RECOVERY - Host ganeti7002 is UP: PING WARNING - Packet loss = 71%, RTA = 110.44 ms
[02:24:03] <icinga-wm>	 PROBLEM - Host doh7003 is DOWN: PING CRITICAL - Packet loss = 100%
[02:24:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:24:33] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 71%, RTA = 110.82 ms
[02:26:29] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[02:26:35] <icinga-wm>	 RECOVERY - Host hcaptcha-proxy7001 is UP: PING WARNING - Packet loss = 71%, RTA = 111.00 ms
[02:26:35] <icinga-wm>	 RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[02:27:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.82% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[02:27:59] <icinga-wm>	 RECOVERY - Wikidough DoT Check -IPv4- on doh7003 is OK: TCP OK - 1.250 second response time on 195.200.68.98 port 853 https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[02:28:29] <icinga-wm>	 PROBLEM - Host hcaptcha-proxy7001 is DOWN: PING CRITICAL - Packet loss = 100%
[02:28:37] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.77 ms
[02:29:19] <icinga-wm>	 PROBLEM - SSH on ganeti7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[02:30:09] <icinga-wm>	 PROBLEM - Host bast7002 is DOWN: PING CRITICAL - Packet loss = 100%
[02:30:17] <icinga-wm>	 RECOVERY - Wikidough DoH Check -IPv6- on doh7003 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 3.495 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[02:30:29] <icinga-wm>	 RECOVERY - Host doh7003 is UP: PING WARNING - Packet loss = 90%, RTA = 110.96 ms
[02:30:38] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:31:15] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[02:31:41] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 66%, RTA = 110.91 ms
[02:32:37] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[02:32:45] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.03 ms
[02:33:35] <icinga-wm>	 RECOVERY - Host hcaptcha-proxy7001 is UP: PING WARNING - Packet loss = 66%, RTA = 111.11 ms
[02:34:25] <icinga-wm>	 PROBLEM - Wikidough DoH Check -IPv6- on doh7003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[02:34:25] <icinga-wm>	 PROBLEM - Wikidough DoH Check -IPv4- on doh7003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[02:34:31] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[02:35:01] <icinga-wm>	 RECOVERY - SSH on doh7003 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[02:35:07] <icinga-wm>	 PROBLEM - Wikidough DoT Check -IPv6- on doh7003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[02:35:38] <jinxer-wm>	 FIRING: [5x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:35:41] <icinga-wm>	 PROBLEM - Host hcaptcha-proxy7001 is DOWN: PING CRITICAL - Packet loss = 100%
[02:35:49] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 33%, RTA = 111.00 ms
[02:36:39] <icinga-wm>	 PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100%
[02:36:45] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[02:36:49] <icinga-wm>	 RECOVERY - Host ganeti7002 is UP: PING OK - Packet loss = 0%, RTA = 110.46 ms
[02:37:25] <icinga-wm>	 PROBLEM - Wikidough DoH Check -IPv4- on doh7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[02:37:31] <icinga-wm>	 PROBLEM - Host doh7003 is DOWN: PING CRITICAL - Packet loss = 100%
[02:38:15] <icinga-wm>	 RECOVERY - Wikidough DoH Check -IPv4- on doh7004 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 0.453 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[02:40:01] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 90%, RTA = 110.93 ms
[02:42:41] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[02:42:59] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 90%, RTA = 111.09 ms
[02:43:21] <icinga-wm>	 PROBLEM - SSH on ncredir7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[02:43:59] <icinga-wm>	 RECOVERY - Wikidough DoT Check -IPv6- on doh7003 is OK: TCP OK - 1.263 second response time on 2a02:ec80:700:3:195:200:68:98 port 853 https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[02:44:11] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[02:44:17] <icinga-wm>	 RECOVERY - Wikidough DoH Check -IPv4- on doh7003 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 1.470 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[02:44:33] <jinxer-wm>	 FIRING: [5x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:44:47] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 33%, RTA = 110.98 ms
[02:45:15] <icinga-wm>	 RECOVERY - Host hcaptcha-proxy7001 is UP: PING WARNING - Packet loss = 50%, RTA = 110.93 ms
[02:45:23] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[02:47:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[02:47:43] <icinga-wm>	 PROBLEM - Host hcaptcha-proxy7001 is DOWN: PING CRITICAL - Packet loss = 100%
[02:48:05] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.82 ms
[02:48:07] <icinga-wm>	 RECOVERY - Host doh7003 is UP: PING OK - Packet loss = 0%, RTA = 110.92 ms
[02:48:09] <icinga-wm>	 RECOVERY - SSH on ganeti7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[02:48:15] <icinga-wm>	 RECOVERY - Wikidough DoH Check -IPv6- on doh7003 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 0.453 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[02:48:21] <icinga-wm>	 PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100%
[02:49:27] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on doh7004 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[02:49:59] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[02:50:00] <jinxer-wm>	 RESOLVED: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:50:03] <logmsgbot>	 !log sukhe@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on doh7003.wikimedia.org with reason: depooled host
[02:50:09] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.00 ms
[02:50:26] <logmsgbot>	 !log sukhe@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on doh7004.wikimedia.org with reason: depooled host
[02:51:21] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[02:52:11] <icinga-wm>	 RECOVERY - Host ganeti7002 is UP: PING OK - Packet loss = 0%, RTA = 110.46 ms
[02:52:11] <icinga-wm>	 RECOVERY - Host hcaptcha-proxy7001 is UP: PING OK - Packet loss = 0%, RTA = 110.86 ms
[02:52:11] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.94 ms
[02:53:11] <icinga-wm>	 RECOVERY - SSH on ncredir7004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[02:54:03] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[02:54:15] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.47 ms
[02:55:27] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[02:55:27] <icinga-wm>	 PROBLEM - Host hcaptcha-proxy7001 is DOWN: PING CRITICAL - Packet loss = 100%
[02:55:37] <icinga-wm>	 RECOVERY - Host hcaptcha-proxy7001 is UP: PING WARNING - Packet loss = 71%, RTA = 110.97 ms
[02:55:41] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job wikidough in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:57:59] <icinga-wm>	 PROBLEM - Host hcaptcha-proxy7001 is DOWN: PING CRITICAL - Packet loss = 100%
[02:58:21] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.04 ms
[02:58:41] <icinga-wm>	 PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100%
[02:59:39] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[02:59:53] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.94 ms
[03:00:29] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job nginx in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:01:07] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[03:02:25] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.90 ms
[03:03:39] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[03:04:19] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.05 ms
[03:04:21] <icinga-wm>	 PROBLEM - SSH on ncredir7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[03:05:05] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[03:05:21] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:06:31] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.91 ms
[03:06:31] <icinga-wm>	 RECOVERY - Host bast7002 is UP: PING WARNING - Packet loss = 80%, RTA = 110.90 ms
[03:07:23] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[03:07:33] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.94 ms
[03:09:37] <icinga-wm>	 RECOVERY - Host hcaptcha-proxy7001 is UP: PING OK - Packet loss = 0%, RTA = 111.08 ms
[03:09:37] <icinga-wm>	 RECOVERY - Host ganeti7002 is UP: PING OK - Packet loss = 0%, RTA = 110.55 ms
[03:12:01] <icinga-wm>	 PROBLEM - Host hcaptcha-proxy7001 is DOWN: PING CRITICAL - Packet loss = 100%
[03:12:01] <icinga-wm>	 PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100%
[03:14:07] <icinga-wm>	 PROBLEM - Host bast7002 is DOWN: PING CRITICAL - Packet loss = 100%
[03:14:19] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:14:35] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[03:14:43] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.98 ms
[03:19:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:21:25] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[03:21:53] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.94 ms
[03:22:45] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[03:22:55] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.86 ms
[03:23:13] <icinga-wm>	 RECOVERY - SSH on ncredir7004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[03:24:43] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[03:24:57] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.72 ms
[03:26:21] <icinga-wm>	 PROBLEM - SSH on ncredir7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[03:29:43] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[03:30:07] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.79 ms
[03:32:11] <icinga-wm>	 RECOVERY - Host ganeti7002 is UP: PING OK - Packet loss = 0%, RTA = 110.55 ms
[03:32:11] <icinga-wm>	 RECOVERY - Host hcaptcha-proxy7001 is UP: PING OK - Packet loss = 0%, RTA = 110.94 ms
[03:33:11] <icinga-wm>	 RECOVERY - SSH on ncredir7004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[03:34:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:35:01] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[03:35:11] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.94 ms
[03:35:53] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[03:36:13] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.89 ms
[03:42:19] <icinga-wm>	 PROBLEM - SSH on ganeti7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[03:44:17] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[03:44:27] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.81 ms
[03:45:19] <icinga-wm>	 PROBLEM - SSH on hcaptcha-proxy7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[03:48:09] <icinga-wm>	 RECOVERY - SSH on hcaptcha-proxy7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[03:49:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:49:47] <icinga-wm>	 PROBLEM - Host hcaptcha-proxy7001 is DOWN: PING CRITICAL - Packet loss = 100%
[03:50:33] <icinga-wm>	 RECOVERY - Host hcaptcha-proxy7001 is UP: PING WARNING - Packet loss = 71%, RTA = 111.03 ms
[03:51:37] <icinga-wm>	 RECOVERY - Host bast7002 is UP: PING OK - Packet loss = 0%, RTA = 111.00 ms
[03:52:19] <icinga-wm>	 PROBLEM - SSH on hcaptcha-proxy7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[03:53:21] <icinga-wm>	 PROBLEM - SSH on ncredir7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[03:54:29] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[03:54:41] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.06 ms
[03:57:37] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[03:57:43] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.82 ms
[03:59:32] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job nginx in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:00:09] <icinga-wm>	 RECOVERY - SSH on ganeti7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:00:45] <icinga-wm>	 PROBLEM - SSH on bast7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:01:41] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[04:01:49] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.90 ms
[04:02:43] <icinga-wm>	 PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100%
[04:03:37] <icinga-wm>	 RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:04:35] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:04:53] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[04:05:57] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.96 ms
[04:06:43] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[04:06:45] <icinga-wm>	 PROBLEM - SSH on bast7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:06:57] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.91 ms
[04:09:09] <icinga-wm>	 RECOVERY - SSH on hcaptcha-proxy7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:10:53] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[04:11:03] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.89 ms
[04:11:11] <icinga-wm>	 RECOVERY - SSH on ncredir7004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:12:19] <icinga-wm>	 PROBLEM - SSH on hcaptcha-proxy7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:12:57] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[04:13:05] <icinga-wm>	 RECOVERY - Host ganeti7002 is UP: PING WARNING - Packet loss = 77%, RTA = 110.55 ms
[04:13:09] <icinga-wm>	 RECOVERY - SSH on hcaptcha-proxy7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:13:09] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.05 ms
[04:14:21] <icinga-wm>	 PROBLEM - SSH on ncredir7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:14:23] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:15:48] <wikibugs>	 06SRE, 10Beta-Cluster-Infrastructure: Beta cluster is slow as sludge / serves 503 - https://phabricator.wikimedia.org/T420833 (10AlexisJazz) 03NEW
[04:16:07] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[04:17:11] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.95 ms
[04:17:13] <icinga-wm>	 RECOVERY - SSH on ncredir7004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:18:03] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[04:18:11] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.03 ms
[04:22:29] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[04:23:19] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.87 ms
[04:23:43] <icinga-wm>	 PROBLEM - Host hcaptcha-proxy7001 is DOWN: PING CRITICAL - Packet loss = 100%
[04:25:19] <icinga-wm>	 PROBLEM - SSH on ganeti7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:26:11] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[04:26:25] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.00 ms
[04:26:55] <wikibugs>	 06SRE, 10Beta-Cluster-Infrastructure: Beta cluster is slow as sludge / serves 503 - https://phabricator.wikimedia.org/T420833#11735592 (10AlexisJazz)
[04:27:09] <icinga-wm>	 RECOVERY - SSH on ganeti7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:27:15] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[04:27:29] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.78 ms
[04:29:29] <icinga-wm>	 RECOVERY - Host hcaptcha-proxy7001 is UP: PING OK - Packet loss = 0%, RTA = 110.92 ms
[04:30:05] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job nginx in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:30:19] <icinga-wm>	 PROBLEM - SSH on ganeti7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:31:19] <icinga-wm>	 PROBLEM - SSH on hcaptcha-proxy7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:32:35] <icinga-wm>	 RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:33:21] <icinga-wm>	 PROBLEM - SSH on ncredir7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:33:47] <icinga-wm>	 PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100%
[04:33:55] <wikibugs>	 06SRE, 10Beta-Cluster-Infrastructure: Beta cluster is slow as sludge / serves 503 and 504 - https://phabricator.wikimedia.org/T420833#11735594 (10AlexisJazz)
[04:34:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:34:35] <icinga-wm>	 RECOVERY - Host ganeti7002 is UP: PING WARNING - Packet loss = 60%, RTA = 110.54 ms
[04:37:55] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[04:38:45] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.95 ms
[04:39:09] <icinga-wm>	 RECOVERY - SSH on hcaptcha-proxy7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:41:37] <icinga-wm>	 PROBLEM - Host hcaptcha-proxy7001 is DOWN: PING CRITICAL - Packet loss = 100%
[04:42:45] <icinga-wm>	 PROBLEM - SSH on bast7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:43:09] <icinga-wm>	 RECOVERY - SSH on ganeti7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:43:45] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[04:43:51] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.90 ms
[04:44:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:45:39] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[04:45:51] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.00 ms
[04:46:43] <icinga-wm>	 PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100%
[04:49:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:50:35] <icinga-wm>	 RECOVERY - Host ganeti7002 is UP: PING WARNING - Packet loss = 60%, RTA = 110.60 ms
[04:51:59] <icinga-wm>	 RECOVERY - Host hcaptcha-proxy7001 is UP: PING WARNING - Packet loss = 80%, RTA = 111.02 ms
[04:53:19] <icinga-wm>	 PROBLEM - SSH on ganeti7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:53:19] <icinga-wm>	 PROBLEM - SSH on hcaptcha-proxy7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:54:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:54:49] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[04:55:05] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.93 ms
[04:55:09] <icinga-wm>	 RECOVERY - SSH on hcaptcha-proxy7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:56:09] <icinga-wm>	 RECOVERY - SSH on ganeti7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:56:48] <wikibugs>	 06SRE, 10Beta-Cluster-Infrastructure: Beta cluster is slow as sludge / serves 503 and 504 - https://phabricator.wikimedia.org/T420833#11735601 (10AlexisJazz) Seems to work much better now. I'll leave the task open for a little while in case someone wants to comment on the cause.
[04:58:37] <icinga-wm>	 RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:59:19] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:01:09] <icinga-wm>	 PROBLEM - Host bast7002 is DOWN: PING CRITICAL - Packet loss = 100%
[05:01:11] <icinga-wm>	 RECOVERY - SSH on ncredir7004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:02:05] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[05:02:15] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.85 ms
[05:03:31] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[05:04:17] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 60%, RTA = 110.93 ms
[05:04:24] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:04:29] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:06:17] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[05:06:19] <icinga-wm>	 RECOVERY - Host bast7002 is UP: PING WARNING - Packet loss = 80%, RTA = 110.95 ms
[05:06:23] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.00 ms
[05:07:19] <icinga-wm>	 PROBLEM - SSH on hcaptcha-proxy7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:07:19] <icinga-wm>	 PROBLEM - SSH on ganeti7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:07:45] <icinga-wm>	 PROBLEM - SSH on bast7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:09:05] <icinga-wm>	 PROBLEM - Host bast7002 is DOWN: PING CRITICAL - Packet loss = 100%
[05:09:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:11:09] <icinga-wm>	 RECOVERY - SSH on hcaptcha-proxy7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:11:15] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[05:11:27] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.89 ms
[05:12:21] <icinga-wm>	 PROBLEM - SSH on ncredir7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:14:19] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job nginx in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:14:19] <icinga-wm>	 PROBLEM - SSH on hcaptcha-proxy7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:14:35] <icinga-wm>	 RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:15:19] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[05:15:35] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.93 ms
[05:16:21] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[05:16:35] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.98 ms
[05:16:37] <icinga-wm>	 RECOVERY - Host bast7002 is UP: PING OK - Packet loss = 0%, RTA = 111.01 ms
[05:17:23] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[05:17:37] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.07 ms
[05:17:37] <jinxer-wm>	 FIRING: [2x] MaxConntrack: Elevated conntrack usage on ganeti7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[05:19:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:19:54] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:22:13] <icinga-wm>	 RECOVERY - SSH on ncredir7004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:24:43] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[05:24:49] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.92 ms
[05:25:45] <icinga-wm>	 PROBLEM - SSH on bast7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:27:21] <icinga-wm>	 PROBLEM - SSH on ncredir7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:29:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:30:45] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[05:30:55] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.96 ms
[05:31:41] <icinga-wm>	 PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100%
[05:31:55] <icinga-wm>	 RECOVERY - Host ganeti7002 is UP: PING WARNING - Packet loss = 66%, RTA = 110.61 ms
[05:33:13] <icinga-wm>	 RECOVERY - SSH on ncredir7004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:33:29] <icinga-wm>	 PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100%
[05:33:49] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[05:33:59] <icinga-wm>	 RECOVERY - Host ganeti7002 is UP: PING WARNING - Packet loss = 77%, RTA = 110.54 ms
[05:33:59] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.93 ms
[05:34:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:34:57] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[05:35:03] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.17 ms
[05:35:45] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[05:36:01] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.77 ms
[05:36:21] <icinga-wm>	 PROBLEM - SSH on ncredir7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:36:51] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[05:37:03] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.04 ms
[05:37:09] <icinga-wm>	 RECOVERY - SSH on hcaptcha-proxy7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:38:01] <icinga-wm>	 PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100%
[05:38:09] <icinga-wm>	 RECOVERY - SSH on ganeti7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:38:35] <icinga-wm>	 RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:39:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:40:21] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[05:41:09] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 33%, RTA = 110.91 ms
[05:41:11] <icinga-wm>	 RECOVERY - Host ganeti7002 is UP: PING OK - Packet loss = 0%, RTA = 110.63 ms
[05:41:11] <icinga-wm>	 RECOVERY - SSH on ncredir7004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:42:09] <icinga-wm>	 PROBLEM - Host bast7002 is DOWN: PING CRITICAL - Packet loss = 100%
[05:43:03] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[05:43:13] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.99 ms
[05:44:19] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:47:13] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[05:47:17] <icinga-wm>	 RECOVERY - Host bast7002 is UP: PING WARNING - Packet loss = 75%, RTA = 111.01 ms
[05:47:19] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.95 ms
[05:48:07] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[05:48:21] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.99 ms
[05:48:45] <icinga-wm>	 PROBLEM - SSH on bast7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:49:07] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[05:49:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:49:21] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 71%, RTA = 110.93 ms
[05:50:19] <icinga-wm>	 PROBLEM - SSH on hcaptcha-proxy7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:51:19] <icinga-wm>	 PROBLEM - SSH on ganeti7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:54:15] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[05:54:29] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.91 ms
[05:55:04] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job nginx in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:55:25] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[05:55:31] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.80 ms
[05:55:35] <icinga-wm>	 RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:56:27] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[05:56:35] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.95 ms
[05:57:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[05:58:47] <icinga-wm>	 PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100%
[05:59:21] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[05:59:35] <icinga-wm>	 RECOVERY - Host ganeti7002 is UP: PING WARNING - Packet loss = 33%, RTA = 110.49 ms
[05:59:35] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.92 ms
[06:00:00] <jinxer-wm>	 FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability
[06:00:00] <jinxer-wm>	 FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability
[06:01:17] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[06:01:37] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.98 ms
[06:03:19] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[06:03:41] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.95 ms
[06:05:33] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[06:05:47] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.82 ms
[06:06:37] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[06:06:47] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.93 ms
[06:09:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:09:47] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[06:10:51] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.07 ms
[06:11:21] <icinga-wm>	 PROBLEM - SSH on ncredir7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[06:13:45] <icinga-wm>	 PROBLEM - SSH on bast7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[06:14:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:14:43] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[06:14:57] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.88 ms
[06:15:13] <icinga-wm>	 RECOVERY - SSH on ncredir7004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[06:16:51] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[06:16:59] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 50%, RTA = 110.91 ms
[06:18:11] <icinga-wm>	 RECOVERY - SSH on ganeti7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[06:18:23] <icinga-wm>	 PROBLEM - SSH on ncredir7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[06:18:53] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[06:18:58] <jinxer-wm>	 FIRING: ProbeDown: Service ganeti7002:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:19:05] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.98 ms
[06:19:19] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job nginx in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:21:09] <icinga-wm>	 RECOVERY - SSH on hcaptcha-proxy7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[06:23:19] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[06:24:01] <icinga-wm>	 PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100%
[06:24:09] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 90%, RTA = 110.80 ms
[06:24:11] <icinga-wm>	 RECOVERY - Host ganeti7002 is UP: PING OK - Packet loss = 0%, RTA = 110.56 ms
[06:24:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:25:13] <icinga-wm>	 RECOVERY - SSH on ncredir7004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[06:28:11] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[06:28:17] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.87 ms
[06:29:11] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[06:29:17] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.89 ms
[06:29:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:32:19] <icinga-wm>	 PROBLEM - SSH on ganeti7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[06:33:13] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[06:33:19] <icinga-wm>	 PROBLEM - SSH on hcaptcha-proxy7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[06:33:23] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.69 ms
[06:34:15] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[06:34:23] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.84 ms
[06:34:41] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:36:09] <icinga-wm>	 RECOVERY - SSH on hcaptcha-proxy7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[06:39:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:39:35] <icinga-wm>	 RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[06:42:23] <icinga-wm>	 PROBLEM - SSH on ncredir7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[06:43:19] <icinga-wm>	 PROBLEM - SSH on hcaptcha-proxy7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[06:43:51] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[06:44:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[06:44:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:44:37] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.82 ms
[06:49:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[06:50:31] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[06:50:47] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.85 ms
[06:51:45] <icinga-wm>	 PROBLEM - SSH on bast7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[06:54:19] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:54:35] <icinga-wm>	 RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[06:57:45] <icinga-wm>	 PROBLEM - SSH on bast7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[06:57:47] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[06:57:57] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.95 ms
[07:00:03] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260322T0700)
[07:01:53] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[07:02:03] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.86 ms
[07:02:41] <icinga-wm>	 RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:04:09] <icinga-wm>	 RECOVERY - SSH on hcaptcha-proxy7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:04:09] <icinga-wm>	 RECOVERY - SSH on ganeti7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:04:55] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[07:05:09] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.93 ms
[07:05:10] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job nginx in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:05:45] <icinga-wm>	 PROBLEM - SSH on bast7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:08:13] <icinga-wm>	 RECOVERY - SSH on ncredir7004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:09:35] <icinga-wm>	 RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:09:45] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[07:10:15] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.99 ms
[07:10:29] <icinga-wm>	 PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100%
[07:10:57] <icinga-wm>	 RECOVERY - Host ganeti7002 is UP: PING WARNING - Packet loss = 75%, RTA = 110.56 ms
[07:11:07] <icinga-wm>	 PROBLEM - Host bast7002 is DOWN: PING CRITICAL - Packet loss = 100%
[07:13:33] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[07:14:01] <icinga-wm>	 PROBLEM - Host hcaptcha-proxy7001 is DOWN: PING CRITICAL - Packet loss = 100%
[07:14:19] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 66%, RTA = 110.89 ms
[07:15:19] <icinga-wm>	 PROBLEM - SSH on ganeti7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:18:05] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[07:18:27] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.99 ms
[07:19:19] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job nginx in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:20:09] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[07:20:31] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.86 ms
[07:20:31] <icinga-wm>	 RECOVERY - Host hcaptcha-proxy7001 is UP: PING OK - Packet loss = 0%, RTA = 110.86 ms
[07:21:29] <icinga-wm>	 RECOVERY - Host bast7002 is UP: PING WARNING - Packet loss = 80%, RTA = 111.59 ms
[07:22:45] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[07:23:19] <icinga-wm>	 PROBLEM - SSH on hcaptcha-proxy7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:23:33] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 71%, RTA = 110.96 ms
[07:24:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:25:23] <icinga-wm>	 PROBLEM - SSH on ncredir7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:26:09] <icinga-wm>	 RECOVERY - SSH on ganeti7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:27:37] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[07:28:41] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.93 ms
[07:31:43] <icinga-wm>	 PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100%
[07:33:07] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[07:33:45] <icinga-wm>	 PROBLEM - SSH on bast7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:33:47] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.89 ms
[07:34:41] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[07:34:49] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.98 ms
[07:35:09] <icinga-wm>	 RECOVERY - SSH on hcaptcha-proxy7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:35:47] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[07:35:53] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.95 ms
[07:36:41] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[07:36:51] <icinga-wm>	 RECOVERY - Host ganeti7002 is UP: PING WARNING - Packet loss = 75%, RTA = 110.55 ms
[07:36:53] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.95 ms
[07:37:41] <icinga-wm>	 PROBLEM - Host hcaptcha-proxy7001 is DOWN: PING CRITICAL - Packet loss = 100%
[07:38:19] <icinga-wm>	 PROBLEM - SSH on ganeti7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:39:45] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[07:39:59] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.02 ms
[07:44:01] <icinga-wm>	 PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100%
[07:44:15] <icinga-wm>	 RECOVERY - SSH on ncredir7004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:44:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:45:01] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[07:46:09] <icinga-wm>	 RECOVERY - SSH on ganeti7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:46:09] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.00 ms
[07:47:55] <icinga-wm>	 RECOVERY - Host hcaptcha-proxy7001 is UP: PING OK - Packet loss = 0%, RTA = 110.88 ms
[07:48:11] <icinga-wm>	 RECOVERY - Host ganeti7002 is UP: PING OK - Packet loss = 0%, RTA = 110.49 ms
[07:48:23] <icinga-wm>	 PROBLEM - SSH on ncredir7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:49:01] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[07:49:09] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 71%, RTA = 110.88 ms
[07:49:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:50:13] <icinga-wm>	 RECOVERY - SSH on ncredir7004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:51:35] <icinga-wm>	 RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:52:03] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[07:52:13] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.99 ms
[07:53:07] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[07:53:15] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.02 ms
[07:54:07] <icinga-wm>	 PROBLEM - Host bast7002 is DOWN: PING CRITICAL - Packet loss = 100%
[07:55:55] <icinga-wm>	 RECOVERY - Host bast7002 is UP: PING WARNING - Packet loss = 90%, RTA = 111.08 ms
[07:58:09] <icinga-wm>	 PROBLEM - Host bast7002 is DOWN: PING CRITICAL - Packet loss = 100%
[07:58:15] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[07:58:19] <icinga-wm>	 PROBLEM - SSH on ganeti7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:58:19] <icinga-wm>	 PROBLEM - SSH on hcaptcha-proxy7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:58:23] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.74 ms
[07:59:03] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[07:59:25] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.69 ms
[08:01:01] <icinga-wm>	 PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100%
[08:01:25] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[08:02:27] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.81 ms
[08:02:29] <icinga-wm>	 RECOVERY - Host ganeti7002 is UP: PING OK - Packet loss = 0%, RTA = 110.56 ms
[08:04:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:05:23] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[08:05:33] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.07 ms
[08:07:23] <icinga-wm>	 PROBLEM - SSH on ncredir7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:07:37] <icinga-wm>	 RECOVERY - Host bast7002 is UP: PING OK - Packet loss = 0%, RTA = 110.96 ms
[08:09:31] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[08:09:39] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.91 ms
[08:10:03] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:10:51] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[08:11:41] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.03 ms
[08:16:45] <icinga-wm>	 PROBLEM - SSH on bast7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:17:31] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[08:17:49] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 71%, RTA = 110.90 ms
[08:19:49] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:20:41] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[08:20:53] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 90%, RTA = 111.00 ms
[08:21:45] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[08:21:55] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.99 ms
[08:23:09] <icinga-wm>	 PROBLEM - Host lsw1-b7-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[08:23:55] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[08:24:01] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.87 ms
[08:24:13] <icinga-wm>	 RECOVERY - Host lsw1-b7-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.72 ms
[08:24:55] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:27:09] <icinga-wm>	 RECOVERY - SSH on ganeti7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:27:51] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[08:28:03] <icinga-wm>	 PROBLEM - Host hcaptcha-proxy7001 is DOWN: PING CRITICAL - Packet loss = 100%
[08:28:05] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.06 ms
[08:28:09] <icinga-wm>	 RECOVERY - SSH on hcaptcha-proxy7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:29:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:30:21] <icinga-wm>	 RECOVERY - Host hcaptcha-proxy7001 is UP: PING WARNING - Packet loss = 90%, RTA = 110.75 ms
[08:33:13] <icinga-wm>	 RECOVERY - SSH on ncredir7004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:34:19] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job nginx in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:36:09] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[08:36:15] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.05 ms
[08:40:11] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[08:40:19] <icinga-wm>	 PROBLEM - SSH on hcaptcha-proxy7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:40:19] <icinga-wm>	 PROBLEM - SSH on ganeti7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:40:21] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.84 ms
[08:41:09] <icinga-wm>	 RECOVERY - SSH on hcaptcha-proxy7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:42:09] <icinga-wm>	 RECOVERY - SSH on ganeti7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:43:07] <icinga-wm>	 PROBLEM - Host bast7002 is DOWN: PING CRITICAL - Packet loss = 100%
[08:44:19] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job nginx in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:44:19] <icinga-wm>	 PROBLEM - SSH on hcaptcha-proxy7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:44:23] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[08:44:31] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.92 ms
[08:48:19] <icinga-wm>	 PROBLEM - SSH on ganeti7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:48:35] <icinga-wm>	 RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:49:25] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[08:49:25] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:49:35] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.20 ms
[08:50:37] <icinga-wm>	 RECOVERY - Host bast7002 is UP: PING OK - Packet loss = 0%, RTA = 110.98 ms
[08:52:29] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[08:52:39] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.15 ms
[08:54:09] <icinga-wm>	 RECOVERY - SSH on hcaptcha-proxy7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:54:25] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:56:21] <icinga-wm>	 PROBLEM - Host hcaptcha-proxy7001 is DOWN: PING CRITICAL - Packet loss = 100%
[08:56:29] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[08:57:45] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 75%, RTA = 111.01 ms
[08:58:09] <icinga-wm>	 PROBLEM - Host bast7002 is DOWN: PING CRITICAL - Packet loss = 100%
[08:59:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:59:23] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[08:59:47] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 80%, RTA = 110.91 ms
[09:01:51] <icinga-wm>	 RECOVERY - Host hcaptcha-proxy7001 is UP: PING WARNING - Packet loss = 71%, RTA = 110.95 ms
[09:03:47] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[09:03:55] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.98 ms
[09:04:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:04:19] <icinga-wm>	 PROBLEM - SSH on hcaptcha-proxy7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:04:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:05:23] <icinga-wm>	 PROBLEM - SSH on ncredir7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:05:43] <icinga-wm>	 PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100%
[09:05:51] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[09:05:57] <icinga-wm>	 RECOVERY - Host ganeti7002 is UP: PING WARNING - Packet loss = 33%, RTA = 110.60 ms
[09:05:59] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.08 ms
[09:08:25] <icinga-wm>	 RECOVERY - Host bast7002 is UP: PING WARNING - Packet loss = 50%, RTA = 111.02 ms
[09:08:51] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[09:09:01] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.93 ms
[09:09:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:09:45] <icinga-wm>	 PROBLEM - SSH on bast7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:09:57] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[09:10:03] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.90 ms
[09:11:23] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[09:12:09] <icinga-wm>	 RECOVERY - SSH on ganeti7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:12:09] <icinga-wm>	 RECOVERY - SSH on hcaptcha-proxy7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:13:09] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.89 ms
[09:14:49] <icinga-wm>	 PROBLEM - Host bast7002 is DOWN: PING CRITICAL - Packet loss = 100%
[09:14:49] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:16:13] <icinga-wm>	 RECOVERY - SSH on ncredir7004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:17:11] <jinxer-wm>	 FIRING: [2x] MaxConntrack: Elevated conntrack usage on ganeti7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[09:19:45] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:19:59] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:20:15] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[09:21:19] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.93 ms
[09:23:03] <icinga-wm>	 PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100%
[09:23:23] <icinga-wm>	 RECOVERY - Host ganeti7002 is UP: PING OK - Packet loss = 0%, RTA = 110.51 ms
[09:25:13] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[09:25:19] <icinga-wm>	 PROBLEM - SSH on ganeti7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:26:19] <icinga-wm>	 PROBLEM - SSH on hcaptcha-proxy7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:26:29] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.82 ms
[09:29:19] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:29:41] <icinga-wm>	 PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100%
[09:30:07] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[09:30:31] <icinga-wm>	 RECOVERY - Host ganeti7002 is UP: PING WARNING - Packet loss = 50%, RTA = 110.44 ms
[09:30:33] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.00 ms
[09:30:35] <icinga-wm>	 RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:31:47] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[09:32:35] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 71%, RTA = 110.89 ms
[09:32:37] <icinga-wm>	 RECOVERY - Host bast7002 is UP: PING OK - Packet loss = 0%, RTA = 111.10 ms
[09:33:09] <icinga-wm>	 RECOVERY - SSH on hcaptcha-proxy7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:34:33] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[09:36:45] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.07 ms
[09:37:19] <icinga-wm>	 PROBLEM - SSH on hcaptcha-proxy7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:38:35] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[09:38:45] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.88 ms
[09:39:33] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[09:40:19] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 71%, RTA = 111.77 ms
[09:40:23] <icinga-wm>	 PROBLEM - SSH on ncredir7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:42:13] <icinga-wm>	 RECOVERY - SSH on ncredir7004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:42:39] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[09:42:49] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.93 ms
[09:43:43] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[09:43:53] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.92 ms
[09:44:45] <icinga-wm>	 PROBLEM - SSH on bast7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:44:51] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:45:49] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[09:45:55] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.03 ms
[09:46:23] <icinga-wm>	 PROBLEM - SSH on ncredir7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:47:21] <wikibugs>	 10SRE-swift-storage, 06Commons: Server error 500 after uploading chunk - https://phabricator.wikimedia.org/T340917#11735705 (10Yann) This issue seems to have disappeared.
[09:52:55] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[09:53:03] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.83 ms
[09:54:09] <icinga-wm>	 RECOVERY - SSH on ganeti7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:54:09] <icinga-wm>	 RECOVERY - SSH on hcaptcha-proxy7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:54:30] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job nginx in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:55:59] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[09:56:09] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.78 ms
[09:57:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[09:57:49] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%
[09:58:11] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.96 ms
[09:59:13] <icinga-wm>	 RECOVERY - SSH on ncredir7004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:59:19] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job nginx in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:59:35] <icinga-wm>	 RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:00:00] <jinxer-wm>	 FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability
[10:00:00] <jinxer-wm>	 FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability
[10:02:45] <icinga-wm>	 PROBLEM - SSH on bast7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:03:11] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%