[00:39:23] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1257598 [00:39:23] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1257598 (owner: 10TrainBranchBot) [00:51:45] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1257598 (owner: 10TrainBranchBot) [01:09:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1257613 [01:09:27] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1257613 (owner: 10TrainBranchBot) [01:17:10] FIRING: [2x] MaxConntrack: Elevated conntrack usage on ganeti7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [01:19:54] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:21:22] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1257613 (owner: 10TrainBranchBot) [01:45:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.33% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:46:25] PROBLEM - Wikidough DoH Check -IPv6- on doh7003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [01:47:15] RECOVERY - Wikidough DoH Check -IPv6- on doh7003 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 0.453 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [01:50:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.57% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:54:39] PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100% [01:55:11] RECOVERY - Host ganeti7002 is UP: PING OK - Packet loss = 0%, RTA = 110.69 ms [01:55:25] PROBLEM - Wikidough DoH Check -IPv6- on doh7003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [01:56:25] PROBLEM - Wikidough DoH Check -IPv4- on doh7003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [01:57:15] RECOVERY - Wikidough DoH Check -IPv6- on doh7003 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 0.456 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [01:57:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [01:58:15] RECOVERY - Wikidough DoH Check -IPv4- on doh7003 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 0.453 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [01:59:51] PROBLEM - Host doh7003 is DOWN: PING CRITICAL - Packet loss = 100% [01:59:59] RECOVERY - Host doh7003 is UP: PING OK - Packet loss = 0%, RTA = 110.80 ms [02:00:00] FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [02:00:00] FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [02:01:02] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [02:03:25] PROBLEM - Wikidough DoH Check -IPv6- on doh7003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [02:03:27] PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100% [02:03:39] PROBLEM - Host hcaptcha-proxy7001 is DOWN: PING CRITICAL - Packet loss = 100% [02:03:45] PROBLEM - SSH on bast7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:03:47] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [02:04:03] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.74 ms [02:04:37] RECOVERY - Host ganeti7002 is UP: PING WARNING - Packet loss = 77%, RTA = 110.55 ms [02:05:11] RECOVERY - Host hcaptcha-proxy7001 is UP: PING OK - Packet loss = 0%, RTA = 110.93 ms [02:06:25] PROBLEM - Wikidough DoH Check -IPv4- on doh7003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [02:06:41] RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:06:50] FIRING: ProbeDown: Service ganeti7002:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:06:53] PROBLEM - Host bast7002 is DOWN: PING CRITICAL - Packet loss = 100% [02:06:59] PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100% [02:07:11] RECOVERY - Host ganeti7002 is UP: PING OK - Packet loss = 0%, RTA = 110.64 ms [02:09:07] PROBLEM - SSH on doh7003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:09:07] PROBLEM - Wikidough DoT Check -IPv6- on doh7003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [02:09:07] PROBLEM - Wikidough DoT Check -IPv4- on doh7003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [02:09:19] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:09:23] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 08m 21s) [02:10:05] RECOVERY - Wikidough DoT Check -IPv4- on doh7003 is OK: TCP OK - 7.499 second response time on 195.200.68.98 port 853 https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [02:11:05] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [02:11:05] PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100% [02:11:13] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.95 ms [02:11:17] RECOVERY - Wikidough DoH Check -IPv4- on doh7003 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 0.454 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [02:11:35] RECOVERY - Host ganeti7002 is UP: PING OK - Packet loss = 0%, RTA = 110.49 ms [02:12:15] RECOVERY - Host bast7002 is UP: PING WARNING - Packet loss = 71%, RTA = 110.98 ms [02:12:15] RECOVERY - Wikidough DoH Check -IPv6- on doh7003 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 0.453 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [02:12:27] PROBLEM - Host doh7003 is DOWN: PING CRITICAL - Packet loss = 100% [02:12:27] PROBLEM - Host hcaptcha-proxy7001 is DOWN: PING CRITICAL - Packet loss = 100% [02:12:31] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [02:13:01] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.85 ms [02:13:17] RECOVERY - Host doh7003 is UP: PING WARNING - Packet loss = 71%, RTA = 111.01 ms [02:13:17] RECOVERY - Host hcaptcha-proxy7001 is UP: PING OK - Packet loss = 0%, RTA = 110.96 ms [02:14:45] PROBLEM - SSH on bast7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:14:59] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [02:15:19] RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 80%, RTA = 110.97 ms [02:15:43] PROBLEM - Host hcaptcha-proxy7001 is DOWN: PING CRITICAL - Packet loss = 100% [02:15:59] PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100% [02:15:59] RECOVERY - SSH on doh7003 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:16:07] PROBLEM - Wikidough DoT Check -IPv4- on doh7003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [02:16:11] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [02:16:21] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.05 ms [02:16:50] RESOLVED: ProbeDown: Service ganeti7002:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:18:01] RECOVERY - Wikidough DoT Check -IPv4- on doh7003 is OK: TCP OK - 3.264 second response time on 195.200.68.98 port 853 https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [02:18:39] RECOVERY - Host ganeti7002 is UP: PING WARNING - Packet loss = 66%, RTA = 110.46 ms [02:19:06] FIRING: ProbeDown: Service ganeti7002:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:19:17] RECOVERY - Host hcaptcha-proxy7001 is UP: PING WARNING - Packet loss = 75%, RTA = 110.85 ms [02:19:55] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [02:20:27] RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 90%, RTA = 110.92 ms [02:21:07] PROBLEM - SSH on doh7003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:21:11] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [02:21:25] PROBLEM - Wikidough DoH Check -IPv6- on doh7003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [02:21:25] PROBLEM - Wikidough DoH Check -IPv4- on doh7003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [02:21:27] RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 75%, RTA = 110.97 ms [02:21:59] RECOVERY - Wikidough DoT Check -IPv6- on doh7003 is OK: TCP OK - 0.233 second response time on 2a02:ec80:700:3:195:200:68:98 port 853 https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [02:22:07] PROBLEM - Wikidough DoT Check -IPv4- on doh7003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [02:22:17] RECOVERY - Wikidough DoH Check -IPv4- on doh7003 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 0.455 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [02:22:23] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [02:22:41] PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100% [02:22:41] PROBLEM - Host hcaptcha-proxy7001 is DOWN: PING CRITICAL - Packet loss = 100% [02:23:31] RECOVERY - Host ganeti7002 is UP: PING WARNING - Packet loss = 71%, RTA = 110.44 ms [02:24:03] PROBLEM - Host doh7003 is DOWN: PING CRITICAL - Packet loss = 100% [02:24:19] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:24:33] RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 71%, RTA = 110.82 ms [02:26:29] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [02:26:35] RECOVERY - Host hcaptcha-proxy7001 is UP: PING WARNING - Packet loss = 71%, RTA = 111.00 ms [02:26:35] RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:27:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.82% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:27:59] RECOVERY - Wikidough DoT Check -IPv4- on doh7003 is OK: TCP OK - 1.250 second response time on 195.200.68.98 port 853 https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [02:28:29] PROBLEM - Host hcaptcha-proxy7001 is DOWN: PING CRITICAL - Packet loss = 100% [02:28:37] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.77 ms [02:29:19] PROBLEM - SSH on ganeti7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:30:09] PROBLEM - Host bast7002 is DOWN: PING CRITICAL - Packet loss = 100% [02:30:17] RECOVERY - Wikidough DoH Check -IPv6- on doh7003 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 3.495 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [02:30:29] RECOVERY - Host doh7003 is UP: PING WARNING - Packet loss = 90%, RTA = 110.96 ms [02:30:38] FIRING: [4x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:31:15] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [02:31:41] RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 66%, RTA = 110.91 ms [02:32:37] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [02:32:45] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.03 ms [02:33:35] RECOVERY - Host hcaptcha-proxy7001 is UP: PING WARNING - Packet loss = 66%, RTA = 111.11 ms [02:34:25] PROBLEM - Wikidough DoH Check -IPv6- on doh7003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [02:34:25] PROBLEM - Wikidough DoH Check -IPv4- on doh7003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [02:34:31] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [02:35:01] RECOVERY - SSH on doh7003 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:35:07] PROBLEM - Wikidough DoT Check -IPv6- on doh7003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [02:35:38] FIRING: [5x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:35:41] PROBLEM - Host hcaptcha-proxy7001 is DOWN: PING CRITICAL - Packet loss = 100% [02:35:49] RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 33%, RTA = 111.00 ms [02:36:39] PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100% [02:36:45] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [02:36:49] RECOVERY - Host ganeti7002 is UP: PING OK - Packet loss = 0%, RTA = 110.46 ms [02:37:25] PROBLEM - Wikidough DoH Check -IPv4- on doh7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [02:37:31] PROBLEM - Host doh7003 is DOWN: PING CRITICAL - Packet loss = 100% [02:38:15] RECOVERY - Wikidough DoH Check -IPv4- on doh7004 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 0.453 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [02:40:01] RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 90%, RTA = 110.93 ms [02:42:41] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [02:42:59] RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 90%, RTA = 111.09 ms [02:43:21] PROBLEM - SSH on ncredir7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:43:59] RECOVERY - Wikidough DoT Check -IPv6- on doh7003 is OK: TCP OK - 1.263 second response time on 2a02:ec80:700:3:195:200:68:98 port 853 https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [02:44:11] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [02:44:17] RECOVERY - Wikidough DoH Check -IPv4- on doh7003 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 1.470 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [02:44:33] FIRING: [5x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:44:47] RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 33%, RTA = 110.98 ms [02:45:15] RECOVERY - Host hcaptcha-proxy7001 is UP: PING WARNING - Packet loss = 50%, RTA = 110.93 ms [02:45:23] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [02:47:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:47:43] PROBLEM - Host hcaptcha-proxy7001 is DOWN: PING CRITICAL - Packet loss = 100% [02:48:05] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.82 ms [02:48:07] RECOVERY - Host doh7003 is UP: PING OK - Packet loss = 0%, RTA = 110.92 ms [02:48:09] RECOVERY - SSH on ganeti7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:48:15] RECOVERY - Wikidough DoH Check -IPv6- on doh7003 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 0.453 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [02:48:21] PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100% [02:49:27] PROBLEM - Bird Internet Routing Daemon on doh7004 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [02:49:59] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [02:50:00] RESOLVED: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:50:03] !log sukhe@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on doh7003.wikimedia.org with reason: depooled host [02:50:09] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.00 ms [02:50:26] !log sukhe@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on doh7004.wikimedia.org with reason: depooled host [02:51:21] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [02:52:11] RECOVERY - Host ganeti7002 is UP: PING OK - Packet loss = 0%, RTA = 110.46 ms [02:52:11] RECOVERY - Host hcaptcha-proxy7001 is UP: PING OK - Packet loss = 0%, RTA = 110.86 ms [02:52:11] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.94 ms [02:53:11] RECOVERY - SSH on ncredir7004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:54:03] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [02:54:15] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.47 ms [02:55:27] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [02:55:27] PROBLEM - Host hcaptcha-proxy7001 is DOWN: PING CRITICAL - Packet loss = 100% [02:55:37] RECOVERY - Host hcaptcha-proxy7001 is UP: PING WARNING - Packet loss = 71%, RTA = 110.97 ms [02:55:41] FIRING: JobUnavailable: Reduced availability for job wikidough in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:57:59] PROBLEM - Host hcaptcha-proxy7001 is DOWN: PING CRITICAL - Packet loss = 100% [02:58:21] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.04 ms [02:58:41] PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100% [02:59:39] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [02:59:53] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.94 ms [03:00:29] FIRING: [2x] JobUnavailable: Reduced availability for job nginx in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:01:07] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [03:02:25] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.90 ms [03:03:39] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:19] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.05 ms [03:04:21] PROBLEM - SSH on ncredir7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:05:05] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [03:05:21] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:06:31] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.91 ms [03:06:31] RECOVERY - Host bast7002 is UP: PING WARNING - Packet loss = 80%, RTA = 110.90 ms [03:07:23] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [03:07:33] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.94 ms [03:09:37] RECOVERY - Host hcaptcha-proxy7001 is UP: PING OK - Packet loss = 0%, RTA = 111.08 ms [03:09:37] RECOVERY - Host ganeti7002 is UP: PING OK - Packet loss = 0%, RTA = 110.55 ms [03:12:01] PROBLEM - Host hcaptcha-proxy7001 is DOWN: PING CRITICAL - Packet loss = 100% [03:12:01] PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:07] PROBLEM - Host bast7002 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:19] FIRING: [2x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:14:35] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:43] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.98 ms [03:19:19] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:21:25] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [03:21:53] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.94 ms [03:22:45] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [03:22:55] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.86 ms [03:23:13] RECOVERY - SSH on ncredir7004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:24:43] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [03:24:57] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.72 ms [03:26:21] PROBLEM - SSH on ncredir7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:29:43] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [03:30:07] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.79 ms [03:32:11] RECOVERY - Host ganeti7002 is UP: PING OK - Packet loss = 0%, RTA = 110.55 ms [03:32:11] RECOVERY - Host hcaptcha-proxy7001 is UP: PING OK - Packet loss = 0%, RTA = 110.94 ms [03:33:11] RECOVERY - SSH on ncredir7004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:34:19] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:35:01] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [03:35:11] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.94 ms [03:35:53] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [03:36:13] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.89 ms [03:42:19] PROBLEM - SSH on ganeti7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:44:17] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [03:44:27] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.81 ms [03:45:19] PROBLEM - SSH on hcaptcha-proxy7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:48:09] RECOVERY - SSH on hcaptcha-proxy7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:49:19] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:49:47] PROBLEM - Host hcaptcha-proxy7001 is DOWN: PING CRITICAL - Packet loss = 100% [03:50:33] RECOVERY - Host hcaptcha-proxy7001 is UP: PING WARNING - Packet loss = 71%, RTA = 111.03 ms [03:51:37] RECOVERY - Host bast7002 is UP: PING OK - Packet loss = 0%, RTA = 111.00 ms [03:52:19] PROBLEM - SSH on hcaptcha-proxy7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:53:21] PROBLEM - SSH on ncredir7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:54:29] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [03:54:41] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.06 ms [03:57:37] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [03:57:43] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.82 ms [03:59:32] FIRING: [2x] JobUnavailable: Reduced availability for job nginx in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:00:09] RECOVERY - SSH on ganeti7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:00:45] PROBLEM - SSH on bast7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:01:41] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [04:01:49] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.90 ms [04:02:43] PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100% [04:03:37] RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:04:35] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:04:53] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [04:05:57] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.96 ms [04:06:43] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [04:06:45] PROBLEM - SSH on bast7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:06:57] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.91 ms [04:09:09] RECOVERY - SSH on hcaptcha-proxy7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:10:53] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [04:11:03] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.89 ms [04:11:11] RECOVERY - SSH on ncredir7004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:12:19] PROBLEM - SSH on hcaptcha-proxy7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:12:57] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [04:13:05] RECOVERY - Host ganeti7002 is UP: PING WARNING - Packet loss = 77%, RTA = 110.55 ms [04:13:09] RECOVERY - SSH on hcaptcha-proxy7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:13:09] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.05 ms [04:14:21] PROBLEM - SSH on ncredir7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:14:23] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:15:48] 06SRE, 10Beta-Cluster-Infrastructure: Beta cluster is slow as sludge / serves 503 - https://phabricator.wikimedia.org/T420833 (10AlexisJazz) 03NEW [04:16:07] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [04:17:11] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.95 ms [04:17:13] RECOVERY - SSH on ncredir7004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:18:03] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [04:18:11] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.03 ms [04:22:29] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [04:23:19] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.87 ms [04:23:43] PROBLEM - Host hcaptcha-proxy7001 is DOWN: PING CRITICAL - Packet loss = 100% [04:25:19] PROBLEM - SSH on ganeti7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:26:11] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [04:26:25] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.00 ms [04:26:55] 06SRE, 10Beta-Cluster-Infrastructure: Beta cluster is slow as sludge / serves 503 - https://phabricator.wikimedia.org/T420833#11735592 (10AlexisJazz) [04:27:09] RECOVERY - SSH on ganeti7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:27:15] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [04:27:29] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.78 ms [04:29:29] RECOVERY - Host hcaptcha-proxy7001 is UP: PING OK - Packet loss = 0%, RTA = 110.92 ms [04:30:05] FIRING: [2x] JobUnavailable: Reduced availability for job nginx in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:30:19] PROBLEM - SSH on ganeti7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:31:19] PROBLEM - SSH on hcaptcha-proxy7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:32:35] RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:33:21] PROBLEM - SSH on ncredir7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:33:47] PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100% [04:33:55] 06SRE, 10Beta-Cluster-Infrastructure: Beta cluster is slow as sludge / serves 503 and 504 - https://phabricator.wikimedia.org/T420833#11735594 (10AlexisJazz) [04:34:19] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:34:35] RECOVERY - Host ganeti7002 is UP: PING WARNING - Packet loss = 60%, RTA = 110.54 ms [04:37:55] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [04:38:45] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.95 ms [04:39:09] RECOVERY - SSH on hcaptcha-proxy7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:41:37] PROBLEM - Host hcaptcha-proxy7001 is DOWN: PING CRITICAL - Packet loss = 100% [04:42:45] PROBLEM - SSH on bast7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:43:09] RECOVERY - SSH on ganeti7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:43:45] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [04:43:51] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.90 ms [04:44:19] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:45:39] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [04:45:51] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.00 ms [04:46:43] PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100% [04:49:19] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:50:35] RECOVERY - Host ganeti7002 is UP: PING WARNING - Packet loss = 60%, RTA = 110.60 ms [04:51:59] RECOVERY - Host hcaptcha-proxy7001 is UP: PING WARNING - Packet loss = 80%, RTA = 111.02 ms [04:53:19] PROBLEM - SSH on ganeti7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:53:19] PROBLEM - SSH on hcaptcha-proxy7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:54:19] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:54:49] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [04:55:05] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.93 ms [04:55:09] RECOVERY - SSH on hcaptcha-proxy7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:56:09] RECOVERY - SSH on ganeti7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:56:48] 06SRE, 10Beta-Cluster-Infrastructure: Beta cluster is slow as sludge / serves 503 and 504 - https://phabricator.wikimedia.org/T420833#11735601 (10AlexisJazz) Seems to work much better now. I'll leave the task open for a little while in case someone wants to comment on the cause. [04:58:37] RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:59:19] FIRING: [2x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:01:09] PROBLEM - Host bast7002 is DOWN: PING CRITICAL - Packet loss = 100% [05:01:11] RECOVERY - SSH on ncredir7004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:02:05] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [05:02:15] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.85 ms [05:03:31] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [05:04:17] RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 60%, RTA = 110.93 ms [05:04:24] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:04:29] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:06:17] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [05:06:19] RECOVERY - Host bast7002 is UP: PING WARNING - Packet loss = 80%, RTA = 110.95 ms [05:06:23] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.00 ms [05:07:19] PROBLEM - SSH on hcaptcha-proxy7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:07:19] PROBLEM - SSH on ganeti7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:07:45] PROBLEM - SSH on bast7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:09:05] PROBLEM - Host bast7002 is DOWN: PING CRITICAL - Packet loss = 100% [05:09:19] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:11:09] RECOVERY - SSH on hcaptcha-proxy7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:11:15] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [05:11:27] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.89 ms [05:12:21] PROBLEM - SSH on ncredir7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:14:19] FIRING: [2x] JobUnavailable: Reduced availability for job nginx in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:14:19] PROBLEM - SSH on hcaptcha-proxy7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:14:35] RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:15:19] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [05:15:35] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.93 ms [05:16:21] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [05:16:35] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.98 ms [05:16:37] RECOVERY - Host bast7002 is UP: PING OK - Packet loss = 0%, RTA = 111.01 ms [05:17:23] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [05:17:37] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.07 ms [05:17:37] FIRING: [2x] MaxConntrack: Elevated conntrack usage on ganeti7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [05:19:19] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:19:54] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:22:13] RECOVERY - SSH on ncredir7004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:24:43] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [05:24:49] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.92 ms [05:25:45] PROBLEM - SSH on bast7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:27:21] PROBLEM - SSH on ncredir7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:29:19] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:30:45] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [05:30:55] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.96 ms [05:31:41] PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100% [05:31:55] RECOVERY - Host ganeti7002 is UP: PING WARNING - Packet loss = 66%, RTA = 110.61 ms [05:33:13] RECOVERY - SSH on ncredir7004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:33:29] PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100% [05:33:49] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [05:33:59] RECOVERY - Host ganeti7002 is UP: PING WARNING - Packet loss = 77%, RTA = 110.54 ms [05:33:59] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.93 ms [05:34:19] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:34:57] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [05:35:03] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.17 ms [05:35:45] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [05:36:01] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.77 ms [05:36:21] PROBLEM - SSH on ncredir7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:36:51] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [05:37:03] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.04 ms [05:37:09] RECOVERY - SSH on hcaptcha-proxy7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:38:01] PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100% [05:38:09] RECOVERY - SSH on ganeti7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:38:35] RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:39:19] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:40:21] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [05:41:09] RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 33%, RTA = 110.91 ms [05:41:11] RECOVERY - Host ganeti7002 is UP: PING OK - Packet loss = 0%, RTA = 110.63 ms [05:41:11] RECOVERY - SSH on ncredir7004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:42:09] PROBLEM - Host bast7002 is DOWN: PING CRITICAL - Packet loss = 100% [05:43:03] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [05:43:13] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.99 ms [05:44:19] FIRING: [2x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:47:13] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [05:47:17] RECOVERY - Host bast7002 is UP: PING WARNING - Packet loss = 75%, RTA = 111.01 ms [05:47:19] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.95 ms [05:48:07] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [05:48:21] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.99 ms [05:48:45] PROBLEM - SSH on bast7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:49:07] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [05:49:19] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:49:21] RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 71%, RTA = 110.93 ms [05:50:19] PROBLEM - SSH on hcaptcha-proxy7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:51:19] PROBLEM - SSH on ganeti7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:54:15] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [05:54:29] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.91 ms [05:55:04] FIRING: [2x] JobUnavailable: Reduced availability for job nginx in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:55:25] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [05:55:31] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.80 ms [05:55:35] RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:56:27] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [05:56:35] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.95 ms [05:57:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [05:58:47] PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100% [05:59:21] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [05:59:35] RECOVERY - Host ganeti7002 is UP: PING WARNING - Packet loss = 33%, RTA = 110.49 ms [05:59:35] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.92 ms [06:00:00] FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [06:00:00] FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [06:01:17] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [06:01:37] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.98 ms [06:03:19] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [06:03:41] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.95 ms [06:05:33] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [06:05:47] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.82 ms [06:06:37] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [06:06:47] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.93 ms [06:09:19] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:09:47] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [06:10:51] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.07 ms [06:11:21] PROBLEM - SSH on ncredir7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:13:45] PROBLEM - SSH on bast7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:14:19] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:14:43] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [06:14:57] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.88 ms [06:15:13] RECOVERY - SSH on ncredir7004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:16:51] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [06:16:59] RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 50%, RTA = 110.91 ms [06:18:11] RECOVERY - SSH on ganeti7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:18:23] PROBLEM - SSH on ncredir7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:18:53] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [06:18:58] FIRING: ProbeDown: Service ganeti7002:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:19:05] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.98 ms [06:19:19] FIRING: [2x] JobUnavailable: Reduced availability for job nginx in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:21:09] RECOVERY - SSH on hcaptcha-proxy7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:23:19] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [06:24:01] PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100% [06:24:09] RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 90%, RTA = 110.80 ms [06:24:11] RECOVERY - Host ganeti7002 is UP: PING OK - Packet loss = 0%, RTA = 110.56 ms [06:24:19] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:25:13] RECOVERY - SSH on ncredir7004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:28:11] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [06:28:17] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.87 ms [06:29:11] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [06:29:17] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.89 ms [06:29:19] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:32:19] PROBLEM - SSH on ganeti7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:33:13] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [06:33:19] PROBLEM - SSH on hcaptcha-proxy7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:33:23] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.69 ms [06:34:15] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [06:34:23] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.84 ms [06:34:41] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:36:09] RECOVERY - SSH on hcaptcha-proxy7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:39:19] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:39:35] RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:42:23] PROBLEM - SSH on ncredir7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:43:19] PROBLEM - SSH on hcaptcha-proxy7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:43:51] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [06:44:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:44:19] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:44:37] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.82 ms [06:49:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:50:31] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [06:50:47] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.85 ms [06:51:45] PROBLEM - SSH on bast7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:54:19] FIRING: [2x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:54:35] RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:57:45] PROBLEM - SSH on bast7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:57:47] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [06:57:57] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.95 ms [07:00:03] FIRING: [2x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260322T0700) [07:01:53] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [07:02:03] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.86 ms [07:02:41] RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:04:09] RECOVERY - SSH on hcaptcha-proxy7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:04:09] RECOVERY - SSH on ganeti7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:04:55] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [07:05:09] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.93 ms [07:05:10] FIRING: [2x] JobUnavailable: Reduced availability for job nginx in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:05:45] PROBLEM - SSH on bast7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:08:13] RECOVERY - SSH on ncredir7004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:09:35] RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:09:45] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [07:10:15] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.99 ms [07:10:29] PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100% [07:10:57] RECOVERY - Host ganeti7002 is UP: PING WARNING - Packet loss = 75%, RTA = 110.56 ms [07:11:07] PROBLEM - Host bast7002 is DOWN: PING CRITICAL - Packet loss = 100% [07:13:33] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [07:14:01] PROBLEM - Host hcaptcha-proxy7001 is DOWN: PING CRITICAL - Packet loss = 100% [07:14:19] RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 66%, RTA = 110.89 ms [07:15:19] PROBLEM - SSH on ganeti7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:18:05] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [07:18:27] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.99 ms [07:19:19] FIRING: [2x] JobUnavailable: Reduced availability for job nginx in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:20:09] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [07:20:31] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.86 ms [07:20:31] RECOVERY - Host hcaptcha-proxy7001 is UP: PING OK - Packet loss = 0%, RTA = 110.86 ms [07:21:29] RECOVERY - Host bast7002 is UP: PING WARNING - Packet loss = 80%, RTA = 111.59 ms [07:22:45] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [07:23:19] PROBLEM - SSH on hcaptcha-proxy7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:23:33] RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 71%, RTA = 110.96 ms [07:24:19] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:25:23] PROBLEM - SSH on ncredir7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:26:09] RECOVERY - SSH on ganeti7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:27:37] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [07:28:41] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.93 ms [07:31:43] PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100% [07:33:07] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [07:33:45] PROBLEM - SSH on bast7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:33:47] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.89 ms [07:34:41] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [07:34:49] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.98 ms [07:35:09] RECOVERY - SSH on hcaptcha-proxy7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:35:47] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [07:35:53] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.95 ms [07:36:41] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [07:36:51] RECOVERY - Host ganeti7002 is UP: PING WARNING - Packet loss = 75%, RTA = 110.55 ms [07:36:53] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.95 ms [07:37:41] PROBLEM - Host hcaptcha-proxy7001 is DOWN: PING CRITICAL - Packet loss = 100% [07:38:19] PROBLEM - SSH on ganeti7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:39:45] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [07:39:59] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.02 ms [07:44:01] PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100% [07:44:15] RECOVERY - SSH on ncredir7004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:44:19] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:45:01] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [07:46:09] RECOVERY - SSH on ganeti7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:46:09] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.00 ms [07:47:55] RECOVERY - Host hcaptcha-proxy7001 is UP: PING OK - Packet loss = 0%, RTA = 110.88 ms [07:48:11] RECOVERY - Host ganeti7002 is UP: PING OK - Packet loss = 0%, RTA = 110.49 ms [07:48:23] PROBLEM - SSH on ncredir7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:49:01] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [07:49:09] RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 71%, RTA = 110.88 ms [07:49:19] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:50:13] RECOVERY - SSH on ncredir7004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:51:35] RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:52:03] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [07:52:13] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.99 ms [07:53:07] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [07:53:15] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.02 ms [07:54:07] PROBLEM - Host bast7002 is DOWN: PING CRITICAL - Packet loss = 100% [07:55:55] RECOVERY - Host bast7002 is UP: PING WARNING - Packet loss = 90%, RTA = 111.08 ms [07:58:09] PROBLEM - Host bast7002 is DOWN: PING CRITICAL - Packet loss = 100% [07:58:15] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [07:58:19] PROBLEM - SSH on ganeti7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:58:19] PROBLEM - SSH on hcaptcha-proxy7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:58:23] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.74 ms [07:59:03] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [07:59:25] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.69 ms [08:01:01] PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100% [08:01:25] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [08:02:27] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.81 ms [08:02:29] RECOVERY - Host ganeti7002 is UP: PING OK - Packet loss = 0%, RTA = 110.56 ms [08:04:19] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:05:23] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [08:05:33] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.07 ms [08:07:23] PROBLEM - SSH on ncredir7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:07:37] RECOVERY - Host bast7002 is UP: PING OK - Packet loss = 0%, RTA = 110.96 ms [08:09:31] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [08:09:39] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.91 ms [08:10:03] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:10:51] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [08:11:41] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.03 ms [08:16:45] PROBLEM - SSH on bast7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:17:31] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [08:17:49] RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 71%, RTA = 110.90 ms [08:19:49] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:20:41] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [08:20:53] RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 90%, RTA = 111.00 ms [08:21:45] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [08:21:55] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.99 ms [08:23:09] PROBLEM - Host lsw1-b7-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:23:55] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [08:24:01] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.87 ms [08:24:13] RECOVERY - Host lsw1-b7-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.72 ms [08:24:55] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:27:09] RECOVERY - SSH on ganeti7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:27:51] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:03] PROBLEM - Host hcaptcha-proxy7001 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:05] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.06 ms [08:28:09] RECOVERY - SSH on hcaptcha-proxy7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:29:19] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:30:21] RECOVERY - Host hcaptcha-proxy7001 is UP: PING WARNING - Packet loss = 90%, RTA = 110.75 ms [08:33:13] RECOVERY - SSH on ncredir7004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:34:19] FIRING: [2x] JobUnavailable: Reduced availability for job nginx in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:36:09] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [08:36:15] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.05 ms [08:40:11] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [08:40:19] PROBLEM - SSH on hcaptcha-proxy7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:40:19] PROBLEM - SSH on ganeti7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:40:21] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.84 ms [08:41:09] RECOVERY - SSH on hcaptcha-proxy7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:42:09] RECOVERY - SSH on ganeti7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:43:07] PROBLEM - Host bast7002 is DOWN: PING CRITICAL - Packet loss = 100% [08:44:19] FIRING: [2x] JobUnavailable: Reduced availability for job nginx in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:44:19] PROBLEM - SSH on hcaptcha-proxy7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:44:23] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [08:44:31] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.92 ms [08:48:19] PROBLEM - SSH on ganeti7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:48:35] RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:49:25] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [08:49:25] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:49:35] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.20 ms [08:50:37] RECOVERY - Host bast7002 is UP: PING OK - Packet loss = 0%, RTA = 110.98 ms [08:52:29] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [08:52:39] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.15 ms [08:54:09] RECOVERY - SSH on hcaptcha-proxy7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:54:25] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:56:21] PROBLEM - Host hcaptcha-proxy7001 is DOWN: PING CRITICAL - Packet loss = 100% [08:56:29] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [08:57:45] RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 75%, RTA = 111.01 ms [08:58:09] PROBLEM - Host bast7002 is DOWN: PING CRITICAL - Packet loss = 100% [08:59:19] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:59:23] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [08:59:47] RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 80%, RTA = 110.91 ms [09:01:51] RECOVERY - Host hcaptcha-proxy7001 is UP: PING WARNING - Packet loss = 71%, RTA = 110.95 ms [09:03:47] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [09:03:55] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.98 ms [09:04:19] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:04:19] PROBLEM - SSH on hcaptcha-proxy7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:04:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:05:23] PROBLEM - SSH on ncredir7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:05:43] PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100% [09:05:51] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [09:05:57] RECOVERY - Host ganeti7002 is UP: PING WARNING - Packet loss = 33%, RTA = 110.60 ms [09:05:59] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.08 ms [09:08:25] RECOVERY - Host bast7002 is UP: PING WARNING - Packet loss = 50%, RTA = 111.02 ms [09:08:51] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [09:09:01] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.93 ms [09:09:19] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:09:45] PROBLEM - SSH on bast7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:09:57] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [09:10:03] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.90 ms [09:11:23] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [09:12:09] RECOVERY - SSH on ganeti7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:12:09] RECOVERY - SSH on hcaptcha-proxy7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:13:09] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.89 ms [09:14:49] PROBLEM - Host bast7002 is DOWN: PING CRITICAL - Packet loss = 100% [09:14:49] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:16:13] RECOVERY - SSH on ncredir7004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:17:11] FIRING: [2x] MaxConntrack: Elevated conntrack usage on ganeti7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [09:19:45] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:19:59] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:20:15] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [09:21:19] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.93 ms [09:23:03] PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100% [09:23:23] RECOVERY - Host ganeti7002 is UP: PING OK - Packet loss = 0%, RTA = 110.51 ms [09:25:13] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [09:25:19] PROBLEM - SSH on ganeti7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:26:19] PROBLEM - SSH on hcaptcha-proxy7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:26:29] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.82 ms [09:29:19] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:29:41] PROBLEM - Host ganeti7002 is DOWN: PING CRITICAL - Packet loss = 100% [09:30:07] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [09:30:31] RECOVERY - Host ganeti7002 is UP: PING WARNING - Packet loss = 50%, RTA = 110.44 ms [09:30:33] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.00 ms [09:30:35] RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:31:47] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [09:32:35] RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 71%, RTA = 110.89 ms [09:32:37] RECOVERY - Host bast7002 is UP: PING OK - Packet loss = 0%, RTA = 111.10 ms [09:33:09] RECOVERY - SSH on hcaptcha-proxy7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:34:33] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [09:36:45] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.07 ms [09:37:19] PROBLEM - SSH on hcaptcha-proxy7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:38:35] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [09:38:45] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.88 ms [09:39:33] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [09:40:19] RECOVERY - Host ncredir7004 is UP: PING WARNING - Packet loss = 71%, RTA = 111.77 ms [09:40:23] PROBLEM - SSH on ncredir7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:42:13] RECOVERY - SSH on ncredir7004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:42:39] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [09:42:49] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.93 ms [09:43:43] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [09:43:53] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.92 ms [09:44:45] PROBLEM - SSH on bast7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:44:51] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:45:49] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [09:45:55] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 111.03 ms [09:46:23] PROBLEM - SSH on ncredir7004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:47:21] 10SRE-swift-storage, 06Commons: Server error 500 after uploading chunk - https://phabricator.wikimedia.org/T340917#11735705 (10Yann) This issue seems to have disappeared. [09:52:55] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [09:53:03] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.83 ms [09:54:09] RECOVERY - SSH on ganeti7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:54:09] RECOVERY - SSH on hcaptcha-proxy7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:54:30] FIRING: [2x] JobUnavailable: Reduced availability for job nginx in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:55:59] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [09:56:09] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.78 ms [09:57:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:57:49] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100% [09:58:11] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 110.96 ms [09:59:13] RECOVERY - SSH on ncredir7004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:59:19] FIRING: [2x] JobUnavailable: Reduced availability for job nginx in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:59:35] RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:00:00] FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [10:00:00] FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [10:02:45] PROBLEM - SSH on bast7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:03:11] PROBLEM - Host ncredir7004 is DOWN: PING CRITICAL - Packet loss = 100%