[00:12:25] [Grafana] !tech RESOLVED: High Job Queue Backlog https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [00:25:59] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 21.04, 19.99, 18.96 [00:29:56] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 17.76, 19.28, 18.92 [00:38:48] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 20.01, 20.78, 19.71 [00:44:44] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 17.89, 19.48, 19.59 [00:48:40] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.03, 21.02, 20.20 [00:52:37] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 18.23, 19.67, 19.84 [00:56:33] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 19.58, 20.64, 20.28 [00:58:32] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 19.14, 20.09, 20.12 [01:18:00] PROBLEM - cloud15 IPMI Sensors on cloud15 is UNKNOWN: ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-cloud15.localhost: internal IPMI error-> Execution of /usr/sbin/ipmi-sel failed with return code 1.-> /usr/sbin/ipmi-sel was executed with the following parameters: sudo /usr/sbin/ipmi-sel --output-event-state --interpret-oem-data --entity-sensor-names --sensor-types=all [01:18:18] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 19.78, 20.70, 20.29 [01:20:00] PROBLEM - cloud15 IPMI Sensors on cloud15 is CRITICAL: IPMI Status: Critical [442 system event log (SEL) entries present] [01:22:15] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 17.82, 19.47, 19.88 [01:33:06] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.76, 20.19, 19.87 [01:35:05] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 16.60, 19.05, 19.53 [01:48:54] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 24.53, 21.22, 19.96 [01:49:11] PROBLEM - mw182 Current Load on mw182 is WARNING: LOAD WARNING - total load average: 20.49, 20.27, 18.69 [01:50:52] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 19.67, 20.23, 19.74 [01:51:08] RECOVERY - mw182 Current Load on mw182 is OK: LOAD OK - total load average: 18.98, 19.47, 18.57 [02:00:43] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 20.77, 21.13, 20.46 [02:08:37] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 19.89, 20.32, 20.35 [02:12:33] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 21.16, 20.83, 20.52 [02:14:32] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 24.49, 22.08, 21.01 [02:16:31] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 18.72, 20.57, 20.58 [02:17:25] [Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [02:22:27] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 19.49, 19.71, 20.23 [02:29:20] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 25.07, 21.79, 20.86 [02:31:19] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 18.61, 20.25, 20.40 [02:52:02] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 21.81, 20.68, 19.83 [02:59:57] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 18.32, 20.30, 20.23 [03:03:53] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 21.31, 21.08, 20.58 [03:04:57] PROBLEM - prometheus151 PowerDNS Recursor on prometheus151 is CRITICAL: CRITICAL - Plugin timed out while executing system call [03:05:01] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 7.26, 3.34, 1.36 [03:05:52] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 18.11, 19.90, 20.20 [03:06:09] PROBLEM - prometheus151 SSH on prometheus151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:06:51] RECOVERY - prometheus151 PowerDNS Recursor on prometheus151 is OK: DNS OK: 0.058 seconds response time. wikitide.net returns 2602:294:0:b13::110,2602:294:0:b23::112,38.46.223.205,38.46.223.206 [03:07:25] [Grafana] !tech RESOLVED: High Job Queue Backlog https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [03:09:45] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [03:10:18] RECOVERY - prometheus151 SSH on prometheus151 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) [03:11:02] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 2.56, 3.56, 2.20 [03:14:44] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [03:15:02] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 6.01, 4.19, 2.73 [03:15:20] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [03:20:25] PROBLEM - prometheus151 PowerDNS Recursor on prometheus151 is CRITICAL: CRITICAL - Plugin timed out while executing system call [03:22:30] RECOVERY - prometheus151 PowerDNS Recursor on prometheus151 is OK: DNS OK: 9.632 seconds response time. wikitide.net returns 2602:294:0:b13::110,2602:294:0:b23::112,38.46.223.205,38.46.223.206 [03:23:46] PROBLEM - cloud15 IPMI Sensors on cloud15 is UNKNOWN: ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-cloud15.localhost: internal IPMI error-> Execution of /usr/sbin/ipmi-sel failed with return code 1.-> /usr/sbin/ipmi-sel was executed with the following parameters: sudo /usr/sbin/ipmi-sel --output-event-state --interpret-oem-data --entity-sensor-names --sensor-types=all [03:25:48] PROBLEM - cloud15 IPMI Sensors on cloud15 is CRITICAL: IPMI Status: Critical [442 system event log (SEL) entries present] [03:27:02] RECOVERY - prometheus151 Current Load on prometheus151 is OK: LOAD OK - total load average: 2.38, 3.19, 3.16 [03:30:20] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [03:30:50] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [03:33:01] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 2.92, 3.89, 3.49 [03:35:01] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 4.50, 4.08, 3.60 [03:35:50] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [03:37:01] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 3.48, 3.50, 3.42 [03:38:04] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [03:39:01] RECOVERY - prometheus151 Current Load on prometheus151 is OK: LOAD OK - total load average: 1.63, 3.00, 3.26 [03:39:29] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 26.20, 21.89, 19.82 [03:41:54] PROBLEM - prometheus151 PowerDNS Recursor on prometheus151 is CRITICAL: CRITICAL - Plugin timed out while executing system call [03:43:03] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 7.82, 5.74, 4.33 [03:43:49] RECOVERY - prometheus151 PowerDNS Recursor on prometheus151 is OK: DNS OK: 0.222 seconds response time. wikitide.net returns 2602:294:0:b13::110,2602:294:0:b23::112,38.46.223.205,38.46.223.206 [03:46:16] PROBLEM - mw182 Current Load on mw182 is WARNING: LOAD WARNING - total load average: 20.48, 21.02, 19.45 [03:48:13] PROBLEM - mw182 Current Load on mw182 is CRITICAL: LOAD CRITICAL - total load average: 25.31, 21.72, 19.83 [03:50:11] PROBLEM - mw182 Current Load on mw182 is WARNING: LOAD WARNING - total load average: 20.58, 21.62, 20.06 [03:51:01] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 1.61, 3.72, 4.00 [03:52:08] PROBLEM - mw182 Current Load on mw182 is CRITICAL: LOAD CRITICAL - total load average: 27.63, 23.68, 20.98 [03:53:04] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [03:53:39] PROBLEM - mw172 Current Load on mw172 is WARNING: LOAD WARNING - total load average: 21.90, 19.48, 16.82 [03:55:02] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 5.10, 3.84, 3.95 [03:55:20] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [03:55:38] PROBLEM - mw172 Current Load on mw172 is CRITICAL: LOAD CRITICAL - total load average: 25.08, 22.02, 18.09 [03:55:39] PROBLEM - mw161 Current Load on mw161 is WARNING: LOAD WARNING - total load average: 22.75, 19.52, 16.36 [03:55:51] PROBLEM - mw171 Current Load on mw171 is CRITICAL: LOAD CRITICAL - total load average: 25.39, 21.68, 17.98 [03:56:55] PROBLEM - mw152 Current Load on mw152 is WARNING: LOAD WARNING - total load average: 22.28, 20.73, 16.92 [03:57:01] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 2.60, 3.55, 3.84 [03:57:38] PROBLEM - mw172 Current Load on mw172 is WARNING: LOAD WARNING - total load average: 21.50, 21.48, 18.36 [03:58:34] PROBLEM - mw162 Current Load on mw162 is CRITICAL: LOAD CRITICAL - total load average: 25.45, 21.13, 17.51 [03:59:02] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 6.38, 4.18, 4.00 [03:59:39] RECOVERY - mw161 Current Load on mw161 is OK: LOAD OK - total load average: 14.82, 18.71, 16.86 [03:59:51] PROBLEM - mw171 Current Load on mw171 is WARNING: LOAD WARNING - total load average: 17.35, 21.21, 18.77 [04:00:20] [Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [04:00:34] RECOVERY - mw162 Current Load on mw162 is OK: LOAD OK - total load average: 14.66, 18.37, 16.93 [04:00:55] RECOVERY - mw152 Current Load on mw152 is OK: LOAD OK - total load average: 14.74, 18.74, 17.10 [04:01:03] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 3.22, 3.49, 3.76 [04:01:36] RECOVERY - mw172 Current Load on mw172 is OK: LOAD OK - total load average: 12.27, 17.79, 17.71 [04:01:51] RECOVERY - mw171 Current Load on mw171 is OK: LOAD OK - total load average: 13.39, 18.37, 18.02 [04:01:54] PROBLEM - mw182 Current Load on mw182 is WARNING: LOAD WARNING - total load average: 20.01, 23.02, 22.61 [04:03:29] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 18.22, 22.42, 23.71 [04:05:01] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 6.29, 4.10, 3.90 [04:07:01] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 2.75, 3.28, 3.61 [04:09:01] RECOVERY - prometheus151 Current Load on prometheus151 is OK: LOAD OK - total load average: 2.44, 2.81, 3.38 [04:11:39] RECOVERY - mw182 Current Load on mw182 is OK: LOAD OK - total load average: 11.73, 16.09, 19.54 [04:13:02] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 3.64, 3.73, 3.70 [04:13:29] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 14.37, 16.37, 20.05 [04:15:20] [Grafana] !tech RESOLVED: High Job Queue Backlog https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [04:18:50] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [04:19:01] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 6.46, 4.33, 3.87 [04:24:28] PROBLEM - prometheus151 PowerDNS Recursor on prometheus151 is CRITICAL: CRITICAL - Plugin timed out while executing system call [04:26:23] RECOVERY - prometheus151 PowerDNS Recursor on prometheus151 is OK: DNS OK: 0.068 seconds response time. wikitide.net returns 2602:294:0:b13::110,2602:294:0:b23::112,38.46.223.205,38.46.223.206 [04:28:50] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [04:29:01] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 1.70, 3.77, 4.00 [04:30:50] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [04:31:01] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 5.58, 4.37, 4.18 [04:32:38] PROBLEM - prometheus151 PowerDNS Recursor on prometheus151 is CRITICAL: CRITICAL - Plugin timed out while executing system call [04:34:33] RECOVERY - prometheus151 PowerDNS Recursor on prometheus151 is OK: DNS OK: 0.072 seconds response time. wikitide.net returns 2602:294:0:b13::110,2602:294:0:b23::112,38.46.223.205,38.46.223.206 [04:35:50] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [04:36:50] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [04:38:57] PROBLEM - prometheus151 PowerDNS Recursor on prometheus151 is CRITICAL: CRITICAL - Plugin timed out while executing system call [04:40:51] RECOVERY - prometheus151 PowerDNS Recursor on prometheus151 is OK: DNS OK: 0.076 seconds response time. wikitide.net returns 2602:294:0:b13::110,2602:294:0:b23::112,38.46.223.205,38.46.223.206 [04:41:50] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [04:42:50] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [04:52:50] [Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [04:55:27] PROBLEM - prometheus151 PowerDNS Recursor on prometheus151 is CRITICAL: CRITICAL - Plugin timed out while executing system call [04:56:51] PROBLEM - prometheus151 SSH on prometheus151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:57:22] RECOVERY - prometheus151 PowerDNS Recursor on prometheus151 is OK: DNS OK: 0.219 seconds response time. wikitide.net returns 2602:294:0:b13::110,2602:294:0:b23::112,38.46.223.205,38.46.223.206 [05:01:03] RECOVERY - prometheus151 SSH on prometheus151 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) [05:02:50] [Grafana] !tech RESOLVED: High Job Queue Backlog https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [05:04:20] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [05:12:46] PROBLEM - ns2 GDNSD Datacenters on ns2 is CRITICAL: CRITICAL - 4 datacenters are down: 38.46.223.205/cpweb, 38.46.223.206/cpweb, 2602:294:0:b13::110/cpweb, 2602:294:0:b23::112/cpweb [05:12:54] PROBLEM - ping6 on cp51 is CRITICAL: PING CRITICAL - Packet loss = 60%, RTA = 205.01 ms [05:13:04] PROBLEM - cp41 Varnish Backends on cp41 is CRITICAL: 7 backends are down. mw151 mw152 mw161 mw162 mw172 mw181 mw182 [05:13:05] PROBLEM - ping6 on cp41 is CRITICAL: PING CRITICAL - Packet loss = 60%, RTA = 170.96 ms [05:13:05] PROBLEM - cp41 HTTPS on cp41 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: cURL returned 28 - Operation timed out after 10003 milliseconds with 0 bytes received [05:13:14] PROBLEM - ping6 on ns2 is CRITICAL: PING CRITICAL - Packet loss = 100% [05:13:27] PROBLEM - ns1 GDNSD Datacenters on ns1 is CRITICAL: CRITICAL - 1 datacenter is down: 46.250.240.167/cpweb [05:13:46] PROBLEM - cp51 HTTPS on cp51 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: cURL returned 28 - Operation timed out after 10000 milliseconds with 0 bytes received [05:13:49] PROBLEM - prometheus151 PowerDNS Recursor on prometheus151 is CRITICAL: CRITICAL - Plugin timed out while executing system call [05:13:51] PROBLEM - cp51 Varnish Backends on cp51 is CRITICAL: 9 backends are down. mw151 mw152 mw161 mw162 mw171 mw172 mw181 mw182 mediawiki [05:14:28] PROBLEM - cp51 HTTP 4xx/5xx ERROR Rate on cp51 is WARNING: WARNING - NGINX Error Rate is 45% [05:15:12] RECOVERY - cp41 HTTPS on cp41 is OK: HTTP OK: HTTP/2 404 - Status line output matched "HTTP/2 404" - 3843 bytes in 9.055 second response time [05:17:56] PROBLEM - prometheus151 SSH on prometheus151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:18:41] PROBLEM - ns2 Puppet on ns2 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds. [05:19:17] RECOVERY - cp51 HTTP 4xx/5xx ERROR Rate on cp51 is OK: OK - NGINX Error Rate is 36% [05:19:20] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [05:19:21] RECOVERY - ping6 on cp41 is OK: PING OK - Packet loss = 0%, RTA = 103.55 ms [05:19:27] RECOVERY - ns1 GDNSD Datacenters on ns1 is OK: OK - all datacenters are online [05:19:28] RECOVERY - ping6 on ns2 is OK: PING OK - Packet loss = 0%, RTA = 141.02 ms [05:19:54] RECOVERY - cp51 HTTPS on cp51 is OK: HTTP OK: HTTP/2 404 - Status line output matched "HTTP/2 404" - 3843 bytes in 1.290 second response time [05:19:58] RECOVERY - prometheus151 SSH on prometheus151 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) [05:19:59] RECOVERY - prometheus151 PowerDNS Recursor on prometheus151 is OK: DNS OK: 6.140 seconds response time. wikitide.net returns 2602:294:0:b13::110,2602:294:0:b23::112,38.46.223.205,38.46.223.206 [05:21:17] RECOVERY - cp41 Varnish Backends on cp41 is OK: All 19 backends are healthy [05:21:18] RECOVERY - ping6 on cp51 is OK: PING OK - Packet loss = 0%, RTA = 162.13 ms [05:21:20] RECOVERY - ns2 GDNSD Datacenters on ns2 is OK: OK - all datacenters are online [05:21:49] RECOVERY - cp51 Varnish Backends on cp51 is OK: All 19 backends are healthy [05:25:29] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 23.39, 17.60, 14.50 [05:26:50] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [05:31:01] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 1.51, 2.99, 3.86 [05:31:50] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [05:35:01] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 5.02, 3.38, 3.77 [05:35:29] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 26.87, 23.10, 18.61 [05:37:29] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 23.09, 23.82, 19.48 [05:38:10] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [05:39:01] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 1.87, 2.94, 3.50 [05:39:29] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 25.81, 24.17, 20.11 [05:41:29] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 18.94, 22.47, 20.01 [05:42:21] PROBLEM - ns2 NTP time on ns2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:43:01] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 4.09, 3.39, 3.54 [05:44:50] PROBLEM - ns2 NTP time on ns2 is UNKNOWN: check_ntp_time: Invalid hostname/address - time.cloudflare.comUsage: check_ntp_time -H [-4|-6] [-w ] [-c ] [-v verbose] [-o ] [05:45:06] RECOVERY - ns2 Puppet on ns2 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:45:29] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 12.02, 18.67, 19.17 [05:45:54] PROBLEM - ping6 on cp51 is CRITICAL: PING CRITICAL - Packet loss = 60%, RTA = 214.58 ms [05:45:58] PROBLEM - ping6 on cp41 is CRITICAL: PING CRITICAL - Packet loss = 80%, RTA = 180.15 ms [05:46:01] PROBLEM - ping6 on ns2 is CRITICAL: PING CRITICAL - Packet loss = 50%, RTA = 209.55 ms [05:47:24] PROBLEM - cp51 Varnish Backends on cp51 is CRITICAL: 7 backends are down. mw151 mw152 mw161 mw171 mw172 mw181 mw182 [05:47:38] PROBLEM - cp41 HTTPS on cp41 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: cURL returned 28 - Operation timed out after 10004 milliseconds with 0 bytes received [05:48:05] !tech hello i think miraheze is being ddossed again [05:48:10] [Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [05:48:50] PROBLEM - ns2 GDNSD Datacenters on ns2 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds. [05:48:51] PROBLEM - cp41 Varnish Backends on cp41 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds. [05:49:00] *.miraheze.org gives me "Error code 520" from cloudflare, non-miraheze domains give me "Error 503 Backend fetch failed" from varnish, grafana is half-502ing and half-not working [05:49:07] RECOVERY - ns2 NTP time on ns2 is OK: NTP OK: Offset -0.000621765852 secs [05:49:14] PROBLEM - prometheus151 PowerDNS Recursor on prometheus151 is CRITICAL: CRITICAL - Plugin timed out while executing system call [05:49:33] PROBLEM - cp51 HTTP 4xx/5xx ERROR Rate on cp51 is WARNING: WARNING - NGINX Error Rate is 57% [05:51:27] PROBLEM - ns1 GDNSD Datacenters on ns1 is CRITICAL: CRITICAL - 1 datacenter is down: 2407:3641:2161:9774::1/cpweb [05:51:31] RECOVERY - cp51 HTTP 4xx/5xx ERROR Rate on cp51 is OK: OK - NGINX Error Rate is 33% [05:51:42] PROBLEM - cloud.neptune.wiki - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:52:01] PROBLEM - cp51 HTTPS on cp51 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: cURL returned 28 - Operation timed out after 10000 milliseconds with 0 bytes received [05:53:12] PROBLEM - wiki.recaptime.eu.org - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:53:27] RECOVERY - ns1 GDNSD Datacenters on ns1 is OK: OK - all datacenters are online [05:54:00] RECOVERY - cp51 HTTPS on cp51 is OK: HTTP OK: HTTP/2 404 - Status line output matched "HTTP/2 404" - 3843 bytes in 3.389 second response time [05:54:16] !log [@test151] starting deploy of {'folders': '1.42/extensions/MirahezeMagic'} to test151 [05:54:17] !log [@test151] finished deploy of {'folders': '1.42/extensions/MirahezeMagic'} to test151 - SUCCESS in 0s [05:54:22] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [05:54:27] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [05:55:01] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 2.30, 3.95, 3.92 [05:57:01] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 5.97, 4.56, 4.13 [05:58:23] PROBLEM - cp51 HTTPS on cp51 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: cURL returned 28 - Operation timed out after 10005 milliseconds with 0 bytes received [05:58:23] RECOVERY - ping6 on cp51 is OK: PING OK - Packet loss = 0%, RTA = 214.10 ms [05:59:27] PROBLEM - ns1 GDNSD Datacenters on ns1 is CRITICAL: CRITICAL - 3 datacenters are down: 46.250.240.167/cpweb, 2407:3641:2161:9774::1/cpweb, 2400:d320:2161:9775::1/cpweb [06:02:42] PROBLEM - ping6 on cp51 is CRITICAL: PING CRITICAL - Packet loss = 80%, RTA = 199.07 ms [06:03:10] [Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [06:03:38] !log [@mwtask181] starting deploy of {'folders': '1.42/extensions/MirahezeMagic'} to all [06:03:42] RECOVERY - prometheus151 PowerDNS Recursor on prometheus151 is OK: DNS OK: 4.470 seconds response time. wikitide.net returns 2602:294:0:b13::110,2602:294:0:b23::112,38.46.223.205,38.46.223.206 [06:04:32] PROBLEM - cp51 HTTP 4xx/5xx ERROR Rate on cp51 is CRITICAL: CRITICAL - NGINX Error Rate is 67% [06:04:34] RECOVERY - cp41 HTTPS on cp41 is OK: HTTP OK: HTTP/2 404 - Status line output matched "HTTP/2 404" - 3843 bytes in 4.097 second response time [06:06:52] RECOVERY - cp51 HTTP 4xx/5xx ERROR Rate on cp51 is OK: OK - NGINX Error Rate is 28% [06:07:01] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 3.74, 3.69, 3.96 [06:07:37] PROBLEM - wiki.consid.vn - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:07:42] PROBLEM - wiki.yumeka.icu - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:07:47] PROBLEM - vitriol.spoon.army - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:07:49] !log [@mwtask171] starting deploy of {'config': True} to all [06:07:54] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [06:08:39] PROBLEM - cp51 Puppet on cp51 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 2 minutes ago with 4 failures. Failed resources (up to 3 shown): File[/etc/ferm/functions.conf],File[/etc/default/ferm],File[/home/salt-user],File[/etc/nginx/mime.types] [06:08:39] PROBLEM - mwtask181 Puppet on mwtask181 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[MediaWiki-REL1_42 MirahezeMagic Sync] [06:08:51] PROBLEM - wiki.sadboyzpod.com - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:08:52] PROBLEM - ns2 NTP time on ns2 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds. [06:08:58] PROBLEM - cp41 HTTPS on cp41 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: cURL returned 28 - Operation timed out after 10002 milliseconds with 0 bytes received [06:09:08] PROBLEM - cp41 Puppet on cp41 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [06:10:45] RECOVERY - cp51 HTTPS on cp51 is OK: HTTP OK: HTTP/2 404 - Status line output matched "HTTP/2 404" - 3821 bytes in 1.081 second response time [06:10:50] RECOVERY - ns2 GDNSD Datacenters on ns2 is OK: OK - all datacenters are online [06:10:50] RECOVERY - ns2 NTP time on ns2 is OK: NTP OK: Offset -0.0005583167076 secs [06:10:59] RECOVERY - cp41 HTTPS on cp41 is OK: HTTP OK: HTTP/2 404 - Status line output matched "HTTP/2 404" - 3843 bytes in 0.681 second response time [06:10:59] RECOVERY - cp51 Varnish Backends on cp51 is OK: All 19 backends are healthy [06:11:06] RECOVERY - cp41 Varnish Backends on cp41 is OK: All 19 backends are healthy [06:11:06] RECOVERY - ping6 on cp51 is OK: PING OK - Packet loss = 0%, RTA = 162.12 ms [06:11:16] RECOVERY - ping6 on ns2 is OK: PING OK - Packet loss = 0%, RTA = 140.66 ms [06:11:19] RECOVERY - ping6 on cp41 is OK: PING OK - Packet loss = 0%, RTA = 105.59 ms [06:11:27] RECOVERY - ns1 GDNSD Datacenters on ns1 is OK: OK - all datacenters are online [06:13:29] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.72, 17.22, 14.53 [06:15:01] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 4.81, 3.85, 3.78 [06:15:29] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 19.67, 18.20, 15.25 [06:15:46] PROBLEM - mwtask171 Puppet on mwtask171 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[MediaWiki Config Sync] [06:15:49] PROBLEM - ns2 Puppet on ns2 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [06:18:10] [Grafana] RESOLVED: PHP-FPM Worker Usage High https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: The mediawiki job queue has more than 500 unclaimed jobs https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [06:19:01] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 3.33, 3.37, 3.59 [06:19:09] PROBLEM - ns2 NTP time on ns2 is UNKNOWN: check_ntp_time: Invalid hostname/address - time.cloudflare.comUsage: check_ntp_time -H [-4|-6] [-w ] [-c ] [-v verbose] [-o ] [06:19:29] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 21.98, 21.66, 17.48 [06:21:07] RECOVERY - ns2 NTP time on ns2 is OK: NTP OK: Offset -0.0007772743702 secs [06:21:32] PROBLEM - cloud.neptune.wiki - LetsEncrypt on sslhost is WARNING: WARNING - Certificate 'cloud.neptune.wiki' expires in 12 day(s) (Mon 19 Aug 2024 01:20:43 AM GMT +0000). [06:21:43] [02ssl] 07WikiTideSSLBot pushed 031 commit to 03master [+0/-0/±1] 13https://github.com/miraheze/ssl/compare/753aef36b278...4287aa02517b [06:21:46] [02ssl] 07WikiTideSSLBot 034287aa0 - Bot: Update SSL cert for cloud.neptune.wiki [06:23:01] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 4.05, 3.66, 3.64 [06:23:01] RECOVERY - wiki.recaptime.eu.org - LetsEncrypt on sslhost is OK: OK - Certificate 'wiki.recaptime.eu.org' will expire on Thu 10 Oct 2024 11:33:48 PM GMT +0000. [06:23:29] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 26.09, 23.61, 19.18 [06:25:01] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 3.53, 3.41, 3.54 [06:26:59] PROBLEM - prometheus151 PowerDNS Recursor on prometheus151 is CRITICAL: CRITICAL - Plugin timed out while executing system call [06:27:02] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 7.02, 4.70, 3.99 [06:28:54] RECOVERY - prometheus151 PowerDNS Recursor on prometheus151 is OK: DNS OK: 0.070 seconds response time. wikitide.net returns 2602:294:0:b13::110,2602:294:0:b23::112,38.46.223.205,38.46.223.206 [06:29:02] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 2.27, 3.83, 3.77 [06:29:29] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 21.62, 23.22, 20.64 [06:30:29] RECOVERY - cp51 Puppet on cp51 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:34:55] RECOVERY - mwtask181 Puppet on mwtask181 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:35:01] RECOVERY - prometheus151 Current Load on prometheus151 is OK: LOAD OK - total load average: 1.84, 2.85, 3.37 [06:36:49] RECOVERY - wiki.consid.vn - LetsEncrypt on sslhost is OK: OK - Certificate 'wiki.consid.vn' will expire on Thu 10 Oct 2024 08:20:17 AM GMT +0000. [06:36:58] RECOVERY - wiki.yumeka.icu - LetsEncrypt on sslhost is OK: OK - Certificate 'wiki.yumeka.icu' will expire on Thu 10 Oct 2024 01:41:08 AM GMT +0000. [06:37:09] RECOVERY - vitriol.spoon.army - LetsEncrypt on sslhost is OK: OK - Certificate 'vitriol.spoon.army' will expire on Wed 09 Oct 2024 10:41:15 AM GMT +0000. [06:37:22] RECOVERY - cp41 Puppet on cp41 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:37:29] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 28.72, 24.96, 22.14 [06:37:48] RECOVERY - wiki.sadboyzpod.com - LetsEncrypt on sslhost is OK: OK - Certificate 'wiki.sadboyzpod.com' will expire on Fri 27 Sep 2024 04:11:35 AM GMT +0000. [06:38:10] [Grafana] !tech RESOLVED: High Job Queue Backlog https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [06:39:45] RECOVERY - mwtask171 Puppet on mwtask171 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:39:50] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [06:41:02] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 3.30, 3.38, 3.43 [06:43:01] RECOVERY - prometheus151 Current Load on prometheus151 is OK: LOAD OK - total load average: 1.87, 3.11, 3.35 [06:45:41] RECOVERY - ns2 Puppet on ns2 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:49:50] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [06:49:52] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 2.41, 3.37, 3.46 [06:51:20] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [06:51:32] RECOVERY - cloud.neptune.wiki - LetsEncrypt on sslhost is OK: OK - Certificate 'cloud.neptune.wiki' will expire on Mon 04 Nov 2024 05:23:08 AM GMT +0000. [06:53:42] RECOVERY - prometheus151 Current Load on prometheus151 is OK: LOAD OK - total load average: 2.05, 3.03, 3.33 [06:57:33] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 4.47, 3.87, 3.59 [06:59:10] PROBLEM - prometheus151 PowerDNS Recursor on prometheus151 is CRITICAL: CRITICAL - Plugin timed out while executing system call [07:01:07] RECOVERY - prometheus151 PowerDNS Recursor on prometheus151 is OK: DNS OK: 2.649 seconds response time. wikitide.net returns 2602:294:0:b13::110,2602:294:0:b23::112,38.46.223.205,38.46.223.206 [07:05:13] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 2.15, 3.86, 3.81 [07:06:20] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [07:07:08] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 4.90, 4.28, 3.97 [07:07:20] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [07:07:55] PROBLEM - prometheus151 SSH on prometheus151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:09:29] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 18.99, 22.28, 23.97 [07:11:06] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 3.39, 3.88, 3.89 [07:11:29] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 24.54, 23.17, 24.08 [07:12:10] RECOVERY - prometheus151 SSH on prometheus151 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) [07:12:20] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [07:15:59] PROBLEM - ns2 Puppet on ns2 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [07:16:20] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [07:17:02] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 8.11, 5.14, 4.26 [07:21:20] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [07:23:20] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [07:28:20] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [07:29:01] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 3.05, 3.42, 3.82 [07:32:18] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [07:35:01] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 4.90, 3.68, 3.72 [07:37:01] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 3.07, 3.13, 3.49 [07:37:18] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [07:37:55] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [07:39:01] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 5.53, 4.15, 3.83 [07:41:02] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 3.72, 3.60, 3.65 [07:42:55] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [07:44:44] RECOVERY - ns2 Puppet on ns2 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:47:01] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 5.85, 4.08, 3.79 [07:47:20] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [07:57:19] PROBLEM - mw182 Current Load on mw182 is CRITICAL: LOAD CRITICAL - total load average: 28.51, 22.01, 18.27 [07:57:20] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [07:58:20] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [08:00:24] [02ImportDump] 07BlankEclair opened pull request 03#116: T12434: Fix uploading dumps - 13https://github.com/miraheze/ImportDump/pull/116 [08:00:33] [02ImportDump] 07coderabbitai[bot] commented on pull request 03#116: T12434: Fix uploading dumps - 13https://github.com/miraheze/ImportDump/pull/116#issuecomment-2270631911 [08:02:34] PROBLEM - mw162 Current Load on mw162 is WARNING: LOAD WARNING - total load average: 21.12, 19.66, 15.86 [08:03:11] PROBLEM - mw182 Current Load on mw182 is WARNING: LOAD WARNING - total load average: 18.51, 23.56, 20.50 [08:03:20] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [08:04:34] RECOVERY - mw162 Current Load on mw162 is OK: LOAD OK - total load average: 12.79, 16.97, 15.35 [08:07:01] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 3.20, 3.56, 3.94 [08:07:05] RECOVERY - mw182 Current Load on mw182 is OK: LOAD OK - total load average: 13.43, 18.59, 19.17 [08:07:30] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [08:09:01] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 5.03, 3.71, 3.92 [08:09:58] miraheze/ImportDump - BlankEclair the build passed. [08:11:21] [02ImportDump] 07coderabbitai[bot] edited pull request 03#116: T12434: Fix uploading dumps - 13https://github.com/miraheze/ImportDump/pull/116 [08:11:43] [02ImportDump] 07coderabbitai[bot] edited a comment on pull request 03#116: T12434: Fix uploading dumps - 13https://github.com/miraheze/ImportDump/pull/116#issuecomment-2270631911 [08:11:49] PROBLEM - prometheus151 PowerDNS Recursor on prometheus151 is CRITICAL: CRITICAL - Plugin timed out while executing system call [08:12:30] [Grafana] FIRING: Some MediaWiki Appservers are running out of PHP-FPM workers. https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [08:13:44] RECOVERY - prometheus151 PowerDNS Recursor on prometheus151 is OK: DNS OK: 0.077 seconds response time. wikitide.net returns 2602:294:0:b13::110,2602:294:0:b23::112,38.46.223.205,38.46.223.206 [08:17:30] [Grafana] RESOLVED: PHP-FPM Worker Usage High https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1[Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [08:29:20] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [08:29:26] PROBLEM - ns2 NTP time on ns2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:31:24] RECOVERY - ns2 NTP time on ns2 is OK: NTP OK: Offset 0.0007326304913 secs [08:35:23] PROBLEM - prometheus151 PowerDNS Recursor on prometheus151 is CRITICAL: CRITICAL - Plugin timed out while executing system call [08:39:30] RECOVERY - prometheus151 PowerDNS Recursor on prometheus151 is OK: DNS OK: 7.146 seconds response time. wikitide.net returns 2602:294:0:b13::110,2602:294:0:b23::112,38.46.223.205,38.46.223.206 [08:44:20] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [08:46:50] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [08:51:15] PROBLEM - ns2 NTP time on ns2 is UNKNOWN: check_ntp_time: Invalid hostname/address - time.cloudflare.comUsage: check_ntp_time -H [-4|-6] [-w ] [-c ] [-v verbose] [-o ] [08:51:50] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [08:53:13] RECOVERY - ns2 NTP time on ns2 is OK: NTP OK: Offset 0.0007044374943 secs [09:01:02] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 2.84, 3.22, 3.96 [09:01:29] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 21.69, 21.67, 23.65 [09:01:50] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [09:03:02] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 6.33, 4.37, 4.29 [09:03:29] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 28.14, 23.51, 24.04 [09:05:29] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.38, 22.87, 23.75 [09:06:50] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [09:07:01] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 2.16, 3.43, 3.92 [09:07:29] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 24.19, 23.47, 23.86 [09:09:29] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 20.83, 22.68, 23.54 [09:11:01] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 5.56, 3.89, 3.93 [09:11:20] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [09:13:01] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 3.79, 3.48, 3.76 [09:16:23] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [09:17:29] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 24.61, 22.61, 22.91 [09:18:25] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [09:19:02] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 5.09, 3.80, 3.69 [09:19:28] PROBLEM - prometheus151 PowerDNS Recursor on prometheus151 is CRITICAL: CRITICAL - Plugin timed out while executing system call [09:19:29] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 19.06, 21.16, 22.35 [09:21:01] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 2.41, 3.58, 3.66 [09:21:28] RECOVERY - prometheus151 PowerDNS Recursor on prometheus151 is OK: DNS OK: 5.331 seconds response time. wikitide.net returns 2602:294:0:b13::110,2602:294:0:b23::112,38.46.223.205,38.46.223.206 [09:23:01] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 5.30, 4.05, 3.80 [09:25:01] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 2.28, 3.67, 3.71 [09:25:29] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 26.27, 23.57, 22.87 [09:27:01] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 5.83, 4.38, 3.96 [09:28:25] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [09:29:29] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 21.01, 22.95, 22.84 [09:31:20] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [09:31:29] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 26.67, 24.28, 23.33 [09:33:02] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 3.54, 3.70, 3.82 [09:35:01] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 4.21, 4.12, 3.97 [09:35:29] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 19.87, 22.67, 22.95 [09:36:20] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [09:38:20] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [09:41:01] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 3.39, 3.74, 3.86 [09:43:20] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [09:45:29] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 26.12, 21.70, 21.58 [09:47:29] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 22.67, 21.63, 21.55 [09:48:21] PROBLEM - prometheus151 SSH on prometheus151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:49:46] PROBLEM - cloud15 IPMI Sensors on cloud15 is UNKNOWN: ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-cloud15.localhost: internal IPMI error-> Execution of /usr/sbin/ipmi-sel failed with return code 1.-> /usr/sbin/ipmi-sel was executed with the following parameters: sudo /usr/sbin/ipmi-sel --output-event-state --interpret-oem-data --entity-sensor-names --sensor-types=all [09:50:28] RECOVERY - prometheus151 SSH on prometheus151 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) [09:51:01] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 4.86, 3.68, 3.64 [09:51:48] PROBLEM - cloud15 IPMI Sensors on cloud15 is CRITICAL: IPMI Status: Critical [442 system event log (SEL) entries present] [09:53:01] RECOVERY - prometheus151 Current Load on prometheus151 is OK: LOAD OK - total load average: 2.79, 3.07, 3.40 [09:53:50] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [09:57:02] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 3.30, 3.84, 3.72 [09:58:50] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [09:59:29] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 24.92, 22.85, 22.12 [10:01:01] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 5.58, 4.16, 3.83 [10:01:20] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [10:01:29] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 20.42, 21.92, 21.88 [10:07:01] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 2.36, 3.76, 3.81 [10:09:01] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 4.77, 4.03, 3.89 [10:11:01] PROBLEM - prometheus151 Current Load on prometheus151 is WARNING: LOAD WARNING - total load average: 3.07, 3.40, 3.66 [10:11:20] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [10:13:50] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [10:17:01] RECOVERY - prometheus151 Current Load on prometheus151 is OK: LOAD OK - total load average: 2.08, 2.94, 3.39 [10:18:50] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [10:19:29] PROBLEM - mw181 Current Load on mw181 is CRITICAL: LOAD CRITICAL - total load average: 24.80, 21.53, 21.17 [10:21:29] PROBLEM - mw181 Current Load on mw181 is WARNING: LOAD WARNING - total load average: 19.69, 20.75, 20.94 [10:28:50] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [10:29:43] PROBLEM - prometheus151 Current Load on prometheus151 is CRITICAL: LOAD CRITICAL - total load average: 6.78, 4.32, 3.57 [10:30:09] PROBLEM - prometheus151 PowerDNS Recursor on prometheus151 is CRITICAL: CRITICAL - Plugin timed out while executing system call [10:32:13] RECOVERY - prometheus151 PowerDNS Recursor on prometheus151 is OK: DNS OK: 9.875 seconds response time. wikitide.net returns 2602:294:0:b13::110,2602:294:0:b23::112,38.46.223.205,38.46.223.206 [10:33:50] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [10:35:20] [Grafana] !tech FIRING: There has been a rise in the MediaWiki exception rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [10:39:29] RECOVERY - mw181 Current Load on mw181 is OK: LOAD OK - total load average: 17.68, 19.54, 20.35 [10:40:20] [Grafana] !tech RESOLVED: MediaWiki Exception Rate https://grafana.wikitide.net/d/GtxbP1Xnk?orgId=1 [10:43:11]