[09:20:46] 10Scoring-platform-team, 10MediaWiki-JobQueue, 10ORES, 10Performance-Team, and 5 others: Job queue corruption after codfw switch over (Queue growth, duplicate runs) - https://phabricator.wikimedia.org/T163337#3374328 (10elukey) Interesting thing found today: https://phabricator.wikimedia.org/P5621 I verif... [09:44:08] 10Scoring-platform-team, 10MediaWiki-JobQueue, 10ORES, 10Performance-Team, and 5 others: Job queue corruption after codfw switch over (Queue growth, duplicate runs) - https://phabricator.wikimedia.org/T163337#3374376 (10elukey) Deleted by accident my previous comment, will re-do it :) So https://phabricat... [11:21:07] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:39:33] 10Scoring-platform-team, 10Wikilabels: [Discuss] Wikilabels routes refactor - https://phabricator.wikimedia.org/T165046#3374535 (10Pginer-WMF) >>! In T165046#3298616, @Halfak wrote: > @Pginer-WMF (cc @jmatazzoni), would you be able to spare a small amount of time to discuss how we're looking to arrange the pag... [12:24:12] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 941 bytes in 7.538 second response time [12:28:03] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 981 bytes in 0.014 second response time [12:48:12] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 3.031 second response time [12:51:03] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 981 bytes in 0.025 second response time [13:12:24] ACKNOWLEDGEMENT - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds alexandros kosiaris debugging it [13:18:12] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 943 bytes in 1.025 second response time [13:24:22] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:26:12] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 942 bytes in 8.035 second response time [13:29:12] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 980 bytes in 0.014 second response time [13:59:22] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 941 bytes in 7.541 second response time [14:04:43] 10Scoring-platform-team, 10MediaWiki-JobQueue, 10ORES, 10Performance-Team, and 5 others: Job queue corruption after codfw switch over (Queue growth, duplicate runs) - https://phabricator.wikimedia.org/T163337#3374792 (10elukey) >>! In T163337#3214837, @Krinkle wrote: > Ideas for next steps: > * Figure out... [14:06:13] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 981 bytes in 0.018 second response time [14:09:22] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 942 bytes in 2.020 second response time [14:12:33] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:27:23] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 957 bytes in 2.022 second response time [14:32:23] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 981 bytes in 0.017 second response time [15:48:39] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 939 bytes in 7.031 second response time [15:51:29] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 979 bytes in 0.013 second response time [15:51:39] yup [15:58:39] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 940 bytes in 8.033 second response time [16:53:53] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 1019 bytes in 0.086 second response time [18:42:05] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 996 bytes in 1.055 second response time [21:29:24] im going to upgrade to stretch on gerrit-mysql (runs icinga2) now. [21:51:53] PROBLEM - Host ores-redis.01 is DOWN: /bin/ping -n -U -w 30 -c 5 ores-redis-01.ores.eqiad.wmflabsCRITICAL - Could not interpret output from ping command [21:51:54] PROBLEM - Host ores-worker-05 is DOWN: /bin/ping -n -U -w 30 -c 5 ores-worker-05.ores.eqiad.wmflabsCRITICAL - Could not interpret output from ping command [21:51:56] PROBLEM - Host ores.wmflabs.org is DOWN: /bin/ping -n -U -w 30 -c 5 ores.wmflabs.orgCRITICAL - Could not interpret output from ping command [21:51:58] PROBLEM - ping4 on Ores-Compute-01 is UNKNOWN: /bin/ping -n -U -w 10 -c 5 ores-compute-01.ores.eqiad.wmflabsCRITICAL - Could not interpret output from ping command [21:51:59] PROBLEM - ping4 on ores-lb-02 is UNKNOWN: /bin/ping -n -U -w 10 -c 5 ores-lb-02.ores.eqiad.wmflabsCRITICAL - Could not interpret output from ping command [21:52:02] PROBLEM - Host ores-web-05 is DOWN: /bin/ping -n -U -w 30 -c 5 ores-web-05.ores.eqiad.wmflabsCRITICAL - Could not interpret output from ping command [21:52:03] PROBLEM - Host Ores-Compute-01 is DOWN: /bin/ping -n -U -w 30 -c 5 ores-compute-01.ores.eqiad.wmflabsCRITICAL - Could not interpret output from ping command [21:52:06] PROBLEM - Host ores-lb-02 is DOWN: /bin/ping -n -U -w 30 -c 5 ores-lb-02.ores.eqiad.wmflabsCRITICAL - Could not interpret output from ping command [21:52:47] ignore ^^ [21:52:59] it's because of the upgrade. i've stopped icinga2 temp [22:00:49] RECOVERY - Host ores-lb-02 is UP: PING OK - Packet loss = 0%, RTA = 1.71 ms [22:00:49] RECOVERY - ping4 on ores-lb-02 is OK: PING OK - Packet loss = 0%, RTA = 3.39 ms [22:00:50] RECOVERY - Host ores-redis.01 is UP: PING OK - Packet loss = 0%, RTA = 1.96 ms [22:00:52] RECOVERY - Host Ores-Compute-01 is UP: PING OK - Packet loss = 0%, RTA = 3.61 ms [22:00:54] RECOVERY - Host ores-web-05 is UP: PING OK - Packet loss = 0%, RTA = 2.26 ms [22:00:57] RECOVERY - Host ores-worker-05 is UP: PING OK - Packet loss = 0%, RTA = 2.00 ms [22:00:57] RECOVERY - Host ores.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 1.74 ms