[00:37:24] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [00:42:04] PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [00:43:32] RECOVERY - ORES web node labs ores-web-01 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1007 bytes in 3.402 second response time https://wikitech.wikimedia.org/wiki/ORES [00:43:34] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 977 bytes in 8.954 second response time https://wikitech.wikimedia.org/wiki/ORES [01:24:11] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [01:26:29] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 981 bytes in 7.640 second response time https://wikitech.wikimedia.org/wiki/ORES [02:27:13] 10Scoring-platform-team, 10Edit-Review-Improvements-Integrated-Filters, 10editquality-modeling, 10Chinese-Sites, and 2 others: Deploy ORES filters for zhwiki - https://phabricator.wikimedia.org/T225562 (10Shizhao) 05Open→03Resolved It' working. [02:27:15] 10Scoring-platform-team, 10Growth-Team, 10editquality-modeling, 10artificial-intelligence: Update RC Filters for new ORES capacities (July, 2019) - https://phabricator.wikimedia.org/T227094 (10Shizhao) [06:28:02] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [06:29:18] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 979 bytes in 8.467 second response time https://wikitech.wikimedia.org/wiki/ORES [06:49:22] PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [06:51:20] RECOVERY - ORES web node labs ores-web-01 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 0.893 second response time https://wikitech.wikimedia.org/wiki/ORES [06:56:10] (03CR) 10Kosta Harlan: "check codehealth" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/510166 (owner: 10Kosta Harlan) [07:22:05] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [07:22:59] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 979 bytes in 0.081 second response time https://wikitech.wikimedia.org/wiki/ORES [08:11:28] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/ORES [08:11:46] PROBLEM - ORES web node labs ores-web-02 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [08:13:30] PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [08:16:16] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1011 bytes in 4.959 second response time https://wikitech.wikimedia.org/wiki/ORES [08:16:24] RECOVERY - ORES web node labs ores-web-02 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 979 bytes in 5.229 second response time https://wikitech.wikimedia.org/wiki/ORES [08:16:34] RECOVERY - ORES web node labs ores-web-01 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1011 bytes in 4.479 second response time https://wikitech.wikimedia.org/wiki/ORES [08:55:09] (03CR) 10jenkins-bot: Localisation updates from https://translatewiki.net. [extensions/ORES] - 10https://gerrit.wikimedia.org/r/534353 (owner: 10L10n-bot) [09:10:24] PROBLEM - ORES web node labs ores-web-02 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [09:11:50] RECOVERY - ORES web node labs ores-web-02 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1007 bytes in 1.009 second response time https://wikitech.wikimedia.org/wiki/ORES [11:34:46] PROBLEM - ORES web node labs ores-web-02 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [11:37:14] PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [11:38:28] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [11:39:06] RECOVERY - ORES web node labs ores-web-02 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 979 bytes in 8.766 second response time https://wikitech.wikimedia.org/wiki/ORES [11:39:56] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 9.191 second response time https://wikitech.wikimedia.org/wiki/ORES [11:40:08] RECOVERY - ORES web node labs ores-web-01 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 977 bytes in 5.265 second response time https://wikitech.wikimedia.org/wiki/ORES [12:03:55] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [12:05:41] PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [12:06:37] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 981 bytes in 2.146 second response time https://wikitech.wikimedia.org/wiki/ORES [12:07:01] RECOVERY - ORES web node labs ores-web-01 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 2.869 second response time https://wikitech.wikimedia.org/wiki/ORES [12:17:13] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [12:18:05] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 981 bytes in 0.317 second response time https://wikitech.wikimedia.org/wiki/ORES [12:37:19] PROBLEM - ORES web node labs ores-web-02 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [12:38:17] PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [12:38:23] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [12:39:51] RECOVERY - ORES web node labs ores-web-01 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 981 bytes in 9.203 second response time https://wikitech.wikimedia.org/wiki/ORES [12:39:55] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 7.268 second response time https://wikitech.wikimedia.org/wiki/ORES [12:40:13] RECOVERY - ORES web node labs ores-web-02 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1007 bytes in 3.876 second response time https://wikitech.wikimedia.org/wiki/ORES [12:55:29] PROBLEM - ORES web node labs ores-web-02 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [12:56:09] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [12:57:43] RECOVERY - ORES web node labs ores-web-02 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 2.434 second response time https://wikitech.wikimedia.org/wiki/ORES [12:58:26] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 977 bytes in 2.740 second response time https://wikitech.wikimedia.org/wiki/ORES [13:09:17] o/ [13:09:19] I'm back! [13:13:56] 10Scoring-platform-team, 10Edit-Review-Improvements-Integrated-Filters, 10editquality-modeling, 10Chinese-Sites, and 2 others: Deploy ORES filters for zhwiki - https://phabricator.wikimedia.org/T225562 (10Halfak) @zhuyifei1999, you can finally test this on Special:RecentChanges! [13:41:49] PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [13:46:51] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [13:47:01] PROBLEM - ORES web node labs ores-web-02 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [13:47:29] Looks like I'll be researching all this noise. We need it to quiet down. [13:48:01] RECOVERY - ORES web node labs ores-web-02 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 981 bytes in 8.644 second response time https://wikitech.wikimedia.org/wiki/ORES [13:48:51] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 6.980 second response time https://wikitech.wikimedia.org/wiki/ORES [13:53:58] 10ORES, 10Scoring-platform-team, 10WMDE-Analytics-Engineering, 10Wikidata, 10User-GoranSMilovanovic: track quality of all/top 10000 Wikidata items over time - https://phabricator.wikimedia.org/T195702 (10Halfak) For clarity, making millions of calls to ORES is totally feasible. We have a utility for doi... [14:02:58] Technical Advice IRC meeting starting in 60 minutes in channel #wikimedia-tech, hosts: @bd808 - all questions welcome, more infos: https://www.mediawiki.org/wiki/Technical_Advice_IRC_Meeting [14:10:27] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [14:10:37] RECOVERY - ORES web node labs ores-web-01 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 1.407 second response time https://wikitech.wikimedia.org/wiki/ORES [14:13:29] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 977 bytes in 5.558 second response time https://wikitech.wikimedia.org/wiki/ORES [14:33:05] PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [14:34:21] PROBLEM - ORES web node labs ores-web-02 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [14:34:21] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [14:35:41] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 979 bytes in 4.222 second response time https://wikitech.wikimedia.org/wiki/ORES [14:35:41] RECOVERY - ORES web node labs ores-web-02 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 979 bytes in 4.237 second response time https://wikitech.wikimedia.org/wiki/ORES [14:35:51] RECOVERY - ORES web node labs ores-web-01 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 981 bytes in 4.443 second response time https://wikitech.wikimedia.org/wiki/ORES [14:39:23] I see a lot of "[2019-09-04T14:38:43] statsd_send_metric()/sendto(): Resource temporarily unavailable [plugins/stats_pusher_statsd/plugin.c line 40]" [14:39:26] In the logs. [14:40:19] PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [14:43:17] RECOVERY - ORES web node labs ores-web-01 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 981 bytes in 2.362 second response time https://wikitech.wikimedia.org/wiki/ORES [14:52:35] Technical Advice IRC meeting starting in 10 minutes in channel #wikimedia-tech, hosts: @bd808 - all questions welcome, more infos: https://www.mediawiki.org/wiki/Technical_Advice_IRC_Meeting [14:57:56] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [14:57:56] PROBLEM - ORES web node labs ores-web-02 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [15:01:44] RECOVERY - ORES web node labs ores-web-02 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 4.873 second response time https://wikitech.wikimedia.org/wiki/ORES [15:01:46] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 6.234 second response time https://wikitech.wikimedia.org/wiki/ORES [15:10:26] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [15:10:26] PROBLEM - ORES web node labs ores-web-02 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [15:10:58] PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [15:12:28] RECOVERY - ORES web node labs ores-web-01 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 6.332 second response time https://wikitech.wikimedia.org/wiki/ORES [15:13:28] RECOVERY - ORES web node labs ores-web-02 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 977 bytes in 2.942 second response time https://wikitech.wikimedia.org/wiki/ORES [15:13:28] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 977 bytes in 2.946 second response time https://wikitech.wikimedia.org/wiki/ORES [15:23:24] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [15:23:25] PROBLEM - ORES web node labs ores-web-02 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [15:25:38] RECOVERY - ORES web node labs ores-web-02 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 2.583 second response time https://wikitech.wikimedia.org/wiki/ORES [15:25:38] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 2.588 second response time https://wikitech.wikimedia.org/wiki/ORES [15:32:36] I wonder if this is something to do with our load balancer. [15:32:41] ores-lb-03 [15:32:54] CPU usage and wait are low. [15:33:48] I just restarted the nginx service. [15:37:06] Let's see what happens. [15:39:37] 10MediaWiki-extensions-ORES, 10ORES, 10Scoring-platform-team: ORES API latency too high - https://phabricator.wikimedia.org/T231776 (10Halfak) You're essentially getting one two scores every 0.1 seconds. That's pretty fast! It's 10x faster than our expected response time for querying for a single revision/... [15:40:17] 10MediaWiki-extensions-ORES, 10ORES, 10Scoring-platform-team: ORES API latency too high - https://phabricator.wikimedia.org/T231776 (10Halfak) Oh! Please consider using ores.wikimedia.org. That's our production service. You should expect it to be faster and better in all ways than ores.wmflabs.org. [15:40:34] 10ORES, 10Scoring-platform-team: ORES API latency too high - https://phabricator.wikimedia.org/T231776 (10Halfak) [15:43:39] 10ORES, 10Scoring-platform-team (Current): Address icinga noise from wmflabs - https://phabricator.wikimedia.org/T231222 (10Halfak) [15:44:54] 10ORES, 10Scoring-platform-team (Current): Address icinga noise from wmflabs - https://phabricator.wikimedia.org/T231222 (10Halfak) I just restarted nginx on ores-lb-03 to see if that was what was causing the problem. Next I'll be looking into getting a clean deploy into wmflabs. [15:45:19] 10ORES, 10Scoring-platform-team (Current): Address icinga noise from wmflabs - https://phabricator.wikimedia.org/T231222 (10Halfak) Also, I'm seeing a lot of lines that look like this in the uwsgi logs: > [2019-09-04T14:38:43] statsd_send_metric()/sendto(): Resource temporarily unavailable [plugins/stats_pu... [15:46:16] 10ORES, 10Scoring-platform-team (Current): Address icinga noise from wmflabs - https://phabricator.wikimedia.org/T231222 (10Halfak) See {T218567} for some related work. [15:51:12] 10ORES, 10Scoring-platform-team (Current): Address icinga noise from wmflabs - https://phabricator.wikimedia.org/T231222 (10Halfak) I'm also seeing some > [2019-09-04T06:25:41] fork(): Cannot allocate memory [core/master_utils.c line 729] [16:01:02] 10ORES, 10Scoring-platform-team (Current): Address icinga noise from wmflabs - https://phabricator.wikimedia.org/T231222 (10Halfak) Related to the statsd noise: {T189605} [16:16:54] PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [16:17:52] RECOVERY - ORES web node labs ores-web-01 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 981 bytes in 6.821 second response time https://wikitech.wikimedia.org/wiki/ORES [16:21:34] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [16:22:10] PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [16:22:42] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 981 bytes in 8.753 second response time https://wikitech.wikimedia.org/wiki/ORES [16:23:08] RECOVERY - ORES web node labs ores-web-01 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/ORES [16:25:19] 10ORES, 10Scoring-platform-team: Limit rev_ids in an ORES request to 50. - https://phabricator.wikimedia.org/T232005 (10Halfak) [16:32:24] 10ORES, 10Scoring-platform-team, 10WMDE-Analytics-Engineering, 10Wikidata, 10User-GoranSMilovanovic: track quality of all/top 10000 Wikidata items over time - https://phabricator.wikimedia.org/T195702 (10GoranSMilovanovic) @Halfak Thank you, Aaron. [16:40:22] 10ORES, 10Scoring-platform-team (Current): Address icinga noise from wmflabs - https://phabricator.wikimedia.org/T231222 (10Halfak) Looks like we're restarting once per day: https://grafana-labs-admin.wikimedia.org/d/000000006/ores-labs?orgId=1&panelId=6&fullscreen&from=1567046937558&to=1567595497214 [16:43:46] PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [16:44:32] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [16:44:50] PROBLEM - ORES web node labs ores-web-02 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [16:48:14] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 981 bytes in 1.259 second response time https://wikitech.wikimedia.org/wiki/ORES [16:48:34] RECOVERY - ORES web node labs ores-web-02 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 981 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/ORES [16:52:58] 10ORES, 10Scoring-platform-team: ORES API latency too high - https://phabricator.wikimedia.org/T231776 (10Halfak) [16:53:00] 10Scoring-platform-team: ORES api query latency - https://phabricator.wikimedia.org/T231940 (10Halfak) [17:03:27] groceryheist, o/ [17:04:14] RECOVERY - ORES web node labs ores-web-01 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 981 bytes in 2.512 second response time https://wikitech.wikimedia.org/wiki/ORES [17:09:38] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [17:10:12] PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [17:11:06] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 5.561 second response time https://wikitech.wikimedia.org/wiki/ORES [17:11:38] PROBLEM - ORES web node labs ores-web-02 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [17:11:39] Shuddap [17:13:08] RECOVERY - ORES web node labs ores-web-02 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 5.390 second response time https://wikitech.wikimedia.org/wiki/ORES [17:13:08] RECOVERY - ORES web node labs ores-web-01 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 0.486 second response time https://wikitech.wikimedia.org/wiki/ORES [17:22:40] groceryheist, ping me if you want to retry meeting later. [17:23:46] yeah [17:24:03] sorry I thought it was at 10:30 [17:24:17] brain blip [17:24:59] didn't check my calendar before I got on my bike [17:27:38] PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [17:29:35] ^halfak [17:30:06] RECOVERY - ORES web node labs ores-web-01 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 977 bytes in 3.503 second response time https://wikitech.wikimedia.org/wiki/ORES [17:35:18] groceryheist, want to meet now? [17:35:50] sure I have another call at 11 [17:37:46] halfak: i'm in the call [17:57:39] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/ORES [17:57:43] PROBLEM - ORES web node labs ores-web-02 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [17:58:53] RECOVERY - ORES web node labs ores-web-02 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 2.480 second response time https://wikitech.wikimedia.org/wiki/ORES [17:58:59] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 979 bytes in 1.381 second response time https://wikitech.wikimedia.org/wiki/ORES [18:04:23] PROBLEM - ORES web node labs ores-web-02 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [18:04:31] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [18:04:35] PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [18:05:53] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 979 bytes in 6.302 second response time https://wikitech.wikimedia.org/wiki/ORES [18:07:13] RECOVERY - ORES web node labs ores-web-02 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 7.513 second response time https://wikitech.wikimedia.org/wiki/ORES [18:07:25] RECOVERY - ORES web node labs ores-web-01 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 979 bytes in 6.337 second response time https://wikitech.wikimedia.org/wiki/ORES [18:13:29] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [18:14:57] PROBLEM - ORES web node labs ores-web-02 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [18:17:39] RECOVERY - ORES web node labs ores-web-02 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 979 bytes in 0.816 second response time https://wikitech.wikimedia.org/wiki/ORES [18:17:49] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 979 bytes in 0.336 second response time https://wikitech.wikimedia.org/wiki/ORES [18:26:57] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [18:28:13] PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [18:29:15] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 981 bytes in 0.928 second response time https://wikitech.wikimedia.org/wiki/ORES [18:29:25] RECOVERY - ORES web node labs ores-web-01 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 4.668 second response time https://wikitech.wikimedia.org/wiki/ORES [18:33:11] PROBLEM - ORES web node labs ores-web-02 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [18:33:21] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [18:33:27] PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [18:34:25] RECOVERY - ORES web node labs ores-web-02 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 981 bytes in 0.072 second response time https://wikitech.wikimedia.org/wiki/ORES [18:34:43] RECOVERY - ORES web node labs ores-web-01 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 981 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/ORES [18:36:01] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 1.867 second response time https://wikitech.wikimedia.org/wiki/ORES [18:38:53] PROBLEM - ORES web node labs ores-web-02 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [18:39:11] PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [18:40:13] RECOVERY - ORES web node labs ores-web-02 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 981 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/ORES [18:40:35] RECOVERY - ORES web node labs ores-web-01 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 981 bytes in 3.702 second response time https://wikitech.wikimedia.org/wiki/ORES [18:58:13] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [18:59:21] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 979 bytes in 3.004 second response time https://wikitech.wikimedia.org/wiki/ORES [19:36:21] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [19:36:49] curses! [19:37:45] PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/ORES [19:42:11] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/ORES [19:42:23] RECOVERY - ORES web node labs ores-web-01 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 4.297 second response time https://wikitech.wikimedia.org/wiki/ORES [19:42:30] * halfak works on silencing all of ^ [20:06:18] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [20:06:22] PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [20:06:35] 10ORES, 10Scoring-platform-team (Current): Address icinga noise from wmflabs - https://phabricator.wikimedia.org/T231222 (10Halfak) Looks like there are periodic timeouts coming from the wmflabs system. I ran a script to just make 1000 request to the fakewiki pseudo-model (which should return right away) and... [20:06:45] Looks like we have a real instability issue on our hands. [20:07:04] This isn't just icinga going crazy. [20:07:08] PROBLEM - ORES web node labs ores-web-02 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [20:07:23] As a precaution. I'm going to restart all of our machines. [20:07:42] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 981 bytes in 7.575 second response time https://wikitech.wikimedia.org/wiki/ORES [20:07:44] RECOVERY - ORES web node labs ores-web-01 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 979 bytes in 5.959 second response time https://wikitech.wikimedia.org/wiki/ORES [20:08:26] RECOVERY - ORES web node labs ores-web-02 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 977 bytes in 0.097 second response time https://wikitech.wikimedia.org/wiki/ORES [20:12:16] PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2103 bytes in 0.980 second response time https://wikitech.wikimedia.org/wiki/ORES [20:13:06] Redis is recoving... [20:13:06] PROBLEM - ORES web node labs ores-web-02 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2103 bytes in 0.026 second response time https://wikitech.wikimedia.org/wiki/ORES [20:13:22] was just going to say could this be a redis spof issue? [20:13:24] I'm going to run a flush all when it comes back. [20:13:35] Yeah, I checked on the redis box and it didn't seem to be struggling. [20:13:54] RECOVERY - ORES web node labs ores-web-01 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 979 bytes in 6.457 second response time https://wikitech.wikimedia.org/wiki/ORES [20:14:40] RECOVERY - ORES web node labs ores-web-02 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 979 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/ORES [20:15:57] Looks like we're back online. [20:17:55] halfak: what's the current status of availability of ORES scores in a programmatic manner, either replicated databases or the Data Lake, is anything available? [20:18:18] Working on an issue now. [20:18:26] But gist is that we provide an API. [20:18:34] And a few historic score dumps ad-hoc. [20:18:42] PROBLEM - ORES web node labs ores-web-02 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [20:18:48] I don't know if anything is going into the data lake. [20:19:00] Nettrom, ^ [20:19:09] halfak: no problem, thanks for taking the time to respond :) [20:19:42] RECOVERY - ORES web node labs ores-web-02 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 4.916 second response time https://wikitech.wikimedia.org/wiki/ORES [20:20:43] 10ORES, 10Scoring-platform-team (Current): Address icinga noise from wmflabs - https://phabricator.wikimedia.org/T231222 (10JHedden) I'm curious if the timeouts are only seen from a single virtual machine, or if you're seeing the same results from multiple ORES virtual machines. Could you please run your scrip... [20:29:58] 10ORES, 10Scoring-platform-team (Current): Address icinga noise from wmflabs - https://phabricator.wikimedia.org/T231222 (10Halfak) @JHedden, I'm running this script from my local machine. But I imagine icinga is making its requests from a different location than me and it's getting the same behavior. Right... [20:30:08] PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [20:31:26] RECOVERY - ORES web node labs ores-web-01 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 8.505 second response time https://wikitech.wikimedia.org/wiki/ORES [20:31:53] Aha! Even circumventing redis, I get similar bad behavior. [20:34:52] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [20:35:46] 10ORES, 10Scoring-platform-team (Current): Address icinga noise from wmflabs - https://phabricator.wikimedia.org/T231222 (10Halfak) This is a snippet from requesting model_info: ` 0.3350839614868164 0.3268578052520752 0.3501322269439697 0.3274261951446533 0.35183024406433105 0.3265092372894287 0.3343515396118... [20:37:44] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 8.591 second response time https://wikitech.wikimedia.org/wiki/ORES [20:38:05] Problem does not exist on the local worker ores-web-01 [20:38:40] PROBLEM - ORES web node labs ores-web-02 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [20:41:24] I'm trying to access the load balancer directly but I'm struggling. [20:41:36] RECOVERY - ORES web node labs ores-web-02 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 981 bytes in 2.998 second response time https://wikitech.wikimedia.org/wiki/ORES [20:45:16] The https redirect keeps catching me. I don't quite know how to get around it. [20:46:58] aha! Got it. [20:47:08] X Forwarded Proto==https [20:49:32] Aha! It is the lb. [20:49:34] I think. [20:49:40] It's really slow [20:52:58] PROBLEM - ORES web node labs ores-web-02 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [20:53:56] RECOVERY - ORES web node labs ores-web-02 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 1.410 second response time https://wikitech.wikimedia.org/wiki/ORES [21:00:06] Oh crud. It looks like I am getting the slow down locally too. [21:02:13] 10ORES, 10Scoring-platform-team (Current): Address icinga noise from wmflabs - https://phabricator.wikimedia.org/T231222 (10Halfak) I get a similar pattern running this on the ores-web nodes directly. I don't see any CPU spikes or increases in iowait. What the heck? [21:03:14] I need to head out for the day. But I've made progress. I figured out that the issue exists even when we're making requests to localhost on the labs machines. [21:07:04] have a good one halfak [21:12:54] PROBLEM - ORES web node labs ores-web-02 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [21:20:26] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [21:21:22] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1011 bytes in 0.376 second response time https://wikitech.wikimedia.org/wiki/ORES [21:36:44] RECOVERY - ORES web node labs ores-web-02 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 977 bytes in 1.212 second response time https://wikitech.wikimedia.org/wiki/ORES [21:55:48] PROBLEM - ORES web node labs ores-web-02 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [21:56:52] RECOVERY - ORES web node labs ores-web-02 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1007 bytes in 1.387 second response time https://wikitech.wikimedia.org/wiki/ORES [22:00:40] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [22:01:48] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 975 bytes in 0.135 second response time https://wikitech.wikimedia.org/wiki/ORES [23:04:08] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [23:04:30] PROBLEM - ORES web node labs ores-web-02 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [23:04:30] PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [23:05:26] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 981 bytes in 4.285 second response time https://wikitech.wikimedia.org/wiki/ORES [23:05:44] RECOVERY - ORES web node labs ores-web-02 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/ORES [23:05:48] RECOVERY - ORES web node labs ores-web-01 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 2.568 second response time https://wikitech.wikimedia.org/wiki/ORES [23:26:50] PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [23:27:56] RECOVERY - ORES web node labs ores-web-01 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 981 bytes in 2.970 second response time https://wikitech.wikimedia.org/wiki/ORES [23:42:56] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [23:44:26] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 5.913 second response time https://wikitech.wikimedia.org/wiki/ORES