[15:19:47] PROBLEM - ORES web node labs ores-web-03 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:18:56] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:21:26] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 457 bytes in 0.590 second response time [17:18:01] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:23:07] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 457 bytes in 0.915 second response time [17:54:03] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:59:11] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 457 bytes in 0.626 second response time [18:05:08] Labs is flip flopping [18:07:08] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:09:37] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 443 bytes in 0.604 second response time [19:04:16] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:06:34] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 704 bytes in 0.558 second response time [20:03:48] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:08:57] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 704 bytes in 3.068 second response time [20:32:22] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:35:00] hmm [20:35:00] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 706 bytes in 1.096 second response time [20:35:02] neat, [20:35:08] ORES is as noisy as -operations [20:35:17] Amir1 something broke? [20:35:29] ToAruShiroiNeko: hey, on it [20:35:34] :D [20:35:50] I am hopelessly under equiped for thisk kind of a thing [20:37:02] nah, based on the graphs, someone is sending lots of requests [20:48:02] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:03:43] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 703 bytes in 7.108 second response time [21:58:45] Amir1 is it malicious? [21:58:55] We do have a persistant idiot [21:59:09] He finds creative ways to be an idiot [22:24:38] o/ [22:24:41] Hey folks [22:25:39] Our median response time still seems to be high. [22:26:49] Oh I see we did a slight bump in capacity [22:27:07] OK. Agenda item for tomorrow: What to do about DOS attacks (good-faith or bad) [22:31:00] 06Revision-Scoring-As-A-Service, 10ORES: [Discuss] DOS attacks on ORES. What to do? - https://phabricator.wikimedia.org/T148347#2720674 (10Halfak) [22:31:06] 06Revision-Scoring-As-A-Service, 10ORES: [Discuss] DOS attacks on ORES. What to do? - https://phabricator.wikimedia.org/T148347#2720687 (10Halfak) p:05Triage>03High [22:31:33] 06Revision-Scoring-As-A-Service, 10ORES: [Discuss] DOS attacks on ORES. What to do? - https://phabricator.wikimedia.org/T148347#2720674 (10Halfak) @akosiaris, I figure you'd have some ideas. What do you advise that we do about this? [22:32:07] OK. I think this is OK for now. I'll keep my phone on me so I can be pinged about any issues that recur. [22:32:08] o/ [22:33:13] 06Revision-Scoring-As-A-Service, 10ORES: [Discuss] DOS attacks on ORES. What to do? - https://phabricator.wikimedia.org/T148347#2720674 (10Legoktm) Do we know that these are malicious attacks? Or are they consumers who are just overloading it? [22:51:31] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:53:37] halfak you can mitigate DOS attacks, there are a few services for that for websites. [22:53:54] They tend to rely on a massive shared cloud [22:54:00] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 457 bytes in 1.112 second response time [23:52:17] ToAruShiroiNeko, I think we're likely to start enforcing user-agents with IP addresses [23:52:24] *email addresses [23:52:35] So that we can email people who are hitting us hard [23:55:02] That wouldnt work at all in a dDoS [23:55:26] Yeah. That's right. [23:55:28] but thats a seperate problem [23:55:52] We do not use Cent Auth [23:55:56] maybe we should [23:56:08] We have a cloud of IO (uwsgi) and CPU (celery) workers [23:56:12] if its tied to a user account, [23:56:33] then you do not even need to worry about the ip [23:56:53] easier said than done of course. [23:57:14] We could allow all IPs to access at low rates and allow higher rates with oauth [23:57:29] what I was typing [23:57:44] could indeed be an "emergency measure" [23:57:47] we could have 3 levels [23:57:57] low, medium and high [23:58:07] system can get more paranoid depending on the beating [23:58:29] instead of completely denying ips at medium range it would be capped [23:58:36] 10 requests per minute or something [23:58:52] it would be more seamless [23:59:07] just because some troll group is being an idiot, doesnt mean we should punish all ips [23:59:19] 3 or more levels [23:59:25] does that make sense? [23:59:37] We can prioritize celery [23:59:50] We do check the queue size for every request