[00:00:02] In about 7 hours I will gain complete freedom. :) [00:00:42] I dont fully understand our backend :) [00:06:32] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:07:03] spiking again [00:08:57] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 457 bytes in 0.626 second response time [00:09:07] https://graphite.wikimedia.org/S/Bp [00:09:16] It's a spike in requests to enwiki's wp10 model [00:09:21] 1 at a time [00:10:19] 06Revision-Scoring-As-A-Service, 10ORES: [Discuss] DOS attacks on ORES. What to do? - https://phabricator.wikimedia.org/T148347#2720777 (10Halfak) https://graphite.wikimedia.org/S/Bp Looks like someone requesting scores for the wp10 model one at a time. [00:29:03] OK I think the next deployment is going to enforce the email-address-in-the-user-agent when request rates get high. [00:53:55] 06Revision-Scoring-As-A-Service, 10ORES: [Discuss] DOS attacks on ORES. What to do? - https://phabricator.wikimedia.org/T148347#2720795 (10Legoktm) Is there contact information in their user agent? I'd just block them that way (400 or something) until we can talk to them and have them use batching, etc. [01:24:48] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:27:18] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 443 bytes in 1.074 second response time [01:35:07] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:37:37] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 443 bytes in 0.555 second response time [01:38:15] Emailed ops about the issue [02:27:57] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:32:59] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 443 bytes in 0.702 second response time [03:15:47] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:28:11] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 443 bytes in 0.101 second response time [03:42:55] halfak: Exciting problem to have! [03:43:10] Did you figure it out, or is it still useful for me to poke around in the weblogs? [03:46:55] That graphite link isn't very self-explanatory--maybe you can give me a time window to focus on? [03:48:27] fyi https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest [03:58:44] LOL the biggest offender is resolving to ISP=Wikimedia Foundation [04:00:40] Doesn't seem to match the pattern of abuse you were describing, though--these are just a handful of reasonable requests per second, seems to be populating the RC scores cache [04:16:07] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:18:27] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 441 bytes in 0.591 second response time [04:38:47] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:43:37] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 443 bytes in 0.642 second response time [04:51:48] d'oh. I screwed up the extremely expensive query, doing it again now. [04:52:14] But I did learn that only 16 user IPs have hit ORES since 00:00 UTC [04:53:27] Here's the strange thing--none of those jump out as egregious. Here are the counts, 25197 23487 20372 14357 1136 433 211 103 37 8 7 3 2 2 1 1 [04:54:34] oops. okay that query was even more wrong than I thought. I was grouping by server hostname rather than user ip. [04:56:49] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:58:46] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:59:19] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 720 bytes in 1.586 second response time [05:01:08] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 443 bytes in 1.148 second response time [05:34:06] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:36:28] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 443 bytes in 0.632 second response time [05:53:56] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:01:18] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 458 bytes in 1.144 second response time [06:31:28] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:34:06] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 442 bytes in 0.605 second response time [06:41:38] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:43:58] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 457 bytes in 0.628 second response time [07:04:26] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:21:48] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 441 bytes in 0.602 second response time [07:37:58] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:42:49] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 706 bytes in 1.110 second response time [08:07:45] 06Revision-Scoring-As-A-Service, 10ORES: [Discuss] DOS attacks on ORES. What to do? - https://phabricator.wikimedia.org/T148347#2720674 (10awight) I pulled the IP address and created a private subtask to temporarily block or throttle this client: T148356. Since our thread here is a [discussion] and not a spi... [08:19:17] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:21:41] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 443 bytes in 0.643 second response time [08:24:20] 06Revision-Scoring-As-A-Service, 10ORES: [Discuss] DOS attacks on ORES. What to do? - https://phabricator.wikimedia.org/T148347#2721200 (10Ladsgroup) We talked about it in `#wikimedia-operations` in IRC. It seems it was @Daniel_Mietchen doing 142 edits per minute in Wikidata without the bot flag which is agai... [17:33:56] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:36:36] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 443 bytes in 0.661 second response time [17:55:41] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:58:17] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 458 bytes in 0.601 second response time [18:15:27] Amir1, did you make a figshare account? [18:16:02] I really just need to know what last name and first initial you'd like to use. This is a professional name. It doesn't need to be a legal name. [18:17:21] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:33] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 457 bytes in 0.618 second response time [18:44:06] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:46:43] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 457 bytes in 1.383 second response time [19:00:42] OK. Looks like precached was going crazy again. Looking into that now [19:02:08] RECOVERY - ORES web node labs ores-web-03 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 457 bytes in 0.598 second response time [19:04:12] ahhh... that feeling when I open up the editor to work on something [19:09:38] 06Revision-Scoring-As-A-Service, 10ORES: Investigate memory leak in precached - https://phabricator.wikimedia.org/T146500#2723662 (10Halfak) Looks like this could be useful: https://pythonhosted.org/Pympler/muppy.html [20:06:26] ^ Amir1 [20:06:30] Woops. Is AFK [20:42:45] 06Revision-Scoring-As-A-Service, 10ORES: Investigate memory leak in precached - https://phabricator.wikimedia.org/T146500#2723857 (10Halfak) I worked out that there was a datastructure that would grow slowly over time. I just submitted https://github.com/wiki-ai/ores/pull/170 that should address the issue. [22:33:37] OK. I'm out of here. Have a good one folks!