[07:21:09] PROBLEM - ORES web node labs ores-web-05 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [07:22:39] RECOVERY - ORES web node labs ores-web-05 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 979 bytes in 0.128 second response time https://wikitech.wikimedia.org/wiki/ORES [07:57:00] PROBLEM - ORES web node labs ores-web-05 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [07:58:03] RECOVERY - ORES web node labs ores-web-05 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/ORES [09:04:13] PROBLEM - ORES web node labs ores-web-06 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [09:05:49] RECOVERY - ORES web node labs ores-web-06 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 7.675 second response time https://wikitech.wikimedia.org/wiki/ORES [09:06:55] PROBLEM - ORES web node labs ores-web-04 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [09:08:51] PROBLEM - ORES web node labs ores-web-05 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [09:12:03] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [09:13:15] RECOVERY - ORES web node labs ores-web-04 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 981 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/ORES [09:16:47] RECOVERY - ORES web node labs ores-web-05 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 977 bytes in 0.561 second response time https://wikitech.wikimedia.org/wiki/ORES [09:17:07] PROBLEM - ORES web node labs ores-web-06 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [09:18:23] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 977 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/ORES [09:21:57] RECOVERY - ORES web node labs ores-web-06 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1007 bytes in 7.988 second response time https://wikitech.wikimedia.org/wiki/ORES [09:24:43] PROBLEM - ORES web node labs ores-web-04 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [09:26:37] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [09:28:05] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 981 bytes in 0.273 second response time https://wikitech.wikimedia.org/wiki/ORES [09:29:33] RECOVERY - ORES web node labs ores-web-04 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 5.501 second response time https://wikitech.wikimedia.org/wiki/ORES [09:36:37] PROBLEM - ORES web node labs ores-web-06 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [09:38:11] RECOVERY - ORES web node labs ores-web-06 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 981 bytes in 7.439 second response time https://wikitech.wikimedia.org/wiki/ORES [09:39:37] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [09:39:37] PROBLEM - ORES web node labs ores-web-05 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [09:40:59] PROBLEM - ORES web node labs ores-web-04 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [09:42:29] RECOVERY - ORES web node labs ores-web-04 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 979 bytes in 1.111 second response time https://wikitech.wikimedia.org/wiki/ORES [09:42:53] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 981 bytes in 8.434 second response time https://wikitech.wikimedia.org/wiki/ORES [09:44:21] RECOVERY - ORES web node labs ores-web-05 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1011 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/ORES [09:49:21] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [09:52:35] PROBLEM - ORES web node labs ores-web-05 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [09:55:49] RECOVERY - ORES web node labs ores-web-05 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 7.938 second response time https://wikitech.wikimedia.org/wiki/ORES [09:55:49] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 7.940 second response time https://wikitech.wikimedia.org/wiki/ORES [10:03:41] PROBLEM - ORES web node labs ores-web-04 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [10:06:55] RECOVERY - ORES web node labs ores-web-04 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 975 bytes in 7.328 second response time https://wikitech.wikimedia.org/wiki/ORES [10:42:35] 10Jade, 10SpamBlacklist, 10Wikimedia-production-error (Shared Build Failure), 10ci-test-error: Jade unit test fails after api changes in SpamBlacklist - https://phabricator.wikimedia.org/T234609 (10Umherirrender) [13:16:40] Well. That's a lot of stuff. [13:24:05] I don't see a lot of memory pressure on our web workers. [13:25:53] Looks like we're seeing a lot of requests right now. [13:28:03] It's all for enwiki damaging and goodfaith. [13:29:58] 10Jade: Update Jade for changes in AbuseFilter and SpamBlacklist API error responses - https://phabricator.wikimedia.org/T232684 (10matmarex) [13:30:01] 10Jade, 10SpamBlacklist, 10Wikimedia-production-error (Shared Build Failure), 10ci-test-error: Jade unit test fails after api changes in SpamBlacklist - https://phabricator.wikimedia.org/T234609 (10matmarex) [13:36:10] 10Jade, 10Wikimedia-production-error (Shared Build Failure), 10ci-test-error: Update Jade for changes in AbuseFilter and SpamBlacklist API error responses - https://phabricator.wikimedia.org/T232684 (10Umherirrender) [13:46:00] I'm digging into the request logs. [13:56:25] PROBLEM - ORES web node labs ores-web-04 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [14:00:36] For some reason, our backpressure isn't working again. [14:02:49] RECOVERY - ORES web node labs ores-web-04 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 981 bytes in 6.638 second response time https://wikitech.wikimedia.org/wiki/ORES [14:03:03] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [14:03:19] PROBLEM - ORES web node labs ores-web-06 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [14:03:37] Curses! [14:07:49] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 2.215 second response time https://wikitech.wikimedia.org/wiki/ORES [14:08:07] RECOVERY - ORES web node labs ores-web-06 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 981 bytes in 5.170 second response time https://wikitech.wikimedia.org/wiki/ORES [14:20:25] So our celery workers are *pinned* but our uwsgi workers are doing very little. [14:21:27] No individual worker processes are getting 100% of CPU so I don't think we've got a regex out of control situation. I think that someone is just sending a ton of requests. [14:21:35] I wonder if our rate limiting isn't working. [14:24:53] Right now, I'm not showing the queue having any pending requests for celery. [14:25:05] This is our primary means of implementing backpressure. [14:29:05] So we're getting a ton of requests, our celery hosts are maxed out, but the queue of pending celery tasks is empty. The only way I could understand this is if the tasks are passed onto individual workers and they are maintaining their own queue somehow. [14:29:15] So a bunch of scoring tasks are backed up somewhere we don't expect. [14:29:40] If this is true, web requests that only touch uwsgi should be fast. [14:29:42] * halfak checks [14:30:08] Yup. uwsgi is very very fact. [14:30:13] *fast [14:31:17] Oh crap. Celery worker CPU just recovered. So my tests don't really mean anything now. [14:31:34] * halfak waits for the next icinga storm. [14:56:12] halfak: did you get my pm yesterday? [14:56:39] Hey Zppix [14:56:48] Just saw it. It was past the scroll in my client :| [14:56:55] I'll check icinga2 quick. [14:56:58] halfak: ah no worries :) [14:57:10] halfak: your a busy person, my question isnt a top priority [14:57:26] No sweat. Should be quick :) [14:58:02] Hmm. Looks like I can't access it. [14:58:12] I have two passwords stored and neither seem to work. [14:58:15] halfak: okay, ill reset your password ill pm you it [14:58:22] Thanks [15:19:41] halfak: o/ [15:19:53] Hi akosiaris! [15:20:12] have you seen btw https://phabricator.wikimedia.org/T233831#5525386 ? Are you ok with my (well actually your) suggestion? [15:21:59] wikimedia/revscoring#1734 (session_orientation - a243030 : halfak): The build was broken. https://travis-ci.org/wikimedia/revscoring/builds/593600419 [15:26:38] Oooh That's a big writeup [15:26:41] * halfak reads [15:27:13] you can skip down to the "What we can do" part [15:27:38] the rest is essentially an incident response doc [15:30:10] akosiaris, just finished. Thank you for the awesome analysis. This all makes sense now. I agree with your proposal to turn off persistence entirely -- in labs. Turning it off in prod seems like it could be more complicated thought I'm interested in thinking in that direction if we see any issues there. [15:30:24] I'm guessing prod may rarely/never see an issue with IO because we have more control of the host. [15:30:56] We rarely restart redis in prod and when we do, it would be best if things resume as expected. [15:35:44] we rarely do indeed however there is a weird tradeoff. Although we don't have IO contention issues in production, the sheer size of the AOF (it's 6GB) means that on every restart we aren't able to answer to requests as redis won't serve requests until it has loaded the entirety of the file. So, depending on how quickly we could warm up a cold cache, it might make sense or not not resume from the AOF (it's all about which of the 2 i [15:35:45] s faster) [15:36:43] but I am also thinking we should do it just in labs first [15:37:22] evaluate, perhaps get some numbers from production during some scheduled maintenance, see what they tell us [15:39:12] SOunds good to me. [15:39:29] Man, it's so awesome to have you look at this stuff and think it through. Thanks akosiaris :) [15:40:30] yw, thanks for reaching out. It was refreshing to look at that incident and figure out what was going on. [16:04:31] accraze, it works. try this: meet.google.com/xxi-usgu-sfa [16:04:40] kevinbazira, ^ [16:07:18] Thanks halfak [17:37:40] Traveling to the U. Back in a bit. [18:35:10] Woops. Forgot to reconnect [18:35:23] halfak: i noticed you had trouble getting UA? [18:35:34] Right. [18:35:42] halfak: how are you trying to get it? [18:35:50] I want to get the UA of a big burst of requests we saw in WMFLabs. [18:36:19] halfak: how are you trying to get the UA? [18:37:11] I've tried a few methods. But the biggest limiter is that our ~90 days of web requests logs from varnish don't include WMFLabs and in WMFLabs, there's a proxy that strips IP and UA before our servers see the request. [18:38:05] halfak: out of courosity have you tried looking at whatever webserver you use logs? (apache/nginx)? [18:38:51] yeah. The load balancer uses nginx. [18:39:22] halfak: would logstash have anything? [18:53:24] Zppix, that's a good question. [18:53:32] Do you know where to find logstash for labs? [18:56:25] halfak: i dont but i know it exists [18:57:33] halfak: try https://logstash-beta.wmflabs.org/ [18:58:21] anyway bbiab [19:17:07] Looks like that's only for beta. [19:36:15] halfak: have you thought about rate-limiting from the load balancer [19:37:43] Zppix, I need an IP to ratelimit. [19:37:43] :( [19:38:53] halfak: why not just set a rate-limit to how many requests from one source? [19:38:58] per x time [19:39:08] How do I know what request comes from what source? [19:39:40] I wonder if you could set a heading thats unique per each connected device [19:45:12] Regretfully, we don't have any sort of handshake so we don't get to demand anything like that. [19:45:30] I guess we could request an email address in the UA but I wouldn't want requests from people's browsers to fail. [19:48:40] halfak: you can't set a header that gets set on the first connection from the device/ [19:48:45] ?* [19:48:56] I mean, we could set a cookie. [19:49:15] But the worse behaving clients won't keep cookies. [19:49:18] *worst [19:49:33] So we'd only be policing clients that are well-behaved. [19:52:17] halfak: Since WM isnt in the EU can't you force cookies to be used? since the USA doesnt have GDPR [19:52:51] Actually, GDPR is kind of the other way. It doesn't force cookies. It forces a notification about the use of cookies -- if cookies are used. [19:53:04] The browser can just refuse to accept cookies. [19:54:02] halfak: thats what i mean since we dont GDPR we dont have to do notification per say, and safely force cookies or dont allow connection? [19:54:37] Ahh. yeah. Well we do still need to notify when we serve to EU audiences. [19:54:50] But yeah, I think disallowing in the case of cookies not working won't work. [19:55:11] Because we need to allow that initial connection and we don't know if the client will respect cookies or not at that point. [19:55:27] Any followup request could be a brand new client connecting for the 1st time. [19:57:39] * halfak starts pulling his hair out for something totally unrelated. [19:57:52] halfak: I hate proxies like this :P [20:00:31] accraze, could I ask you to rubber-duck something quick? [20:00:33] https://phabricator.wikimedia.org/P9242 [20:00:54] I'm stuck on the above error. [20:01:14] To me, the import I do next from python should demonstrate that the error makes no sense. [20:09:29] hmmm [20:11:30] halfak might be related to how the code is being intialized, is your branch current on github? [20:11:48] It is. I'm checking that travis fails in the same way. [20:12:10] Yup. [20:12:12] Confirmed. [20:14:41] I ran into something similar on one of the other repos and eventually gave up and added it to nitpick_ignore in docs/conf.py file [20:14:52] Damn. [20:15:08] That's an important link [20:15:11] but this seems like it should work [20:15:32] Right! Well thanks for looking with me. At least I know it's probably the compiler [20:16:27] https://blog.plover.com/prog/compiler-error.html -- this doesn't consider a mess like sphinx ;) [20:17:07] lol [20:17:19] have you tried force-building the docs to see if it builds correctly? I remember seeing some stuff about "bogus" nitpick warnings [20:18:35] Hmm. I have been running sphinx-build locally. [20:18:46] But the same command as travis [20:18:57] sphinx-build -anW -b html docs dist/docs [20:19:02] Maybe you have something else in mind? [20:19:56] Maybe try removing the "-anW" and see if generates the correct docs [20:20:08] Aha! kk will try [20:20:53] Sure enough, the link is broken. [20:21:20] Aha! dist/docs/revscoring.datasources.session_oriented.html doesn't exist [20:21:28] Let me just see why THAT is. [20:21:44] * halfak is an idiot [20:21:46] \o/ [20:22:55] I forgot to create an .rst file that would pull in the docstring for session_oriented [20:23:04] LOL [20:23:31] wow, well at least we didn't go down the compiler route [20:23:53] lol [20:34:16] I dislike the term "memoization" [20:34:47] It really seems like it should just be called "memorization" but there was a typo at some critical point in history [20:41:55] * halfak rolls the dice again. [20:45:17] wikimedia/revscoring#1744 (session_orientation - 66b5e35 : halfak): The build was fixed. https://travis-ci.org/wikimedia/revscoring/builds/593728876 [20:47:41] You're damn right. [20:48:01] I kinda wish travis-ci bot would just stay in channel i find the join/part annoying... okay rant over [20:56:04] 10Scoring-platform-team (Current): Refactor revscoring to handle session-orientation - https://phabricator.wikimedia.org/T231214 (10Halfak) I'm way overdue in updating this task. So I have a PR: https://github.com/wikimedia/revscoring/pull/450 I've taken two major steps. 1. Write a function called list_of_... [20:57:18] 10Scoring-platform-team (Current): Refactor revscoring to handle session-orientation - https://phabricator.wikimedia.org/T231214 (10Halfak) As of right now, `revscoring.datasources.session_oriented` is complete and works as expected. `revscoring.features.bytes.session` is complete and works as expected. I'm ho... [20:57:21] * halfak continues updating the task. [20:57:30] Zppix, agreed. [20:57:56] 10Scoring-platform-team (Current), 10revscoring, 10artificial-intelligence: Refactor revscoring to handle session-orientation - https://phabricator.wikimedia.org/T231214 (10Halfak) [20:58:10] Alright! https://phabricator.wikimedia.org/T231214 is now ready for review and waiting in the Review column. [20:58:21] I added a bunch of notes to update the task on my progress and what needs review. [20:58:41] accraze, ^ for next week (unless you're excited for a break from Jade) :D [20:58:54] * halfak runs to his final meeting of the day [20:59:35] awesome will take a look next week! [22:11:34] * halfak --> weekend [22:11:37] take care folks! [22:13:38] later halfak