[17:45:06] 06Revision-Scoring-As-A-Service, 10DBA, 06Labs, 10Labs-Infrastructure, 10MediaWiki-extensions-ORES: Replicate ores_classification and ores_model on labs - https://phabricator.wikimedia.org/T148561#2726255 (10Ladsgroup) [17:45:30] 06Revision-Scoring-As-A-Service, 10DBA, 06Labs, 10Labs-Infrastructure, 10MediaWiki-extensions-ORES: Replicate ores_classification and ores_model tables in labs - https://phabricator.wikimedia.org/T148561#2726270 (10Ladsgroup) [17:57:40] halfak: /o\ [17:58:10] That was silly--feel free to ping me if you're still investigating the wp10 spikes [17:58:23] awight: hey, do you have a minute to review https://gerrit.wikimedia.org/r/#/c/315661/ ? [17:58:41] super straightforward and still no one is reviewing it :((( [17:59:45] oh it needs rebase now [18:01:53] Amir1: hi! sure thing [18:02:34] awight: I rebase it now [18:03:04] Amir1: Did you figure out the load issue? Following task didn't quite make that clear. [18:04:07] Per our discussion in the weekly meeting we came to this conclusion that it was a mixture of a huge load caused by human-like bot in wikidata and a spike of requests [18:04:29] we have such spikes all the time and we handle it gracefully [18:04:44] but this time we had the back pressure from wikidata too [18:05:11] Was it because the high load failure behavior is to create a backlog of requests rather than drop some? [18:08:13] (03PS2) 10Ladsgroup: Extensive CI tests, part II [extensions/ORES] - 10https://gerrit.wikimedia.org/r/315661 (https://phabricator.wikimedia.org/T146560) [18:10:50] I'm not sure I understood you correctly but by "mixture" I mean the queue was almost full for several hours because of the bot edits and the spike did completely crashed the queue [18:12:08] awight: It's ready now: https://gerrit.wikimedia.org/r/#/c/315661 [18:18:26] Holding newborn for bit, but I'll be able to merge that in a few hours. They look great! [18:23:13] Have fun :) [18:42:04] o/ awight|afk [18:42:19] No worries dude. I was really glad to have you pick up the task. I really wasn't sure what to do. [18:44:43] (03CR) 10Ladsgroup: "PS2 is rebase only" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/315661 (https://phabricator.wikimedia.org/T146560) (owner: 10Ladsgroup) [18:45:05] o/ Amir1 [18:45:19] hey halfak! [18:46:00] Just starting work for the day. Anything you need from me before I dig into the wikiclass --> revscoring 1.3.0 work? [18:46:35] I'm thinking that's going to be quick and then I'll look into what it will take to implement my prioritization proposal for ORES [18:47:04] halfak: oh yeah, I thought about it [18:47:32] we should determine the thresholds by checking our queue in different modes [18:47:42] busy time, quiet time, etc [18:57:49] +1 Amir1 [18:57:56] Thresholds will be on config [18:58:01] As will the whitelist [19:29:23] I have an interest in understanding the job queue stuff, too... Maybe loop me in the next time you chat, or point me to docs? [19:29:44] awight, is referring to the MediaWiki job queue? [19:29:44] Fundraising Tech is exclusively a queue-driven shop [19:29:58] We also have a job queue inside of ORES [19:30:07] huh. Which one crashed? [19:30:08] It's the primary way that we check for being overloaded. [19:30:29] Ahh yeah. The the ORES queue is the one that Amir1 was referencing [19:30:42] I'm mainly curious because fr-tech needs better queue failure handling, so I'd like to learn more best practices. [19:30:56] Gotcha. Not sure on best practices honestly. [19:30:57] Plus maybe there would be some cross-pollination [19:31:00] me neither [19:31:09] We've really hacked together our strategy based on Yuvi's intuitions [19:31:28] Essentially, we check the queue size every time that a request comes in and decide whether or not to serve that request [19:31:39] If the queue is above capacity, we respond with a 503, overloaded. [19:31:44] Wow, not too expensive a check though [19:31:57] This is a Redis list? [19:31:58] Na. We're using a celery queue. It's essentially free. [19:32:02] *Redis queue [19:32:08] Yeah a redis list [19:32:18] Compared to the time we spend serving a request [19:32:32] We don't check it in a safe way, so the thresholding is kind of loose. [19:32:39] yeah it's within the data center, might as well be on the same computer [19:32:44] Sometimes we get a few extra requests in the queue. [19:33:01] But all in all, it seems to work OK. [19:33:04] you could also have a shared semaphore that gets set by a watchdog process [19:33:34] Yeah. That'd be a bit faster, I think. [19:33:47] Another benefit would be to isolate the jobs from queue details [19:34:03] can_process_now [19:34:39] That process would probably look for a 90% full condition, then email someone when the alarm goes off. [19:34:56] Right now, the jobs don't know about the queue. But the thing that starts jobs does. It'll error if the conditions aren't right to start a job. [19:34:57] or... I guess the alert should still be its own, third component. [19:35:02] ah okay [19:35:05] are there docs? [19:35:27] Not a sentence of english :D [19:35:31] bahaha [19:35:44] fwiw, this is how bad we have it in fr-tech: https://www.mediawiki.org/wiki/Fundraising_tech/Message_queues [19:35:53] https://github.com/wiki-ai/ores/blob/master/ores/scoring_systems/celery_queue.py#L37 [19:35:57] "queue_maxsize" [19:36:05] Just finished two months of killing ActiveMQ and overhauling queues though, we'll be rewriting the docs soon. [19:36:11] https://github.com/wiki-ai/ores/blob/master/ores/scoring_systems/celery_queue.py#L209 [19:36:29] ty [19:36:53] I'll be refactoring this to support tiers [19:37:14] The plan is to have all requests processed when the queue is empty or has a negligible count. [19:37:34] For moderate load, we'll only let requests through that contain an email address in the user agent. [19:37:48] And for heavy loads, only a whitelisted set of user agents will be allowed. [19:38:11] The white list will be public info (public repo) so it's not intended to protect against malicious attacks. [19:39:23] Oh that's rad [19:39:24] White list will be requests directly from MediaWiki, Change Propagation or our precaching system. [19:39:47] * halfak wants to not exclude people who don't read the rules and reward those who do. [19:40:07] Although we honestly don't have the "have an email address in your UA" advertised. [19:42:00] * awight thwacks users on the knuckles [19:43:00] :D [19:43:23] mebbe you could also do something with the RC load... [19:43:32] RC load? [19:43:49] Precaching scores for the recent changes, i mean [19:44:12] Oh yeah. That would get highest priority. When that is running, everyone gets better performance. [19:44:24] Or are you thinking that should get lower priority? [19:44:32] huh. right, I was imagining the opposite, but you're right [19:44:52] If a request gets a score from the cache, it won't even see our queue. [19:44:58] So that's advantageous. [19:45:30] Do you have a cache stampede issue yet? Where lots of clients request scores for the latest change, and you get a race condition where nobody is served from cache? [19:45:50] awight, just so long as the requests come in ~50ms apart, we're good. [19:45:59] hehe ok [19:46:05] We keep the precache system fast and close to try to stay ahead of that. [19:46:09] That's a real nice response time [19:46:15] As it stands, we get about 50% cache hits [19:46:33] however, I was seeing TTFB of >1 seconds when reading the Hive webrequest table [19:46:56] ^s^ [19:47:03] Oh! Also, I should say that it's not really cache hits. Nearly simultaneous hits will get references to the same job and will wait for it to finish. [19:47:12] ideal! [19:47:43] I have to bend over backwards to get celery to do this, but it's worth it. [19:48:05] baby is making me something special... [19:48:14] \o/ presents [19:48:18] ;) [19:54:01] * halfak observes the cross-validation of the updated enwiki wp10 model [20:04:57] 06Revision-Scoring-As-A-Service, 10revscoring, 03Research-and-Data-2016-17-Q2: Implement PCFG features for editquality - https://phabricator.wikimedia.org/T144636#2726842 (10Halfak) [20:05:24] 10Revision-Scoring-As-A-Service-Backlog, 10rsaas-articlequality , 03Research-and-Data-2016-17-Q2: Build draft quality model (spam, vandalism, attack, or OK) - https://phabricator.wikimedia.org/T148038#2726843 (10Halfak) [20:06:04] 10Revision-Scoring-As-A-Service-Backlog, 10rsaas-articlequality , 03Research-and-Data-2016-17-Q2: Build feature set for draft quality model - https://phabricator.wikimedia.org/T148580#2726845 (10Halfak) [20:06:36] 10Revision-Scoring-As-A-Service-Backlog, 10rsaas-articlequality : Extract features for deleted page (draft quality model) - https://phabricator.wikimedia.org/T148581#2726858 (10Halfak) [20:06:49] 10Revision-Scoring-As-A-Service-Backlog, 10rsaas-articlequality : Build feature set for draft quality model - https://phabricator.wikimedia.org/T148580#2726845 (10Halfak) [20:47:18] (03CR) 10Awight: Extensive CI tests, part II (0310 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/315661 (https://phabricator.wikimedia.org/T146560) (owner: 10Ladsgroup) [20:48:10] (03CR) 10Awight: [C: 04-1] "Some cleanup suggestions inlined in PS1... It'll be wonderful to have these tests!" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/315661 (https://phabricator.wikimedia.org/T146560) (owner: 10Ladsgroup) [20:48:40] Amir1: [20:49:14] Amir1: Only one of those comments was a blocker, the one about getScoreRecentChangesList not taking a second parameter yet. [20:49:36] I'd be happy to merge with that fix, but thought you might want the chance to clean up other stuff as well. Just lmk! [20:55:27] (03PS1) 10Awight: Soften dependency on the BetaFeatures extension [extensions/ORES] - 10https://gerrit.wikimedia.org/r/316701 [20:56:59] (03CR) 10Awight: "CI is passing so this must be a problem with my setup, but fyi I get 9 failures when trying to run the tests in --group ORES." [extensions/ORES] - 10https://gerrit.wikimedia.org/r/315661 (https://phabricator.wikimedia.org/T146560) (owner: 10Ladsgroup) [21:01:22] halfak: Amir1: Would it be useful if I wrote up what I understand of the work queueing and RC precaching stuff? [21:01:42] & should that be on mediawikiwiki or meta? [21:26:42] awight, yes. I think mediawiki [21:26:53] Maybe somewhere near change propagation or the mw work queue stuff? [21:28:03] I don't see anything like that--unless you're talking about https://www.mediawiki.org/wiki/Extension:ORES#Extension_workflow [21:28:39] (I'm checking https://www.mediawiki.org/wiki/Category:ORES ) [21:36:44] halfak: ^ [21:39:36] awight, https://www.mediawiki.org/wiki/Change_propagation ? [21:39:56] https://www.mediawiki.org/wiki/Manual:Job_queue [21:41:18] ooh thx I thought you meant there was a description of ORES integrations already [21:43:10] (sorry in meeting, will be more active here in a bit) [21:44:15] * halfak doesn't intend to be curt [21:44:47] haha it's a huge help, don't apologize [22:25:10] Hmm I don't see where the new change-propagation config went--oh well. [22:31:01] awight, what config are you looking for? [22:34:23] it's very minor--I was just chasing down https://gerrit.wikimedia.org/r/#/c/295576/6/modules/changeprop/templates/config.yaml.erb [22:34:38] It seems that file has been removed, with the commit message "moving to scap 3 for deployment" [22:35:33] eeek [22:35:40] I guess releng will know then [22:36:04] * awight assumes it still works [22:36:51] 06Revision-Scoring-As-A-Service, 10ORES: Implement prioritization of request processing - https://phabricator.wikimedia.org/T148594#2727278 (10Halfak) [22:37:06] btw, https://www.mediawiki.org/wiki/Extension:ORES/Components -- nothing worth looking at yet, but just so you have the URL... and can stop me from being redundant. [22:38:46] We'll want to move that out of "Extension" at some point, but I appreciate the start. [22:39:12] We should have https://meta.wikimedia.org/wiki/Objective_Revision_Evaluation_Service and https://wikitech.wikimedia.org/wiki/ORES cross-linked [22:39:35] will do [22:40:00] Seems we need categories on the wikitech pages [22:40:04] Indeed [22:45:46] 06Revision-Scoring-As-A-Service, 10ORES: Implement prioritization of request processing - https://phabricator.wikimedia.org/T148594#2727313 (10Halfak) I'm thinking of extending the scoring_system config with a block for matching user-agents. ``` queue_maxsize: 100 # pending tasks queue_thresholds:... [22:47:19] Where is the production config for the ores backend? Does it exist in a repo? [22:50:45] I see there aren't tests for the python ores backend--is that component too trivial to test? I guess `revscoring` is the workhorse? [23:12:18] awight, yeah revscoring is the workhorse, but ORES backend could be tested a little more. [23:12:39] We have tests for the basic set of internal components, but not the web API. [23:12:56] https://phabricator.wikimedia.org/diffusion/1880/browse/master/config/ [23:13:06] There's also some configs in a private repo [23:14:46] These are the most critical tests: https://github.com/wiki-ai/ores/tree/master/ores/scoring_systems/tests [23:15:33] But we have other tests too. E.g. https://github.com/wiki-ai/ores/tree/master/ores/metrics_collectors/tests, https://github.com/wiki-ai/ores/tree/master/ores/score_caches/tests, and https://github.com/wiki-ai/ores/tree/master/ores/tests [23:15:58] The hardest thing to test are things that actually use redis or the MW api. [23:16:09] But those are pretty narrow slices.