[17:45:06] <wikibugs>	 06Revision-Scoring-As-A-Service, 10DBA, 06Labs, 10Labs-Infrastructure, 10MediaWiki-extensions-ORES: Replicate ores_classification and ores_model on labs - https://phabricator.wikimedia.org/T148561#2726255 (10Ladsgroup)
[17:45:30] <wikibugs>	 06Revision-Scoring-As-A-Service, 10DBA, 06Labs, 10Labs-Infrastructure, 10MediaWiki-extensions-ORES: Replicate ores_classification and ores_model tables in labs - https://phabricator.wikimedia.org/T148561#2726270 (10Ladsgroup)
[17:57:40] <awight>	 halfak: /o\
[17:58:10] <awight>	 That was silly--feel free to ping me if you're still investigating the wp10 spikes
[17:58:23] <Amir1>	 awight: hey, do you have a minute to review https://gerrit.wikimedia.org/r/#/c/315661/ ?
[17:58:41] <Amir1>	 super straightforward and still no one is reviewing it :(((
[17:59:45] <Amir1>	 oh it needs rebase now
[18:01:53] <awight>	 Amir1: hi!  sure thing
[18:02:34] <Amir1>	 awight: I rebase it now
[18:03:04] <awight>	 Amir1: Did you figure out the load issue?  Following task didn't quite make that clear.
[18:04:07] <Amir1>	 Per our discussion in the weekly meeting we came to this conclusion that it was a mixture of a huge load caused by human-like bot in wikidata and a spike of requests 
[18:04:29] <Amir1>	 we have such spikes all the time and we handle it gracefully 
[18:04:44] <Amir1>	 but this time we had the back pressure from wikidata too 
[18:05:11] <awight>	 Was it because the high load failure behavior is to create a backlog of requests rather than drop some?
[18:08:13] <grrrit-wm>	 (03PS2) 10Ladsgroup: Extensive CI tests, part II [extensions/ORES] - 10https://gerrit.wikimedia.org/r/315661 (https://phabricator.wikimedia.org/T146560) 
[18:10:50] <Amir1>	 I'm not sure I understood you correctly but by "mixture" I mean the queue was almost full for several hours because of the bot edits and the spike did completely crashed the queue 
[18:12:08] <Amir1>	 awight: It's ready now: https://gerrit.wikimedia.org/r/#/c/315661
[18:18:26] <awight>	 Holding newborn for bit, but I'll be able to merge that in a few hours.  They look great!
[18:23:13] <Amir1>	 Have fun :)
[18:42:04] <halfak>	 o/ awight|afk 
[18:42:19] <halfak>	 No worries dude.  I was really glad to have you pick up the task.  I really wasn't sure what to do. 
[18:44:43] <grrrit-wm>	 (03CR) 10Ladsgroup: "PS2 is rebase only" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/315661 (https://phabricator.wikimedia.org/T146560) (owner: 10Ladsgroup)
[18:45:05] <halfak>	 o/ Amir1 
[18:45:19] <Amir1>	 hey halfak!
[18:46:00] <halfak>	 Just starting work for the day.  Anything you need from me before I dig into the wikiclass --> revscoring 1.3.0 work?
[18:46:35] <halfak>	 I'm thinking that's going to be quick and then I'll look into what it will take to implement my prioritization proposal for ORES
[18:47:04] <Amir1>	 halfak: oh yeah, I thought about it
[18:47:32] <Amir1>	 we should determine the thresholds by checking our queue in different modes
[18:47:42] <Amir1>	 busy time, quiet time, etc
[18:57:49] <halfak>	 +1 Amir1 
[18:57:56] <halfak>	 Thresholds will be on config
[18:58:01] <halfak>	 As will the whitelist
[19:29:23] <awight>	 I have an interest in understanding the job queue stuff, too...  Maybe loop me in the next time you chat, or point me to docs?
[19:29:44] <halfak>	 awight, is referring to the MediaWiki job queue?
[19:29:44] <awight>	 Fundraising Tech is exclusively a queue-driven shop
[19:29:58] <halfak>	 We also have a job queue inside of ORES
[19:30:07] <awight>	 huh.  Which one crashed?
[19:30:08] <halfak>	 It's the primary way that we check for being overloaded. 
[19:30:29] <halfak>	 Ahh yeah.  The the ORES queue is the one that Amir1 was referencing
[19:30:42] <awight>	 I'm mainly curious because fr-tech needs better queue failure handling, so I'd like to learn more best practices.
[19:30:56] <halfak>	 Gotcha.  Not sure on best practices honestly. 
[19:30:57] <awight>	 Plus maybe there would be some cross-pollination
[19:31:00] <awight>	 me neither
[19:31:09] <halfak>	 We've really hacked together our strategy based on Yuvi's intuitions
[19:31:28] <halfak>	 Essentially, we check the queue size every time that a request comes in and decide whether or not to serve that request
[19:31:39] <halfak>	 If the queue is above capacity, we respond with a 503, overloaded. 
[19:31:44] <awight>	 Wow, not too expensive a check though
[19:31:57] <awight>	 This is a Redis list?
[19:31:58] <halfak>	 Na.  We're using a celery queue.  It's essentially free. 
[19:32:02] <halfak>	 *Redis queue
[19:32:08] <halfak>	 Yeah a redis list
[19:32:18] <halfak>	 Compared to the time we spend serving a request
[19:32:32] <halfak>	 We don't check it in a safe way, so the thresholding is kind of loose. 
[19:32:39] <awight>	 yeah it's within the data center, might as well be on the same computer
[19:32:44] <halfak>	 Sometimes we get a few extra requests in the queue. 
[19:33:01] <halfak>	 But all in all, it seems to work OK. 
[19:33:04] <awight>	 you could also have a shared semaphore that gets set by a watchdog process
[19:33:34] <halfak>	 Yeah.  That'd be a bit faster, I think. 
[19:33:47] <awight>	 Another benefit would be to isolate the jobs from queue details
[19:34:03] <awight>	 can_process_now
[19:34:39] <awight>	 That process would probably look for a 90% full condition, then email someone when the alarm goes off.
[19:34:56] <halfak>	 Right now, the jobs don't know about the queue.  But the thing that starts jobs does.  It'll error if the conditions aren't right to start a job. 
[19:34:57] <awight>	 or... I guess the alert should still be its own, third component.
[19:35:02] <awight>	 ah okay
[19:35:05] <awight>	 are there docs?
[19:35:27] <halfak>	 Not a sentence of english :D 
[19:35:31] <awight>	 bahaha
[19:35:44] <awight>	 fwiw, this is how bad we have it in fr-tech: https://www.mediawiki.org/wiki/Fundraising_tech/Message_queues
[19:35:53] <halfak>	 https://github.com/wiki-ai/ores/blob/master/ores/scoring_systems/celery_queue.py#L37
[19:35:57] <halfak>	 "queue_maxsize"
[19:36:05] <awight>	 Just finished two months of killing ActiveMQ and overhauling queues though, we'll be rewriting the docs soon.
[19:36:11] <halfak>	 https://github.com/wiki-ai/ores/blob/master/ores/scoring_systems/celery_queue.py#L209
[19:36:29] <awight>	 ty
[19:36:53] <halfak>	 I'll be refactoring this to support tiers
[19:37:14] <halfak>	 The plan is to have all requests processed when the queue is empty or has a negligible count. 
[19:37:34] <halfak>	 For moderate load, we'll only let requests through that contain an email address in the user agent. 
[19:37:48] <halfak>	 And for heavy loads, only a whitelisted set of user agents will be allowed. 
[19:38:11] <halfak>	 The white list will be public info (public repo) so it's not intended to protect against malicious attacks. 
[19:39:23] <awight>	 Oh that's rad
[19:39:24] <halfak>	 White list will be requests directly from MediaWiki, Change Propagation or our precaching system. 
[19:39:47] * halfak wants to not exclude people who don't read the rules and reward those who do. 
[19:40:07] <halfak>	 Although we honestly don't have the "have an email address in your UA" advertised. 
[19:42:00] * awight thwacks users on the knuckles
[19:43:00] <halfak>	 :D
[19:43:23] <awight>	 mebbe you could also do something with the RC load...
[19:43:32] <halfak>	 RC load?
[19:43:49] <awight>	 Precaching scores for the recent changes, i mean
[19:44:12] <halfak>	 Oh yeah.  That would get highest priority.  When that is running, everyone gets better performance. 
[19:44:24] <halfak>	 Or are you thinking that should get lower priority?
[19:44:32] <awight>	 huh.  right, I was imagining the opposite, but you're right
[19:44:52] <halfak>	 If a request gets a score from the cache, it won't even see our queue. 
[19:44:58] <halfak>	 So that's advantageous. 
[19:45:30] <awight>	 Do you have a cache stampede issue yet?  Where lots of clients request scores for the latest change, and you get a race condition where nobody is served from cache?
[19:45:50] <halfak>	 awight, just so long as the requests come in ~50ms apart, we're good. 
[19:45:59] <awight>	 hehe ok
[19:46:05] <halfak>	 We keep the precache system fast and close to try to stay ahead of that. 
[19:46:09] <awight>	 That's a real nice response time
[19:46:15] <halfak>	 As it stands, we get about 50% cache hits
[19:46:33] <awight>	 however, I was seeing TTFB of >1 seconds when reading the Hive webrequest table
[19:46:56] <awight>	 ^s^
[19:47:03] <halfak>	 Oh!  Also, I should say that it's not really cache hits.  Nearly simultaneous hits will get references to the same job and will wait for it to finish.
[19:47:12] <awight>	 ideal!
[19:47:43] <halfak>	 I have to bend over backwards to get celery to do this, but it's worth it. 
[19:48:05] <awight|goo>	 baby is making me something special...
[19:48:14] <halfak>	 \o/ presents
[19:48:18] <halfak>	 ;) 
[19:54:01] * halfak observes the cross-validation of the updated enwiki wp10 model
[20:04:57] <wikibugs>	 06Revision-Scoring-As-A-Service, 10revscoring, 03Research-and-Data-2016-17-Q2: Implement PCFG features for editquality - https://phabricator.wikimedia.org/T144636#2726842 (10Halfak)
[20:05:24] <wikibugs>	 10Revision-Scoring-As-A-Service-Backlog, 10rsaas-articlequality , 03Research-and-Data-2016-17-Q2: Build draft quality model (spam, vandalism, attack, or OK) - https://phabricator.wikimedia.org/T148038#2726843 (10Halfak)
[20:06:04] <wikibugs>	 10Revision-Scoring-As-A-Service-Backlog, 10rsaas-articlequality , 03Research-and-Data-2016-17-Q2: Build feature set for draft quality model - https://phabricator.wikimedia.org/T148580#2726845 (10Halfak)
[20:06:36] <wikibugs>	 10Revision-Scoring-As-A-Service-Backlog, 10rsaas-articlequality : Extract features for deleted page (draft quality model) - https://phabricator.wikimedia.org/T148581#2726858 (10Halfak)
[20:06:49] <wikibugs>	 10Revision-Scoring-As-A-Service-Backlog, 10rsaas-articlequality : Build feature set for draft quality model - https://phabricator.wikimedia.org/T148580#2726845 (10Halfak)
[20:47:18] <grrrit-wm>	 (03CR) 10Awight: Extensive CI tests, part II (0310 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/315661 (https://phabricator.wikimedia.org/T146560) (owner: 10Ladsgroup)
[20:48:10] <grrrit-wm>	 (03CR) 10Awight: [C: 04-1] "Some cleanup suggestions inlined in PS1... It'll be wonderful to have these tests!" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/315661 (https://phabricator.wikimedia.org/T146560) (owner: 10Ladsgroup)
[20:48:40] <awight>	 Amir1: 
[20:49:14] <awight>	 Amir1: Only one of those comments was a blocker, the one about getScoreRecentChangesList not taking a second parameter yet.
[20:49:36] <awight>	 I'd be happy to merge with that fix, but thought you might want the chance to clean up other stuff as well.  Just lmk!
[20:55:27] <grrrit-wm>	 (03PS1) 10Awight: Soften dependency on the BetaFeatures extension [extensions/ORES] - 10https://gerrit.wikimedia.org/r/316701 
[20:56:59] <grrrit-wm>	 (03CR) 10Awight: "CI is passing so this must be a problem with my setup, but fyi I get 9 failures when trying to run the tests in --group ORES." [extensions/ORES] - 10https://gerrit.wikimedia.org/r/315661 (https://phabricator.wikimedia.org/T146560) (owner: 10Ladsgroup)
[21:01:22] <awight>	 halfak: Amir1: Would it be useful if I wrote up what I understand of the work queueing and RC precaching stuff?
[21:01:42] <awight>	 & should that be on mediawikiwiki or meta?
[21:26:42] <halfak>	 awight, yes.  I think mediawiki
[21:26:53] <halfak>	 Maybe somewhere near change propagation or the mw work queue stuff?
[21:28:03] <awight>	 I don't see anything like that--unless you're talking about https://www.mediawiki.org/wiki/Extension:ORES#Extension_workflow
[21:28:39] <awight>	 (I'm checking https://www.mediawiki.org/wiki/Category:ORES )
[21:36:44] <awight>	 halfak: ^
[21:39:36] <halfak>	 awight, https://www.mediawiki.org/wiki/Change_propagation ?
[21:39:56] <halfak>	 https://www.mediawiki.org/wiki/Manual:Job_queue
[21:41:18] <awight>	 ooh thx I thought you meant there was a description of ORES integrations already
[21:43:10] <halfak>	 (sorry in meeting, will be more active here in a bit)
[21:44:15] * halfak doesn't intend to be curt
[21:44:47] <awight>	 haha it's a huge help, don't apologize
[22:25:10] <awight>	 Hmm I don't see where the new change-propagation config went--oh well.
[22:31:01] <halfak>	 awight, what config are you looking for?
[22:34:23] <awight>	 it's very minor--I was just chasing down https://gerrit.wikimedia.org/r/#/c/295576/6/modules/changeprop/templates/config.yaml.erb
[22:34:38] <awight>	 It seems that file has been removed, with the commit message "moving to scap 3 for deployment"
[22:35:33] <halfak>	 eeek
[22:35:40] <halfak>	 I guess releng will know then
[22:36:04] * awight assumes it still works
[22:36:51] <wikibugs>	 06Revision-Scoring-As-A-Service, 10ORES: Implement prioritization of request processing - https://phabricator.wikimedia.org/T148594#2727278 (10Halfak)
[22:37:06] <awight>	 btw, https://www.mediawiki.org/wiki/Extension:ORES/Components -- nothing worth looking at yet, but just so you have the URL... and can stop me from being redundant.
[22:38:46] <halfak>	 We'll want to move that out of "Extension" at some point, but I appreciate the start. 
[22:39:12] <halfak>	 We should have https://meta.wikimedia.org/wiki/Objective_Revision_Evaluation_Service and https://wikitech.wikimedia.org/wiki/ORES cross-linked
[22:39:35] <awight>	 will do
[22:40:00] <awight>	 Seems we need categories on the wikitech pages
[22:40:04] <halfak>	 Indeed
[22:45:46] <wikibugs>	 06Revision-Scoring-As-A-Service, 10ORES: Implement prioritization of request processing - https://phabricator.wikimedia.org/T148594#2727313 (10Halfak) I'm thinking of extending the scoring_system config with a block for matching user-agents.    ```     queue_maxsize: 100 # pending tasks     queue_thresholds:...
[22:47:19] <awight>	 Where is the production config for the ores backend?  Does it exist in a repo?
[22:50:45] <awight>	 I see there aren't tests for the python ores backend--is that component too trivial to test?  I guess `revscoring` is the workhorse?
[23:12:18] <halfak>	 awight, yeah revscoring is the workhorse, but ORES backend could be tested a little more. 
[23:12:39] <halfak>	 We have tests for the basic set of internal components, but not the web API. 
[23:12:56] <halfak>	 https://phabricator.wikimedia.org/diffusion/1880/browse/master/config/
[23:13:06] <halfak>	 There's also some configs in a private repo
[23:14:46] <halfak>	 These are the most critical tests: https://github.com/wiki-ai/ores/tree/master/ores/scoring_systems/tests
[23:15:33] <halfak>	 But we have other tests too.  E.g. https://github.com/wiki-ai/ores/tree/master/ores/metrics_collectors/tests, https://github.com/wiki-ai/ores/tree/master/ores/score_caches/tests, and https://github.com/wiki-ai/ores/tree/master/ores/tests
[23:15:58] <halfak>	 The hardest thing to test are things that actually use redis or the MW api. 
[23:16:09] <halfak>	 But those are pretty narrow slices.