[06:09:11] df -h [21:01:52] * OuKB looks arouns [21:01:53] hello [21:01:55] *d [21:02:15] this is https://www.mediawiki.org/wiki/User:Daniel_Kinzler_(WMDE)/Job_Queue now? [21:02:20] yes! [21:02:31] o/ [21:02:32] Hello [21:02:42] We'll get started in a minute [21:02:43] yep, discuss & identify job queue issues [21:03:31] any AaronSchulz? [21:03:40] #startmeeting TechCom RFC meeting [21:03:40] Meeting started Wed Sep 13 21:03:40 2017 UTC and is due to finish in 60 minutes. The chair is RoanKattouw. Information about MeetBot at http://wiki.debian.org/MeetBot. [21:03:40] Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. [21:03:40] The meeting name has been set to 'techcom_rfc_meeting' [21:03:47] #topic Job queue issues [21:03:50] Hes in Europe iirc [21:03:54] #link https://www.mediawiki.org/wiki/User:Daniel_Kinzler_(WMDE)/Job_Queue [21:03:54] before the meeting started - anybody wants to fix https://www.mediawiki.org/wiki/Architecture_meetings for new naming? [21:04:02] #link https://etherpad.wikimedia.org/p/JobQueue-ircmeeting [21:04:11] too slow :) [21:04:11] Yeah Aaron and Krinkle are both in Europe for the perf team offsite I believe [21:04:16] gilles anomie [21:04:18] gwicke made an etherpad for collecting notes --^^ [21:04:36] the whole perf team? that's bad timing then [21:04:42] SMalyshev: it's a wiki, be bold ;) [21:04:47] yeah, it's seeded with a copy of Daniel's notes [21:05:24] o/ [21:05:34] ok, so. there is a bunch of issues with the job queue. with how it's implemented, how it's used, and how the internal interfaces are specified. [21:06:03] the purpose of this meeting is to identify the most pressing issues, the most relevant questions to answer, and perhaps to float some possible solutions [21:06:31] <_joe_> what should we start with? [21:06:31] I have tried to dump what i know about the issue into the wiki page (resp the etherpad). [21:06:41] what would you like to start with? scheduling? [21:06:41] o/ [21:06:51] <_joe_> ok! [21:06:52] hey legoktm! [21:07:06] not much point in talking about the current wiki scheduling mechanism if it's about to be replaced by kafka, right? [21:07:17] I think it'd be good to try and define a few terms before we proceed to avoid any ambigiuty. Especially since the naming isn't very well here in this area. E.g. "Job runner", "Job queue" (as in, the push/pop storage for the queue), and "Job Queue" (the concept and logic within MediaWiki core outside the runner/backend logic, such as for execution and pushing) [21:08:09] TimStarling: I think there is still value in agreeing on the problem [21:08:09] TimStarling: Is there any timeframe for the switch to kafka? [21:08:18] Some of the issues seem related to the refreshLinks system, and others with the Wikimedia job queue infrastructure [21:08:21] TimStarling: good point. gwicke, can you tell us how the scheduling works with kafka? is it still one queue per (target) wiki? how is the next wiki/job selected for processing? [21:08:23] E.g. some of these issues are out of the scope of a Kafka-based job queue store. [21:08:46] kaldari: the first production job was enabled today; we expect to migrate a substantial portion of jobs next quardter [21:09:00] do we also have Pchelolo? [21:09:13] yup TimStarling you do [21:09:14] <_joe_> kaldari: at least 6 months for a full migration I'd say realistically [21:09:34] <_joe_> but gwicke and Pchelolo might have different opinions :) [21:09:39] * gwicke nods [21:09:52] buth this migration will mean only changing the transport for now (redis -> eventbus) [21:09:54] 6 months is a good estimate [21:10:11] <_joe_> so, the scheduling discussion is still relevant [21:10:12] mobrovac: What is used to listen to kafka and trigger RunSingleJob.php? [21:10:14] A Node.js service? [21:10:18] as I said in the previous meeting, having fairness of scheduling between wikis was a deliberate design decision, IMHO important and useful [21:10:40] scheduling will change as well [21:10:57] TimStarling: how about fairness of scheduling between jobs? [21:10:57] towards a single queue per job type, not separate per project [21:11:00] the general principle with scheduling is that the longest-running tasks should have the lowest priority [21:11:04] Krinkle: yup, change-propagation, the same mechanism we are using for propagating changes [21:11:06] so a different set of trade-offs / risks [21:11:10] Yeah, but I think the scheduling should mirror the other implementation. And both can be improved accordingly which should also simplify transitioning. [21:11:22] so that you don't starve out simple things while waiting for big things [21:11:25] <_joe_> the current design favours fair scheduling between wikis, but that creates operating issues. In an ideal world, either a human operator or the scheduler itself could dedicate more execution threads to a specific wiki/jobtype [21:11:33] as opposed to one significantly scheduling differently from the other. [21:11:52] TimStarling: but you still have to make sure you get to the big things often enough so they don't pile up. that's what we are seeing right now [21:12:08] they don't pile up if you have enough total resources [21:12:12] the scheduling in eventbus will be per job type, in order of ingress [21:12:14] pathological cases are real, and they cause trouble in prod [21:12:26] there are ways to address the starvation issue as well, such as rate limiting by project [21:12:29] gwicke: we added "some" randomness to that approach, so we could tune randomness vs fairness. what do you think of that idea? [21:12:33] <_joe_> TimStarling: we will never have enough resources for 600k jobs that take ~ 1s on average each [21:12:33] if you don't have enough throughput, the size of the queue will increase, prioritisation will not fix that [21:12:48] what would that rate limiting look like? skip over some items in the queue, then reqeuue the skipped ones at end or something? [21:12:56] #info the scheduling in eventbus will be per job type, in order of ingress [21:12:58] <_joe_> what we need is a fair scheduler that is also smart [21:13:03] it would be also nice to have safety mechanism so that if a single wiki has an abnormal number of small jobs submitted it will not affect other wikis [21:13:10] right, so the solution is obviously not to dedicate the whole infrastructure to attempting to run those 600k jobs, and starving everything else [21:13:12] * DanielK_WMDE__ encourages people to use #info liberally [21:13:28] <_joe_> TimStarling: still the current scheduler is clearly inefficient [21:13:41] DanielK_WMDE__: our current theory is that the starvation issue is not as common as we fear [21:13:49] #info having fairness of scheduling between wikis was a deliberate design decision, IMHO important and useful [21:13:51] <_joe_> basically never allowing to exhaust the queue without manual intervention [21:14:00] and that other measures such as rate limiting can address abuse [21:14:01] deduplication means that execution gets progressively more efficient as you add more jobs [21:14:10] I assume the reason fairness is considered a problem by some is not because one wikis' jobs don't run soon enough, but because overall run rate globally is lower than it should be? If that is the case, then that means we may need to figure out why that is because that's now how it was designed to function (naturally). [21:14:24] <_joe_> Krinkle: yes [21:14:53] The end use case that should remain is that if a wiki is dormant and I schedule 1 job there, it should run nearly instantly no matter what. [21:14:55] Also monitoring is an issue if starving happens, in our recent case we didn't know which type of job-wiki were responsible and we were guessing around different types [21:15:00] if we rely on deduplication to stay under capacity, then we're always going to have a lot of jobs in the queue [21:15:03] <_joe_> Krinkle: I assume the problem is we don't run as many jobs as we could because we do a lot of small per-wiki runs, most of which are almost empty [21:15:10] #info The end use case that should remain is that if a wiki is dormant and I schedule 1 job there, it should run nearly instantly no matter what. [21:15:30] the current deduplication strategy also hurts queue introspection -- it's hard to tell what's queued up when many of the items are dupes of each other with no real effect [21:15:46] brion: yeah, daniel mentioned that on wiki [21:15:46] Krinkle: why? honest question. why should this be fair per wiki, and not, say, per user, or job type, or individual job, or scaled by page count of the wiki, or whatever? [21:15:49] _joe_: we should confirm then if the problem is the "wasting of time" on subjective unimportant jobs, or the waste on cycles checking/switching wikis. The former might be a hard sell. [21:15:53] "For wikis with just a few thousand pages, we sometimes see millions of UpdateHtmlCache jobs sitting in the queue." [21:15:58] IIRC one of the reasons for the current wiki switching is that switching to a specific wiki context has some overhead, and there was a desire to amortize that [21:16:00] which is just an artifact of deduplication on pop [21:16:10] <_joe_> Krinkle: I suspect the latter [21:16:16] (we hope) [21:16:25] _joe_: If that is the case, there is likely a timeout being reached somewhere that shouldnt' exist. [21:16:28] #info _joe_: we should confirm then if the problem is the "wasting of time" on subjective unimportant jobs, or the waste on cycles checking/switching wikis. The former might be a hard sell. [21:16:49] deduplication should happen on push, no? [21:16:55] why doesn't it? [21:17:07] performance [21:17:09] When I last looked at this, the switching was nearly instant. The move from maintenance scripts to a separate PHP service with sub process and subsequently to HHVM sub-http requests, may have improved the overhead of this. [21:17:24] <_joe_> I do agree we don't want to de-prioritize small wikis, but since we never run at 100% capacity, we have room to boost jobs, and that's practically impossible in the current implementation [21:17:28] But afaik the aggregator in the jobchron (as opposed to job runner) is meant to check these ahead of time, so it will not create a run if there are 0 jobs. [21:17:34] the check is done before the spawn, not within. [21:18:07] <_joe_> Krinkle: I'm not sure that if you have 1 job in the queue, the runner won't still run for N seconds, I have to check [21:18:20] Krinkle: agreed, I think the move to HHVM eliminated a lot of the startup cost [21:18:31] if there are a lot of jobs being queued from a given wiki, it makes sense to defer those jobs for a while so that deduplication can take effect [21:18:42] there might also be the historical factor, where scripts were originally written per wiki [21:18:45] <_joe_> TimStarling: a while == weeks? [21:18:52] <_joe_> because that's what's happening right now [21:18:57] #info if there are a lot of jobs being queued from a given wiki, it makes sense to defer those jobs for a while so that deduplication can take effect [21:19:15] <_joe_> and I do have to run runjobs by hand on terbium, in order to reduce the backlog [21:19:26] TimStarling: yes, sure, deferring jobs helps deduping. But we dedupe on pop? that's strange... [21:19:30] _joe_: you want to talk about design or incident response? [21:19:51] deduping on push would be synchronous, and potentially slow, i guess... [21:19:55] bloom filters, anyone? [21:19:57] Pchelolo: will we have deduplication on push in the new thing? [21:20:01] Yeah, it used to happen every month or so that enwiki/dewiki/commons changes a popular template like Template:Information/Infobox etc. and spend days catching up in the backlog. [21:20:15] <_joe_> well the current implementation has some clear deficiencies I'd like to see addressed so that I don't have to do incident response [21:20:20] But meanwhile other wikis and other job types run fine (e.g. email jobs, cdn purge jobs, and refresh link jobs on other wikis) [21:20:24] TimStarling: what would the advantage be? [21:20:25] TimStarling: currently the dedupe in the new thing does exactly the same as the old one [21:20:34] <_joe_> one is not being able to be easily tuned for bursts in capacity in case of need [21:20:51] Krinkle: that's a bottleneck that can bring the whole infra down, right? [21:21:04] gwicke: smaller queues, with stuff that will actually run? [21:21:05] Daniel is complaining that it makes it hard to get queue stats [21:21:13] TimStarling: note that "push" in the kafka sense really means enqueuing a message [21:21:19] We even address this within a wiki by having refreshLinks gradually recurse in subjobs instead of all at once [21:21:32] so that we don't start post-edit with 5M jobs queued, but just a few, and others can also recurse alongside it. [21:21:35] mobrovac: can we (quickly) check for dupes before doing that? [21:21:46] also he didn't actually know that we had deduplication on pop, he thought it was just broken [21:21:57] nope DanielK_WMDE__, because that would mean going through the kafka queue [21:22:17] disk space is fairly cheap; re statistics, another option is to look at the delay [21:22:22] we can use indicators on the other side to know if a job has already been processed [21:22:25] Is the job queue growth from wikidata specifically in-scope for this discussion or separate? (e.g. better dependency tracking, finer-grained reparsing etc.) [21:22:26] which is less sensitive to bursts of quickly deduplicated events [21:22:54] mobrovac: so how about a quick pre-filter that allows for false negatives? [21:23:03] <_joe_> Krinkle: you mean how change propagation from wikidata works? Unsure, that's another can of worms IMHO [21:23:09] Krinkle: I think that would be a rather complex discussion; I doubt that we could do it justice here [21:23:27] from a wiki user POV, it should be possible to see the backlog for a wiki, so that you could know "all changes made before have taken effect" or "wait two weeks for all transclusions to update" [21:23:39] DanielK_WMDE__: i'd be interested to hear how can you deduce if a job should be executed beforehand when all you have is that job and nothing else? [21:23:40] Yeah, just making sure we're not using a new scheduler to optimise for that "problem". [21:23:43] mobrovac: a LRU list of job signatures recently seen locally, kept in memory. that would be quick-and-dirty dedupe before push. [21:23:51] (and of course, it should not be -much- affected by a completely different project) [21:23:55] <_joe_> Platonides: so you say you'd be more interested in the lag than in the queue size, right? [21:24:05] DanielK_WMDE__: oh sure, but that's local to each appserver [21:24:15] mobrovac: yea, but MUCH better than nothing [21:24:21] queue size as a raw number is pretty meta... lag is human-scaled :) [21:24:23] _joe_: queue size seems quite meaningless [21:24:31] mobrovac: it's not sufficient for full dedupe, but I bet it would catch >50%. Maybe >90%. [21:24:35] why instead of deduping not just use page_touched to skip jobs quickly? [21:24:41] <_joe_> Platonides: I agree, FWIW :) [21:24:55] #info queue size as a raw number is pretty meta... lag is human-scaled :) [21:25:01] DanielK_WMDE__: your percentages are tempting :) [21:25:08] #info mobrovac: a LRU list of job signatures recently seen locally, kept in memory. that would be quick-and-dirty dedupe before push. [21:25:14] we have been using lag as the primary indicator for changeprop, and it has been quite useful for us [21:25:16] lag might be a vague number in case of not fully FIFO queues though [21:25:26] mobrovac: maybe i'm wrong, and it wouldn't give us much. but i think it's worth investigating [21:25:41] the one-runner-per-type would probably make that complex, indeed [21:25:47] kafka is fifo, so it works well in that context [21:25:48] <_joe_> brion: you calculate by the Xth percentile of lag in jobs execution [21:26:01] _joe_: Replying to " one is not being able to be easily tuned for bursts in capacity in case of need" I've asked Aaron about this over the past years and usually got the same reply: Job execution bottleneck is not the job runner, but shared resources (e.g. db slave wait and such) [21:26:03] DanielK_WMDE__: kept in memory where? there will be multiple producers and ingress points to the queue [21:26:05] MaxSem|grrrr: you mean at insert time, by saying 'no need to queue this one'? or in the running phase [21:26:10] gwicke: yes, time-to-execution seems to be the metric of choice [21:26:24] <_joe_> Krinkle: I don't agree with that assessment [21:26:31] brion, at execution time [21:26:34] speaking of backlog, this is what we have for changeprop - https://grafana.wikimedia.org/dashboard/db/eventbus?refresh=1m&panelId=10&fullscreen&orgId=1 [21:26:37] regarding _joe_'s concern about scaling total throughput: one problem we've had in the past is that when you scale up throughput, you can DDoS the MySQL servers [21:26:40] Meaning, ideally speaking, the current job runner executes in reasonable batches with yielding in between to get everyone a turn, but it should never be idling outside job execution for any fairness, that eithe risn't happening or is a bug. [21:26:57] <_joe_> TimStarling: yes, of course you have to keep that in mind [21:26:58] brion, this sounds much easier than doing ninja-grade deduplication trickery [21:27:01] so we've had to tune throughput to be as high as possible while MySQL stays up [21:27:02] MaxSem|grrrr: page_touched would only work for RefreshLinksJobs. But these are heavy and frequent, so sure... [21:27:03] MaxSem|grrrr: hmmmm, i guess they're roughly equivalent. but might be more predictable yes [21:27:11] _joe_: So the job runners are maxing out one of their resources currently? (ram, cpu, ..) [21:27:14] but then that requires constant human attention [21:27:26] #info why instead of deduping not just use page_touched to skip [RefreshLinksJobs] quickly? [21:27:42] <_joe_> Krinkle: nope, but even when I raise the concurrency of a job type, that has actually little effect in practicce [21:27:52] maybe the wikis could be spread to affect different mysql groups in order to minimize mysql-DDOS? [21:27:55] _joe_: exactly. [21:28:07] <_joe_> Krinkle: not because of db starvation [21:28:23] can someone explain to me how *exactly* deduplication currently works? I couldn't find any documentation on this [21:28:26] there is some page_touched deduplication already, but there are also other options that make that less effective [21:28:29] <_joe_> I think throttling has to do with that [21:28:38] in particular the rootJobParam stuff is a bit mystifying [21:28:47] _joe_: starvation as in, it's not spending most time waiting for replag, or it's not spending most time writing/reading sql? [21:28:54] or neither? [21:28:57] DanielK_WMDE__: there are several levels [21:29:03] one is the root job [21:29:09] <_joe_> Krinkle: it's not spending most time writing/reading [21:29:24] which is basically "update all pages transcluding page X" [21:29:37] on second edit to that template, the first set of jobs is superseded [21:29:37] _joe_: right, but I believe it is spenidng most time waiting for replag. A job quue write is not complete until after we wait for all slaves to have replicated the write. [21:29:45] That is also by design. [21:29:56] And is part of the job execute, not the job runner. [21:30:01] gwicke: that's just for RefreshLinks, right? [21:30:18] (the new runner would experience the same, the RunSingleJob.php wont' exit until replag is ready) [21:30:23] DanielK_WMDE__: all backlink jobs [21:30:26] Krinkle: why can't they be batched? [21:30:29] <_joe_> Krinkle: uhm, I hope it just waits for the replicas in the same datacenter in our prod environment [21:30:29] so htmlcacheupdate as well [21:30:42] <_joe_> right, that's surely the case [21:30:43] #info I believe it is spenidng most time waiting for replag. A job quue write is not complete until after we wait for all slaves to have replicated the write. [21:30:48] _joe_: Sure, yeah, standard waitForSlaves logic (non-zero wait in local dc) [21:30:50] weight* [21:30:52] #info from a wiki user POV, it should be possible to see the backlog for a wiki, so that you could know "all changes made before have taken effect" or "wait two weeks for all transclusions to update" [21:30:53] second level is leaf jobs, and the final one is job specific, such as looking at page_touched [21:31:06] <_joe_> Krinkle: I am not sure about replag waiting, I'll run some numbers [21:31:13] Platonides: We do batch. 500 writes per query typically, there's various levels of batching. [21:31:22] per transaction* [21:31:31] Krinkle, _joe_: waiting for replication makes sense. DB throughput is a hard limit on job execution, and should be. batching can improve that. but batching kills deduplication [21:31:43] also, some jobs don't actually write to the db [21:31:51] do we still wait for replication when we run those [21:31:53] ? [21:31:54] Right batching of unrelated jobs doesn't happen right now, indeed. [21:32:22] DanielK_WMDE__: No. The method is called, but it's informed by seen master position on write, so will be a no-op. [21:32:32] #info waiting for replication makes sense. DB throughput is a hard limit on job execution, and should be. batching can improve that. but batching kills deduplication [21:32:47] Krinkle: ok, thanks [21:32:57] so... [21:32:59] half time mark [21:33:06] although again, this is worth verifying. [21:33:08] we have discussed scheduling fairness and deduplication [21:33:11] re: needless replag waiting. [21:33:15] other issues we should cover? [21:33:40] for TimedMediaHandler I have very _long_ jobs doing video transcodes [21:33:42] hav about endly re-try? has that been fixed? [21:33:54] job sizes come to mind [21:33:57] i can break them into many small jobs instead, but not yet sure the implications of such floods :) [21:33:58] re "but does not trigger an UpdateHtmlCache, which it probably should" [21:34:00] what is the gerrit project name of the new job queue? [21:34:17] I thought the RefreshLinks would internally trigger it, *if* the html really changed [21:34:39] which many times it won't [21:34:48] TimStarling: change-propagation [21:34:55] TimStarling: most of the new queue is tracked under T157088 umbrella task [21:34:55] T157088: [EPIC] Develop a JobQueue backend based on EventBus - https://phabricator.wikimedia.org/T157088 [21:34:59] <_joe_> Krinkle: btw, it seems that the "maxtime" parameter we pass to RunJobs.php has little to no effect [21:34:59] hmm, originally i think we left the html regeneration to page-view-time (via page_touched being updated) [21:35:14] mobrovac, brion: i was wondering about that... the runner could keep track of the avg execution time per job type, and consider that value for scheduling fairness. so a large job on one wiki would count for many small jobs on another wiki [21:35:15] only refreshed links to fix the db tables [21:35:16] <_joe_> which I forgot to mention on the ticket and might be important [21:35:38] DanielK_WMDE__: hmm, could work [21:35:51] _joe_: yeah, that's for upper limit, not minimum. [21:35:53] brion: maybe [21:36:05] _joe_: Re maxtime, the way it worked a few years ago it was very ineffective. Not sure if that's changed [21:36:07] <_joe_> Krinkle: I say it has no effect as an upper limit [21:36:09] Platonides: that's what I assumed, but apparently it does not. or i missed it somehow. [21:36:11] At the time, many jobs were huge [21:36:16] DanielK_WMDE__: the budgeteer limiter I just wrote is based on costs, so very much along those lines [21:36:27] The way it worked was it would look at the clock every time it finished a job [21:36:29] _joe_: Oh, well, it won't kill a running job, it should stop running jobs after it passes that. [21:36:30] #info maybe the runner could keep track of the avg execution time per job type, and consider that value for scheduling fairness. so a large job on one wiki would count for many small jobs on another wiki [21:36:37] (re long jobs) if i make the changes I'm planning to do video transcodes as many small jobs, then i get floods of lots of tiny jobs, which ideally "fan out" as necessary to many threads well and then finish. but that's dependent on the scheduler handling this case well :D [21:36:39] Until that, it wil keep popping new jobs of the same type/wiki. [21:36:46] So if a few jobs take 15 minutes to finish, then setting maxtime to 3-5 minutes doesn't do anything really [21:36:46] until one of the resource limits is reached. [21:36:51] <_joe_> Krinkle: ok, that explains that [21:36:51] #link https://github.com/wikimedia/budgeteer [21:37:00] <_joe_> Krinkle: what if we don't get new jobs? [21:37:09] <_joe_> will it keep trying or just return? [21:37:13] _joe_: Just return. [21:37:24] my worst case scenario for TMH is a batch re-build of many files, which means millions of tiny jobs [21:37:28] <_joe_> ok, that matches my observations [21:37:30] brion: in eventbus, we have concurrency limiting, so these small jobs could happen to run concurrently [21:37:38] brion: ideally, this wouldn't be needed, and long jobs wouldn't be a problem... [21:37:42] hehe [21:37:50] _joe_: the typical way small-wiki admins fix job queue growth is to run runJobs.php once manually without limits until it exits. [21:38:04] e.g. for a one-time event or some such. [21:38:11] <_joe_> Krinkle: that's the way we're doing it here too [21:38:26] I man on wikis that trigger jobs from http and don't have a seperate runner, so it's 1 job per page view. [21:38:29] heh [21:38:34] And you e.g. want to catch up once for something special. [21:38:45] _joe_: right, but presumanly in our case it would never exit in that case [21:38:51] unless you specify limits [21:39:00] given we'll always enqueue new jobs faster and never reach size 0 [21:39:31] that would mean to average queue size grows forever. which means we are doing it wrong [21:39:38] <_joe_> it does not [21:39:49] for third parties I think we should do like wordpress and have a cron.php which you hit from cron with curl [21:40:06] <_joe_> TimStarling: +1 [21:40:13] That's what runJobs.php is basically, right? [21:40:21] It already exists early if another process is still running [21:40:30] <_joe_> Krinkle: well, you need to do a POST to it [21:40:30] so you can safely do that (afaik translatewiki does that) [21:40:34] it only runs jobs though, you can't plug in other scheduled tasks [21:40:39] _joe_: no no, that's wmf specific. [21:40:42] TimStarling: why is hitting it via http a good thing? doesn't this cause issues with timeouts? [21:40:43] runJobs.php the maintenance script [21:40:43] <_joe_> right [21:40:59] Which is the equivalent of the jobchron/jobrunner service, not the runJobs.php end point [21:41:02] <_joe_> so maintenance/runJobs.php [21:41:04] TimStarling: +1 for a cron hook for arbitrary tasks, though [21:41:05] Yeah [21:41:10] http is usually nice because you don't have to worry about a separate CLI php config [21:41:10] not rpc/runJobs.php [21:41:25] plus you get your caches and things if you have them [21:41:29] #info for third parties I think we should do like wordpress and have a cron.php which you hit from cron with curl [21:41:36] brion: and hhvm caches, as in our case. [21:41:39] yep [21:42:16] <_joe_> btw, wordpress gives the user the opportunity to run cron.php periodically when pages are hit, if they don't have access to a cron daemon, IIRC [21:42:22] ...and should we use that mechanism ourselves? [21:42:25] We could standardise runJobs.php to use Special:RunJobs as well. [21:42:25] (i have encountered problems with job queue over http in vagrant, but i think that's due to a very low connection limit somewhere in my config) [21:42:30] maintenance/runJobs.php [21:42:32] to match wmf-prod. [21:42:37] _joe_: MW has that too :) $wgRunJobRate [21:43:10] MW default for non-pageview run jobs is CLI maint/runJobs.php, whicih doesn't go back over HTTP afaik. [21:43:20] Krinkle: i didn't fully get that, can you #info? [21:44:15] #info Stock MediaWiki job runner (maintenance/runJobs.php) invokes JobRunner class directly, not over HTTP. For cache and config consistency, we should consider standardising on Special:RunJobs over http. [21:44:16] those HTTP self-calls seem ugly [21:44:37] and you may easily it HTTP limits (memory, time…) [21:44:52] gwicke: so in the brave new worl of kafka (omg, if my literature teacher saw that), how are very long running jobs handled? are they a problem? are they ok? [21:45:02] its actually a very efficient way to implement a "service bus" model for a PHP app stack [21:45:07] Platonides: For the run-job-on-page-view thing self-calls are ugly, but mainly done afaik to avoid fatal errors and for async. [21:45:17] Not for the sake of http, given were already there in that case. [21:45:23] #info MediaWiki by default will run one job per web request [21:45:25] and http timeouts are typically easy to control on the PHP code side [21:45:29] [15 minute warning] [21:45:40] <_joe_> DanielK_WMDE__: formally, the mechanism remains the same [21:45:52] DanielK_WMDE__: there is a combination of concurrency limiting, and cost-based rate limiting [21:45:53] <_joe_> we have an external service calling an endpoint via HTTP POST [21:45:55] But yeah, good point about limits. But presumably people using pageview job runner will not have CLI access anyway, so they don't have a choice. At that point, the best we can do is at least start from a fresh request to have dedicated quota [21:46:00] instead of sharing with the main page view request. [21:46:03] Platonides: ^ [21:46:09] with cost being typically dominated by execution cost, but the cost function can be tuned to penalize failures etc [21:46:14] <_joe_> the concurrrency model is the same in the old and new world, AIUI [21:46:51] roughly, yeah [21:46:54] <_joe_> you get a certain number of concurrent threads per job type [21:47:08] <_joe_> where threads == job executions [21:47:21] Krinkle: I was thinking for people that can run a cli job [21:47:29] <_joe_> what would change, in the current implementation, is the scheduler [21:47:41] #info [with kafka] there is a combination of concurrency limiting, and cost-based rate limiting; with cost being typically dominated by execution cost [21:47:43] It sounds like the new stack performs an HTTP call to MediaWiki/rpc for each individual job, whereas the current model does it per batch (wiki+job type+batch limits) [21:47:45] nod, that's the concurrency limit part [21:47:54] That means significantly more overhead (HTTP + mw startup) [21:47:56] right? [21:48:19] #info It sounds like the new stack performs an HTTP call to MediaWiki/rpc for each individual job, whereas the current model does it per batch (wiki+job type+batch limits) [21:48:23] DanielK_WMDE__: should have said "dominated by execution *time*" [21:48:27] <_joe_> Krinkle: I don't honestly think that's an issue, but I might be proved wrong [21:48:34] Krinkle: we can switch the new implementation to do batching as well really easily [21:48:38] when I worked on StatusNet's job queue system we had a *lot* of overhead from context switching at first, had to consolidate/batch ... [21:48:40] <_joe_> and that ^^ [21:48:42] <_joe_> :) [21:48:48] <_joe_> Pchelolo beat me to say that [21:49:00] yay batching [21:49:22] I think web requests on HHVM aren't as expensive as starting cli scripts, though [21:49:23] Pchelolo: right. By post-ing the job specifiers to the endpoint you mean? Not by having rpc/runJobs read from kafka. [21:49:28] is there any interest in reducing the push rate? or are people happy with all the unconditional deferral we're doing these days? [21:49:50] brion: i saw that code! brrrrrr.... well, actually, not bad to read. but i could feel the inefficiency making my skin crawl :D [21:49:58] :) [21:49:59] Krinkle: just post an array of jobs instead of individual jobs and consolidate these arrays inside change-prop [21:50:08] ten minute warning [21:50:17] gwicke: Yeah, the HTTP/php-startup overhead is pretty good with HHVM, but there is still MW itself (Setup.php) which is like 300-900ms per request easily. [21:50:17] i have another issue. [21:50:23] one that i didn't put into the notes [21:50:37] i'd like to have more control over execution order. [21:50:47] Krinkle: wait, what? I thought this was one order of magnitude less [21:50:50] ~60ms [21:50:50] i want to be able to tell job Y that it should only run after job X has executed [21:50:54] used to be ~30 [21:51:00] <_joe_> I think gwicke is right [21:51:12] <_joe_> most php scripts run in less than 300 ms right now on appservers [21:51:24] <_joe_> as in rendering pages too [21:51:28] PHP API latency is around 50ms [21:51:34] median [21:51:36] gwicke: Hm.. yeah, you're right. It's 30-100ms, not 300-1000ms [21:51:44] Okay [21:51:54] 60ms [21:52:13] <_joe_> it's still not negligible, but this is an implementation detail tbh [21:52:13] and for session-less requests it'll be closer toward 30ms [21:52:35] <_joe_> DanielK_WMDE__: so after job X is executed successfully? [21:52:36] Yeah. As long as the job specifier is posted and not in query parameters, batching should be easy to add to the new system. [21:52:38] yeah, I think in the bigger picture dedup & scheduling are more important [21:52:41] is there any problem from the kafka/changeprop point of view of having a high insert rate? [21:52:47] we can add batching if needed [21:52:52] (gwicke: By Kafka, do you mean - https://kafka.apache.org/ - and - https://en.wikipedia.org/wiki/Apache_Kafka ? How would these, and Wikidata engineers, anticipate machine learning and machine translation questions, if at all? Thanks) [21:52:54] ideally without losing leaf dedup [21:52:54] Although it does complicate feedback with regards to error state and retry given this is now managed outside the runner. [21:52:58] _joe_: no, after it has completed successfully [21:53:17] TimStarling: not really [21:53:20] <_joe_> TimStarling: kafka handles our webrequest logs, the jobqueue is not remotely as big [21:53:25] Scott_WUaS: yes, these. and: they won't. [21:53:32] Thanks [21:53:59] past prod numbers from log processing are something like 150k events / second on three kafka nodes [21:54:00] my test wiki now unconditionally queues 6 jobs per edit, it used to be between 0 and 1 [21:54:06] the job queue won't be anywhere near that [21:54:23] <_joe_> if it is near that, we have a huge problem :P [21:54:47] nod, especially on the other end of that kafka pipe [21:55:02] <_joe_> we're around 1k/second of job insertion rate at the moment [21:55:50] one thing we haven't touched on yet is the ability to inspect queue contents [21:55:56] i still think we should investigate early dedupe [21:56:03] <_joe_> So regarding deduplication, I am unsure how effective it is, because there is actually no way to tell right now [21:56:09] with kafka, there are some nice tools (kafkacat | jq basically) that will make this easier [21:56:36] _joe_: would it be hard to track that? we can count how often we drop a job, and how often we don't? [21:56:40] <_joe_> how many jobs get dropped because of deduplication? Is it effective? [21:56:50] yea... [21:56:56] <_joe_> DanielK_WMDE__: I don't think it would be hard to compute those data [21:56:57] _joe_: compare that to the edit rate, which is 15-20 per second [21:57:00] <_joe_> we just don't [21:57:00] there are some metrics already for this [21:57:04] [3 minutes left] [21:57:35] good metrics on this are high on our list [21:57:43] #info <_joe_> So regarding deduplication, I am unsure how effective it is, because there is actually no way to tell right now <_joe_> I don't think it would be hard to compute those data <_joe_> we just don't [21:57:54] also for comparing the more traditional root jobs / leaf job approach vs. budgeteer before fully switching [21:58:10] <_joe_> gwicke: +1 [21:58:15] #info 14:54:00 my test wiki now unconditionally queues 6 jobs per edit, it used to be between 0 and 1 [21:58:18] odd that there is no way to tell [21:58:33] <_joe_> TimStarling: well not from logs or metrics I found [21:58:47] <_joe_> maybe it escaped me [21:58:57] it should be easy enough to instrument that if it is not there already [21:59:03] _joe_: it's really easy to add statsd tracking. have you liiked into that? [21:59:15] I think there are some very basic metrics already [21:59:52] <_joe_> TimStarling: that's my point, since we're talking about different approaches, I'd like us to run some numbers [22:00:30] https://grafana.wikimedia.org/dashboard/db/job-queue-rate?panelId=7&fullscreen&orgId=1 [22:00:39] "duplicate inserts" [22:00:44] abandon/error/retry has statsd and/or logstash afaik, and we also have a dashboard showing frequency of those methods overall. [22:01:17] <_joe_> yes [22:01:20] The runtime dupe is "superseded" (dupe_insert vs dupe_pop) [22:01:28] <_joe_> oh [22:01:33] yeah, in the same dashboard [22:01:34] first graph [22:01:39] gwicke: the very high percentage of dupes tells me that early dedupe would be effective [22:01:59] <_joe_> heh I didn't know "superseded" meant "deduplicated" [22:02:00] Aaron's recent fix to HTMLCacheUpdate jobs to share the rootTimestamp moved dupe_pop from near 0 to about 7 [22:02:10] _joe_: Aye, yeah, it's only labelled that way in the Grafana dashboard. [22:02:18] Worth renaming potentially [22:02:24] inside MW we call both deduplicated [22:02:48] We're over time BTW [22:02:53] Any last thoughts here? [22:02:56] DanielK_WMDE__: maybe, but it won't replace the late dedup [22:03:01] Krinkle: may be worth distinguishing them in stats [22:03:05] https://github.com/wikimedia/mediawiki/commit/cb7c910ba72bdf4c2c2f5fa7e7dd307f98e5138e [22:03:08] gwicke: of course [22:03:10] could be added in addition, though [22:03:16] that was the idea, yea [22:03:23] It seems like discussion about how to instrument deduplication can easily be continued or taken elsewhere [22:03:26] DanielK_WMDE__: We do, dupe_insert=Deduplicated dupe_pop=Superseded in the graphs [22:03:29] <_joe_> yes [22:03:38] RoanKattouw: yep [22:03:47] <_joe_> I didn't see dupe_pop clearly [22:03:53] someone want to name a channel for continued discussion? [22:03:58] We can talk here [22:03:59] Krinkle: yea, i meant a rename could be confusing, too [22:04:11] I just wanna wrap up, do #endmeeting and see if anyone has anything to say about other things than instrumentation [22:04:14] ok, but do #endmeeting [22:04:54] OK, well thanks for coming everyone [22:05:07] The afterparty will be right here and we'll be talking about instrumenting dedups [22:05:14] #endmeeting [22:05:17] Meeting ended Wed Sep 13 22:05:14 2017 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) [22:05:17] Minutes: https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-09-13-21.03.html [22:05:17] Minutes (text): https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-09-13-21.03.txt [22:05:17] Minutes (wiki): https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-09-13-21.03.wiki [22:05:17] Log: https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-09-13-21.03.log.html [22:06:01] <_joe_> Krinkle: I did see Aaron's fix for htmlCacheUpdate, but that begs the question [22:06:06] so, about dedupe... how about checking against jobs that have already run, not just the ones still in the queue? [22:06:12] <_joe_> is deduplication implemented for each djob? [22:06:16] runners could keep a LRU list of signatures they have seen [22:06:19] <_joe_> *job sorry [22:06:32] _joe_: afaik it is not [22:06:47] would should make that easier / more consistent [22:08:12] <_joe_> so potentially every job that has a deduplication mechanism can have a different logic? And thus bugs like the one Aaron fixed? [22:09:13] ifaik, yes. but i have not looked into this [22:09:53] <_joe_> Pchelolo, gwicke how would deduplication work in the new jobqueue implementation? still demanded to the individual jobs? [22:10:09] gwicke: is it possible to look up messages in kafka based on some key? [22:10:51] Deduplication on insertion and at run time is configurable separately afaik. [22:10:52] DanielK_WMDE__: you can use kafkacat and then filter with jq by any imaginable selector [22:11:04]