[20:53:39] \o/ [20:59:46] o/ [21:00:17] #startmeeting RFC meeting [21:00:17] Meeting started Wed Sep 5 21:00:17 2018 UTC and is due to finish in 60 minutes. The chair is kchapman. Information about MeetBot at http://wiki.debian.org/MeetBot. [21:00:17] Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. [21:00:17] The meeting name has been set to 'rfc_meeting' [21:00:24] #topic RfC: ParallelMaintenance helper class for multi-process maintenance scripts https://phabricator.wikimedia.org/T201970 [21:00:35] #link https://phabricator.wikimedia.org/T201970 [21:01:08] hey yall [21:01:17] hi! who is here to discuss RFC T201970? [21:01:18] T201970: RfC: ParallelMaintenance helper class for multi-process maintenance scripts - https://phabricator.wikimedia.org/T201970 [21:01:30] hello/ [21:01:48] o/ [21:02:09] brion: do you want to kick things off? [21:02:17] kchapman: sure thing! [21:02:36] ok so I noticed a while ago that a lot of maintenance scripts follow a similar pattern: [21:03:02] reading a bunch of input data from some source (db/file/etc), and doing similar independent work on each item [21:03:22] this can often be parallelized well over multiple CPU cores if you have them available [21:03:42] there are several ad-hoc solutions in the MW codebase doing parallelization... [21:04:06] the very basic ForkStreamController is used in several maint scripts to divide up multiple sets of input, but cannot transfer information back to the parent. [21:04:39] OrderedForkStreamController, used by some CIrrusSearch maint scripts, sends data from the parent to the children for work and then returns a line of data, and was a good jumping off point [21:04:40] so ! [21:05:00] the RFC adds a MediaWiki\Parallel namespace and ParallelMaintenance class wrapping it, extending standard Maintenance. [21:05:10] I've ported a few scripts over, and it seems to work ok :) [21:05:21] My main feedback questions are about compatibility and API... [21:05:29] Any questions/comments so far? [21:05:44] * s/ForkStreamController/ForkController/ [21:06:06] I don't see a ParallelMaintenance.php in your gerrit change [21:06:18] TimStarling: it's hiding in Maintenance.php, which is awful [21:06:29] can probably be moved out [21:07:50] current API on ParallelMaintenance is that you have a 'loop' method to replace 'execute', followed by 'process' and 'result' methods which get called respectively on the child process and the parent process with the data being sent back and forth [21:08:06] it's sort of a micro version of map-reduce if that helps conceptually ;) [21:08:39] brion: so loop() is called once, and result is called N times? [21:08:41] but loop() is called once? you know arduino has a loop() function which is actually the loop body [21:08:53] right, it's the loop body. [21:09:00] naming is hard :D [21:09:22] also, your example hasa loop in the loop bbody that is named loop... [21:09:30] heh [21:09:56] i'm a bit confused about that, to be hondest. so loop() is called N times, once for each thread? [21:10:11] DanielK_WMDE: loop() is called once, on the main thread. [21:10:31] it sends data items into the queue, which get farmed out to worker processes where they call process() [21:10:35] so it's NOT the loop body, but expected to contain a loop. [21:10:47] yeah, that's bad naming on my part I think [21:11:02] one other place where this API model breaks down is if you have a multi-step loop process. in refreshLinks.php I ended up sending down data objects that have an 'action' property to route to the right behavior, and I don't quite like it. [21:11:08] no, i was just confused by your replsonse to tim earlier [21:11:26] :) [21:12:25] so loop() is expected to enqueue M inputs which get distributed to N children, and on each child, process is called reoughly M/N times? [21:12:29] I would call it dispatch() or something [21:12:41] it's the IDispatcher methods that dispatches, right? [21:12:43] DanielK_WMDE: exactly [21:12:44] So the subclass + abstract is one implementation, another would be using closures. I think abstract is better as it makes it easier to verify what needs to be implemented in terms of signatures etc. But I'm having trouble imagining what the script itself would be when using closures. What would the closure be passed to? Would there still be a ParallelMaintenance class? [21:13:37] brion: ok. that interface seems fine to me. names can be bikeshedded. [21:13:44] yeah :D [21:14:23] Krinkle: the closure would have to be available before the forking happens [21:14:32] obviously you can't pass a closure across IPC [21:14:49] i'm not so sure about IController as an interface, though. Interfaces should be defined by what each calelr needds to see, not by what is implemented together. [21:15:03] So I'd suggest an interface quest for queue() or enqueue(). [21:15:07] Krinkle: I think a model I was working on registered them something like $controller = new FOrkController( [ $this, 'loop', ], [$this, 'process'], [this, 'result'] ); etc [21:15:14] brion: Right. Would the closure be a parameter to a method, or a constructor option to a class stored in a plain Maintenance subclass? [21:15:23] because loop() should never call drain(). so it should not see drain(). [21:16:12] Krinkle: I think a method might return a closure, or a map of them. have to think more on this. [21:16:45] does this also standardize the name of --procs/threads cli parameter name (that I always guess wrong for each script)? [21:16:50] DanielK_WMDE: yeah exposing drain() was kinda a hack for a multi-step process [21:17:08] The only reason for a closure approach in my mind, is to make it easier to re-use outside maintenance scripts. Or to have multiple within a single script. So that it moves the abstraction from Maintenance to just an object created and called from Maintennace. [21:17:08] oh so it does get used in loop() in some cases? [21:17:09] Nikerabbit: ParallelMaintenance exposes it as --threads, so it'd be nicely standardized :D [21:17:15] But I think you have that already [21:17:16] (i've seen --fork as well!) [21:17:43] DanielK_WMDE: yeah, but i think it should not maybe :D [21:18:07] Yeah, would be good to use move away from "threads, T" to "proceses, P" [21:18:13] Krinkle: doParallel( callable $split, callable $map, callable $reduce ); [21:18:35] that could bne offered as an option/alternative by the same class [21:21:44] but I don't really see the point [21:21:45] for multi-step processes where you have to collect output of one stage before starting another i think it makes sense to create a new set of forks on each step... [21:21:46] potentially slow to make new processes though [21:21:46] ...in which case maybe something like $this->parallel( .... work & result callbacks ... ) to get a controller, then queue into it and close when you're done [21:21:46] yeah that's a possible concern though [21:21:47] maybe you need to split the job control from the process control [21:21:47] an additional feedback I got is that pcntl_fork is unavailable on windows, because windows has a different process model. [21:21:47] i don't think that's a big deal though [21:21:47] I should not use the word job [21:21:47] brion: the JS way would be to have your first loop() method return a closure, and that closure be the next loop(). It would get its input from a member variable, I guess. not so perfect [21:21:47] DanielK_WMDE: i looked briefly at PHP's generators, might be useful here too [21:21:47] split the work unit control from the process control [21:21:48] (for the windows server thing, you can always run in-process without the threading, so the default behavior would not change.) [21:22:04] brion: oh, you could make generators communicate via stdin/stdout! But that's quite different from the map/reduce approach. [21:22:46] TimStarling: like spawning a bunch of threads once (process control), then adding multiple closures to them later (job control)? [21:23:04] as long as all the worker-process closures you're going to use are defined before you pcntl_fork() i think that should work [21:23:26] not entirely clear to me (didn't read the code yet) how the results are received? is it a call to result() in parent for every return value returned from process()? [21:23:53] yes, you would have a job (or something very like it called something different for disambiguation) with a producer and consumer [21:23:58] SMalyshev: yes, exactly. they're also re-ordered to always return int he order dispatched [21:24:07] also, what happens if child process fails (either controllably, like error condition, or uncontrolled, like exception/fatal error)? [21:24:16] when the producer is done, it returns, so drain() gets called [21:24:45] SMalyshev: currently a failing child process will die, causing the parent process to fail to read and close out. this can be handled better, there's some suggestions on the changeset [21:24:46] so when a maintenance script has multiple stages, you wait for the first job to complete, i.e. the producer will complete in the parent after drain() completes [21:24:52] then start the second job [21:25:07] TimStarling: ++ [21:25:22] I like that [21:25:23] brion: yeah it would be helpful to at least know the context of failure (e.g. exception test, etc.) [21:25:26] the advantage over having the producer call drain() is that you don't have to multiplex the result() function [21:25:41] yeah, that's a huge improvement in api i think [21:26:01] SMalyshev: definitely. exceptions can be caught and a backtrace reported over the line; fatals i think will just close the process :( [21:27:03] ok :) any more thoughts? [21:27:10] or any comments I missed? [21:27:31] not many true fatals left these days, we even have a userspace OOM handler [21:27:36] <3 [21:28:31] ok! so my rough todos based on discussion so far: [21:28:49] * rename everything ;) [21:28:49] (use the #info tag please to note in the minutes) [21:28:55] whoops! [21:28:56] brion: well, at least something in stderr that can be logging somewhere [21:29:00] #info todo: rename everything! ;) [21:29:10] #info todo: catch exceptions [21:29:27] overall i think it sounds like a good move. One thought though is that perhaps maintenance and the "worker pool" should be separate concepts. I suppose i'm mostly thinking of this as similar to pythons multiprocessing.Pool.map. But we wont use this outside maintenance so maybe not worth the effort [21:29:29] I think "info todo" is called #action :) [21:29:50] #action refactor controller into separate process control and job control, allowing easier routing of work items in multi-stage maint scripts [21:29:52] whoops [21:30:01] #action catch exceptions and report cleanly [21:30:35] the #action command actually takes a nickname as its first parameter, that way it organises action items by person [21:30:37] ebernhardson: the base classes could in theory be used outside maint, as long as pcntl_fork works [21:30:41] oh neat [21:30:49] but it does give you an unassigned list if you don't use a nickname [21:30:49] #action brion learn to use meetbot right [21:31:15] #action brion use closures more aggressively [21:31:23] #action brion drop the half-done proc_open mode [21:31:42] Ok I think that covers the main items, and I'll go through the backscroll later for tidbits :D [21:31:47] Any other concerns? [21:31:59] brion: indeed it does, i should have looked back at the patch [21:32:12] oh right, that's what ExecStreamController is [21:32:40] why are you dropping it? [21:32:43] yeah, i was half-working on a possible proc_open router that would call your maint script with a special worker mode option [21:32:47] just not sure it's worth doing [21:33:05] on the other hand that might work fine as long as you recreate the same closures :D [21:33:42] only real benefit of it is it would work on windows server or other places where pcntl_fork is unavailable but you can proc_open a php interpreter [21:34:01] pcntl_fork() is quite fragile, even on linux [21:34:02] imo for windows our "best effort" should be in-process without any workers [21:34:14] that was my experience many years ago writing ForkController anyway [21:34:19] ah that reminds me [21:34:29] if we do keep using fork, we should make sure we're consistent about closing services [21:35:00] I moved the funciton that does it out of ForkController into MedaiWiki\Parallel\Utils or something, but it's still fragile ;) [21:35:17] mt_srand( getmypid() ); [21:35:22] that's not nice [21:35:23] *shudder* [21:35:33] oh! and things like request IDs [21:35:40] do those change on the fork or stay the same...? [21:36:02] ideally it should stay the same? [21:36:10] +1 for resetting services, it was not fun to debug some corruption issues with sql/http stuff when running multiple threads [21:36:25] #action brion: make sure there's a consistent way to hook services-close-out on fork, if we keep using fork [21:36:33] yeah, I think you should reset all services and remove the comment apologising for that [21:36:41] :) [21:36:50] looks like something like closeConnections() needs a hook... [21:37:04] yep. [21:37:09] extensions could have permanent resources too [21:37:29] these problems go away if you use proc_open() [21:37:31] probably not many of them but might happen [21:37:51] but it is more work since you have to write the special worker mode [21:37:52] of course, yes, proc_open is a different story [21:38:30] TimStarling: ah yes, that's where we get overlap with the RFC for maintenance script runner [21:38:46] if there's a single runner script, it's relatively easy to route to any maintenance script in the system [21:39:04] unless you have a custom out of tree maintenance script [21:39:24] which is probably a poor practice but I know I do it sometimes :) [21:40:11] #info if using proc_open mode, need to route to the same script. easier to do this with a single end-point (see the maintenance runner script RFC) [21:40:59] you could have runtime registration, but that would break your use case [21:41:00] shouldn't be hard to use $argv[0], not? [21:41:54] yes, $argv[0] should work [21:41:58] SMalyshev: *ponders* ..... ah yes there's one failure case actually -- sub-invocations [21:42:01] we even have PHP_BINARY [21:42:10] it's possible for one Maintenance to call another in-process [21:42:21] hmm that would be trickier [21:42:30] in which case argv[0] is the parent script [21:42:33] * brion sobs [21:43:12] have some kind of variable that is initially set to $argv[0] and changed when other script is called? [21:43:15] #info routing scripts via argv[0] fails on sub-invocations of one Maintenance class by another [21:43:30] by variable I mean property of Maintenance object [21:43:34] SMalyshev: that might work, some kind of maintenance state stack [21:44:00] #info consider a child-invocation state stack in Maintenance->runChildScript or whatever it is that runs the sub-invocations [21:44:00] or just __FILE__ in the maintenance class... [21:44:28] yeah, that can work, have it return its own preferred location. [21:44:49] then new classes that don't self-invoke (after the other RFC) will just return the generic runner script [21:45:07] and old classes using the traditionall Maintenance invoker will return their __FILE__ [21:45:11] ok that can work :D [21:45:30] how about changing the default value of not-using threads to (try to) use them automatically? [21:45:48] #action brion: for proc_open mode, have ParallelMaintance classes return their location as __FILE__ and (after other rfc) use the generic invoker as default [21:46:02] i'm not super familiar with the php threading model, but how much of mediawiki is threadsafe? [21:46:15] it seems a large undertaking to consider threads instead of fork/proc_open [21:46:44] ebernhardson: none of php is threadsafe ;) every time we say "threads" we really mean "processes" here [21:46:54] it might be confusing terminology to do that though! [21:46:56] there is PECL pthreads now [21:47:01] haven't played with it yet [21:47:03] oh yeah there's that but it looks .... weird [21:47:12] it divides the world into threadsafe and non-threadsafe areas [21:47:29] huh, interesting [21:47:37] ok we've got a few minutes left. final questions and comments! [21:47:56] Nikerabbit: ooh might do yeah [21:48:07] hi [21:48:07] have to add a way to detect the number of processors (unless overridden) [21:48:14] which is easy enough, just slightly different for each os [21:48:31] #action brion: add CPU count detection and consider use by default [21:48:54] hi :) [21:49:25] I imagine most cases where accurate processor count is important would be linux anyway [21:49:40] yep [21:50:44] we actually have a processor count in mediawiki-config/w/health-check.php, I wrote it a long time ago for LVS load balancing weights [21:51:00] ah nice [21:51:30] IIRC it excludes hyperthreaded cores, hence the complicated associative array thing [21:51:30] #info there's a CPU counter in health-check.php [21:51:34] heh [21:52:52] just 8 minutes left [21:54:39] ok sounds like we're about done :) [21:54:57] ebernhardson: the PHP threading model is one request per thread [21:55:15] almost nothing can be safely shared between threads [21:57:21] each PHP request has its own allocator, there is a shared allocator (just a malloc() wrapper) which needs to be explicitly used if you want something to be shareable across requests/threads [21:58:11] would be fun to do a side exploration of limited shared buffers for transferring ownership of strings [21:58:30] last chance to add to the minutes :) [21:58:36] \o/ [21:58:49] #info thanks everybody! good directions on next steps ready to go [21:59:04] well, APC is shared, and that can include shared strings [21:59:16] oooooh true [22:00:44] good night :) [22:00:50] night night Nikerabbit :) [22:02:02] and we are done! thanks all! [22:02:09] #endmeeting [22:02:09] Meeting ended Wed Sep 5 22:02:08 2018 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) [22:02:09] Minutes: https://tools.wmflabs.org/meetbot/wikimedia-office/2018/wikimedia-office.2018-09-05-21.00.html [22:02:09] Minutes (text): https://tools.wmflabs.org/meetbot/wikimedia-office/2018/wikimedia-office.2018-09-05-21.00.txt [22:02:09] Minutes (wiki): https://tools.wmflabs.org/meetbot/wikimedia-office/2018/wikimedia-office.2018-09-05-21.00.wiki [22:02:09] Log: https://tools.wmflabs.org/meetbot/wikimedia-office/2018/wikimedia-office.2018-09-05-21.00.log.html [22:02:10] thanks kchapman ! :) [22:03:51] * brion wanders off to collect notes and todos [22:04:41] did we not have a #endmeeting? [22:05:32] yep [22:05:42] TimStarling [23:02:09] #endmeeting [22:06:42] ok, my network dropped out for 3 minutes or so, I must have missed it, thanks