[00:00:17] TimStarling: can you look over https://gerrit.wikimedia.org/r/#/c/188503/ and merge if it's good? there's some users on enwiki that really want it, and i was almost there with anomie, but he's on vacation now [00:01:24] this is not the sort of thing I would usually merge [00:02:12] hmm, okay. do you know anyone else that that is their department, or should i just wait for Brad to get back? [00:02:41] I don't really agree with the idea of adding dynamic, cache breaking APIs to Scribunto [00:03:31] cache breaking in what way? [00:03:49] just that it doesn't trigger a reparse when e.g. a user gets promoted? [00:05:35] well, I imagine users don't get promoted very often [00:05:44] but blocks expire every day [00:06:14] not sure what the point of putting user block information into parser output is [00:06:30] unless you had a "break this cache" button next to it, but in that case, why not just load it via XHR? [00:06:58] that way you don't need the user to click "break this cache" to get the correct info [00:07:50] it just seems like the wrong UI [00:08:49] for the use-case enwiki wants, they just want the group info. i just threw in the block info since we have to fetch it anyway to make sure hideuser isn't set [00:09:06] (though others have asked for block info in the past, and there's a core change laying around somewhere to add an is-blocked check to core) [00:09:57] are caches invalidated if/when the user is suppressed? [00:10:04] autoconfirm presumably changes pretty often [00:10:22] legoktm: Not a big issues, one would need to know the name, so the name would already need to be part of the Wikitext [00:10:30] legoktm: they're not [00:10:33] As people can't do queries in some way [00:10:46] still, having block info in there sounds awry to me [00:10:48] would people really mind the stale results once in a while though? [00:11:02] jackmcbarn: these are Wikipedians we're talking about ;) [00:11:29] just looking at the userrights log on enwiki [00:11:30] legoktm: exactly. look at all the purge buttons everywhere on enwiki [00:11:42] we have a *lot* of groups don't we? [00:12:04] jackmcbarn: so yes, they do mind stale results [00:12:39] TimStarling: Up to 20, maybe [00:12:39] i think they'd prefer stale results to none at all though [00:13:22] jackmcbarn, don't forget: parser cache lives for 30 days. people will start complaining immediately [00:13:38] i don't think they'll complain though [00:13:46] since they're used to just about everything else being stale already [00:13:57] they'll just stick a purge button in whatever templates call this [00:17:06] that's what I'm saying, I don't like purge buttons [00:17:31] they're bad for performance, bad for user experience [00:17:52] they expose implementation details, they ask the user a stupid question "do you want current information or stale information?" [00:17:58] how are they worse for performance than if the cache did update properly? [00:18:00] how is a user meant to answer that? [00:18:39] why do we provide things like {{NUMBEROFPAGES}} then? [00:18:53] I'm saying you should not put dynamic information together in the same cache object with expensive, relatively static parser cache results [00:21:02] you know that when {{NUMBEROFPAGES}} was added to MW, on enwiki it returned less than 100,000 [00:21:32] the whole site, database and web, was running on a single server loaned to us by bomis [00:21:35] so is it something you wouldn't merge today? [00:22:00] and/or that we'd take out if it wouldn't lead to bloodshed? [00:22:41] I would have to think pretty hard about it if it were proposed today [00:23:07] it is probably OK, it doesn't really matter if it is updated weekly or so [00:23:48] as long as you don't need to put a purge button next to it, it's probably OK [00:24:49] fwiw, the exact use case this change is wanted for is a template that tells you whether or not a user is an admin [00:25:12] so in theory i could rip out implicitGroups and block info [00:25:17] and manual groups hardly ever change [00:25:45] or you could wait until anomie gets back [00:26:16] so you don't object to him merging it if he likes it? [00:26:21] if not, then i guess i will just do that [00:28:31] I won't give it -1 [00:29:16] ok [00:58:34] jackmcbarn: you haven't addressed the most salient point (IMO) that Tim brought up which is that this can be done better using JavaScript. Why not do it that way? [01:12:04] ori: I made a deploy window for us to test --restart tomorrow. 16:00Z (09:00 PDT) [01:12:28] I'm too tired and hungry to try now that we have the trebuchet problems fixed [01:17:34] bd808: np, thanks very much once again [15:39:22] ori's work with pygments reminded me that I once wrote a lexer for them -- https://bitbucket.org/birkenfeld/pygments-main/pull-request/193/lexer-for-iso-iec-14977-ebnf-grammars/diff -- I can't remember why I wanted syntax highlighting for EBNF so badly though [15:46:18] what happened to Max's fork? [15:53:15] dunno [16:00:30] he abandoned it [16:01:03] greg-g: if you remember, during my review, i mentioned that there had been another incident where I thoughtlessly stepped on somebody's toes [16:01:39] that was it. he was understandably upset about it and I apologized profusely. I feel pretty awful about it. I offered to abandon my patch sets but he said we might as well just go with it at this point. [16:02:42] ori: I'm ready to try the scap with restart changes [16:03:00] do you have a good idea of how to tell if the depool part actually works? [16:03:42] what do you mean? how to test that pybal actually depools the hosts? if apache goes down, then pybal will depool them; we're not testing pybal, after all [16:04:13] true. I guess I mean verify that apache actually goes down [16:04:37] we should see that in logs somewhere. Are they root only? [16:05:06] drwxr-x--- 2 root adm 4096 Jun 22 06:27 /var/log/apache2/ [16:05:09] yup [16:05:46] bd808: open a telnet to port 80 [16:06:47] bd808: i can chmod them on a particular host [16:07:07] or you could just come watch [16:07:17] yeah [16:07:20] which host? [16:07:25] i guess "any" [16:07:44] yeah it should happen on all of them in the group [16:09:24] i'm ready, running watch -d -n0.1 service apache2 status [16:10:11] k. here goes [16:23:26] ori: gotcha :/ [18:29:05] legoktm: Do you want to do the honors on https://gerrit.wikimedia.org/r/#/c/160223/ ? [18:29:13] bd808: with the hosts pooled, it took a lot longer (4m vs 1m). will that grow linearly with the number of hosts? [18:31:08] bd808: sure [18:32:12] lemme also run a final grep... [18:33:52] ori: the batch size is at 1 right now for the restart. we should probably raise that [18:34:57] bd808: can it be specified as a percentage? [18:35:13] we can compute that [18:36:52] In theory Apache is configured to only wait 5 seconds for the processes to drain but in practice I'm not sure how long it will take for graceful-stop to actually kill off the parent process [18:40:58] ori: what percentage were you thinking? Remember that this will be a percentage of the total number of scap targets in the cluster [18:41:48] 10% [18:44:30] asked _joe_ on other channel [18:44:34] let's see what he says [19:07:29] bd808: so, 5% [19:10:04] and you convinced yourself that a bad shuffle won't kill us right? [19:10:47] the 5% will be 5% of all mw hosts [19:11:01] s/mw hosts/scap targets/ [19:59:48] legoktm: btw, for the xpp => java compat patch, I could have written up the patch in the time it took to explain it in the email. I left it as an excuse to get someone involved in some light-weight patch contribution. [20:01:00] ori: oh :P john is one of the main pywikibot devs [20:01:15] bd808: https://dpaste.de/t7ku/raw [20:02:15] bd808: seems like a race condition if ssh completes before you reach the os.waitpid line [20:03:22] ori: hmmm... I've never seen that happen before [20:03:44] bd808: probably because we haven't regularly been scapping to a group of just 10 hosts [20:03:47] Keegan: can you hear me? my internet has been weird all morning [20:05:52] bd808: If WNOHANG was specified in options and there were no children in a waitable state, then waitid() returns 0 immediately and the state of the siginfo_t structure pointed to by infop is unspecified. To distinguish this case from that where a child was in a waitable state, zero out the si_pid field before the call and check for a nonzero value in this field after the call returns. [20:05:54] http://linux.die.net/man/2/waitpid [20:09:08] ori: you want to patch for it or should I give it a shot? [20:09:41] go for it [20:11:27] bd808: i'm not sure how to fix it exactly; i think we might want to pass waitpid WSTOPPED or WEXITED (or both) [20:12:13] hm, no [20:12:16] we don't want to do that [20:12:22] we just want to try / except [20:12:45] and only reraise if the errno is other than 10 [20:15:17] bd808: http://stackoverflow.com/a/1609031/582542 [20:17:58] bd808: more fun: it's a python bug http://bugs.python.org/issue1731717 [20:19:34] i think try / except, and if errno == ECHLD, then procs.clear() [20:19:46] if errno != ECHLD, raise [20:20:30] or instead of procs.clear, just break [20:20:37] there's nothing useful left to do in the loop anwyay. [20:30:29] ori: should it really break on ECHILD? Won't that leave undrained pipes? [20:36:54] legoktm: is it possible some of my edits disappered from my user contribution ? [20:37:15] matanya: umm, more context? [20:37:47] bd808: i have an idea -- can i run it by you via a patch? [20:38:10] Sure. [20:38:10] legoktm: i edited this page https://commons.wikimedia.org/wiki/Commons:Deletion_requests/File:Jimmy_Wales_by_Pricasso_%28the_making_of%29.ogv [20:38:18] on 11:55, 26 December 2013 (UTC) [20:38:56] doesn't show in user contribution [20:39:15] something is sneaky here [20:40:00] matanya: https://commons.wikimedia.org/w/index.php?title=Commons%3ADeletion_requests%2FFile%3AJimmy_Wales_by_Pricasso_%28the_making_of%29.ogv&type=revision&diff=112591544&oldid=112515160 [20:40:34] man, sighing as me! [20:41:02] thanks legoktm didn't think of that [20:41:25] :) [20:41:39] * matanya punches himself in the face [20:49:40] bd808: https://gerrit.wikimedia.org/r/#/c/220301/ [21:01:59] ori: testing locally, not ignoring you :) [21:02:20] no worries, no rush [21:05:35] heh. I think my local test is revealing a separate bug that I introduced [21:06:02] 5% rounded down of 1 is 0 which is not a great batch size [21:06:57] right, so max(len(target_hosts) // 20, 1) [21:07:09] yeah [21:07:19] is that the cause of this bug, too? [21:07:22] see if that fixes things for me [21:07:38] i.e. did we possibly have a batch of 0? [21:08:21] I think one way of possibly testing my patch is to add something like this right above line 151 in ssh.py: if random.random() <= 0.30: continue [21:08:36] this will simulate the bug by leaving a pid in procs that has already died [21:10:03] "did we possibly have a batch of 0?" I don't think so. For me this just hangs the whole process [21:31:26] ori: when you have a moment, please comment on https://gerrit.wikimedia.org/r/#/c/218905/ [21:56:04] matanya: commented [21:57:54] thanks ori [22:17:15] csteipp, legoktm: any time to review https://gerrit.wikimedia.org/r/#/c/204059/ ? [22:19:35] I left a comment [22:21:16] legoktm: thanks. I'm not sure how the job "knows" that though [22:21:51] the code the patch introduces to create the jobs knows, but not the job [22:25:44] bd808: it's checked in linke 31 [22:25:48] line* [22:26:34] * bd808 smacks forehead [22:29:22] is there any extension hook that allows extension to add contect into section? [22:29:31] s/contect/content/ [22:29:53] BeforePageDisplay probably [22:30:18] and OutputPage::addHeadItem() [22:30:23] https://github.com/wikimedia/mediawiki-extensions-TwitterCards/blob/master/TwitterCards.hooks.php#L51 [22:31:20] ahh, addHeadItem looks like what I need, thanks [23:04:16] bd808: i think the next thing i'm going to test with --restart is: add a 100 hosts to scap-test [23:04:29] i won't depool them or remove them from mediawiki-installation or anything [23:05:09] did you figure out the icinga thing that Faidon pointed out? [23:05:15] yes [23:05:20] added a retry to the check [23:05:20] cool [23:05:32] it will check twice before reporting [23:05:40] and there's a ~10s gap between checks [23:06:06] if it takes 20s to restart hhvm we probably want to know that [23:07:02] yeah [23:08:23] * bd808 is trying to find a runaway process on his work laptop by ssh'ing in from his personal laptop [23:08:40] * bd808 blames virtualbox [23:10:55] apparently trying to provision a 4th VM was one too many