[00:20:23] I give up on the return to for now [00:20:46] Ryan_Lane: Done otherwise [00:20:58] sweet [00:21:05] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/TODO was modified, changed by MPelletier (WMF) link https://www.mediawiki.org/w/index.php?diff=658287 edit summary: [+159] New use case (connectivity) [00:22:01] There's probably some code cleanup to do, but I'll do it later on my local machine for eeease [00:22:48] openstack-foobar? :) [00:23:30] lol [00:23:45] I presumed I would've reworked that by now... [00:24:01] Though, it doesn't actually make any difference [00:24:14] * Ryan_Lane nods [00:24:33] fixed [00:57:11] Ryan_Lane: That OSM revisions should be good to go now [00:57:20] For the OATHAuth one - https://bugzilla.wikimedia.org/show_bug.cgi?id=40091 [00:57:25] Does it actually need it's own tab? [00:57:33] I don't think so [00:57:41] I'd like to get the return-to links working, though [00:57:48] otherwise it's confusing for the end-user [00:57:57] Would it make sense just to move the links from Special:OATH to Special:Preferences directly [00:58:04] ie the Enable/Disable/Reset links as appropriate? [00:58:06] probably so, yeah [00:58:14] Which removes one click level [00:58:18] yep [00:58:24] die clicks die [00:58:42] Indeed [00:58:49] And will possibly fix the return to there.. [00:58:52] https://wikitech-test.wmflabs.org/w/index.php?title=Special:OATH&returnto=Special%3APreferences [01:00:09] as it's in the first url, it's just not carried on further on [01:00:14] * Reedy starts poking that [01:00:58] hm. maybe rather than manage oath, it should show enable or reset/disable [01:01:12] that's another click gone as well [01:01:25] and it makes it easier to return to the preferences pane [01:12:06] Ryan_Lane: That's what I was meaning [01:12:11] ah [01:12:12] yeah [01:12:12] https://wikitech-test.wmflabs.org/wiki/Special:Preferences [01:12:16] ^ Done, I think ;) [01:12:38] * Ryan_Lane tries [01:12:50] Damn it. return to is still broken [01:13:09] heh [01:13:09] yep [01:13:24] one step in the right direction [01:19:19] Ryan_Lane: https://gerrit.wikimedia.org/r/53304 [01:19:37] That improves on it, the returnto is another thing I'll have to poke at seperately and see how we normally do it [01:19:50] * Ryan_Lane nods [01:20:31] oh. I'm explicitly setting the returnto somewhere [01:20:43] that's why this isn't working [01:21:38] Oh, I nuked it. Pffft [01:22:17] for instance: $out = Linker::link( $this->getTitle(), wfMsgHtml( 'oathauth-backtodisplay' ) ); [01:22:25] it doesn't use returnto at all [01:22:46] and that parameter isn't being passed through to the htmlform callback, either [01:24:06] https://wikitech-test.wmflabs.org/w/index.php?title=Special:OATH&action=enable&returnto=Special%3APreferences [01:24:29] $returnToTitle = Title::newFromText( $this->mReturnTo ); [01:24:54] should just be a matter of passing it through in htmlform, and using the parameter to make the link [01:25:51] let me see if I can do it really quick [01:38:28] that may do the trick [01:39:48] heh [01:39:51] that sure as hell didn't work [01:41:00] ah. right. it's missing the parameter [01:47:07] almost there [01:48:32] We should probably make it an unlisted special page, and remove the group on Special:SpecialPages [01:49:25] yep [01:49:26] extends SpecialPage => extends UnlistedSpecialPage [01:50:06] 'oathauth-prefs-manage' is unused again too [01:51:52] let me delete that [01:52:43] oh [01:52:45] did you push stuff in? [01:53:56] Where? [01:53:57] needed to rebase [01:55:36] grr [01:56:11] reset isn't returning properly [01:56:14] everything else is [01:58:28] Slightly weird when all the code looks the same [01:59:43] yeah [01:59:52] it's the damn DerivativeRequest, I'm betting [02:02:04] ah [02:02:05] I see [02:02:40] I really need to use variables more consistently [02:04:35] that did it [02:05:52] ok. I think that change is good to go, now [02:05:55] aha [02:06:05] I'll just skim over it [02:06:50] Yeah, LGTM [02:06:54] cool [02:06:56] I'll merge it in [02:07:58] I may as well deploy it, too [02:08:08] yeah [02:10:49] The OSM change should be good to go too [02:12:10] am I a reviewer on that? [02:12:20] it's missing from my queue [02:12:35] found it [02:12:55] doing a quick review [02:13:19] https://gerrit.wikimedia.org/r/#/c/53248/ does a bit of cleanup [02:14:12] AFK back later [02:14:30] ok [04:01:59] addshore [04:02:07] 2030 [04:02:08] load [04:02:10] :D [07:05:40] petan, you around? [07:06:36] I've started linkwatcher again, using your script and "qsub -u long /data/project/beetstra/linkwatcher/linkwatcher.sh -o /data/project/beetstra/linkwatcher/syslog.output -e /data/project/beetstra/linkwatcher/syslog.errors" .. it is in the queue already for at least 30 minutes without starting .. how long does that normally take? [07:07:39] depends on what else is running [07:07:40] I've also made the files and started the other 3 bots .. which are also waiting in the queue - 'state -> 'qw' in qstat [07:08:00] hmm .. what other bots are running? [07:08:25] well mine have been running since yesterday [07:08:28] in SGE [07:08:43] idk if anyone else is using it [07:09:29] I was suggested to start using that as well .. since linkwatcher is needin more resources [07:09:41] But those will be running continuously ... [07:09:55] oh these scripts im running are one time [07:10:05] they're the interwiki link removers [07:10:10] actually, linkwatcher was running last night on sg, but I adapted it, and was then suggested to put them in the long queue [07:10:14] Yeah, I know [07:10:27] I think we are doomed to tease each other with our bots :-) [07:12:14] :P [08:13:36] legoktm .. they sometimes get to 'status 't'' .. but get returned to qw, even in the normal queue [08:13:50] i forget what t stands for [08:13:52] * legoktm looks it up [08:13:59] t = transfer [08:14:15] t(ransfering) [08:14:15] T(hreshold) [08:14:17] ah [08:14:38] is there a way to see all the jobs that are being run? or do you need to be root for that? [08:14:50] did not figure that out yet [08:15:11] new to the qxxx commands [08:16:05] OMG [08:16:09] its addshore's fault [08:16:17] hold on a sec [08:16:38] http://bots.wmflabs.org/~legoktm/all.txt [08:17:03] * Beetstra loves it when it is someone elses fault :-) [08:17:08] though [08:17:12] i dont see your jobs? [08:17:24] legoktm@bots-gs:~$ qstat -u '*' > ~/public_html/all.txt [08:17:58] I submitted 4, killed 2 of them (hoping the other 2 came through) [08:19:28] http://bots.wmflabs.org/~legoktm/beetstra.txt [08:20:29] that is curious [08:20:43] legoktm@bots-gs:~$ qacct -o beetstra -j > ~/public_html/beetstra.txt [08:20:54] what does that mean? Those jobs are old and I gdel-d them [08:21:05] thats just your job history [08:21:18] yeah, I see [08:21:21] (now) [08:21:49] qstat -f says now: [08:21:53] main.q@bots-bnr1.pmtpa.wmflabs BIP 0/7972/20000 176.83 lx26-amd64 a [08:21:53] 18439 0.26278 unblockbot beetstra t 03/12/2013 08:21:12 1 [08:21:53] 18440 0.26273 xlinkbot.s beetstra t 03/12/2013 08:21:12 1 [08:22:06] ah [08:22:08] those should be the two 'lighter' bots [08:22:18] but they likely are back in qw in 5 minutes [08:22:54] earlier it was: [08:22:55] 18439 0.26052 unblockbot beetstra qw 03/12/2013 07:34:37 1 [08:22:56] 18440 0.26048 xlinkbot.s beetstra qw 03/12/2013 07:34:47 1 [08:23:02] hmmm [08:24:57] * Beetstra wants to try something .. [08:30:16] OK, that test works - if I log into bots-bnr1, and start the bots using my the script, they run [08:30:35] But submission of the same script into the queue .. stalls. [08:39:38] nah [08:41:19] anyway, they're back in 'qw'-state [08:44:59] hmm [08:45:01] the bot is don [08:45:02] down* [08:45:03] but [08:45:04] petrb to SAL: disabling addshore's cron for a while [08:47:56] wow [08:48:24] addshore: the job i submitted 2 days ago was at 30. now we're at 20262. [08:49:06] hmm [08:49:21] petan: if i schedule a job with the same name, isnt it just supposed to ignore it? [09:07:55] hi [09:07:57] I am here [09:08:32] legoktm no [09:08:34] it will submit it again [09:08:42] Beetstra, legoktm patience [09:08:46] :P [09:08:47] load is not some 8000+ [09:08:50] :P [09:08:51] * legoktm waits patiently [09:09:04] we need to wait for that burst of addshore jobs to finish somehow [09:09:08] I don't want to kill hem [09:09:14] addshore ping [09:09:17] iirc on the toolserver jobs dont get resubmitted [09:09:20] do something with that pls [09:09:27] he should be sleeping now [09:12:48] !log bots deleting all qw jobs of addshore from queu [09:12:52] Logged the message, Master [09:30:01] 20336 0.25679 unblockbot beetstra r 03/12/2013 09:28:23 main.q@bots-bnr1.pmtpa.wmflabs 1 [09:30:06] yay! [09:30:26] :P [10:01:37] Change on 12mediawiki a page Wikimedia Labs/Migration Of Toolserver Tools was modified, changed by Silke WMDE link https://www.mediawiki.org/w/index.php?diff=658587 edit summary: [+287] /* When can I migrate my software to Labs? */ info on db replicas and user dbs [10:12:13] Change on 12mediawiki a page Wikimedia Labs/Migration Of Toolserver Tools was modified, changed by Silke WMDE link https://www.mediawiki.org/w/index.php?diff=658588 edit summary: [+514] /* List of important questions/FAQ */ added section about permissions [10:14:42] oh man [10:14:46] "Failed to add jenkins-bot to deployment-prep. This needs the "loginviashell" right." [10:14:48] seriously [10:15:10] petan: is OG overflowing with me? :< [10:15:29] Change on 12mediawiki a page Wikimedia Labs/Migration Of Toolserver Tools was modified, changed by Silke WMDE link https://www.mediawiki.org/w/index.php?diff=658589 edit summary: [+394] /* List of important questions/FAQ */ added section about stewards [10:16:58] Beetstra: legoktm remember you can set prioritys for tasks, I imagine that would affect how quickly they get picked up in the queue :) [10:17:07] oh how? [10:17:14] * addshore goes to find the parameter [10:17:31] * legoktm sets priority to addshore+1 ;) [10:18:28] qsub -p (priority which is The qsub utility shall accept a value for the priority option-argument that conforms to the syntax for signed decimal integers, and which is not less than -1024 and not greater than 1023.) [10:19:27] Change on 12mediawiki a page Wikimedia Labs/Migration Of Toolserver Tools was modified, changed by Silke WMDE link https://www.mediawiki.org/w/index.php?diff=658590 edit summary: [+302] /* Table of features needed for current tools */ added link to the list of tools [10:21:17] Change on 12mediawiki a page Wikimedia Labs/Migration Of Toolserver Tools was modified, changed by Silke WMDE link https://www.mediawiki.org/w/index.php?diff=658591 edit summary: [+13] /* Wikimedia Germany */ completed WMDE's staff list [10:23:06] Change on 12mediawiki a page Wikimedia Labs/Migration Of Toolserver Tools was modified, changed by Silke WMDE link https://www.mediawiki.org/w/index.php?diff=658592 edit summary: [-3] changed disclaimer on top of the page [10:27:01] Whee!! All 4 running [10:27:16] And even better, linkwatcher is slowly munching away old backlogs [10:28:59] hmm .. but bot is not very, very responsive [10:29:06] Bleh .. lecture .. have to go again [10:29:59] heh .. cancelled .. :-D [11:12:03] hashar need some rights on wikitech [11:12:04] ? [11:12:11] maybe I can help you [11:15:25] addshore yes it was [11:15:41] addshore like 20 000 of your jobs waiting in queue :P [11:15:54] HAH [11:16:02] I have restructured my cron a bit :) [11:16:05] silly OG [11:16:05] ok [11:17:57] haha, i would say it is working looking at the loads but I think I just broke my cron ;p [11:19:35] so, how is this now. Is the box just working on 'full capacity', and the bots have to share that, or do individual bots get a predetermined fraction of 'load' assigned, and they are forced to stay within that? [11:20:41] its currently anything goes as long as the load stays below (i think 15) [11:21:30] did you see my message above about priority Beetstra ? [11:21:46] yeah .. I did not use that [11:21:54] It just started now [11:22:10] And I thought I added '-u long' to the list, but I see they are in the main.q [11:22:47] isnt it -q long? [11:23:07] * Beetstra copied petan's commandline :-) [11:23:14] * Beetstra is new to the 'qxxx' commands [11:23:19] the documentation is so messy [11:23:52] oh yes it's -q [11:23:55] [-u user_list] [11:23:58] hehe [11:24:00] if I told you -u it was not that [11:24:00] [-q destination] [11:24:05] * Beetstra changes script [11:24:08] xD [11:24:40] petan: are my jobs being denied from the queues atm or are they just breaking in some other way? [11:26:05] they are breaking [11:26:17] :D [11:26:20] interesting xD [11:28:16] well at least that cleared out the queue ;p [11:28:16] OK, 3 of 4 transferred to long queue [11:34:55] right, just let it be known that OG really cant handle lots of little jobs very well at all :/ [11:42:58] Beetstra do you need instance bots-liwa now? [11:59:58] okay I wrote some documentation Darkdadaah - https://wikitech.wikimedia.org/wiki/Nova_Resource:Bots/Documentation#Using_OGE [12:00:05] it's not much but better than nothing :P [12:01:09] petan, let me clean up there first [12:01:19] Beetstra cleanup what? :o [12:01:20] but it looks like I am finished with both -liwa and -nr1 [12:01:24] ok [12:01:37] :-) .. there may be a residual backlog file there .. have to check [12:01:41] though I think all is clean [12:01:43] sure just let me know when you wouldn't need them anymore so we can delete them [12:02:20] bots-liwa can go [12:02:27] ok [12:03:03] and I am not using bots-nr1 either anymore [12:03:06] !log bots deleted bots-liwa [12:03:08] Logged the message, Master [12:03:18] (maybe someone else is using bots-nr1?) [12:03:27] I don't know but I will figure out :) [12:03:45] :-) [12:04:14] Beetstra when you have a lot of time, you might consider moving all databases from bots-sql2 to bots-bsql01 - but that doesn't need to be done asap [12:04:27] it's just bigger and faster sql [12:04:29] OK [12:04:36] that is going to take time, indeed [12:04:41] probably yes [12:04:52] But may take a couple of weeks before I have serious time for that [12:04:59] no problem [12:05:06] Unless you have a quick way of just copying them instead of transferring them [12:05:46] I was using mysqldump for that, so probably nothing really fast, but... maybe we can figure out some better solution [12:06:18] Beetstra you still have processes running on -nr1 [12:06:24] huh [12:06:47] perl LinkSaver [12:07:04] petan, the servers are configured to use one file per db? [12:07:12] Platonides only the new one [12:07:21] Platonides these old sql servers were not I think [12:07:29] killed them .. sometimes the modules of my bots don't autodie [12:07:33] I was to suggest rsyncing the innodb files [12:07:36] Platonides one file per table, not db [12:07:47] but if it's in one big block, you may not be able to do that properly [12:07:48] Platonides not possible in this case :( [12:08:19] but maybe I find a way to quickly convert the one-file to multiple [12:08:24] like recreating the db online or something like that [12:08:30] who knows [12:08:36] maybe there is some tool for that [12:08:50] but it can wait now [12:09:32] !log bots deleting -nr1 [12:09:33] Logged the message, Master [12:11:19] Platonides I fixed that * bug [12:11:21] :P [12:11:22] petan: About OGE: -o -e and -j don't work? [12:11:26] now it tell you invalid name [12:11:37] Darkdadaah they should but they don't [12:11:40] and I have no idea why [12:11:52] everytime when I used them, nothing happened [12:12:16] I mean - the job started, finished and no output :/ [12:12:48] If people want to reuse their scripts, we should try to make those work. [12:12:58] hm... indeed [12:13:06] the script I wrote is a workaround for this bug [12:13:17] not a final solution :o [12:13:48] Writing >> $logfile at every line is troublesome :( [12:13:58] I know it's a workaround. [12:14:22] well, heh it doesn't need to be at every line [12:14:35] if you want to launch a huge script, you could create a second shell for that [12:14:50] but it needs to be at 1 line at least [12:15:07] If you don't add it to one line, Murphy's law predict that this is the line that will fail. [12:19:26] petan, which server was a submit host? [12:20:04] -gs [12:20:18] gs (grid scheduler) [12:20:19] *bots-gs [12:20:28] weird name :P [12:20:35] but it's short and I <3 short names [12:22:14] The load seems to be really high on both nodes. [12:23:16] this is weird, why would it suddenly show the waiting job in a queue and later in none again? [12:24:08] Because it ended really fast? [12:24:39] !mail [12:24:39] we have a mailing list labs-l@lists.wikimedia.org feel free to send a message there, don't forget to subscribe [12:24:42] ignore me [12:25:11] Darkdadaah: sorry thats me, just fixing it now [12:25:20] Platonides: it probably broke [12:25:39] Darkdadaah. I don't think so [12:25:44] it's still in qw state [12:25:59] ahh, thats again me causing things to go slowly :/ [12:26:04] it will go through shortly dont worry! [12:26:11] Oh. Ok. [12:26:39] w -> waiting [12:26:50] not sure what's the q for [12:26:56] "waiting for a queue" ? [12:27:29] qw = queued/waiting [12:27:43] r = running :) [12:28:58] queued and waiting for resource on a node [12:28:58] it has been waiting for 7 minutes :( [12:29:13] ok I completely rewrote the docs, have fun [12:29:14] all the nodes must be quite overloaded, then [12:29:21] don't overload box :) [12:30:31] given that we removed some boxes - we might consider creation of bnr3 but... [12:30:34] do we need it? :o [12:30:44] petan: I think with everyone moving onto these yes :/ [12:30:55] * addshore is still cutting back his cron :p [12:31:12] addshore what about just submitting these jobs less often ;( [12:31:14] ;) [12:31:21] thats what I am doing xD [12:31:24] :) [12:31:28] petan, just to have an idea .. if you compare bnr1 with -liwa or -nr1, how much bigger is bnr1? [12:31:30] but i have hundreds of lines of cron xD [12:31:48] -nr1 was 2gb or ram + swap and 1 cpu [12:32:01] -bnr1 is 8gb of ram + 20gb of swap and 4 cpu [12:32:21] Beetstra ^ [12:32:28] addshore I know [12:32:29] :P [12:32:31] do note that the linkwatcher was munching quite a bit of -liwa .. [12:32:43] ok -liwa was just as -nr1 [12:32:44] small [12:33:05] yeah, but roughly spoken, -bnr1 is 16 times -nr1 ... [12:33:11] petan: is there a way to clear all waiting jobs for a user? [12:33:21] addshore yes but hard [12:33:28] qdel is boring [12:33:31] addshore I can remove them if you want [12:33:33] OK ... so we have that x 32 (bnr1 & bnr2) .. hmm .. [12:33:37] it requires some shell magic [12:34:04] go for it petan as long as its only the ones waiting ;p [12:34:39] addshore qdel `qstat -u $user | sed 's/^\s*//' | sed 's/\s.*//'` [12:34:45] :D [12:34:48] wait [12:34:49] no [12:34:53] that will delete all [12:35:18] xD [12:35:56] addshore qdel `qstat -u addshore | grep -E 'addshore\s*qw' | sed 's/^\s*//' | sed 's/\s.*//'` [12:36:09] you sure? ;p [12:36:14] addshore no [12:36:21] addshore echo `qstat -u addshore | grep -E 'addshore\s*qw' | sed 's/^\s*//' | sed 's/\s.*//'` [12:36:22] :P [12:36:25] that's safe version [12:36:32] ill just wait then :P they should have made it to the queue in 1 or 2 more mins [12:36:33] check what it produces [12:36:42] I hope it is not possible to delete other people's jobs :/ [12:36:50] Darkdadaah I don't know? :D [12:36:53] qdel -u user_name doesn't work ? [12:36:53] but it shouldn't be [12:37:06] phe: yes it does but he wants to delete only some [12:37:06] qdel does seem to have -u as a param [12:37:07] not all [12:37:58] Platonides: is it running yet? ;p [12:38:01] iirc you can do also qdel 133-192,257 to del job id 133 to 192 + 257 [12:38:18] phe: !!!!! :O [12:38:25] i did not know you could do ranges [12:38:42] addshore or: qdel "please delete all my jobs in qw status" [12:38:55] but probably not so smart :P [12:39:09] btw why there is no limit through setrlimit at login on shared ? [12:39:15] hmm phe ranges dont work [12:39:15] *shared box [12:39:31] phe: because bots is work in progress [12:39:53] there are almost no limits atm [12:39:58] petan: I think we should create wrappers for each of these commands :/ [12:40:04] addshore yup [12:40:10] addshore feel free to do so + docs [12:40:11] they are just ... shit xd [12:43:34] petan: -o and -e work for me. [12:44:05] ^those have always worked for me [12:45:15] -o -e? [12:45:41] -o output.log -e errors.log [12:46:39] ahh :) [12:47:11] * addshore now has 3 waiting jobs :d [12:47:14] 1 [12:47:27] all gone :) [12:48:10] Even with no parameters, the files are created in ones home as script_name.e{JobID} and script_name.o{JobID} [12:48:42] (not sure about the path) [12:49:52] yeah [12:49:55] e for errors [12:49:57] o for output [12:59:54] :-/ [13:00:08] linkwatcher is getting it harder again :-( [13:10:03] petan and/or beetstra… still having trouble with sudo? [13:10:36] andrewbogott .. no, moved bots to another instance, not necessary anymore [13:10:47] Ok, so… shall I close out https://bugzilla.wikimedia.org/show_bug.cgi?id=45985? [13:10:50] That bug can be closed, bots-liwa does not exist anymore [13:11:03] And you're able to sudo elsewhere? [13:11:05] (forgot to test before the instance was deleted) [13:11:16] I am now running all bots from bots-sg [13:11:26] or -gs .. whatever [13:12:01] thanks for following up [13:13:03] [bz] (RESOLVED - created by: Peter Bena, priority: High - major) [Bug 45985] https://wikitech.wikimedia.org/wiki/Special:NovaSudoer doesn't work in case of beetstra - https://bugzilla.wikimedia.org/show_bug.cgi?id=45985 [13:18:29] Reedy, ping [13:22:36] andrewbogott I forgot about that bug but it didn't work before [13:22:41] no idea why [13:23:33] [bz] (NEW - created by: Peter Bena, priority: High - major) [Bug 45768] console doesn't show proper errors - https://bugzilla.wikimedia.org/show_bug.cgi?id=45768 [13:23:45] petan: Ok, let me know if you see something like that happening again. [13:23:52] k [13:40:32] are cpu hot pluggable on instance ? and can disk space increased w/o rebotting an instance ? [13:40:40] phe: nop [13:40:46] phe: you need to create a new instance :/ [13:46:21] andrewbogott: hello :-] Do you have any slight idea why Icinga is not able to monitor some instances? http://icinga.wmflabs.org/cgi-bin/icinga/status.cgi?hostgroup=deployment-prep&style=detail [13:46:50] I guess Icinga can't connect to the nrpe daemon running on the instance, maybe cause of some firewall filter [13:48:08] hashar: No idea offhand. Doesn't the daemon contact the server? In which case the firewall shouldn't matter... [13:48:52] hashar, even increasing disk space can't be done with an instance reboot ? [13:49:07] phe: I am pretty sure we can't :( [13:49:46] andrewbogott: that is Icinga which connect to the nrpe daemon on the instance and then ask it to start some commands [13:49:52] andrewbogott: I will try to figure it out :-] [13:50:44] hashar: May well be a firewall issue then, esp. if icinga uses different ports than nagios [13:51:29] I guess the icinga instance has a different IP address :D [13:52:25] hmm the default is to accept 5666 (nrpe port) for 10.4.0.0/21 [13:53:27] filling a bug :-] [13:55:16] [bz] (NEW - created by: Antoine "hashar" Musso, priority: Unprioritized - normal) [Bug 46026] icinga: cant monitor some instances - https://bugzilla.wikimedia.org/show_bug.cgi?id=46026 [13:57:38] petan: the gird is actually stable now xD [14:00:00] ok [14:51:54] petan: if you see the number 200 anywhere in OG settings lemme know ;p [14:54:06] hashar, volume have fixed size but it look like you can create/attach new volume to an instance and use lvm in the instance to increase storage space, it's a recent feature of openstack [15:03:14] Coren: are you there? [15:03:35] Darkdadaah: I am /always/ there. Watching from the shadows. Muahaha! [15:03:43] :) [15:04:23] What can I do to you? [15:04:39] Coren: I'm trying to import data in my db on Tools but I can't use DATA LOCAL INFILE. [15:05:40] "The used command is not allowed with this MariaDB version" [15:06:00] Is there another way? [15:06:58] Hmm. Lemme check on something. [15:07:55] I believe most MySQL infile/outfile requires privileges on the server side. [15:08:46] scfc_de: Not local data infile, but that might be because it's the default mysql client and not the mariadb that's on the grid itself. [15:09:05] Sorry I need to step away for a bit. I'll be back a bit later. [15:09:44] Coren: Ah, missed "LOCAL". Must try this on Toolserver some time where this always was a major hassle. [15:11:19] phe: maybe we do not support it yet [15:11:41] Darkdadaah: Give me a minute, I think this can be solved by switching the client to the MariaDB one. [15:12:54] what? [15:20:45] Darkdadaah: Should work now. That was it: it requires the mariadb client [15:23:49] PETAN! I FOUND IT! [15:24:01] o.o [15:25:26] i figured out that OG was hitting a limit of 200ish processes across all instances, spent about half an hour looking for everything that could be causing it, anything set at 100 as we have 2 instances, or 200, or 0.blar. I just found gid_range which had a range of 200, I just added an extra 100 to the range and all pending tasks jumped into queues! [15:26:51] !log bots increased gid_range by 100 to allow more simultaneous jobs [15:26:54] Logged the message, Master [15:27:31] legoktm: http://ganglia.wmflabs.org/latest/?r=20min&cs=&ce=&c=bots&h=bots-bnr1&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS [15:27:42] notice how the cpu wait and load has gone down and mem up [15:27:51] :o [15:27:56] thats as now the jobs are acutlaly runnning rather than just having space reserved for them... [15:27:56] thats nice [15:28:28] :D [15:36:43] !log bots reset some crazy load thresholds that were used back to more normal values [15:36:46] Logged the message, Master [15:37:04] addhappy nice [15:37:05] :D [15:37:10] yes :) [15:37:19] it only otok me like, hours xD [15:37:43] addhappy btw what are normal values [15:37:46] it was the moment when I saw the number of pending jobs jump from 150 to 0 xD [15:37:51] I suppose load like 10 is still OK [15:37:59] yee 10 and 9 [15:38:04] ok [15:38:25] Im going to make a little script to output lovely monitoring info for the whole grid in a second [15:38:46] cool [15:38:59] addhappy don't forget to put it into docs :P [15:40:04] Coren: I'm back. It seems to work now, thanks :-) [15:42:06] Darkdadaah: No problem. I wish all troubles were always this easy to solve. :-P [15:54:37] petan: how do I start tasks in OG that when they die they pick themselves back up again? :D [15:54:57] I was hoping that restart option will do that [15:55:00] in queue conf [15:55:08] but it doesn't (for some reason) [16:36:20] !log bots increase gid_range to 1000, bring avg loads down to 4 on all queues [16:36:22] Logged the message, Master [17:02:33] petan put wmbot on OG? ;p [17:02:37] andrewbogott: Hi [17:03:07] Reedy: I was pinging regarding the changes to user prefs on labsconsole... [17:03:15] You did that, or was it mostly Ryan? [17:03:48] I did most of it, Ryan did some of it [17:04:08] benestar have you moved to OGE yet? [17:04:29] Reedy, pm me your email? I don't seem to have it [17:05:13] Or, well, I guess I don't need to forward you this. I had two questions about the changes. [17:05:48] also petan we could probably get rid of bots-labs [17:05:55] ? [17:05:58] ah [17:06:16] purpose of that box was to host labs related bots like morebots, so they never get affected by run of other bots [17:06:22] for stability purposes [17:06:23] cant really see a need for a seperate instance for it [17:06:27] I'd like to keep it :P [17:06:30] I can [17:06:31] petan, run with a high priority ;p [17:06:37] labs-morebots, where are you? [17:06:37] I am a logbot running on i-0000015e. [17:06:37] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [17:06:38] To log a message, type !log . [17:06:46] for example this morning load on both nodes was over 2000 [17:06:50] petan, addshore, labs-morebots is running there. [17:07:04] petan: that was a phantom load :P [17:07:09] andrewbogott yes I know, but not on bnr [17:07:14] using SGE [17:07:19] true [17:07:21] neither server was actually loaded, infact they are more loaded now with a load of 1.2... [17:07:23] addshore I really want to keep it for now :P [17:07:39] addshore I don't believe that OGE is stable yet [17:07:39] Reedy, first of all, I'm dumb and can't find the new place to enter/view a public ssh key [17:07:42] * addshore slaps petan with a fish [17:07:56] addshore we could delete bots-N instances, and keep like only 1 of them [17:08:16] Reedy: and, secondly, I'm pretty sure that the sidebar link to the page that alters user permissions is now missing. [17:08:25] I think lightweight irc bots could be running on some dedicated box, I don't like idea of mixing them with these heavy load jobs [17:08:50] petan yep, I just went through and had a look at ps aux, looks like 3 is empty, 2 has you running something on it, 1 has wmib, and 4 has bene 2809 and drtrigon [17:09:05] petan naaah, just need seperate queues [17:09:06] 4 is only one being used [17:09:18] I want to move wm-bot to bots-labs [17:09:29] so that we could recreate bots-1 as a testing instance [17:09:36] bots -2 -3 and -4 can die [17:09:40] * addshore is going to look at making some new queues [17:10:24] andrewbogott: For the second one, you'd have to ask Ryan. I didn't touch any of that config - https://wikitech.wikimedia.org/w/index.php?title=MediaWiki:Sidebar&action=history [17:10:28] addshore give it some time, if I see that OGE really works, we might consider moving morebots there, but wm-bot can't use OGE now [17:10:31] it's too complex for it [17:10:45] mhhhm, i remember you saying before [17:11:11] andrewbogott: For the first, it's under the "OpenStack" tab. Being shown via preferences is more normal in mediawiki world. I would agree that the tab probably needs renaming [17:11:13] it's actually so complex I didn't create a startup script for it yet and there is one thing - IT MUST NOT run on gluster [17:11:18] and there is no other option atm [17:11:25] that's why it runs in /mnt/share [17:11:59] actually wm-bot is so complex I was thinking of getting own project on labs for it :P [17:12:35] but as long as I can run it on own box in bots project, I don't care [17:12:46] Though, that sidebar history doesn't show it being removed.. [17:13:00] I just don't like idea of mixing it with other tools, especially these which can eat all RAM, or CPU and break it [17:13:28] Reedy: Ok, I see it. I guess I clicked on every tab but that one [17:13:33] :D [17:13:44] Is there ever anything under that tab besides keys? [17:13:57] Not for the moment at least [17:14:09] Heh, then it should definitely be renamed! [17:14:19] addshore: you pinged me? [17:14:31] I'm not sure it should be called "SSH Keys" or something [17:14:40] Maybe even call it "Labs" or something [17:16:02] andrewbogott: For a change on wikitech only, we can just override the value in https://wikitech.wikimedia.org/wiki/MediaWiki:Prefs-openstack [17:17:29] https://wikitech.wikimedia.org/wiki/Help:Getting_Started#to_Wikimedia_Labs [17:17:47] https://wikitech.wikimedia.org/wiki/Help:Access#Connection_closed_by_remote_host [17:19:17] I'm reading the source, trying to understand where that default comes from [17:19:44] look at prefs-openstack in the i18n file [17:20:06] MediaWiki takes the value you used and prepends prefs- [17:20:25] Ah, what I mean is -- trying to understand why that would go under an 'openstack' tab, and what the 'openstack' tab is intended for generally. [17:20:31] In nova/OpenStackNovaUser, line 665 down [17:21:54] https://bugzilla.wikimedia.org/show_bug.cgi?id=40092 [17:22:52] Was there already a tab called 'Openstack' or did you create it? [17:24:44] I created it [17:26:55] Why do you think it shouldn't be called 'SSH Keys'? [17:28:27] It's a little specific if we need/want to use it for other stuff in the future [17:28:32] atm, I've no idea what [17:34:22] hi, how can i change email in gerrit? [17:34:28] Something terrible is happening to my laptop, gotta reboot [17:34:30] brb [18:01:47] Ryan_Lane: Can I have editinterface rights on labsconsole? I know that and why they are taken out of wikicontentadmin by default, but I found myself hitting various issues with gadgets and stylesheets on labsconsole. Would be easier to handle it myself. [18:02:11] Krinkle: can you enable two-factor authentication first? [18:02:19] Sure [18:02:48] my only real issue with giving those rights is that it gives the ability to XSS [18:02:54] yeah [18:03:27] at the moment various templates such as https://wikitech.wikimedia.org/wiki/Template:DocumentationPage are unstyled as MediaWiki:Common.css was overwritten by wikitech-old import (I see now you did that manually actually). should probably be merged instead. [18:04:17] Ryan_Lane: I'm looking at https://wikitech.wikimedia.org/w/index.php?title=Special:OATH&action=enable&returnto=Special%3APreferences but I'm nonethewiser. [18:04:38] I'm familiar with Google's 2-way auth, but how does this work? I don't see a phone number involved here. [18:05:09] What do I do with token? [18:07:56] Aside from it apparently lacking usabilty, even when trying hard I have no clue what to do with this page. Not one clue. [18:08:48] Ryan_Lane, how are you feeling about instance-proxy today? Are you soured on the whole idea thanks to the phpmyadmin incident? [18:09:02] And, if you are not soured… can you suggest a production box that it can run on? [18:11:07] andrewbogott: oh. no. totally think we should be doing instance-proxy [18:11:36] let's ask mark about this [18:12:26] Ooo. Nice if a little scary. New disk has 4K physical sectors. [18:15:03] Ryan_Lane, what happened to the List Instances/List Projects items on the sidebar? [18:15:16] Also can you add Nova Resource to the default search namespaces? [18:16:49] Krenair: I removed Nova Resource from default because most of the time people want to search for docs and not instances [18:17:00] we should likely move the project pages out of Nova Resource [18:18:11] Ryan_Lane: Okay, I've stared at this page, tried some search keywords on wikitech with no results and looked at the extensions source code. Still no clue. [18:19:06] I don't remember anything about a phone number [18:19:19] I just installed the google authenticator app on my phone [18:20:08] Krinkle: do you have android or iphone? [18:20:12] I should really make some docs [18:20:21] [bz] (NEW - created by: Fabrice Florin, priority: Highest - major) [Bug 46035] EE-Prototype is down on WMFLabs - https://bugzilla.wikimedia.org/show_bug.cgi?id=46035 [18:20:23] For Google I entered a phone number and I get text messages with codes. [18:20:30] yeah, that's not how this works [18:20:44] Krenair: You installed a google app for wikimedia labs or for google? [18:20:44] google has another option, which uses the same solution for this [18:20:46] Ryan_Lane: Is there an app I'm supposed to install? [18:20:47] Hi guys, does anyone know how we could get access on EE-Prototype on WMFLabs? (http://ee-prototype.wmflabs.org/) We need the site for a critical deployment this week, led by matthiasmullie. Here's the Bugzilla ticket: https://bugzilla.wikimedia.org/show_bug.cgi?id=46035 [18:20:50] I have an iPhone. [18:20:50] Krinkle: yes [18:20:55] Krinkle: google authenticator [18:21:45] Krinkle, for wikimedia labs, to do the phone side of the two-factor auth. The same app handles gmail, dropbox, etc. [18:21:48] WFM yesterday [18:21:57] Time to rejigger hard disks and stuff. [18:21:59] RNG [18:22:03] BBL [18:22:09] * Reedy gives Coren|Away a large hammer [18:22:18] I dislike the SMS two-factor that google has [18:22:21] I doubt the "app" handles it, it is just a generic implementation for authentication. [18:22:38] that is, assuming Dropbox and Wikimedia aren't in bed with Google. [18:22:42] it requires a network connection, it costs money when you're in other countries, it's less secure, etc. [18:22:47] OK, got the app. [18:22:53] google authenticator is an OATH implementation [18:22:59] Ryan_Lane: it's a fallback if you don't have a smartphone [18:23:04] paravoid: yeah [18:23:08] it's not bad to have the option [18:23:14] paravoid: though it seems to be the default option that google shows [18:23:22] I don't mean for us [18:23:25] * Ryan_Lane nods [18:23:26] I know :) [18:23:27] Ryan_Lane: You might want to link it on the wiki if you haven't already. https://play.google.com/store/apps/details?id=com.google.android.apps.authenticator2 [18:23:35] although it wouldn't be very hard to implement :) [18:23:35] Reedy: ah. indeed [18:23:38] paravoid: yeah [18:23:47] considering we already have a sms mail gateway :) [18:23:50] paravoid: and realistically, it might be good to have a "fallback" [18:23:59] in case of totally lost credentials [18:24:02] https://itunes.apple.com/gb/app/google-authenticator/id388497605 [18:24:08] Your request produced an error. [18:24:08] [newNullResponse] [18:24:09] it's just the matter of storing a phone number and sending a mail [18:24:10] gj Apple [18:24:12] yep [18:24:18] hahaha apple [18:24:20] fail [18:24:38] Krinkle: so, the app should be mostly straight forward [18:24:43] there are two-factor systems that just have the phone generate a random value [18:24:49] Krinkle: make sure to store your scratch token [18:24:56] Platonides: that's what this does [18:24:57] well, not really "random" :P [18:25:01] yeah, I know the drill. [18:25:06] Got a boatload of them. [18:25:10] I mean, not needing network [18:25:17] Platonides: yeah, that's what this does :) [18:25:30] it's a shared secret, where the token is generated based on the time [18:25:36] using the shared secret [18:25:40] ah, ok [18:26:15] I thought you said above that it used OAUTH [18:26:21] heh [18:26:32] yeah. it's annoying that they are so similarly named [18:26:41] we're using TOTP, which is OATH [18:26:41] indeed [18:27:16] hm. or is it TOTP? [18:27:18] err [18:27:19] HOTP [18:27:46] google supports both. we only support one of the two [18:44:33] Hey Ryan_Lane : Any suggestions on how to restore EE-Prototype on WMFLabs? (http://ee-prototype.wmflabs.org/) What do you recommend we do to solve this problem quickly? We rely on this site to test all our editor engagement features: Echo, Article Feedback and Page Curation. Here's the Bugzilla ticket: https://bugzilla.wikimedia.org/show_bug.cgi?id=46035 [18:44:47] fabriceflorin: I just responded [18:44:52] the instance OOM'd [18:45:11] someone rebooted it [18:45:15] that's the proper solution [18:45:57] Ryan_Lane: sorry, doing different things at the same time; I got 2-step authentication set up a few minutes ago. Logged in and out a few times to try it out, works nice [18:46:51] Krinkle: cool [18:48:32] Krinkle: ok, let me promote you [18:48:39] I think sysop privileges are fine here [18:48:50] Ryan_Lane: Thanks so much for getting back to us! I'm glad that you that the server is being rebooted. I still cannot see it on my end yet, but will wait a while and check it again. What does OOM mean? [18:48:59] out of memory [18:49:14] Cool, thanks again. Fingers crossed ... [18:49:18] so the kernel started killing processes [18:49:49] it's up [18:49:57] ... but failed? [18:50:10] well, it succeeded :) [18:50:21] but it killed processes that are needed for it to stay alive [18:50:28] ... [18:50:33] the OOM killer isn't exactly smart [18:50:47] you can prioritize its killing, of course [18:50:58] we should likely have defaults [18:51:23] What did it kill which it shouldn't've? I couldn't even ping that instance.... [18:53:24] Thanks for the explanations. The server seems to be working now, but veeery slooooowly . matthiasmullie may be able to provide more info about what's going on from his perspective ... [18:54:23] Krenair, you should be able to see it from the console page [18:54:43] I don't have access to the editor-engagement instances. Just openstack [18:55:11] matthiasmullie just posted this on our Bugzilla ticket: "I just attempted to SSH into ee-prototype, but that failed: mlitn@bastion1:~$ ssh ee-prototype | ssh: connect to host ee-prototype port 22: No route to host | Rebooting the instance through Special:NovaInstance didn't seem to change anything either. [18:55:29] that was just prior to it being back up [18:58:43] fabriceflorin: is it still slow for you? is pretty snappy for me [18:59:34] (aft is currently broken on there though - will need to re-merge some stuff) [19:03:55] Well, I wouldn't call it 'snappy': it takes me between 15 seconds and 30 seconds to open an article, no matter which browser I use. Lots of status messages about 'waiting for wmf labs' or 'transferring data' -- and the AFT is not showing up on articles. [19:08:32] I'm aware of AFT not showing up; I'll first need to re-merge some of it's code again [19:51:18] * Damianz pokes Ryan_Lane [19:52:54] [bz] (NEW - created by: Antoine "hashar" Musso, priority: Normal - normal) [Bug 46026] icinga: cant monitor some instances - https://bugzilla.wikimedia.org/show_bug.cgi?id=46026 [19:56:44] * Damianz notes that also emailed ryan so he knows what the poke is for so is probably avoiding him :D [19:59:28] matthiasmullie: are you running things off of project storage? [19:59:38] or are you using /mnt or /srv? [19:59:49] /mnt and /srv will be much faster [19:59:57] the code is on /srv/mediawiki/w [20:00:22] ah ok [20:00:26] are you using apc and memcache? [20:00:30] !log bastion killing and restarting nrpe-server on all instances (bug 46026) [20:00:32] Logged the message, Master [20:00:44] memcache, yes [20:00:53] * Damianz gives mutante a cookie [20:02:21] matthiasmullie: install php-apc [20:04:05] that will help with speed [20:07:53] Ryan_Lane: how should I install it on there? (pecl, build, anything else?) [20:08:04] apt-get install php-apc [20:08:10] well [20:08:11] okay [20:08:14] sudo apt-get install php-apc [20:08:16] then restart apache [20:09:48] Errors were encountered while processing: [20:09:48] ganglia-monitor [20:09:50] php5-memcached [20:09:50] E: Sub-process /usr/bin/dpkg returned an error code (1) [20:10:27] this is ee-prototype? [20:10:27] indeed [20:11:22] ah ee [20:11:49] php-apc installed [20:12:01] I installed php5-memcached explicitly [20:12:10] using apt-get install php5-memcached [20:12:22] I'm purging and reinstalling ganglia-monitor [20:13:06] restarting apache [20:13:28] thanks to have taken care of it ryan :-D [20:13:41] !log editor-engagement installed php-apc php5-memcached and reinstalled ganglia-monitor on ee-prototype [20:13:43] Logged the message, Master [20:13:49] ok cool :) [20:13:57] bleh. what's wrong with ganglia-monitor? [20:14:02] so you suggest that we'd switch from memcached to apc on there? [20:14:13] ah. there we go [20:14:16] mlitn: no [20:14:29] apc does opcode caching [20:14:51] it caches the bytecode version of the php code in memory [20:15:04] oh - right, ok :) [20:15:27] alternative: eaccelerator [20:15:45] you could eventually use APC as a memory cache [20:15:56] but wmf uses memcached :-] [20:15:56] mutante: yeah, but we use apc [20:15:56] which is working nicely [20:20:46] yep [20:21:07] man ee-prototype is incredibly slow [20:21:26] memcache is running on 11000 [20:21:39] Anyone expect these to be working: dumps-2, orgcharts-dev, stackfarm-sql2, tstarling-puppet, vumi-metrics, wsexport? [20:21:55] Ryan_Lane and hashar : thanks so much for taking the time to help solve these issues with mlitn -- your good advice is much, much appreciated ! [20:21:57] most of those keep OOMing [20:22:24] there's really no reason why it should be this slow [20:22:31] lol [20:22:43] could it be one of the real server being overloaded? [20:22:52] there's like a billion php processes running as mlitn [20:22:56] fabriceflorin: Ryan did it all :-] I am having dinner still [20:23:25] about a billion of these: mlitn 4429 4428 0 19:49 ? 00:00:00 /bin/sh -c php /srv/mediawiki/w/extensions/ArticleFeedbackv5/maintenance/archiveFeedback.php [20:23:25] mlitn 4430 4429 11 19:49 ? 00:03:51 php /srv/mediawiki/w/extensions/ArticleFeedbackv5/maintenance/archiveFeedback.php [20:23:31] mmh [20:23:43] * Damianz wonders what the French have for dinner [20:23:47] that's likely the reason for the OOMs too [20:24:03] probably [20:24:16] disabled that cronjob [20:24:30] (which was set to run every minute for testing purposes) [20:24:47] (which apparantly does not seem to work out too well) [20:25:23] mlitn: mind if I killall mlitn? [20:25:34] not at all, please do :) [20:26:09] mlitn: I think you can restore the cron job to its normal settings, now that we have confirmed that Auto-archive is working as intended. I would even be OK not deploying with Auto-archive this week, and adding it later, in order to make our release date. [20:28:41] alright! [20:32:30] <^demon> I vaguely remember something about if you adjust security groups, you have to rebuild the instance? [20:32:37] <^demon> Or can they be applied to existing instances? [20:36:27] and now ee-prototype is fast [20:36:29] ;) [20:36:46] ^demon: no. you can modify security group rules on the fly [20:37:00] you can't modify the list of security groups for an instance, though [20:39:19] <^demon> !log deployment-prep added port 1099 to search engine security group to allow RMI messaging to go through [20:39:21] Logged the message, Master [20:42:56] Ryan_Lane: Yes, ee-prototype is definitely much faster, thank you! mlitn : let me know when AFT is back up on that server. [20:45:35] fabriceflorin: yw [20:48:05] fabriceflorin: it's back up [20:48:12] Ryan_Lane: thanks for the help indeed! [20:48:17] yw [20:50:05] Thanks, guys, all's well that ends well :) [20:51:13] mlitn: Let me know if there is anything in particular you would like me test first. I'm now updating the help pages and taking screenshots, but am happy to work on whatever is most helpful to you. [20:52:58] Ryan_Lane: Getting an error 500 on https://wikitech.wikimedia.org/w/index.php?title=Special:NovaInstance&action=configure&instanceid=51f3a377-90b5-4f46-8590-5624c50c5dbf&project=openstack®ion=pmtpa (bug 45649) [20:53:06] Can you see what the http errors logs say? [21:00:41] yeah. one sec [21:01:08] Reedy: Call to a member function getInstanceId() on a non-object in /srv/org/wikimedia/controller/wikis/slot0/extensions/OpenStackManager/special/SpecialNovaInstance.php on line 276 [21:02:01] Ok [21:02:01] $instance = $this->userNova->getInstance( $instanceosid ); [21:02:01] $instanceid = $instance->getInstanceId(); [21:02:01] $instancename = $instance->getInstanceName(); [21:02:12] So the null is expected, as said instanceid isn't in said project [21:04:16] yeah, but a 500 isn't likely good [21:04:19] Oh... That's what it was [21:04:25] I had a patch in the works to fix that one [21:04:42] Got rid of it when I wanted to try out one of yesterday's commits [21:06:28] if ( !$instance ) { [21:06:28] return false; [21:06:28] } [21:06:31] That's the simplest fix [21:06:37] The better fix is to include an error message ;) [21:06:59] A server version of SpecialNova::notInProject() [21:07:54] Krenair: Do you want to fix it? Or shall I? [21:18:38] Reedy, I'll fix it [21:18:53] Ook [21:35:23] wtf bugzilla. conflict with someone only adding a comment, it says I can overwrite which 'This will cause all of the above changes to be overwritten, except for the added comment(s).' [21:35:47] lol