[00:28:00] [bz] (8NEW - created by: 2spage, priority: 4High - 6normal) [Bug 53778] [Regression] Echo notification emails from wikitech are empty - https://bugzilla.wikimedia.org/show_bug.cgi?id=53778 [02:04:36] [bz] (8NEW - created by: 2jeremyb, priority: 4Unprioritized - 6enhancement) [Bug 53935] install ExpandTemplates mediawiki extension @ wikitech - https://bugzilla.wikimedia.org/show_bug.cgi?id=53935 [05:58:54] [bz] (8NEW - created by: 2Antoine "hashar" Musso, priority: 4Unprioritized - 6normal) [Bug 53978] setup labs project for continuous integration jobs - https://bugzilla.wikimedia.org/show_bug.cgi?id=53978 [07:11:52] [bz] (8ASSIGNED - created by: 2spage, priority: 4High - 6normal) [Bug 53778] [Regression] Echo notification emails from wikitech are empty - https://bugzilla.wikimedia.org/show_bug.cgi?id=53778 [07:41:26] (03CR) 10Yuvipanda: [C: 032] "Sorry about the delay!" [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/83349 (owner: 10Jeroen De Dauw) [07:56:36] valhallasw: i saw your pull req [07:56:42] valhallasw: i'll try to set it up tomorrow [08:01:39] (03CR) 10Yuvipanda: "(Testing)" [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/83349 (owner: 10Jeroen De Dauw) [08:01:41] (03CR) 10Yuvipanda: "(Testing again)" [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/83349 (owner: 10Jeroen De Dauw) [08:01:45] good enough ;) [08:43:14] YuviPanda: cool! [08:43:29] YuviPanda: this was from github for windows, so I think that's going to make life easier for windows devvers [08:43:38] wonderfuol [08:43:40] *wonderful [08:44:02] basically it uses the triangular pull/push option in the most recent git version [08:44:05] 1.8.4, I think [08:44:11] ooooo [08:44:13] that's nice [08:44:39] it would help to have something that prevents master branch commits, though, but that's again something we can set up [08:44:49] (because then you get merge commits, etc) [08:45:34] yeah [08:45:39] okay, i need to go sleep now [08:45:45] i'll try to get this working tomorrow [08:45:46] bye [08:45:50] and thanks for doing this, valhallasw :) [08:46:05] you're welcome :-) and good night [08:46:08] sorry I can't help though :( (No windows, etc) [08:46:09] night [08:55:35] [bz] (8NEW - created by: 2Nemo, priority: 4Unprioritized - 6major) [Bug 53987] sulinfo is unusable (takes tens of seconds) - https://bugzilla.wikimedia.org/show_bug.cgi?id=53987 [09:21:00] ssh -A "Nicolas Raoul"@bastion.wmflabs.org does not work... I used to connect a few months ago, but I forgot how... maybe my username is converted to a more unix-like username? [09:24:48] Nicolas: you should use your unix username instead. [09:25:04] Nicolas: when creating an account on wikitech, you have chosen one [09:25:59] valhallasw: thanks! Where can this unix username be found? I unfortunately forgot it, and all I can see on Gerrit and wikitech is "Nicolas Raoul" [09:26:35] Nicolas: check https://gerrit.wikimedia.org/r/#/settings/ [09:26:41] 'username' is your unix username [09:27:09] Found! It is nicolas-raoul [09:27:11] also available at https://wikitech.wikimedia.org/wiki/Special:Preferences#mw-prefsection-personal under 'Instance shell account name' [09:27:29] thanks for your help valhallasw, sorry for the trouble! [09:27:43] you're welcome! [10:44:09] Anybody know of a reason why the login to tools-login is taking a significant amount of time? [10:49:17] Hrm... If I'm reading this correct the tools cluseter is in full FUBAR mode http://ganglia.wmflabs.org/latest/?c=tools [11:04:58] Any labs admins: Ping [11:18:53] Hasteur: hah! [11:22:48] Is labs breaking down or something? [11:23:08] heh, we will see :P [11:23:13] petan: ? [11:23:16] around? :D [11:23:30] tools is completely overloaded. [11:23:50] its primarily -login [11:24:15] Who's responsible for that? [11:25:40] Coren: who I doubt is awake yet :0 [11:25:59] Cyberpower678: that depends on what you mean with 'responsible' ;-) [11:26:09] Coren, is responsible for labs overloading? [11:26:10] its something to do with nfs :/ [11:26:11] it's Coren's job to take care of issues, but I doubt he was the one who caused it. [11:26:12] tools-dev and tools-login -- any known issue? [11:26:16] valhallasw: indeed ;p [11:26:27] liangent: something to do with nfs again it would seem [11:57:02] addshore: when will it be back? [11:57:16] I can only guess in a few hours [12:02:13] addshore: automatically? [12:02:24] heh, highly unlikely [12:02:49] so how are you sure it's "in a few hours" [12:03:15] I didn't say I was sure, I said 'I guess' in a few hours :p [12:03:38] and probably then as someone will be awake to fix it :) [12:47:12] I just requested creation of my second project, looking forward to it getting approved :-) https://wikitech.wikimedia.org/wiki/New_Project_Request/PoiMap2 [12:49:19] Coren, wake up. [12:49:47] cyberbot node is failing. [12:51:06] Cyberpower678: tools-login is down... [12:57:41] Cyberpower678: Hasteur its being worked on now :) [12:57:59] and yes, Cyberpower678, nfs is down [12:58:05] Cyberbot is dying. It's stats are low. [12:58:08] :p [13:10:25] Cyberpower678: I had bots die 3 hours ago [13:10:53] My bots are still running, but barely. It's dying off very slowly. [13:11:26] * Betacommand grumbles about the toolserver being far more stable [13:14:52] and yet labs is supposed to be better?... [13:21:47] :/ [13:22:26] Betacommand, !newlabs [13:22:53] Cyberpower678: what? [13:22:58] Type that. [13:23:32] Betacommand, ^ [13:24:03] !newlabs Cyberpower678 [13:24:03] This is labs. It's another version of toolserver. Because people wanted it just like toolserver, an effort was made to create an almost identical environment. Now users can enjoy replication, similar commands, and bear the burden of instabilities, just like Toolserver. [13:24:04] My bot just pinged out :p [13:24:44] Its not just tools labs that's effected btw [13:24:53] Cyberpower678: if this was just like the toolserver we wouldnt have nfs issues [13:24:55] *affected [13:25:12] addshore, ^ [13:25:42] Cyberpower678: the toolserver is actually fairly stable, the most un-stable component was the databases [13:26:18] Betacommand, well at least scripts that are already executed remain operational as well as the databases. [13:26:29] Cyberpower678: thats not true [13:26:40] what's the point of having databases if you can't connect to them because tools-login is down? [13:26:42] Ive had half my bots die [13:26:45] Err. My bot is a good example of that. [13:26:52] it's nice if you're on one of the other lab instances, but other than that.... [13:27:06] Cyberpower678: as long as it doesn't to any disk access it'll probably be fine, yes [13:27:24] Indeed [13:27:26] Cyberpower678: most bots use disk access [13:28:29] Betacommand, clarify? [13:29:12] writing cookies, log files, other temp files... [13:29:18] Nope. [13:29:19] Cyberpower678: work with files on the hard drives [13:29:25] Cyberpower678: most bots do [13:29:34] caching requests [13:29:37] yours may not, but 95% do [13:29:37] Not mine. Except RfX reporter which has crashed [13:29:55] And spambot. [13:30:11] But spambot writes at the end of it's task. [13:31:30] Cyberpower678: most of the time its best to log durring the run, in case something causes it to crash you have logs [13:32:07] It does log, but that's generated by the server. My script doesn't write it. [13:32:31] The logs just don't write when NFS is down. [13:32:37] what do you mean " generated by the server" [13:33:02] Submitted to jsub and is logged using the -o and -e parameters. [13:33:11] Grid task console output? [13:33:20] Yes. [13:35:13] It generates tons of information that always have helped me to debug my scripts. Also all of my scripts can recover from a crash. [13:36:10] Back to my class. [13:37:06] addshore, any progress on NFS? Cyberbot's memory is slowly overflowing. [13:38:06] labstore3, the various mount lines for /dev/mapper/store-xx are commented out in fstab from aug 15, so naturally there's o nfs etc, not sure if it's ok to just put 'em back in, need someone to take a look-see [13:39:01] Probably a few hours more till coren is around :/ [13:39:31] * Cyberpower678 blows an airhorn at Coren. [13:40:33] addshore, can you lend me your foghorn? [13:42:36] "Ryan's point is that if something is really critical and user [13:42:36] facing for a project, then looking into moving it to production should [13:42:37] be on the roadmap. If something can survive being down a couple of [13:42:37] hours now and then (as most bots or web tools could), then Tool Labs [13:42:37] suffices." [13:42:42] Cyberpower678: ^ [13:43:07] aka 'just wait until Coren gets out of bed and have some patience, please' [13:43:23] valhallasw, no thanks. :p [13:43:30] Just kidding. [13:43:40] I'm being humourous. [13:44:32] is it just me being stupid again, or is there something wrong with tools-login? [13:44:36] ~> ssh jkroll@tools-login.wmflabs.org [13:44:36] ssh_exchange_identification: Connection closed by remote host [13:44:55] JohannesK_WMDE, you're late. :p [13:45:03] Welcome to the club. [13:45:20] ok, so i'm not stupid. just verfying. ;) [13:45:41] Nope. [13:46:43] Valhallasw ! Nice idea [13:47:16] ... [13:47:24] how did I do that? [14:03:37] is tools-login down? [14:03:49] yeah it is [14:06:41] Amir1: its not down! its just inaccessible :P [14:07:36] addshore: I rewrote the whole codes of harvesting data from WP It was working and boom! [14:08:11] petan: What's wrong with labs? [14:08:29] zhuyifei1999_: Tool Labs is DOWN (NFS issues) [14:08:37] zhuyifei1999_: the NFS server is currently down [14:10:24] addshore: When is it going to be fixed? [14:10:34] hopfully when Coren gets here :) [14:11:06] Coren: we need you! Wake up! [14:12:36] haha!, I wonder how many pings he will have when he gets here [14:13:29] addshore: somehow [14:14:04] there's still some cpu usage as user logged in gangla [14:14:32] *ganglia [14:14:33] and...? :P [14:14:49] so how's it down? [14:16:03] !log [14:16:19] http://bots.wmflabs.org/~wm-bot/dump/%23wikimedia-labs.htm [14:17:07] zhuyifei1999_: as I said, it is not down, it is just inaccessible [14:17:17] the only thing that is not working is NFS [14:17:43] addshore: same difference [14:18:18] well, not for zhuyifei1999_s example.. gangla doesnt use nfs, hence it is using cpu and its still working as expected [14:19:19] tools-login and webproxy's last heartbeat is one and a half hours ago [14:19:49] Hi everybody :) [14:20:40] hi renoirb [14:21:15] heh zhuyifei1999_ they indeed semm to have actually gone down now [14:21:17] *seem [14:21:18] It's my first time playing with wikitech today :) [14:21:31] renoirb: which bit of wikitech? ;p [14:21:41] webplatform project [14:22:09] Ryan Lane gave me access to a project where I can practice some stuff in your infra [14:22:51] addshore: Why can't petan solve it? [14:22:56] petan: ping [14:23:10] zhuyifei1999_: I dont think petan has access to the NFS which is where the issue lies. [14:23:20] also petan isn't here [14:25:26] addshore: I remember NFS had issues for several times [14:25:39] so is it a fixed procedure to fix it? [14:25:58] liangent: from the looks of things this is something different [14:30:24] addshore: about when does Coren wake up? [14:30:40] from when he got on irc yesterday I would say in the next 2 hours [14:31:25] tool-dev will fail in the next two hours [14:33:41] zhuyifei1999_: fyi https://etherpad.wikimedia.org/ro/r.B0XlRhOdKRWT6xuH [14:36:27] hmmmm what is SSH Key? [14:36:35] how can I make it? :P [14:37:30] addshore: why not labstore 1, 2 or 4 [14:37:51] zhuyifei1999_: because there is only a labstore3 :P [14:38:49] Revi: I hope there's a doc [14:39:22] !help [14:39:22] !documentation for labs !wm-bot for bot [14:39:28] pah [14:39:34] !keys [14:39:34] http://bots.wmflabs.org/~petrb/db/ list of infobot keys [14:39:35] eeeek [14:39:50] http://bots.wmflabs.org/~wm-bot/dump/%23wikimedia-labs.htm [14:39:51] addshore: why not install more? [14:40:05] !addshore [14:40:05] addshore is no longer fail! [14:40:08] ahh Revi [14:40:10] !docs [14:40:10] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help [14:40:22] in case one fails [14:40:29] Revi: go there ^^ it has all of the infomation you need [14:40:30] let' see.... [14:41:24] eeek....I need PC..... [14:41:30] I am mobile now....abusive.... [14:41:30] ..? [14:41:35] hah [14:42:07] why I can't make it on phone? bad phone! [14:44:59] addshore: ^^ [14:45:33] zhuyifei1999_: well the labstore nfs is redundant to a degree [14:45:39] but yes, there is only one of them [14:46:34] its probably the largest current possible point of failure in the tools infrastructure [14:49:44] addshore: what happened before it fails? [14:50:00] what happened before it failed this morning? [14:50:20] I guess so [14:50:36] https://etherpad.wikimedia.org/ro/r.B0XlRhOdKRWT6xuH is all I know [14:51:23] result is I have to wait until 3:00 AM UTC :0 [14:59:07] Cyberpower678: ping [14:59:56] Okay, Revi - you had a question? [15:00:29] T13|needsCoffee: Solved [15:00:39] !addshore del [15:00:39] Successfully removed addshore [15:00:43] but I have to wait 12 hrs ;P [15:00:50] !addshore is fail... still... [15:00:50] Key was added [15:01:19] :D [15:01:25] addshore: ^^ [15:02:20] Icinga seems to be down too: http://icinga.wmflabs.org/icinga/ [15:02:55] pietrodn: it probably also uses labstore3 :/ [15:03:31] Did the NFS drives die? [15:04:45] the nfsd wont come up, see https://etherpad.wikimedia.org/p/addshore [15:04:47] opps [15:04:52] see https://etherpad.wikimedia.org/ro/r.B0XlRhOdKRWT6xuH ;p [15:05:18] !addshore [15:05:18] <(^.^)> [15:12:08] Hm. tools-dev doesn't refuse the connection like tools-login, but it fails to start the shell. [15:13:17] yup, as tools-login has now gone down :) so no connection there [15:13:43] and tool-dev is up, but cant load any shell prefs etc and might not even be able to get at the keys to auth you [15:17:34] Well, I'll just watch the Apple keynote instead of doing queries on the replicated DBs :P [15:19:03] I was wondering, assuming I already have setup my keys in Gerrit, that I already tried SSHd to it, that I am in bastion project. [15:19:15] what I am missing to connect to bastion.wmflabs.org ? [15:20:32] pietrodn: good plan! [15:20:57] renoirb: what message do you get? [15:21:11] $ ssh renoirb@bastion.wmflabs.org [15:21:11] Permission denied (publickey). [15:21:39] mhhm, when you say you have setup your keys in gerrit, have you set them up on wikitech? ;p [15:21:48] addshore: but connection got gerrit.wikimedia.org:29418, I got: **** Welcome to Gerrit Code Review **** [15:22:05] addshore: by the way I discovered that the WMF Tool Labs replicas are (were?) very fast, once I figured out I should use the revision_userindex table and not revision. :P [15:22:05] oh, so you have two gerrit then? [15:22:29] I can login to Bastion so it's not broken [15:22:40] renoirb: no, but I am aware of a place in the wikitech preferences where you have to put your key! [15:23:01] renoirb: https://wikitech.wikimedia.org/wiki/Special:Preferences#mw-prefsection-openstack%7COpenStack [15:23:08] oh, ok [15:23:19] pietrodn: bastion is not tools ;) [15:23:22] I was following https://wikitech.wikimedia.org/wiki/Help:Access [15:26:07] ok, i wasn't aware of that one [15:26:23] worked after a minute of waiting :) [15:29:26] addshore: I adjusted https://wikitech.wikimedia.org/wiki/Help:Access#Prerequisites [15:29:31] :> [15:29:46] Am I wrong with the place? [15:30:03] looks fine [15:32:44] hi [15:32:54] hi petan ;p [15:33:45] !ping [15:33:45] !pong [15:33:50] :/ [15:34:21] ? :O [15:34:28] testing if wm-bot work [15:34:37] nfs is down [15:35:09] addshore: saw a new log? [15:35:14] that php interface [15:35:26] new log? [15:35:29] http://tools.wmflabs.org/wm-bot/logs (but doesn't work because nfs is borked) [15:35:34] * logs [15:35:35] irc logs [15:35:36] hah! [15:35:36] :D [15:35:45] I saw the email :P [15:35:49] im a bit behind on everything atm [15:35:59] and ya nfs is down [15:37:04] T13|needsCoffee, pong [15:38:06] I have some requests for rfa/b stat bot [15:38:07] Ok. Now I'm getting a little impatient. [15:38:10] !newlabs [15:38:10] This is labs. It's another version of toolserver. Because people wanted it just like toolserver, an effort was made to create an almost identical environment. Now users can enjoy replication, similar commands, and bear the burden of instabilities, just like Toolserver. [15:38:33] LOL that's funny :'D [15:38:39] If there is no rfa/b can there be no header? [15:38:44] pietrodn: the toolserver doesnt have nfs issues [15:38:46] Come on. I'm starting to get emails on why RfX reporter isn't updating. [15:39:09] T13|needsCoffee, request denied. [15:39:16] @infobot-detail newlabs [15:39:16] Info for newlabs: this key was created at 8/11/2013 2:29:36 PM by Cyberpower678, this key was displayed 6 time(s), last time at 9/10/2013 3:38:10 PM (00:01:06.3147010 ago) this key is normal [15:39:25] Cyberpower678 :D [15:39:25] Also, can there be some parameters to hide certain columns? [15:39:29] Betacommand: ok, but I can remember that on one day the logins were broken :P [15:39:39] T13|needsCoffee, already done [15:39:50] to an extent [15:40:01] petan, I thought it might come in handy. [15:40:07] Cool, just need to wait for it to update then.. xD [15:40:26] T13|needsCoffee, what did you want to hide? [15:40:38] pietrodn: labs uptime vs ts uptime isnt even near equal [15:40:55] up 122 days 19:33 [15:40:58] !newlabs-rl is newlabs in real: This is labs. It's another version of toolserver. Because people wanted it just like toolserver, it's just as fucked up and broken most of time. [15:40:58] Key was added [15:40:59] Betacommand, no. Toolserver's is clearly higher. :p [15:41:17] Cyberpower678: I know [15:41:30] Betacommand: yes, the Toolserver is more stable, but slower [15:41:31] :p [15:41:49] And it has the friggin archive table that labs promised by the end of August. [15:41:52] willows uptime is over 4 months without issue [15:41:58] don't say labs are not stable, just "tool labs" are not :P [15:42:04] wm-bot lives on labs as well [15:42:15] I want a refund. :p [15:42:26] petan: why isnt tool labs stable? [15:42:29] key to make a stable project: get rid of nfs & gluster [15:42:34] Betacommand: because of nfs [15:42:46] Betacommand, because everybody is running their crap on the wrong node. [15:42:54] But Toolserver has NFS too, doesn't it? [15:43:00] Dups? [15:43:02] Causing labs to fail. [15:43:10] T13|needsCoffee, request denied [15:43:16] petan: if that is the issue, why not use alternate? [15:43:25] nfs is alternate [15:43:29] to gluster :P [15:43:46] in past we were using gluster it was almost worse [15:43:50] petan: what is the toolserver running? [15:43:54] no idea [15:43:56] Maybe offer alternative format for Ending so I can see short date instead [15:44:03] but toolserver is surely being used by far less people than labs [15:44:07] I imagine ts is also on nfs [15:44:11] T13|needsCoffee, ? [15:44:15] Which is super eaay to do. [15:44:26] labs have hundreds of virtual machines running [15:44:33] petan: TS has a LOT of users [15:44:41] Betacommand: incomparably less than less [15:44:42] * labs [15:44:45] less than labs [15:44:48] 00:00, 16 September 2013 is kind of long. [15:45:29] Betacommand: labs have approx 1900 users [15:45:29] petan: instead of using virt machines why not just dedicate a few boxes? [15:45:33] Can I create a sandbox version of what the output could look like in your userspace? [15:45:45] precisely 1994 ldap entries [15:45:45] I think that would make it easiest. [15:45:56] fyi toolserver has hemlock which is its nfs server [15:46:07] T13|needsCoffee, why my userspace? [15:46:19] Betacommand: there isn't problem with that, nfs is server is dedicated physical machine [15:46:43] petan: any backup units? [15:46:49] virtual machines are fine, it's just... there is too huge IO demands [15:46:50] Wa going to put it as /sandbox of the current one to make it eay to find. [15:46:58] well, it has hemlock as the userstore nfs and then 2 head nodes ( turnera and damiana ) [15:47:07] Betacommand: apparently nothing like that is supported by nfs... [15:47:12] petan: virt machines are probably causing most of the IO [15:47:17] T13|needsCoffee, ok. [15:47:19] Doesn't really matte where, just trying to make it easy. [15:47:22] :) [15:47:34] Betacommand: but virtual machines themselve have hdd images on separate physical servers [15:47:37] I'll ping you when fone. [15:48:01] nfs is just a shared mount for a number of projects including tools [15:48:14] Make your proposal. Propose the short date parameter by creating a duplicate of the current one and modifying it to what it should be. [15:49:09] petan: why not find ways of reducing the IO or split the nfs over several hosts? [15:50:06] Betacommand: good idea... I have no powers nor access to do that [15:50:10] the IO for the nfs is no more than 2M IN and 1M out generally [15:50:32] petan, let me guess. Coren? [15:50:37] at a guess I would say that is not the issue here [15:50:43] yes he has access there [15:51:04] petan, do you think he'll do it? [15:51:08] where is he? [15:51:10] do what [15:51:16] petan, split nfs? [15:51:21] I think according to Coren problem isn't IO so he won't split it [15:51:26] Betacommand, sleeping [15:51:27] Cyberpower678: there isnt really any need for it [15:51:27] he believes it's software bug [15:51:48] despite that might be true, splitting it would help as well [15:51:50] Cyberpower678: he shouldnt be given the current time [15:51:57] if one of nfs servers got fucked, other half would work [15:52:19] he could even install multiple nfs instances on 1 host [15:52:23] Betacommand, he's probably at the foundation office sitting in one of his jacuzzzis. [15:52:34] Cyberpower678: he works remotely AFAIK :P [15:53:42] petan, it should be split to several instances for tools. That way everyone's stuff goes to hell. [15:53:45] petan: all we probably need it to make labstore not the single point of failure and have some way to failover onto another host, but there is no point in splitting the nfs [15:54:05] petan: doing some quick googling setting up a fallover nfs server should be do-able [15:54:18] Betacommand: yup :P [15:54:41] I imagine after today that will be on the todo list [15:54:52] * Betacommand grumbles [15:55:04] petan, can you explain the different aspects of the CPU usage here? [15:55:11] someone should have coren's phone number [15:55:25] petan, User/Nice/System/Wait/Idle [15:55:28] Betacommand: read Ryan's mail on 'production vs not production' [15:55:43] valhallasw: it really pissed me off [15:55:53] Welcome to the club. [15:56:06] I obviously know what User/System/Idle are but the other two I'd like to know. petan [15:56:07] * Betacommand is thinking of going over his head [15:56:27] Cyberpower678: Nice >> http://serverfault.com/questions/116950/what-does-nice-mean-on-cpu-utilization-graphs [15:56:28] * aude hopes my stuff is not corrupt [15:56:30] valhallasw, link? [15:56:58] Cyberpower678: the entire thread http://lists.wikimedia.org/pipermail/labs-l/2013-September/001594.html [15:57:17] Betacommand: All of staff has it, in emergencies. What's up? [15:57:26] morning Coren :) [15:57:33] Coren: catch up >> https://etherpad.wikimedia.org/ro/r.B0XlRhOdKRWT6xuH [15:57:40] my bot!!! [15:57:46] I am back [15:57:46]