[00:28:00] [bz] (8NEW - created by: 2spage, priority: 4High - 6normal) [Bug 53778] [Regression] Echo notification emails from wikitech are empty - https://bugzilla.wikimedia.org/show_bug.cgi?id=53778 [02:04:36] [bz] (8NEW - created by: 2jeremyb, priority: 4Unprioritized - 6enhancement) [Bug 53935] install ExpandTemplates mediawiki extension @ wikitech - https://bugzilla.wikimedia.org/show_bug.cgi?id=53935 [05:58:54] [bz] (8NEW - created by: 2Antoine "hashar" Musso, priority: 4Unprioritized - 6normal) [Bug 53978] setup labs project for continuous integration jobs - https://bugzilla.wikimedia.org/show_bug.cgi?id=53978 [07:11:52] [bz] (8ASSIGNED - created by: 2spage, priority: 4High - 6normal) [Bug 53778] [Regression] Echo notification emails from wikitech are empty - https://bugzilla.wikimedia.org/show_bug.cgi?id=53778 [07:41:26] (03CR) 10Yuvipanda: [C: 032] "Sorry about the delay!" [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/83349 (owner: 10Jeroen De Dauw) [07:56:36] valhallasw: i saw your pull req [07:56:42] valhallasw: i'll try to set it up tomorrow [08:01:39] (03CR) 10Yuvipanda: "(Testing)" [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/83349 (owner: 10Jeroen De Dauw) [08:01:41] (03CR) 10Yuvipanda: "(Testing again)" [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/83349 (owner: 10Jeroen De Dauw) [08:01:45] good enough ;) [08:43:14] YuviPanda: cool! [08:43:29] YuviPanda: this was from github for windows, so I think that's going to make life easier for windows devvers [08:43:38] wonderfuol [08:43:40] *wonderful [08:44:02] basically it uses the triangular pull/push option in the most recent git version [08:44:05] 1.8.4, I think [08:44:11] ooooo [08:44:13] that's nice [08:44:39] it would help to have something that prevents master branch commits, though, but that's again something we can set up [08:44:49] (because then you get merge commits, etc) [08:45:34] yeah [08:45:39] okay, i need to go sleep now [08:45:45] i'll try to get this working tomorrow [08:45:46] bye [08:45:50] and thanks for doing this, valhallasw :) [08:46:05] you're welcome :-) and good night [08:46:08] sorry I can't help though :( (No windows, etc) [08:46:09] night [08:55:35] [bz] (8NEW - created by: 2Nemo, priority: 4Unprioritized - 6major) [Bug 53987] sulinfo is unusable (takes tens of seconds) - https://bugzilla.wikimedia.org/show_bug.cgi?id=53987 [09:21:00] ssh -A "Nicolas Raoul"@bastion.wmflabs.org does not work... I used to connect a few months ago, but I forgot how... maybe my username is converted to a more unix-like username? [09:24:48] Nicolas: you should use your unix username instead. [09:25:04] Nicolas: when creating an account on wikitech, you have chosen one [09:25:59] valhallasw: thanks! Where can this unix username be found? I unfortunately forgot it, and all I can see on Gerrit and wikitech is "Nicolas Raoul" [09:26:35] Nicolas: check https://gerrit.wikimedia.org/r/#/settings/ [09:26:41] 'username' is your unix username [09:27:09] Found! It is nicolas-raoul [09:27:11] also available at https://wikitech.wikimedia.org/wiki/Special:Preferences#mw-prefsection-personal under 'Instance shell account name' [09:27:29] thanks for your help valhallasw, sorry for the trouble! [09:27:43] you're welcome! [10:44:09] Anybody know of a reason why the login to tools-login is taking a significant amount of time? [10:49:17] Hrm... If I'm reading this correct the tools cluseter is in full FUBAR mode http://ganglia.wmflabs.org/latest/?c=tools [11:04:58] Any labs admins: Ping [11:18:53] Hasteur: hah! [11:22:48] Is labs breaking down or something? [11:23:08] heh, we will see :P [11:23:13] petan: ? [11:23:16] around? :D [11:23:30] tools is completely overloaded. [11:23:50] its primarily -login [11:24:15] Who's responsible for that? [11:25:40] Coren: who I doubt is awake yet :0 [11:25:59] Cyberpower678: that depends on what you mean with 'responsible' ;-) [11:26:09] Coren, is responsible for labs overloading? [11:26:10] its something to do with nfs :/ [11:26:11] it's Coren's job to take care of issues, but I doubt he was the one who caused it. [11:26:12] tools-dev and tools-login -- any known issue? [11:26:16] valhallasw: indeed ;p [11:26:27] liangent: something to do with nfs again it would seem [11:57:02] addshore: when will it be back? [11:57:16] I can only guess in a few hours [12:02:13] addshore: automatically? [12:02:24] heh, highly unlikely [12:02:49] so how are you sure it's "in a few hours" [12:03:15] I didn't say I was sure, I said 'I guess' in a few hours :p [12:03:38] and probably then as someone will be awake to fix it :) [12:47:12] I just requested creation of my second project, looking forward to it getting approved :-) https://wikitech.wikimedia.org/wiki/New_Project_Request/PoiMap2 [12:49:19] Coren, wake up. [12:49:47] cyberbot node is failing. [12:51:06] Cyberpower678: tools-login is down... [12:57:41] Cyberpower678: Hasteur its being worked on now :) [12:57:59] and yes, Cyberpower678, nfs is down [12:58:05] Cyberbot is dying. It's stats are low. [12:58:08] :p [13:10:25] Cyberpower678: I had bots die 3 hours ago [13:10:53] My bots are still running, but barely. It's dying off very slowly. [13:11:26] * Betacommand grumbles about the toolserver being far more stable [13:14:52] and yet labs is supposed to be better?... [13:21:47] :/ [13:22:26] Betacommand, !newlabs [13:22:53] Cyberpower678: what? [13:22:58] Type that. [13:23:32] Betacommand, ^ [13:24:03] !newlabs Cyberpower678 [13:24:03] This is labs. It's another version of toolserver. Because people wanted it just like toolserver, an effort was made to create an almost identical environment. Now users can enjoy replication, similar commands, and bear the burden of instabilities, just like Toolserver. [13:24:04] My bot just pinged out :p [13:24:44] Its not just tools labs that's effected btw [13:24:53] Cyberpower678: if this was just like the toolserver we wouldnt have nfs issues [13:24:55] *affected [13:25:12] addshore, ^ [13:25:42] Cyberpower678: the toolserver is actually fairly stable, the most un-stable component was the databases [13:26:18] Betacommand, well at least scripts that are already executed remain operational as well as the databases. [13:26:29] Cyberpower678: thats not true [13:26:40] what's the point of having databases if you can't connect to them because tools-login is down? [13:26:42] Ive had half my bots die [13:26:45] Err. My bot is a good example of that. [13:26:52] it's nice if you're on one of the other lab instances, but other than that.... [13:27:06] Cyberpower678: as long as it doesn't to any disk access it'll probably be fine, yes [13:27:24] Indeed [13:27:26] Cyberpower678: most bots use disk access [13:28:29] Betacommand, clarify? [13:29:12] writing cookies, log files, other temp files... [13:29:18] Nope. [13:29:19] Cyberpower678: work with files on the hard drives [13:29:25] Cyberpower678: most bots do [13:29:34] caching requests [13:29:37] yours may not, but 95% do [13:29:37] Not mine. Except RfX reporter which has crashed [13:29:55] And spambot. [13:30:11] But spambot writes at the end of it's task. [13:31:30] Cyberpower678: most of the time its best to log durring the run, in case something causes it to crash you have logs [13:32:07] It does log, but that's generated by the server. My script doesn't write it. [13:32:31] The logs just don't write when NFS is down. [13:32:37] what do you mean " generated by the server" [13:33:02] Submitted to jsub and is logged using the -o and -e parameters. [13:33:11] Grid task console output? [13:33:20] Yes. [13:35:13] It generates tons of information that always have helped me to debug my scripts. Also all of my scripts can recover from a crash. [13:36:10] Back to my class. [13:37:06] addshore, any progress on NFS? Cyberbot's memory is slowly overflowing. [13:38:06] labstore3, the various mount lines for /dev/mapper/store-xx are commented out in fstab from aug 15, so naturally there's o nfs etc, not sure if it's ok to just put 'em back in, need someone to take a look-see [13:39:01] Probably a few hours more till coren is around :/ [13:39:31] * Cyberpower678 blows an airhorn at Coren. [13:40:33] addshore, can you lend me your foghorn? [13:42:36] "Ryan's point is that if something is really critical and user [13:42:36] facing for a project, then looking into moving it to production should [13:42:37] be on the roadmap. If something can survive being down a couple of [13:42:37] hours now and then (as most bots or web tools could), then Tool Labs [13:42:37] suffices." [13:42:42] Cyberpower678: ^ [13:43:07] aka 'just wait until Coren gets out of bed and have some patience, please' [13:43:23] valhallasw, no thanks. :p [13:43:30] Just kidding. [13:43:40] I'm being humourous. [13:44:32] is it just me being stupid again, or is there something wrong with tools-login? [13:44:36] ~> ssh jkroll@tools-login.wmflabs.org [13:44:36] ssh_exchange_identification: Connection closed by remote host [13:44:55] JohannesK_WMDE, you're late. :p [13:45:03] Welcome to the club. [13:45:20] ok, so i'm not stupid. just verfying. ;) [13:45:41] Nope. [13:46:43] Valhallasw ! Nice idea [13:47:16] ... [13:47:24] how did I do that? [14:03:37] is tools-login down? [14:03:49] yeah it is [14:06:41] Amir1: its not down! its just inaccessible :P [14:07:36] addshore: I rewrote the whole codes of harvesting data from WP It was working and boom! [14:08:11] petan: What's wrong with labs? [14:08:29] zhuyifei1999_: Tool Labs is DOWN (NFS issues) [14:08:37] zhuyifei1999_: the NFS server is currently down [14:10:24] addshore: When is it going to be fixed? [14:10:34] hopfully when Coren gets here :) [14:11:06] Coren: we need you! Wake up! [14:12:36] haha!, I wonder how many pings he will have when he gets here [14:13:29] addshore: somehow [14:14:04] there's still some cpu usage as user logged in gangla [14:14:32] *ganglia [14:14:33] and...? :P [14:14:49] so how's it down? [14:16:03] !log [14:16:19] http://bots.wmflabs.org/~wm-bot/dump/%23wikimedia-labs.htm [14:17:07] zhuyifei1999_: as I said, it is not down, it is just inaccessible [14:17:17] the only thing that is not working is NFS [14:17:43] addshore: same difference [14:18:18] well, not for zhuyifei1999_s example.. gangla doesnt use nfs, hence it is using cpu and its still working as expected [14:19:19] tools-login and webproxy's last heartbeat is one and a half hours ago [14:19:49] Hi everybody :) [14:20:40] hi renoirb [14:21:15] heh zhuyifei1999_ they indeed semm to have actually gone down now [14:21:17] *seem [14:21:18] It's my first time playing with wikitech today :) [14:21:31] renoirb: which bit of wikitech? ;p [14:21:41] webplatform project [14:22:09] Ryan Lane gave me access to a project where I can practice some stuff in your infra [14:22:51] addshore: Why can't petan solve it? [14:22:56] petan: ping [14:23:10] zhuyifei1999_: I dont think petan has access to the NFS which is where the issue lies. [14:23:20] also petan isn't here [14:25:26] addshore: I remember NFS had issues for several times [14:25:39] so is it a fixed procedure to fix it? [14:25:58] liangent: from the looks of things this is something different [14:30:24] addshore: about when does Coren wake up? [14:30:40] from when he got on irc yesterday I would say in the next 2 hours [14:31:25] tool-dev will fail in the next two hours [14:33:41] zhuyifei1999_: fyi https://etherpad.wikimedia.org/ro/r.B0XlRhOdKRWT6xuH [14:36:27] hmmmm what is SSH Key? [14:36:35] how can I make it? :P [14:37:30] addshore: why not labstore 1, 2 or 4 [14:37:51] zhuyifei1999_: because there is only a labstore3 :P [14:38:49] Revi: I hope there's a doc [14:39:22] !help [14:39:22] !documentation for labs !wm-bot for bot [14:39:28] pah [14:39:34] !keys [14:39:34] http://bots.wmflabs.org/~petrb/db/ list of infobot keys [14:39:35] eeeek [14:39:50] http://bots.wmflabs.org/~wm-bot/dump/%23wikimedia-labs.htm [14:39:51] addshore: why not install more? [14:40:05] !addshore [14:40:05] addshore is no longer fail! [14:40:08] ahh Revi [14:40:10] !docs [14:40:10] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help [14:40:22] in case one fails [14:40:29] Revi: go there ^^ it has all of the infomation you need [14:40:30] let' see.... [14:41:24] eeek....I need PC..... [14:41:30] I am mobile now....abusive.... [14:41:30] ..? [14:41:35] hah [14:42:07] why I can't make it on phone? bad phone! [14:44:59] addshore: ^^ [14:45:33] zhuyifei1999_: well the labstore nfs is redundant to a degree [14:45:39] but yes, there is only one of them [14:46:34] its probably the largest current possible point of failure in the tools infrastructure [14:49:44] addshore: what happened before it fails? [14:50:00] what happened before it failed this morning? [14:50:20] I guess so [14:50:36] https://etherpad.wikimedia.org/ro/r.B0XlRhOdKRWT6xuH is all I know [14:51:23] result is I have to wait until 3:00 AM UTC :0 [14:59:07] Cyberpower678: ping [14:59:56] Okay, Revi - you had a question? [15:00:29] T13|needsCoffee: Solved [15:00:39] !addshore del [15:00:39] Successfully removed addshore [15:00:43] but I have to wait 12 hrs ;P [15:00:50] !addshore is fail... still... [15:00:50] Key was added [15:01:19] :D [15:01:25] addshore: ^^ [15:02:20] Icinga seems to be down too: http://icinga.wmflabs.org/icinga/ [15:02:55] pietrodn: it probably also uses labstore3 :/ [15:03:31] Did the NFS drives die? [15:04:45] the nfsd wont come up, see https://etherpad.wikimedia.org/p/addshore [15:04:47] opps [15:04:52] see https://etherpad.wikimedia.org/ro/r.B0XlRhOdKRWT6xuH ;p [15:05:18] !addshore [15:05:18] <(^.^)> [15:12:08] Hm. tools-dev doesn't refuse the connection like tools-login, but it fails to start the shell. [15:13:17] yup, as tools-login has now gone down :) so no connection there [15:13:43] and tool-dev is up, but cant load any shell prefs etc and might not even be able to get at the keys to auth you [15:17:34] Well, I'll just watch the Apple keynote instead of doing queries on the replicated DBs :P [15:19:03] I was wondering, assuming I already have setup my keys in Gerrit, that I already tried SSHd to it, that I am in bastion project. [15:19:15] what I am missing to connect to bastion.wmflabs.org ? [15:20:32] pietrodn: good plan! [15:20:57] renoirb: what message do you get? [15:21:11] $ ssh renoirb@bastion.wmflabs.org [15:21:11] Permission denied (publickey). [15:21:39] mhhm, when you say you have setup your keys in gerrit, have you set them up on wikitech? ;p [15:21:48] addshore: but connection got gerrit.wikimedia.org:29418, I got: **** Welcome to Gerrit Code Review **** [15:22:05] addshore: by the way I discovered that the WMF Tool Labs replicas are (were?) very fast, once I figured out I should use the revision_userindex table and not revision. :P [15:22:05] oh, so you have two gerrit then? [15:22:29] I can login to Bastion so it's not broken [15:22:40] renoirb: no, but I am aware of a place in the wikitech preferences where you have to put your key! [15:23:01] renoirb: https://wikitech.wikimedia.org/wiki/Special:Preferences#mw-prefsection-openstack%7COpenStack [15:23:08] oh, ok [15:23:19] pietrodn: bastion is not tools ;) [15:23:22] I was following https://wikitech.wikimedia.org/wiki/Help:Access [15:26:07] ok, i wasn't aware of that one [15:26:23] worked after a minute of waiting :) [15:29:26] addshore: I adjusted https://wikitech.wikimedia.org/wiki/Help:Access#Prerequisites [15:29:31] :> [15:29:46] Am I wrong with the place? [15:30:03] looks fine [15:32:44] hi [15:32:54] hi petan ;p [15:33:45] !ping [15:33:45] !pong [15:33:50] :/ [15:34:21] ? :O [15:34:28] testing if wm-bot work [15:34:37] nfs is down [15:35:09] addshore: saw a new log? [15:35:14] that php interface [15:35:26] new log? [15:35:29] http://tools.wmflabs.org/wm-bot/logs (but doesn't work because nfs is borked) [15:35:34] * logs [15:35:35] irc logs [15:35:36] hah! [15:35:36] :D [15:35:45] I saw the email :P [15:35:49] im a bit behind on everything atm [15:35:59] and ya nfs is down [15:37:04] T13|needsCoffee, pong [15:38:06] I have some requests for rfa/b stat bot [15:38:07] Ok. Now I'm getting a little impatient. [15:38:10] !newlabs [15:38:10] This is labs. It's another version of toolserver. Because people wanted it just like toolserver, an effort was made to create an almost identical environment. Now users can enjoy replication, similar commands, and bear the burden of instabilities, just like Toolserver. [15:38:33] LOL that's funny :'D [15:38:39] If there is no rfa/b can there be no header? [15:38:44] pietrodn: the toolserver doesnt have nfs issues [15:38:46] Come on. I'm starting to get emails on why RfX reporter isn't updating. [15:39:09] T13|needsCoffee, request denied. [15:39:16] @infobot-detail newlabs [15:39:16] Info for newlabs: this key was created at 8/11/2013 2:29:36 PM by Cyberpower678, this key was displayed 6 time(s), last time at 9/10/2013 3:38:10 PM (00:01:06.3147010 ago) this key is normal [15:39:25] Cyberpower678 :D [15:39:25] Also, can there be some parameters to hide certain columns? [15:39:29] Betacommand: ok, but I can remember that on one day the logins were broken :P [15:39:39] T13|needsCoffee, already done [15:39:50] to an extent [15:40:01] petan, I thought it might come in handy. [15:40:07] Cool, just need to wait for it to update then.. xD [15:40:26] T13|needsCoffee, what did you want to hide? [15:40:38] pietrodn: labs uptime vs ts uptime isnt even near equal [15:40:55] up 122 days 19:33 [15:40:58] !newlabs-rl is newlabs in real: This is labs. It's another version of toolserver. Because people wanted it just like toolserver, it's just as fucked up and broken most of time. [15:40:58] Key was added [15:40:59] Betacommand, no. Toolserver's is clearly higher. :p [15:41:17] Cyberpower678: I know [15:41:30] Betacommand: yes, the Toolserver is more stable, but slower [15:41:31] :p [15:41:49] And it has the friggin archive table that labs promised by the end of August. [15:41:52] willows uptime is over 4 months without issue [15:41:58] don't say labs are not stable, just "tool labs" are not :P [15:42:04] wm-bot lives on labs as well [15:42:15] I want a refund. :p [15:42:26] petan: why isnt tool labs stable? [15:42:29] key to make a stable project: get rid of nfs & gluster [15:42:34] Betacommand: because of nfs [15:42:46] Betacommand, because everybody is running their crap on the wrong node. [15:42:54] But Toolserver has NFS too, doesn't it? [15:43:00] Dups? [15:43:02] Causing labs to fail. [15:43:10] T13|needsCoffee, request denied [15:43:16] petan: if that is the issue, why not use alternate? [15:43:25] nfs is alternate [15:43:29] to gluster :P [15:43:46] in past we were using gluster it was almost worse [15:43:50] petan: what is the toolserver running? [15:43:54] no idea [15:43:56] Maybe offer alternative format for Ending so I can see short date instead [15:44:03] but toolserver is surely being used by far less people than labs [15:44:07] I imagine ts is also on nfs [15:44:11] T13|needsCoffee, ? [15:44:15] Which is super eaay to do. [15:44:26] labs have hundreds of virtual machines running [15:44:33] petan: TS has a LOT of users [15:44:41] Betacommand: incomparably less than less [15:44:42] * labs [15:44:45] less than labs [15:44:48] 00:00, 16 September 2013 is kind of long. [15:45:29] Betacommand: labs have approx 1900 users [15:45:29] petan: instead of using virt machines why not just dedicate a few boxes? [15:45:33] Can I create a sandbox version of what the output could look like in your userspace? [15:45:45] precisely 1994 ldap entries [15:45:45] I think that would make it easiest. [15:45:56] fyi toolserver has hemlock which is its nfs server [15:46:07] T13|needsCoffee, why my userspace? [15:46:19] Betacommand: there isn't problem with that, nfs is server is dedicated physical machine [15:46:43] petan: any backup units? [15:46:49] virtual machines are fine, it's just... there is too huge IO demands [15:46:50] Wa going to put it as /sandbox of the current one to make it eay to find. [15:46:58] well, it has hemlock as the userstore nfs and then 2 head nodes ( turnera and damiana ) [15:47:07] Betacommand: apparently nothing like that is supported by nfs... [15:47:12] petan: virt machines are probably causing most of the IO [15:47:17] T13|needsCoffee, ok. [15:47:19] Doesn't really matte where, just trying to make it easy. [15:47:22] :) [15:47:34] Betacommand: but virtual machines themselve have hdd images on separate physical servers [15:47:37] I'll ping you when fone. [15:48:01] nfs is just a shared mount for a number of projects including tools [15:48:14] Make your proposal. Propose the short date parameter by creating a duplicate of the current one and modifying it to what it should be. [15:49:09] petan: why not find ways of reducing the IO or split the nfs over several hosts? [15:50:06] Betacommand: good idea... I have no powers nor access to do that [15:50:10] the IO for the nfs is no more than 2M IN and 1M out generally [15:50:32] petan, let me guess. Coren? [15:50:37] at a guess I would say that is not the issue here [15:50:43] yes he has access there [15:51:04] petan, do you think he'll do it? [15:51:08] where is he? [15:51:10] do what [15:51:16] petan, split nfs? [15:51:21] I think according to Coren problem isn't IO so he won't split it [15:51:26] Betacommand, sleeping [15:51:27] Cyberpower678: there isnt really any need for it [15:51:27] he believes it's software bug [15:51:48] despite that might be true, splitting it would help as well [15:51:50] Cyberpower678: he shouldnt be given the current time [15:51:57] if one of nfs servers got fucked, other half would work [15:52:19] he could even install multiple nfs instances on 1 host [15:52:23] Betacommand, he's probably at the foundation office sitting in one of his jacuzzzis. [15:52:34] Cyberpower678: he works remotely AFAIK :P [15:53:42] petan, it should be split to several instances for tools. That way everyone's stuff goes to hell. [15:53:45] petan: all we probably need it to make labstore not the single point of failure and have some way to failover onto another host, but there is no point in splitting the nfs [15:54:05] petan: doing some quick googling setting up a fallover nfs server should be do-able [15:54:18] Betacommand: yup :P [15:54:41] I imagine after today that will be on the todo list [15:54:52] * Betacommand grumbles [15:55:04] petan, can you explain the different aspects of the CPU usage here? [15:55:11] someone should have coren's phone number [15:55:25] petan, User/Nice/System/Wait/Idle [15:55:28] Betacommand: read Ryan's mail on 'production vs not production' [15:55:43] valhallasw: it really pissed me off [15:55:53] Welcome to the club. [15:56:06] I obviously know what User/System/Idle are but the other two I'd like to know. petan [15:56:07] * Betacommand is thinking of going over his head [15:56:27] Cyberpower678: Nice >> http://serverfault.com/questions/116950/what-does-nice-mean-on-cpu-utilization-graphs [15:56:28] * aude hopes my stuff is not corrupt [15:56:30] valhallasw, link? [15:56:58] Cyberpower678: the entire thread http://lists.wikimedia.org/pipermail/labs-l/2013-September/001594.html [15:57:17] Betacommand: All of staff has it, in emergencies. What's up? [15:57:26] morning Coren :) [15:57:33] Coren: catch up >> https://etherpad.wikimedia.org/ro/r.B0XlRhOdKRWT6xuH [15:57:40] my bot!!! [15:57:46] I am back [15:57:46] my tools are unavailable [15:57:54] Coren: tools has been down for 6+ hours [15:58:02] Ohcrap! [15:58:06] gerrit bot is gone! [15:58:07] !coffee Coren [15:58:28] * Cyberpower678 gives Coren some coffee with 15 Turbo Shots. [15:58:32] Coren: I think this would qualify for a call :P [15:58:33] :) [15:58:33] Cyberpower678: wait = no operation (99% wait == 0% cpu usage in windows) [15:58:36] Why was the NFS server rebooted? [15:58:55] Coren: I didn't reboot it XD [15:59:00] not that I could [15:59:02] Yes, that should have gotten me paged. I'm not sure why it didn't and I'm going to make damn sure it will in the future. [15:59:12] * Coren curses [15:59:13] * Cyberpower678 also gives Coren beer and Red Bull [15:59:40] as long as things are not corrupt, then i am happy [15:59:50] aude, probably not [15:59:51] although i have backups / stuff in git [16:00:10] Yeah, the server was rebooted by someone who didn't know how to restart the service. Joy. [16:00:12] * Betacommand grabs some popcorn and watches coren [16:00:14] petan, so what does it mean when nice is consuming 98% of my CPU? [16:00:34] petan, and what does it mean when wait is consuming 98% of my cpu? [16:00:57] Cyberpower678: the "nice" CPU percentage is the % of CPU time occupied by processes with a positive nice value [16:01:42] * Coren his going to hunt down whoever rebooted this without pinging me first. [16:01:42] Basically it's CPU time that's currently "in use", but if a normal (nice value 0) or high-priority (negative nice value) process comes along those programs will be kicked off the CPU. [16:02:02] addshore, you essentially just defined that with the term. Means nothing to me. [16:02:48] addshore, no define wait. [16:02:51] *now [16:03:07] * Cyberpower678 hands Coren a torch and a pitchfork. [16:03:16] Cyberpower678: have you tried googling any of these? :P [16:03:29] must I say gtfq? ;p [16:03:52] addshore, no. I'm too stupid to know how. :p [16:03:52] NFS server is coming back online, the instances should start waking up shortly. [16:04:13] thanks Coren [16:04:22] Cyberpower678: wait is percentage of time that the CPU were idle during which the system had an outstanding disk I/O request [16:04:27] I'm sticking around. [16:04:32] Memory usage on Cyberbot is growing critical. [16:05:00] addshore, NFS caused the wait. :p [16:05:08] Cyberpower678: yes... [16:05:29] cause everything is a waiting for nfs to come back ;p [16:05:55] Current Load Avg (15, 5, 1m): [16:05:55] 3220%, 3307%, 3325% [16:05:55] Avg Utilization (last hour): [16:05:55] 1680% [16:06:00] Basically. I can still be 3-4 minutes before NFS proper comes back, then the backlog will slowly catch up. [16:06:08] Load at 1k [16:06:28] LOL hashar you changed topic just when it's coming back :P [16:06:30] Cyberpower678: everything is just waiting, dont worry :P [16:06:56] &#^@%. I should have been paged for this. [16:07:27] Coren, go slap the paging machine ;p [16:09:45] -login seems to be the worse off of the bunch; all the cron job piled up. [16:09:54] So why is the load still rising? [16:10:09] Coren, who's crontab is causing that? [16:10:20] Because NFS has exponential backoff; the longer it's been down the slower it is to recover. [16:10:47] Nobody in particular; cron just keeps starting jobs at the appointed time; but the jobs can't complete so they pile up. [16:11:04] Yeah, but why aren't they jsubbed? [16:11:38] Cyberpower678: they may be cron still has to submit them [16:12:08] Ah. I see nodes beginning to talk to NFS again. [16:13:39] heh * watches IO go up :D* [16:14:16] * Cyberpower678 watches tools recover [16:14:52] gerrit bot is back ;) [16:14:54] :D [16:14:58] :) [16:15:17] Go easy on the bastions people, if everyone tries to log in at once it's just going to make their recovery slower. [16:15:19] AHHhhh [16:15:22] Tools is spiking [16:15:38] Load just went from 1.0 to 1.5 [16:15:41] k [16:18:02] tools-dev has recovered enough to allow logins. -login might take another 10-15 before it's really usable. [16:20:20] Load is crawling back down on -login. [16:21:08] I'm going to have words with the rest of ops. They have my cell number, this deserved a call. [16:21:26] Please accept my apologies anyone; this'd have been fixed much earlier if I had known about it. [16:22:15] My bot never crashed. They were suspended. [16:22:28] Cyberpower678: perfect ;p [16:22:38] *sigh* And today is full of "team building" stuff. The postmortem on labs-l might only come later today; but the short of it: I have no idea why the NFS server was rebooted, but when it was the "start the actual NFS service" script wasn't invoked. [16:22:49] Spambot was still in operation this entire time. [16:22:57] Cyberpower678: Yes, that's the expected behaviour for a bot that doesn't have a hard timeout. [16:23:13] perhaps the "reach out and get a dev" trigger needs to be better advertised [16:23:30] Hasteur: I did reach out and poke the ops :p [16:23:58] Coren: did you read my pasted log? [16:24:15] Hasteur: That was a communication problem; normally I, Ryan or Andrew should have been pinged the minute it was known to be a labs thing, and all three of us would have been able to do this. [16:24:36] addshore: I did, that tells my "what" not "why". :-) [16:24:56] hehe, well, everything went down hours before that conversation took place [16:25:35] Clearly. I'm going to make damn sure there is a nagios somewhere that know to make all my stuff go beep if that happens again. [16:26:01] that was my attempt to poke anyone that could possible do anything :P Ialso stalking yours ryans and tims nick :p [16:26:26] -login is back to reasonable state. As far as I can tell, the grid didn't break; it just patiently waited for the filesystem to come back. [16:26:37] Coren: How do you handle such things in the ops team? Do you call each other in the middle of the night or is there some documentation that other ops (awake ones) can look at and act? [16:26:38] Coren: I think you mean icinga ;p [16:26:58] Maintainers may need to brace themselves for a small flood of email from cron telling them how their stuff wedged for a while. [16:27:55] Silke_WMDE: Most of the time, documentation is clear enough to avoid the phonecalls; the NFS server is kinda odd because of the hardware issue the doc doesn't quite match (but everyone is aware of it). [16:28:10] i see [16:28:11] But yeah, if we're stuck we normally will call each other up. [16:29:12] !log deployment-prep rebooted bastion after some nfs outage. Stopped udp2log, started udp2log-mw [16:29:15] Logged the message, Master [16:29:34] I should have been called hours ago; that I wasn't is an error I'm gong to make sure doesn't happen again. :-) [16:29:50] Coren: give me your phone number ;p [16:30:32] addshore: I'd rather not spread it around /too/ much, but any WMF staffer has access to my phone number. [16:31:28] I'm going to make sure apergos knows that it's okay to call me, timezone notwithstanding. :-) [16:31:58] Coren: what timezone are you actually in right now? [16:31:59] I'll make sure to include the words "Call coren" in my next request if this ever happens again ;p [16:32:30] Coren: thank you :-] [16:32:30] Silke_WMDE: SF I think. [16:32:32] Silke_WMDE: PDT; I'm in SF [16:32:51] ah, nice! [16:32:56] Coren: hihi [16:33:12] Speaking of "in SF", I've got a meeting starting. Things seem to be back up; I'm going to keep an eye on them for the next few hours regardless. [16:33:34] thanks and see you soon! [16:33:51] Coren|Away: beta is fully back up :-) enjoy your meeting [16:35:49] Coren|Away: ping [16:36:19] petan: ping [16:36:25] CP678|iPhone: he is going to a meeting :P [16:36:40] Petan too? [16:37:08] no.. [16:40:25] petan: do you have access to Ganglia? [16:42:43] CP678|iPhone: ... everyone does... [16:43:08] To modify it I [16:43:12] Mean. [16:43:16] also just a note the load on the NFS has dropped back to a normal level of IO so I guess everything has caught up :) [16:43:21] CP678|iPhone: to modify what about it? [16:44:15] The graphs [16:44:27] modify what about the graphs.... >.< [16:44:54] Change them to SVG. They'd look so much better that way. [16:46:00] is there an option to do so? [16:46:32] personally I think they look fine ;p http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xxlarge&h=labstore3.pmtpa.wmnet&m=cpu_report&s=by+name&mc=2&g=network_report&c=Labs+NFS+cluster+pmtpa [16:47:08] Not that I see. I think they look like shit right. They're blurry and unclear. [16:47:32] I think you might need a new screen CP678|iPhone ;p [16:47:40] or maybe use something that isnt your iphone :P [16:47:57] Erm. Hello? Retina display here. [16:48:30] It looks like shit on my MacBook Pro retina as well. [16:48:54] all I see if a crisp 1247x788 graph ;p [16:49:49] I run it on windows and it's super tiny and on Mac it scales it to look more normal but it's looks like shit. [16:50:35] Can't they be switched to SVG? It's a simple fix. [16:53:36] addshore, can you adjust ganglia? [16:54:14] CP678|iPhone: well, if ganglia has a configurable option to switch to svg then sure it probably can be don (file a bug) [16:54:16] *done [16:54:48] It uses that graph tool thingy which I know can be switched to SVG. [16:54:53] but I really dont see a problem with a 1247x788 pixel graph... thats bigger than some peoples screens... [16:55:33] addshore, here's an idea. Set zoom to 200% and then tell me what you see? [16:56:16] Cyberpower678: that makes the graph 4 times bigger than my screen... why would I want to? [16:56:20] what are you trying to look at? [16:57:04] addshore, it will give you an idea of what I'm looking at with normal zoom aside from the fact that I have a vastly superior screen than you do. [16:57:45] morning Ryan_Lane :) [16:57:57] Cyberpower678: what will give me an idea of what your looking at? [16:58:00] addshore, take a smaller graph and zoom it. [16:58:07] howdy [16:58:08] The blurriness. [16:58:28] Cyberpower678, why would I deliberately pick a smaller graph though if I wanted to zoom in >.< where is the login in that? [16:58:31] *logic [16:58:46] addshore, you're not getting me. [16:58:52] Ryan_Lane: you missed an eventful 6 hours or so ;p [16:59:56] addshore, my screen has twice the resolution of a normal screen. I need a zoom of 200% on my browser to see exactly what people with 100% zoom see. [17:00:21] So a picture that looks crisp on your screen looks like shit on mine. [17:00:26] Cyberpower678: what resolution? [17:00:33] addshore: yep [17:00:54] Cyberpower678: a picture that is on your screen will look identical, if not better on your screen at the same actual size [17:01:46] 2880 x 1800 on a 15 inch screen [17:02:29] addshore, zoom the graph into 200% and you'll see what I'm seeing. [17:02:40] Just orders of magnitude larger on yours. [17:02:44] Cyberpower678: thats not how it works >.< [17:02:51] Yes it does. [17:02:54] if you have twice the resolution [17:03:04] and you zoom to 200% [17:03:09] we will see the same image [17:03:11] roughly [17:03:34] same size, yes. Same quality no [17:03:44] I see a blurry image and you see a crisp image. [17:03:52] no, yours will have twice as many pixels and the quality should appear the same [17:04:01] Cyberpower678: send me a screenshot of what you see? [17:04:02] No.\ [17:04:50] The resolution of the picture is the same and what happens when you take a 100x100 image and resize it 200x200? [17:04:58] will it: [17:05:03] a. Increase in detail [17:05:09] b. become grainy [17:06:13] addshore, ^ [17:06:51] but what are you trying to look at that means you have to zoom in..? [17:07:07] Ganglia [17:07:32] WHAT ON GANGLIA? [17:07:44] The freakin graphs. :| [17:08:05] WHICH FRIGGIN GRAPH... (the same question I asked like half an hour ago) [17:08:19] and which tiny bit of detail do you need that you cannot see without zomming in? [17:08:49] They are PNG images rendered in 397px � 239px which gets resized to 794px x 478px through the zoom. [17:09:09] addshore, any graph. [17:09:23] Cyberpower678: why not look at the 1247x788 graphs... (like I said about half an hour ago) [17:09:28] such as http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xxlarge&h=labstore3.pmtpa.wmnet&m=cpu_report&s=by+name&mc=2&g=network_report&c=Labs+NFS+cluster+pmtpa [17:09:37] which I also linked to.. [17:09:48] Just as crappy [17:10:00] They are blurred. [17:10:05] just as crappy, its 3 times the size... [17:10:18] and its not blured or zoomed, it is rendered at that resolution [17:10:31] addshore, you don't get it. [17:11:19] I do, which is why I also said "well, if ganglia has a configurable option to switch to svg then sure it probably can be done (file a bug)" [17:11:33] My browser zooms everything at 200% including pictures. It will look tiny if I zoomed back to 100%. It looks crisp at 100% and grain 200% [17:12:10] change your resolution then? :P thats not really a problem with ganglia.. just the way your choosing to view it! [17:12:24] /facepalm [17:12:30] I give [17:13:27] It's not a problem with Ganglia. I never said it was, but it can certainly support browsers that have ultra high resolution screens. [17:13:45] Cyberpower678: so make a bug in the relevant place :P [17:13:48] Wikipedia made the switch a while back [17:14:03] addshore, link? [17:14:06] after a quick google ganglia doesnt support svgs.. [17:14:10] Oy vei [17:14:18] its currently on the wishlist [17:14:19] http://sourceforge.net/apps/trac/ganglia/wiki/ganglia_wish-list [17:14:33] It uses the RRDTool no? [17:14:40] yes [17:14:50] That can be set to generate SVG [17:14:55] It's used on ACC. [17:15:00] So I know. [17:15:13] And that graphs look amazing when switched to SVG. [17:15:18] yes but thats rrdtool itself, this is rrdtool through ganglia [17:15:43] Add an option to allow switching to SVG in-line RRDtool graphs. [17:15:43] This should be pretty easy to add as a config option. I think support for SVG in current browsers is now "good enough". A half-way modern version of RRDTool can generate SVG versions of the graphs, which should look much better. [17:15:53] so I gather its not yet possible in ganglia [17:18:01] -.- [17:18:27] Shouldn't be too hard. Just change what ever is submitting the RRDTool configuration. [17:18:49] Cyberpower678: so make a bug! [17:18:57] Quite a scrollback [17:18:59] :p [17:20:03] xD [17:20:09] now I must get back to work ;p [18:09:32] (03PS1) 10Yuvipanda: Fix stupid syntax error on config YAML file [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/83671 [18:09:46] (03CR) 10Yuvipanda: [C: 032] Fix stupid syntax error on config YAML file [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/83671 (owner: 10Yuvipanda) [18:09:54] addshore, I've discovered some major bugs in every script. [18:10:06] ?? [18:10:25] addshore, incredible memory leaks [18:10:27] in what script where and what bugs... [18:11:24] Looks, like I need to work on memory conservation [18:12:48] heh [18:15:02] So far I discovered that when I restarted rfx tally, status reporter, and the task checker, it consumed less memory after it started than it did when I tossed it out. [18:22:01] anyone know how to set the email target for cron notifications ? [18:35:49] Betacommand, not available yet. [18:37:45] why not? [18:37:56] Ask Coren. [19:00:55] boom! [19:00:57] back [19:00:57] :D [19:00:58] http://blue-dragon.wmflabs.org/wiki/Main_Page [19:01:40] Ryan_Lane: so how do we setup https for the proxy? [19:03:52] YuviPanda, make me a crat. :p [19:04:00] hmm? [19:04:18] for the blue dragon wiki [19:04:39] why? [19:04:41] it's a testwiki [19:04:42] for stuff [19:04:47] Just because. [19:06:27] hi everyoby [19:06:30] YuviPanda: start with a self-signed cert [19:06:40] YuviPanda: we'll work on adding in a real cert [19:06:45] we need to buy a * cert [19:06:47] Ryan_Lane: makes sense. [19:07:13] Ryan_Lane: do we already have some nginx https work we can use? [19:07:16] I have a problem with become command on tool lab, can anybody help [19:07:17] or do we have to write it again? [19:07:47] Ryan_Lane: I am hoping I don't have to setup ssl myself, mostly because I don't really trust myself to do that right... [19:07:48] pouyana_, shoot [19:07:58] YuviPanda: puppet does it for you [19:08:07] is there a class / role? [19:08:14] oh. I see what you mean [19:08:15] Cyberpower678: I added new too; [19:08:18] tool [19:08:20] look at the protoproxy work [19:08:26] but I can not become my tool [19:08:29] * YuviPanda greps [19:08:37] my username is a group member of the tool [19:08:45] wikidev project-bastion project-tools local-fatg [19:08:51] What's the tool? [19:08:52] is my groups output [19:08:59] local-fatg [19:09:07] created it a while ago [19:09:44] output of become: become: no such tool 'fatg' [19:10:11] local-fatg also dosn't work [19:11:13] where are you SSHing into? [19:11:28] tools-login [19:11:30] petan, why can't I see all existing tools on Special:NovaProject? [19:12:01] fatg doesn't exist [19:12:19] https://wikitech.wikimedia.org/wiki/Special:NovaServiceGroup [19:12:25] but I see it here [19:13:01] maybe I have created it in a wrong place :D? [19:13:23] the folder for it is non-existent [19:13:41] When did you make it? [19:14:01] 20-30 min ago or so [19:15:51] Give it some time to create. [19:21:22] @replag [19:29:00] Coren|Away, ping [19:30:08] petan, ping [19:31:23] addshore, ping [19:45:25] Cyberpower678: please stop pinging, just drop an email [19:46:42] Betacommand, I'm not pinging you. [19:47:05] Cyberpower678: its annoying to see my screen fill up with your pings [19:47:28] Betacommand, I'm not pinging you. [19:47:45] Cyberpower678: it also pisses off petan [19:48:12] But I'm not pinging you. [19:48:57] WTF?? [19:49:08] Riley just tried to access my IRC account [20:21:09] Cyberpower678: None of use opsen actually have IRC magic bits. [20:30:36] Did tools web servers go down today? [20:30:42] Also sad times bots sql server going away :( Better get around to moving the bots [20:35:14] YuviPanda: could you manually convert my pull request to a gerrit patchset, so I can make a writeup on using github desktop to contribute? :-) [20:35:28] step 8) poke YuviPanda and offer him stroopwafels [21:06:28] Damianz: The tools cluster went down due to NFS mount issues about 10 hours ago [22:03:09] mutante: poke me when you have it removed, and I'll flip the switch :) [22:09:11] YuviPanda: oh noes.. i can't do 2-factor because my phone is broken :p [22:09:20] darn [22:09:28] looks for backup codes [22:33:29] mutante: ow [23:52:02] mutante: SSL! https://blue-dragon.wmflabs.org/wiki/Main_Page [23:52:07] Ryan_Lane: SSL for the proxy :) https://blue-dragon.wmflabs.org/wiki/Main_Page [23:52:13] (self signed, need to puppetize) [23:52:32] i broke http but. fixing [23:55:07] okay, that's fixed too [23:55:09] woohoo [23:55:13] now to puppetize this