[00:58:52] PROBLEM Current Load is now: CRITICAL on mwreview-abogott-dev2 i-00000547.pmtpa.wmflabs output: Connection refused by host [00:59:32] PROBLEM Current Users is now: CRITICAL on mwreview-abogott-dev2 i-00000547.pmtpa.wmflabs output: Connection refused by host [01:00:13] PROBLEM Disk Space is now: CRITICAL on mwreview-abogott-dev2 i-00000547.pmtpa.wmflabs output: Connection refused by host [01:01:02] PROBLEM Free ram is now: CRITICAL on mwreview-abogott-dev2 i-00000547.pmtpa.wmflabs output: Connection refused by host [01:02:22] PROBLEM Total processes is now: CRITICAL on mwreview-abogott-dev2 i-00000547.pmtpa.wmflabs output: Connection refused by host [01:02:52] PROBLEM dpkg-check is now: CRITICAL on mwreview-abogott-dev2 i-00000547.pmtpa.wmflabs output: Connection refused by host [01:04:32] RECOVERY Current Users is now: OK on mwreview-abogott-dev2 i-00000547.pmtpa.wmflabs output: USERS OK - 0 users currently logged in [01:05:12] RECOVERY Disk Space is now: OK on mwreview-abogott-dev2 i-00000547.pmtpa.wmflabs output: DISK OK [01:06:02] RECOVERY Free ram is now: OK on mwreview-abogott-dev2 i-00000547.pmtpa.wmflabs output: OK: 670% free memory [01:06:32] PROBLEM Total processes is now: WARNING on bots-salebot i-00000457.pmtpa.wmflabs output: PROCS WARNING: 173 processes [01:07:22] RECOVERY Total processes is now: OK on mwreview-abogott-dev2 i-00000547.pmtpa.wmflabs output: PROCS OK: 88 processes [01:07:52] RECOVERY dpkg-check is now: OK on mwreview-abogott-dev2 i-00000547.pmtpa.wmflabs output: All packages OK [01:08:52] RECOVERY Current Load is now: OK on mwreview-abogott-dev2 i-00000547.pmtpa.wmflabs output: OK - load average: 0.09, 0.58, 0.53 [01:11:24] RECOVERY Total processes is now: OK on bots-salebot i-00000457.pmtpa.wmflabs output: PROCS OK: 99 processes [01:39:54] PROBLEM host: i-00000548.pmtpa.wmflabs is DOWN address: i-00000548.pmtpa.wmflabs CRITICAL - Host Unreachable (i-00000548.pmtpa.wmflabs) [01:43:52] RECOVERY host: i-00000548.pmtpa.wmflabs is UP address: i-00000548.pmtpa.wmflabs PING OK - Packet loss = 0%, RTA = 6.77 ms [01:44:22] PROBLEM Total processes is now: CRITICAL on mwreview-abogott-dev3 i-00000548.pmtpa.wmflabs output: Connection refused by host [01:45:54] PROBLEM Current Load is now: CRITICAL on mwreview-abogott-dev3 i-00000548.pmtpa.wmflabs output: Connection refused by host [01:45:54] PROBLEM dpkg-check is now: CRITICAL on mwreview-abogott-dev3 i-00000548.pmtpa.wmflabs output: Connection refused by host [01:46:34] PROBLEM Current Users is now: CRITICAL on mwreview-abogott-dev3 i-00000548.pmtpa.wmflabs output: Connection refused by host [01:47:14] PROBLEM Disk Space is now: CRITICAL on mwreview-abogott-dev3 i-00000548.pmtpa.wmflabs output: Connection refused by host [01:48:04] PROBLEM Free ram is now: CRITICAL on mwreview-abogott-dev3 i-00000548.pmtpa.wmflabs output: Connection refused by host [01:53:02] RECOVERY Free ram is now: OK on mwreview-abogott-dev3 i-00000548.pmtpa.wmflabs output: OK: 927% free memory [01:54:22] RECOVERY Total processes is now: OK on mwreview-abogott-dev3 i-00000548.pmtpa.wmflabs output: PROCS OK: 84 processes [01:55:54] RECOVERY Current Load is now: OK on mwreview-abogott-dev3 i-00000548.pmtpa.wmflabs output: OK - load average: 0.07, 0.62, 0.53 [01:55:54] RECOVERY dpkg-check is now: OK on mwreview-abogott-dev3 i-00000548.pmtpa.wmflabs output: All packages OK [01:56:34] RECOVERY Current Users is now: OK on mwreview-abogott-dev3 i-00000548.pmtpa.wmflabs output: USERS OK - 0 users currently logged in [01:57:14] RECOVERY Disk Space is now: OK on mwreview-abogott-dev3 i-00000548.pmtpa.wmflabs output: DISK OK [02:29:09] !log wikidata-dev disable memcached on test-repo, in local settings [02:29:11] Logged the message, Master [02:41:33] RECOVERY Free ram is now: OK on bots-sql2 i-000000af.pmtpa.wmflabs output: OK: 21% free memory [02:59:32] PROBLEM Free ram is now: WARNING on bots-sql2 i-000000af.pmtpa.wmflabs output: Warning: 15% free memory [03:10:24] PROBLEM Free ram is now: WARNING on wikistream-1 i-0000016e.pmtpa.wmflabs output: Warning: 11% free memory [04:05:31] 12/20/2012 - 04:05:31 - Updating keys for ant at /export/keys/ant [04:17:44] !log wikidata-dev re-enable memcached on test repo + clients, setup CDB file localisation caching [04:17:47] Logged the message, Master [04:19:32] PROBLEM Total processes is now: WARNING on parsoid-spof i-000004d6.pmtpa.wmflabs output: PROCS WARNING: 152 processes [04:24:33] RECOVERY Total processes is now: OK on parsoid-spof i-000004d6.pmtpa.wmflabs output: PROCS OK: 150 processes [06:28:33] PROBLEM Total processes is now: WARNING on vumi-metrics i-000004ba.pmtpa.wmflabs output: PROCS WARNING: 151 processes [06:35:22] RECOVERY Free ram is now: OK on wikistream-1 i-0000016e.pmtpa.wmflabs output: OK: 37% free memory [06:48:33] RECOVERY Total processes is now: OK on vumi-metrics i-000004ba.pmtpa.wmflabs output: PROCS OK: 147 processes [08:16:40] Change on 12mediawiki a page Developer access was modified, changed by Edinwiki link https://www.mediawiki.org/w/index.php?diff=618622 edit summary: /* User:Edinwiki */ [08:32:32] PROBLEM Total processes is now: WARNING on parsoid-spof i-000004d6.pmtpa.wmflabs output: PROCS WARNING: 152 processes [08:37:33] RECOVERY Total processes is now: OK on parsoid-spof i-000004d6.pmtpa.wmflabs output: PROCS OK: 150 processes [08:57:10] !log manually updated git puppet repo on deployment-video05 [08:57:10] manually is not a valid project. [08:58:53] !beta manually updated git puppet repo on deployment-video05 [08:58:54] !log deployment-prep manually updated git puppet repo on deployment-video05 [08:58:57] Logged the message, Master [08:59:33] PROBLEM Current Load is now: WARNING on parsoid-roundtrip7-8core i-000004f9.pmtpa.wmflabs output: WARNING - load average: 5.16, 5.27, 5.10 [09:03:59] on the labsconsole main page there is a dot behind access and privileges in section General [09:11:51] checking out [09:12:32] fixed [09:24:34] RECOVERY Current Load is now: OK on parsoid-roundtrip7-8core i-000004f9.pmtpa.wmflabs output: OK - load average: 4.81, 4.87, 4.96 [09:25:39] thx [10:28:22] RECOVERY Total processes is now: OK on dumps-bot1 i-000003ed.pmtpa.wmflabs output: PROCS OK: 99 processes [10:30:32] RECOVERY Current Users is now: OK on dumps-bot2 i-000003f4.pmtpa.wmflabs output: USERS OK - 0 users currently logged in [10:31:12] RECOVERY Disk Space is now: OK on dumps-bot2 i-000003f4.pmtpa.wmflabs output: DISK OK [10:31:32] RECOVERY Disk Space is now: OK on dumps-bot3 i-00000503.pmtpa.wmflabs output: DISK OK [10:32:02] RECOVERY Free ram is now: OK on dumps-bot2 i-000003f4.pmtpa.wmflabs output: OK: 3968% free memory [10:32:12] RECOVERY Free ram is now: OK on dumps-bot3 i-00000503.pmtpa.wmflabs output: OK: 3963% free memory [10:32:42] RECOVERY SSH is now: OK on dumps-bot2 i-000003f4.pmtpa.wmflabs output: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [10:33:22] RECOVERY Total processes is now: OK on dumps-bot2 i-000003f4.pmtpa.wmflabs output: PROCS OK: 107 processes [10:33:32] RECOVERY Total processes is now: OK on dumps-bot3 i-00000503.pmtpa.wmflabs output: PROCS OK: 104 processes [10:34:12] RECOVERY dpkg-check is now: OK on dumps-bot2 i-000003f4.pmtpa.wmflabs output: All packages OK [10:34:22] RECOVERY dpkg-check is now: OK on dumps-bot3 i-00000503.pmtpa.wmflabs output: All packages OK [10:34:22] RECOVERY Current Load is now: OK on dumps-bot2 i-000003f4.pmtpa.wmflabs output: OK - load average: 0.08, 0.11, 0.05 [10:34:22] RECOVERY Current Load is now: OK on dumps-bot3 i-00000503.pmtpa.wmflabs output: OK - load average: 0.03, 0.08, 0.05 [10:35:43] RECOVERY Current Users is now: OK on dumps-bot3 i-00000503.pmtpa.wmflabs output: USERS OK - 1 users currently logged in [10:35:53] RECOVERY SSH is now: OK on dumps-bot3 i-00000503.pmtpa.wmflabs output: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [10:46:26] hmm, anyone had success getting the new /home storage activated for lucid instances? [10:48:30] Hydriz: you have to reboot [10:48:39] you don't say :P [10:48:49] but after reboot, I got back the old share [10:49:11] k [10:52:44] oh wait, I got it already haha [10:53:13] RECOVERY Total processes is now: OK on incubator-apache i-00000211.pmtpa.wmflabs output: PROCS OK: 136 processes [11:56:08] !cmds @labs-resolve @labs-info @labs-project-info @labs-project-users @labs-project-instances @labs-user [11:56:17] !cmds is special commands in this channel: @labs-resolve @labs-info @labs-project-info @labs-project-users @labs-project-instances @labs-user [11:56:18] Key was added [11:56:26] !cmds del [11:56:26] Successfully removed cmds [11:58:06] !log wikidata-dev disable memcached on wikidata test repo and clients [11:58:08] Logged the message, Master [11:58:38] @labs-info [11:58:43] !cmds is @labs-resolve @labs-info @labs-project-info @labs-project-users @labs-project-instances @labs-user [11:58:43] Key was added [11:58:59] @labs-info deployment-apache32 [11:59:00] [Name deployment-apache32 doesn't exist but resolves to I-0000031a] I-0000031a is Nova Instance with name: deployment-apache32, host: virt8, IP: 10.4.0.166 of type: m1.large, with number of CPUs: 4, RAM of this size: 8192M, member of project: deployment-prep, size of storage: 90 and with image ID: ubuntu-12.04-precise [12:33:31] hey addshore [12:33:42] heya petan :) [12:34:56] @labs-project-info hugglewa [12:34:56] The project Hugglewa has 2 instances and 4 members, description: The Huggle project web interface edition. [12:35:01] mm [12:37:13] gonna re write my bots, but cant decide on a language :/ [12:37:18] c# [12:37:20] :) [12:37:34] can't think of a better one, except for c++ [12:37:56] depends if u need performance or if you need to save time working on it [12:38:05] a bit of both :p [12:38:06] stuff you make in c++ in 1 month, you can do in 1 day in c# :P [12:38:21] and then dotnetiwki classes, awb classes, or my own classes xD [12:38:41] dotnetwikibot is kind of working [12:38:48] I use it for all my wiki bots [12:39:50] does it have any links to irc? [12:39:59] ? [12:40:04] irc feeds? [12:40:07] ah [12:40:08] no [12:40:12] you need to make your own :P [12:40:16] thats my job today then ;p [12:40:17] which is pretty easy [12:40:25] you can copy paste code of wm-bot which use it [12:40:28] :D [12:40:31] Type @commands for list of commands. This bot is running http://meta.wikimedia.org/wiki/WM-Bot version wikimedia bot v. 1.10.4.60 source code licensed under GPL and located at https://github.com/benapetr/wikimedia-bot [12:40:32] if i dont get distracted, which usually happens when i trya nd sit down and do these things >.< [12:40:39] true xD [12:41:33] https://github.com/benapetr/wikimedia-bot/blob/master/plugins/wmib_rc/wmib_rc/RC.cs [12:41:44] though it's hard to read :D :D [12:42:09] * addshore slaps petan with a large trout for making it hard to read [12:43:34] start reading at line 400 :P [12:43:42] that's what you are interested in [12:44:11] Recent Changes :P [12:44:19] yoink [12:45:16] hmm, your bot looks rather, substantial [12:45:21] mmmmm [12:45:23] in fact [12:45:31] there are 2 function which are interesting only [12:45:47] wmib.RegularModule.Load() and RecentChanges.Connect() [12:45:57] these 2 do connection to irc + parsing of feed [12:46:04] and are quite short [12:46:10] rest of that is for bot only [12:49:55] HAha [13:56:33] PROBLEM Total processes is now: WARNING on parsoid-spof i-000004d6.pmtpa.wmflabs output: PROCS WARNING: 152 processes [14:11:34] RECOVERY Total processes is now: OK on parsoid-spof i-000004d6.pmtpa.wmflabs output: PROCS OK: 150 processes [14:45:30] 12/20/2012 - 14:45:30 - Updating keys for mwang at /export/keys/mwang [15:28:34] PROBLEM Current Load is now: WARNING on parsoid-roundtrip7-8core i-000004f9.pmtpa.wmflabs output: WARNING - load average: 4.98, 5.09, 5.04 [15:38:32] RECOVERY Current Load is now: OK on parsoid-roundtrip7-8core i-000004f9.pmtpa.wmflabs output: OK - load average: 4.46, 4.83, 4.96 [15:46:32] PROBLEM Current Load is now: WARNING on parsoid-roundtrip7-8core i-000004f9.pmtpa.wmflabs output: WARNING - load average: 4.79, 5.23, 5.16 [16:46:32] RECOVERY Current Load is now: OK on parsoid-roundtrip7-8core i-000004f9.pmtpa.wmflabs output: OK - load average: 4.12, 4.32, 4.96 [17:11:36] !logs [17:11:53] !htmllogs [17:11:53] experimental: http://bots.wmflabs.org/~wm-bot/html/%23wikimedia-labs [18:35:53] PROBLEM Current Load is now: CRITICAL on mwreview-abogott-dev4 i-00000549.pmtpa.wmflabs output: Connection refused by host [18:36:33] PROBLEM Current Users is now: CRITICAL on mwreview-abogott-dev4 i-00000549.pmtpa.wmflabs output: Connection refused by host [18:37:13] PROBLEM Disk Space is now: CRITICAL on mwreview-abogott-dev4 i-00000549.pmtpa.wmflabs output: Connection refused by host [18:37:23] PROBLEM Total processes is now: CRITICAL on mwreview-abogott-dev4 i-00000549.pmtpa.wmflabs output: Connection refused by host [18:37:53] PROBLEM dpkg-check is now: CRITICAL on mwreview-abogott-dev4 i-00000549.pmtpa.wmflabs output: Connection refused by host [18:38:03] PROBLEM Free ram is now: CRITICAL on mwreview-abogott-dev4 i-00000549.pmtpa.wmflabs output: Connection refused by host [18:40:14] !log [18:40:18] !logs [18:40:22] !htmllogs [18:40:22] experimental: http://bots.wmflabs.org/~wm-bot/html/%23wikimedia-labs [18:40:31] 12/20/2012 - 18:40:31 - Updating keys for kipod at /export/keys/kipod [18:50:54] RECOVERY Current Load is now: OK on mwreview-abogott-dev4 i-00000549.pmtpa.wmflabs output: OK - load average: 1.41, 1.32, 0.96 [18:51:34] RECOVERY Current Users is now: OK on mwreview-abogott-dev4 i-00000549.pmtpa.wmflabs output: USERS OK - 0 users currently logged in [18:52:14] RECOVERY Disk Space is now: OK on mwreview-abogott-dev4 i-00000549.pmtpa.wmflabs output: DISK OK [18:52:24] RECOVERY Total processes is now: OK on mwreview-abogott-dev4 i-00000549.pmtpa.wmflabs output: PROCS OK: 84 processes [18:52:54] RECOVERY dpkg-check is now: OK on mwreview-abogott-dev4 i-00000549.pmtpa.wmflabs output: All packages OK [18:52:54] RECOVERY Free ram is now: OK on mwreview-abogott-dev4 i-00000549.pmtpa.wmflabs output: OK: 926% free memory [18:55:17] labs-morebots, I think benestar_ was wondering about you. [18:55:17] I am a logbot running on i-0000015e. [18:55:17] Messages are logged to labsconsole.wikimedia.org/wiki/Server_Admin_Log. [18:55:17] To log a message, type !log . [18:59:14] andrewbogott: thanks ;) [18:59:43] benestar: I just added that feature so I never miss a chance to show it off [19:00:18] xD [19:40:30] 12/20/2012 - 19:40:30 - Updating keys for mwang at /export/keys/mwang [19:41:33] PROBLEM Total processes is now: WARNING on parsoid-spof i-000004d6.pmtpa.wmflabs output: PROCS WARNING: 151 processes [19:46:32] RECOVERY Total processes is now: OK on parsoid-spof i-000004d6.pmtpa.wmflabs output: PROCS OK: 149 processes [20:22:36] <^demon> Ryan_Lane: I have some simple ldap questions. [20:23:16] shoot [20:23:21] <^demon> a) Can groups be made to inherit members from other groups? [20:23:52] <^demon> b) What are your thoughts about moving all gerrit groups out of gerrit and into ldap? [20:24:00] a) difficult [20:24:17] <^demon> saper! [20:24:17] a) yes [20:24:19] <^demon> :) [20:24:22] b) good [20:24:25] :) [20:24:33] question is, of course, if gerrit *supports* it [20:24:53] and yes, it's fine if we move groups into ldap [20:24:56] <^demon> Hmm, well would ldap return a flattened list, or would it expect gerrit to dive into it? [20:24:56] will be easier to implement groups in LDAP than to fight UUID support [20:25:01] it's more than fine, it's great [20:25:05] ldap is just a database [20:25:11] it will not flatten for you [20:25:20] ^demon: I am not sure you can write recursive ldap query [20:25:35] group inheritance is done in the client implementation [20:25:43] <^demon> Blargh. Well that's the sticking point then. [20:25:46] you have to implement the recursion yourself [20:25:51] (that's what I meant by difficult -> dive yourself in all implementations using the LDAP groups for anything) [20:26:01] <^demon> Group inheritance in gerrit is completely fubar'd for external groups. [20:26:15] Ryan_Lane: seen my mail wrt ldap btw? [20:26:17] <^demon> So we come back to the original problem :\ [20:26:21] simplistic PAM modules won't be able to understand inheritance [20:26:49] paravoid: not yet [20:27:05] indeed [20:27:12] we aren't using those groups in pam, though [20:27:19] we're using the openstack groups, but they are flat [20:27:22] <^demon> saper: btw, Shawn merged that "fake ldap" script someone wrote and posted to repo-discuss. It's in contrib/ [20:27:42] contrib hosts all the random crap that's useful to 1 in every few hundred people [20:28:31] paravoid: which ldap email is this? [20:28:41] petan's issue [20:30:35] paravoid: what's the subject? [20:30:55] EU based admins [20:31:18] oh, yeah, I saw you replied that you lived in the EU :D [20:31:22] not that [20:31:35] oh [20:31:40] I somehow missed that part of the reply [20:31:59] hm [20:31:59] ^demon: will have a look at it, thanks [20:32:02] there's a second mail, found it? [20:32:06] yes [20:32:09] so... [20:32:13] * Damianz jumps up and down and waves at paravoid.... long time no see [20:32:25] ottomatta was nice enough to write a change to openstackmanager [20:32:33] we needed to change how the groups worked [20:32:36] oh really? [20:32:40] yes [20:32:49] Ryan_Lane: we might be tempted to use LDAP one day, not sure if it's worth making all other apps more difficult because of gerrit requirements... [20:32:49] cool! [20:32:57] change how? [20:32:57] basically we deleted all the groups and recreated them [20:33:02] I guess this one failed [20:33:13] what did you change? [20:33:16] before I was using a feature of opendj to replicate the member list [20:33:24] I remember [20:33:26] turns out that feature wasn't meant to be used that way [20:33:32] and it was actually a bug that it worked :) [20:33:36] ahaha! [20:33:49] so, now openstackmanager will manage the group and project membership directly [20:33:55] and keep them in sync? [20:34:08] My theory is gluster has no features and anything that works is a bug [20:34:08] and we have a maintenance script that will keep them in sync, if they are wrong for some reason [20:34:21] aha [20:34:25] <^demon> Well, with 2.5+ we won't actually add the ldap groups in gerrit. [20:34:40] I'm guessing something went wrong with the maintenance script [20:34:43] <^demon> We just have to change all the acls to use ldap/foo as the group names (which we're going to have to do anyway for the existing ldap groups) [20:34:47] I did have to ctrl-c it [20:34:55] and I've only ran it once [20:35:06] heh [20:35:09] I wasn't aware of all that [20:35:14] so there was some head scratching [20:35:22] trying to find out how a group might be missing [20:35:24] yeah. I should have written an email about it [20:35:30] no, that's okay [20:35:38] 12/20/2012 - 20:35:38 - Updating keys for mwang at /export/keys/mwang [20:35:42] just saying [20:35:50] makes much more sense now [20:35:54] yeah [20:36:01] <^demon> I wish there was an ssh command to adjust acls. [20:36:05] <^demon> Would make updating all this crap easier. [20:36:08] <^demon> *sigh* [20:36:12] labs-home-wm_: why do you keep updating mwang's key, eh? [20:36:19] ^demon: are you aware of ldapvi? [20:36:29] <^demon> I mean gerrit acls. [20:36:32] ah [20:36:59] <^demon> They're all stored in a file called project.config in the refs/meta/config branch on each repo. [20:37:11] ^demon: are acls in the database or git branch? [20:37:22] <^demon> branch, refs/meta/config. [20:37:23] <^demon> See ^ [20:37:29] <^demon> Which is cool for versioning and being able to submit config for review, but a pita for mass changes. [20:38:03] right; maybe it wouldn't be that hard to implement, but gerrit currently uses only simplest stuff from JGit [20:38:42] <^demon> At least jgit supports gc now. Can't wait for gerrit to use it. [20:39:09] do we collect a lot of garbage? [20:39:38] most objects should be referenced from refs/changes anyway? [20:40:02] * saper wonders if JGit supports funny non-branch refs [20:40:20] for gc [20:41:05] * Damianz rawrs and bats at fly [20:53:50] Change on 12mediawiki a page Wikimedia Labs/Toolserver features needed in Tool Labs was modified, changed by DaB. link https://www.mediawiki.org/w/index.php?diff=618819 edit summary: /* Various */ [20:59:06] I question how many items from the Toolserver features needed in Tool Labs page are actually needed and how many are just 'make this like toolserver' .... [21:00:47] Home directories, SVN repos [21:01:18] Wasn't there some disagreement over user databases as well? [21:01:39] <^demon> We're not setting up any new svn repos. [21:01:43] and wasn't there another list like this? [21:02:06] the not allowing user tables on the prod replica for direct joins because tis app logic? [21:02:18] yeah there was [21:02:25] svn repos aren't going to happen [21:02:30] https://www.mediawiki.org/wiki/Wikimedia_Labs/Toolserver_features_wanted_in_Tool_Labs [21:02:32] and then [21:02:40] https://www.mediawiki.org/wiki/Wikimedia_Labs/Toolserver_features_wanted_in_Tool_Labs [21:02:44] * Damianz gives sumanah 1 toffee apple cookie [21:02:45] sumanah is too quick :) [21:02:46] https://www.mediawiki.org/wiki/Talk:Admin_tools_development#Toolserver_termination ? [21:03:04] *Free license-choice of the user for his/her tools. <— that's fine as long as it's open source [21:03:23] I argue that wtfpl is opensource! [21:03:24] wait, so there's https://www.mediawiki.org/wiki/Wikimedia_Labs/Toolserver_features_needed_in_Tool_Labs and https://www.mediawiki.org/wiki/Wikimedia_Labs/Toolserver_features_wanted_in_Tool_Labs ? needed AND wanted? [21:03:40] Yes. DaB just started the needed one [21:03:41] :) [21:04:29] <^demon> Ryan_Lane: I suppose there's nothing preventing someone from setting up a svn project in labs, but yeah, we're not going to keep svn.wm.o going for this. [21:04:31] http://lists.wikimedia.org/pipermail/toolserver-l/2012-December/005574.html [21:04:42] Change on 12mediawiki a page Wikimedia Labs/Toolserver features needed in Tool Labs was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=618825 edit summary: /* Filesystem */ [21:04:48] "As you know the general meeting of WMDE decided that the WMF has to guarantee that the features of the toolserver exists in WikiLabs within 6 months (otherwise WMDE has to look for a way to continue the toolserver)." [21:05:30] Change on 12mediawiki a page Wikimedia Labs/Toolserver features needed in Tool Labs was modified, changed by Krenair link https://www.mediawiki.org/w/index.php?diff=618827 edit summary: /* Web */ svn? [21:06:10] Damianz, so yes, it seems to be a 'make this like toolserver' page [21:06:10] Change on 12mediawiki a page Wikimedia Labs/Toolserver features needed in Tool Labs was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=618829 edit summary: /* Languages */ [21:06:20] That's stupid though [21:06:27] "stupid" seems strong [21:06:30] and unnecessary [21:06:33] not really [21:06:48] it's effort and work for no benifits [21:06:52] none? [21:07:06] it seems like a good user experience for Toolserver users is a benefit [21:07:08] and means labs just becomes full of old tech.... though yes you could run ts on labs without much effort [21:07:09] Change on 12mediawiki a page Wikimedia Labs/Toolserver features needed in Tool Labs was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=618831 edit summary: /* Web */ [21:07:18] so, you should revise your "no benefits" statement [21:07:31] no benefits technology wise [21:07:49] not so concerned with users - current there's little benift, new users and community there may be some... long way down the road [21:08:03] yeah, your lack of concern for users doesn't really cut it :) [21:08:20] we have a bunch of tradeoffs to work through re: Labs tech, user experience for incoming users, etc etc [21:08:24] Change on 12mediawiki a page Wikimedia Labs/Toolserver features needed in Tool Labs was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=618832 edit summary: /* OSM */ [21:08:37] Change on 12mediawiki a page Wikimedia Labs/Toolserver features needed in Tool Labs was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=618833 edit summary: /* OSM */ [21:09:44] Change on 12mediawiki a page Wikimedia Labs/Toolserver features needed in Tool Labs was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=618834 edit summary: /* Various */ [21:09:46] that reminds me that I need to prep for a meeting about this tomorrow [21:09:47] Well there's no point having good user ex with 0 decent technology... really you should have a good foundation technically to bring users onto, otherwise they see crap, get annoyed, docs don't work and eventually it's a bad ux [21:10:24] you're acting like you should spend one hundred percent of your time on technology and give zero thought to UX until and unless some kind of magical UX-less tech exists [21:10:29] in toto [21:10:31] Change on 12mediawiki a page Wikimedia Labs/Toolserver features needed in Tool Labs was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=618836 edit summary: /* End-user-support */ [21:10:48] Change on 12mediawiki a page Wikimedia Labs/Toolserver features needed in Tool Labs was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=618838 edit summary: /* OSM */ [21:11:22] my impatience with that kind of thinking is probably going to get in the way of me hearing what your actual opinions are, which are probably more reasonable once you articulate them properly [21:12:22] to be fair, most of the things DaB is asking for are actually needed in some way [21:12:34] just not the exact same way toolserver is doing it [21:13:03] nod [21:13:52] Ryan_Lane: ok to update https://www.mediawiki.org/wiki/Wikimedia_Labs#TODO to change "Enable database replication - Ryan hopes to get this done by the end of November or December 2012 " to a new date? I believe you said Jan/Feb? [21:13:52] "Access to (anonymized) web- and web-error-logs." [21:13:55] Damianz 2 things you have to do now :) [21:14:04] swap host and uname in nagios [21:14:12] So I'm not part of labs but I think the problem here is that only project roots can access web server logs? [21:14:12] that's first one [21:14:16] Krenair: it won't actually need to be anonymized [21:14:28] Damianz PINGG [21:14:29] our privacy policy makes that possible [21:14:46] we can easily make the web server logs readable to non-roots [21:14:58] Damianz - fix ram checks [21:15:02] Change on 12mediawiki a page Wikimedia Labs/Toolserver features needed in Tool Labs was modified, changed by Krenair link https://www.mediawiki.org/w/index.php?diff=618841 edit summary: /* Web */ [21:15:07] they show like 43532632952% free ram [21:15:19] Ryan_Lane: and post-Kraken relevant data might even be more available to toolmakers [21:15:30] Can people already request git repos in gerrit for their labs stuff? [21:15:41] What do you mean swap uname? [21:15:46] Change on 12mediawiki a page Wikimedia Labs/Toolserver features needed in Tool Labs was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=618842 edit summary: /* Various */ [21:15:48] And I didn't write the ram check but I can make a new one [21:15:54] sumanah: yeah, that's the plan [21:15:59] Ryan_Lane: ok, editing now [21:16:01] Damianz swap I-000000blah with "nicename" [21:16:03] sumanah: we are working on the ldap integration now, in fact. [21:16:05] sumanah: Focus 100% on ux and you just move ts to labs and fix nothing [21:16:10] Damianz example: [21:16:12] Change on 12mediawiki a page Wikimedia Labs/Toolserver features needed in Tool Labs was modified, changed by DaB. link https://www.mediawiki.org/w/index.php?diff=618843 edit summary: /* End-user-support */ [21:16:13] petan: talk to Ryan_Lane, it's pending him fixing it [21:16:21] @labs-resolve bastion [21:16:21] I don't know this instance - aren't you are looking for: I-000000ba (bastion1), I-0000019b (bastion-restricted1), I-00000390 (deployment-bastion), [21:16:27] Krenair: yes, people can request repos for anything they want [21:16:30] Change on 12mediawiki a page Wikimedia Labs was modified, changed by Sharihareswara (WMF) link https://www.mediawiki.org/w/index.php?diff=618844 edit summary: /* TODO */ Database replication delayed. [21:16:32] you see bastion1 is I-00000sth [21:16:35] I'd really like to make it easier for folks to create repos [21:16:41] Damianz I need you to swap these 2 thing in nagios [21:16:41] * sumanah continues to wait for Damianz to display thoughtfulness :-) [21:16:42] they're wrong in ldap because openstack changed it so *shrug*... [21:16:53] no they are not wrong they are right [21:16:59] Damianz: what's wrong in ldap? [21:17:05] <^demon> Ryan_Lane: I would too. I brought it up at the hackathon, but we got bogged down bikeshedding. [21:17:06] your python script has 2 variables host and uname [21:17:12] I need you to swap them [21:17:17] ^demon: :( [21:17:21] I tried it but it resulted in errors in groups [21:17:24] ^demon: well, it'll likely happen at some point [21:17:30] <^demon> We ended up with something that looked cool on the board, but wasn't really all that easy, defeating the point. [21:17:32] because groups are filled with these I-things [21:17:35] Change on 12mediawiki a page Wikimedia Labs/Toolserver features needed in Tool Labs was modified, changed by Krenair link https://www.mediawiki.org/w/index.php?diff=618845 edit summary: /* Web */ [21:17:42] i-xxx is the fwdn [21:17:45] fqdn* [21:17:53] ok, when u open nagios [21:17:55] yeah. we need to fix the host name crap [21:18:01] that I-thing is where uname should have benn [21:18:06] I want to move completely away from i-xxx [21:18:07] it can't be [21:18:08] that's why you should swap it [21:18:11] why? [21:18:13] it used to be [21:18:14] the other is not a fqdn [21:18:15] before [21:18:15] <^demon> Ryan_Lane: We could write a dummy extension to live on labsconsole. Someone requests, someone approves, presses "Create" [21:18:19] So we can't support 2 regions [21:18:19] <^demon> It sets up default acls. [21:18:20] who cares [21:18:23] <^demon> And groups. [21:18:25] It breaks if we have any vms in equiad [21:18:25] make it alias [21:18:28] ^demon: yeah, that we could [21:18:33] what? [21:18:33] ^demon: would likely be easy to make [21:18:42] Change on 12mediawiki a page Wikimedia Labs/status was modified, changed by Sharihareswara (WMF) link https://www.mediawiki.org/w/index.php?diff=618846 edit summary: 2012-12-20 [21:18:44] it was exactly same before and it worked [21:18:56] <^demon> Ryan_Lane: Can labsconsole ssh to gerrit? That's the only real requirement. [21:19:00] ^demon: yep [21:19:17] Damianz howcome my c++ parser could do that and ur python thing can't?? [21:19:18] I can look at it but if I remember right the fqdn needs changing to ..wmflabs before it can work properly with nice names... [21:19:36] Damianz: that's how it works [21:19:40] The c++ parser broke... but that's unrelated [21:19:41] but my c++ parser were using it as ALIAS [21:19:45] only public DNS doesn't work that way [21:19:54] that made it look nice in nagios but working fine [21:19:59] bastion1.pmtpa.wmflabs <— nice name [21:20:05] <^demon> Ryan_Lane: There's always going to be weird repos that are like "Copy history" or "We need a weird acl" or "Convert from svn," but for the majority of cases (new extensions) we can script the whole thing. [21:20:14] ^demon: yeah [21:21:24] Damianz when I open nagios it's almost useless because I can't recognize a single host without resolving it using wm-bot [21:21:36] that complicates things [21:21:46] all old hosts are like that in ldap I swear [21:21:55] looking in a sec [21:22:03] only public dns [21:22:07] when I open nagios there is not a single server with nice name [21:22:13] hostname have always used region name [21:22:29] I am not talking about region but I-thing [21:22:33] I-000000fsg03 [21:22:35] etc. [21:22:43] That /is/the hostname [21:22:49] I know [21:22:50] So DaB added "Documentation for beginners and non-technicals." - doesn't labs already have docs for beginners? [21:22:54] Server won't let me login anyway [21:22:58] but it's not the "fancy hostname" [21:23:06] Damianz because something broke [21:23:13] I don't think non-technicals though... Sounds like a wontfix to me [21:23:15] Krenair: yes. it's not wonderful though [21:23:16] paravoid was fixing it [21:23:20] something with ldap [21:23:23] I think [21:23:33] which host can you not log into? [21:23:36] that's why u can't login [21:23:40] he can't login to nagios [21:23:41] I broke some group stuff recently [21:23:42] A server can have many 'fancy' hostnames though, it only has one f1dn [21:23:44] same problem I had [21:23:45] one sec [21:23:46] fqdn* [21:23:50] so dunno how that works [21:23:52] * Damianz shrug [21:24:05] http://lists.wikimedia.org/pipermail/toolserver-l/2012-December/005575.html [21:24:06] Damianz ok but why we can't use fancy hostname as internal name in nagios? [21:24:14] so in web interface you see these nice names as we used to [21:24:18] I see a project-nagios group [21:24:21] but damianz isn't in it [21:24:28] I was [21:24:31] @labs-project-users nagios [21:24:31] Following users are in this project (showing all 8 members): Dzahn, Ryan Lane, Lcarr, Novaadmin, Petrb, DamianZaremba, Faidon, Mwang, [21:24:36] he is [21:24:41] he's not a sysadmin [21:24:50] so he can't ssh? o.O [21:24:59] hm [21:25:04] I don't see the project in openstackmanager [21:25:04] why? [21:25:05] weird [21:25:17] hm? [21:25:22] I definitely saw the project the other day [21:25:41] yeah, it's in ldap [21:25:53] Problem is how you figure out which is the 'right' fancy name... since they are just stored as aditional records IIRC... need to run in debug to see really though. [21:25:56] I see it in osm [21:26:12] Damianz the right fancy name is what you have as uname [21:26:12] ah, now I see it [21:26:15] in your script [21:26:19] let me re-run the mainteance script [21:26:40] variable $uname or whatever python has [21:27:23] Failed syncing members for project nagios and group project-nagios146 project groups were synced, 1 changed, 0 failed. [21:27:32] not sure what that means [21:27:35] but it's not working :D [21:28:14] I wonder why that is failing [21:28:19] now chanserv is gone :D [21:28:21] it's the only one failing, thankfully [21:28:26] everything crashes [21:28:28] :D [21:28:30] one sec, I'll manually sync it [21:28:55] hm [21:28:59] actually, it sync'd [21:29:04] mm [21:29:04] I wonder why it said it failed [21:29:09] So intancename could be used as alias... but since it's not unique cross region it could also be ruddy confusing.... what might be an idea is taking the instance name then using the associateddomain looking like /^instancename\.(pmtpa|equiad)\.wmflabs$/ I guess... I'll look when I can login [21:29:25] it is unique [21:29:35] you can't create 2 fancy same [21:29:40] RAWR [21:29:49] I *must* fix this session bug once and for all [21:29:53] it's driving me fucking insane [21:30:05] :D [21:30:11] I thought you could... [21:30:18] Ryan_Lane: ? [21:30:20] I thought you did [21:30:29] Ryan_Lane: note that I recreated the group by hand [21:30:31] well, the session bug is fixed [21:30:36] paravoid: just now? [21:30:42] no, the other day [21:30:43] ah [21:30:44] Ryan_Lane don't believe you :P [21:30:45] yeah [21:30:45] not sure how the script syncs them [21:30:49] it was missing some members [21:31:02] hm? [21:31:03] I re-ran the sync-script [21:31:06] now it's correct [21:31:12] ok [21:31:14] ah, okay [21:31:21] so Ryan_Lane is it possible to create 2 fancy hosts? [21:31:24] nscd cache needs to be purged on that instance, though [21:31:27] like bastion and bastion [21:31:34] what do you mean? [21:31:36] or it's unique name [21:31:44] if fancy name is unique or it's not [21:31:45] in eqiad and pmtpa? [21:31:50] unique [21:31:53] Can you create 2 servers with the same short name in 2 regions [21:31:53] in labs [21:31:54] ok [21:32:00] ah [21:32:03] in two regions? [21:32:04] yes [21:32:07] see [21:32:09] mm [21:32:10] so I'll do regex shit [21:32:13] brb [21:32:21] bastion.eqiad.wmflabs and bastion.pmtpa.wmflabs [21:32:23] ok can you replace I-blabla with fqdn? [21:32:29] Damianz [21:32:29] Change on 12mediawiki a page Developer access was modified, changed by Nemo bis link https://www.mediawiki.org/w/index.php?diff=618858 edit summary: I don't understand. If this page is useless, why leave all the text? The warning on it being obsolete is way less readable than the blinking wall of text below. [21:32:34] that would be still better [21:32:36] I'll figure out the right fqdn form instance name and use that [21:32:40] also [21:32:47] IF IT'S IN GIT, DON"T COPY IT TO 2 [21:32:55] ? [21:32:55] the end-goal is to completely kill off the use of i-xxx [21:33:06] ok [21:33:15] I'm sure that'll make paravoid happy :) [21:33:25] hehe [21:33:26] many people will be happy [21:33:36] maybe even wars will stop in world... [21:33:41] hahaha [21:33:43] who knows [21:33:52] I'm never happy so I'll ensure a balance is maintained [21:33:54] it's going to take some engineering effort to do that [21:34:03] but I think it's worth it [21:34:36] k [21:34:46] btw Damianz I enabled nlogin again [21:34:53] so, yeah, the session bug is fixed. now there's a problem in how I'm handling the keystone tokens [21:34:53] nlogin? [21:34:56] I needed it to stop nagios from spamming [21:35:00] that login in browser [21:35:04] so you can control nagios from web [21:35:12] nagios.wmflabs.org/nlogin [21:35:13] Just use the cgi stuff? [21:35:19] ? [21:35:24] this is cgi stuff [21:35:33] but you can't have anon and login in one url [21:35:50] nagios3 login you as guest with no password [21:35:55] nlogin ask you for password [21:35:56] ah I see [21:36:13] basically it's just creepy hack [21:36:14] in apache [21:36:42] hashar: chrismcmahon: did you see the post about a database error in beta? [21:36:59] Ryan_Lane: hop sorry. Maybe the db is dead ? :( [21:37:02] Ryan_Lane: I don't think so, looking [21:37:04] looks like an extension may be missing a table [21:37:22] echo [21:37:22] hashar did you run update.php? :D [21:37:33] ewww [21:37:40] we don't use update.php in production :) [21:37:44] I know [21:37:49] I meant its alternative [21:37:51] so we don't run update.php on beta :-) [21:37:53] ah [21:38:00] so beta breaks whenever someone submit a SQL change :-) [21:38:00] hmmm I allready have fqdn stuff in place it seems [21:38:10] this way we can detect such SQL change and make sure production will not have that issue [21:38:11] I don't remember name of it [21:38:12] then [21:38:21] we do not have any process to actually report the issue :) [21:38:24] nor to update it in beta [21:38:31] Damianz but wrong place! :P [21:38:35] so I end up running update.php from time to time to fix beta [21:38:39] not really [21:38:46] beta and db updates: https://bugzilla.wikimedia.org/show_bug.cgi?id=36228 [21:39:05] hashar: and we still don't have all the right contents in the db either. [21:40:48] fucking gluster [21:41:42] public datasets aren't accessible [21:42:06] Ceph! [21:44:42] chrismcmahon: what do you mean by "right contents" ? [21:45:13] PROBLEM host: analytics.pmtpa.wmflabs is DOWN address: analytics.pmtpa.wmflabs CRITICAL - Host Unreachable (analytics.pmtpa.wmflabs) [21:46:24] PROBLEM host: fawikitest.pmtpa.wmflabs is DOWN address: fawikitest.pmtpa.wmflabs PING CRITICAL - Packet loss = 100% [21:47:04] New patchset: DamianZaremba; "Moving to nice fqdn" [labs/nagios-builder] (master) - https://gerrit.wikimedia.org/r/39686 [21:47:31] Damianz: cephfs isn't ready for use [21:47:42] Neither is gluster :P [21:48:00] New review: DamianZaremba; "Meeps" [labs/nagios-builder] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/39686 [21:48:16] New review: DamianZaremba; "Meep" [labs/nagios-builder] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/39686 [21:48:17] Change merged: DamianZaremba; [labs/nagios-builder] (master) - https://gerrit.wikimedia.org/r/39686 [21:48:30] Damianz: wanna get a pep8 run on labs/nagios-builder ? [21:48:33] Hippy now? [21:48:39] PROBLEM Current Load is now: CRITICAL on aggregator2.pmtpa.wmflabs aggregator2.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [21:48:39] PROBLEM Disk Space is now: CRITICAL on conventionextension-trial.pmtpa.wmflabs conventionextension-trial.pmtpa.wmflabs output: DISK CRITICAL - free space: / 0 MB (0% inode=47%): [21:48:39] PROBLEM host: wlm-mysql-master.pmtpa.wmflabs is DOWN address: wlm-mysql-master.pmtpa.wmflabs PING CRITICAL - Packet loss = 100% [21:48:39] PROBLEM host: wlm-apache1.pmtpa.wmflabs is DOWN address: wlm-apache1.pmtpa.wmflabs PING CRITICAL - Packet loss = 100% [21:48:39] PROBLEM Current Load is now: CRITICAL on aggregator-test1.pmtpa.wmflabs aggregator-test1.pmtpa.wmflabs output: CHECK_NRPE: Socket timeout after 10 seconds. [21:48:41] hashar: You know how to make a dude happy [21:48:49] PROBLEM Disk Space is now: CRITICAL on ipv6test1.pmtpa.wmflabs ipv6test1.pmtpa.wmflabs output: DISK CRITICAL - free space: / 0 MB (0% inode=53%): [21:48:49] PROBLEM Disk Space is now: CRITICAL on labs-nfs1.pmtpa.wmflabs labs-nfs1.pmtpa.wmflabs output: DISK CRITICAL - free space: /export 140 MB (0% inode=49%): /home 140 MB (0% inode=49%): /public/keys 140 MB (0% inode=49%): [21:48:49] PROBLEM Disk Space is now: CRITICAL on maps-test2.pmtpa.wmflabs maps-test2.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [21:48:49] PROBLEM Disk Space is now: CRITICAL on mw1-21beta-lucid.pmtpa.wmflabs mw1-21beta-lucid.pmtpa.wmflabs output: DISK CRITICAL - free space: / 0 MB (0% inode=47%): [21:48:59] PROBLEM Disk Space is now: CRITICAL on ganglia-test2.pmtpa.wmflabs ganglia-test2.pmtpa.wmflabs output: CHECK_NRPE: Socket timeout after 10 seconds. [21:48:59] PROBLEM Disk Space is now: WARNING on scribunto.pmtpa.wmflabs scribunto.pmtpa.wmflabs output: DISK WARNING - free space: / 525 MB (5% inode=81%): [21:49:03] Though tbf it has a make file already [21:49:09] PROBLEM Disk Space is now: CRITICAL on sube.pmtpa.wmflabs sube.pmtpa.wmflabs output: DISK CRITICAL - free space: / 0 MB (0% inode=38%): [21:49:09] PROBLEM Disk Space is now: CRITICAL on patchtest.pmtpa.wmflabs patchtest.pmtpa.wmflabs output: CHECK_NRPE: Socket timeout after 10 seconds. [21:49:09] PROBLEM Disk Space is now: CRITICAL on patchtest2.pmtpa.wmflabs patchtest2.pmtpa.wmflabs output: CHECK_NRPE: Socket timeout after 10 seconds. [21:49:09] PROBLEM Disk Space is now: CRITICAL on testing-arky.pmtpa.wmflabs testing-arky.pmtpa.wmflabs output: DISK CRITICAL - free space: / 0 MB (0% inode=47%): [21:49:19] PROBLEM Current Users is now: CRITICAL on aggregator2.pmtpa.wmflabs aggregator2.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [21:49:19] PROBLEM Free ram is now: WARNING on bots-sql2.pmtpa.wmflabs bots-sql2.pmtpa.wmflabs output: Warning: 15% free memory [21:49:29] PROBLEM Current Users is now: CRITICAL on aggregator-test1.pmtpa.wmflabs aggregator-test1.pmtpa.wmflabs output: CHECK_NRPE: Socket timeout after 10 seconds. [21:49:29] PROBLEM Free ram is now: CRITICAL on maps-test2.pmtpa.wmflabs maps-test2.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [21:49:39] PROBLEM Free ram is now: CRITICAL on ganglia-test2.pmtpa.wmflabs ganglia-test2.pmtpa.wmflabs output: CHECK_NRPE: Socket timeout after 10 seconds. [21:49:44] * Damianz looks at nagios [21:49:49] PROBLEM Free ram is now: WARNING on swift-be1.pmtpa.wmflabs swift-be1.pmtpa.wmflabs output: Warning: 18% free memory [21:49:49] PROBLEM Free ram is now: CRITICAL on patchtest.pmtpa.wmflabs patchtest.pmtpa.wmflabs output: CHECK_NRPE: Socket timeout after 10 seconds. [21:49:49] PROBLEM Free ram is now: CRITICAL on patchtest2.pmtpa.wmflabs patchtest2.pmtpa.wmflabs output: CHECK_NRPE: Socket timeout after 10 seconds. [21:49:51] hashar: update.php does not take care of db updates like this one: https://gerrit.wikimedia.org/r/#/c/23382/ [21:49:59] PROBLEM Disk Space is now: CRITICAL on aggregator2.pmtpa.wmflabs aggregator2.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [21:50:00] (I don't think it does at least) [21:50:13] PROBLEM Disk Space is now: CRITICAL on aggregator-test1.pmtpa.wmflabs aggregator-test1.pmtpa.wmflabs output: CHECK_NRPE: Socket timeout after 10 seconds. [21:50:13] PROBLEM host: utrsweb.pmtpa.wmflabs is DOWN address: utrsweb.pmtpa.wmflabs PING CRITICAL - Packet loss = 100% [21:50:23] PROBLEM SSH is now: CRITICAL on ganglia-test2.pmtpa.wmflabs ganglia-test2.pmtpa.wmflabs output: CRITICAL - Socket timeout after 10 seconds [21:50:43] PROBLEM Free ram is now: CRITICAL on aggregator2.pmtpa.wmflabs aggregator2.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [21:50:53] PROBLEM Free ram is now: CRITICAL on aggregator-test1.pmtpa.wmflabs aggregator-test1.pmtpa.wmflabs output: CHECK_NRPE: Socket timeout after 10 seconds. [21:51:03] PROBLEM Total processes is now: CRITICAL on kripke.pmtpa.wmflabs kripke.pmtpa.wmflabs output: PROCS CRITICAL: 238 processes [21:51:03] PROBLEM Total processes is now: CRITICAL on maps-test2.pmtpa.wmflabs maps-test2.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [21:51:03] PROBLEM Total processes is now: CRITICAL on patchtest2.pmtpa.wmflabs patchtest2.pmtpa.wmflabs output: CHECK_NRPE: Socket timeout after 10 seconds. [21:51:03] PROBLEM Total processes is now: CRITICAL on ganglia-test2.pmtpa.wmflabs ganglia-test2.pmtpa.wmflabs output: CHECK_NRPE: Socket timeout after 10 seconds. [21:51:13] PROBLEM Total processes is now: CRITICAL on su-be1.pmtpa.wmflabs su-be1.pmtpa.wmflabs output: PROCS CRITICAL: 230 processes [21:51:13] PROBLEM Total processes is now: CRITICAL on su-be2.pmtpa.wmflabs su-be2.pmtpa.wmflabs output: PROCS CRITICAL: 236 processes [21:51:13] PROBLEM Total processes is now: CRITICAL on su-be3.pmtpa.wmflabs su-be3.pmtpa.wmflabs output: PROCS CRITICAL: 236 processes [21:51:13] PROBLEM Total processes is now: CRITICAL on swift-be2.pmtpa.wmflabs swift-be2.pmtpa.wmflabs output: PROCS CRITICAL: 208 processes [21:51:13] PROBLEM Total processes is now: CRITICAL on swift-be1.pmtpa.wmflabs swift-be1.pmtpa.wmflabs output: PROCS CRITICAL: 208 processes [21:51:14] PROBLEM Total processes is now: CRITICAL on swift-be3.pmtpa.wmflabs swift-be3.pmtpa.wmflabs output: PROCS CRITICAL: 207 processes [21:51:14] PROBLEM Total processes is now: CRITICAL on swift-be4.pmtpa.wmflabs swift-be4.pmtpa.wmflabs output: PROCS CRITICAL: 209 processes [21:51:15] PROBLEM Total processes is now: CRITICAL on patchtest.pmtpa.wmflabs patchtest.pmtpa.wmflabs output: CHECK_NRPE: Socket timeout after 10 seconds. [21:51:23] PROBLEM SSH is now: CRITICAL on aggregator2.pmtpa.wmflabs aggregator2.pmtpa.wmflabs output: Server answer: [21:51:33] PROBLEM dpkg-check is now: CRITICAL on centralauth-puppet.pmtpa.wmflabs centralauth-puppet.pmtpa.wmflabs output: DPKG CRITICAL dpkg reports broken packages [21:51:33] PROBLEM SSH is now: CRITICAL on aggregator-test1.pmtpa.wmflabs aggregator-test1.pmtpa.wmflabs output: CRITICAL - Socket timeout after 10 seconds [21:51:33] PROBLEM dpkg-check is now: CRITICAL on ee-prototype.pmtpa.wmflabs ee-prototype.pmtpa.wmflabs output: DPKG CRITICAL dpkg reports broken packages [21:51:33] PROBLEM dpkg-check is now: CRITICAL on gerrit-db.pmtpa.wmflabs gerrit-db.pmtpa.wmflabs output: DPKG CRITICAL dpkg reports broken packages [21:51:43] well, this looks fairly bad [21:51:43] PROBLEM dpkg-check is now: CRITICAL on integration-jenkins2.pmtpa.wmflabs integration-jenkins2.pmtpa.wmflabs output: DPKG CRITICAL dpkg reports broken packages [21:51:43] PROBLEM dpkg-check is now: CRITICAL on kubo.pmtpa.wmflabs kubo.pmtpa.wmflabs output: DPKG CRITICAL dpkg reports broken packages [21:51:43] PROBLEM dpkg-check is now: CRITICAL on maps-test2.pmtpa.wmflabs maps-test2.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [21:51:43] PROBLEM dpkg-check is now: CRITICAL on ganglia-test2.pmtpa.wmflabs ganglia-test2.pmtpa.wmflabs output: CHECK_NRPE: Socket timeout after 10 seconds. [21:51:53] PROBLEM dpkg-check is now: CRITICAL on rds.pmtpa.wmflabs rds.pmtpa.wmflabs output: DPKG CRITICAL dpkg reports broken packages [21:51:53] PROBLEM dpkg-check is now: CRITICAL on sultest1.pmtpa.wmflabs sultest1.pmtpa.wmflabs output: DPKG CRITICAL dpkg reports broken packages [21:51:53] PROBLEM dpkg-check is now: CRITICAL on sultest2.pmtpa.wmflabs sultest2.pmtpa.wmflabs output: DPKG CRITICAL dpkg reports broken packages [21:52:03] PROBLEM dpkg-check is now: CRITICAL on patchtest.pmtpa.wmflabs patchtest.pmtpa.wmflabs output: CHECK_NRPE: Socket timeout after 10 seconds. [21:52:03] PROBLEM dpkg-check is now: CRITICAL on patchtest2.pmtpa.wmflabs patchtest2.pmtpa.wmflabs output: CHECK_NRPE: Socket timeout after 10 seconds. [21:52:03] PROBLEM dpkg-check is now: CRITICAL on upload-wizard.pmtpa.wmflabs upload-wizard.pmtpa.wmflabs output: DPKG CRITICAL dpkg reports broken packages [21:52:03] PROBLEM dpkg-check is now: CRITICAL on worker1.pmtpa.wmflabs worker1.pmtpa.wmflabs output: DPKG CRITICAL dpkg reports broken packages [21:52:03] PROBLEM Total processes is now: CRITICAL on aggregator2.pmtpa.wmflabs aggregator2.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [21:52:04] PROBLEM Total processes is now: CRITICAL on aggregator1.pmtpa.wmflabs aggregator1.pmtpa.wmflabs output: PROCS CRITICAL: 253 processes [21:52:13] PROBLEM Total processes is now: CRITICAL on aggregator-test1.pmtpa.wmflabs aggregator-test1.pmtpa.wmflabs output: CHECK_NRPE: Socket timeout after 10 seconds. [21:52:15] New patchset: DamianZaremba; "IP not fqdn" [labs/nagios-builder] (master) - https://gerrit.wikimedia.org/r/39688 [21:52:23] PROBLEM host: bugzillatesting.pmtpa.wmflabs is DOWN address: bugzillatesting.pmtpa.wmflabs PING CRITICAL - Packet loss = 100% [21:52:23] PROBLEM Current Load is now: CRITICAL on maps-test2.pmtpa.wmflabs maps-test2.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [21:52:33] PROBLEM Current Load is now: CRITICAL on ganglia-test2.pmtpa.wmflabs ganglia-test2.pmtpa.wmflabs output: CHECK_NRPE: Socket timeout after 10 seconds. [21:52:33] PROBLEM Current Load is now: WARNING on parsoid-roundtrip4-8core.pmtpa.wmflabs parsoid-roundtrip4-8core.pmtpa.wmflabs output: WARNING - load average: 5.27, 5.17, 5.14 [21:52:33] PROBLEM Current Load is now: WARNING on parsoid-roundtrip5-8core.pmtpa.wmflabs parsoid-roundtrip5-8core.pmtpa.wmflabs output: WARNING - load average: 5.05, 5.07, 5.06 [21:52:42] Change merged: DamianZaremba; [labs/nagios-builder] (master) - https://gerrit.wikimedia.org/r/39688 [21:52:43] PROBLEM Current Load is now: CRITICAL on patchtest.pmtpa.wmflabs patchtest.pmtpa.wmflabs output: CHECK_NRPE: Socket timeout after 10 seconds. [21:52:43] PROBLEM Current Load is now: CRITICAL on patchtest2.pmtpa.wmflabs patchtest2.pmtpa.wmflabs output: CHECK_NRPE: Socket timeout after 10 seconds. [21:52:53] Change on 12mediawiki a page Wikimedia Labs/Toolserver features needed in Tool Labs was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=618876 edit summary: /* Web */ [21:53:08] PROBLEM dpkg-check is now: CRITICAL on 10.4.0.193 aggregator2.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [21:53:08] PROBLEM dpkg-check is now: CRITICAL on 10.4.0.192 aggregator-test1.pmtpa.wmflabs output: CHECK_NRPE: Socket timeout after 10 seconds. [21:53:08] PROBLEM Current Users is now: CRITICAL on 10.4.0.84 maps-test2.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [21:53:08] PROBLEM Current Users is now: CRITICAL on 10.4.0.157 ganglia-test2.pmtpa.wmflabs output: CHECK_NRPE: Socket timeout after 10 seconds. [21:53:12] Damianz: https://gerrit.wikimedia.org/r/#/c/39690/ + https://gerrit.wikimedia.org/r/39689 :-D [21:53:14] merging that [21:53:27] it will break since it needs re-indenting [21:53:28] PROBLEM Current Users is now: CRITICAL on 10.4.0.69 patchtest.pmtpa.wmflabs output: CHECK_NRPE: Socket timeout after 10 seconds. [21:53:28] PROBLEM Current Users is now: CRITICAL on 10.4.0.74 patchtest2.pmtpa.wmflabs output: CHECK_NRPE: Socket timeout after 10 seconds. [21:54:48] Change on 12mediawiki a page Wikimedia Labs/Toolserver features needed in Tool Labs was modified, changed by CBM link https://www.mediawiki.org/w/index.php?diff=618877 edit summary: /* Database */ [21:54:59] So my cleanup crap doesn't work so well damn [21:55:21] Change on 12mediawiki a page Wikimedia Labs/Toolserver features needed in Tool Labs was modified, changed by CBM link https://www.mediawiki.org/w/index.php?diff=618878 edit summary: /* Filesystem */ [21:56:36] New patchset: Hashar; "typos in README.md" [labs/nagios-builder] (master) - https://gerrit.wikimedia.org/r/39691 [21:56:37] lets see if that lints [21:56:51] it won't [21:56:54] Damianz: need more pep8 fixes https://integration.mediawiki.org/ci/job/labs-nagios-builder-pep8/1/console ;) [21:57:01] I know [21:57:09] Damianz: whenever you are happy with it, I can make jenkins to vote verify -1 on lint failure [21:57:16] Change on 12mediawiki a page Wikimedia Labs/Toolserver features needed in Tool Labs was modified, changed by CBM link https://www.mediawiki.org/w/index.php?diff=618879 edit summary: /* End-user-support */ [21:57:22] we have pep8 v1.3.3 iirc [21:57:24] in a few min ;) [22:04:59] New patchset: DamianZaremba; "Fixing cleanup" [labs/nagios-builder] (master) - https://gerrit.wikimedia.org/r/39692 [22:05:00] New patchset: DamianZaremba; "pep8" [labs/nagios-builder] (master) - https://gerrit.wikimedia.org/r/39693 [22:05:18] Change merged: DamianZaremba; [labs/nagios-builder] (master) - https://gerrit.wikimedia.org/r/39692 [22:05:30] Change merged: DamianZaremba; [labs/nagios-builder] (master) - https://gerrit.wikimedia.org/r/39693 [22:06:29] Damianz: should I make it a voter (aka block on pep8 failure) ? [22:08:05] Damianz: pep8 jobs are not blocking by default. https://gerrit.wikimedia.org/r/39694 would make it block (vote v-1) [22:08:46] Change on 12mediawiki a page Wikimedia Labs was modified, changed by Sharihareswara (WMF) link https://www.mediawiki.org/w/index.php?diff=618884 edit summary: /* TODO */ whom to ask [22:08:50] Cool [22:08:59] I like yaml [22:09:18] New patchset: DamianZaremba; "MOAR COFFEE" [labs/nagios-builder] (master) - https://gerrit.wikimedia.org/r/39697 [22:09:30] Change merged: DamianZaremba; [labs/nagios-builder] (master) - https://gerrit.wikimedia.org/r/39697 [22:10:47] hashar: You know, I can actually get rid of travis now :D [22:11:09] petan: How do you do a hostname search in here? [22:11:15] Change on 12mediawiki a page Wikimedia Labs was modified, changed by Sharihareswara (WMF) link https://www.mediawiki.org/w/index.php?diff=618886 edit summary: /* TODO */ request queue [22:15:14] PROBLEM host: analytics.pmtpa.wmflabs is DOWN address: 10.4.0.63 CRITICAL - Host Unreachable (10.4.0.63) [22:15:36] you lie, nagios. [22:15:36] http://analytics.wmflabs.org/ [22:16:12] analytics.pmtpa.wmflabs doesn't respond to ping [22:16:18] Damianz: we can talk about it in january. I am on vacation starting saturday :D [22:16:27] lucky :P [22:16:48] * Damianz goes to port his ignore list [22:17:43] PROBLEM host: fawikitest.pmtpa.wmflabs is DOWN address: 10.4.1.43 PING CRITICAL - Packet loss = 100% [22:18:43] PROBLEM host: wlm-mysql-master.pmtpa.wmflabs is DOWN address: 10.4.0.159 PING CRITICAL - Packet loss = 100% [22:18:43] PROBLEM host: wlm-apache1.pmtpa.wmflabs is DOWN address: 10.4.0.160 PING CRITICAL - Packet loss = 100% [22:20:17] New patchset: DamianZaremba; "Typo fix and porting ignore list" [labs/nagios-builder] (master) - https://gerrit.wikimedia.org/r/39698 [22:20:24] PROBLEM host: utrsweb.pmtpa.wmflabs is DOWN address: 10.4.0.244 PING CRITICAL - Packet loss = 100% [22:20:40] Change merged: DamianZaremba; [labs/nagios-builder] (master) - https://gerrit.wikimedia.org/r/39698 [22:21:02] Damianz, petan: thanks for the work on nagios! [22:21:35] anyone have a good reason for me not to ignore buzillatesting. fawikitest, home-migrate-lucid and nova-precise1? [22:21:45] nope [22:21:50] nova-precise1 is dead right now [22:21:54] some kernel issue [22:22:02] I really need to get that instance back up [22:22:09] or build a new one [22:22:23] PROBLEM host: bugzillatesting.pmtpa.wmflabs is DOWN address: 10.4.0.230 PING CRITICAL - Packet loss = 100% [22:22:29] maybe I should build a new one to ensure everything is puppetized properly :D [22:22:32] I had it working right up to the homedir reboot [22:22:35] New patchset: DamianZaremba; "Ignore all the things" [labs/nagios-builder] (master) - https://gerrit.wikimedia.org/r/39699 [22:22:45] Change merged: DamianZaremba; [labs/nagios-builder] (master) - https://gerrit.wikimedia.org/r/39699 [22:22:48] yeah, no clue why the kernel has an issue [22:24:28] New patchset: DamianZaremba; "Fixing tests" [labs/nagios-builder] (master) - https://gerrit.wikimedia.org/r/39700 [22:24:39] Change merged: DamianZaremba; [labs/nagios-builder] (master) - https://gerrit.wikimedia.org/r/39700 [22:25:25] Could actually make jenkins do the python tests run and point it at ldap [22:26:45] Damianz: which tests? [22:26:52] mine? [22:27:03] oh, for nagios-builder? [22:27:11] yeah it has tests [22:27:13] they're crap [22:27:34] as tests tend to be :) [22:30:07] BOOM [22:30:08] see email [22:30:15] we need a bot that says the list has mail [22:30:21] cause I totally don't get enought notifications already [22:30:46] * Damianz waits for someone to point out it's broken and he's a douche [22:32:07] hahaha [22:34:02] Hmmm do I print off the 6 pages of insurance claim forms and fill them in now or do them in work time tomorrow... I think tomorrow since it looks boring as hell [22:34:42] paperwork sucks [22:35:17] indeed [22:35:48] It sucks really hard when it's christmas week and you need to get an insurance claim though and have your motorbike fixed :( [22:37:31] ugh, yeah [22:37:37] that's not going to be fun [22:38:03] hmm, i just added Rfaulk to project analytics [22:38:24] and then when i look at the diff in RC i see i supposedly also added Tomasz..but i didnt...hmm [22:38:31] https://labsconsole.wikimedia.org/w/index.php?title=Nova_Resource:Analytics&curid=1032&diff=9424&oldid=8526 [22:39:51] it didn't add it? [22:40:12] maybe he was already in it? [22:40:16] it did, but i believe i just added Rfaulk and not Tomasz [22:40:25] it's possible that the page update failed last time [22:40:28] and worked this time [22:40:28] yea..hmm.. just looked like i did it in RC [22:40:32] gotcha [22:41:24] heh. "Number of virtual CPUs in use:" stat on the main page is very, very, wrong [22:41:25] !log analytics adding Rfaulk as member and sysadmin [22:41:26] Logged the message, Master [22:45:33] no bot currently that was talking about powering up new instances and creating home dirs? [22:49:01] mutante: what do you mean? [22:49:15] homedirs are no longer created by a bot [22:49:18] there used to be a bot that told me here when it was adding new users [22:49:22] we use pam_mkhomedir [22:49:26] and brought up new instances [22:49:38] I don't think we had one for new instances [22:49:40] RFaulkner sits next to me [22:49:41] I do want that, though [22:49:51] and i am trying to show stuff, how to bring up his instance and stuff [22:50:34] ah, I see [22:52:43] so i added him to bastion too to be able to ssh to his new instance [22:53:08] want those logged in bastion itself too or is RC enough [22:53:53] PROBLEM Current Load is now: CRITICAL on babbage0.pmtpa.wmflabs 10.4.1.51 output: Connection refused by host [22:54:33] PROBLEM Current Users is now: CRITICAL on babbage0.pmtpa.wmflabs 10.4.1.51 output: Connection refused by host [22:55:12] PROBLEM Disk Space is now: CRITICAL on babbage0.pmtpa.wmflabs 10.4.1.51 output: Connection refused by host [22:55:52] PROBLEM Free ram is now: CRITICAL on babbage0.pmtpa.wmflabs 10.4.1.51 output: Connection refused by host [22:56:15] mutante: you don't need to add people to bastion anymore [22:56:18] rfaulkner: <-- and that is your new instance there, already has monitoring before its up:) [22:56:27] Ryan_Lane: i also checked for the "shell" flag and he had it [22:56:30] oh [22:56:30] ok [22:56:31] i wasnt sure if i need both [22:56:36] then you did :) [22:56:55] there are some people who were given shell that weren't already in bastion [22:57:05] those people need to be manually added in [22:57:06] i think he was one of them [22:57:12] and should be good now [22:57:13] I should really compile a list and fix all of them [22:57:22] PROBLEM Total processes is now: CRITICAL on babbage0.pmtpa.wmflabs 10.4.1.51 output: Connection refused by host [22:57:39] you know what I need? a todo bot :) [22:57:50] @todo [22:57:57] and it would add it to the todo list. [22:58:12] hell, one for bugs would be even better [22:58:12] PROBLEM dpkg-check is now: CRITICAL on babbage0.pmtpa.wmflabs 10.4.1.51 output: Connection refused by host [22:58:49] @add-bug Infrastructure - fix bastion/shell users - blah blah blah blah [22:58:49] Invalid name [22:58:56] :D [22:59:32] RECOVERY Current Users is now: OK on babbage0.pmtpa.wmflabs 10.4.1.51 output: USERS OK - 0 users currently logged in [22:59:34] rfaulkner: and here's your monitoring deep link http://nagios.wmflabs.org/cgi-bin/nagios3/status.cgi?host=babbage0.pmtpa.wmflabs [23:00:01] Ryan_Lane: you want to create bugzilla tickets or rt tickets?:) [23:00:07] bugzilla [23:00:09] or lines on labs wiki pages [23:00:13] labs uses bugzilla [23:00:13] RECOVERY Disk Space is now: OK on babbage0.pmtpa.wmflabs 10.4.1.51 output: DISK OK [23:00:21] we can ask andre probably to check that out [23:00:32] that would be nice :) [23:00:34] we already have open stuff to allow ticket creation via mail [23:00:45] and then of course people pointed out there is an API [23:00:53] RECOVERY Free ram is now: OK on babbage0.pmtpa.wmflabs 10.4.1.51 output: OK: 1676% free memory [23:02:23] RECOVERY Total processes is now: OK on babbage0.pmtpa.wmflabs 10.4.1.51 output: PROCS OK: 90 processes [23:03:13] RECOVERY dpkg-check is now: OK on babbage0.pmtpa.wmflabs 10.4.1.51 output: All packages OK [23:03:48] Login is broken on http://test.wikimedia.beta.wmflabs.org/ [23:03:53] RECOVERY Current Load is now: OK on babbage0.pmtpa.wmflabs 10.4.1.51 output: OK - load average: 0.02, 0.42, 0.42 [23:03:59] Needs Echo DB changes applied