[00:00:44] ah [00:00:46] here we go [00:00:49] it works now [00:00:53] I hate glusterfs [00:01:36] it seems the gluster processes were down for that mount on two of the servers [00:02:01] lol [00:03:15] thankfully it was only that one volume [00:03:18] I checked the rest [00:03:24] gluster volume status | grep 'Brick' | grep 'N' [00:04:14] ori-l: works now? [00:04:55] ori-l: and btw, I think kubo is fine, it was just gluster being stupid [00:07:17] bleh. I forgot the -c option to qemu-img to compress the roundtrip instance's disk [00:14:02] PROBLEM host: ve-roundtrip2.pmtpa.wmflabs is DOWN address: 10.4.0.162 CRITICAL - Host Unreachable (10.4.0.162) [00:29:33] PROBLEM Free ram is now: CRITICAL on dumps-bot3.pmtpa.wmflabs 10.4.0.118 output: Critical: 5% free memory [00:31:32] PROBLEM Current Load is now: WARNING on parsoid-roundtrip7-8core.pmtpa.wmflabs 10.4.1.26 output: WARNING - load average: 4.92, 5.15, 5.06 [00:36:53] PROBLEM Free ram is now: WARNING on bots-3.pmtpa.wmflabs 10.4.0.59 output: Warning: 15% free memory [00:41:23] 12/29/2012 - 00:41:23 - Creating a home directory for smccandlish at /export/keys/smccandlish [00:41:34] can we not kill that bot yet [00:43:36] heh [00:44:03] oh good, SMcCandlish is making a Labs account! [00:44:03] PROBLEM host: ve-roundtrip2.pmtpa.wmflabs is DOWN address: 10.4.0.162 CRITICAL - Host Unreachable (10.4.0.162) [00:44:13] can probably kill that bot, yeah [00:44:17] we need to run it on the new host [00:44:25] there's still a bot that creates keys [00:47:11] 12/29/2012 - 00:47:11 - Updating keys for smccandlish at /export/keys/smccandlish [00:47:56] keys are harder though [00:48:02] stupid opensshd [00:51:52] PROBLEM Free ram is now: CRITICAL on bots-3.pmtpa.wmflabs 10.4.0.59 output: Critical: 3% free memory [00:52:34] you know what you really should do? [00:52:44] make labsconsole talk to salt, make salt dump the key [00:52:48] insant and KILL BOTTAGE [00:53:23] Damianz: that would be ideal. yes [00:54:05] we need salt-api [00:54:57] I need to figure out talking to salt from djano in a non-blocking way heh [00:56:53] RECOVERY Free ram is now: OK on bots-3.pmtpa.wmflabs 10.4.0.59 output: OK: 114% free memory [00:57:36] if you had salt-api then doing stuff as the user integrated into osm just got easy heh [00:58:00] could totally write a php client though ;P [00:58:00] yes [00:58:02] that's the idea [00:58:15] yeah, could write a php client, but I'd prefer not to [00:58:24] I'd like to have keystone integration with salt-api [00:59:12] Personally I'm just going to pipe everything though a rest api I control so I'm not too fussed about salt auth... anything automagic can be minion delegated access [00:59:37] what do you mean? [00:59:48] you're going to write your own rest api for it? [01:00:21] grrrrr…. Ryan_Lane, do you mind logging into nova-precise2 and telling me what I've screwed up with my paths/aliases/etc? I predict it will take you about 45 seconds to diagnose. [01:00:31] sure :) [01:02:17] Well I'm going to expose an api for the asset management thing I've half written, so that will just 'proxy' to salt in the same way their api works.... it's all abstracted ish so if you do like power off it will figure out of it's a vmware guest, ec2 instance etc, tied together with enc glue, pxe installs, ad dns configs and soon salt magic [01:02:20] I think this is a matter of missing something php related [01:02:51] andrewbogott: libapache2-mod-php5 [01:02:58] Ryan_Lane, I thought that too, but… if I comment out wgarticlepath then… https://nova-precise2.pmtpa.wmflabs/w/index.php?title=Main_Page [01:02:59] it works [01:03:05] Which convinced me that php was functioning [01:03:30] hm [01:03:32] interesting [01:03:41] I'd still try installing that package [01:03:46] and see if that fixes it [01:03:49] yep, ok [01:05:17] Ryan_Lane: Yeah, that seems to help. [01:05:23] heh [01:05:28] annoying, right [01:05:28] ? [01:05:31] But, crap, I still feel like my evidence for php being fine is convincing [01:06:23] Once I get OSM working properly will the sidebar magically appear, or do I need to do something specific for that? [01:07:22] nope. you need to modify the sidebar config [01:07:34] which is in mediawiki's MediaWiki namespace [01:07:52] https://labsconsole.wikimedia.org/wiki/MediaWiki:Sidebar [01:08:15] ok -- that explains why I couldn't find it in the config. [01:08:16] thanks [01:09:35] also... [01:09:35] https://labsconsole.wikimedia.org/wiki/MediaWiki:Sidebar/Group:sysadmin [01:09:43] https://labsconsole.wikimedia.org/wiki/MediaWiki:Sidebar/Group:cloudadmin [01:09:53] https://labsconsole.wikimedia.org/wiki/MediaWiki:Sidebar/Group:netadmin [01:09:53] PROBLEM Total processes is now: WARNING on bots-salebot.pmtpa.wmflabs 10.4.0.163 output: PROCS WARNING: 179 processes [01:10:02] https://labsconsole.wikimedia.org/wiki/MediaWiki:Sidebar/Group:user [01:14:52] RECOVERY Total processes is now: OK on bots-salebot.pmtpa.wmflabs 10.4.0.163 output: PROCS OK: 99 processes [01:15:32] PROBLEM host: ve-roundtrip2.pmtpa.wmflabs is DOWN address: 10.4.0.162 CRITICAL - Host Unreachable (10.4.0.162) [01:16:33] RECOVERY Current Load is now: OK on parsoid-roundtrip7-8core.pmtpa.wmflabs 10.4.1.26 output: OK - load average: 4.66, 4.81, 4.94 [01:24:34] PROBLEM Current Load is now: WARNING on parsoid-roundtrip7-8core.pmtpa.wmflabs 10.4.1.26 output: WARNING - load average: 4.93, 5.17, 5.09 [01:36:38] 12/29/2012 - 01:36:37 - Updating keys for saper at /export/keys/saper [01:45:33] PROBLEM host: ve-roundtrip2.pmtpa.wmflabs is DOWN address: 10.4.0.162 CRITICAL - Host Unreachable (10.4.0.162) [02:15:34] PROBLEM host: ve-roundtrip2.pmtpa.wmflabs is DOWN address: 10.4.0.162 CRITICAL - Host Unreachable (10.4.0.162) [02:37:32] PROBLEM Current Load is now: WARNING on parsoid-roundtrip7-8core.pmtpa.wmflabs 10.4.1.26 output: WARNING - load average: 4.89, 4.95, 5.01 [02:48:02] PROBLEM host: ve-roundtrip2.pmtpa.wmflabs is DOWN address: 10.4.0.162 CRITICAL - Host Unreachable (10.4.0.162) [02:56:40] 12/29/2012 - 02:56:40 - Updating keys for gifti at /export/keys/gifti [03:02:32] RECOVERY Current Load is now: OK on parsoid-roundtrip7-8core.pmtpa.wmflabs 10.4.1.26 output: OK - load average: 4.85, 4.92, 4.99 [03:18:02] PROBLEM host: ve-roundtrip2.pmtpa.wmflabs is DOWN address: 10.4.0.162 CRITICAL - Host Unreachable (10.4.0.162) [03:32:32] PROBLEM Current Load is now: WARNING on parsoid-roundtrip7-8core.pmtpa.wmflabs 10.4.1.26 output: WARNING - load average: 5.34, 5.21, 5.15 [03:36:42] PROBLEM Free ram is now: UNKNOWN on dumps-bot1.pmtpa.wmflabs 10.4.0.4 output: NRPE: Call to fork() failed [03:41:22] PROBLEM Current Users is now: CRITICAL on dumps-bot1.pmtpa.wmflabs 10.4.0.4 output: CHECK_NRPE: Error - Could not complete SSL handshake. [03:41:42] PROBLEM Free ram is now: CRITICAL on dumps-bot1.pmtpa.wmflabs 10.4.0.4 output: CHECK_NRPE: Error - Could not complete SSL handshake. [03:42:02] PROBLEM Disk Space is now: CRITICAL on dumps-bot1.pmtpa.wmflabs 10.4.0.4 output: CHECK_NRPE: Error - Could not complete SSL handshake. [03:42:02] PROBLEM dpkg-check is now: CRITICAL on dumps-bot1.pmtpa.wmflabs 10.4.0.4 output: CHECK_NRPE: Error - Could not complete SSL handshake. [03:43:23] PROBLEM Total processes is now: CRITICAL on dumps-bot1.pmtpa.wmflabs 10.4.0.4 output: CHECK_NRPE: Error - Could not complete SSL handshake. [03:44:32] PROBLEM SSH is now: CRITICAL on dumps-bot1.pmtpa.wmflabs 10.4.0.4 output: Server answer: [03:44:42] PROBLEM Current Load is now: CRITICAL on dumps-bot1.pmtpa.wmflabs 10.4.0.4 output: CHECK_NRPE: Error - Could not complete SSL handshake. [03:46:43] RECOVERY Free ram is now: OK on dumps-bot1.pmtpa.wmflabs 10.4.0.4 output: OK: 33% free memory [03:47:02] RECOVERY Disk Space is now: OK on dumps-bot1.pmtpa.wmflabs 10.4.0.4 output: DISK OK [03:47:03] RECOVERY dpkg-check is now: OK on dumps-bot1.pmtpa.wmflabs 10.4.0.4 output: All packages OK [03:48:02] PROBLEM host: ve-roundtrip2.pmtpa.wmflabs is DOWN address: 10.4.0.162 CRITICAL - Host Unreachable (10.4.0.162) [03:48:22] RECOVERY Total processes is now: OK on dumps-bot1.pmtpa.wmflabs 10.4.0.4 output: PROCS OK: 120 processes [03:49:32] RECOVERY SSH is now: OK on dumps-bot1.pmtpa.wmflabs 10.4.0.4 output: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:49:42] RECOVERY Current Load is now: OK on dumps-bot1.pmtpa.wmflabs 10.4.0.4 output: OK - load average: 0.16, 1.14, 1.20 [03:51:22] RECOVERY Current Users is now: OK on dumps-bot1.pmtpa.wmflabs 10.4.0.4 output: USERS OK - 0 users currently logged in [03:57:03] RECOVERY Free ram is now: OK on patchtest.pmtpa.wmflabs 10.4.0.69 output: OK: 411% free memory [03:57:03] RECOVERY dpkg-check is now: OK on patchtest.pmtpa.wmflabs 10.4.0.69 output: All packages OK [03:57:03] RECOVERY dpkg-check is now: OK on patchtest2.pmtpa.wmflabs 10.4.0.74 output: All packages OK [03:57:03] RECOVERY Free ram is now: OK on patchtest2.pmtpa.wmflabs 10.4.0.74 output: OK: 479% free memory [03:57:43] RECOVERY Total processes is now: OK on patchtest.pmtpa.wmflabs 10.4.0.69 output: PROCS OK: 85 processes [03:58:14] RECOVERY Current Users is now: OK on patchtest.pmtpa.wmflabs 10.4.0.69 output: USERS OK - 0 users currently logged in [03:58:14] RECOVERY Current Users is now: OK on patchtest2.pmtpa.wmflabs 10.4.0.74 output: USERS OK - 0 users currently logged in [03:58:33] RECOVERY Current Load is now: OK on patchtest.pmtpa.wmflabs 10.4.0.69 output: OK - load average: 0.00, 0.01, 0.00 [03:58:54] RECOVERY Total processes is now: OK on patchtest2.pmtpa.wmflabs 10.4.0.74 output: PROCS OK: 83 processes [03:59:03] RECOVERY Disk Space is now: OK on patchtest.pmtpa.wmflabs 10.4.0.69 output: DISK OK [03:59:04] RECOVERY Disk Space is now: OK on patchtest2.pmtpa.wmflabs 10.4.0.74 output: DISK OK [03:59:13] RECOVERY Current Load is now: OK on patchtest2.pmtpa.wmflabs 10.4.0.74 output: OK - load average: 0.00, 0.01, 0.00 [04:05:02] PROBLEM dpkg-check is now: CRITICAL on patchtest.pmtpa.wmflabs 10.4.0.69 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:05:03] PROBLEM dpkg-check is now: CRITICAL on patchtest2.pmtpa.wmflabs 10.4.0.74 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:05:03] PROBLEM Free ram is now: CRITICAL on patchtest2.pmtpa.wmflabs 10.4.0.74 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:05:13] PROBLEM Free ram is now: CRITICAL on patchtest.pmtpa.wmflabs 10.4.0.69 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:05:53] PROBLEM Total processes is now: CRITICAL on patchtest.pmtpa.wmflabs 10.4.0.69 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:06:23] PROBLEM Current Users is now: CRITICAL on patchtest.pmtpa.wmflabs 10.4.0.69 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:06:23] PROBLEM Current Users is now: CRITICAL on patchtest2.pmtpa.wmflabs 10.4.0.74 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:06:43] PROBLEM Current Load is now: CRITICAL on patchtest.pmtpa.wmflabs 10.4.0.69 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:07:03] PROBLEM Disk Space is now: CRITICAL on patchtest.pmtpa.wmflabs 10.4.0.69 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:07:04] PROBLEM Total processes is now: CRITICAL on patchtest2.pmtpa.wmflabs 10.4.0.74 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:07:13] PROBLEM Disk Space is now: CRITICAL on patchtest2.pmtpa.wmflabs 10.4.0.74 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:07:23] PROBLEM Current Load is now: CRITICAL on patchtest2.pmtpa.wmflabs 10.4.0.74 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:18:02] PROBLEM host: ve-roundtrip2.pmtpa.wmflabs is DOWN address: 10.4.0.162 CRITICAL - Host Unreachable (10.4.0.162) [04:20:32] RECOVERY host: ve-roundtrip2.pmtpa.wmflabs is UP address: 10.4.0.162 PING OK - Packet loss = 0%, RTA = 0.67 ms [04:26:28] 12/29/2012 - 04:26:27 - Updating keys for smccandlish at /export/keys/smccandlish [04:31:11] 12/29/2012 - 04:31:10 - Updating keys for smccandlish at /export/keys/smccandlish [04:35:55] 12/29/2012 - 04:35:54 - Updating keys for smccandlish at /export/keys/smccandlish [04:41:31] 12/29/2012 - 04:41:31 - Updating keys for smccandlish at /export/keys/smccandlish [04:47:08] 12/29/2012 - 04:47:07 - Updating keys for smccandlish at /export/keys/smccandlish [04:47:32] RECOVERY Current Load is now: OK on parsoid-roundtrip7-8core.pmtpa.wmflabs 10.4.1.26 output: OK - load average: 4.58, 4.80, 4.94 [04:51:43] 12/29/2012 - 04:51:43 - Updating keys for smccandlish at /export/keys/smccandlish [04:55:32] PROBLEM Current Load is now: WARNING on parsoid-roundtrip7-8core.pmtpa.wmflabs 10.4.1.26 output: WARNING - load average: 5.45, 5.27, 5.10 [04:56:47] 12/29/2012 - 04:56:46 - Updating keys for smccandlish at /export/keys/smccandlish [05:05:33] RECOVERY Current Load is now: OK on parsoid-roundtrip7-8core.pmtpa.wmflabs 10.4.1.26 output: OK - load average: 4.71, 4.88, 4.99 [05:09:48] Anyone home? [05:11:49] Gerrit won't let me log in. [05:12:17] I did logout of labsconsole and back in. [05:13:34] Hello, hello? [05:15:46] Any Gerrit admins around? [05:21:54] 12/29/2012 - 05:21:53 - Updating keys for smccandlish at /export/keys/smccandlish [05:22:50] Thanks. [05:24:46] It still always says "Incorrect username or password." I've tried smccandlish, Smccandlish, SMcCandlish and my shell ID, mech, and they all fail. [05:26:14] 12/29/2012 - 05:26:13 - Updating keys for smccandlish at /export/keys/smccandlish [05:32:45] Still doesn't work. I'll try again tomorrow. [05:34:00] I can't log into Gerrit, either. [05:36:34] https://dl.dropbox.com/u/11458013/Screenshot%20from%202012-12-29%2000%3A31%3A02.png [05:37:02] Looks like a different error than SMcCandlish was having. [06:21:44] So I can log into kubo now, but not Gerrit. Zero-sum accessibility? :) [06:22:13] well, I fixed kubo for you earlier ;) [06:22:25] I think you're going to need ^demon for gerrit [06:22:52] franny: try to log in for me? [06:22:56] I'm tailing the logs [06:23:45] Just tried again, Ryan_Lane. [06:23:57] and the world imploded [06:25:13] -_- [06:25:19] it's trying to look you up by fran [06:25:22] for some reason [06:25:33] That's my shell account name. [06:25:45] Caused by: java.sql.BatchUpdateException: Duplicate entry 'username:fran' for key 'PRIMARY' [06:25:48] hooray [06:25:58] But "fran" gives me "incorrect username or password." [06:25:59] did you initially log in with an email address? [06:26:09] Not that I'm aware. [06:26:15] did you ever log in with fran? [06:26:26] I think I might have. [06:26:31] that's the problem [06:26:41] Why would it not work anymore? [06:26:43] you were renamed after you had logged in with fran [06:26:51] By whom? [06:26:53] gerrit doesn't handle that well [06:27:04] no clue [06:27:17] you know what [06:27:30] you could write a web interface for renaming and use salt to do remote app changes [06:27:41] could. yes. [06:27:42] * Damianz finds salty uses for all Ryan's problems :P [06:27:52] this again assumes salt-api [06:28:05] franny: I think you're going to need ^demon to solve this problem [06:28:11] shame I dislike the code :P [06:28:18] though you know what they say; fuck it, ship it [06:28:36] which code? salt-api? [06:28:44] All code :P [06:28:48] it's considered alpha right now, help them write it ;) [06:28:49] ah [06:28:49] heh [06:29:02] franny: you seem to have two accounts [06:29:02] I don't like the way they've done modules... but I might [06:29:06] in labsconsole [06:29:09] Would be interesting to make ESSO work with it [06:29:12] Fran and https://labsconsole.wikimedia.org/w/index.php?title=User:Fran_McCrory&action=edit&redlink=1 [06:29:14] 'Aborted' < The most helpful error ever seen from an app [06:29:49] so, yeah, indeed you were renamed at some point [06:29:54] PROBLEM Total processes is now: WARNING on parsoid-spof.pmtpa.wmflabs 10.4.0.33 output: PROCS WARNING: 156 processes [06:30:01] did you have a svn account? [06:30:08] I remember someone saying about svn not working [06:30:12] it's possible this happened when the real name bug was on labsconsole [06:30:24] Huh. [06:30:36] (if you set your real name in labsconsole, it would rename your account) [06:30:46] I fixed that at some point last week [06:30:51] ROFL [06:31:02] That was probably it. [06:31:05] how does it even have ad access to do that [06:31:05] likely [06:31:06] -.- [06:31:23] Well, "Fran McCrory" is consistent with my username on everything else Wikimedia. [06:31:42] Damianz: because real name is hardcoded in the ldap plugin to be cn [06:31:52] and I fixed mediawiki core for updating preferences [06:31:53] PROBLEM Total processes is now: WARNING on parsoid-roundtrip4-8core.pmtpa.wmflabs 10.4.0.39 output: PROCS WARNING: 151 processes [06:31:57] we use cn for the username [06:32:00] so, yeah, fail [06:32:05] wtf [06:32:11] so many attrubites to choose from [06:32:29] guess it's not sAMAccountName at least [06:32:32] to be fair, the ldap extension is really old and preference syncing is one of the earlier features ;) [06:33:10] I don't like to make backwards incompatible changes because it makes support a nightmare [06:33:13] so I never changed this [06:33:23] I'll be making that backwards incompatible change soon :) [06:33:45] (it'll be configurable, so you can set it to any attribute you want) [06:33:56] if $env == 'labs' # sekrit code [06:46:52] RECOVERY Total processes is now: OK on parsoid-roundtrip4-8core.pmtpa.wmflabs 10.4.0.39 output: PROCS OK: 146 processes [06:54:52] RECOVERY Total processes is now: OK on parsoid-spof.pmtpa.wmflabs 10.4.0.33 output: PROCS OK: 150 processes [07:14:32] PROBLEM Current Load is now: WARNING on parsoid-roundtrip7-8core.pmtpa.wmflabs 10.4.1.26 output: WARNING - load average: 8.85, 6.26, 5.45 [07:54:33] RECOVERY Free ram is now: OK on dumps-bot3.pmtpa.wmflabs 10.4.0.118 output: OK: 32% free memory [08:30:22] PROBLEM Free ram is now: CRITICAL on dumps-bot2.pmtpa.wmflabs 10.4.0.60 output: Critical: 5% free memory [12:07:12] PROBLEM Total processes is now: WARNING on bots-3.pmtpa.wmflabs 10.4.0.59 output: PROCS WARNING: 151 processes [12:09:52] PROBLEM Free ram is now: WARNING on bots-3.pmtpa.wmflabs 10.4.0.59 output: Warning: 8% free memory [12:14:53] PROBLEM Free ram is now: CRITICAL on bots-3.pmtpa.wmflabs 10.4.0.59 output: Critical: 3% free memory [12:17:12] RECOVERY Total processes is now: OK on bots-3.pmtpa.wmflabs 10.4.0.59 output: PROCS OK: 150 processes [12:29:03] PROBLEM host: bots-3.pmtpa.wmflabs is DOWN address: 10.4.0.59 CRITICAL - Host Unreachable (10.4.0.59) [12:59:52] PROBLEM host: bots-3.pmtpa.wmflabs is DOWN address: 10.4.0.59 CRITICAL - Host Unreachable (10.4.0.59) [13:29:53] PROBLEM host: bots-3.pmtpa.wmflabs is DOWN address: 10.4.0.59 CRITICAL - Host Unreachable (10.4.0.59) [13:56:02] * Beetstra looks sadly at bots-3 [13:56:20] What has changed there? Why do my bots suddenly bring that box down so fast ... ? [13:58:48] afaik it was rebooted [13:59:53] PROBLEM host: bots-3.pmtpa.wmflabs is DOWN address: 10.4.0.59 CRITICAL - Host Unreachable (10.4.0.59) [14:03:02] giftpflanze - I see complaints of low memory above [14:03:23] well, i can't tell you why :) [14:03:58] * Beetstra tickles petan with a USB3-cable [14:29:53] PROBLEM host: bots-3.pmtpa.wmflabs is DOWN address: 10.4.0.59 CRITICAL - Host Unreachable (10.4.0.59) [15:00:13] PROBLEM host: bots-3.pmtpa.wmflabs is DOWN address: 10.4.0.59 CRITICAL - Host Unreachable (10.4.0.59) [15:30:12] PROBLEM host: bots-3.pmtpa.wmflabs is DOWN address: 10.4.0.59 CRITICAL - Host Unreachable (10.4.0.59) [15:35:23] RECOVERY Free ram is now: OK on dumps-bot2.pmtpa.wmflabs 10.4.0.60 output: OK: 32% free memory [16:00:12] PROBLEM host: bots-3.pmtpa.wmflabs is DOWN address: 10.4.0.59 CRITICAL - Host Unreachable (10.4.0.59) [16:30:12] PROBLEM host: bots-3.pmtpa.wmflabs is DOWN address: 10.4.0.59 CRITICAL - Host Unreachable (10.4.0.59) [17:00:13] PROBLEM host: bots-3.pmtpa.wmflabs is DOWN address: 10.4.0.59 CRITICAL - Host Unreachable (10.4.0.59) [17:30:32] PROBLEM host: bots-3.pmtpa.wmflabs is DOWN address: 10.4.0.59 CRITICAL - Host Unreachable (10.4.0.59) [17:49:42] PROBLEM Free ram is now: WARNING on dumps-bot1.pmtpa.wmflabs 10.4.0.4 output: Warning: 19% free memory [18:00:32] PROBLEM host: bots-3.pmtpa.wmflabs is DOWN address: 10.4.0.59 CRITICAL - Host Unreachable (10.4.0.59) [18:30:33] PROBLEM host: bots-3.pmtpa.wmflabs is DOWN address: 10.4.0.59 CRITICAL - Host Unreachable (10.4.0.59) [19:00:43] PROBLEM host: bots-3.pmtpa.wmflabs is DOWN address: 10.4.0.59 CRITICAL - Host Unreachable (10.4.0.59) [19:30:53] PROBLEM host: bots-3.pmtpa.wmflabs is DOWN address: 10.4.0.59 CRITICAL - Host Unreachable (10.4.0.59) [20:02:12] PROBLEM host: bots-3.pmtpa.wmflabs is DOWN address: 10.4.0.59 CRITICAL - Host Unreachable (10.4.0.59) [20:32:12] PROBLEM host: bots-3.pmtpa.wmflabs is DOWN address: 10.4.0.59 CRITICAL - Host Unreachable (10.4.0.59) [21:02:13] PROBLEM host: bots-3.pmtpa.wmflabs is DOWN address: 10.4.0.59 CRITICAL - Host Unreachable (10.4.0.59) [21:32:14] PROBLEM host: bots-3.pmtpa.wmflabs is DOWN address: 10.4.0.59 CRITICAL - Host Unreachable (10.4.0.59) [21:51:41] 12/29/2012 - 21:51:41 - Updating keys for mwang at /export/keys/mwang [22:02:22] PROBLEM host: bots-3.pmtpa.wmflabs is DOWN address: 10.4.0.59 CRITICAL - Host Unreachable (10.4.0.59) [22:21:29] @labs-info bots-3 [22:21:29] [Name bots-3 doesn't exist but resolves to I-000000e5] I-000000e5 is Nova Instance with name: bots-3, host: virt8, IP: 10.4.0.59 of type: m1.small, with number of CPUs: 1, RAM of this size: 2048M, member of project: bots, size of storage: 30 and with image ID: lucid-server-cloudimg-amd64.img [22:21:42] @labs-info bots-nr1 [22:21:42] [Name bots-nr1 doesn't exist but resolves to I-0000049e] I-0000049e is Nova Instance with name: bots-nr1, host: virt5, IP: 10.4.1.2 of type: m1.small, with number of CPUs: 1, RAM of this size: 2048M, member of project: bots, size of storage: 30 and with image ID: ubuntu-12.04-precise [22:22:10] @labs-project-users bots [22:22:10] Following users are in this project (displaying 19 of 40 total): Addshore, Alejrb, Andrew Bogott, Aude, Beetstra, DamianZaremba, DeltaQuad, Dzahn, Fastily, Hashar, Hydriz, Hyperon, Jasonspriggs, Jeremyb, Johnduhart, Kaldari, Krinkle, Ryan Lane, Lcarr, [22:22:34] * jeremyb spies a european [22:22:38] :o [22:24:11] @labs-info bots-2 [22:24:11] [Name bots-2 doesn't exist but resolves to I-0000009c] I-0000009c is Nova Instance with name: bots-2, host: virt6, IP: 10.4.0.42 of type: m1.small, with number of CPUs: 1, RAM of this size: 2048M, member of project: bots, size of storage: 30 and with image ID: lucid-server-cloudimg-amd64.img [22:24:13] @labs-info bots-4 [22:24:14] [Name bots-4 doesn't exist but resolves to I-000000e8] I-000000e8 is Nova Instance with name: bots-4, host: virt6, IP: 10.4.0.64 of type: m1.small, with number of CPUs: 1, RAM of this size: 2048M, member of project: bots, size of storage: 30 and with image ID: lucid-server-cloudimg-amd64.img [22:24:25] interesting [22:24:33] talking to a bot? [22:24:37] no [22:24:41] checking images [22:24:45] we use precise on nr1 [22:24:49] didn't even notice [22:24:59] images of a bot? ew! [22:25:05] of ubuntu [22:25:06] :P [22:25:39] you even named it???? ok ok, i will stop [22:28:33] petan: where is the user directory stuff configured on bots-apache01? [22:29:12] Damianz: also, you set up cluebot incorrectly on bots-apache01 ;) [22:29:13] /etc/apache2/mods-enabled/ [22:29:22] I think there [22:29:25] not sure [22:29:42] don't really care :P [22:29:48] someone broke it, I hacky fixed it [22:29:50] Damianz: you added cluebot into sites-enabled. it should be in sites-available with a link to enabled ;) [22:30:02] * petan slaps Damianz for being hacky [22:30:06] if people didn't borke things it wouldn't be a mess [22:30:25] heh [22:30:29] it would be no fun [22:30:43] no ops would be needed [22:30:49] to fix it [22:30:49] than they wouldn't be people [22:31:22] ops should make structured, tested changes in a documented workflow [22:31:25] otherwise it's mayhem [22:32:09] reminds me I wanted to make a documentation for wm-bot [22:32:23] PROBLEM host: bots-3.pmtpa.wmflabs is DOWN address: 10.4.0.59 CRITICAL - Host Unreachable (10.4.0.59) [22:32:38] wut [22:32:42] bots-3 down? [22:32:54] weren't bots there [22:33:50] petan: /data/project/petrb/logs doesn't exist [22:33:58] re: https://bugzilla.wikimedia.org/show_bug.cgi?id=42578 [22:34:16] but it should be /data/project/public_html or not? [22:34:23] ah [22:34:24] right [22:34:24] sorry [22:35:02] OOM freaking python [22:35:03] that directory isn't owned by ou [22:35:04] *you [22:35:13] um, who owns it? [22:35:17] no one [22:35:17] root? [22:35:19] 1002 [22:35:21] aha [22:35:23] weird [22:35:28] that's why it's being denied [22:35:29] why is that problem for apache? [22:35:33] oh [22:35:36] ok [22:35:48] for userdir the owner must match [22:36:12] I think it's possible to disable that [22:36:15] !logs bots bots-3 died OOM because of some python [22:36:26] !log bots bots-3 died OOM because of some python [22:36:27] but it's likely best to not change that [22:36:28] Logged the message, Master [22:36:39] ok, cool [22:36:49] !logs alias htmllogs [22:36:50] Created new alias for this key [22:36:53] !logs [22:36:53] experimental: http://bots.wmflabs.org/~wm-bot/html/%23wikimedia-labs [22:36:53] can you also fix the directory permissions to not be 777? [22:37:05] which one [22:37:10] public_html wasn't [22:37:13] db and logs [22:37:15] when it was on nfs [22:37:19] and wm-bot [22:37:22] ah [22:37:38] that one I likely can do that but problem is that there is no wm-bot user in ldap [22:37:44] maybe that was 1002 user... [22:37:48] ah [22:37:54] wm-bot is running using own unix user [22:37:57] petan: add it as a user via labsconsole [22:38:05] how? [22:38:09] create an account [22:38:14] it's open registration [22:38:16] oh [22:38:22] ok [22:38:32] this is one of the benefits of open registration, you can make system accounts now :) [22:38:33] but can I change default home? [22:38:38] like in passwd [22:38:45] not that it would matter [22:38:46] hm [22:38:52] I could [22:38:53] but I don't like services to have /home [22:39:05] right [22:39:38] ok anyway keep the rights 777 until I do that otherwise logging won't work [22:39:43] * Ryan_Lane nods [22:40:09] btw, the user doesn't need to be added to any projects or anything like that [22:40:11] just needs to exist [22:40:24] aha ok [22:40:37] project membership is only needed to ssh in [22:40:54] it's to allow cross-project service accounts [22:41:28] ok [22:42:42] another place puppet sucks in our usage and making things re-usable is impossible [22:42:57] Damianz: what do you mean? [22:43:14] puppet service accounts in ldap rather than declaring local uids [22:43:22] we could do that too [22:43:32] as long as they are marked as system users [22:43:39] so that the uid range doesn't conflict [22:45:17] Creating directory '/home/wmib'. [22:45:18] Unable to create and initialize directory '/home/wmib'. [22:45:27] it needs to be member of project :P [22:45:52] Failed to add Wmib to bots. This needs the "loginviashell" right. [22:46:02] RECOVERY host: bots-3.pmtpa.wmflabs is UP address: 10.4.0.59 PING OK - Packet loss = 0%, RTA = 0.69 ms [22:48:29] that's because /home/ is fake [22:49:52] RECOVERY Free ram is now: OK on bots-3.pmtpa.wmflabs 10.4.0.59 output: OK: 874% free memory [22:52:36] Damianz I know but how could I create a home for a user on project storage? [22:52:41] that's what bot is doing or not? [22:53:18] does it need a home? you could just reference everything to the username in ldap and it will work cross instance [22:53:44] of course it doesn't need a home [22:54:07] but user without proper $HOME might not work ok [22:54:17] petan: it does? [22:54:22] even system users in system have some home specified [22:54:31] /home isn't fake [22:54:35] -.- [22:54:43] the dir isn't created unless you're a project member [22:54:50] can't be a member as it doesn't have shell rights [22:54:54] pam_mkhomedir makes it [22:54:58] yes [22:55:01] it's just a directory [22:55:09] Unable to create and initialize directory '/home/wmib'. [22:55:10] ah [22:55:10] I see [22:55:12] yeah but you can't ssh in for pam to make it [22:55:13] I know wy [22:55:14] *why [22:55:16] so bleh [22:55:28] labs-nfs1:/export/home/bots on /home type nfs (rw,addr=10.4.0.13) [22:55:45] the instance needs to be rebooted [22:55:52] the nfs mounts are read-only [22:55:57] um [22:56:01] ok let's try another one [22:56:10] we switched to glusterfs [22:56:24] is it ok for me to reboot bots-apache01? [22:56:25] ur right [22:56:35] yes it's ok for ME [22:56:38] :D [22:56:41] dunno if it's ok for others [22:56:43] :P [22:56:49] well, fuck em :) [22:56:53] hehe [22:57:06] * Ryan_Lane really hopes it doesn't have some problem [22:57:16] it would be lame if I rebooted it and it didn't come back up :D [22:57:41] yes it would be but it shouldn't be big problem, apache has low priority on bots :P [22:57:47] heh [22:57:47] application servers are important here [22:57:50] * Ryan_Lane nods [22:58:13] @labs-info apache01 [22:58:13] I don't know this instance, sorry, try browsing the list by hand, but I can guarantee there is no such instance matching this name, host or Nova ID unless it was created less than 7 seconds ago [22:58:19] @labs-info bots-apache01 [22:58:20] [Name bots-apache01 doesn't exist but resolves to I-000004fc] I-000004fc is Nova Instance with name: bots-apache01, host: virt6, IP: 10.4.0.141 of type: m1.medium, with number of CPUs: 2, RAM of this size: 4096M, member of project: bots, size of storage: 50 and with image ID: ubuntu-12.04-precise [22:58:47] can I tell you how useful that bot is? [22:59:11] depends :D [22:59:20] I'm serious. it's very useful [22:59:26] heh I hope [22:59:29] It's only useful because your interface is shit [22:59:35] heh [22:59:37] lol [22:59:52] bleh. it isn't rebooting [23:00:01] it's stuck [23:00:01] reason why I invented this thing was to be able to do [23:00:04] @labs-info bob [23:00:04] [Name bob doesn't exist but resolves to I-0000012d] I-0000012d is Nova Instance with name: bob, host: virt6, IP: 10.4.0.90 of type: m1.small, with number of CPUs: 1, RAM of this size: 2048M, member of project: pediapress, size of storage: 30 and with image ID: lucid-server-cloudimg-amd64.img [23:00:05] gonna need to kill it [23:00:10] I really wanted to know what is bob [23:00:14] so I made this tool [23:00:31] You know if this was in puppet we could just re-install the instance easier than rebooting it [23:00:32] PROBLEM SSH is now: CRITICAL on bots-apache01.pmtpa.wmflabs 10.4.0.141 output: CRITICAL - Socket timeout after 10 seconds [23:00:49] Damianz nope rebooting will be always easier :D [23:00:52] lol [23:00:55] yeah, rebooting is easier ;) [23:01:00] I think it was hung on a filesystem [23:01:07] if rebooting is easier something sucks [23:01:12] if I knew 10 years ago that there will be day when reinstalling system is faster than rebooting it... hehe [23:01:12] it would have eventually rebooted [23:01:32] I always had to reinstall my windows 95 after few months [23:01:39] virt6 is a pain in my ass [23:01:46] Personally I don't even bother updating servers, just make a new one and throw it into prod after automated burn in [23:01:55] * Damianz hands Ryan the lube [23:01:59] wait a moment, that wasn't 10 years [23:02:02] that was more :D [23:02:21] Running periodic task ComputeManager.update_available_resource <— that [23:02:29] that hangs for ages [23:03:13] I can't wait till they move periodic tasks out of the daemons [23:03:21] .... [23:03:31] why are blocking tasks in any thread near important stuff [23:03:33] fucking shard [23:03:40] yay unix model [23:03:44] yes. it's stupid [23:04:03] thankfully I think they are fixing this in grizzly [23:04:31] obviously running 100 instances per node is too much [23:05:23] RECOVERY SSH is now: OK on bots-apache01.pmtpa.wmflabs 10.4.0.141 output: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [23:05:54] hooray. home directories are on gluster on apache01 now [23:06:07] I can't wait till I can kill that nfs instance [23:06:25] that would make it fast :) [23:06:50] I don't think there's many hosts still mounted on it [23:07:02] you can verify it somehow [23:07:05] I think [23:07:12] showmount -a [23:07:13] just look at who's mounted [23:07:48] 10.4.0.41 [23:07:50] 10.4.0.42 [23:07:54] 10.4.0.44 [23:07:55] 10.4.0.48 [23:07:57] all bots [23:08:01] 10.4.0.59 [23:08:25] I expected bots to keep them mounted [23:08:30] a bunch of deployment prep [23:08:31] @labs-resolve 10.4.0.42 [23:08:31] I don't know this instance, sorry, try browsing the list by hand, but I can guarantee there is no such instance matching this name, host or Nova ID unless it was created less than 56 seconds ago [23:08:36] meh [23:08:44] I didn't implement teh IP resolving [23:08:59] I think it's lying to me, though [23:09:10] mount or bot [23:09:14] mount [23:09:36] @labs-resolve bots [23:09:36] I don't know this instance - aren't you are looking for: I-0000009c (bots-2), I-0000009e (bots-cb), I-000000a9 (bots-1), I-000000af (bots-sql2), I-000000b4 (bots-sql3), I-000000b5 (bots-sql1), I-000000e5 (bots-3), I-000000e8 (bots-4), I-0000015e (bots-labs), I-00000190 (bots-dev), [23:09:54] and this thing doesn't show IP's :/ [23:09:59] damn it [23:10:30] I just checked one of the instances it listed [23:10:38] and it's not mounting anything from labs-nfs1 [23:13:53] !log hugglewa rebooted all instances [23:13:54] Logged the message, Master [23:17:44] someone need to invent a tool to bypass sudo password request, like you enter password once to execute sudo on multiple machines... etc, so that I could create a script to run a command in multiple terminals [23:18:09] I'm planning on using passwordless sudo [23:18:21] well, but that's also not the best solution [23:18:26] it's less secure [23:18:41] I need to fix all current sudo policies to remove ALL and use the project group for the users [23:18:45] it would be cool if sudo could share the information across the network [23:19:01] have you seen pam_url? [23:19:14] like you would login on one machine and it would not ask you for password on all other machines for certain time [23:19:24] it would be possible using pam_url [23:19:27] hmm [23:20:14] passwordless is easier, though :) [23:20:56] yeah [23:21:14] if I was to do it with pam_url, though, I'd use the system's puppet certificate for authentication to the service [23:21:23] then the web server can know which host you are coming from [23:21:46] if you are coming from a host in the same project, grant access if the user has authenticated in the last x amount of time [23:21:57] yup that's it [23:22:06] not very hard to implement [23:22:43] * Damianz hugs his krb [23:22:59] and you could use two-factor auth with it ;) [23:23:13] krb is too much of a pain in the ass to deal with [23:23:23] I love my new router :D I spent last two days flashing my custom built linux into it :D :D [23:23:30] :D [23:23:31] nice [23:23:31] so much fun with such a simple thing [23:24:23] finally I can have traffic reports for every mac address on network. I found out my sister is biggest leecher :P [23:25:08] hahaha [23:25:54] ah, sweet, I can remount home on almost all of these instances [23:32:22] <3 my cisco router that netflow gives me cool stuff off [23:32:33] PROBLEM Free ram is now: WARNING on dumps-bot3.pmtpa.wmflabs 10.4.0.118 output: Warning: 19% free memory [23:33:08] 'Password can only contain alpha-numeric characters: "(" not allowed.' you gotta be fucking kidding me [23:34:11] hahahaha [23:35:01] 'ERROR: "Table Prefix" must not be empty.' < WHAT IF I DON'T WANT A PREFIX [23:35:05] -.- [23:35:28] I hate shitty programs [23:36:03] Though I actually wrote 60lines of ruby yesterday, that caused a whole other level of rage [23:52:10] * Damianz wtf [23:53:24] Damianz: ? [23:53:37] -tech like 2min ago [23:53:55] hahaha [23:55:26] hm. you know, I could manage ssh known_hosts files via a mount [23:55:34] like I do with keys [23:55:36] and have salt manage them [23:55:44] that would be slightly insane [23:55:49] why? [23:55:59] much rather just use ssh fp records then it's verifiable externally easily also [23:56:02] it's better than every user having a wrong one [23:56:13] ssh fp? [23:56:16] dns records? [23:56:33] you have to configure the clients for that [23:56:38] so it actually wouldn't work [23:56:45] who doesn't have that turned on by default?! [23:56:57] it's not on by default [23:57:09] Is on my boxen :D [23:57:12] heh [23:57:28] only problem would be... [23:57:33] pmtpa.wmflabs also isn't a real dns zone [23:57:35] if I want to ssh to outside labs, like github [23:57:41] need to write to file [23:57:47] ah [23:57:47] true [23:57:48] though if you can split into global and user that's ok [23:57:49] good poin [23:57:53] *point