[00:25:33] Ryan_Lane: Why can't you delete wikis? Is that a legal supressing of free speach shit thing? [00:37:46] PROBLEM Disk Space is now: WARNING on aggregator1 aggregator1 output: DISK WARNING - free space: / 496 MB (5% inode=94%): [00:39:18] Damianz: no, because we have no technical method of doing so cleanly [00:42:40] Damianz: We can delete things at leisure and it's not a free speech issue. The First Amendment protects people against the *government* restricting their speech. Private non-govt parties can do whatever the hell they want [00:43:22] Comparison: if the New York Times refuses to publish your article, that's nto a free speech issue. If you start your own newspaper and authorities try to prevent you from distributing it, that is a free speech issue [00:47:10] wikimedia is the goverment :P [00:47:18] {{cn}} [00:47:29] Technical way of doing it - remove dns :D [00:51:30] I could likely do that [00:51:36] that may be a good idea [00:52:46] PROBLEM Disk Space is now: CRITICAL on aggregator1 aggregator1 output: DISK CRITICAL - free space: / 236 MB (2% inode=94%): [00:56:00] On another note, tech journalists really need to go diaf [01:02:36] <^demon> Ryan_Lane: I seem to have sudo privs on manganese, but I can't login. Shouldn't it use LDAP? Or does it need to be manually added to site.pp? [01:02:57] Change on 12mediawiki a page Wikimedia Labs/Account creation text was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=513560 edit summary: [01:03:20] ^demon: needs to be added to site.pp [01:03:30] ^demon: you need it for git stuff? [01:03:45] hm [01:03:48] how did it work on formey? [01:03:59] oh. you had a home directory because of svn [01:04:02] and a real shell [01:04:31] <^demon> Well my ldap entry has my default shell as /bin/bash instead of sillyshell ;-) [01:04:47] yeah [01:05:39] let's see if this works [01:06:16] <^demon> Well I was trying to resolve https://gerrit.wikimedia.org/r/#change,3024 :) but it wasn't merging cleanly [01:06:47] <^demon> But we can just abandon mine [01:07:29] of course that doesn't work [01:07:49] * Damianz wonders how long his finger will take to re-appear with skin [01:07:50] since your account already exists [01:08:00] I'll just create a home directory for you, I guess [01:10:37] ^demon: try now [01:11:43] <^demon> Prompting for password :\ [01:11:46] hm [01:11:52] oh [01:11:58] poor owership :) [01:12:09] try now [01:12:18] <^demon> Score :) [01:12:20] <^demon> ty [01:12:28] yw [01:12:48] puppet won't be able to fix your key on this system [02:50:13] Hello. Beta has problems. [02:50:31] Recent changes is giving a db error. [02:50:54] not just labs, but other beta wikis too [02:52:01] New patchset: Andrew Bogott; "Added the openstack tip as an optional repo." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/3308 [02:52:12] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/3308 [02:52:40] New review: Andrew Bogott; "(no comment)" [operations/puppet] (test); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3308 [02:52:43] Change merged: Andrew Bogott; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/3308 [02:55:54] * Hazard-SJ sees status [02:56:06] PROBLEM dpkg-check is now: CRITICAL on essex-2 essex-2 output: DPKG CRITICAL dpkg reports broken packages [02:57:20] Hazard-SJ: what's the name of the project it is in? [02:57:35] just cause i'd suggest logint it [02:57:39] logging [03:01:59] mutante: http://labs.wikimedia.beta.wmflabs.org/wiki/Special:RecentChanges is one, and it says "Can't contact the database server: Lost connection to MySQL server at 'reading initial communication packet', system error: 111 (deployment-sql)" [03:03:13] RECOVERY dpkg-check is now: OK on salt salt output: All packages OK [03:03:51] Seems the CSS was down for a few seconds, but is working now :) [03:03:53] RECOVERY Current Load is now: OK on salt salt output: OK - load average: 0.18, 0.18, 0.09 [03:04:15] 03/21/2012 - 03:04:14 - Creating a home directory for dzahn at /export/home/deployment-prep/dzahn [03:04:43] RECOVERY Current Users is now: OK on salt salt output: USERS OK - 1 users currently logged in [03:04:51] mutante: It seems to be all special pages on the beta cluster globally [03:04:52] we know [03:05:09] Hazard-SJ: i was wondering if that server "deployment-sql" is inside "deployment-prep" (name of the project in labs) [03:05:11] Not just special pages, anything not in cache [03:05:13] 03/21/2012 - 03:05:13 - Updating keys for dzahn [03:05:26] Some of the dbs are corrupted which petan was fixing [03:05:43] RECOVERY Disk Space is now: OK on salt salt output: DISK OK [03:05:43] and it is [03:05:48] * Hazard-SJ hates corrupted dbs [03:06:03] RECOVERY Free ram is now: OK on salt salt output: OK: 86% free memory [03:06:07] !log deployment-prep added myself as a member just to see the instance names and check for the sql server... [03:06:08] Logged the message, Master [03:07:33] RECOVERY Total Processes is now: OK on salt salt output: PROCS OK: 92 processes [03:08:47] Damianz: so the mysqld is shutdown on purpose because the db is corrupted? 'cause its just not running. should not be started tehn? [03:09:01] It is /probably/ being worked on [03:09:05] ok [03:09:14] I wouldn't start it as like 400 dbs where being fixed. [03:09:28] Last time I heard he was trying some stuff before just giving up and restoring shit loads of sql dumps. [03:09:28] ok, i wont. thx [03:18:18] PROBLEM dpkg-check is now: CRITICAL on essex-3 essex-3 output: CHECK_NRPE: Error - Could not complete SSL handshake. [03:18:58] PROBLEM Current Load is now: CRITICAL on essex-3 essex-3 output: CHECK_NRPE: Error - Could not complete SSL handshake. [03:19:38] PROBLEM Current Users is now: CRITICAL on essex-3 essex-3 output: CHECK_NRPE: Error - Could not complete SSL handshake. [03:20:18] PROBLEM Disk Space is now: CRITICAL on essex-3 essex-3 output: CHECK_NRPE: Error - Could not complete SSL handshake. [03:20:58] PROBLEM Free ram is now: CRITICAL on essex-3 essex-3 output: CHECK_NRPE: Error - Could not complete SSL handshake. [03:22:28] !log deployment-prep mysqld on deployment-sql is stopped - did not start it though after i heard petan is working on corrupted db's [03:22:28] PROBLEM Total Processes is now: CRITICAL on essex-3 essex-3 output: CHECK_NRPE: Error - Could not complete SSL handshake. [03:22:29] Logged the message, Master [03:25:58] RECOVERY Free ram is now: OK on essex-3 essex-3 output: OK: 82% free memory [03:27:28] RECOVERY Total Processes is now: OK on essex-3 essex-3 output: PROCS OK: 105 processes [03:28:18] RECOVERY dpkg-check is now: OK on essex-3 essex-3 output: All packages OK [03:28:58] RECOVERY Current Load is now: OK on essex-3 essex-3 output: OK - load average: 1.36, 1.37, 0.90 [03:29:38] RECOVERY Current Users is now: OK on essex-3 essex-3 output: USERS OK - 2 users currently logged in [03:30:18] RECOVERY Disk Space is now: OK on essex-3 essex-3 output: DISK OK [03:36:18] PROBLEM dpkg-check is now: CRITICAL on essex-3 essex-3 output: DPKG CRITICAL dpkg reports broken packages [04:11:12] 03/21/2012 - 04:11:11 - Updating keys for edsu [04:38:12] 03/21/2012 - 04:38:12 - Updating keys for edsu [04:42:12] 03/21/2012 - 04:42:12 - Updating keys for edsu [04:44:12] 03/21/2012 - 04:44:12 - Updating keys for edsu [07:55:02] hello [07:55:13] THO|Cloud: hi [08:01:24] Change on 12mediawiki a page Wikimedia Labs/Things to fix in beta was modified, changed by Petrb link https://www.mediawiki.org/w/index.php?diff=513729 edit summary: [14:33:58] PROBLEM Current Load is now: CRITICAL on essex-4 essex-4 output: CHECK_NRPE: Error - Could not complete SSL handshake. [14:34:38] PROBLEM Current Users is now: CRITICAL on essex-4 essex-4 output: CHECK_NRPE: Error - Could not complete SSL handshake. [14:35:18] PROBLEM Disk Space is now: CRITICAL on essex-4 essex-4 output: CHECK_NRPE: Error - Could not complete SSL handshake. [14:36:08] PROBLEM Free ram is now: CRITICAL on essex-4 essex-4 output: CHECK_NRPE: Error - Could not complete SSL handshake. [14:37:28] PROBLEM Total Processes is now: CRITICAL on essex-4 essex-4 output: CHECK_NRPE: Error - Could not complete SSL handshake. [14:38:18] PROBLEM dpkg-check is now: CRITICAL on essex-4 essex-4 output: CHECK_NRPE: Error - Could not complete SSL handshake. [15:01:15] petan, petan|wk did you manage to fix the SQL error for the abuse filter patch? [15:10:21] RECOVERY Disk Space is now: OK on essex-4 essex-4 output: DISK OK [15:11:01] RECOVERY Free ram is now: OK on essex-4 essex-4 output: OK: 82% free memory [15:12:31] RECOVERY Total Processes is now: OK on essex-4 essex-4 output: PROCS OK: 103 processes [15:13:11] RECOVERY dpkg-check is now: OK on essex-4 essex-4 output: All packages OK [15:19:01] PROBLEM Current Load is now: WARNING on essex-4 essex-4 output: WARNING - load average: 24.36, 13.81, 6.84 [15:19:41] RECOVERY Current Users is now: OK on essex-4 essex-4 output: USERS OK - 1 users currently logged in [15:28:51] RECOVERY Current Load is now: OK on essex-4 essex-4 output: OK - load average: 2.46, 3.89, 4.91 [15:36:58] Thehelpfulone: no yet [15:37:44] ok, please poke me when you do :) [16:01:14] Still having troubles in labs? [16:01:47] yes [16:04:08] Sorry to hear it [16:24:17] New patchset: Andrew Bogott; "s/T/True in nova.conf. Essex/Nova is choking on 'T'" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/3312 [16:24:29] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/3312 [16:24:47] New review: Andrew Bogott; "(no comment)" [operations/puppet] (test); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3312 [16:24:50] Change merged: Andrew Bogott; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/3312 [18:43:54] I need a new project. anybody here have rights to make one? (I don't think I do.) [18:51:20] <^demon> In gerrit? [18:51:22] <^demon> Or labs? [18:53:25] 03/21/2012 - 18:53:25 - Updating keys for edsu [18:54:12] 03/21/2012 - 18:54:11 - Updating keys for edsu [18:55:11] 03/21/2012 - 18:55:11 - Updating keys for edsu [18:56:01] in labs. [18:56:13] 03/21/2012 - 18:56:12 - Updating keys for edsu [18:56:19] it's finally time to build a swift cluster in labs. [18:56:59] <^demon> Ah. I'm not much help there :) [18:59:28] maplebed: I think I can. What do you want it called? [18:59:53] andrewbogott: 'swift' if that's ok. [19:01:13] 03/21/2012 - 19:01:12 - Updating keys for edsu [19:01:30] 03/21/2012 - 19:01:30 - Creating a project directory for swift [19:01:31] 03/21/2012 - 19:01:30 - Creating a home directory for ben at /export/home/swift/ben [19:02:07] maplebed: Do you see it listed on this page now? https://labsconsole.wikimedia.org/wiki/Special:NovaInstance [19:02:12] 03/21/2012 - 19:02:11 - Updating keys for edsu [19:02:15] (I don't, but that doesn't shock me since I'm not a member.) [19:02:30] 03/21/2012 - 19:02:30 - Updating keys for ben [19:02:45] andrewbogott: "No Nova credentials found for your account." [19:02:59] but I do see it in https://labsconsole.wikimedia.org/wiki/Special:NovaProject [19:03:19] wait... it just went away. [19:03:32] oh, now it's back. [19:03:32] weird. [19:03:49] You get that credential warning when you try to add an instance? [19:04:11] I just clicked your link. is that trying to create an instance? [19:04:25] I'm gonna log out and log back in again. [19:04:33] Oh... probably that was just trying to access that page as me. [19:04:41] It was just the 'manage instances' page. [19:04:43] Or, meant to be. [19:05:13] 03/21/2012 - 19:05:12 - Updating keys for edsu [19:06:18] andrewbogott: logging out and back in again did it. [19:06:36] (and I actually do have the ability to make projects, so ... well, sorry for the ping. :P ) [19:06:36] maplebed: So you're set? [19:06:42] I wondered about that :) [19:06:51] I didn't untill logging out and back in again. [19:06:55] huh [19:07:55] Ryan_Lane: you wanted to know when folks had to log out/back in again? I just did. [19:08:08] * Ryan_Lane nods [19:08:25] unfortunately, it seems that whatever triggers the bug happens at some point well before it shows up [19:08:55] mediawiki seems to be invalidating sessions, and it's doing it even more now that I switched to using memcache [19:09:17] I'm going to have to audit all of the authentication code in mediawiki where it hooks into authentication plugins to track this down [19:09:20] anyone have a time for a socks proxy question re: https://labsconsole.wikimedia.org/wiki/Help:Access ? [19:09:31] s/a time/time/ [19:09:38] ryan_lane: any suggestion for teh instance type I should use for swift backends and swift frontends? [19:09:47] probably small [19:09:49] edsu: I'll try and help if you like. what's your question? [19:09:56] or medium [19:10:02] we're really low on ram right now [19:10:13] actually, we're going to replace the bad ram in virt3 today [19:10:22] m1.small or x1.small? [19:10:38] hm [19:10:42] m1.small [19:10:48] I'd recommend using /data/project for the storage [19:10:51] wait [19:10:55] that's not going to work [19:11:02] maplebed: thanks, i am able to ssh connect to bastion and directly to my instance wikistream-1 w/ ssh [19:11:08] swift needs disks, right? [19:11:18] swift needs "local" disks, yeah. [19:11:22] or does it expect a formatted filesystem? [19:11:22] at least one per host [19:11:33] maplebed: so on my workstation I can: ssh -D 9876 edsu@bastion.wmflabs.org [19:11:34] it's only an object store, not a block store. doesn't export a filesystem. [19:11:39] ah ok. [19:11:52] either way you are hitting gluster storage [19:11:54] maplebed: and i've configured my browser to use a socks proxy of localhost:9876 [19:12:01] each node is isolated, communicating with others via rsync and http. [19:12:12] does it take a directory and stick files into it? [19:12:19] yeah. [19:12:36] it may be good to use /data/project/ [19:12:38] I'd like the "directory" formatted xfs if I can. [19:12:48] maplebed: and i can see http://google.com just fine in my browser [19:12:50] either way you are going to hit gluster storage [19:13:09] alternatively, you can use /mnt [19:13:11] edsu: from your browser, load http://whatismyipaddress.com/ [19:13:15] and reformat as xfs [19:13:20] maplebed: but when i point it at http://wikistream-1.pmtpa.wmflabs I get a server not found error [19:13:28] if you want to use /mnt, then use s1.small or medium [19:13:43] /mnt will be slower than /data/project [19:13:45] maplebed: i see 208.80.153.194 [19:13:50] but you can make a filesystem there [19:14:10] though really, labs isn't for performance testing anyway, so /mnt is likely more consistent [19:14:30] edsu: good; that confirms that the proxy part is working correctly. [19:14:38] btw, it doesn't have to be /mnt, you can remove that from the fstab and remount it elsewhere [19:14:57] ok, I'll use /mnt (and probably ping you again when it comes time to actaully mount something) [19:15:04] and you can reformat the disk however you like. it comes as ext3, created and mounted for you on startup [19:15:26] edsu: did you set remote dns lookups in your browser? wikistream-1.pmtpa.wmflabs is not a globally resolvable DNS name. [19:15:27] maplebed: on bastion i can `telnet wikistream-1.pmtpa.wmflabs 80` [19:15:37] in 30 minutes I'm going to take down virt3 to replace dimms [19:15:42] some instances will come down with it [19:15:51] bastion knows how to do DNS for .wmflabs, but your computer probably doesn't. [19:16:15] if you use a socks proxy, it'll handle the DNS too [19:16:20] Ryan_Lane: is the content int /mnt persistent? across reboots or something? [19:16:24] maplebed: ye[ [19:16:26] *yep [19:16:35] everything in labs is persistent [19:16:37] Ryan_Lane: at least in firefox, local or remote DNS lookups is a configuration option. [19:16:38] until you delete it [19:16:48] ok. cool. [19:17:04] maplebed: ahh, this is firefox [19:17:08] though I'm not sure if the default is local or remote DNS. [19:17:16] I know it used to be local, but maybe it's not anymore. [19:17:30] edsu: http://kb.mozillazine.org/Network.proxy.socks_remote_dns [19:18:55] at least in my installation of firefox, Network.proxy.socks_remote_dns is set to a default of false. (firefox 11.0) [19:19:18] maplebed: yup it's default to false in firefox 11 at least [19:19:44] maplebed++ # bingo, thanks very much! [19:20:11] edsu: you are now required to update the docs you were following so the next person following them doesn't have the same problem. [19:20:14] :D [19:20:40] maplebed: i can do that :) [19:20:49] heh [19:20:50] (I'm kidding about the required part, but it'd be awesome if you do.) [19:20:51] sweet [19:21:01] agreed. would be awesome. [19:21:06] always nice to have good docs :) [19:21:55] the way i changed it was to go to about:config and look for network.proxy.socks_remote_dns [19:21:58] would that suffice? [19:22:20] sure [19:22:23] i guess there might be a menu somewhere with a checkbox [19:22:33] ok, cool [19:22:33] I'm not sure there is. [19:22:34] if you can write up a little about the situation that causes the need for it, it would be good [19:30:10] maplebed: https://labsconsole.wikimedia.org/w/index.php?title=Help%3AAccess&action=historysubmit&diff=2790&oldid=2777 that work ok? [19:30:37] Ryan_Lane: oops meant to ask you the same question :) [19:31:16] looks good to me. [19:33:31] * Ryan_Lane nods [19:33:33] looks good [19:35:43] since it's always easier to edit content than write, I've done some formatting change to that. https://labsconsole.wikimedia.org/w/index.php?title=Help:Access&diff=2791&oldid=2790 [19:37:19] edsu: fwiw, if you're going to be doing work on a private instance on labs a lot, you should take a look at foxyproxy. Using it instead means you won't have to enable / disable the proxy every time you want to work on your instance. [19:37:54] it also means all your non-labs web browsing won't be slowed down by having to go through your ssh tunnel. [19:38:47] Ryan_Lane: also m1.small for the swift frontend proxies? they need no local storage; just enough to run one wsgi web service. [19:40:03] maplebed: thanks [19:40:59] yeah, that seems appropriate [19:41:04] ok. [19:41:09] I *really* need to get this cisco box in to the compute cluster [19:41:12] one more question - [19:41:15] we're basically our of ram [19:41:18] *out [19:41:20] to enable any labs instance to hit swift on port 80 [19:41:22] i can't type today [19:41:44] oh, you should make security groups before you create instances [19:41:48] I create a security group with a cidr range of 0/0 or 10.4.0.0/24 or something different? [19:41:56] 0.0.0.0/0 [19:42:09] ok. [19:42:11] 10.4.0.0/24 works if you only want labs to access it [19:42:24] if you want the outside world to be able to hit it, when using a public IP, it needs to be 0.0.0.0/0 [19:42:38] hm. ok. [19:42:48] I'll put in 0/0 but not give it a public IP at the moment. [19:42:52] the good thing is, you can change the security group as much as you want [19:43:02] does that mean that any host in our network can hit it? (eg fenari) [19:43:08] yes [19:43:18] well [19:43:19] maybe [19:43:24] we may filter both ways [19:43:32] ok. so the public IP is only for non-wmf servers to hit it, and the 10.4 vs. 0/0 is for labs vs. all wmf hosts to hit it [19:43:46] I think we actually filter both ways [19:43:56] so you likely need a public IP for wmf hosts too [19:44:13] ok. I won't do that yet, but it's good to know. [19:44:27] * Ryan_Lane nods [19:44:31] we're slightly paranoid :) [19:45:40] Ryan_Lane: where is our instance? (hugglewa-w1) [19:49:41] PROBLEM Current Load is now: WARNING on bots-cb bots-cb output: WARNING - load average: 4.51, 15.70, 8.90 [19:53:59] PROBLEM dpkg-check is now: CRITICAL on swift-be3 swift-be3 output: Connection refused by host [19:53:59] PROBLEM dpkg-check is now: CRITICAL on swift-be1 swift-be1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [19:54:39] PROBLEM Current Load is now: CRITICAL on swift-be1 swift-be1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [19:54:39] PROBLEM Current Load is now: CRITICAL on swift-be2 swift-be2 output: Connection refused by host [19:54:39] PROBLEM Current Load is now: CRITICAL on swift-be3 swift-be3 output: Connection refused by host [19:55:35] PROBLEM Current Users is now: CRITICAL on swift-be1 swift-be1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [19:55:35] PROBLEM Current Users is now: CRITICAL on swift-be2 swift-be2 output: Connection refused by host [19:55:35] PROBLEM Current Users is now: CRITICAL on swift-be3 swift-be3 output: Connection refused by host [19:56:04] PROBLEM Disk Space is now: CRITICAL on swift-be1 swift-be1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [19:56:04] PROBLEM Disk Space is now: CRITICAL on swift-be2 swift-be2 output: Connection refused by host [19:56:14] PROBLEM Disk Space is now: CRITICAL on swift-be3 swift-be3 output: Connection refused by host [19:56:49] PROBLEM Free ram is now: CRITICAL on swift-be1 swift-be1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [19:56:49] PROBLEM Free ram is now: CRITICAL on swift-be2 swift-be2 output: Connection refused by host [19:56:59] PROBLEM Free ram is now: CRITICAL on swift-be3 swift-be3 output: CHECK_NRPE: Socket timeout after 10 seconds. [19:58:18] PROBLEM Total Processes is now: CRITICAL on swift-be1 swift-be1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [19:58:23] PROBLEM Total Processes is now: CRITICAL on swift-be2 swift-be2 output: Connection refused by host [19:58:28] PROBLEM Total Processes is now: CRITICAL on swift-be3 swift-be3 output: Connection refused by host [19:59:02] PROBLEM dpkg-check is now: CRITICAL on swift-be2 swift-be2 output: Connection refused by host [20:04:01] PROBLEM Current Load is now: CRITICAL on swift-fe1 swift-fe1 output: Connection refused by host [20:04:01] PROBLEM Current Load is now: CRITICAL on swift-be4 swift-be4 output: Connection refused by host [20:04:51] PROBLEM Current Users is now: CRITICAL on swift-be4 swift-be4 output: Connection refused by host [20:04:51] PROBLEM Current Users is now: CRITICAL on swift-fe1 swift-fe1 output: Connection refused by host [20:04:51] RECOVERY Current Load is now: OK on bots-cb bots-cb output: OK - load average: 0.71, 1.60, 4.01 [20:05:21] PROBLEM Disk Space is now: CRITICAL on swift-be4 swift-be4 output: Connection refused by host [20:05:21] PROBLEM Disk Space is now: CRITICAL on swift-fe1 swift-fe1 output: Connection refused by host [20:06:11] PROBLEM Free ram is now: CRITICAL on swift-fe1 swift-fe1 output: Connection refused by host [20:06:11] PROBLEM Free ram is now: CRITICAL on swift-be4 swift-be4 output: Connection refused by host [20:07:31] PROBLEM Total Processes is now: CRITICAL on swift-be4 swift-be4 output: Connection refused by host [20:07:36] PROBLEM Total Processes is now: CRITICAL on swift-fe1 swift-fe1 output: Connection refused by host [20:08:21] PROBLEM dpkg-check is now: CRITICAL on swift-fe1 swift-fe1 output: Connection refused by host [20:08:21] PROBLEM dpkg-check is now: CRITICAL on swift-be4 swift-be4 output: CHECK_NRPE: Error - Could not complete SSL handshake. [20:09:32] * andrewbogott hands over his King of Nagios Warnings crown to maplebed [20:45:40] if anyone has their browser set up to view labs vms, I got wikistream running at http://wikistream-1.pmtpa.wmflabs/ [20:46:25] would be interested to know if it works for anyone else :) [20:49:51] PROBLEM Free ram is now: CRITICAL on deployment-web2 deployment-web2 output: Critical: 3% free memory [20:50:01] PROBLEM Free ram is now: CRITICAL on deployment-web4 deployment-web4 output: CHECK_NRPE: Socket timeout after 10 seconds. [20:51:01] PROBLEM Current Load is now: CRITICAL on deployment-web deployment-web output: CHECK_NRPE: Socket timeout after 10 seconds. [20:51:21] PROBLEM SSH is now: CRITICAL on deployment-web deployment-web output: CRITICAL - Socket timeout after 10 seconds [20:51:21] PROBLEM Current Load is now: CRITICAL on deployment-web4 deployment-web4 output: CHECK_NRPE: Socket timeout after 10 seconds. [20:51:22] PROBLEM Current Users is now: CRITICAL on deployment-web deployment-web output: CHECK_NRPE: Socket timeout after 10 seconds. [20:51:22] PROBLEM Current Users is now: CRITICAL on deployment-web4 deployment-web4 output: CHECK_NRPE: Socket timeout after 10 seconds. [20:51:22] PROBLEM dpkg-check is now: CRITICAL on deployment-web4 deployment-web4 output: CHECK_NRPE: Socket timeout after 10 seconds. [20:52:01] PROBLEM Current Load is now: CRITICAL on deployment-web3 deployment-web3 output: CHECK_NRPE: Socket timeout after 10 seconds. [20:52:21] PROBLEM Disk Space is now: CRITICAL on deployment-web4 deployment-web4 output: CHECK_NRPE: Socket timeout after 10 seconds. [20:52:21] PROBLEM Disk Space is now: CRITICAL on deployment-web deployment-web output: CHECK_NRPE: Socket timeout after 10 seconds. [20:52:21] PROBLEM Current Load is now: CRITICAL on deployment-web5 deployment-web5 output: CHECK_NRPE: Socket timeout after 10 seconds. [20:52:21] PROBLEM SSH is now: CRITICAL on deployment-web4 deployment-web4 output: CRITICAL - Socket timeout after 10 seconds [20:56:17] PROBLEM Free ram is now: CRITICAL on deployment-web5 deployment-web5 output: CHECK_NRPE: Socket timeout after 10 seconds. [20:56:22] PROBLEM Current Load is now: CRITICAL on deployment-web2 deployment-web2 output: CHECK_NRPE: Socket timeout after 10 seconds. [20:56:22] PROBLEM Total Processes is now: CRITICAL on deployment-web5 deployment-web5 output: CHECK_NRPE: Socket timeout after 10 seconds. [20:56:27] PROBLEM Total Processes is now: CRITICAL on deployment-web2 deployment-web2 output: CHECK_NRPE: Socket timeout after 10 seconds. [20:56:32] PROBLEM dpkg-check is now: CRITICAL on deployment-web3 deployment-web3 output: CHECK_NRPE: Socket timeout after 10 seconds. [20:56:34] PROBLEM dpkg-check is now: CRITICAL on deployment-web5 deployment-web5 output: CHECK_NRPE: Socket timeout after 10 seconds. [20:56:34] PROBLEM dpkg-check is now: CRITICAL on deployment-web2 deployment-web2 output: CHECK_NRPE: Socket timeout after 10 seconds. [20:57:22] PROBLEM Free ram is now: CRITICAL on deployment-web3 deployment-web3 output: CHECK_NRPE: Socket timeout after 10 seconds. [20:57:23] PROBLEM SSH is now: CRITICAL on deployment-web3 deployment-web3 output: CRITICAL - Socket timeout after 10 seconds [20:57:23] PROBLEM SSH is now: CRITICAL on deployment-web5 deployment-web5 output: CRITICAL - Socket timeout after 10 seconds [20:57:23] PROBLEM Total Processes is now: CRITICAL on deployment-web3 deployment-web3 output: CHECK_NRPE: Socket timeout after 10 seconds. [20:59:34] wtf happened to deployment? heh [20:59:45] did someone OOM it again? [21:09:20] Ryan_Lane: Are there puppet classes for mediawiki + OSM hiding someplace? [21:09:59] ummm [21:10:05] no, mediawiki is mostly manually set up [21:10:48] ok [21:11:47] Regarding the task of having openstack write instance info to mediawiki... [21:12:17] I don't know enough about mediawiki to know how that would happen. Is it just a matter of dropping a file of markup text into the right place? [21:16:23] andrewbogott: don't we already have instance info in mediawiki? [21:17:16] andrewbogott: it would be best to write to pages using the API [21:17:21] jeremyb: OSM does it now [21:17:27] we'd prefer nova to do it [21:17:31] jeremyb: Yes. The current setup uses a 'pull' model which sometimes gets out of sync. The idea is to hook nova/openstack events to push out details. [21:17:41] though I guess OSM could watch nova's update queue [21:17:57] is that how you were going to handle it? [21:17:59] damnit [21:18:17] i spent teh last 60+ secs thinking OSM was openstreetmap [21:27:05] heh [21:27:05] (a chronic issue) [21:27:05] context ;) [21:27:05] aude|away: ^^^^^ [21:27:05] Ryan_Lane: That, or write a custom notification driver that writes to the wiki. [21:27:05] * Ryan_Lane nods [21:27:05] Doing it as a notification driver is probably slightly easier. [21:27:05] andrewbogott: so, also push out RCs and edits in page history? [21:27:05] andrewbogott: is there something I'm supposed to do so that nagios doens't flip out about my swift instances? [21:27:05] RCs? [21:27:06] jeremyb: We're talking about the state of Labs VMs. [21:27:06] recent changes [21:27:06] And statistics about same. [21:27:06] maplebed: there's a bug with the nagios puppet manifests [21:27:06] maplebed: on first run, they don't work [21:27:06] you need to force run puppet again [21:27:06] and nagios will stop complaining [21:27:06] on which hosts - mine or nagios? [21:27:06] no one has bothered to fix the nagios manifests [21:27:06] yours [21:27:06] but statistics that rarely change like instance size not RAM currently in use. right? [21:27:06] it's likely missing the nrpe config [21:27:06] Ryan_Lane: I implemented gluster auto-attachment as a notification driver. I'm sorta waiting to see if other nova devs like or hate that. [21:27:06] andrewbogott: ah. cool [21:27:06] jeremyb: vm state changes, also where the vm is hosted [21:27:06] so, when a live-migration occurs, the wiki is wrong [21:27:06] right now the wiki is very wrong in regards to where instances are running [21:27:06] but is that the only reason reboots don't work after migrate? [21:36:49] Ryan_Lane: I take it there are test instances of mediawiki that I can use to learn about the APIs? [21:36:49] hm [21:36:49] you can use beta [21:36:49] if it's alive right now [21:36:49] ok [21:36:49] or test wikipedia [21:36:49] andrewbogott: http://test.wikipedia.org/w/api.php [21:37:03] Ryan_Lane: I have an instance that won't respond to pings or reboot from labsconsole, something borked in labs atm? [21:37:12] we just replaced the ram in virt3 [21:37:23] I'm bringing the instances back up right now [21:37:26] ok [21:37:37] should be back soon [21:37:54] Ryan_Lane: What kind of integration with OSM are you imagining? Will I just be generating a page, or actually inserting data into a DB so you can run queries and such? [21:37:54] which instance is it? [21:38:01] "outreacheval" [21:38:11] andrewbogott: just creating/editing pages [21:38:18] ok [21:38:47] Anything else you'd rather I worked on first? [21:39:10] hm. no, that would be good [21:39:30] it's one of the few things nova doesn't do that OSM does [21:39:42] nimish_g: it should be booting [21:39:54] k, thx [21:40:07] the systems are really loaded right now since I just booted like 35 instances [21:40:10] so it may take a little bit [21:40:19] see: http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=load_one&s=by+name&c=Virtualization+cluster+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [21:40:20] heh [21:42:38] hm. some may need fsck run :( [21:42:43] that's annoying [21:42:51] I'll give them a little bit to boot first [21:43:06] if they don't I'll bring them down, fsck, then bring them up [21:43:13] nimish_g: one of yours looks like it may need an fsck [21:45:50] oh. it didn't boot properly [21:46:22] 03/21/2012 - 21:46:22 - Updating keys for ben [21:46:30] 03/21/2012 - 21:46:29 - Updating keys for ben [21:46:39] 03/21/2012 - 21:46:39 - Updating keys for ben [21:47:14] 03/21/2012 - 21:47:13 - Updating keys for ben [21:47:18] hm. boots are queued up [21:47:24] 03/21/2012 - 21:47:24 - Updating keys for ben [21:48:38] I'll give it more time before troubleshooting [21:48:38] 03/21/2012 - 21:47:28 - Updating keys for ben [21:48:38] 03/21/2012 - 21:47:34 - Updating keys for ben [21:48:38] 03/21/2012 - 21:47:42 - Updating keys for ben [21:48:38] labs-home-wm: I need to fix you, spammer [21:48:39] ...are you talking to a bot? [21:48:47] yep :) [21:49:51] haha ok. do tell me when you get a chance to fsck my instance [21:49:57] * Ryan_Lane nods [21:50:07] I think it may be waiting in line to boot [21:50:21] Ryan_Lane: when you talk about fixing the bot, I hope you're talking about it in the same way you fix a cat. [21:50:27] heh [21:50:34] well, it's the script that feeds the bot ;) [21:51:05] it should enumerate all the projects where the key is being updated and spit it out in one log message [21:51:09] I'm getting the "If you are having access problems..." when I try and connect to bastion. [21:51:48] bastion, or bastion-restricted? [21:51:48] despite that, it does eventually connect. [21:51:48] (bastion-restricted) [21:51:48] and I have a shell, though it took ~2min. [21:51:48] that message always shows up [21:51:48] yeah [21:51:48] oh. [21:51:48] it's all totally overloaded right now [21:51:51] because I booted about 35 instances [21:52:07] when I move home directories to the project storage that problem will partially go away [21:52:32] the instance storage hits high wait-io when too many instances boot [21:53:37] when we have more compute nodes, and more network nodes, this should also be less of a problem [21:53:42] though we'll then have more instances too [21:53:44] so, we'll see [21:54:01] this is one of the reason I'm pushing people to use project storage more than instance storage, where possible [21:54:06] separate the IO [21:54:57] I'm going to get the advice of the openstack people about scaling some of the services. they get severly overloaded when too many instances are booted [21:56:52] nimish_g: ok, your instance should be up now. may be a little slow for another 10 or so minutes [21:57:43] …. let the spam begin…. [21:58:10] labs-nagios-wm: welcome to the party [21:58:29] RECOVERY Disk Space is now: OK on nova-daas-1 nova-daas-1 output: DISK OK [21:58:29] RECOVERY Free ram is now: OK on nova-daas-1 nova-daas-1 output: OK: 83% free memory [21:59:09] RECOVERY dpkg-check is now: OK on feeds feeds output: All packages OK [21:59:50] RECOVERY HTTP is now: OK on search-test search-test output: HTTP OK: HTTP/1.1 200 OK - 1102 bytes in 0.006 second response time [22:01:09] PROBLEM Current Load is now: CRITICAL on nova-daas-1 nova-daas-1 output: CHECK_NRPE: Socket timeout after 10 seconds. [22:03:17] !log deployment-prep rebooting -web2 and -web4, they were OOM'd [22:04:54] RECOVERY dpkg-check is now: OK on nova-daas-1 nova-daas-1 output: All packages OK [22:06:05] RECOVERY host: deployment-web4 is UP address: deployment-web4 PING OK - Packet loss = 0%, RTA = 0.84 ms [22:07:54] RECOVERY host: deployment-web2 is UP address: deployment-web2 PING OK - Packet loss = 0%, RTA = 0.47 ms [22:09:54] RECOVERY Current Load is now: OK on bots-3 bots-3 output: OK - load average: 1.97, 3.02, 4.58 [22:09:54] PROBLEM Current Load is now: WARNING on test-oneiric test-oneiric output: WARNING - load average: 0.21, 2.33, 5.50 [22:09:54] RECOVERY Free ram is now: OK on test-oneiric test-oneiric output: OK: 91% free memory [22:09:54] RECOVERY Current Users is now: OK on test-oneiric test-oneiric output: USERS OK - 0 users currently logged in [22:11:54] PROBLEM host: deployment-web5 is DOWN address: deployment-web5 CRITICAL - Host Unreachable (deployment-web5) [22:14:10] !log deployment-prep rebooting -web5, it's OOM'd [22:14:54] RECOVERY Current Load is now: OK on test-oneiric test-oneiric output: OK - load average: 0.01, 0.91, 4.03 [22:15:54] RECOVERY Disk Space is now: OK on deployment-web5 deployment-web5 output: DISK OK [22:15:54] RECOVERY dpkg-check is now: OK on deployment-web5 deployment-web5 output: All packages OK [22:16:04] RECOVERY host: deployment-web5 is UP address: deployment-web5 PING OK - Packet loss = 0%, RTA = 0.77 ms [22:17:44] RECOVERY Current Users is now: OK on deployment-web5 deployment-web5 output: USERS OK - 0 users currently logged in [22:18:24] PROBLEM host: essex-4 is DOWN address: essex-4 check_ping: Invalid hostname/address - essex-4 [22:18:24] RECOVERY Free ram is now: OK on deployment-web5 deployment-web5 output: OK: 92% free memory [22:18:34] RECOVERY SSH is now: OK on deployment-web5 deployment-web5 output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [22:19:54] RECOVERY Current Load is now: OK on deployment-web5 deployment-web5 output: OK - load average: 0.15, 0.20, 0.09 [22:19:54] RECOVERY Total Processes is now: OK on deployment-web5 deployment-web5 output: PROCS OK: 108 processes [22:47:54] PROBLEM host: deployment-web3 is DOWN address: deployment-web3 CRITICAL - Host Unreachable (deployment-web3) [22:48:24] PROBLEM host: essex-4 is DOWN address: essex-4 check_ping: Invalid hostname/address - essex-4 [23:06:16] !log deployment-prep rebooting -web3 OOM [23:07:44] RECOVERY Current Load is now: OK on deployment-web3 deployment-web3 output: OK - load average: 0.27, 0.08, 0.03 [23:07:54] RECOVERY host: deployment-web3 is UP address: deployment-web3 PING OK - Packet loss = 0%, RTA = 0.34 ms [23:08:24] RECOVERY Free ram is now: OK on deployment-web3 deployment-web3 output: OK: 91% free memory [23:08:34] RECOVERY SSH is now: OK on deployment-web3 deployment-web3 output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [23:09:54] RECOVERY Total Processes is now: OK on deployment-web3 deployment-web3 output: PROCS OK: 108 processes [23:10:54] RECOVERY Disk Space is now: OK on deployment-web3 deployment-web3 output: DISK OK [23:10:54] RECOVERY Current Users is now: OK on deployment-web3 deployment-web3 output: USERS OK - 0 users currently logged in [23:10:54] RECOVERY dpkg-check is now: OK on deployment-web3 deployment-web3 output: All packages OK [23:18:24] PROBLEM host: essex-4 is DOWN address: essex-4 check_ping: Invalid hostname/address - essex-4 [23:48:24] PROBLEM host: essex-4 is DOWN address: essex-4 check_ping: Invalid hostname/address - essex-4