[00:26:44] <labs-nagios-wm_>	 PROBLEM host: grail is DOWN address: i-00000210 PING CRITICAL - Packet loss = 100%
[00:34:36] <labs-nagios-wm_>	 RECOVERY Free ram is now: OK on deployment-web2 i-00000125 output: OK: 20% free memory
[00:58:46] <labs-nagios-wm_>	 PROBLEM host: grail is DOWN address: i-00000210 PING CRITICAL - Packet loss = 100%
[01:16:16] <labs-nagios-wm_>	 RECOVERY Free ram is now: OK on deployment-web6 i-000001d9 output: OK: 20% free memory
[01:28:46] <labs-nagios-wm_>	 PROBLEM host: grail is DOWN address: i-00000210 PING CRITICAL - Packet loss = 100%
[01:29:16] <labs-nagios-wm_>	 PROBLEM Free ram is now: WARNING on deployment-web6 i-000001d9 output: Warning: 19% free memory
[01:58:46] <labs-nagios-wm_>	 PROBLEM host: grail is DOWN address: i-00000210 PING CRITICAL - Packet loss = 100%
[02:28:46] <labs-nagios-wm_>	 PROBLEM host: grail is DOWN address: i-00000210 PING CRITICAL - Packet loss = 100%
[02:39:19] <labs-nagios-wm_>	 RECOVERY Free ram is now: OK on deployment-web6 i-000001d9 output: OK: 23% free memory
[02:58:54] <labs-nagios-wm_>	 PROBLEM host: grail is DOWN address: i-00000210 PING CRITICAL - Packet loss = 100%
[02:58:55] <labs-nagios-wm_>	 RECOVERY Puppet freshness is now: OK on mobile-feeds i-000000c1 output: puppet ran at Sun Apr 22 02:58:45 UTC 2012
[03:02:24] <labs-nagios-wm_>	 PROBLEM Free ram is now: WARNING on deployment-web6 i-000001d9 output: Warning: 19% free memory
[03:26:06] <labs-nagios-wm_>	 PROBLEM Puppet freshness is now: CRITICAL on wikidata-dev-2 i-0000020a output: Puppet has not run in last 20 hours
[03:29:16] <labs-nagios-wm_>	 PROBLEM host: grail is DOWN address: i-00000210 PING CRITICAL - Packet loss = 100%
[03:29:42] <labs-nagios-wm_>	 PROBLEM Free ram is now: CRITICAL on mobile-feeds i-000000c1 output: CHECK_NRPE: Socket timeout after 10 seconds.
[03:34:37] <labs-nagios-wm_>	 PROBLEM Free ram is now: WARNING on mobile-feeds i-000000c1 output: Warning: 7% free memory
[03:37:36] <labs-nagios-wm_>	 PROBLEM Free ram is now: WARNING on deployment-web2 i-00000125 output: Warning: 19% free memory
[03:43:55] <labs-nagios-wm_>	 PROBLEM Free ram is now: WARNING on nova-daas-1 i-000000e7 output: Warning: 12% free memory
[03:43:55] <labs-nagios-wm_>	 PROBLEM Free ram is now: WARNING on deployment-web4 i-00000163 output: Warning: 19% free memory
[03:47:56] <labs-nagios-wm_>	 PROBLEM Free ram is now: WARNING on test-oneiric i-00000187 output: Warning: 14% free memory
[03:49:02] <labs-nagios-wm_>	 RECOVERY Free ram is now: OK on deployment-web4 i-00000163 output: OK: 23% free memory
[03:49:28] <labs-nagios-wm_>	 PROBLEM Free ram is now: WARNING on utils-abogott i-00000131 output: Warning: 14% free memory
[03:59:16] <labs-nagios-wm_>	 PROBLEM host: grail is DOWN address: i-00000210 PING CRITICAL - Packet loss = 100%
[04:03:21] <labs-nagios-wm_>	 PROBLEM Free ram is now: CRITICAL on test-oneiric i-00000187 output: Critical: 5% free memory
[04:05:05] <labs-nagios-wm_>	 PROBLEM Free ram is now: CRITICAL on nova-daas-1 i-000000e7 output: Critical: 5% free memory
[04:05:37] <labs-nagios-wm_>	 PROBLEM Free ram is now: CRITICAL on utils-abogott i-00000131 output: CHECK_NRPE: Socket timeout after 10 seconds.
[04:08:07] <labs-nagios-wm_>	 RECOVERY Free ram is now: OK on test-oneiric i-00000187 output: OK: 97% free memory
[04:09:07] <labs-nagios-wm_>	 PROBLEM Current Load is now: WARNING on bots-sql3 i-000000b4 output: WARNING - load average: 6.83, 8.05, 6.00
[04:10:07] <labs-nagios-wm_>	 PROBLEM Free ram is now: WARNING on orgcharts-dev i-0000018f output: Warning: 17% free memory
[04:10:07] <labs-nagios-wm_>	 RECOVERY Free ram is now: OK on utils-abogott i-00000131 output: OK: 96% free memory
[04:11:07] <labs-nagios-wm_>	 PROBLEM Current Load is now: WARNING on bots-cb i-0000009e output: WARNING - load average: 0.45, 5.69, 5.42
[04:13:47] <labs-nagios-wm_>	 RECOVERY Free ram is now: OK on nova-daas-1 i-000000e7 output: OK: 92% free memory
[04:16:07] <labs-nagios-wm_>	 RECOVERY Current Load is now: OK on bots-cb i-0000009e output: OK - load average: 0.38, 2.34, 4.03
[04:19:07] <labs-nagios-wm_>	 RECOVERY Current Load is now: OK on bots-sql3 i-000000b4 output: OK - load average: 2.48, 3.85, 4.64
[04:30:07] <labs-nagios-wm_>	 PROBLEM Free ram is now: CRITICAL on orgcharts-dev i-0000018f output: Critical: 5% free memory
[04:30:07] <labs-nagios-wm_>	 PROBLEM host: grail is DOWN address: i-00000210 PING CRITICAL - Packet loss = 100%
[04:40:07] <labs-nagios-wm_>	 RECOVERY Free ram is now: OK on orgcharts-dev i-0000018f output: OK: 96% free memory
[04:43:47] <labs-nagios-wm_>	 PROBLEM Free ram is now: WARNING on deployment-web4 i-00000163 output: Warning: 19% free memory
[05:00:07] <labs-nagios-wm_>	 PROBLEM host: grail is DOWN address: i-00000210 PING CRITICAL - Packet loss = 100%
[05:30:07] <labs-nagios-wm_>	 PROBLEM host: grail is DOWN address: i-00000210 PING CRITICAL - Packet loss = 100%
[05:42:38] <labs-nagios-wm_>	 RECOVERY Free ram is now: OK on deployment-web2 i-00000125 output: OK: 20% free memory
[06:00:11] <labs-nagios-wm_>	 PROBLEM host: grail is DOWN address: i-00000210 PING CRITICAL - Packet loss = 100%
[06:00:38] <labs-nagios-wm_>	 PROBLEM Free ram is now: WARNING on deployment-web2 i-00000125 output: Warning: 19% free memory
[06:30:14] <labs-nagios-wm_>	 PROBLEM host: grail is DOWN address: i-00000210 PING CRITICAL - Packet loss = 100%
[06:45:47] <labs-nagios-wm_>	 RECOVERY Free ram is now: OK on mobile-feeds i-000000c1 output: OK: 80% free memory
[06:48:39] <labs-nagios-wm_>	 PROBLEM Current Load is now: WARNING on mobile-feeds i-000000c1 output: WARNING - load average: 0.81, 7.61, 5.07
[06:53:06] <labs-nagios-wm_>	 RECOVERY Current Load is now: OK on mobile-feeds i-000000c1 output: OK - load average: 0.07, 2.85, 3.70
[07:00:42] <labs-nagios-wm_>	 PROBLEM host: grail is DOWN address: i-00000210 PING CRITICAL - Packet loss = 100%
[07:30:51] <labs-nagios-wm_>	 PROBLEM host: grail is DOWN address: i-00000210 PING CRITICAL - Packet loss = 100%
[08:01:01] <labs-nagios-wm_>	 PROBLEM host: grail is DOWN address: i-00000210 PING CRITICAL - Packet loss = 100%
[08:31:01] <labs-nagios-wm_>	 PROBLEM host: grail is DOWN address: i-00000210 PING CRITICAL - Packet loss = 100%
[09:00:11] <labs-nagios-wm_>	 PROBLEM Puppet freshness is now: CRITICAL on nova-production1 i-0000007b output: Puppet has not run in last 20 hours
[09:01:01] <labs-nagios-wm_>	 PROBLEM host: grail is DOWN address: i-00000210 PING CRITICAL - Packet loss = 100%
[09:04:11] <labs-nagios-wm_>	 PROBLEM Puppet freshness is now: CRITICAL on nova-gsoc1 i-000001de output: Puppet has not run in last 20 hours
[09:31:01] <labs-nagios-wm_>	 PROBLEM host: grail is DOWN address: i-00000210 PING CRITICAL - Packet loss = 100%
[10:01:01] <labs-nagios-wm_>	 PROBLEM host: grail is DOWN address: i-00000210 PING CRITICAL - Packet loss = 100%
[10:31:06] <labs-nagios-wm_>	 PROBLEM host: grail is DOWN address: i-00000210 PING CRITICAL - Packet loss = 100%
[10:56:54] <petan|wk>	 hey Ryan_Lane
[10:57:08] <petan|wk>	 should I recreate all web servers with 8gb?
[10:57:10] <petan|wk>	 or only some
[11:01:06] <labs-nagios-wm_>	 PROBLEM host: grail is DOWN address: i-00000210 PING CRITICAL - Packet loss = 100%
[11:31:07] <labs-nagios-wm_>	 PROBLEM host: grail is DOWN address: i-00000210 PING CRITICAL - Packet loss = 100%
[11:45:03] <Ryan_Lane>	 petan|wk: well, how many are there?
[11:45:06] <Ryan_Lane>	 i'd say all of them
[11:45:15] <Ryan_Lane>	 if we have too many, we can remove some later
[11:45:22] <Ryan_Lane>	 they have way too little ram right now, though
[12:01:10] <labs-nagios-wm_>	 PROBLEM host: grail is DOWN address: i-00000210 PING CRITICAL - Packet loss = 100%
[12:15:27] <gerrit-wm>	 New patchset: Dzahn; "use the project- labs groups to allow puppet-created user to execute cron jobs" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/5555
[12:15:41] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/5555
[12:17:35] <gerrit-wm>	 New review: Dzahn; "without it my cron jobs are stuck" [operations/puppet] (test); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/5555
[12:17:38] <gerrit-wm>	 Change merged: Dzahn; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/5555
[12:31:10] <labs-nagios-wm_>	 PROBLEM host: grail is DOWN address: i-00000210 PING CRITICAL - Packet loss = 100%
[13:01:18] <labs-nagios-wm_>	 PROBLEM host: grail is DOWN address: i-00000210 PING CRITICAL - Packet loss = 100%
[13:22:39] <Platonides>	 mutante, if you can take a look to grail instance when you have time...
[13:26:08] <labs-nagios-wm_>	 PROBLEM Puppet freshness is now: CRITICAL on wikidata-dev-2 i-0000020a output: Puppet has not run in last 20 hours
[13:26:30] <mutante>	 !log gareth adding myself to be able to reboot grail instance
[13:26:32] <labs-morebots>	 Logged the message, Master
[13:26:38] <labs-home-wm_>	 04/22/2012 - 13:26:37 - Creating a home directory for dzahn at /export/home/gareth/dzahn
[13:27:16] <Platonides>	 were you able to log in?
[13:27:31] <Platonides>	 I had already rebooted it yesterday, but didn't help
[13:27:39] <labs-home-wm_>	 04/22/2012 - 13:27:38 - Updating keys for dzahn
[13:27:47] <mutante>	 <-- need keys first.. hold on
[13:28:06] <Platonides>	 need keys?
[13:28:34] <mutante>	 the instance needs my keys or i cant login anyways
[13:29:00] <Platonides>	 is some special step needed to give the keys to the instance?
[13:29:09] <Platonides>	 I thought they fetched them automatically from ldap
[13:29:15] <mutante>	 that one i just did, added myself to project
[13:29:15] <Platonides>	 (once you were in the project)
[13:29:40] <mutante>	 well, grail does not let me login
[13:29:47] <Platonides>	 doesn't let me either
[13:29:54] <Platonides>	 it rejects the public key
[13:29:59] <mutante>	 same here
[13:30:08] <Platonides>	 also, the dhcp looks wrong
[13:30:14] <Platonides>	 see the frequency at the console log
[13:30:25] <Platonides>	 it's renewing the lease every 50 seconds or so
[13:30:36] <mutante>	 and this was an issue that happened before and was fixed by restart?
[13:30:55] <Platonides>	 no
[13:31:13] <Platonides>	 no action was done there after creating the instance
[13:31:18] <labs-nagios-wm_>	 PROBLEM host: grail is DOWN address: i-00000210 PING CRITICAL - Packet loss = 100%
[13:31:28] <Platonides>	 is it possible that it got broken by the security rules?
[13:31:29] <mutante>	 but you dont mind if i restart it , right
[13:31:36] <Platonides>	 of course I don't
[13:32:02] <mutante>	  /me uses "get console output" on wiki
[13:32:47] <mutante>	 oh, i see the DHCP issues, yep
[13:32:59] <mutante>	 but eventually it bound to an IP in the end
[13:33:05] <Platonides>	 yes, it gets bound
[13:33:16] <Platonides>	 and we are able to connect with sshd
[13:33:23] <mutante>	 yea
[13:33:45] <Platonides>	 look at the security group
[13:33:53] <Platonides>	 that's the only piece I could have done wrong
[13:33:56] <mutante>	 security rules would also let us connect ..or not.. but not refuse the key
[13:34:02] <mutante>	 looking
[13:34:22] <Platonides>	 I thought that they might be blocking some check to the ldap, or something similar
[13:34:33] <Platonides>	 if it's ok, I'm out of ideas, then
[13:34:38] <mutante>	 true, that could be
[13:35:07] <Platonides>	 I don't know what is needed for that, though
[13:35:21] <Platonides>	 where is the ldap server?
[13:35:45] <mutante>	 afaik sanger
[13:35:50] <mutante>	 which had recent issues :p
[13:36:00] <mutante>	 but i can use my instance just fine.. so
[13:36:20] <mutante>	 i suggest: reboot without changes,, then reboot after removing security group and compare
[13:36:36] <Platonides>	 I thought you can't remove a security group from an instance?
[13:36:56] <mutante>	 users in "netadmin" should be able to
[13:37:05] <mutante>	 but also regarding this there are/were changes
[13:37:17] <mutante>	 i'll try
[13:37:48] <Platonides>	 ok, try
[13:37:52] <mutante>	 !log gareth rebooting grail
[13:37:53] <labs-morebots>	 Logged the message, Master
[13:39:26] <Platonides>	 puppet-agent[700]: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate definition: Package[apache2] is already defined in file /etc/puppet/manifests/webserver.pp at line 69; cannot redefine at /etc/puppet/manifests/webserver.pp:29
[13:39:29] <mutante>	 note how console output now looks normal and good
[13:39:30] <Platonides>	 could this be the cause?
[13:39:35] <mutante>	 but it still does not let us log in
[13:39:39] <mutante>	 the DHCP stuff is gone though
[13:39:58] <Platonides>	 it's there again
[13:40:43] <mutante>	 yea, if puppet run is broken... and that is supposed to add the keys
[13:40:50] <mutante>	 it explains i cant login
[13:41:44] <mutante>	 !log gareth removing webserver classes from grail instance temp. puppet breakage
[13:41:45] <labs-morebots>	 Logged the message, Master
[13:42:09] <mutante>	 Platonides: where did you just get that output ?
[13:42:21] <Platonides>	 it's in the console log
[13:42:35] <Platonides>	 just at the right of "i-00000210 login:"
[13:42:44] <mutante>	 ah, i see
[13:43:03] <mutante>	 can we also get the output of the next run without rebooting again and no login?
[13:43:12] <mutante>	 i was wondering
[13:43:31] <Platonides>	 I don't follow you
[13:43:39] <Platonides>	 when you reboot, the console output is cleared
[13:43:46] <mutante>	 well, we'll have to restart again to remove the classes from the instance
[13:43:49] <Platonides>	 you can however copy & paste the old one before rebooting
[13:44:48] <Jarry1250>	 Hey all. I had previously bookmarked http://orgcharts.wmflabs.org/, but now it times out. 
[13:45:37] <mutante>	 Platonides: i meant the output of puppet runs..after the first one at boot, don't think you can get them when not being able to login, but we need to restart instance anyways after changing the config which classes it is supposed to use
[13:46:15] <Platonides>	 I still can't login
[13:46:17] <Platonides>	 2012-04-22 13:45:13,261 - DataSourceEc2.py[WARNING]: 'http://169.254.169.254' failed: socket timeout [timed out]
[13:46:25] <Platonides>	 that might be interesting?
[13:46:53] <Platonides>	 seems a zeroconf address
[13:46:59] <Platonides>	 perhaps it's expected :/
[13:47:37] <mutante>	 hmm, but right after that it says "found data source" anyways
[13:48:14] <Platonides>	 automount[2404]: syntax error in nsswitch config near [ syntax error ]
[13:49:36] <mutante>	 did it work before with a specific set of puppet classes until classes were added?
[13:49:47] <Platonides>	 this instance never worked
[13:49:48] <mutante>	 or was it just like "create instance, apply classes" and did not work since then
[13:50:08] <mutante>	 ok, then i guess there never was a successful puppet run on it
[13:50:09] <Platonides>	 I created the instance trying to set the right classes
[13:50:15] <Platonides>	 probably not
[13:50:16] <mutante>	 which is needed to add our keys and users
[13:50:48] <Platonides>	 I don't know why did the webserver classes conflict
[13:51:06] <Platonides>	 I just wanted to give it a basic lamp config :s
[13:51:28] <mutante>	 it's probably better to create the instance without addding classes,
[13:51:35] <mutante>	 starting it up, and logging once
[13:51:41] <mutante>	 trying
[13:52:53] <Platonides>	 "WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED" hehe
[13:53:34] <Platonides>	 now it refuses the connection :S
[13:54:48] <Platonides>	 still rejecting the private key
[13:54:50] <mutante>	 give me a minute.. looking for docs 
[13:55:10] <Platonides>	 I'm going now
[13:55:15] <Platonides>	 will be back in a few hours
[13:55:26] <Platonides>	 play freely with it
[13:59:00] <mutante>	 ok, i am a little, but also going soon again.. sunday after all.. cya
[14:01:23] <labs-nagios-wm_>	 PROBLEM host: grail is DOWN address: i-00000210 PING CRITICAL - Packet loss = 100%
[14:19:20] <mutante>	 !log gareth deleted non-default security group and instance, created new instance with same name, settings and default security group
[14:19:21] <labs-morebots>	 Logged the message, Master
[14:21:23] <labs-nagios-wm_>	 RECOVERY host: grail is UP address: i-00000212 PING OK - Packet loss = 0%, RTA = 0.69 ms
[14:22:53] <labs-nagios-wm_>	 PROBLEM Current Load is now: CRITICAL on grail i-00000212 output: Connection refused by host
[14:22:53] <labs-nagios-wm_>	 PROBLEM Current Users is now: CRITICAL on grail i-00000212 output: Connection refused by host
[14:24:13] <labs-nagios-wm_>	 PROBLEM Free ram is now: CRITICAL on grail i-00000212 output: Connection refused by host
[14:29:13] <labs-nagios-wm_>	 PROBLEM Total Processes is now: CRITICAL on grail i-00000212 output: Connection refused by host
[14:29:13] <labs-nagios-wm_>	 PROBLEM dpkg-check is now: CRITICAL on grail i-00000212 output: Connection refused by host
[14:29:13] <labs-nagios-wm_>	 PROBLEM Disk Space is now: CRITICAL on grail i-00000212 output: Connection refused by host
[14:29:37] <Shujench_>	 How can I request an account on wikilabs?
[14:31:34] <Shujen>	 !accountreq
[14:31:34] <wm-bot>	 in case you want to have an account on labs, please contact someone who is in charge of doing that: Ryan.Lane, m.utante or ssmolle.tt
[14:32:50] <mutante>	 Shujen: https://labsconsole.wikimedia.org/wiki/Help:Access#Access_FAQ
[14:34:13] <mutante>	 !accountreq is case you want to have an account on labs please read here: https://labsconsole.wikimedia.org/wiki/Help:Access#Access_FAQ
[14:34:13] <wm-bot>	 Key exist!
[14:34:23] <mutante>	 !del accountreq
[14:34:33] <mutante>	 !accountreq is case you want to have an account on labs please read here: https://labsconsole.wikimedia.org/wiki/Help:Access#Access_FAQ
[14:34:34] <wm-bot>	 Key exist!
[14:35:02] <Shujen>	 thx
[14:36:33] <mutante>	 you're welcome, via wiki request is the best way
[14:37:50] <Shujen>	 requested
[14:37:52] <Shujen>	 http://www.mediawiki.org/wiki/Developer_access#User:Shujenchang
[14:38:25] <Shujen>	 Is there any admin?
[14:40:44] <petan|wk>	 !accountreq del
[14:40:44] <wm-bot>	 Successfully removed accountreq
[14:40:59] <petan|wk>	 !accountreq is case you want to have an account on labs please read here: https://labsconsole.wikimedia.org/wiki/Help:Access#Access_FAQ
[14:40:59] <wm-bot>	 Key was added!
[14:44:10] <labs-nagios-wm_>	 PROBLEM Free ram is now: CRITICAL on deployment-web5 i-00000213 output: Connection refused by host
[14:44:10] <labs-nagios-wm_>	 PROBLEM dpkg-check is now: CRITICAL on deployment-web5 i-00000213 output: Connection refused by host
[14:44:50] <labs-nagios-wm_>	 PROBLEM Total Processes is now: CRITICAL on deployment-web5 i-00000213 output: Connection refused by host
[14:45:42] <labs-nagios-wm_>	 PROBLEM Current Users is now: CRITICAL on deployment-web5 i-00000213 output: Connection refused by host
[14:45:42] <labs-nagios-wm_>	 PROBLEM Current Load is now: CRITICAL on deployment-web5 i-00000213 output: Connection refused by host
[14:45:42] <labs-nagios-wm_>	 PROBLEM Disk Space is now: CRITICAL on deployment-web5 i-00000213 output: Connection refused by host
[14:46:40] <labs-nagios-wm_>	 ACKNOWLEDGEMENT Current Load is now: CRITICAL on grail i-00000212 output: CHECK_NRPE: Error - Could not complete SSL handshake.
[14:47:10] <labs-nagios-wm_>	 ACKNOWLEDGEMENT Free ram is now: CRITICAL on grail i-00000212 output: CHECK_NRPE: Error - Could not complete SSL handshake.
[14:47:25] <labs-nagios-wm_>	 ACKNOWLEDGEMENT Current Users is now: CRITICAL on grail i-00000212 output: CHECK_NRPE: Error - Could not complete SSL handshake.
[14:47:25] <labs-nagios-wm_>	 ACKNOWLEDGEMENT Disk Space is now: CRITICAL on grail i-00000212 output: CHECK_NRPE: Error - Could not complete SSL handshake.
[14:47:30] <labs-nagios-wm_>	 ACKNOWLEDGEMENT Total Processes is now: CRITICAL on grail i-00000212 output: CHECK_NRPE: Error - Could not complete SSL handshake.
[14:47:40] <labs-nagios-wm_>	 ACKNOWLEDGEMENT dpkg-check is now: CRITICAL on grail i-00000212 output: CHECK_NRPE: Error - Could not complete SSL handshake.
[14:48:58] <petan|wk>	 !bots
[14:48:59] <wm-bot>	 http://www.mediawiki.org/wiki/Wikimedia_Labs/Create_a_bot_running_infrastructure proposal for bots
[14:50:00] <labs-nagios-wm_>	 PROBLEM Current Load is now: WARNING on bots-cb i-0000009e output: WARNING - load average: 2.86, 11.94, 9.44
[14:51:37] <labs-home-wm_>	 04/22/2012 - 14:51:36 - Creating a home directory for petrb at /export/home/gareth/petrb
[14:52:36] <labs-home-wm_>	 04/22/2012 - 14:52:36 - Updating keys for petrb
[14:53:43] <labs-nagios-wm_>	 PROBLEM Current Load is now: CRITICAL on deployment-web4 i-00000214 output: CHECK_NRPE: Error - Could not complete SSL handshake.
[14:54:10] <mutante>	 !log gareth - added petrb to project netadmin to fix nagios for the instance
[14:54:12] <labs-morebots>	 Logged the message, Master
[14:54:23] <labs-nagios-wm_>	 PROBLEM Current Users is now: CRITICAL on deployment-web4 i-00000214 output: CHECK_NRPE: Error - Could not complete SSL handshake.
[14:55:13] <labs-nagios-wm_>	 PROBLEM Disk Space is now: CRITICAL on deployment-web4 i-00000214 output: CHECK_NRPE: Error - Could not complete SSL handshake.
[14:55:43] <labs-nagios-wm_>	 PROBLEM Free ram is now: CRITICAL on deployment-web4 i-00000214 output: CHECK_NRPE: Error - Could not complete SSL handshake.
[14:56:53] <labs-nagios-wm_>	 PROBLEM Total Processes is now: CRITICAL on deployment-web4 i-00000214 output: CHECK_NRPE: Error - Could not complete SSL handshake.
[14:57:33] <labs-nagios-wm_>	 PROBLEM dpkg-check is now: CRITICAL on deployment-web4 i-00000214 output: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:04:53] <labs-nagios-wm_>	 RECOVERY Current Load is now: OK on bots-cb i-0000009e output: OK - load average: 0.29, 0.91, 3.82
[15:08:02] <petan|wk>	 ok whoever is owner of instance grail it should work now
[17:41:07] <labs-nagios-wm_>	 RECOVERY Free ram is now: OK on deployment-web2 i-00000125 output: OK: 20% free memory
[17:49:07] <labs-nagios-wm_>	 PROBLEM Free ram is now: WARNING on deployment-web2 i-00000125 output: Warning: 18% free memory
[18:09:17] <labs-nagios-wm_>	 RECOVERY Disk Space is now: OK on grail i-00000215 output: DISK OK
[18:09:17] <labs-nagios-wm_>	 RECOVERY dpkg-check is now: OK on grail i-00000215 output: All packages OK
[18:09:17] <labs-nagios-wm_>	 RECOVERY Total Processes is now: OK on grail i-00000215 output: PROCS OK: 87 processes
[18:09:27] <labs-nagios-wm_>	 RECOVERY Free ram is now: OK on grail i-00000215 output: OK: 92% free memory
[18:11:57] <labs-nagios-wm_>	 RECOVERY Total Processes is now: OK on deployment-web4 i-00000214 output: PROCS OK: 146 processes
[18:12:27] <labs-nagios-wm_>	 RECOVERY dpkg-check is now: OK on deployment-web4 i-00000214 output: All packages OK
[18:12:41] <Platonides>	 mutante, petan, what if I wanted it to be in a non.default security group?
[18:13:47] <labs-nagios-wm_>	 RECOVERY Current Load is now: OK on deployment-web4 i-00000214 output: OK - load average: 0.33, 0.28, 0.11
[18:14:27] <labs-nagios-wm_>	 RECOVERY Current Users is now: OK on deployment-web4 i-00000214 output: USERS OK - 0 users currently logged in
[18:15:37] <labs-nagios-wm_>	 RECOVERY Free ram is now: OK on deployment-web4 i-00000214 output: OK: 96% free memory
[18:15:47] <labs-nagios-wm_>	 RECOVERY Disk Space is now: OK on deployment-web4 i-00000214 output: DISK OK
[18:16:17] <labs-nagios-wm_>	 PROBLEM host: grail is DOWN address: i-00000215 check_ping: Invalid hostname/address - i-00000215
[19:01:15] <labs-nagios-wm_>	 PROBLEM Puppet freshness is now: CRITICAL on nova-production1 i-0000007b output: Puppet has not run in last 20 hours
[19:05:05] <labs-nagios-wm_>	 PROBLEM Puppet freshness is now: CRITICAL on nova-gsoc1 i-000001de output: Puppet has not run in last 20 hours
[19:28:55] <labs-nagios-wm_>	 PROBLEM Current Load is now: WARNING on bots-cb i-0000009e output: WARNING - load average: 9.58, 11.89, 5.55
[19:33:55] <labs-nagios-wm_>	 RECOVERY Current Load is now: OK on bots-cb i-0000009e output: OK - load average: 0.20, 4.56, 4.11
[19:44:23] <labs-nagios-wm_>	 PROBLEM HTTP is now: CRITICAL on deployment-web5 i-00000213 output: Connection refused
[21:46:07] <labs-home-wm_>	 04/22/2012 - 21:46:06 - Creating a home directory for vvv at /export/home/openstack/vvv
[21:46:58] <SpComb>	 bind mounts!
[21:47:07] <labs-home-wm_>	 04/22/2012 - 21:47:07 - Updating keys for vvv
[21:48:25] <Ryan_Lane>	 SpComb: ?
[21:48:44] * SpComb  set up his NFS server today
[21:48:56] <Ryan_Lane>	 where?
[21:48:59] <Ryan_Lane>	 in Labs?
[21:49:01] <SpComb>	 nah
[21:49:05] <Ryan_Lane>	 ah. heh
[21:49:15] <Ryan_Lane>	 we're moving away from NFS everywhere possible
[21:49:22] <SpComb>	 that "create home directory in /exports/home/..." sounds like you have your home fs mounted directly in /exports
[21:49:25] <SpComb>	 oh :(
[21:50:08] <SpComb>	 first thing I ran into once I rebooted the server with an NFS-mounted /home was that all files became owned by nobody:nogroup until some random kernel idmapd timeout expired
[21:50:31] <Ryan_Lane>	 we're using automounts for home directories
[21:50:36] <Ryan_Lane>	 in a creative way ;)
[21:50:39] <SpComb>	 NFSv4 automounts?
[21:50:43] <Ryan_Lane>	 no
[21:50:44] <Ryan_Lane>	 3
[21:50:49] <SpComb>	 hm
[21:50:52] <Ryan_Lane>	 nfs4 kind of sucks
[21:50:58] <Ryan_Lane>	 http://ryandlane.com/blog/2011/11/01/sharing-home-directories-to-instances-within-a-project-using-puppet-ldap-autofs-and-nova/
[21:51:00] <SpComb>	 I was noting the same
[21:51:18] <Ryan_Lane>	 nfs4 is great, except that support for it still isn't amazing
[21:53:23] <SpComb>	 I just configured my /etc/exports and client mounts using Puppet, only have an O(1) number of exports/mounts here :)
[21:54:07] <SpComb>	 some fine `exec { '/etc/exports': command => 'cat /etc/exports.d/* > /etc/exports', notify => Service['nfs-server'] }` magics and it's all nice :)
[21:55:22] <SpComb>	 + refreshonly and a `define nfs::export`
[22:00:45] <SpComb>	 hmm, you sure seem to have a lot of configuration stored in LDAP
[22:02:19] <Ryan_Lane>	 yep
[22:02:28] <Ryan_Lane>	 I have per-project sudo as well
[22:02:58] <Ryan_Lane>	 you are using exported resources?
[22:03:29] <SpComb>	 no, heard way too much bad stuff about the SQL storage stuff
[22:03:51] <Ryan_Lane>	 yep
[22:04:02] <Platonides>	 Ryan_Lane, how should I make a security group for an instance
[22:04:06] <SpComb>	 but I really only have a handful of servers to manage, 3-6 or so
[22:04:12] <Ryan_Lane>	 Platonides: you need to make it before hand
[22:04:12] <Platonides>	 so that it allows me to log in
[22:04:13] <Platonides>	 ?
[22:04:22] <SpComb>	 in my case, Puppet is more about managing changes to the servers over their lifetime
[22:04:24] <Ryan_Lane>	 wait. what do you mean?
[22:04:27] <Ryan_Lane>	 Platonides: via ssh?
[22:04:45] <Ryan_Lane>	 that rule is set up by default
[22:04:45] <Platonides>	 yes, the public key was rejected
[22:04:45] <Ryan_Lane>	 which instance are you having issues with?
[22:04:55] <Ryan_Lane>	 SpComb: yeah. my use case wouldn't let me do that, though
[22:05:15] <Ryan_Lane>	 SpComb: every project needs another export, and projects can be created at will
[22:05:16] <Platonides>	 it started working after <mutante> !log gareth deleted non-default security group and instance, created new instance with same name, settings and default security group
[22:05:45] <Ryan_Lane>	 did someone create the instance without default selected?
[22:05:53] <Ryan_Lane>	 default should always be selected
[22:06:00] <SpComb>	 Ryan_Lane: yeah... I just have a single monolitic puppet repo that contains everything, using just parametrized classes
[22:06:04] <Ryan_Lane>	 unless the person *really* knows what they are doing
[22:06:14] <SpComb>	 data in LDAP is strictly separate from configuration in Puppet
[22:06:17] <Ryan_Lane>	 SpComb: yeah. it's a sane way of handling things if you can do it that way
[22:06:40] <SpComb>	 it's been pretty fun working on it so far :)
[22:06:41] <Ryan_Lane>	 Platonides: gimme a sec. I'll look at it for you
[22:07:04] <SpComb>	 being able to plop down replicating LDAP slaves on hosts at will in under a minute is awesome :P
[22:07:37] <labs-home-wm_>	 04/22/2012 - 22:07:37 - Creating a home directory for laner at /export/home/gareth/laner
[22:08:24] <Ryan_Lane>	 Platonides: there's no instances in gareth
[22:08:31] <Platonides>	 yep, I deleted it
[22:08:36] <labs-home-wm_>	 04/22/2012 - 22:08:36 - Updating keys for laner
[22:08:48] <Ryan_Lane>	 I can't debug something that isn't there...
[22:08:48] <Platonides>	 so I wanted to create it again with a non-default SW
[22:08:51] <Platonides>	 but working
[22:08:59] <Ryan_Lane>	 ah
[22:09:03] <Platonides>	 does it need some special rule?
[22:09:07] <Ryan_Lane>	 make a new security group
[22:09:21] <Ryan_Lane>	 ensure the new security group and the default one are selected
[22:09:39] <Ryan_Lane>	 if you remove default you'll disable ping and ssh (and nagios)
[22:09:57] <Platonides>	 I had a rule of 22 to 0.0.0.0/0
[22:10:05] <Platonides>	 we could reach the sshd
[22:10:11] <Platonides>	 the problem is, it rejected our keysd
[22:10:47] <Ryan_Lane>	 well, create the instance and let me know if it has the problem, then I can see why
[22:11:50] <Ryan_Lane>	 hm. load is spiking on labs. I wonder why
[22:11:54] <labs-nagios-wm_>	 PROBLEM Current Load is now: CRITICAL on nagios 127.0.0.1 output: CRITICAL - load average: 3.96, 6.68, 3.66
[22:12:11] <Ryan_Lane>	 bots and deployment prep are spiking the most
[22:12:18] <Ryan_Lane>	 http://ganglia.wmflabs.org/latest/ <— per-project ganglia ;)
[22:12:29] <Ryan_Lane>	 I guess sara hasn't sent that out to the list yet
[22:12:53] <labs-nagios-wm_>	 PROBLEM Current Load is now: WARNING on bots-cb i-0000009e output: WARNING - load average: 3.37, 12.70, 7.60
[22:13:27] <Platonides>	 I wonder if I had the instance in the default security group
[22:14:00] <Platonides>	 also, which classes should I add so it provides a basic LAMP setup?
[22:14:28] <Platonides>	 using webserver classes caused conflicts
[22:14:38] <Ryan_Lane>	 hm.
[22:14:43] <Ryan_Lane>	 it's not terribly easy right now
[22:14:53] <Ryan_Lane>	 our apache puppet config is so beyond screwed up
[22:15:08] <Ryan_Lane>	 we need puppet repo documentation
[22:15:32] <Ryan_Lane>	 lemme see
[22:16:53] <labs-nagios-wm_>	 RECOVERY Current Load is now: OK on nagios 127.0.0.1 output: OK - load average: 0.30, 2.72, 2.77
[22:17:41] <Ryan_Lane>	 there's no all-in-one class for a full lamp stack
[22:17:59] <Ryan_Lane>	 webserver::php5 will install apache and php
[22:18:12] <Platonides>	 maybe that was the source of the conflict
[22:18:20] <Ryan_Lane>	 webserver::php5-mysql will give you mysql
[22:18:21] <Ryan_Lane>	 well
[22:18:24] <Ryan_Lane>	 php5-mysql
[22:18:33] <Ryan_Lane>	 again, our classes aren't great for this
[22:18:37] <Platonides>	 I had chosen both webserver::apache2 and webserver::php5
[22:18:44] <Ryan_Lane>	 ah. that'll do it
[22:18:44] <Platonides>	 since I wanted apache and php
[22:18:55] <Ryan_Lane>	 we have a new class that makes a lot more sense, but it's not easy to use
[22:19:10] <Platonides>	 if webserver::php5 already provides apache, that could be the source of the conflict
[22:19:12] <Ryan_Lane>	 and we are constantly making changes to it
[22:19:16] <Ryan_Lane>	 it is
[22:19:51] <Platonides>	 why wouldn't webserver::php5 just depend on webserver::apache2, so an extra webserver::apache2 didn't break it?
[22:20:09] <Platonides>	 maybe that's not even possible
[22:20:13] <Platonides>	 it's far from intuitive :P
[22:20:17] <Ryan_Lane>	 agreed
[22:20:24] <Ryan_Lane>	 which is the point of the saner new class :)
[22:21:52] <Platonides>	 btw, are the -----BEGIN SSH HOST KEY FINGERPRINTS----- output by a script made by us?
[22:22:03] <Platonides>	 it'd be nice to also have the ECDSA fingerprint there
[22:22:12] <Ryan_Lane>	 where do you see this?
[22:22:19] <Platonides>	 in the console output
[22:22:27] <Ryan_Lane>	 oh
[22:22:37] <Ryan_Lane>	 no. that's what happens when the instance generates them
[22:22:46] <Platonides>	 the ECDSA fingerprint is above in the noise, so not very important
[22:22:50] <Platonides>	 just a nice-to-have
[22:22:54] <labs-nagios-wm_>	 RECOVERY Current Load is now: OK on bots-cb i-0000009e output: OK - load average: 0.36, 1.94, 4.08
[22:23:10] <Ryan_Lane>	 I was planning on pulling the fingerprint in when the instance builds, so that it would be available in the web interface
[22:23:30] <Platonides>	 well, grail is now in that broken mode :)
[22:23:37] <Ryan_Lane>	 grail?
[22:23:38] <Ryan_Lane>	 oh
[22:23:44] <labs-nagios-wm_>	 PROBLEM Current Load is now: CRITICAL on grail i-00000216 output: Connection refused by host
[22:23:55] <Ryan_Lane>	 gimme a sec
[22:24:03] <Platonides>	 not actually refused, since I can connect to the host sshd
[22:24:14] <Platonides>	 as confirmed by the fingerprint
[22:24:24] <labs-nagios-wm_>	 PROBLEM Current Users is now: CRITICAL on grail i-00000216 output: Connection refused by host
[22:25:04] <labs-nagios-wm_>	 PROBLEM Disk Space is now: CRITICAL on grail i-00000216 output: Connection refused by host
[22:25:28] <Ryan_Lane>	 the puppet run isn't finished yet
[22:25:39] <Ryan_Lane>	 you can't log in until puppet is totally finished running.
[22:25:43] <Ryan_Lane>	 give it a couple more mins
[22:25:44] <labs-nagios-wm_>	 PROBLEM Free ram is now: CRITICAL on grail i-00000216 output: Connection refused by host
[22:25:45] <Ryan_Lane>	 brb
[22:26:54] <labs-nagios-wm_>	 PROBLEM Total Processes is now: CRITICAL on grail i-00000216 output: Connection refused by host
[22:27:34] <labs-nagios-wm_>	 PROBLEM dpkg-check is now: CRITICAL on grail i-00000216 output: Connection refused by host
[22:28:52] <Platonides>	 are you sure?
[22:28:54] <Platonides>	 Finished puppet run
[22:29:39] <Platonides>	 hmm.. Sub-process /usr/bin/dpkg returned an error code (1)
[22:30:37] * Ryan_Lane  groans
[22:30:37] <Ryan_Lane>	 Duplicate definition: Sshkey[10.4.0.2] is already defined
[22:31:01] <Ryan_Lane>	 seems it did run after that, though
[22:31:39] <Platonides>	 that definition is not my fault :P
[22:31:55] <Ryan_Lane>	 wait
[22:32:06] <Ryan_Lane>	 did you build an oneiric instance?
[22:32:11] <Ryan_Lane>	 or lucid?
[22:32:31] <Platonides>	 oneiric
[22:32:41] <Platonides>	 is that broken?
[22:32:50] <Ryan_Lane>	 yes
[22:32:55] <Platonides>	 sigh
[22:33:07] <Ryan_Lane>	 why would you use anything other than an LTS anyway? :)
[22:33:18] <Platonides>	 it's newer? ;)
[22:33:34] <Ryan_Lane>	 oneiric wasn't broken, but the per-project ganglia stuff likely put the final nail in the coffin for that version
[22:33:36] <Platonides>	 you should remove those options
[22:33:53] <Platonides>	 ok, I'll try with lucid tomorrow
[22:33:56] <Platonides>	 thanks
[22:33:58] <Ryan_Lane>	 yw
[23:08:44] <labs-nagios-wm_>	 RECOVERY Current Load is now: OK on grail i-00000216 output: OK - load average: 0.23, 0.42, 0.34
[23:09:24] <labs-nagios-wm_>	 RECOVERY Current Users is now: OK on grail i-00000216 output: USERS OK - 0 users currently logged in
[23:10:04] <labs-nagios-wm_>	 RECOVERY Disk Space is now: OK on grail i-00000216 output: DISK OK
[23:10:44] <labs-nagios-wm_>	 RECOVERY Free ram is now: OK on grail i-00000216 output: OK: 91% free memory
[23:11:54] <labs-nagios-wm_>	 RECOVERY Total Processes is now: OK on grail i-00000216 output: PROCS OK: 69 processes
[23:27:10] <labs-nagios-wm_>	 PROBLEM Puppet freshness is now: CRITICAL on wikidata-dev-2 i-0000020a output: Puppet has not run in last 20 hours
[23:57:29] <hexmode>	 right channel
[23:57:57] <hexmode>	 Ryan_Lane: clicking "add group" under deployment-prep
[23:58:15] <hexmode>	 ok, adding just "bz"
[23:58:27] <hexmode>	 created
[23:58:44] <hexmode>	 back to puppet group list
[23:59:10] <hexmode>	 clicking "delete" for "bz" group I just created
[23:59:23] <hexmode>	 Ryan_Lane: "The action you have requested is limited to users in the group: Administrators."