[00:02:32] PROBLEM Free ram is now: UNKNOWN on etherpad-lite i-000002de output: NRPE: Unable to read output [00:02:32] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [00:13:34] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [00:32:59] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [00:43:59] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [01:03:08] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [01:14:01] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [01:16:11] PROBLEM Puppet freshness is now: CRITICAL on maps-test2 i-00000253 output: Puppet has not run in last 20 hours [01:33:06] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [01:44:02] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [01:50:52] PROBLEM Free ram is now: WARNING on bots-3 i-000000e5 output: Warning: 15% free memory [02:03:09] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [02:05:58] i'm having trouble accessing bastion.pmpta.wmflabs.org via ssh using approach described at https://labsconsole.wikimedia.org/wiki/Access#Using_ProxyCommand_ssh_option. this worked for me on saturday, but not now, and to the best of my knowledge nothing's changed in my ssh agent configuration [02:06:41] e.g. when i run 'root@foo:~# ssh bastion.pmtpa.wmflabs', i get: [02:06:44] If you are having access problems, please see: https://labsconsole.wikimedia.org/wiki/Access#Accessing_public_and_private_instances [02:06:46] channel 0: open failed: administratively prohibited: open failed [02:06:47] ssh_exchange_identification: Connection closed by remote host [02:08:29] however, when i run 'ssh emw@bastion.wmflabs.org', i get in fine [02:08:44] any ideas what's wrong here? [02:10:00] can anyone else successfully ssh into bastion with the ProxyCommand approach outlined at that first link i sent? [02:10:49] PROBLEM Free ram is now: CRITICAL on bots-3 i-000000e5 output: Critical: 4% free memory [02:11:03] also, on a separate note, i've got an instance that i delete on over the weekend still listed under 'manage instances'. is there a bug on this? [02:11:41] that i deleted* [02:14:59] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [02:25:49] RECOVERY Free ram is now: OK on bots-3 i-000000e5 output: OK: 52% free memory [02:30:00] Emw: Do you happen to know whether or not you are a member of the bastion project? [02:30:26] i know that i had been as of saturday. i'll check again [02:30:46] If you were then you probably still are. [02:31:05] andrewbogott: yup, i'm (still) a member [02:31:09] Um... wait... [02:31:17] in your paste above it looks like you're trying to ssh in as root? [02:31:23] That doesn't sound good. [02:33:02] (full disclosure, I've never used the particular method documented in that howto) [02:33:09] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [02:34:21] that might be it. however, i've got this pair of lines at the bottom of /root/.ssh/config: [02:34:22] Host *.wmflabs [02:34:24] User emw [02:35:00] Oh, I see what you mean. [02:35:16] Um, ok, sorry, now I'm fully ignorant :( [02:36:11] Emw: I would try to do ssh -v and see if it's actually picking up the settings. [02:37:58] it looks like it is [02:38:08] debug1: Reading configuration data /root/.ssh/config [02:38:10] debug1: Applying options for *.pmtpa.wmflabs [02:38:11] debug1: Applying options for *.wmflabs [02:38:13] debug1: Reading configuration data /etc/ssh/ssh_config [02:38:14] debug1: Applying options for * [02:38:16] debug1: Executing proxy command: exec ssh -a -W bastion.pmtpa.wmflabs:22 bastion1.pmtpa.wmflabs [02:38:40] PROBLEM Total Processes is now: WARNING on aggregator-test1 i-000002bf output: PROCS WARNING: 200 processes [02:39:18] i'm kind of unsure about that command on the last line -- does the omission of a username (e.g. emw@bastion...) imply that inadvertantly trying to log in as root? [02:39:29] I don't know, but it seems suspicious. [02:39:53] Try copy/paste that command, insert username, see if works? [02:40:55] no luck [02:41:23] 'ssh -a -W emw@bastion.pmtpa.wmflabs:22 bastion1.pmtpa.wmflabs' and 'ssh -a -W bastion.pmtpa.wmflabs:22 emw@bastion1.pmtpa.wmflabs' both fail [02:42:33] possible it's something wrong with labs, but no idea what it would be. Ryan_Lane might have an idea, or you might be better off waiting until more people are online. [02:43:34] (and, I have to run.) [02:43:34] any chance the sshd config was changed to prohibit tunneling? i see a 'PermitTunnel Yes' setting mentioned in http://linuxindetails.wordpress.com/2010/02/18/channel-3-open-failed-administratively-prohibited-open-failed/ in reply to a similar error message to the one i'm receiving [02:43:39] umm [02:43:56] alright, thanks for the pointers andrew [02:44:26] Ryan_Lane, context here: https://labsconsole.wikimedia.org/wiki/Access#Using_ProxyCommand_ssh_option [02:44:30] there's no host called bastion.pmtpa.wmflabs [02:44:39] it's bastion.wmflabs.org [02:44:53] the instance's name is bastion1.pmtpa.wmflabs [02:45:06] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [02:45:14] that really looks like your local ssh config is fucked up [02:45:30] * andrewbogott flees in shame [02:45:34] heh [02:45:46] there's a chain of ProxyCommand settings i'm using from andrewbogott's link [02:46:00] that's your exact config? [02:46:04] what's your ssh command? [02:46:51] root@ocelot:~# ssh bastion.pmtpa.wmflabs [02:46:51] Emw: ? [02:46:55] yeah [02:46:59] that's not going to work [02:47:03] that host doesn't exist [02:47:10] ssh bastion.wmflabs.org [02:47:33] it's bastion1.pmtpa.wmflabs [02:47:37] or bastion.wmflabs.org [02:48:16] and it makes way more sense to use the public host, otherwise you're connecting through the public host, and back to itself over the private IP [02:49:04] the proxy command stuff is there so you can connect to instance behind the bastion [02:49:26] ah, alright [02:50:53] using bastion1 works in this case. i had thought i'd been able to get into bastion via 'ssh bastion.pmtpa.wmflabs' before, but i might have been thinking of a random instance where i got in via 'bastion1.pmtpa.wmflabs' (which just worked) [02:51:12] please just use bastion.wmflabs.org [02:51:21] it makes absolutely no sense to do it the way you are :D [02:51:28] agreed [02:52:00] i trying to get into bastion because i'm unable to get to some instances of mine using the same ProxyCommand approach [02:52:04] was trying* [02:54:43] instances i-000002f8 and i-000002f7, which i set up, have been redlinked for over a day in my 'Instance List' page. i deleted instance i-000002f7 (name: pdbhandler-dev) yesterday morning when i realized i didn't want to use oneiric, but it's still listed as 'running' [02:56:29] i think set up i-000002f8 (name: pdbhandler-1), running lucid, soon after i deleted that oneiric instance. it's been pending since yesterday morning. i recall other instances being created fairly quickly [02:56:47] i think set up -> i then set up [03:03:44] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [03:07:53] 07/03/2012 - 03:07:52 - User laner may have been modified in LDAP or locally, updating key in project(s): deployment-prep [03:11:27] PROBLEM Disk Space is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [03:15:43] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [03:15:43] RECOVERY Disk Space is now: OK on mwreview i-000002ae output: DISK OK [03:33:46] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [03:38:36] PROBLEM Total Processes is now: CRITICAL on aggregator-test1 i-000002bf output: PROCS CRITICAL: 201 processes [03:38:42] PROBLEM Free ram is now: WARNING on test-oneiric i-00000187 output: Warning: 17% free memory [03:43:35] PROBLEM Total Processes is now: WARNING on aggregator-test1 i-000002bf output: PROCS WARNING: 200 processes [03:44:00] PROBLEM Free ram is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [03:44:15] PROBLEM Free ram is now: CRITICAL on psm-precise i-000002f2 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:46:17] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [03:46:17] PROBLEM Free ram is now: WARNING on utils-abogott i-00000131 output: Warning: 14% free memory [03:47:17] PROBLEM Free ram is now: WARNING on orgcharts-dev i-0000018f output: Warning: 16% free memory [03:48:33] PROBLEM Free ram is now: WARNING on bots-sql2 i-000000af output: Warning: 18% free memory [03:48:42] PROBLEM Free ram is now: UNKNOWN on psm-precise i-000002f2 output: NRPE: Unable to read output [03:52:47] PROBLEM Free ram is now: WARNING on nova-daas-1 i-000000e7 output: Warning: 15% free memory [03:58:54] PROBLEM Free ram is now: CRITICAL on test-oneiric i-00000187 output: Critical: 4% free memory [04:01:05] PROBLEM Free ram is now: CRITICAL on utils-abogott i-00000131 output: Critical: 5% free memory [04:02:22] RECOVERY Free ram is now: OK on orgcharts-dev i-0000018f output: OK: 54% free memory [04:03:58] RECOVERY Free ram is now: OK on test-oneiric i-00000187 output: OK: 97% free memory [04:03:58] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [04:06:08] RECOVERY Free ram is now: OK on utils-abogott i-00000131 output: OK: 84% free memory [04:08:06] PROBLEM Free ram is now: CRITICAL on nova-daas-1 i-000000e7 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:12:44] PROBLEM Free ram is now: WARNING on nova-daas-1 i-000000e7 output: Warning: 6% free memory [04:17:15] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [04:19:07] PROBLEM Free ram is now: CRITICAL on etherpad-lite i-000002de output: CHECK_NRPE: Socket timeout after 10 seconds. [04:23:57] PROBLEM Free ram is now: UNKNOWN on etherpad-lite i-000002de output: NRPE: Unable to read output [04:33:59] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [04:48:46] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [05:03:56] PROBLEM Total Processes is now: CRITICAL on aggregator-test1 i-000002bf output: PROCS CRITICAL: 208 processes [05:04:06] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [05:18:46] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [05:34:38] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [05:48:46] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [06:05:32] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [06:18:47] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [06:30:29] PROBLEM Free ram is now: CRITICAL on psm-precise i-000002f2 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:30:29] PROBLEM Current Users is now: CRITICAL on psm-precise i-000002f2 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:30:29] PROBLEM Disk Space is now: CRITICAL on psm-precise i-000002f2 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:30:29] PROBLEM dpkg-check is now: CRITICAL on psm-precise i-000002f2 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:31:04] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 8.63, 5.63, 2.75 [06:32:06] PROBLEM Total Processes is now: WARNING on incubator-bot2 i-00000252 output: PROCS WARNING: 153 processes [06:36:53] PROBLEM Free ram is now: UNKNOWN on psm-precise i-000002f2 output: NRPE: Unable to read output [06:36:53] RECOVERY Current Users is now: OK on psm-precise i-000002f2 output: USERS OK - 0 users currently logged in [06:36:53] RECOVERY Disk Space is now: OK on psm-precise i-000002f2 output: DISK OK [06:36:53] RECOVERY dpkg-check is now: OK on psm-precise i-000002f2 output: All packages OK [06:36:53] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [06:37:17] PROBLEM Free ram is now: CRITICAL on etherpad-lite i-000002de output: CHECK_NRPE: Socket timeout after 10 seconds. [06:37:17] PROBLEM Free ram is now: CRITICAL on wikistats-history-01 i-000002e2 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:38:16] PROBLEM Free ram is now: CRITICAL on grail i-000002c6 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:38:26] PROBLEM Disk Space is now: CRITICAL on grail i-000002c6 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:38:26] PROBLEM Current Load is now: CRITICAL on grail i-000002c6 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:38:26] PROBLEM Current Users is now: CRITICAL on grail i-000002c6 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:38:26] PROBLEM Total Processes is now: CRITICAL on grail i-000002c6 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:38:40] PROBLEM dpkg-check is now: CRITICAL on grail i-000002c6 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:38:45] PROBLEM Total Processes is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:41:10] PROBLEM Current Load is now: CRITICAL on ve-nodejs i-00000245 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:41:11] PROBLEM Current Users is now: CRITICAL on ve-nodejs i-00000245 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:41:11] PROBLEM Disk Space is now: CRITICAL on ve-nodejs i-00000245 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:41:11] PROBLEM Total Processes is now: CRITICAL on ve-nodejs i-00000245 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:41:17] PROBLEM Free ram is now: CRITICAL on ve-nodejs i-00000245 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:41:17] PROBLEM dpkg-check is now: CRITICAL on ve-nodejs i-00000245 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:45:50] PROBLEM Free ram is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Socket timeout after 10 seconds. [06:46:29] PROBLEM Disk Space is now: CRITICAL on nova-precise1 i-00000236 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:46:29] PROBLEM Current Load is now: CRITICAL on nova-precise1 i-00000236 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:48:39] PROBLEM Free ram is now: UNKNOWN on wikistats-history-01 i-000002e2 output: NRPE: Unable to read output [06:48:39] PROBLEM Free ram is now: UNKNOWN on etherpad-lite i-000002de output: NRPE: Unable to read output [06:48:39] PROBLEM Total Processes is now: CRITICAL on ganglia-test2 i-00000250 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:50:53] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [06:50:54] PROBLEM Current Load is now: CRITICAL on nagios 127.0.0.1 output: CRITICAL - load average: 5.45, 7.33, 5.93 [06:50:54] RECOVERY Disk Space is now: OK on grail i-000002c6 output: DISK OK [06:50:54] RECOVERY Current Users is now: OK on grail i-000002c6 output: USERS OK - 0 users currently logged in [06:50:54] RECOVERY Free ram is now: OK on grail i-000002c6 output: OK: 84% free memory [06:50:54] RECOVERY Total Processes is now: OK on grail i-000002c6 output: PROCS OK: 107 processes [06:51:02] RECOVERY dpkg-check is now: OK on grail i-000002c6 output: All packages OK [06:51:02] RECOVERY Current Load is now: OK on grail i-000002c6 output: OK - load average: 0.42, 0.91, 1.33 [06:51:02] PROBLEM Current Load is now: WARNING on wikidata-dev-2 i-00000259 output: WARNING - load average: 4.94, 5.07, 6.51 [06:51:07] RECOVERY Current Users is now: OK on ve-nodejs i-00000245 output: USERS OK - 0 users currently logged in [06:51:07] RECOVERY Disk Space is now: OK on ve-nodejs i-00000245 output: DISK OK [06:51:07] RECOVERY Current Load is now: OK on ve-nodejs i-00000245 output: OK - load average: 0.42, 1.13, 1.94 [06:51:07] RECOVERY Total Processes is now: OK on ve-nodejs i-00000245 output: PROCS OK: 94 processes [06:51:16] RECOVERY Free ram is now: OK on ve-nodejs i-00000245 output: OK: 77% free memory [06:51:16] RECOVERY dpkg-check is now: OK on ve-nodejs i-00000245 output: All packages OK [06:52:23] PROBLEM Total Processes is now: CRITICAL on psm-precise i-000002f2 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:57:56] RECOVERY Disk Space is now: OK on nova-precise1 i-00000236 output: DISK OK [06:57:56] RECOVERY Current Load is now: OK on nova-precise1 i-00000236 output: OK - load average: 1.12, 2.79, 3.16 [07:03:50] PROBLEM Free ram is now: CRITICAL on configtest-main i-000002dd output: CHECK_NRPE: Socket timeout after 10 seconds. [07:05:31] !log integration Rebooting intergration-apache1. CPU and load has been raising linear for the past 2 hours up to 100% just now. Cause unknown, instance was not in use for the last 24 hours. [07:05:32] Logged the message, Master [07:11:07] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [07:11:07] RECOVERY Current Load is now: OK on wikidata-dev-2 i-00000259 output: OK - load average: 3.12, 2.59, 3.99 [07:11:07] PROBLEM Free ram is now: CRITICAL on psm-precise i-000002f2 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:14:22] RECOVERY Disk Space is now: OK on ipv6test1 i-00000282 output: DISK OK [07:14:43] PROBLEM Free ram is now: CRITICAL on ganglia-test2 i-00000250 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:20:01] PROBLEM Current Load is now: CRITICAL on integration-apache1 i-000002eb output: CRITICAL - load average: 9.87, 24.82, 29.26 [07:21:21] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [07:21:29] PROBLEM Current Load is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:22:38] PROBLEM Total Processes is now: WARNING on incubator-bot2 i-00000252 output: PROCS WARNING: 159 processes [07:26:58] PROBLEM Current Load is now: WARNING on bots-sql2 i-000000af output: WARNING - load average: 5.62, 5.47, 6.92 [07:26:58] PROBLEM Current Load is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:26:58] PROBLEM Current Users is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:26:58] PROBLEM dpkg-check is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:26:58] PROBLEM Disk Space is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [07:26:59] PROBLEM Current Load is now: CRITICAL on psm-precise i-000002f2 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:27:07] PROBLEM dpkg-check is now: CRITICAL on configtest-main i-000002dd output: CHECK_NRPE: Socket timeout after 10 seconds. [07:27:07] PROBLEM Total Processes is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:27:17] grmbllblbl [07:27:36] PROBLEM Disk Space is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:27:36] PROBLEM Current Load is now: WARNING on deployment-apache31 i-000002d4 output: WARNING - load average: 7.20, 8.44, 8.68 [07:27:37] PROBLEM Current Load is now: WARNING on configtest-main i-000002dd output: WARNING - load average: 8.06, 8.40, 7.69 [07:27:48] PROBLEM Current Load is now: WARNING on bots-2 i-0000009c output: WARNING - load average: 6.18, 5.04, 5.20 [07:27:48] PROBLEM Current Load is now: WARNING on wikisource-web i-000000fe output: WARNING - load average: 5.31, 5.18, 5.26 [07:27:57] PROBLEM Free ram is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:27:57] PROBLEM dpkg-check is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:27:57] PROBLEM Current Load is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:27:57] PROBLEM Current Users is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:27:57] PROBLEM Disk Space is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:27:57] PROBLEM Free ram is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:27:58] PROBLEM Total Processes is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:28:07] PROBLEM dpkg-check is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:28:07] PROBLEM Current Users is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:28:07] PROBLEM Disk Space is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:28:07] PROBLEM Free ram is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:28:20] PROBLEM Free ram is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [07:28:20] PROBLEM dpkg-check is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [07:28:20] PROBLEM Current Load is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [07:28:20] PROBLEM Disk Space is now: CRITICAL on psm-precise i-000002f2 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:28:29] PROBLEM dpkg-check is now: CRITICAL on psm-precise i-000002f2 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:28:56] PROBLEM Current Load is now: WARNING on hugglewiki i-000000aa output: WARNING - load average: 4.44, 5.79, 6.77 [07:28:56] PROBLEM Current Load is now: WARNING on bots-cb i-0000009e output: WARNING - load average: 4.14, 6.56, 10.57 [07:28:56] PROBLEM Current Load is now: WARNING on grail i-000002c6 output: WARNING - load average: 5.27, 5.79, 5.54 [07:28:56] PROBLEM Current Load is now: WARNING on maps-tilemill1 i-00000294 output: WARNING - load average: 0.71, 4.41, 6.65 [07:28:56] PROBLEM Current Load is now: WARNING on ganglia-test2 i-00000250 output: WARNING - load average: 7.90, 7.90, 7.00 [07:29:13] PROBLEM SSH is now: CRITICAL on ganglia-test2 i-00000250 output: CRITICAL - Socket timeout after 10 seconds [07:29:13] PROBLEM dpkg-check is now: CRITICAL on ganglia-test2 i-00000250 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:29:13] PROBLEM Current Load is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [07:29:45] PROBLEM Current Load is now: WARNING on wep i-000000c2 output: WARNING - load average: 4.70, 5.65, 5.57 [07:29:45] PROBLEM Disk Space is now: WARNING on ipv6test1 i-00000282 output: DISK WARNING - free space: / 68 MB (5% inode=57%): [07:29:45] PROBLEM Current Load is now: WARNING on resourceloader2-apache i-000001d7 output: WARNING - load average: 6.34, 6.51, 5.62 [07:30:06] wtf [07:30:17] hashar: Argh, what the hell [07:30:25] http://ganglia.wmflabs.org/latest/?r=20min&cs=&ce=&c=integration&h=integration-apache1 [07:30:28] all CPU Wait [07:30:39] well the labs seems dead somehow [07:30:41] PROBLEM Current Load is now: WARNING on pybal-precise i-00000289 output: WARNING - load average: 6.34, 6.71, 5.72 [07:30:41] There's 1 http request made per minute :P [07:30:52] yeah [07:30:58] http://ganglia.wmflabs.org/latest/ [07:31:10] there is 600 load spike at 7am every day [07:31:27] PROBLEM Current Load is now: WARNING on incubator-bot2 i-00000252 output: WARNING - load average: 2.42, 4.63, 5.62 [07:31:27] RECOVERY Current Users is now: OK on incubator-bot2 i-00000252 output: USERS OK - 0 users currently logged in [07:31:27] PROBLEM Free ram is now: UNKNOWN on configtest-main i-000002dd output: NRPE: Unable to read output [07:31:27] RECOVERY Disk Space is now: OK on incubator-bot2 i-00000252 output: DISK OK [07:31:27] RECOVERY Free ram is now: OK on incubator-bot2 i-00000252 output: OK: 32% free memory [07:31:28] RECOVERY dpkg-check is now: OK on incubator-bot2 i-00000252 output: All packages OK [07:31:28] PROBLEM Current Load is now: WARNING on migration1 i-00000261 output: WARNING - load average: 2.73, 4.38, 5.01 [07:31:29] RECOVERY Current Users is now: OK on migration1 i-00000261 output: USERS OK - 0 users currently logged in [07:31:29] RECOVERY Disk Space is now: OK on migration1 i-00000261 output: DISK OK [07:31:30] RECOVERY Free ram is now: OK on migration1 i-00000261 output: OK: 88% free memory [07:31:30] RECOVERY Total Processes is now: OK on migration1 i-00000261 output: PROCS OK: 84 processes [07:31:32] RECOVERY Total Processes is now: OK on mobile-testing i-00000271 output: PROCS OK: 134 processes [07:31:37] RECOVERY Current Users is now: OK on mobile-testing i-00000271 output: USERS OK - 0 users currently logged in [07:31:38] RECOVERY dpkg-check is now: OK on migration1 i-00000261 output: All packages OK [07:31:38] RECOVERY Disk Space is now: OK on mobile-testing i-00000271 output: DISK OK [07:31:38] RECOVERY Free ram is now: OK on mobile-testing i-00000271 output: OK: 95% free memory [07:31:38] RECOVERY dpkg-check is now: OK on configtest-main i-000002dd output: All packages OK [07:31:39] someone must be doing some heavy I/O on one of the instance [07:31:43] RECOVERY dpkg-check is now: OK on mobile-testing i-00000271 output: All packages OK [07:32:11] PROBLEM Current Load is now: WARNING on mobile-testing i-00000271 output: WARNING - load average: 3.08, 5.55, 6.26 [07:33:04] RECOVERY Current Load is now: OK on wikisource-web i-000000fe output: OK - load average: 0.18, 2.37, 4.07 [07:33:04] RECOVERY Current Load is now: OK on bots-2 i-0000009c output: OK - load average: 3.06, 4.19, 4.84 [07:33:04] PROBLEM Current Load is now: WARNING on mwreview i-000002ae output: WARNING - load average: 5.67, 7.16, 7.75 [07:33:04] RECOVERY Free ram is now: OK on mwreview i-000002ae output: OK: 69% free memory [07:33:04] RECOVERY dpkg-check is now: OK on mwreview i-000002ae output: All packages OK [07:33:33] Krinkle: so I end up not doing anything on labs till 10am :-D [07:33:34] RECOVERY Current Load is now: OK on maps-tilemill1 i-00000294 output: OK - load average: 0.10, 1.70, 4.85 [07:33:35] RECOVERY Current Load is now: OK on upload-wizard i-0000021c output: OK - load average: 0.38, 3.59, 4.99 [07:33:43] RECOVERY dpkg-check is now: OK on ganglia-test2 i-00000250 output: All packages OK [07:33:43] RECOVERY SSH is now: OK on ganglia-test2 i-00000250 output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [07:34:13] PROBLEM Free ram is now: UNKNOWN on integration-apache1 i-000002eb output: NRPE: Unable to read output [07:34:41] RECOVERY Current Load is now: OK on wep i-000000c2 output: OK - load average: 0.35, 2.62, 4.32 [07:34:41] RECOVERY Current Load is now: OK on resourceloader2-apache i-000001d7 output: OK - load average: 1.09, 3.27, 4.48 [07:34:41] PROBLEM Disk Space is now: CRITICAL on ipv6test1 i-00000282 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:35:13] PROBLEM Current Load is now: WARNING on integration-apache1 i-000002eb output: WARNING - load average: 3.58, 5.67, 13.66 [07:35:33] RECOVERY Current Load is now: OK on pybal-precise i-00000289 output: OK - load average: 0.28, 2.90, 4.37 [07:35:55] RECOVERY Current Load is now: OK on bots-sql2 i-000000af output: OK - load average: 0.72, 1.99, 4.49 [07:35:55] RECOVERY Current Load is now: OK on incubator-bot2 i-00000252 output: OK - load average: 0.25, 2.32, 4.40 [07:35:55] RECOVERY Current Load is now: OK on migration1 i-00000261 output: OK - load average: 0.06, 1.63, 3.63 [07:35:55] RECOVERY Disk Space is now: OK on mwreview i-000002ae output: DISK OK [07:35:55] PROBLEM Current Load is now: WARNING on psm-precise i-000002f2 output: WARNING - load average: 3.24, 5.52, 6.79 [07:36:53] RECOVERY Current Load is now: OK on mobile-testing i-00000271 output: OK - load average: 0.07, 2.11, 4.57 [07:36:53] PROBLEM Total Processes is now: WARNING on ganglia-test2 i-00000250 output: PROCS WARNING: 187 processes [07:37:28] RECOVERY Total Processes is now: OK on psm-precise i-000002f2 output: PROCS OK: 141 processes [07:37:43] PROBLEM Free ram is now: WARNING on ganglia-test2 i-00000250 output: Warning: 18% free memory [07:38:33] RECOVERY Current Load is now: OK on hugglewiki i-000000aa output: OK - load average: 0.01, 0.95, 3.70 [07:40:22] back [07:40:46] I love my laptop, but sometime it become very hot for no reason :/ [07:41:32] Krinkle: http://ganglia.wikimedia.org/latest/?r=4hr&cs=&ce=&m=load_one&s=by+name&c=Virtualization+cluster+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [07:41:39] Krinkle: that is the production servers that support labs [07:41:45] load reached 4K [07:41:53] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [07:42:08] hashar: heh [07:42:23] RECOVERY Current Load is now: OK on deployment-apache31 i-000002d4 output: OK - load average: 0.02, 1.38, 4.58 [07:42:23] RECOVERY Current Load is now: OK on configtest-main i-000002dd output: OK - load average: 0.02, 1.32, 4.14 [07:42:23] RECOVERY Current Load is now: OK on mwreview i-000002ae output: OK - load average: 0.04, 1.14, 4.22 [07:42:23] PROBLEM Total Processes is now: CRITICAL on aggregator-test1 i-000002bf output: PROCS CRITICAL: 201 processes [07:42:41] hashar: CPU stays low though [07:42:41] with lot of waiting I/O on the real hardware http://ganglia.wikimedia.org/latest/?r=4hr&cs=&ce=&m=cpu_wio&s=by+name&c=Virtualization+cluster+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [07:43:33] RECOVERY Current Load is now: OK on bots-cb i-0000009e output: OK - load average: 0.22, 1.02, 4.81 [07:43:33] RECOVERY Current Load is now: OK on ganglia-test2 i-00000250 output: OK - load average: 0.36, 1.49, 3.84 [07:45:53] RECOVERY Current Load is now: OK on psm-precise i-000002f2 output: OK - load average: 0.07, 0.92, 3.71 [07:49:01] Krinkle: also I rebooted your instance Sunday [07:49:11] Krinkle: it was struck by the leap second kernel bug [07:49:23] yeah, I just noticed while logging. [07:51:53] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [07:52:41] rm: cannot remove `labs-config': Input/output error [07:52:42] yeahhhh [07:52:47] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 0.20, 0.52, 3.97 [07:53:49] I give up will do something else [07:55:06] RECOVERY Current Load is now: OK on integration-apache1 i-000002eb output: OK - load average: 0.15, 0.43, 3.97 [08:04:43] PROBLEM Free ram is now: WARNING on aggregator-test1 i-000002bf output: Warning: 19% free memory [08:12:19] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [08:22:14] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [08:38:44] PROBLEM Free ram is now: CRITICAL on ganglia-test2 i-00000250 output: CHECK_NRPE: Socket timeout after 10 seconds. [08:42:33] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [08:49:18] PROBLEM Free ram is now: CRITICAL on wikistats-history-01 i-000002e2 output: CHECK_NRPE: Socket timeout after 10 seconds. [08:52:27] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [08:54:08] PROBLEM Free ram is now: UNKNOWN on wikistats-history-01 i-000002e2 output: NRPE: Unable to read output [08:57:26] PROBLEM Free ram is now: UNKNOWN on psm-precise i-000002f2 output: NRPE: Unable to read output [09:12:48] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [09:14:37] PROBLEM Disk Space is now: WARNING on ipv6test1 i-00000282 output: DISK WARNING - free space: / 67 MB (5% inode=57%): [09:22:30] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [09:23:00] PROBLEM Puppet freshness is now: CRITICAL on deployment-transcoding i-00000105 output: Puppet has not run in last 20 hours [09:25:00] PROBLEM Puppet freshness is now: CRITICAL on gerrit i-000000ff output: Puppet has not run in last 20 hours [09:42:11] PROBLEM Free ram is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [09:43:01] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [09:47:01] PROBLEM Free ram is now: WARNING on bots-sql2 i-000000af output: Warning: 18% free memory [09:53:06] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [09:56:44] !log deployment-prep Asked reboot for deployment-transcoding through labsconsole [09:56:46] Logged the message, Master [10:00:17] PROBLEM Puppet freshness is now: CRITICAL on wikistats-01 i-00000042 output: Puppet has not run in last 20 hours [10:04:46] PROBLEM Free ram is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Socket timeout after 10 seconds. [10:09:36] PROBLEM Free ram is now: UNKNOWN on integration-apache1 i-000002eb output: NRPE: Unable to read output [10:13:05] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [10:18:26] PROBLEM Free ram is now: WARNING on ganglia-test2 i-00000250 output: Warning: 16% free memory [10:23:08] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [10:30:49] PROBLEM Free ram is now: WARNING on bots-2 i-0000009c output: Warning: 19% free memory [10:43:19] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [10:52:11] Change abandoned: Jens Ohlig; "(no reason)" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/10567 [10:52:19] Change restored: Jens Ohlig; "(no reason)" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/10567 [10:53:08] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [10:54:23] PROBLEM Free ram is now: CRITICAL on wikistats-history-01 i-000002e2 output: CHECK_NRPE: Socket timeout after 10 seconds. [10:59:13] PROBLEM Free ram is now: UNKNOWN on wikistats-history-01 i-000002e2 output: NRPE: Unable to read output [11:13:22] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [11:16:23] PROBLEM Puppet freshness is now: CRITICAL on maps-test2 i-00000253 output: Puppet has not run in last 20 hours [11:23:14] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [11:43:23] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [11:53:13] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [12:10:29] <^demon> hashar: Good morning [12:10:42] ^demon: hello :) [12:12:19] <^demon> So I was looking at what Ryan suggested yesterday about role classes, but I was a little bit confused. Do you know of any examples I could follow that do what he suggested? [12:13:38] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [12:15:03] ^demon: sure let me find it [12:16:05] <^demon> k :) [12:17:40] ahh applicationserver::labs [12:17:55] ^demon: in manifests/site.pp [12:18:06] though that is probably not the most straightforward [12:20:14] <^demon> I can't find that class in site.pp [12:20:24] what is it? you are going to remove the "# TODO: rewrite this old mess."? or just any example for using role classes [12:20:28] there is a comment [12:20:33] the class is labs [12:20:38] a subclass of applicationserver [12:20:44] anyway that is not the best example [12:20:46] ^demon: https://gerrit.wikimedia.org/r/#/c/13304/ [12:21:04] ^demon: that one is easy. It creates the bits::labs under the role::cache class [12:21:08] PROBLEM Current Load is now: WARNING on bots-sql2 i-000000af output: WARNING - load average: 5.36, 6.07, 5.36 [12:21:09] so end result is role::cache::bits::labs [12:21:30] then it uses the varnish::instance which is a parameterized class [12:21:41] I guess that is the best example [12:21:58] <^demon> Hmm. [12:22:02] I have that change deployed on my instance, works fine [12:22:31] on the diff https://gerrit.wikimedia.org/r/#/c/13304/2/manifests/role/cache.pp,unified line 522 is the call to the parameterized class [12:22:37] think about it like calling a php function [12:22:54] then you give it named parameters (think about it like passing a $options array of settings in mw) [12:23:12] to find out which parameters and their defaults, you obviously need to look at the class definition [12:23:18] order of parameters does not matter [12:23:28] so all that would let you define a new role [12:23:37] the most simple role class, without the parameterized classes, is probably role/pdf.pp , which is then included in site.pp on node "pdf1" [12:23:43] which you can nicely pack in something like role::gerrit::labs or something similar [12:24:32] the applicationserver classes define a common class from which various classes can inherits (I think the parent is application server::parent, think of it like a PHP abstract class you extend) [12:25:04] <^demon> Right, but I guess my question is: how would I update site.pp to use the new role class? [12:25:19] for labs, it is not needed [12:25:33] you add the new class (aka role::gerrit::labs ) in the labsconsole under 'Manage puppet group' [12:25:37] in prod you would include role::pdf in a "node" [12:25:44] then configure the instance and tick the [] role::gerrit::labs checkbox [12:25:59] that will instruct puppet on your instance to install that class on your instance [12:26:06] so basically site.pp is not needed for labs ;) [12:26:08] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [12:26:25] <^demon> I understand how to use labs :) [12:26:30] <^demon> mutante: Thanks, that's what I needed [12:26:47] I am just making sure I know that you know about what we know ! [12:28:00] ^demon: grep role site.pp when being on the prod branch, also gives a nice overview besides the example [12:28:18] <^demon> Yeah I have site.pp open, but it's so wildly inconsistent :) [12:28:51] well, the ones that include role::something are good, and the ones that include a lot of classes directly should likely be moved to role classes [12:28:59] to keep site.pp shorter [12:31:38] hashar: so you have puppetmaster::self on your instance now? i gotta start using it now [12:33:24] I got one on the deployment-cache-bits instance [12:33:33] I followed some tutorial Faidon wrote [12:33:36] ^demon: we should probably block any commits to the test branch in gerrit. easy for you? [12:33:50] reported a few minor glitches et voila :) [12:33:59] <^demon> mutante: Doable. [12:34:04] mutante: I think test is readonly already [12:34:18] mutante: doc https://labsconsole.wikimedia.org/wiki/Help:Self-hosted_puppetmaster [12:34:22] ok, good, then we just have 5 open gerrit requests [12:34:29] it is really straightforward [12:34:34] thanks [12:34:48] https://gerrit.wikimedia.org/r/#/q/branch:test,n,z [12:34:53] <^demon> Man, permissions on operations/puppet are a mess :\ [12:35:49] doh there are two 'test' reference [12:36:19] <^demon> Well, refs/heads/test and refs/for/refs/heads/test [12:36:28] I guess refs/for/refs/heads/test is to only allow wmf people to submit changes in gerrit [12:39:18] <^demon> Yeah, but they also had push rights on refs/heads/test, allowing you to skip gerrit. [12:40:27] <^demon> mutante: https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=commitdiff;h=978720d6ca4a133b445acf8446739ecaf254276a should take care of it. [12:41:36] <^demon> ops can still do pushes if they really need to (via the refs/* permissions), but this should stop anyone else from submitting for review to the branch. [12:41:43] <^demon> (And in practice, most ops play in production :p) [12:43:44] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [12:50:03] New review: Dzahn; "please consider abandoning / resubmitting / saving this, since the test branch does not exist anymore" [operations/puppet] (test); V: 0 C: 0; - https://gerrit.wikimedia.org/r/10567 [12:53:43]