[00:02:32] PROBLEM Free ram is now: UNKNOWN on etherpad-lite i-000002de output: NRPE: Unable to read output [00:02:32] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [00:13:34] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [00:32:59] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [00:43:59] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [01:03:08] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [01:14:01] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [01:16:11] PROBLEM Puppet freshness is now: CRITICAL on maps-test2 i-00000253 output: Puppet has not run in last 20 hours [01:33:06] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [01:44:02] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [01:50:52] PROBLEM Free ram is now: WARNING on bots-3 i-000000e5 output: Warning: 15% free memory [02:03:09] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [02:05:58] i'm having trouble accessing bastion.pmpta.wmflabs.org via ssh using approach described at https://labsconsole.wikimedia.org/wiki/Access#Using_ProxyCommand_ssh_option. this worked for me on saturday, but not now, and to the best of my knowledge nothing's changed in my ssh agent configuration [02:06:41] e.g. when i run 'root@foo:~# ssh bastion.pmtpa.wmflabs', i get: [02:06:44] If you are having access problems, please see: https://labsconsole.wikimedia.org/wiki/Access#Accessing_public_and_private_instances [02:06:46] channel 0: open failed: administratively prohibited: open failed [02:06:47] ssh_exchange_identification: Connection closed by remote host [02:08:29] however, when i run 'ssh emw@bastion.wmflabs.org', i get in fine [02:08:44] any ideas what's wrong here? [02:10:00] can anyone else successfully ssh into bastion with the ProxyCommand approach outlined at that first link i sent? [02:10:49] PROBLEM Free ram is now: CRITICAL on bots-3 i-000000e5 output: Critical: 4% free memory [02:11:03] also, on a separate note, i've got an instance that i delete on over the weekend still listed under 'manage instances'. is there a bug on this? [02:11:41] that i deleted* [02:14:59] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [02:25:49] RECOVERY Free ram is now: OK on bots-3 i-000000e5 output: OK: 52% free memory [02:30:00] Emw: Do you happen to know whether or not you are a member of the bastion project? [02:30:26] i know that i had been as of saturday. i'll check again [02:30:46] If you were then you probably still are. [02:31:05] andrewbogott: yup, i'm (still) a member [02:31:09] Um... wait... [02:31:17] in your paste above it looks like you're trying to ssh in as root? [02:31:23] That doesn't sound good. [02:33:02] (full disclosure, I've never used the particular method documented in that howto) [02:33:09] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [02:34:21] that might be it. however, i've got this pair of lines at the bottom of /root/.ssh/config: [02:34:22] Host *.wmflabs [02:34:24] User emw [02:35:00] Oh, I see what you mean. [02:35:16] Um, ok, sorry, now I'm fully ignorant :( [02:36:11] Emw: I would try to do ssh -v and see if it's actually picking up the settings. [02:37:58] it looks like it is [02:38:08] debug1: Reading configuration data /root/.ssh/config [02:38:10] debug1: Applying options for *.pmtpa.wmflabs [02:38:11] debug1: Applying options for *.wmflabs [02:38:13] debug1: Reading configuration data /etc/ssh/ssh_config [02:38:14] debug1: Applying options for * [02:38:16] debug1: Executing proxy command: exec ssh -a -W bastion.pmtpa.wmflabs:22 bastion1.pmtpa.wmflabs [02:38:40] PROBLEM Total Processes is now: WARNING on aggregator-test1 i-000002bf output: PROCS WARNING: 200 processes [02:39:18] i'm kind of unsure about that command on the last line -- does the omission of a username (e.g. emw@bastion...) imply that inadvertantly trying to log in as root? [02:39:29] I don't know, but it seems suspicious. [02:39:53] Try copy/paste that command, insert username, see if works? [02:40:55] no luck [02:41:23] 'ssh -a -W emw@bastion.pmtpa.wmflabs:22 bastion1.pmtpa.wmflabs' and 'ssh -a -W bastion.pmtpa.wmflabs:22 emw@bastion1.pmtpa.wmflabs' both fail [02:42:33] possible it's something wrong with labs, but no idea what it would be. Ryan_Lane might have an idea, or you might be better off waiting until more people are online. [02:43:34] (and, I have to run.) [02:43:34] any chance the sshd config was changed to prohibit tunneling? i see a 'PermitTunnel Yes' setting mentioned in http://linuxindetails.wordpress.com/2010/02/18/channel-3-open-failed-administratively-prohibited-open-failed/ in reply to a similar error message to the one i'm receiving [02:43:39] umm [02:43:56] alright, thanks for the pointers andrew [02:44:26] Ryan_Lane, context here: https://labsconsole.wikimedia.org/wiki/Access#Using_ProxyCommand_ssh_option [02:44:30] there's no host called bastion.pmtpa.wmflabs [02:44:39] it's bastion.wmflabs.org [02:44:53] the instance's name is bastion1.pmtpa.wmflabs [02:45:06] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [02:45:14] that really looks like your local ssh config is fucked up [02:45:30] * andrewbogott flees in shame [02:45:34] heh [02:45:46] there's a chain of ProxyCommand settings i'm using from andrewbogott's link [02:46:00] that's your exact config? [02:46:04] what's your ssh command? [02:46:51] root@ocelot:~# ssh bastion.pmtpa.wmflabs [02:46:51] Emw: ? [02:46:55] yeah [02:46:59] that's not going to work [02:47:03] that host doesn't exist [02:47:10] ssh bastion.wmflabs.org [02:47:33] it's bastion1.pmtpa.wmflabs [02:47:37] or bastion.wmflabs.org [02:48:16] and it makes way more sense to use the public host, otherwise you're connecting through the public host, and back to itself over the private IP [02:49:04] the proxy command stuff is there so you can connect to instance behind the bastion [02:49:26] ah, alright [02:50:53] using bastion1 works in this case. i had thought i'd been able to get into bastion via 'ssh bastion.pmtpa.wmflabs' before, but i might have been thinking of a random instance where i got in via 'bastion1.pmtpa.wmflabs' (which just worked) [02:51:12] please just use bastion.wmflabs.org [02:51:21] it makes absolutely no sense to do it the way you are :D [02:51:28] agreed [02:52:00] i trying to get into bastion because i'm unable to get to some instances of mine using the same ProxyCommand approach [02:52:04] was trying* [02:54:43] instances i-000002f8 and i-000002f7, which i set up, have been redlinked for over a day in my 'Instance List' page. i deleted instance i-000002f7 (name: pdbhandler-dev) yesterday morning when i realized i didn't want to use oneiric, but it's still listed as 'running' [02:56:29] i think set up i-000002f8 (name: pdbhandler-1), running lucid, soon after i deleted that oneiric instance. it's been pending since yesterday morning. i recall other instances being created fairly quickly [02:56:47] i think set up -> i then set up [03:03:44] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [03:07:53] 07/03/2012 - 03:07:52 - User laner may have been modified in LDAP or locally, updating key in project(s): deployment-prep [03:11:27] PROBLEM Disk Space is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [03:15:43] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [03:15:43] RECOVERY Disk Space is now: OK on mwreview i-000002ae output: DISK OK [03:33:46] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [03:38:36] PROBLEM Total Processes is now: CRITICAL on aggregator-test1 i-000002bf output: PROCS CRITICAL: 201 processes [03:38:42] PROBLEM Free ram is now: WARNING on test-oneiric i-00000187 output: Warning: 17% free memory [03:43:35] PROBLEM Total Processes is now: WARNING on aggregator-test1 i-000002bf output: PROCS WARNING: 200 processes [03:44:00] PROBLEM Free ram is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [03:44:15] PROBLEM Free ram is now: CRITICAL on psm-precise i-000002f2 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:46:17] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [03:46:17] PROBLEM Free ram is now: WARNING on utils-abogott i-00000131 output: Warning: 14% free memory [03:47:17] PROBLEM Free ram is now: WARNING on orgcharts-dev i-0000018f output: Warning: 16% free memory [03:48:33] PROBLEM Free ram is now: WARNING on bots-sql2 i-000000af output: Warning: 18% free memory [03:48:42] PROBLEM Free ram is now: UNKNOWN on psm-precise i-000002f2 output: NRPE: Unable to read output [03:52:47] PROBLEM Free ram is now: WARNING on nova-daas-1 i-000000e7 output: Warning: 15% free memory [03:58:54] PROBLEM Free ram is now: CRITICAL on test-oneiric i-00000187 output: Critical: 4% free memory [04:01:05] PROBLEM Free ram is now: CRITICAL on utils-abogott i-00000131 output: Critical: 5% free memory [04:02:22] RECOVERY Free ram is now: OK on orgcharts-dev i-0000018f output: OK: 54% free memory [04:03:58] RECOVERY Free ram is now: OK on test-oneiric i-00000187 output: OK: 97% free memory [04:03:58] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [04:06:08] RECOVERY Free ram is now: OK on utils-abogott i-00000131 output: OK: 84% free memory [04:08:06] PROBLEM Free ram is now: CRITICAL on nova-daas-1 i-000000e7 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:12:44] PROBLEM Free ram is now: WARNING on nova-daas-1 i-000000e7 output: Warning: 6% free memory [04:17:15] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [04:19:07] PROBLEM Free ram is now: CRITICAL on etherpad-lite i-000002de output: CHECK_NRPE: Socket timeout after 10 seconds. [04:23:57] PROBLEM Free ram is now: UNKNOWN on etherpad-lite i-000002de output: NRPE: Unable to read output [04:33:59] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [04:48:46] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [05:03:56] PROBLEM Total Processes is now: CRITICAL on aggregator-test1 i-000002bf output: PROCS CRITICAL: 208 processes [05:04:06] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [05:18:46] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [05:34:38] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [05:48:46] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [06:05:32] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [06:18:47] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [06:30:29] PROBLEM Free ram is now: CRITICAL on psm-precise i-000002f2 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:30:29] PROBLEM Current Users is now: CRITICAL on psm-precise i-000002f2 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:30:29] PROBLEM Disk Space is now: CRITICAL on psm-precise i-000002f2 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:30:29] PROBLEM dpkg-check is now: CRITICAL on psm-precise i-000002f2 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:31:04] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 8.63, 5.63, 2.75 [06:32:06] PROBLEM Total Processes is now: WARNING on incubator-bot2 i-00000252 output: PROCS WARNING: 153 processes [06:36:53] PROBLEM Free ram is now: UNKNOWN on psm-precise i-000002f2 output: NRPE: Unable to read output [06:36:53] RECOVERY Current Users is now: OK on psm-precise i-000002f2 output: USERS OK - 0 users currently logged in [06:36:53] RECOVERY Disk Space is now: OK on psm-precise i-000002f2 output: DISK OK [06:36:53] RECOVERY dpkg-check is now: OK on psm-precise i-000002f2 output: All packages OK [06:36:53] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [06:37:17] PROBLEM Free ram is now: CRITICAL on etherpad-lite i-000002de output: CHECK_NRPE: Socket timeout after 10 seconds. [06:37:17] PROBLEM Free ram is now: CRITICAL on wikistats-history-01 i-000002e2 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:38:16] PROBLEM Free ram is now: CRITICAL on grail i-000002c6 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:38:26] PROBLEM Disk Space is now: CRITICAL on grail i-000002c6 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:38:26] PROBLEM Current Load is now: CRITICAL on grail i-000002c6 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:38:26] PROBLEM Current Users is now: CRITICAL on grail i-000002c6 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:38:26] PROBLEM Total Processes is now: CRITICAL on grail i-000002c6 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:38:40] PROBLEM dpkg-check is now: CRITICAL on grail i-000002c6 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:38:45] PROBLEM Total Processes is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:41:10] PROBLEM Current Load is now: CRITICAL on ve-nodejs i-00000245 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:41:11] PROBLEM Current Users is now: CRITICAL on ve-nodejs i-00000245 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:41:11] PROBLEM Disk Space is now: CRITICAL on ve-nodejs i-00000245 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:41:11] PROBLEM Total Processes is now: CRITICAL on ve-nodejs i-00000245 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:41:17] PROBLEM Free ram is now: CRITICAL on ve-nodejs i-00000245 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:41:17] PROBLEM dpkg-check is now: CRITICAL on ve-nodejs i-00000245 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:45:50] PROBLEM Free ram is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Socket timeout after 10 seconds. [06:46:29] PROBLEM Disk Space is now: CRITICAL on nova-precise1 i-00000236 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:46:29] PROBLEM Current Load is now: CRITICAL on nova-precise1 i-00000236 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:48:39] PROBLEM Free ram is now: UNKNOWN on wikistats-history-01 i-000002e2 output: NRPE: Unable to read output [06:48:39] PROBLEM Free ram is now: UNKNOWN on etherpad-lite i-000002de output: NRPE: Unable to read output [06:48:39] PROBLEM Total Processes is now: CRITICAL on ganglia-test2 i-00000250 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:50:53] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [06:50:54] PROBLEM Current Load is now: CRITICAL on nagios 127.0.0.1 output: CRITICAL - load average: 5.45, 7.33, 5.93 [06:50:54] RECOVERY Disk Space is now: OK on grail i-000002c6 output: DISK OK [06:50:54] RECOVERY Current Users is now: OK on grail i-000002c6 output: USERS OK - 0 users currently logged in [06:50:54] RECOVERY Free ram is now: OK on grail i-000002c6 output: OK: 84% free memory [06:50:54] RECOVERY Total Processes is now: OK on grail i-000002c6 output: PROCS OK: 107 processes [06:51:02] RECOVERY dpkg-check is now: OK on grail i-000002c6 output: All packages OK [06:51:02] RECOVERY Current Load is now: OK on grail i-000002c6 output: OK - load average: 0.42, 0.91, 1.33 [06:51:02] PROBLEM Current Load is now: WARNING on wikidata-dev-2 i-00000259 output: WARNING - load average: 4.94, 5.07, 6.51 [06:51:07] RECOVERY Current Users is now: OK on ve-nodejs i-00000245 output: USERS OK - 0 users currently logged in [06:51:07] RECOVERY Disk Space is now: OK on ve-nodejs i-00000245 output: DISK OK [06:51:07] RECOVERY Current Load is now: OK on ve-nodejs i-00000245 output: OK - load average: 0.42, 1.13, 1.94 [06:51:07] RECOVERY Total Processes is now: OK on ve-nodejs i-00000245 output: PROCS OK: 94 processes [06:51:16] RECOVERY Free ram is now: OK on ve-nodejs i-00000245 output: OK: 77% free memory [06:51:16] RECOVERY dpkg-check is now: OK on ve-nodejs i-00000245 output: All packages OK [06:52:23] PROBLEM Total Processes is now: CRITICAL on psm-precise i-000002f2 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:57:56] RECOVERY Disk Space is now: OK on nova-precise1 i-00000236 output: DISK OK [06:57:56] RECOVERY Current Load is now: OK on nova-precise1 i-00000236 output: OK - load average: 1.12, 2.79, 3.16 [07:03:50] PROBLEM Free ram is now: CRITICAL on configtest-main i-000002dd output: CHECK_NRPE: Socket timeout after 10 seconds. [07:05:31] !log integration Rebooting intergration-apache1. CPU and load has been raising linear for the past 2 hours up to 100% just now. Cause unknown, instance was not in use for the last 24 hours. [07:05:32] Logged the message, Master [07:11:07] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [07:11:07] RECOVERY Current Load is now: OK on wikidata-dev-2 i-00000259 output: OK - load average: 3.12, 2.59, 3.99 [07:11:07] PROBLEM Free ram is now: CRITICAL on psm-precise i-000002f2 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:14:22] RECOVERY Disk Space is now: OK on ipv6test1 i-00000282 output: DISK OK [07:14:43] PROBLEM Free ram is now: CRITICAL on ganglia-test2 i-00000250 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:20:01] PROBLEM Current Load is now: CRITICAL on integration-apache1 i-000002eb output: CRITICAL - load average: 9.87, 24.82, 29.26 [07:21:21] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [07:21:29] PROBLEM Current Load is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:22:38] PROBLEM Total Processes is now: WARNING on incubator-bot2 i-00000252 output: PROCS WARNING: 159 processes [07:26:58] PROBLEM Current Load is now: WARNING on bots-sql2 i-000000af output: WARNING - load average: 5.62, 5.47, 6.92 [07:26:58] PROBLEM Current Load is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:26:58] PROBLEM Current Users is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:26:58] PROBLEM dpkg-check is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:26:58] PROBLEM Disk Space is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [07:26:59] PROBLEM Current Load is now: CRITICAL on psm-precise i-000002f2 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:27:07] PROBLEM dpkg-check is now: CRITICAL on configtest-main i-000002dd output: CHECK_NRPE: Socket timeout after 10 seconds. [07:27:07] PROBLEM Total Processes is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:27:17] grmbllblbl [07:27:36] PROBLEM Disk Space is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:27:36] PROBLEM Current Load is now: WARNING on deployment-apache31 i-000002d4 output: WARNING - load average: 7.20, 8.44, 8.68 [07:27:37] PROBLEM Current Load is now: WARNING on configtest-main i-000002dd output: WARNING - load average: 8.06, 8.40, 7.69 [07:27:48] PROBLEM Current Load is now: WARNING on bots-2 i-0000009c output: WARNING - load average: 6.18, 5.04, 5.20 [07:27:48] PROBLEM Current Load is now: WARNING on wikisource-web i-000000fe output: WARNING - load average: 5.31, 5.18, 5.26 [07:27:57] PROBLEM Free ram is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:27:57] PROBLEM dpkg-check is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:27:57] PROBLEM Current Load is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:27:57] PROBLEM Current Users is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:27:57] PROBLEM Disk Space is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:27:57] PROBLEM Free ram is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:27:58] PROBLEM Total Processes is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:28:07] PROBLEM dpkg-check is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:28:07] PROBLEM Current Users is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:28:07] PROBLEM Disk Space is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:28:07] PROBLEM Free ram is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:28:20] PROBLEM Free ram is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [07:28:20] PROBLEM dpkg-check is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [07:28:20] PROBLEM Current Load is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [07:28:20] PROBLEM Disk Space is now: CRITICAL on psm-precise i-000002f2 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:28:29] PROBLEM dpkg-check is now: CRITICAL on psm-precise i-000002f2 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:28:56] PROBLEM Current Load is now: WARNING on hugglewiki i-000000aa output: WARNING - load average: 4.44, 5.79, 6.77 [07:28:56] PROBLEM Current Load is now: WARNING on bots-cb i-0000009e output: WARNING - load average: 4.14, 6.56, 10.57 [07:28:56] PROBLEM Current Load is now: WARNING on grail i-000002c6 output: WARNING - load average: 5.27, 5.79, 5.54 [07:28:56] PROBLEM Current Load is now: WARNING on maps-tilemill1 i-00000294 output: WARNING - load average: 0.71, 4.41, 6.65 [07:28:56] PROBLEM Current Load is now: WARNING on ganglia-test2 i-00000250 output: WARNING - load average: 7.90, 7.90, 7.00 [07:29:13] PROBLEM SSH is now: CRITICAL on ganglia-test2 i-00000250 output: CRITICAL - Socket timeout after 10 seconds [07:29:13] PROBLEM dpkg-check is now: CRITICAL on ganglia-test2 i-00000250 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:29:13] PROBLEM Current Load is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [07:29:45] PROBLEM Current Load is now: WARNING on wep i-000000c2 output: WARNING - load average: 4.70, 5.65, 5.57 [07:29:45] PROBLEM Disk Space is now: WARNING on ipv6test1 i-00000282 output: DISK WARNING - free space: / 68 MB (5% inode=57%): [07:29:45] PROBLEM Current Load is now: WARNING on resourceloader2-apache i-000001d7 output: WARNING - load average: 6.34, 6.51, 5.62 [07:30:06] wtf [07:30:17] hashar: Argh, what the hell [07:30:25] http://ganglia.wmflabs.org/latest/?r=20min&cs=&ce=&c=integration&h=integration-apache1 [07:30:28] all CPU Wait [07:30:39] well the labs seems dead somehow [07:30:41] PROBLEM Current Load is now: WARNING on pybal-precise i-00000289 output: WARNING - load average: 6.34, 6.71, 5.72 [07:30:41] There's 1 http request made per minute :P [07:30:52] yeah [07:30:58] http://ganglia.wmflabs.org/latest/ [07:31:10] there is 600 load spike at 7am every day [07:31:27] PROBLEM Current Load is now: WARNING on incubator-bot2 i-00000252 output: WARNING - load average: 2.42, 4.63, 5.62 [07:31:27] RECOVERY Current Users is now: OK on incubator-bot2 i-00000252 output: USERS OK - 0 users currently logged in [07:31:27] PROBLEM Free ram is now: UNKNOWN on configtest-main i-000002dd output: NRPE: Unable to read output [07:31:27] RECOVERY Disk Space is now: OK on incubator-bot2 i-00000252 output: DISK OK [07:31:27] RECOVERY Free ram is now: OK on incubator-bot2 i-00000252 output: OK: 32% free memory [07:31:28] RECOVERY dpkg-check is now: OK on incubator-bot2 i-00000252 output: All packages OK [07:31:28] PROBLEM Current Load is now: WARNING on migration1 i-00000261 output: WARNING - load average: 2.73, 4.38, 5.01 [07:31:29] RECOVERY Current Users is now: OK on migration1 i-00000261 output: USERS OK - 0 users currently logged in [07:31:29] RECOVERY Disk Space is now: OK on migration1 i-00000261 output: DISK OK [07:31:30] RECOVERY Free ram is now: OK on migration1 i-00000261 output: OK: 88% free memory [07:31:30] RECOVERY Total Processes is now: OK on migration1 i-00000261 output: PROCS OK: 84 processes [07:31:32] RECOVERY Total Processes is now: OK on mobile-testing i-00000271 output: PROCS OK: 134 processes [07:31:37] RECOVERY Current Users is now: OK on mobile-testing i-00000271 output: USERS OK - 0 users currently logged in [07:31:38] RECOVERY dpkg-check is now: OK on migration1 i-00000261 output: All packages OK [07:31:38] RECOVERY Disk Space is now: OK on mobile-testing i-00000271 output: DISK OK [07:31:38] RECOVERY Free ram is now: OK on mobile-testing i-00000271 output: OK: 95% free memory [07:31:38] RECOVERY dpkg-check is now: OK on configtest-main i-000002dd output: All packages OK [07:31:39] someone must be doing some heavy I/O on one of the instance [07:31:43] RECOVERY dpkg-check is now: OK on mobile-testing i-00000271 output: All packages OK [07:32:11] PROBLEM Current Load is now: WARNING on mobile-testing i-00000271 output: WARNING - load average: 3.08, 5.55, 6.26 [07:33:04] RECOVERY Current Load is now: OK on wikisource-web i-000000fe output: OK - load average: 0.18, 2.37, 4.07 [07:33:04] RECOVERY Current Load is now: OK on bots-2 i-0000009c output: OK - load average: 3.06, 4.19, 4.84 [07:33:04] PROBLEM Current Load is now: WARNING on mwreview i-000002ae output: WARNING - load average: 5.67, 7.16, 7.75 [07:33:04] RECOVERY Free ram is now: OK on mwreview i-000002ae output: OK: 69% free memory [07:33:04] RECOVERY dpkg-check is now: OK on mwreview i-000002ae output: All packages OK [07:33:33] Krinkle: so I end up not doing anything on labs till 10am :-D [07:33:34] RECOVERY Current Load is now: OK on maps-tilemill1 i-00000294 output: OK - load average: 0.10, 1.70, 4.85 [07:33:35] RECOVERY Current Load is now: OK on upload-wizard i-0000021c output: OK - load average: 0.38, 3.59, 4.99 [07:33:43] RECOVERY dpkg-check is now: OK on ganglia-test2 i-00000250 output: All packages OK [07:33:43] RECOVERY SSH is now: OK on ganglia-test2 i-00000250 output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [07:34:13] PROBLEM Free ram is now: UNKNOWN on integration-apache1 i-000002eb output: NRPE: Unable to read output [07:34:41] RECOVERY Current Load is now: OK on wep i-000000c2 output: OK - load average: 0.35, 2.62, 4.32 [07:34:41] RECOVERY Current Load is now: OK on resourceloader2-apache i-000001d7 output: OK - load average: 1.09, 3.27, 4.48 [07:34:41] PROBLEM Disk Space is now: CRITICAL on ipv6test1 i-00000282 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:35:13] PROBLEM Current Load is now: WARNING on integration-apache1 i-000002eb output: WARNING - load average: 3.58, 5.67, 13.66 [07:35:33] RECOVERY Current Load is now: OK on pybal-precise i-00000289 output: OK - load average: 0.28, 2.90, 4.37 [07:35:55] RECOVERY Current Load is now: OK on bots-sql2 i-000000af output: OK - load average: 0.72, 1.99, 4.49 [07:35:55] RECOVERY Current Load is now: OK on incubator-bot2 i-00000252 output: OK - load average: 0.25, 2.32, 4.40 [07:35:55] RECOVERY Current Load is now: OK on migration1 i-00000261 output: OK - load average: 0.06, 1.63, 3.63 [07:35:55] RECOVERY Disk Space is now: OK on mwreview i-000002ae output: DISK OK [07:35:55] PROBLEM Current Load is now: WARNING on psm-precise i-000002f2 output: WARNING - load average: 3.24, 5.52, 6.79 [07:36:53] RECOVERY Current Load is now: OK on mobile-testing i-00000271 output: OK - load average: 0.07, 2.11, 4.57 [07:36:53] PROBLEM Total Processes is now: WARNING on ganglia-test2 i-00000250 output: PROCS WARNING: 187 processes [07:37:28] RECOVERY Total Processes is now: OK on psm-precise i-000002f2 output: PROCS OK: 141 processes [07:37:43] PROBLEM Free ram is now: WARNING on ganglia-test2 i-00000250 output: Warning: 18% free memory [07:38:33] RECOVERY Current Load is now: OK on hugglewiki i-000000aa output: OK - load average: 0.01, 0.95, 3.70 [07:40:22] back [07:40:46] I love my laptop, but sometime it become very hot for no reason :/ [07:41:32] Krinkle: http://ganglia.wikimedia.org/latest/?r=4hr&cs=&ce=&m=load_one&s=by+name&c=Virtualization+cluster+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [07:41:39] Krinkle: that is the production servers that support labs [07:41:45] load reached 4K [07:41:53] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [07:42:08] hashar: heh [07:42:23] RECOVERY Current Load is now: OK on deployment-apache31 i-000002d4 output: OK - load average: 0.02, 1.38, 4.58 [07:42:23] RECOVERY Current Load is now: OK on configtest-main i-000002dd output: OK - load average: 0.02, 1.32, 4.14 [07:42:23] RECOVERY Current Load is now: OK on mwreview i-000002ae output: OK - load average: 0.04, 1.14, 4.22 [07:42:23] PROBLEM Total Processes is now: CRITICAL on aggregator-test1 i-000002bf output: PROCS CRITICAL: 201 processes [07:42:41] hashar: CPU stays low though [07:42:41] with lot of waiting I/O on the real hardware http://ganglia.wikimedia.org/latest/?r=4hr&cs=&ce=&m=cpu_wio&s=by+name&c=Virtualization+cluster+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [07:43:33] RECOVERY Current Load is now: OK on bots-cb i-0000009e output: OK - load average: 0.22, 1.02, 4.81 [07:43:33] RECOVERY Current Load is now: OK on ganglia-test2 i-00000250 output: OK - load average: 0.36, 1.49, 3.84 [07:45:53] RECOVERY Current Load is now: OK on psm-precise i-000002f2 output: OK - load average: 0.07, 0.92, 3.71 [07:49:01] Krinkle: also I rebooted your instance Sunday [07:49:11] Krinkle: it was struck by the leap second kernel bug [07:49:23] yeah, I just noticed while logging. [07:51:53] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [07:52:41] rm: cannot remove `labs-config': Input/output error [07:52:42] yeahhhh [07:52:47] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 0.20, 0.52, 3.97 [07:53:49] I give up will do something else [07:55:06] RECOVERY Current Load is now: OK on integration-apache1 i-000002eb output: OK - load average: 0.15, 0.43, 3.97 [08:04:43] PROBLEM Free ram is now: WARNING on aggregator-test1 i-000002bf output: Warning: 19% free memory [08:12:19] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [08:22:14] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [08:38:44] PROBLEM Free ram is now: CRITICAL on ganglia-test2 i-00000250 output: CHECK_NRPE: Socket timeout after 10 seconds. [08:42:33] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [08:49:18] PROBLEM Free ram is now: CRITICAL on wikistats-history-01 i-000002e2 output: CHECK_NRPE: Socket timeout after 10 seconds. [08:52:27] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [08:54:08] PROBLEM Free ram is now: UNKNOWN on wikistats-history-01 i-000002e2 output: NRPE: Unable to read output [08:57:26] PROBLEM Free ram is now: UNKNOWN on psm-precise i-000002f2 output: NRPE: Unable to read output [09:12:48] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [09:14:37] PROBLEM Disk Space is now: WARNING on ipv6test1 i-00000282 output: DISK WARNING - free space: / 67 MB (5% inode=57%): [09:22:30] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [09:23:00] PROBLEM Puppet freshness is now: CRITICAL on deployment-transcoding i-00000105 output: Puppet has not run in last 20 hours [09:25:00] PROBLEM Puppet freshness is now: CRITICAL on gerrit i-000000ff output: Puppet has not run in last 20 hours [09:42:11] PROBLEM Free ram is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [09:43:01] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [09:47:01] PROBLEM Free ram is now: WARNING on bots-sql2 i-000000af output: Warning: 18% free memory [09:53:06] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [09:56:44] !log deployment-prep Asked reboot for deployment-transcoding through labsconsole [09:56:46] Logged the message, Master [10:00:17] PROBLEM Puppet freshness is now: CRITICAL on wikistats-01 i-00000042 output: Puppet has not run in last 20 hours [10:04:46] PROBLEM Free ram is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Socket timeout after 10 seconds. [10:09:36] PROBLEM Free ram is now: UNKNOWN on integration-apache1 i-000002eb output: NRPE: Unable to read output [10:13:05] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [10:18:26] PROBLEM Free ram is now: WARNING on ganglia-test2 i-00000250 output: Warning: 16% free memory [10:23:08] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [10:30:49] PROBLEM Free ram is now: WARNING on bots-2 i-0000009c output: Warning: 19% free memory [10:43:19] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [10:52:11] Change abandoned: Jens Ohlig; "(no reason)" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/10567 [10:52:19] Change restored: Jens Ohlig; "(no reason)" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/10567 [10:53:08] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [10:54:23] PROBLEM Free ram is now: CRITICAL on wikistats-history-01 i-000002e2 output: CHECK_NRPE: Socket timeout after 10 seconds. [10:59:13] PROBLEM Free ram is now: UNKNOWN on wikistats-history-01 i-000002e2 output: NRPE: Unable to read output [11:13:22] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [11:16:23] PROBLEM Puppet freshness is now: CRITICAL on maps-test2 i-00000253 output: Puppet has not run in last 20 hours [11:23:14] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [11:43:23] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [11:53:13] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [12:10:29] <^demon> hashar: Good morning [12:10:42] ^demon: hello :) [12:12:19] <^demon> So I was looking at what Ryan suggested yesterday about role classes, but I was a little bit confused. Do you know of any examples I could follow that do what he suggested? [12:13:38] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [12:15:03] ^demon: sure let me find it [12:16:05] <^demon> k :) [12:17:40] ahh applicationserver::labs [12:17:55] ^demon: in manifests/site.pp [12:18:06] though that is probably not the most straightforward [12:20:14] <^demon> I can't find that class in site.pp [12:20:24] what is it? you are going to remove the "# TODO: rewrite this old mess."? or just any example for using role classes [12:20:28] there is a comment [12:20:33] the class is labs [12:20:38] a subclass of applicationserver [12:20:44] anyway that is not the best example [12:20:46] ^demon: https://gerrit.wikimedia.org/r/#/c/13304/ [12:21:04] ^demon: that one is easy. It creates the bits::labs under the role::cache class [12:21:08] PROBLEM Current Load is now: WARNING on bots-sql2 i-000000af output: WARNING - load average: 5.36, 6.07, 5.36 [12:21:09] so end result is role::cache::bits::labs [12:21:30] then it uses the varnish::instance which is a parameterized class [12:21:41] I guess that is the best example [12:21:58] <^demon> Hmm. [12:22:02] I have that change deployed on my instance, works fine [12:22:31] on the diff https://gerrit.wikimedia.org/r/#/c/13304/2/manifests/role/cache.pp,unified line 522 is the call to the parameterized class [12:22:37] think about it like calling a php function [12:22:54] then you give it named parameters (think about it like passing a $options array of settings in mw) [12:23:12] to find out which parameters and their defaults, you obviously need to look at the class definition [12:23:18] order of parameters does not matter [12:23:28] so all that would let you define a new role [12:23:37] the most simple role class, without the parameterized classes, is probably role/pdf.pp , which is then included in site.pp on node "pdf1" [12:23:43] which you can nicely pack in something like role::gerrit::labs or something similar [12:24:32] the applicationserver classes define a common class from which various classes can inherits (I think the parent is application server::parent, think of it like a PHP abstract class you extend) [12:25:04] <^demon> Right, but I guess my question is: how would I update site.pp to use the new role class? [12:25:19] for labs, it is not needed [12:25:33] you add the new class (aka role::gerrit::labs ) in the labsconsole under 'Manage puppet group' [12:25:37] in prod you would include role::pdf in a "node" [12:25:44] then configure the instance and tick the [] role::gerrit::labs checkbox [12:25:59] that will instruct puppet on your instance to install that class on your instance [12:26:06] so basically site.pp is not needed for labs ;) [12:26:08] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [12:26:25] <^demon> I understand how to use labs :) [12:26:30] <^demon> mutante: Thanks, that's what I needed [12:26:47] I am just making sure I know that you know about what we know ! [12:28:00] ^demon: grep role site.pp when being on the prod branch, also gives a nice overview besides the example [12:28:18] <^demon> Yeah I have site.pp open, but it's so wildly inconsistent :) [12:28:51] well, the ones that include role::something are good, and the ones that include a lot of classes directly should likely be moved to role classes [12:28:59] to keep site.pp shorter [12:31:38] hashar: so you have puppetmaster::self on your instance now? i gotta start using it now [12:33:24] I got one on the deployment-cache-bits instance [12:33:33] I followed some tutorial Faidon wrote [12:33:36] ^demon: we should probably block any commits to the test branch in gerrit. easy for you? [12:33:50] reported a few minor glitches et voila :) [12:33:59] <^demon> mutante: Doable. [12:34:04] mutante: I think test is readonly already [12:34:18] mutante: doc https://labsconsole.wikimedia.org/wiki/Help:Self-hosted_puppetmaster [12:34:22] ok, good, then we just have 5 open gerrit requests [12:34:29] it is really straightforward [12:34:34] thanks [12:34:48] https://gerrit.wikimedia.org/r/#/q/branch:test,n,z [12:34:53] <^demon> Man, permissions on operations/puppet are a mess :\ [12:35:49] doh there are two 'test' reference [12:36:19] <^demon> Well, refs/heads/test and refs/for/refs/heads/test [12:36:28] I guess refs/for/refs/heads/test is to only allow wmf people to submit changes in gerrit [12:39:18] <^demon> Yeah, but they also had push rights on refs/heads/test, allowing you to skip gerrit. [12:40:27] <^demon> mutante: https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=commitdiff;h=978720d6ca4a133b445acf8446739ecaf254276a should take care of it. [12:41:36] <^demon> ops can still do pushes if they really need to (via the refs/* permissions), but this should stop anyone else from submitting for review to the branch. [12:41:43] <^demon> (And in practice, most ops play in production :p) [12:43:44] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [12:50:03] New review: Dzahn; "please consider abandoning / resubmitting / saving this, since the test branch does not exist anymore" [operations/puppet] (test); V: 0 C: 0; - https://gerrit.wikimedia.org/r/10567 [12:53:43] RECOVERY Free ram is now: OK on ganglia-test2 i-00000250 output: OK: 86% free memory [12:55:14] ^demon: thanks, eh, does this need review? when i follow the review link from gitweb i end up in gerrit search but no results [12:55:22] <^demon> Nope :p [12:55:31] <^demon> It's done from the gerrit ui, in a special branch called refs/meta/config [12:55:39] aha [12:55:51] yet you can have gitweb links, gotcha [12:56:13] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [13:07:06] PROBLEM Total Processes is now: WARNING on incubator-bot2 i-00000252 output: PROCS WARNING: 151 processes [13:10:50] <^demon> Man, Ryan was right. Gerrit manifest is a mess :\ [13:13:46] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [13:16:31] PROBLEM Current Load is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [13:18:48] <^demon> mutante: Work in progress https://gerrit.wikimedia.org/r/#/c/13484/ :) [13:21:17] PROBLEM Current Load is now: WARNING on bots-sql2 i-000000af output: WARNING - load average: 7.89, 7.69, 7.48 [13:26:37] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [13:28:47] New review: Dzahn; "please consider resubmitting to other branch or abandoning since the test branch does not exist any..." [operations/puppet] (test); V: 0 C: -2; - https://gerrit.wikimedia.org/r/12164 [13:29:14] New review: Dzahn; "please consider resubmitting to other branch or abandoning since the test branch does not exist any..." [operations/puppet] (test); V: 0 C: -2; - https://gerrit.wikimedia.org/r/6541 [13:30:22] New review: Dzahn; "this is still in test but the branch is gone meanwhile." [operations/puppet] (test); V: 0 C: 0; - https://gerrit.wikimedia.org/r/9392 [13:31:44] New review: Dzahn; "please consider resubmitting to another branch since the test branch is gone meanwhile we can't merg..." [operations/puppet] (test); V: 0 C: -2; - https://gerrit.wikimedia.org/r/6109 [13:43:47] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [13:56:17] PROBLEM Free ram is now: WARNING on incubator-bot1 i-00000251 output: Warning: 19% free memory [13:56:38] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [14:04:07] PROBLEM Total Processes is now: WARNING on psm-precise i-000002f2 output: PROCS WARNING: 151 processes [14:17:50] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [14:26:40] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [14:48:38] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [14:56:40] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [15:12:57] <^demon> hashar: The guys at openstack built a glue layer to make the jenkins <-> gerrit tie ins easier. Don't know how useful it might be, but thought I'd pass it along: http://ci.openstack.org/zuul/ [15:13:14] oh my god [15:13:25] just when I almost finished my own glue :/ [15:13:40] ^demon: thanks for sharing :) [15:13:51] <^demon> You're welcome :) [15:14:06] <^demon> Again, it might not work for us at all, but just thought it interesting since we're developing the same wheel. [15:16:04] I am writing some ant targets [15:16:10] to easily clone from gerrit project [15:16:25] will put everything in a shared dir on gallium then clone from it to the jobs workspace [15:16:33] as easy as: :-] [15:16:45] <^demon> Granted, they do a couple of things differently. For one, they have jenkins handle the SUBMIT after it's CR+2/V+1 [15:17:58] I would want us to do that [15:17:59] <^demon> Ok, I'm going to grab an early lunch. Back later. [15:18:13] bon appétit ! [15:18:41] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [15:26:41] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [15:48:42] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [15:56:49] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [16:06:40] RECOVERY Free ram is now: OK on incubator-bot1 i-00000251 output: OK: 21% free memory [16:18:49] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [16:26:59] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [16:40:12] andrewbogott: so, my tests are broken for the bgp stuff [16:40:19] well, two of the four tests, anyway [16:40:39] I'm pretty sure I'm not setting the function properly [16:42:46] Ryan_Lane: link? [16:43:07] andrewbogott: https://review.openstack.org/#/c/9255/ [16:44:15] ooh openstack's gerrit looks nice [16:44:48] yeah, good skinning [16:45:47] ok, I'm looking... [16:46:38] Ryan_Lane, manager test is working for you, right? [16:49:10] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [16:50:09] andrewbogott: yeah [16:50:23] this is one of the failures I'm getting: https://jenkins.openstack.org/job/gate-nova-python26/1442/testReport/nova.tests.network.test_linux_net/LinuxNetworkTestCase/test_get_bgp_routes/ [16:51:12] I'm trying to make db.floating_ip_get_all_by_host return a specific set of IPs [16:51:30] In both tests you're setting the stub after the function is invoked. So switching those two lines around gets me further... [16:51:39] hahaha. right [16:51:43] Then the assert is between two very long strings that differ, rather than one long string and an empty one. [16:51:49] Are you set up to run the tests locally? [16:51:52] yeah [16:51:55] lemme fix that. thanks :) [16:52:55] not terribly sure how i missed that :) [16:53:01] One thing I always use when debugging tests: When a test fails, the test output displays stdout. So debug lines are easy to see. [16:53:23] *shrug* Stubbing still feels somewhat dark and mysterious to me. [16:53:45] I mostly get it. it's just seeding your tests by replacing functions [16:53:50] to ensure they return what you want [16:54:13] that said, it took me nearly the same amount of time to write the test than it did the code :D [16:54:54] Yeah, I guess I understand how it works, it's just that mucking around with __dict__ makes me uneasy. [16:55:03] * Ryan_Lane nods [16:57:18] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [16:59:45] oh [16:59:51] I'm not setting any of the options either [17:00:26] wait. no. yes I am [17:00:32] I must be doing that incorrectly [17:07:26] PROBLEM Total Processes is now: WARNING on orgcharts-dev i-0000018f output: PROCS WARNING: 453 processes [17:11:11] 07/03/2012 - 17:11:11 - Updating keys for kareneddy at /export/keys/kareneddy [17:18:41] petan: hey, you around? [17:18:52] petan: I'd like to introduce you to someone, if so [17:19:11] yeah [17:19:16] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [17:19:24] but this irc client isn't really working well atm [17:19:51] my server is being ddosed so quassel core doesn't work [17:20:11] :( [17:20:17] lame [17:20:29] Application level? [17:20:36] RECOVERY Disk Space is now: OK on ipv6test1 i-00000282 output: DISK OK [17:20:54] damianz logging in to ssh takes about 1 hour, I don't know yet [17:21:04] Ah, nice [17:21:04] I guess load is near to 1000 [17:21:07] :D [17:21:17] <^demon> Have you tried turning it off and on again? [17:26:54] ^demon I am on vacation far away from town [17:27:11] if it won't boot up back I don't want to go back and reboot it by hand :D [17:27:20] <^demon> Aw :\ [17:27:56] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [17:29:03] PROBLEM Disk Space is now: WARNING on ipv6test1 i-00000282 output: DISK WARNING - free space: / 65 MB (5% inode=57%): [17:29:11] Ryan_Lane, are you unstumped or restumped? [17:29:26] I switched tasks for a little bit [17:30:13] andrewbogott: but, I'm setting the flags like elsewhere in the file [17:33:07] PROBLEM Free ram is now: CRITICAL on etherpad-lite i-000002de output: CHECK_NRPE: Socket timeout after 10 seconds. [17:33:40] maybe I need to set the flags on the driver directly? [17:35:16] setting the flags locally should work [17:35:29] maybe I should be stubbing self.driver.db [17:35:47] wait. no [17:35:53] because that's returning properly [17:36:00] Yeah, the stubbing is clearly doing something, so that part seems right. [17:37:57] PROBLEM Free ram is now: UNKNOWN on etherpad-lite i-000002de output: NRPE: Unable to read output [17:43:07] Ryan_Lane: So, I may not understand how this is meant to work. It should be responding that the current ip (FLAGS.my_ip) is the next hop for any given floating ip? [17:43:17] yep [17:43:50] it'll tell the router that the floating IP's route points to the server its hosted on [17:43:55] Don't you need to specify my_ip, then? Otherwise the test results will vary depending on what host you run it on? [17:43:59] yes [17:44:05] oh. maybe that's the issue [17:44:25] Well, or else you need to format FLAGS.my_ip into the expected result. [17:47:18] Damn, now the strings look the same to me but are still comparing as !=. Might have to parse these results in order to avoid whitespace diffs :( [17:49:47] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [17:52:45] Ryan_Lane: This is fragile, but makes the first test pass: http://pastebin.com/xjMvNW97 [17:53:20] * Ryan_Lane nods [17:55:17] Probably the most reasonable thing is to use something like assertNotEquals(actual.find("blah"),-1) rather than one big string comparison. [17:57:54] Change on 12mediawiki a page OAuth/status was modified, changed by RobLa-WMF link https://www.mediawiki.org/w/index.php?diff=558224 edit summary: new status update [17:57:57] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [17:58:27] andrewbogott: well, I'm trying to ensure that the entire string is correct [17:58:33] because otherwise exabgp won't start [17:58:56] Of, if exabgp cares about formatting and everything then maybe a total comparison is right. [17:59:13] I was briefly confused by the trailing \n but otherwise it works. [17:59:17] cool [17:59:38] I think I'm going to set: self.flags(my_ip='192.168.1.3') [18:00:09] should work. [18:20:07] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [18:38:43] PROBLEM Free ram is now: CRITICAL on etherpad-lite i-000002de output: CHECK_NRPE: Socket timeout after 10 seconds. [18:38:56] wm [18:39:12] Are proxy servers legal in the UK [18:40:00] is there a way to tell ssh "don't use key" [18:40:35] Anyway [18:40:43] I need an unblock [18:40:52] Please unblock my account [18:40:53] maybe ssh -o PreferredAuthentications=keyboard-interactive,password ? [18:41:17] Please unblock my account [18:41:28] Guest57916: which account? Which wiki? [18:41:42] Don't pretend you don't know [18:42:16] Every-one knows me [18:42:26] ;--) [18:42:32] so [18:42:36] Do u know me [18:42:41] guest57916 not really [18:42:54] hashar pm please [18:43:36] I can't pm on this client sorry :/ [18:43:46] <^demon> Ohnos. How will you unblock him then? [18:44:09] Guest57916 try #wikimedia-ops [18:44:23] this is not a right channel anyway [18:44:28] <^demon> petan2: Nooooo :\ [18:44:32] <^demon> None of these channels are right [18:44:50] !log deployment-prep Successfully deleted instance deployment-deb, but failed to remove deployment-deb DNS entry. ohno [18:44:50] wikimedia-ops is place where people can complain about irc [18:44:53] Logged the message, Master [18:44:56] and bans [18:45:04] so I think it's kind of correct :o [18:45:19] <^demon> Yeah, if you were banned at the server level for abuse or something. [18:45:26] nah [18:45:29] serious stuff now. Need to find out why http://deployment.wikimedia.beta.wmflabs.org/ 404 :-D [18:45:29] <^demon> If you were banned on-wiki, I don't know any of the tech channels that would be useful. [18:45:34] it's -ops it's not -operations [18:45:42] -ops is for irc ops [18:45:44] :) [18:45:47] <^demon> Oh, heh [18:45:53] <^demon> Missed that ;-) [18:46:00] <^demon> Too many channels. [18:46:05] heh [18:46:25] we took a bit of freenode :P [18:46:48] that's why I thought donating some hardware resources to freenode could be nice [18:47:05] given that wikimedia has 50 times more servers than freenode [18:47:36] <^demon> yeah, but we're also a non-profit, so it's not necessarily as easy as "here's a server, enjoy" [18:47:44] mutante heya did someone check that RT ticket? I dodn't get any response [18:48:03] hm, I thought we could spare 1 :P [18:48:15] or few decommisioned :D [18:48:31] if no one wants them I will take them :D [18:48:51] <^demon> 501(c)3 means we have to be careful about that sort of thing :) [18:49:04] what is that [18:49:23] <^demon> The section of the IRS code that denotes non-profits. [18:49:28] Ryan_Lane if you were going to throw some old servers to dustbin tell me XD [18:49:29] <^demon> Specifically the part that we fall under. [18:49:35] <^demon> Every couple of years when we have decom'd servers, RobH usually finds other non-profits who want/need them to donate them to. [18:49:42] petan2: heh [18:49:50] <^demon> We can't just give them away to random people on the street :) [18:49:51] robh is the guy to talk to about that [18:49:57] and we announce it on the blog [18:50:00] Change on 12mediawiki a page OAuth/status was modified, changed by RobLa-WMF link https://www.mediawiki.org/w/index.php?diff=558267 edit summary: new status update [18:50:05] we only give to non-profits [18:50:07] aha, that's what I meant with freenode [18:50:15] that is non profit too [18:50:22] petan is non profit guy as well [18:50:34] if it counts of course [18:52:53] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [19:05:49] !log deployment-prep no machine send their syslog anywhere, some change got lost during the merge. See https://gerrit.wikimedia.org/r/14090 [19:05:51] Logged the message, Master [19:10:19] !log deployment killed -apache30 network access by running tcpdump ! [19:10:19] deployment is not a valid project. [19:10:40] Ryan_Lane: I managed to kill an instance network access by just running tcpdump :-D [19:10:48] eh? [19:10:51] I don't see how [19:10:55] neither do it [19:10:57] do i [19:11:10] deployment-apache30 if you want to investigate, if not I will just reboot it :-) [19:11:19] hashar@deployment-apache30:~$ ./syslog [19:11:19] [sudo] password for hashar: [19:11:34] then nothing but " (113) No route to host" errors returned from squid ;-) [19:12:24] PROBLEM host: deployment-apache30 is DOWN address: i-000002d3 CRITICAL - Host Unreachable (i-000002d3) [19:13:52] Ryan_Lane: Would you expect me to be able to launch a new instance? Or is that known to be broken at the moment? [19:14:43] andrewbogott: that should definitely work [19:14:49] what failed? [19:14:57] did it say "failed to launch instance"? [19:15:03] and in which project? [19:15:17] !log deployment-prep touched wmf-config/mwblocker.log [19:15:19] Logged the message, Master [19:15:33] in project mwreview, instances 'novapluginreview' and 'pluginreview'. [19:15:37] Ryan_Lane: are you interested in investigate the tcpdump or should I just reboot the instance? [19:15:43] reboot [19:15:50] Neither built, I can't get logs from either, also have tried repeatedly to delete pluginreview and failed. [19:15:53] hm [19:15:58] which host did they launch on [19:15:59] ? [19:16:25] Neither has a doc page, so not sure. [19:16:32] oh [19:16:36] what state are they in? [19:16:37] pending? [19:16:39] RECOVERY host: deployment-apache30 is UP address: i-000002d3 PING OK - Packet loss = 0%, RTA = 0.61 ms [19:16:50] doh [19:16:55] it is back on [19:16:58] novapluginreview is 'pending'. pluginreview was 'pending' until I deleted it, at which point it switched its state to 'running' [19:17:08] o.O [19:17:32] I haven't looked on virt1 yet. [19:17:46] I had similar issue from time to time, just asked to delete several time and it eventually get deleted [19:18:03] 'several' = > 3 times? [19:18:08] not sure [19:18:16] just do it when I look at the instance list [19:18:38] you should check if instance failed or not before [19:18:42] !console [19:18:42] in case you want to see what is happening on terminal of your vm, check console output [19:18:48] oh Successfully deleted instance, but failed to remove deployment-deb DNS entry. [19:18:49] different issue this time :) [19:19:05] No, hashar, that's what it's telling me too. [19:19:05] probably already gone [19:19:15] But then the instance remains in the list, state 'running'. [19:19:21] might just be the entry still in LDAP or something [19:19:33] andrewbogott maybe ?action=purge :P [19:19:35] grr [19:19:37] /usr/local/apache/common/docroot/labs [19:19:38] heh [19:19:46] Hydriz broke everything [19:19:50] what is that [19:19:56] or i did -) [19:19:57] hydriz has access to project? [19:20:00] since when [19:20:02] :O [19:20:12] I gave him access [19:20:15] ah [19:20:18] monday morning [19:20:21] hmm [19:20:33] I am not sure what he ran on the cluster but he definitely broke a couple things :-D [19:20:41] he actually asked me several times before, but i wanted to set up some sql backup system before [19:20:41] need to contact him to find out what he did hehe [19:20:51] so that we can recover in case something get fucked up [19:21:25] we should be carefull when giving out access to beta since it's still fragile :P [19:21:36] hydriz is known for breaking stuff XD [19:22:59] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [19:23:08] PROBLEM Puppet freshness is now: CRITICAL on deployment-transcoding i-00000105 output: Puppet has not run in last 20 hours [19:23:31] hashar I think of replacing motd: a huge - PLEASE USE !log AFTER EVERY YOUR CHANGE [19:23:54] I removed his access btw [19:24:18] or we could just change configuration of bash to execute log command after every line :D [19:24:26] that would be fun [19:25:08] PROBLEM Puppet freshness is now: CRITICAL on gerrit i-000000ff output: Puppet has not run in last 20 hours [19:25:17] I don't really have troubles with hydriz or anyone else, but we really need some way to recover system [19:25:19] first [19:25:40] before we start giving out shell en masse [19:26:14] last backup of sql is 2 months old [19:26:26] it wouldn't be fun if someone deleted db's in accident [19:30:12] AHHHH [19:30:13] http://deployment.wikimedia.beta.wmflabs.org/w/index.php?title=Special:UserLogin&action=submitlogin&type=login&returnto=Main+Page&foo !!! [19:30:20] fixed the floobar [19:30:31] can't believe it took me one hour to figure it out [19:30:48] DamianZ is there a way to tell ssh to give me few more minutes on login screen before you time out [19:31:14] because I am still not able to ssh to my p00r box [19:31:45] debug3: Wrote 64 bytes for a total of 1127Connection closed by [19:32:18] it's trying to transfer my key but it never make it because the response is so slow... [19:42:59] !log deployment-prep {{bug|38118}} fixed http://deployment.wikimedia.beta.wmflabs.org/ (was missing the docroot, submitted to gerrit https://gerrit.wikimedia.org/r/14099 [19:43:01] Logged the message, Master [19:43:12] andrewbogott: http://deployment.wikimedia.beta.wmflabs.org/ is back again :D [19:43:24] hashar: [19:43:35] Cool, then I can stop worrying about my inability to launch new instances for a bit :/ [19:43:40] that wiki docroot disappeared on monday :/ [19:44:20] andrewbogott: definitely sounds like something broken with nova :/ [19:44:29] yep! [19:44:41] and I can't help on that topic sorry :-( no access there hehe [19:47:52] hashar yay [19:47:55] nice [19:50:53] PROBLEM Free ram is now: WARNING on incubator-bot1 i-00000251 output: Warning: 19% free memory [19:53:03] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [19:53:09] petan2: the lack of syslog does not help for sure [19:53:13] I will try to poke that next week [19:53:36] we have a conflict between base::remote-syslog which installs rsyslog on all machines and syslog-ng which is used as the syslog package [19:54:27] I still have to find a proper way to handle the .dblist files too :/ [20:00:09] PROBLEM Puppet freshness is now: CRITICAL on wikistats-01 i-00000042 output: Puppet has not run in last 20 hours [20:06:09] RECOVERY Free ram is now: OK on incubator-bot1 i-00000251 output: OK: 20% free memory [20:09:13] I am off++ [20:18:53] PROBLEM Free ram is now: UNKNOWN on etherpad-lite i-000002de output: NRPE: Unable to read output [20:23:23] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [20:24:13] PROBLEM Free ram is now: WARNING on incubator-bot1 i-00000251 output: Warning: 19% free memory [20:49:04] PROBLEM Free ram is now: CRITICAL on etherpad-lite i-000002de output: CHECK_NRPE: Socket timeout after 10 seconds. [20:53:24] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [20:53:55] PROBLEM Free ram is now: UNKNOWN on etherpad-lite i-000002de output: NRPE: Unable to read output [21:13:47] andrewbogott: so, still having issues? [21:17:10] PROBLEM Puppet freshness is now: CRITICAL on maps-test2 i-00000253 output: Puppet has not run in last 20 hours [21:23:30] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [21:24:20] PROBLEM Free ram is now: CRITICAL on wikistats-history-01 i-000002e2 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:10] PROBLEM Free ram is now: UNKNOWN on wikistats-history-01 i-000002e2 output: NRPE: Unable to read output [21:41:20] RECOVERY Total Processes is now: OK on psm-precise i-000002f2 output: PROCS OK: 143 processes [21:41:50] RECOVERY Disk Space is now: OK on patchtest i-000000f1 output: DISK OK [21:42:21] andrewbogott: seems that rabbit had issues, which were likely due to the reboot from yesterday [21:42:33] we didn't reboot after the dns changes [21:42:40] and we have another IP address on eth0 [21:42:47] it's likely that rabbit was confused [21:43:03] so, I set some settings explictly so that it always gets the correct information no matter what now [21:43:32] I restarted all of the nova services after fixing it [21:43:36] now the deletes worked [21:44:10] RECOVERY HTTP is now: OK on blamemaps-s1 i-000002c3 output: HTTP OK: HTTP/1.1 200 OK - 453 bytes in 0.142 second response time [21:44:10] RECOVERY HTTP is now: OK on demo-deployment1 i-00000276 output: HTTP OK: HTTP/1.1 200 OK - 911 bytes in 0.497 second response time [21:49:50] PROBLEM Disk Space is now: CRITICAL on patchtest i-000000f1 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:52:20] PROBLEM HTTP is now: CRITICAL on blamemaps-s1 i-000002c3 output: CRITICAL - Socket timeout after 10 seconds [21:52:20] PROBLEM HTTP is now: CRITICAL on demo-deployment1 i-00000276 output: CRITICAL - Socket timeout after 10 seconds [21:53:31] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [22:03:10] RECOVERY Free ram is now: OK on bots-2 i-0000009c output: OK: 21% free memory [22:10:17] Ryan_Lane: things seem better now. Did you fix something? [22:10:26] mentioned above :) [22:10:32] oh, maybe you missed it [22:10:41] rabbit was having issues [22:10:46] likely due to the service IP for DNS [22:11:03] so, I put in config to make rabbit use specific IPs/hostnames [22:11:14] then cleared rabbit's queues and restarted all nova services [22:11:25] that cleared it up [22:14:15] PROBLEM Puppet freshness is now: CRITICAL on bots-sql2 i-000000af output: Puppet has not run in last 20 hours [22:14:25] PROBLEM Free ram is now: UNKNOWN on pluginreview i-000002fc output: NRPE: Unable to read output [22:19:38] PROBLEM Free ram is now: CRITICAL on etherpad-lite i-000002de output: CHECK_NRPE: Socket timeout after 10 seconds. [22:23:30] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [22:53:30] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [23:07:34] PROBLEM host: pluginreview is DOWN address: i-000002fc check_ping: Invalid hostname/address - i-000002fc [23:24:28] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [23:54:51] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0)