[00:01:08] PROBLEM Current Load is now: CRITICAL on nova-precise1 i-00000236 output: CRITICAL - load average: 30.75, 22.43, 18.59 [00:15:49] PROBLEM Free ram is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Socket timeout after 10 seconds. [00:22:10] PROBLEM Free ram is now: UNKNOWN on integration-apache1 i-000002eb output: NRPE: Unable to read output [00:23:14] RECOVERY Total Processes is now: OK on bots-3 i-000000e5 output: PROCS OK: 137 processes [00:25:04] PROBLEM Current Load is now: CRITICAL on nova-production1 i-0000007b output: CRITICAL - load average: 19.67, 19.56, 20.16 [00:27:04] PROBLEM Current Users is now: CRITICAL on bastion-restricted1 i-0000019b output: USERS CRITICAL - 16 users currently logged in [00:27:44] PROBLEM HTTP is now: CRITICAL on maps-test2 i-00000253 output: CRITICAL - Socket timeout after 10 seconds [00:28:34] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [00:32:14] PROBLEM SSH is now: CRITICAL on nova-production1 i-0000007b output: Connection refused [00:32:24] PROBLEM Free ram is now: CRITICAL on nova-production1 i-0000007b output: Connection refused by host [00:32:24] PROBLEM Current Load is now: WARNING on bots-3 i-000000e5 output: WARNING - load average: 8.92, 7.66, 6.04 [00:33:41] PROBLEM dpkg-check is now: CRITICAL on nova-production1 i-0000007b output: Connection refused by host [00:35:05] PROBLEM Total Processes is now: CRITICAL on nova-production1 i-0000007b output: Connection refused or timed out [00:35:52] PROBLEM Current Users is now: CRITICAL on nova-production1 i-0000007b output: Connection refused or timed out [00:36:33] PROBLEM Disk Space is now: CRITICAL on nova-production1 i-0000007b output: Connection refused or timed out [00:44:08] PROBLEM host: nova-production1 is DOWN address: i-0000007b CRITICAL - Host Unreachable (i-0000007b) [00:54:59] RECOVERY host: nova-production1 is UP address: i-0000007b PING OK - Packet loss = 0%, RTA = 93.10 ms [00:57:29] RECOVERY SSH is now: OK on nova-production1 i-0000007b output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [00:59:14] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [00:59:14] RECOVERY Current Users is now: OK on nova-production1 i-0000007b output: USERS OK - 0 users currently logged in [00:59:24] RECOVERY Disk Space is now: OK on nova-production1 i-0000007b output: DISK OK [00:59:43] I wanted to use a local repository, so I included the puppet class misc:labsdebrepo and ran puppet [00:59:53] But it fails with "change from absent to directory failed: Cannot create /data/project/repo; parent directory /data/project does not exist" [01:03:29] RECOVERY Total Processes is now: OK on nova-production1 i-0000007b output: PROCS OK: 118 processes [01:03:34] RECOVERY Current Load is now: OK on nova-production1 i-0000007b output: OK - load average: 5.40, 6.94, 3.95 [01:04:19] PROBLEM Current Load is now: CRITICAL on nova-gsoc1 i-000001de output: CRITICAL - load average: 18.45, 19.56, 20.13 [01:05:43] PROBLEM Current Load is now: WARNING on translation-memory-1 i-0000013a output: WARNING - load average: 4.18, 5.71, 5.33 [01:10:28] In case someone is around, I also have a question regarding using a local puppet repository to test new configurations? [01:10:29] PROBLEM Current Load is now: WARNING on bots-sql2 i-000000af output: WARNING - load average: 7.18, 7.55, 5.94 [01:14:39] apmon: which instance is this? [01:14:51] maps-test2 [01:15:29] RECOVERY Current Load is now: OK on bots-sql2 i-000000af output: OK - load average: 1.79, 3.97, 4.84 [01:15:39] RECOVERY Current Load is now: OK on translation-memory-1 i-0000013a output: OK - load average: 4.14, 4.40, 4.78 [01:16:52] I am finally trying to figure out how to use the local debian repository and puppet master to set up the map rendering stack [01:17:16] PROBLEM Current Users is now: WARNING on bastion-restricted1 i-0000019b output: USERS WARNING - 9 users currently logged in [01:17:47] which includes postgresql, postgis, mapnik that are included in the standard ubuntu repository and osm2pgsql, mod_tile and renderd which would need to be local packages [01:19:06] gimme a min [01:19:10] thanks [01:20:30] I am new to puppet (and the wikilabs setup), so I might end up asking a bunch of silly questions until I figure everything out [01:24:26] RECOVERY Total Processes is now: OK on bastion-restricted1 i-0000019b output: PROCS OK: 115 processes [01:29:26] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [01:29:26] apmon: ok. should be fixed now. [01:29:47] great, thanks! [01:29:51] yw [01:31:13] I created a postgresql class in the labsconsole and created a postgresql.pp in /var/lib/git/operations/puppet/manifest Should that be picked up, or do I need to do more to create a new class? [01:32:50] did you add it to the instance via "configure"? [01:33:30] yes. After running puppet -tv on the instance, it complains it can't find the postgresql class though [01:34:05] hm [01:34:36] let me just try it again, and see if the failures from before effected it [01:35:30] It gives the error message: "err: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find class postgresql for i-00000253.pmtpa.wmflabs on node i-00000253.pmtpa.wmflabs" [01:35:39] apmon: you need to add it to site.pp [01:35:46] via import [01:37:08] just edit the site.pp file with a test editor and add the line 'import "postgresql.pp"'? [01:37:17] text editor [01:37:24] yes [01:37:51] you also need to add a service to that class [01:38:17] and have it enable the service. otherwise, on reboots postgres won't start [01:38:28] also good to have it set to running, so puppet will restart the service if it's down [01:38:57] Well, at the moment I am just experimenting, so I tried the smallest possible puppet manifest to see if it gets applied [01:39:03] * Ryan_Lane nods [01:39:15] Once that works, I'll need to expand it to set up all the configs [01:39:26] * Ryan_Lane nods [01:39:31] have fun :) [01:40:05] I am sure I will ;-) [01:42:39] PROBLEM Current Load is now: WARNING on nova-precise1 i-00000236 output: WARNING - load average: 11.52, 14.29, 16.46 [01:44:29] PROBLEM Free ram is now: CRITICAL on wikistats-history-01 i-000002e2 output: CHECK_NRPE: Socket timeout after 10 seconds. [01:44:39] PROBLEM Current Load is now: CRITICAL on nova-gsoc1 i-000001de output: CRITICAL - load average: 20.80, 20.00, 20.27 [01:48:29] RECOVERY Current Load is now: OK on bots-3 i-000000e5 output: OK - load average: 4.58, 4.36, 4.88 [01:49:19] PROBLEM Free ram is now: UNKNOWN on wikistats-history-01 i-000002e2 output: NRPE: Unable to read output [01:52:49] PROBLEM Current Load is now: CRITICAL on nova-precise1 i-00000236 output: CHECK_NRPE: Socket timeout after 10 seconds. [01:57:39] PROBLEM Current Load is now: WARNING on nova-precise1 i-00000236 output: WARNING - load average: 11.94, 15.37, 16.91 [01:59:29] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [02:22:51] PROBLEM Current Load is now: CRITICAL on nova-precise1 i-00000236 output: CRITICAL - load average: 31.25, 20.82, 17.97 [02:29:11] PROBLEM Current Load is now: CRITICAL on nova-essex-test i-000001f9 output: CRITICAL - load average: 30.10, 21.13, 16.53 [02:29:51] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [02:34:17] PROBLEM Current Load is now: WARNING on nova-essex-test i-000001f9 output: WARNING - load average: 10.46, 15.30, 15.43 [02:40:02] RECOVERY Total Processes is now: OK on incubator-bot2 i-00000252 output: PROCS OK: 147 processes [02:40:41] PROBLEM Total Processes is now: WARNING on aggregator-test1 i-000002bf output: PROCS WARNING: 200 processes [03:00:24] PROBLEM Free ram is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Socket timeout after 10 seconds. [03:01:21] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [03:01:40] PROBLEM Free ram is now: CRITICAL on psm-precise i-000002f2 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:05:54] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 4.97, 4.47, 3.49 [03:05:54] PROBLEM Free ram is now: UNKNOWN on integration-apache1 i-000002eb output: NRPE: Unable to read output [03:05:57] PROBLEM Free ram is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [03:07:14] PROBLEM Free ram is now: UNKNOWN on psm-precise i-000002f2 output: NRPE: Unable to read output [03:10:54] PROBLEM Free ram is now: WARNING on bots-sql2 i-000000af output: Warning: 11% free memory [03:11:20] PROBLEM Current Load is now: CRITICAL on nagios 127.0.0.1 output: CRITICAL - load average: 4.73, 5.43, 4.29 [03:16:20] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 2.37, 3.91, 3.97 [03:21:14] PROBLEM Free ram is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [03:28:35] PROBLEM Free ram is now: CRITICAL on wikistats-history-01 i-000002e2 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:31:50] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [03:33:37] PROBLEM Free ram is now: UNKNOWN on wikistats-history-01 i-000002e2 output: NRPE: Unable to read output [03:36:07] PROBLEM Free ram is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Socket timeout after 10 seconds. [03:43:27] PROBLEM Current Load is now: CRITICAL on nova-essex-test i-000001f9 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:58:28] PROBLEM Puppet freshness is now: CRITICAL on wikistats-01 i-00000042 output: Puppet has not run in last 20 hours [03:59:06] PROBLEM Free ram is now: WARNING on test-oneiric i-00000187 output: Warning: 15% free memory [04:00:16] PROBLEM Free ram is now: WARNING on orgcharts-dev i-0000018f output: Warning: 17% free memory [04:00:16] PROBLEM Free ram is now: WARNING on nova-daas-1 i-000000e7 output: Warning: 16% free memory [04:02:16] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [04:05:47] PROBLEM Free ram is now: WARNING on utils-abogott i-00000131 output: Warning: 16% free memory [04:19:28] PROBLEM Free ram is now: CRITICAL on test-oneiric i-00000187 output: Critical: 5% free memory [04:25:51] PROBLEM Free ram is now: CRITICAL on orgcharts-dev i-0000018f output: Critical: 4% free memory [04:26:44] PROBLEM Free ram is now: CRITICAL on utils-abogott i-00000131 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:29:21] RECOVERY Free ram is now: OK on test-oneiric i-00000187 output: OK: 96% free memory [04:30:41] RECOVERY Free ram is now: OK on orgcharts-dev i-0000018f output: OK: 66% free memory [04:33:31] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [04:36:11] RECOVERY Free ram is now: OK on utils-abogott i-00000131 output: OK: 97% free memory [04:40:45] PROBLEM Free ram is now: CRITICAL on nova-daas-1 i-000000e7 output: Critical: 5% free memory [04:50:46] RECOVERY Free ram is now: OK on nova-daas-1 i-000000e7 output: OK: 94% free memory [05:00:18] PROBLEM Total Processes is now: CRITICAL on aggregator-test1 i-000002bf output: PROCS CRITICAL: 201 processes [05:03:18] PROBLEM Free ram is now: WARNING on bots-sql2 i-000000af output: Warning: 11% free memory [05:04:28] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [05:05:18] PROBLEM Total Processes is now: WARNING on aggregator-test1 i-000002bf output: PROCS WARNING: 200 processes [05:10:18] PROBLEM Free ram is now: CRITICAL on wikistats-history-01 i-000002e2 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:10:23] PROBLEM Total Processes is now: CRITICAL on aggregator-test1 i-000002bf output: PROCS CRITICAL: 201 processes [05:15:10] PROBLEM Free ram is now: UNKNOWN on wikistats-history-01 i-000002e2 output: NRPE: Unable to read output [05:18:20] PROBLEM Free ram is now: UNKNOWN on integration-apache1 i-000002eb output: NRPE: Unable to read output [05:25:00] PROBLEM Current Load is now: WARNING on nova-essex-test i-000001f9 output: WARNING - load average: 11.31, 16.61, 15.99 [05:35:05] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [05:52:15] PROBLEM Current Load is now: WARNING on bots-3 i-000000e5 output: WARNING - load average: 4.14, 5.43, 5.03 [05:57:15] RECOVERY Current Load is now: OK on bots-3 i-000000e5 output: OK - load average: 4.22, 4.53, 4.73 [06:05:17] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [06:32:17] PROBLEM Free ram is now: CRITICAL on psm-precise i-000002f2 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:33:17] PROBLEM Total Processes is now: WARNING on incubator-bot2 i-00000252 output: PROCS WARNING: 153 processes [06:39:14] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [06:43:31] PROBLEM Free ram is now: UNKNOWN on psm-precise i-000002f2 output: NRPE: Unable to read output [06:43:58] PROBLEM Free ram is now: CRITICAL on configtest-main i-000002dd output: CHECK_NRPE: Socket timeout after 10 seconds. [06:43:58] PROBLEM Free ram is now: CRITICAL on wikistats-history-01 i-000002e2 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:44:04] PROBLEM Total Processes is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:48:37] PROBLEM Free ram is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:48:38] PROBLEM Free ram is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Socket timeout after 10 seconds. [06:50:34] PROBLEM Current Load is now: WARNING on wikidata-dev-2 i-00000259 output: WARNING - load average: 2.93, 4.52, 6.45 [06:50:41] PROBLEM Current Load is now: CRITICAL on nagios 127.0.0.1 output: CRITICAL - load average: 6.95, 9.10, 7.59 [06:59:46] PROBLEM Current Load is now: CRITICAL on nova-ldap1 i-000000df output: CRITICAL - load average: 19.46, 21.58, 20.76 [07:01:40] PROBLEM Total Processes is now: WARNING on incubator-bot2 i-00000252 output: PROCS WARNING: 156 processes [07:02:40] PROBLEM Free ram is now: CRITICAL on psm-precise i-000002f2 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:02:46] PROBLEM Disk Space is now: CRITICAL on migration1 i-00000261 output: Connection refused or timed out [07:02:52] PROBLEM Current Users is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:02:52] PROBLEM Disk Space is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [07:02:52] PROBLEM Current Users is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [07:02:52] PROBLEM Free ram is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [07:04:19] PROBLEM Current Users is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:04:19] PROBLEM Free ram is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:04:19] PROBLEM Total Processes is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:04:24] PROBLEM Current Load is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:05:30] RECOVERY Disk Space is now: OK on ipv6test1 i-00000282 output: DISK OK [07:06:19] PROBLEM Free ram is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [07:10:13] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [07:10:24] PROBLEM Current Load is now: WARNING on bots-apache1 i-000000b0 output: WARNING - load average: 7.08, 6.29, 5.44 [07:10:24] PROBLEM Current Load is now: WARNING on wikidata-dev-3 i-00000225 output: WARNING - load average: 4.33, 4.79, 5.54 [07:10:24] PROBLEM Current Load is now: WARNING on translation-memory-2 i-000002d9 output: WARNING - load average: 9.48, 8.45, 7.94 [07:10:29] PROBLEM Current Load is now: WARNING on incubator-bot1 i-00000251 output: WARNING - load average: 9.84, 11.07, 10.02 [07:10:35] PROBLEM Current Load is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [07:10:35] PROBLEM Total Processes is now: CRITICAL on redis1 i-000002b6 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:10:41] PROBLEM dpkg-check is now: CRITICAL on redis1 i-000002b6 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:10:41] PROBLEM Disk Space is now: CRITICAL on redis1 i-000002b6 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:11:09] PROBLEM Disk Space is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:11:09] PROBLEM dpkg-check is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:11:09] PROBLEM Total Processes is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:11:16] PROBLEM HTTP is now: CRITICAL on integration-apache1 i-000002eb output: CRITICAL - Socket timeout after 10 seconds [07:11:21] PROBLEM Current Load is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:11:22] PROBLEM Current Users is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:11:22] PROBLEM Total Processes is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:11:27] PROBLEM Disk Space is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:11:27] PROBLEM Free ram is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:11:27] PROBLEM dpkg-check is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:11:27] PROBLEM Free ram is now: CRITICAL on su-be3 i-000002e9 output: (Service Check Timed Out) [07:14:57] RECOVERY Current Load is now: OK on wikidata-dev-2 i-00000259 output: OK - load average: 1.60, 2.54, 4.27 [07:17:18] PROBLEM Current Load is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:18] PROBLEM Current Load is now: WARNING on configtest-main i-000002dd output: WARNING - load average: 6.94, 8.67, 8.38 [07:17:23] PROBLEM Current Load is now: WARNING on deployment-apache31 i-000002d4 output: WARNING - load average: 5.90, 7.07, 7.07 [07:17:23] PROBLEM Current Load is now: WARNING on mobile-testing i-00000271 output: WARNING - load average: 6.38, 7.94, 7.12 [07:17:28] PROBLEM Current Users is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:28] PROBLEM dpkg-check is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:28] PROBLEM Total Processes is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:33] PROBLEM dpkg-check is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:45] PROBLEM Disk Space is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:46] PROBLEM Current Load is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:17] PROBLEM dpkg-check is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:28] RECOVERY Total Processes is now: OK on redis1 i-000002b6 output: PROCS OK: 90 processes [07:19:33] RECOVERY dpkg-check is now: OK on redis1 i-000002b6 output: All packages OK [07:19:34] RECOVERY Disk Space is now: OK on redis1 i-000002b6 output: DISK OK [07:19:34] RECOVERY Current Load is now: OK on bots-apache1 i-000000b0 output: OK - load average: 1.00, 2.66, 4.04 [07:19:39] PROBLEM Free ram is now: UNKNOWN on su-be3 i-000002e9 output: NRPE: Unable to read output [07:19:39] PROBLEM Current Load is now: WARNING on nova-ldap1 i-000000df output: WARNING - load average: 16.89, 18.50, 19.57 [07:19:39] RECOVERY Current Users is now: OK on precise-test i-00000231 output: USERS OK - 0 users currently logged in [07:19:39] PROBLEM Current Load is now: WARNING on precise-test i-00000231 output: WARNING - load average: 5.92, 5.87, 5.55 [07:19:39] RECOVERY Disk Space is now: OK on precise-test i-00000231 output: DISK OK [07:19:40] RECOVERY Total Processes is now: OK on precise-test i-00000231 output: PROCS OK: 105 processes [07:19:44] RECOVERY Free ram is now: OK on precise-test i-00000231 output: OK: 71% free memory [07:19:44] RECOVERY dpkg-check is now: OK on precise-test i-00000231 output: All packages OK [07:19:44] PROBLEM Current Load is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:20:25] PROBLEM Current Load is now: WARNING on bots-3 i-000000e5 output: WARNING - load average: 4.41, 4.43, 5.87 [07:20:25] PROBLEM Current Load is now: WARNING on hugglewiki i-000000aa output: WARNING - load average: 7.04, 8.31, 8.26 [07:20:25] PROBLEM Current Load is now: WARNING on mwreview i-000002ae output: WARNING - load average: 1.74, 4.44, 5.43 [07:21:00] RECOVERY Disk Space is now: OK on migration1 i-00000261 output: DISK OK [07:21:00] RECOVERY Disk Space is now: OK on mwreview i-000002ae output: DISK OK [07:21:00] RECOVERY Current Users is now: OK on mwreview i-000002ae output: USERS OK - 0 users currently logged in [07:21:00] RECOVERY Free ram is now: OK on mwreview i-000002ae output: OK: 74% free memory [07:23:02] PROBLEM Current Load is now: WARNING on labs-nfs1 i-0000005d output: WARNING - load average: 11.28, 7.09, 7.46 [07:23:02] PROBLEM Current Load is now: WARNING on bots-cb i-0000009e output: WARNING - load average: 1.31, 2.73, 6.00 [07:23:02] PROBLEM Current Load is now: WARNING on signwriting-ase i-000002f5 output: WARNING - load average: 4.57, 4.96, 5.22 [07:23:04] PROBLEM Current Load is now: WARNING on psm-precise i-000002f2 output: WARNING - load average: 6.01, 8.45, 7.97 [07:23:10] PROBLEM Current Load is now: WARNING on incubator-bot2 i-00000252 output: WARNING - load average: 7.18, 6.86, 6.57 [07:23:10] PROBLEM Current Load is now: WARNING on pybal-precise i-00000289 output: WARNING - load average: 3.07, 4.43, 5.50 [07:23:10] PROBLEM Current Load is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [07:23:10] PROBLEM Current Users is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [07:23:10] PROBLEM Total Processes is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [07:23:18] PROBLEM Free ram is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [07:23:20] PROBLEM Disk Space is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [07:23:20] PROBLEM SSH is now: CRITICAL on bots-sql2 i-000000af output: CRITICAL - Socket timeout after 10 seconds [07:23:20] PROBLEM Total Processes is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [07:23:44] RECOVERY Current Users is now: OK on integration-apache1 i-000002eb output: USERS OK - 0 users currently logged in [07:23:44] RECOVERY Disk Space is now: OK on integration-apache1 i-000002eb output: DISK OK [07:23:44] PROBLEM Current Load is now: WARNING on integration-apache1 i-000002eb output: WARNING - load average: 7.75, 14.06, 16.26 [07:24:02] PROBLEM Free ram is now: UNKNOWN on integration-apache1 i-000002eb output: NRPE: Unable to read output [07:24:09] RECOVERY Total Processes is now: OK on integration-apache1 i-000002eb output: PROCS OK: 110 processes [07:24:14] RECOVERY dpkg-check is now: OK on integration-apache1 i-000002eb output: All packages OK [07:24:16] PROBLEM Current Load is now: CRITICAL on deployment-apache31 i-000002d4 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:24:52] RECOVERY HTTP is now: OK on integration-apache1 i-000002eb output: HTTP OK: HTTP/1.1 200 OK - 1308 bytes in 0.005 second response time [07:24:52] RECOVERY Disk Space is now: OK on incubator-bot1 i-00000251 output: DISK OK [07:24:52] PROBLEM Current Users is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:24:52] PROBLEM Disk Space is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:25:04] PROBLEM Current Load is now: WARNING on incubator-bot1 i-00000251 output: WARNING - load average: 13.20, 10.79, 9.81 [07:25:13] PROBLEM Total Processes is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:25:18] PROBLEM dpkg-check is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:25:22] RECOVERY Current Load is now: OK on mwreview i-000002ae output: OK - load average: 0.92, 2.34, 4.25 [07:26:25] PROBLEM Current Users is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:26:25] PROBLEM dpkg-check is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:26:43] RECOVERY Current Users is now: OK on migration1 i-00000261 output: USERS OK - 0 users currently logged in [07:26:43] RECOVERY Free ram is now: OK on migration1 i-00000261 output: OK: 88% free memory [07:26:43] RECOVERY Total Processes is now: OK on migration1 i-00000261 output: PROCS OK: 84 processes [07:26:59] PROBLEM Current Load is now: WARNING on migration1 i-00000261 output: WARNING - load average: 1.88, 4.16, 5.66 [07:27:35] RECOVERY Current Load is now: OK on signwriting-ase i-000002f5 output: OK - load average: 1.57, 3.76, 4.72 [07:27:35] RECOVERY Current Load is now: OK on pybal-precise i-00000289 output: OK - load average: 3.24, 3.71, 4.91 [07:27:35] PROBLEM Current Load is now: WARNING on upload-wizard i-0000021c output: WARNING - load average: 0.42, 3.56, 5.26 [07:27:35] RECOVERY Current Users is now: OK on upload-wizard i-0000021c output: USERS OK - 0 users currently logged in [07:27:35] RECOVERY Total Processes is now: OK on upload-wizard i-0000021c output: PROCS OK: 84 processes [07:27:40] RECOVERY Free ram is now: OK on upload-wizard i-0000021c output: OK: 93% free memory [07:27:40] RECOVERY Disk Space is now: OK on bots-sql2 i-000000af output: DISK OK [07:27:40] PROBLEM Current Load is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:27:50] RECOVERY Total Processes is now: OK on bots-sql2 i-000000af output: PROCS OK: 91 processes [07:27:55] RECOVERY SSH is now: OK on bots-sql2 i-000000af output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [07:28:30] PROBLEM Disk Space is now: WARNING on ipv6test1 i-00000282 output: DISK WARNING - free space: / 70 MB (5% inode=57%): [07:28:30] PROBLEM Free ram is now: UNKNOWN on configtest-main i-000002dd output: NRPE: Unable to read output [07:28:53] PROBLEM Free ram is now: UNKNOWN on incubator-bot1 i-00000251 output: NRPE: Call to fork() failed [07:29:28] RECOVERY dpkg-check is now: OK on migration1 i-00000261 output: All packages OK [07:29:28] RECOVERY Free ram is now: OK on incubator-bot2 i-00000252 output: OK: 31% free memory [07:29:28] PROBLEM Free ram is now: WARNING on bots-sql2 i-000000af output: Warning: 10% free memory [07:29:28] PROBLEM Current Load is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:30:13] RECOVERY Current Load is now: OK on precise-test i-00000231 output: OK - load average: 1.78, 3.54, 4.69 [07:30:37] PROBLEM Current Load is now: WARNING on bots-sql2 i-000000af output: WARNING - load average: 12.81, 13.64, 15.60 [07:30:37] RECOVERY Current Users is now: OK on incubator-bot2 i-00000252 output: USERS OK - 0 users currently logged in [07:30:37] RECOVERY Disk Space is now: OK on incubator-bot2 i-00000252 output: DISK OK [07:30:37] RECOVERY Total Processes is now: OK on reportcard2 i-000001ea output: PROCS OK: 84 processes [07:30:42] RECOVERY dpkg-check is now: OK on incubator-bot2 i-00000252 output: All packages OK [07:32:05] RECOVERY Current Users is now: OK on reportcard2 i-000001ea output: USERS OK - 0 users currently logged in [07:32:05] RECOVERY dpkg-check is now: OK on reportcard2 i-000001ea output: All packages OK [07:32:45] RECOVERY Current Load is now: OK on migration1 i-00000261 output: OK - load average: 0.21, 1.36, 3.88 [07:32:53] RECOVERY Current Load is now: OK on labs-nfs1 i-0000005d output: OK - load average: 2.23, 2.84, 4.90 [07:32:54] RECOVERY Current Load is now: OK on psm-precise i-000002f2 output: OK - load average: 1.12, 2.65, 4.90 [07:32:54] RECOVERY Current Load is now: OK on upload-wizard i-0000021c output: OK - load average: 0.07, 1.33, 3.79 [07:33:35] apergos: the labs went wild this weekend due to the leap sec bug :-D [07:33:45] he is not there of course bah [07:33:53] PROBLEM Free ram is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:34:03] PROBLEM Current Load is now: WARNING on mobile-testing i-00000271 output: WARNING - load average: 1.09, 3.43, 5.39 [07:34:03] PROBLEM Current Load is now: WARNING on deployment-apache31 i-000002d4 output: WARNING - load average: 0.42, 3.46, 5.44 [07:34:50] hasher: How do you identify if a host is suffering from the bug? [07:35:13] RECOVERY dpkg-check is now: OK on bots-sql2 i-000000af output: All packages OK [07:35:13] RECOVERY Current Load is now: OK on wikidata-dev-3 i-00000225 output: OK - load average: 2.76, 3.37, 4.60 [07:35:14] Hydriz: high load for now reason [07:35:20] Hydriz: some process stuck in 100% [07:35:30] RECOVERY dpkg-check is now: OK on incubator-bot1 i-00000251 output: All packages OK [07:35:30] RECOVERY Total Processes is now: OK on incubator-bot1 i-00000251 output: PROCS OK: 139 processes [07:35:45] Hydriz: you have to manually reboot the box I think. Labsconsole does not seem to be able to trigger reboots anymore [07:35:46] hashar: its currently due to gluster :P [07:35:54] I mean, right now [07:35:56] Hydriz: :-)))))))))))))))))))) [07:36:00] what is your project? [07:36:06] I will help you identify them [07:36:17] hmm, its okay :P [07:36:25] RECOVERY Current Load is now: OK on hugglewiki i-000000aa output: OK - load average: 1.03, 1.58, 4.05 [07:37:01] RECOVERY Current Users is now: OK on incubator-bot1 i-00000251 output: USERS OK - 1 users currently logged in [07:37:02] just waiting for labs to cool down right now [07:37:28] Hydriz: also, any machine that started showing 100% cpu usage on one core since sunday midnight is definitely bugged :-] [07:37:41] for bots project :http://ganglia.wmflabs.org/latest/?r=week&cs=&ce=&m=cpu_report&s=by+name&c=bots&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [07:37:58] that would be bots-3 and bots-sql3 at least [07:38:08] RECOVERY Current Load is now: OK on bots-cb i-0000009e output: OK - load average: 1.71, 2.71, 4.51 [07:38:14] ah [07:38:15] I see [07:38:20] Hydriz: bots-sql3 it is probably just mysql being wild. So just try to restart it [07:38:47] nope, none of the hosts I operate suffers under this :P [07:39:03] ohh you are on dumps project aren't you ? [07:39:08] RECOVERY Current Load is now: OK on configtest-main i-000002dd output: OK - load average: 1.37, 2.11, 4.71 [07:39:29] yep [07:39:32] and incubator [07:40:12] well incubator-apache show a bit more CPU usage than before [07:40:18] same for incubator-bot0 [07:40:21] but that might not be related [07:40:28] RECOVERY Current Load is now: OK on translation-memory-2 i-000002d9 output: OK - load average: 5.34, 4.49, 4.94 [07:41:05] possibly not, its always high for no apparent reason [07:41:16] was thinking of switching to squid instead [07:41:28] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [07:41:31] (referring to -apache) [07:42:48] PROBLEM Current Load is now: WARNING on nova-precise1 i-00000236 output: WARNING - load average: 14.06, 17.47, 17.10 [07:44:12] grr Mount Everest kind of load graph... [07:45:01] PROBLEM Free ram is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [07:45:02] Hydriz: well just to be safe, I would reboot all my instance anyway :-] [07:45:32] grr load increasing again [07:47:00] reboot! hehe [07:48:41] RECOVERY Current Load is now: OK on integration-apache1 i-000002eb output: OK - load average: 0.77, 0.77, 4.43 [07:49:16] hashar: Is it possible for me to join deployment-prep, so that I can look around how things are set up? [07:49:30] sure :-D [07:49:34] still need to write a doc though [07:49:51] yeah, some rough ideas are on wiki, I see [07:50:08] Successfully added Hydriz to deployment-prep. [07:50:13] :) [07:50:15] thanks! [07:50:20] !log deployment-prep Gave access to the cluster to Hydriz [07:50:24] Logged the message, Master [07:50:37] Hydriz: the main machine is deployment-dbdump , some conf is in /home/wikipedia/ [07:50:46] hmm [07:50:46] Hydriz: and most of the setup is in puppet actually [07:51:01] is there any testing for the dumps infrastructure on that project? [07:51:19] nop [07:51:25] the name is misleading :-]]] [07:51:33] hmm, not sure if apergos needs to do so [07:51:47] anyway its not under my control haha [07:51:50] 07/02/2012 - 07:51:50 - Created a home directory for hydriz in project(s): deployment-prep [07:52:40] ah, virt cluster load has gone down... [07:52:50] 07/02/2012 - 07:52:50 - User hydriz may have been modified in LDAP or locally, updating key in project(s): deployment-prep [07:54:40] PROBLEM host: dumps-1 is DOWN address: i-00000170 CRITICAL - Host Unreachable (i-00000170) [08:02:51] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 2.64, 2.59, 4.00 [08:05:59] hashar: Before I do anything, what can I actually do on the cluster? [08:06:08] look :-] [08:06:18] still in the process of documenting it [08:06:47] and I still need to apply some changes from labs to production [08:06:50] or in short, don't run any kind of script? :P [08:06:57] exactly :-D [08:06:58] sorry [08:07:03] lol haha [08:07:09] will be better once we have some documentation [08:07:21] yeah, its hell messy [08:10:15] RECOVERY host: dumps-1 is UP address: i-00000170 PING OK - Packet loss = 0%, RTA = 0.74 ms [08:10:54] RECOVERY Current Load is now: OK on bots-sql2 i-000000af output: OK - load average: 2.41, 3.57, 4.84 [08:12:34] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [08:13:44] PROBLEM Current Load is now: CRITICAL on nova-precise1 i-00000236 output: CRITICAL - load average: 31.84, 20.73, 18.05 [08:16:05] PROBLEM Current Load is now: CRITICAL on nova-essex-test i-000001f9 output: CHECK_NRPE: Socket timeout after 10 seconds. [08:17:54] RECOVERY Current Load is now: OK on nagios 127.0.0.1 output: OK - load average: 1.65, 2.05, 2.91 [08:18:34] PROBLEM Current Load is now: WARNING on translation-memory-2 i-000002d9 output: WARNING - load average: 5.01, 5.34, 5.07 [08:18:44] PROBLEM Current Load is now: WARNING on nova-precise1 i-00000236 output: WARNING - load average: 12.46, 17.25, 17.56 [08:20:54] PROBLEM Current Load is now: WARNING on nova-essex-test i-000001f9 output: WARNING - load average: 19.90, 13.42, 13.35 [08:28:55] PROBLEM Current Load is now: CRITICAL on nova-precise1 i-00000236 output: CRITICAL - load average: 32.08, 23.40, 19.15 [08:29:55] PROBLEM Current Load is now: WARNING on bots-2 i-0000009c output: WARNING - load average: 5.25, 5.25, 5.06 [08:32:56] Hydriz: don't you get access to the bots project ? [08:33:08] hmm? [08:33:17] I have access to the bots project [08:34:00] Hydriz: would you mind rebooting bots-sql3 ? :-] [08:34:07] going to cause some issues I guess [08:34:12] hmm [08:34:15] PROBLEM Free ram is now: UNKNOWN on incubator-bot1 i-00000251 output: NRPE: Call to fork() failed [08:34:15] but that machine is doing 100% CPU cause of a kernel bug [08:34:21] I can reboot it using the command line [08:34:27] but not from the labsconsole thing [08:34:28] that would be great [08:34:33] yeah labsconsole seems bugged [08:34:42] I wonder if it will cause things to break :P [08:34:58] I don't have sysadmin access to the project [08:35:02] ohh [08:35:05] so we have to wait :-] [08:36:12] sudo policy allows me to sudo everything in the project, but no access on the wiki side :( [08:36:25] and the deployment-prep cluster... [08:36:33] I can't seem to view the sudo policy [08:37:15] PROBLEM Disk Space is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Error - Could not complete SSL handshake. [08:37:15] PROBLEM SSH is now: CRITICAL on incubator-bot1 i-00000251 output: Server answer: [08:37:17] Hydriz: if you got sudo access, just reboot the bots-sql3 using sudo reboot [08:37:19] that would do it [08:37:35] yeah, I know :P [08:37:36] PROBLEM Total Processes is now: WARNING on aggregator-test1 i-000002bf output: PROCS WARNING: 200 processes [08:37:46] but I am having trouble logging in to the server itself [08:37:53] its stuck :( [08:39:22] hashar: I see how small the sql server is lol [08:40:08] RECOVERY Current Load is now: OK on bots-2 i-0000009c output: OK - load average: 4.43, 4.84, 4.96 [08:41:24] Hydriz: you might need to raise the ssh client default timeout. Try sshing to bastion then: ssh -o ConnectTimeout=600 bots-sql3 [08:41:43] yeah, I am in bots-sql3 [08:41:53] grats [08:42:06] and its possible that its due to bots-2, and ultimately by Beetstra's bots [08:42:16] RECOVERY SSH is now: OK on incubator-bot1 i-00000251 output: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [08:42:30] * Beetstra hides [08:42:30] and the leap second bug [08:42:34] :P [08:42:56] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [08:42:56] I can't help that people spam Wikipedia so much that one instance is hardly enough to run LiWa3 .. [08:43:17] get a big server :) [08:43:28] I am using 16G myself :P [08:43:48] Well, you were already a bigger server ... [08:43:52] you .. labs that is [08:44:12] hehe [08:44:17] The previous box (a 4 processor Sun Sparc with 8 Gb internal memory) had troubles running the bots [08:44:21] Hydriz: is it rebooting ? [08:44:29] not yet :P [08:44:32] I got to check bots-2 too [08:44:38]