[00:01:08] PROBLEM Current Load is now: CRITICAL on nova-precise1 i-00000236 output: CRITICAL - load average: 30.75, 22.43, 18.59 [00:15:49] PROBLEM Free ram is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Socket timeout after 10 seconds. [00:22:10] PROBLEM Free ram is now: UNKNOWN on integration-apache1 i-000002eb output: NRPE: Unable to read output [00:23:14] RECOVERY Total Processes is now: OK on bots-3 i-000000e5 output: PROCS OK: 137 processes [00:25:04] PROBLEM Current Load is now: CRITICAL on nova-production1 i-0000007b output: CRITICAL - load average: 19.67, 19.56, 20.16 [00:27:04] PROBLEM Current Users is now: CRITICAL on bastion-restricted1 i-0000019b output: USERS CRITICAL - 16 users currently logged in [00:27:44] PROBLEM HTTP is now: CRITICAL on maps-test2 i-00000253 output: CRITICAL - Socket timeout after 10 seconds [00:28:34] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [00:32:14] PROBLEM SSH is now: CRITICAL on nova-production1 i-0000007b output: Connection refused [00:32:24] PROBLEM Free ram is now: CRITICAL on nova-production1 i-0000007b output: Connection refused by host [00:32:24] PROBLEM Current Load is now: WARNING on bots-3 i-000000e5 output: WARNING - load average: 8.92, 7.66, 6.04 [00:33:41] PROBLEM dpkg-check is now: CRITICAL on nova-production1 i-0000007b output: Connection refused by host [00:35:05] PROBLEM Total Processes is now: CRITICAL on nova-production1 i-0000007b output: Connection refused or timed out [00:35:52] PROBLEM Current Users is now: CRITICAL on nova-production1 i-0000007b output: Connection refused or timed out [00:36:33] PROBLEM Disk Space is now: CRITICAL on nova-production1 i-0000007b output: Connection refused or timed out [00:44:08] PROBLEM host: nova-production1 is DOWN address: i-0000007b CRITICAL - Host Unreachable (i-0000007b) [00:54:59] RECOVERY host: nova-production1 is UP address: i-0000007b PING OK - Packet loss = 0%, RTA = 93.10 ms [00:57:29] RECOVERY SSH is now: OK on nova-production1 i-0000007b output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [00:59:14] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [00:59:14] RECOVERY Current Users is now: OK on nova-production1 i-0000007b output: USERS OK - 0 users currently logged in [00:59:24] RECOVERY Disk Space is now: OK on nova-production1 i-0000007b output: DISK OK [00:59:43] I wanted to use a local repository, so I included the puppet class misc:labsdebrepo and ran puppet [00:59:53] But it fails with "change from absent to directory failed: Cannot create /data/project/repo; parent directory /data/project does not exist" [01:03:29] RECOVERY Total Processes is now: OK on nova-production1 i-0000007b output: PROCS OK: 118 processes [01:03:34] RECOVERY Current Load is now: OK on nova-production1 i-0000007b output: OK - load average: 5.40, 6.94, 3.95 [01:04:19] PROBLEM Current Load is now: CRITICAL on nova-gsoc1 i-000001de output: CRITICAL - load average: 18.45, 19.56, 20.13 [01:05:43] PROBLEM Current Load is now: WARNING on translation-memory-1 i-0000013a output: WARNING - load average: 4.18, 5.71, 5.33 [01:10:28] In case someone is around, I also have a question regarding using a local puppet repository to test new configurations? [01:10:29] PROBLEM Current Load is now: WARNING on bots-sql2 i-000000af output: WARNING - load average: 7.18, 7.55, 5.94 [01:14:39] apmon: which instance is this? [01:14:51] maps-test2 [01:15:29] RECOVERY Current Load is now: OK on bots-sql2 i-000000af output: OK - load average: 1.79, 3.97, 4.84 [01:15:39] RECOVERY Current Load is now: OK on translation-memory-1 i-0000013a output: OK - load average: 4.14, 4.40, 4.78 [01:16:52] I am finally trying to figure out how to use the local debian repository and puppet master to set up the map rendering stack [01:17:16] PROBLEM Current Users is now: WARNING on bastion-restricted1 i-0000019b output: USERS WARNING - 9 users currently logged in [01:17:47] which includes postgresql, postgis, mapnik that are included in the standard ubuntu repository and osm2pgsql, mod_tile and renderd which would need to be local packages [01:19:06] gimme a min [01:19:10] thanks [01:20:30] I am new to puppet (and the wikilabs setup), so I might end up asking a bunch of silly questions until I figure everything out [01:24:26] RECOVERY Total Processes is now: OK on bastion-restricted1 i-0000019b output: PROCS OK: 115 processes [01:29:26] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [01:29:26] apmon: ok. should be fixed now. [01:29:47] great, thanks! [01:29:51] yw [01:31:13] I created a postgresql class in the labsconsole and created a postgresql.pp in /var/lib/git/operations/puppet/manifest Should that be picked up, or do I need to do more to create a new class? [01:32:50] did you add it to the instance via "configure"? [01:33:30] yes. After running puppet -tv on the instance, it complains it can't find the postgresql class though [01:34:05] hm [01:34:36] let me just try it again, and see if the failures from before effected it [01:35:30] It gives the error message: "err: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find class postgresql for i-00000253.pmtpa.wmflabs on node i-00000253.pmtpa.wmflabs" [01:35:39] apmon: you need to add it to site.pp [01:35:46] via import [01:37:08] just edit the site.pp file with a test editor and add the line 'import "postgresql.pp"'? [01:37:17] text editor [01:37:24] yes [01:37:51] you also need to add a service to that class [01:38:17] and have it enable the service. otherwise, on reboots postgres won't start [01:38:28] also good to have it set to running, so puppet will restart the service if it's down [01:38:57] Well, at the moment I am just experimenting, so I tried the smallest possible puppet manifest to see if it gets applied [01:39:03] * Ryan_Lane nods [01:39:15] Once that works, I'll need to expand it to set up all the configs [01:39:26] * Ryan_Lane nods [01:39:31] have fun :) [01:40:05] I am sure I will ;-) [01:42:39] PROBLEM Current Load is now: WARNING on nova-precise1 i-00000236 output: WARNING - load average: 11.52, 14.29, 16.46 [01:44:29] PROBLEM Free ram is now: CRITICAL on wikistats-history-01 i-000002e2 output: CHECK_NRPE: Socket timeout after 10 seconds. [01:44:39] PROBLEM Current Load is now: CRITICAL on nova-gsoc1 i-000001de output: CRITICAL - load average: 20.80, 20.00, 20.27 [01:48:29] RECOVERY Current Load is now: OK on bots-3 i-000000e5 output: OK - load average: 4.58, 4.36, 4.88 [01:49:19] PROBLEM Free ram is now: UNKNOWN on wikistats-history-01 i-000002e2 output: NRPE: Unable to read output [01:52:49] PROBLEM Current Load is now: CRITICAL on nova-precise1 i-00000236 output: CHECK_NRPE: Socket timeout after 10 seconds. [01:57:39] PROBLEM Current Load is now: WARNING on nova-precise1 i-00000236 output: WARNING - load average: 11.94, 15.37, 16.91 [01:59:29] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [02:22:51] PROBLEM Current Load is now: CRITICAL on nova-precise1 i-00000236 output: CRITICAL - load average: 31.25, 20.82, 17.97 [02:29:11] PROBLEM Current Load is now: CRITICAL on nova-essex-test i-000001f9 output: CRITICAL - load average: 30.10, 21.13, 16.53 [02:29:51] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [02:34:17] PROBLEM Current Load is now: WARNING on nova-essex-test i-000001f9 output: WARNING - load average: 10.46, 15.30, 15.43 [02:40:02] RECOVERY Total Processes is now: OK on incubator-bot2 i-00000252 output: PROCS OK: 147 processes [02:40:41] PROBLEM Total Processes is now: WARNING on aggregator-test1 i-000002bf output: PROCS WARNING: 200 processes [03:00:24] PROBLEM Free ram is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Socket timeout after 10 seconds. [03:01:21] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [03:01:40] PROBLEM Free ram is now: CRITICAL on psm-precise i-000002f2 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:05:54] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 4.97, 4.47, 3.49 [03:05:54] PROBLEM Free ram is now: UNKNOWN on integration-apache1 i-000002eb output: NRPE: Unable to read output [03:05:57] PROBLEM Free ram is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [03:07:14] PROBLEM Free ram is now: UNKNOWN on psm-precise i-000002f2 output: NRPE: Unable to read output [03:10:54] PROBLEM Free ram is now: WARNING on bots-sql2 i-000000af output: Warning: 11% free memory [03:11:20] PROBLEM Current Load is now: CRITICAL on nagios 127.0.0.1 output: CRITICAL - load average: 4.73, 5.43, 4.29 [03:16:20] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 2.37, 3.91, 3.97 [03:21:14] PROBLEM Free ram is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [03:28:35] PROBLEM Free ram is now: CRITICAL on wikistats-history-01 i-000002e2 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:31:50] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [03:33:37] PROBLEM Free ram is now: UNKNOWN on wikistats-history-01 i-000002e2 output: NRPE: Unable to read output [03:36:07] PROBLEM Free ram is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Socket timeout after 10 seconds. [03:43:27] PROBLEM Current Load is now: CRITICAL on nova-essex-test i-000001f9 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:58:28] PROBLEM Puppet freshness is now: CRITICAL on wikistats-01 i-00000042 output: Puppet has not run in last 20 hours [03:59:06] PROBLEM Free ram is now: WARNING on test-oneiric i-00000187 output: Warning: 15% free memory [04:00:16] PROBLEM Free ram is now: WARNING on orgcharts-dev i-0000018f output: Warning: 17% free memory [04:00:16] PROBLEM Free ram is now: WARNING on nova-daas-1 i-000000e7 output: Warning: 16% free memory [04:02:16] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [04:05:47] PROBLEM Free ram is now: WARNING on utils-abogott i-00000131 output: Warning: 16% free memory [04:19:28] PROBLEM Free ram is now: CRITICAL on test-oneiric i-00000187 output: Critical: 5% free memory [04:25:51] PROBLEM Free ram is now: CRITICAL on orgcharts-dev i-0000018f output: Critical: 4% free memory [04:26:44] PROBLEM Free ram is now: CRITICAL on utils-abogott i-00000131 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:29:21] RECOVERY Free ram is now: OK on test-oneiric i-00000187 output: OK: 96% free memory [04:30:41] RECOVERY Free ram is now: OK on orgcharts-dev i-0000018f output: OK: 66% free memory [04:33:31] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [04:36:11] RECOVERY Free ram is now: OK on utils-abogott i-00000131 output: OK: 97% free memory [04:40:45] PROBLEM Free ram is now: CRITICAL on nova-daas-1 i-000000e7 output: Critical: 5% free memory [04:50:46] RECOVERY Free ram is now: OK on nova-daas-1 i-000000e7 output: OK: 94% free memory [05:00:18] PROBLEM Total Processes is now: CRITICAL on aggregator-test1 i-000002bf output: PROCS CRITICAL: 201 processes [05:03:18] PROBLEM Free ram is now: WARNING on bots-sql2 i-000000af output: Warning: 11% free memory [05:04:28] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [05:05:18] PROBLEM Total Processes is now: WARNING on aggregator-test1 i-000002bf output: PROCS WARNING: 200 processes [05:10:18] PROBLEM Free ram is now: CRITICAL on wikistats-history-01 i-000002e2 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:10:23] PROBLEM Total Processes is now: CRITICAL on aggregator-test1 i-000002bf output: PROCS CRITICAL: 201 processes [05:15:10] PROBLEM Free ram is now: UNKNOWN on wikistats-history-01 i-000002e2 output: NRPE: Unable to read output [05:18:20] PROBLEM Free ram is now: UNKNOWN on integration-apache1 i-000002eb output: NRPE: Unable to read output [05:25:00] PROBLEM Current Load is now: WARNING on nova-essex-test i-000001f9 output: WARNING - load average: 11.31, 16.61, 15.99 [05:35:05] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [05:52:15] PROBLEM Current Load is now: WARNING on bots-3 i-000000e5 output: WARNING - load average: 4.14, 5.43, 5.03 [05:57:15] RECOVERY Current Load is now: OK on bots-3 i-000000e5 output: OK - load average: 4.22, 4.53, 4.73 [06:05:17] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [06:32:17] PROBLEM Free ram is now: CRITICAL on psm-precise i-000002f2 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:33:17] PROBLEM Total Processes is now: WARNING on incubator-bot2 i-00000252 output: PROCS WARNING: 153 processes [06:39:14] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [06:43:31] PROBLEM Free ram is now: UNKNOWN on psm-precise i-000002f2 output: NRPE: Unable to read output [06:43:58] PROBLEM Free ram is now: CRITICAL on configtest-main i-000002dd output: CHECK_NRPE: Socket timeout after 10 seconds. [06:43:58] PROBLEM Free ram is now: CRITICAL on wikistats-history-01 i-000002e2 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:44:04] PROBLEM Total Processes is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:48:37] PROBLEM Free ram is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:48:38] PROBLEM Free ram is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Socket timeout after 10 seconds. [06:50:34] PROBLEM Current Load is now: WARNING on wikidata-dev-2 i-00000259 output: WARNING - load average: 2.93, 4.52, 6.45 [06:50:41] PROBLEM Current Load is now: CRITICAL on nagios 127.0.0.1 output: CRITICAL - load average: 6.95, 9.10, 7.59 [06:59:46] PROBLEM Current Load is now: CRITICAL on nova-ldap1 i-000000df output: CRITICAL - load average: 19.46, 21.58, 20.76 [07:01:40] PROBLEM Total Processes is now: WARNING on incubator-bot2 i-00000252 output: PROCS WARNING: 156 processes [07:02:40] PROBLEM Free ram is now: CRITICAL on psm-precise i-000002f2 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:02:46] PROBLEM Disk Space is now: CRITICAL on migration1 i-00000261 output: Connection refused or timed out [07:02:52] PROBLEM Current Users is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:02:52] PROBLEM Disk Space is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [07:02:52] PROBLEM Current Users is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [07:02:52] PROBLEM Free ram is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [07:04:19] PROBLEM Current Users is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:04:19] PROBLEM Free ram is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:04:19] PROBLEM Total Processes is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:04:24] PROBLEM Current Load is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:05:30] RECOVERY Disk Space is now: OK on ipv6test1 i-00000282 output: DISK OK [07:06:19] PROBLEM Free ram is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [07:10:13] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [07:10:24] PROBLEM Current Load is now: WARNING on bots-apache1 i-000000b0 output: WARNING - load average: 7.08, 6.29, 5.44 [07:10:24] PROBLEM Current Load is now: WARNING on wikidata-dev-3 i-00000225 output: WARNING - load average: 4.33, 4.79, 5.54 [07:10:24] PROBLEM Current Load is now: WARNING on translation-memory-2 i-000002d9 output: WARNING - load average: 9.48, 8.45, 7.94 [07:10:29] PROBLEM Current Load is now: WARNING on incubator-bot1 i-00000251 output: WARNING - load average: 9.84, 11.07, 10.02 [07:10:35] PROBLEM Current Load is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [07:10:35] PROBLEM Total Processes is now: CRITICAL on redis1 i-000002b6 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:10:41] PROBLEM dpkg-check is now: CRITICAL on redis1 i-000002b6 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:10:41] PROBLEM Disk Space is now: CRITICAL on redis1 i-000002b6 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:11:09] PROBLEM Disk Space is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:11:09] PROBLEM dpkg-check is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:11:09] PROBLEM Total Processes is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:11:16] PROBLEM HTTP is now: CRITICAL on integration-apache1 i-000002eb output: CRITICAL - Socket timeout after 10 seconds [07:11:21] PROBLEM Current Load is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:11:22] PROBLEM Current Users is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:11:22] PROBLEM Total Processes is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:11:27] PROBLEM Disk Space is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:11:27] PROBLEM Free ram is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:11:27] PROBLEM dpkg-check is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:11:27] PROBLEM Free ram is now: CRITICAL on su-be3 i-000002e9 output: (Service Check Timed Out) [07:14:57] RECOVERY Current Load is now: OK on wikidata-dev-2 i-00000259 output: OK - load average: 1.60, 2.54, 4.27 [07:17:18] PROBLEM Current Load is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:18] PROBLEM Current Load is now: WARNING on configtest-main i-000002dd output: WARNING - load average: 6.94, 8.67, 8.38 [07:17:23] PROBLEM Current Load is now: WARNING on deployment-apache31 i-000002d4 output: WARNING - load average: 5.90, 7.07, 7.07 [07:17:23] PROBLEM Current Load is now: WARNING on mobile-testing i-00000271 output: WARNING - load average: 6.38, 7.94, 7.12 [07:17:28] PROBLEM Current Users is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:28] PROBLEM dpkg-check is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:28] PROBLEM Total Processes is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:33] PROBLEM dpkg-check is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:45] PROBLEM Disk Space is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:46] PROBLEM Current Load is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:17] PROBLEM dpkg-check is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:28] RECOVERY Total Processes is now: OK on redis1 i-000002b6 output: PROCS OK: 90 processes [07:19:33] RECOVERY dpkg-check is now: OK on redis1 i-000002b6 output: All packages OK [07:19:34] RECOVERY Disk Space is now: OK on redis1 i-000002b6 output: DISK OK [07:19:34] RECOVERY Current Load is now: OK on bots-apache1 i-000000b0 output: OK - load average: 1.00, 2.66, 4.04 [07:19:39] PROBLEM Free ram is now: UNKNOWN on su-be3 i-000002e9 output: NRPE: Unable to read output [07:19:39] PROBLEM Current Load is now: WARNING on nova-ldap1 i-000000df output: WARNING - load average: 16.89, 18.50, 19.57 [07:19:39] RECOVERY Current Users is now: OK on precise-test i-00000231 output: USERS OK - 0 users currently logged in [07:19:39] PROBLEM Current Load is now: WARNING on precise-test i-00000231 output: WARNING - load average: 5.92, 5.87, 5.55 [07:19:39] RECOVERY Disk Space is now: OK on precise-test i-00000231 output: DISK OK [07:19:40] RECOVERY Total Processes is now: OK on precise-test i-00000231 output: PROCS OK: 105 processes [07:19:44] RECOVERY Free ram is now: OK on precise-test i-00000231 output: OK: 71% free memory [07:19:44] RECOVERY dpkg-check is now: OK on precise-test i-00000231 output: All packages OK [07:19:44] PROBLEM Current Load is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:20:25] PROBLEM Current Load is now: WARNING on bots-3 i-000000e5 output: WARNING - load average: 4.41, 4.43, 5.87 [07:20:25] PROBLEM Current Load is now: WARNING on hugglewiki i-000000aa output: WARNING - load average: 7.04, 8.31, 8.26 [07:20:25] PROBLEM Current Load is now: WARNING on mwreview i-000002ae output: WARNING - load average: 1.74, 4.44, 5.43 [07:21:00] RECOVERY Disk Space is now: OK on migration1 i-00000261 output: DISK OK [07:21:00] RECOVERY Disk Space is now: OK on mwreview i-000002ae output: DISK OK [07:21:00] RECOVERY Current Users is now: OK on mwreview i-000002ae output: USERS OK - 0 users currently logged in [07:21:00] RECOVERY Free ram is now: OK on mwreview i-000002ae output: OK: 74% free memory [07:23:02] PROBLEM Current Load is now: WARNING on labs-nfs1 i-0000005d output: WARNING - load average: 11.28, 7.09, 7.46 [07:23:02] PROBLEM Current Load is now: WARNING on bots-cb i-0000009e output: WARNING - load average: 1.31, 2.73, 6.00 [07:23:02] PROBLEM Current Load is now: WARNING on signwriting-ase i-000002f5 output: WARNING - load average: 4.57, 4.96, 5.22 [07:23:04] PROBLEM Current Load is now: WARNING on psm-precise i-000002f2 output: WARNING - load average: 6.01, 8.45, 7.97 [07:23:10] PROBLEM Current Load is now: WARNING on incubator-bot2 i-00000252 output: WARNING - load average: 7.18, 6.86, 6.57 [07:23:10] PROBLEM Current Load is now: WARNING on pybal-precise i-00000289 output: WARNING - load average: 3.07, 4.43, 5.50 [07:23:10] PROBLEM Current Load is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [07:23:10] PROBLEM Current Users is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [07:23:10] PROBLEM Total Processes is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [07:23:18] PROBLEM Free ram is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [07:23:20] PROBLEM Disk Space is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [07:23:20] PROBLEM SSH is now: CRITICAL on bots-sql2 i-000000af output: CRITICAL - Socket timeout after 10 seconds [07:23:20] PROBLEM Total Processes is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [07:23:44] RECOVERY Current Users is now: OK on integration-apache1 i-000002eb output: USERS OK - 0 users currently logged in [07:23:44] RECOVERY Disk Space is now: OK on integration-apache1 i-000002eb output: DISK OK [07:23:44] PROBLEM Current Load is now: WARNING on integration-apache1 i-000002eb output: WARNING - load average: 7.75, 14.06, 16.26 [07:24:02] PROBLEM Free ram is now: UNKNOWN on integration-apache1 i-000002eb output: NRPE: Unable to read output [07:24:09] RECOVERY Total Processes is now: OK on integration-apache1 i-000002eb output: PROCS OK: 110 processes [07:24:14] RECOVERY dpkg-check is now: OK on integration-apache1 i-000002eb output: All packages OK [07:24:16] PROBLEM Current Load is now: CRITICAL on deployment-apache31 i-000002d4 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:24:52] RECOVERY HTTP is now: OK on integration-apache1 i-000002eb output: HTTP OK: HTTP/1.1 200 OK - 1308 bytes in 0.005 second response time [07:24:52] RECOVERY Disk Space is now: OK on incubator-bot1 i-00000251 output: DISK OK [07:24:52] PROBLEM Current Users is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:24:52] PROBLEM Disk Space is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:25:04] PROBLEM Current Load is now: WARNING on incubator-bot1 i-00000251 output: WARNING - load average: 13.20, 10.79, 9.81 [07:25:13] PROBLEM Total Processes is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:25:18] PROBLEM dpkg-check is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:25:22] RECOVERY Current Load is now: OK on mwreview i-000002ae output: OK - load average: 0.92, 2.34, 4.25 [07:26:25] PROBLEM Current Users is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:26:25] PROBLEM dpkg-check is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:26:43] RECOVERY Current Users is now: OK on migration1 i-00000261 output: USERS OK - 0 users currently logged in [07:26:43] RECOVERY Free ram is now: OK on migration1 i-00000261 output: OK: 88% free memory [07:26:43] RECOVERY Total Processes is now: OK on migration1 i-00000261 output: PROCS OK: 84 processes [07:26:59] PROBLEM Current Load is now: WARNING on migration1 i-00000261 output: WARNING - load average: 1.88, 4.16, 5.66 [07:27:35] RECOVERY Current Load is now: OK on signwriting-ase i-000002f5 output: OK - load average: 1.57, 3.76, 4.72 [07:27:35] RECOVERY Current Load is now: OK on pybal-precise i-00000289 output: OK - load average: 3.24, 3.71, 4.91 [07:27:35] PROBLEM Current Load is now: WARNING on upload-wizard i-0000021c output: WARNING - load average: 0.42, 3.56, 5.26 [07:27:35] RECOVERY Current Users is now: OK on upload-wizard i-0000021c output: USERS OK - 0 users currently logged in [07:27:35] RECOVERY Total Processes is now: OK on upload-wizard i-0000021c output: PROCS OK: 84 processes [07:27:40] RECOVERY Free ram is now: OK on upload-wizard i-0000021c output: OK: 93% free memory [07:27:40] RECOVERY Disk Space is now: OK on bots-sql2 i-000000af output: DISK OK [07:27:40] PROBLEM Current Load is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:27:50] RECOVERY Total Processes is now: OK on bots-sql2 i-000000af output: PROCS OK: 91 processes [07:27:55] RECOVERY SSH is now: OK on bots-sql2 i-000000af output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [07:28:30] PROBLEM Disk Space is now: WARNING on ipv6test1 i-00000282 output: DISK WARNING - free space: / 70 MB (5% inode=57%): [07:28:30] PROBLEM Free ram is now: UNKNOWN on configtest-main i-000002dd output: NRPE: Unable to read output [07:28:53] PROBLEM Free ram is now: UNKNOWN on incubator-bot1 i-00000251 output: NRPE: Call to fork() failed [07:29:28] RECOVERY dpkg-check is now: OK on migration1 i-00000261 output: All packages OK [07:29:28] RECOVERY Free ram is now: OK on incubator-bot2 i-00000252 output: OK: 31% free memory [07:29:28] PROBLEM Free ram is now: WARNING on bots-sql2 i-000000af output: Warning: 10% free memory [07:29:28] PROBLEM Current Load is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:30:13] RECOVERY Current Load is now: OK on precise-test i-00000231 output: OK - load average: 1.78, 3.54, 4.69 [07:30:37] PROBLEM Current Load is now: WARNING on bots-sql2 i-000000af output: WARNING - load average: 12.81, 13.64, 15.60 [07:30:37] RECOVERY Current Users is now: OK on incubator-bot2 i-00000252 output: USERS OK - 0 users currently logged in [07:30:37] RECOVERY Disk Space is now: OK on incubator-bot2 i-00000252 output: DISK OK [07:30:37] RECOVERY Total Processes is now: OK on reportcard2 i-000001ea output: PROCS OK: 84 processes [07:30:42] RECOVERY dpkg-check is now: OK on incubator-bot2 i-00000252 output: All packages OK [07:32:05] RECOVERY Current Users is now: OK on reportcard2 i-000001ea output: USERS OK - 0 users currently logged in [07:32:05] RECOVERY dpkg-check is now: OK on reportcard2 i-000001ea output: All packages OK [07:32:45] RECOVERY Current Load is now: OK on migration1 i-00000261 output: OK - load average: 0.21, 1.36, 3.88 [07:32:53] RECOVERY Current Load is now: OK on labs-nfs1 i-0000005d output: OK - load average: 2.23, 2.84, 4.90 [07:32:54] RECOVERY Current Load is now: OK on psm-precise i-000002f2 output: OK - load average: 1.12, 2.65, 4.90 [07:32:54] RECOVERY Current Load is now: OK on upload-wizard i-0000021c output: OK - load average: 0.07, 1.33, 3.79 [07:33:35] apergos: the labs went wild this weekend due to the leap sec bug :-D [07:33:45] he is not there of course bah [07:33:53] PROBLEM Free ram is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:34:03] PROBLEM Current Load is now: WARNING on mobile-testing i-00000271 output: WARNING - load average: 1.09, 3.43, 5.39 [07:34:03] PROBLEM Current Load is now: WARNING on deployment-apache31 i-000002d4 output: WARNING - load average: 0.42, 3.46, 5.44 [07:34:50] hasher: How do you identify if a host is suffering from the bug? [07:35:13] RECOVERY dpkg-check is now: OK on bots-sql2 i-000000af output: All packages OK [07:35:13] RECOVERY Current Load is now: OK on wikidata-dev-3 i-00000225 output: OK - load average: 2.76, 3.37, 4.60 [07:35:14] Hydriz: high load for now reason [07:35:20] Hydriz: some process stuck in 100% [07:35:30] RECOVERY dpkg-check is now: OK on incubator-bot1 i-00000251 output: All packages OK [07:35:30] RECOVERY Total Processes is now: OK on incubator-bot1 i-00000251 output: PROCS OK: 139 processes [07:35:45] Hydriz: you have to manually reboot the box I think. Labsconsole does not seem to be able to trigger reboots anymore [07:35:46] hashar: its currently due to gluster :P [07:35:54] I mean, right now [07:35:56] Hydriz: :-)))))))))))))))))))) [07:36:00] what is your project? [07:36:06] I will help you identify them [07:36:17] hmm, its okay :P [07:36:25] RECOVERY Current Load is now: OK on hugglewiki i-000000aa output: OK - load average: 1.03, 1.58, 4.05 [07:37:01] RECOVERY Current Users is now: OK on incubator-bot1 i-00000251 output: USERS OK - 1 users currently logged in [07:37:02] just waiting for labs to cool down right now [07:37:28] Hydriz: also, any machine that started showing 100% cpu usage on one core since sunday midnight is definitely bugged :-] [07:37:41] for bots project :http://ganglia.wmflabs.org/latest/?r=week&cs=&ce=&m=cpu_report&s=by+name&c=bots&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [07:37:58] that would be bots-3 and bots-sql3 at least [07:38:08] RECOVERY Current Load is now: OK on bots-cb i-0000009e output: OK - load average: 1.71, 2.71, 4.51 [07:38:14] ah [07:38:15] I see [07:38:20] Hydriz: bots-sql3 it is probably just mysql being wild. So just try to restart it [07:38:47] nope, none of the hosts I operate suffers under this :P [07:39:03] ohh you are on dumps project aren't you ? [07:39:08] RECOVERY Current Load is now: OK on configtest-main i-000002dd output: OK - load average: 1.37, 2.11, 4.71 [07:39:29] yep [07:39:32] and incubator [07:40:12] well incubator-apache show a bit more CPU usage than before [07:40:18] same for incubator-bot0 [07:40:21] but that might not be related [07:40:28] RECOVERY Current Load is now: OK on translation-memory-2 i-000002d9 output: OK - load average: 5.34, 4.49, 4.94 [07:41:05] possibly not, its always high for no apparent reason [07:41:16] was thinking of switching to squid instead [07:41:28] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [07:41:31] (referring to -apache) [07:42:48] PROBLEM Current Load is now: WARNING on nova-precise1 i-00000236 output: WARNING - load average: 14.06, 17.47, 17.10 [07:44:12] grr Mount Everest kind of load graph... [07:45:01] PROBLEM Free ram is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [07:45:02] Hydriz: well just to be safe, I would reboot all my instance anyway :-] [07:45:32] grr load increasing again [07:47:00] reboot! hehe [07:48:41] RECOVERY Current Load is now: OK on integration-apache1 i-000002eb output: OK - load average: 0.77, 0.77, 4.43 [07:49:16] hashar: Is it possible for me to join deployment-prep, so that I can look around how things are set up? [07:49:30] sure :-D [07:49:34] still need to write a doc though [07:49:51] yeah, some rough ideas are on wiki, I see [07:50:08] Successfully added Hydriz to deployment-prep. [07:50:13] :) [07:50:15] thanks! [07:50:20] !log deployment-prep Gave access to the cluster to Hydriz [07:50:24] Logged the message, Master [07:50:37] Hydriz: the main machine is deployment-dbdump , some conf is in /home/wikipedia/ [07:50:46] hmm [07:50:46] Hydriz: and most of the setup is in puppet actually [07:51:01] is there any testing for the dumps infrastructure on that project? [07:51:19] nop [07:51:25] the name is misleading :-]]] [07:51:33] hmm, not sure if apergos needs to do so [07:51:47] anyway its not under my control haha [07:51:50] 07/02/2012 - 07:51:50 - Created a home directory for hydriz in project(s): deployment-prep [07:52:40] ah, virt cluster load has gone down... [07:52:50] 07/02/2012 - 07:52:50 - User hydriz may have been modified in LDAP or locally, updating key in project(s): deployment-prep [07:54:40] PROBLEM host: dumps-1 is DOWN address: i-00000170 CRITICAL - Host Unreachable (i-00000170) [08:02:51] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 2.64, 2.59, 4.00 [08:05:59] hashar: Before I do anything, what can I actually do on the cluster? [08:06:08] look :-] [08:06:18] still in the process of documenting it [08:06:47] and I still need to apply some changes from labs to production [08:06:50] or in short, don't run any kind of script? :P [08:06:57] exactly :-D [08:06:58] sorry [08:07:03] lol haha [08:07:09] will be better once we have some documentation [08:07:21] yeah, its hell messy [08:10:15] RECOVERY host: dumps-1 is UP address: i-00000170 PING OK - Packet loss = 0%, RTA = 0.74 ms [08:10:54] RECOVERY Current Load is now: OK on bots-sql2 i-000000af output: OK - load average: 2.41, 3.57, 4.84 [08:12:34] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [08:13:44] PROBLEM Current Load is now: CRITICAL on nova-precise1 i-00000236 output: CRITICAL - load average: 31.84, 20.73, 18.05 [08:16:05] PROBLEM Current Load is now: CRITICAL on nova-essex-test i-000001f9 output: CHECK_NRPE: Socket timeout after 10 seconds. [08:17:54] RECOVERY Current Load is now: OK on nagios 127.0.0.1 output: OK - load average: 1.65, 2.05, 2.91 [08:18:34] PROBLEM Current Load is now: WARNING on translation-memory-2 i-000002d9 output: WARNING - load average: 5.01, 5.34, 5.07 [08:18:44] PROBLEM Current Load is now: WARNING on nova-precise1 i-00000236 output: WARNING - load average: 12.46, 17.25, 17.56 [08:20:54] PROBLEM Current Load is now: WARNING on nova-essex-test i-000001f9 output: WARNING - load average: 19.90, 13.42, 13.35 [08:28:55] PROBLEM Current Load is now: CRITICAL on nova-precise1 i-00000236 output: CRITICAL - load average: 32.08, 23.40, 19.15 [08:29:55] PROBLEM Current Load is now: WARNING on bots-2 i-0000009c output: WARNING - load average: 5.25, 5.25, 5.06 [08:32:56] Hydriz: don't you get access to the bots project ? [08:33:08] hmm? [08:33:17] I have access to the bots project [08:34:00] Hydriz: would you mind rebooting bots-sql3 ? :-] [08:34:07] going to cause some issues I guess [08:34:12] hmm [08:34:15] PROBLEM Free ram is now: UNKNOWN on incubator-bot1 i-00000251 output: NRPE: Call to fork() failed [08:34:15] but that machine is doing 100% CPU cause of a kernel bug [08:34:21] I can reboot it using the command line [08:34:27] but not from the labsconsole thing [08:34:28] that would be great [08:34:33] yeah labsconsole seems bugged [08:34:42] I wonder if it will cause things to break :P [08:34:58] I don't have sysadmin access to the project [08:35:02] ohh [08:35:05] so we have to wait :-] [08:36:12] sudo policy allows me to sudo everything in the project, but no access on the wiki side :( [08:36:25] and the deployment-prep cluster... [08:36:33] I can't seem to view the sudo policy [08:37:15] PROBLEM Disk Space is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Error - Could not complete SSL handshake. [08:37:15] PROBLEM SSH is now: CRITICAL on incubator-bot1 i-00000251 output: Server answer: [08:37:17] Hydriz: if you got sudo access, just reboot the bots-sql3 using sudo reboot [08:37:19] that would do it [08:37:35] yeah, I know :P [08:37:36] PROBLEM Total Processes is now: WARNING on aggregator-test1 i-000002bf output: PROCS WARNING: 200 processes [08:37:46] but I am having trouble logging in to the server itself [08:37:53] its stuck :( [08:39:22] hashar: I see how small the sql server is lol [08:40:08] RECOVERY Current Load is now: OK on bots-2 i-0000009c output: OK - load average: 4.43, 4.84, 4.96 [08:41:24] Hydriz: you might need to raise the ssh client default timeout. Try sshing to bastion then: ssh -o ConnectTimeout=600 bots-sql3 [08:41:43] yeah, I am in bots-sql3 [08:41:53] grats [08:42:06] and its possible that its due to bots-2, and ultimately by Beetstra's bots [08:42:16] RECOVERY SSH is now: OK on incubator-bot1 i-00000251 output: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [08:42:30] * Beetstra hides [08:42:30] and the leap second bug [08:42:34] :P [08:42:56] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [08:42:56] I can't help that people spam Wikipedia so much that one instance is hardly enough to run LiWa3 .. [08:43:17] get a big server :) [08:43:28] I am using 16G myself :P [08:43:48] Well, you were already a bigger server ... [08:43:52] you .. labs that is [08:44:12] hehe [08:44:17] The previous box (a 4 processor Sun Sparc with 8 Gb internal memory) had troubles running the bots [08:44:21] Hydriz: is it rebooting ? [08:44:29] not yet :P [08:44:32] I got to check bots-2 too [08:44:38] Beetstra: so optimize the bots to use less CPU :-] [08:44:54] hmm... deployment-transcoding isn't ssh-able [08:45:05] there is a limit to optimisation ... [08:45:28] Beetstra: Shall I be evil and reboot the sql3 server, and make your bots go mad? :P [08:45:46] you think that they care? [08:46:13] lol [08:46:54] the sub-process has the following: [08:46:55] eval { [08:46:55] $mysql=DBI->connect("dbi:mysql:linkwatcher;bots-sql2",$settings{'mysqluser'},$settings{'mysqlpassword'}) or die "Can't connect to MySQL: $DBI::errstr"; [08:46:55] }; [08:46:55] if ($@) { [08:46:56] die "Dying painfully because of faulthy mysql handle - error $@\n"; [08:47:00] } [08:47:02] $mysql->{mysql_auto_reconnect} = 1; [08:47:04] [08:47:06] unless ($mysql) { [08:47:08] die "Dying painfully because of faulthy mysql handle"; [08:47:12] } [08:47:16] so .. if it dies, it respawns until it connects again - reboot what you want ... [08:47:16] hmm [08:47:22] okie :) [08:47:56] PROBLEM Free ram is now: UNKNOWN on wikistats-history-01 i-000002e2 output: NRPE: Unable to read output [08:48:16] PROBLEM Current Load is now: CRITICAL on bots-sql3 i-000000b4 output: Connection refused by host [08:48:21] !log bots Rebooting bots-sql3 as we humans are born evil (aka leap second bug/high CPU) [08:48:22] Logged the message, Master [08:49:05] actually .. all my bots should by now use bots-sql2 ... :-D [08:49:31] right, then who is using bots-sql3 [08:51:16] PROBLEM dpkg-check is now: CRITICAL on bots-sql3 i-000000b4 output: Connection refused by host [08:51:51] where in the world is the foreachwiki script [08:52:16] PROBLEM Total Processes is now: CRITICAL on bots-sql3 i-000000b4 output: Connection refused by host [08:52:22] PROBLEM SSH is now: CRITICAL on bots-sql3 i-000000b4 output: Connection refused [08:52:56] PROBLEM Free ram is now: UNKNOWN on psm-precise i-000002f2 output: NRPE: Unable to read output [08:53:06] PROBLEM Disk Space is now: CRITICAL on bots-sql3 i-000000b4 output: Connection refused by host [08:53:06] PROBLEM Current Load is now: WARNING on bots-2 i-0000009c output: WARNING - load average: 6.17, 6.09, 5.51 [08:53:47] PROBLEM Current Users is now: CRITICAL on bots-sql3 i-000000b4 output: Connection refused by host [08:56:16] RECOVERY dpkg-check is now: OK on bots-sql3 i-000000b4 output: All packages OK [08:56:44] Special:NovaSudoer Y U NO LOAD [08:57:16] RECOVERY Total Processes is now: OK on bots-sql3 i-000000b4 output: PROCS OK: 69 processes [08:58:06] RECOVERY Disk Space is now: OK on bots-sql3 i-000000b4 output: DISK OK [08:58:16] PROBLEM Current Load is now: WARNING on bots-sql3 i-000000b4 output: WARNING - load average: 9.16, 9.62, 9.11 [08:58:51] RECOVERY Current Users is now: OK on bots-sql3 i-000000b4 output: USERS OK - 0 users currently logged in [09:13:23] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [09:21:10] just remembered that labsconsole can't reboot instances for now [09:25:48] PROBLEM Free ram is now: WARNING on bots-sql2 i-000000af output: Warning: 10% free memory [09:43:38] PROBLEM Total Processes is now: CRITICAL on aggregator-test1 i-000002bf output: PROCS CRITICAL: 206 processes [09:44:20] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [09:48:36] PROBLEM Total Processes is now: WARNING on aggregator-test1 i-000002bf output: PROCS WARNING: 200 processes [09:58:22] PROBLEM host: bots-sql3 is DOWN address: i-000000b4 CRITICAL - Host Unreachable (i-000000b4) [10:00:34] RECOVERY Disk Space is now: OK on ipv6test1 i-00000282 output: DISK OK [10:02:28] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 2.17, 3.25, 3.04 [10:06:09] !log deployment-prep ukwiki was giving errors regarding flaggedrevs's flaggedpages table not existing. Fixed it by running mwscript update.php ukwiki. [10:06:10] Logged the message, Master [10:07:42] RECOVERY Current Load is now: OK on nagios 127.0.0.1 output: OK - load average: 3.08, 2.88, 2.91 [10:08:42] PROBLEM Disk Space is now: WARNING on ipv6test1 i-00000282 output: DISK WARNING - free space: / 70 MB (5% inode=57%): [10:10:22] PROBLEM Current Load is now: WARNING on nova-precise1 i-00000236 output: WARNING - load average: 14.00, 13.39, 14.79 [10:11:19] Hydriz: grats :) [10:11:26] hmm? [10:12:30] lol what did I do [10:14:40] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [10:16:48] Hydriz: fixed the flagged revs tables :) [10:16:54] lol [10:17:08] I believe most of the docs in Wikitech is applicable here? [10:17:30] almost [10:17:48] I spend a good part of may upgrading beta to use the tools from production [10:18:11] heh [10:18:21] oh yes [10:18:25] RECOVERY SSH is now: OK on bots-sql3 i-000000b4 output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [10:18:26] do we use swift on beta? [10:18:37] RECOVERY host: bots-sql3 is UP address: i-000000b4 PING OK - Packet loss = 0%, RTA = 1.10 ms [10:19:01] !log deployment-prep Easy fix for the leap second bug: /etc/init.d/ntp stop; date `date +"%m%d%H%M%C%y.%S"`; /etc/init.d/ntp start [10:19:03] Logged the message, Master [10:20:24] RECOVERY Current Load is now: OK on bots-sql3 i-000000b4 output: OK - load average: 3.64, 2.80, 1.28 [10:21:46] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 6.03, 4.50, 3.62 [10:29:07] hehe, abuse and run addwiki.php to create a private wiki just for me on beta cluster [10:33:31] Hydriz: please dont [10:33:35] Hydriz: will be deleted [10:33:40] :P [10:33:55] Hydriz: still have a lot of work to do before adding new wikis for testing [10:34:10] I wonder if there are evil people that are going to do that haha [10:34:36] if you want to test out, create a new instance on one of your projects :-) [10:36:05] 07/02/2012 - 10:36:05 - Updating keys for bachsau at /export/keys/bachsau [10:40:12] 07/02/2012 - 10:40:12 - Updating keys for bachsau at /export/keys/bachsau [10:44:54] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [10:45:06] PROBLEM Current Load is now: WARNING on bots-3 i-000000e5 output: WARNING - load average: 9.04, 7.53, 5.85 [10:54:10] hashar: If you got the time now, can you fix deployment.wikimedia.beta.wmflabs.org? It seems to be down for no reason [10:54:35] there is a bug about it [10:54:39] since beginning of may [10:54:44] Ì don't think it ever worked anyway [10:55:08] ever? lool [10:57:23] deployment.beta.wmflabs.org returns the holding wiki for not exisitng, deployment.wikimedia.beta.wmflabs.org returns an apache error - not worked as long as I've known it exists. [11:00:22] Hmm, how did I fix the Locale errors for the instances a few months back? [11:01:11] any progress of the planned reversed http proxy? [11:14:06] PROBLEM Current Load is now: CRITICAL on nova-essex-test i-000001f9 output: CRITICAL - load average: 31.79, 23.09, 17.41 [11:14:59] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [11:15:09] RECOVERY Current Load is now: OK on bots-3 i-000000e5 output: OK - load average: 4.30, 4.48, 4.88 [11:19:00] PROBLEM Current Load is now: WARNING on nova-essex-test i-000001f9 output: WARNING - load average: 11.21, 17.25, 16.74 [11:32:28] PROBLEM Current Load is now: CRITICAL on nova-precise1 i-00000236 output: CRITICAL - load average: 30.69, 20.05, 17.91 [11:37:28] PROBLEM Current Load is now: WARNING on nova-precise1 i-00000236 output: WARNING - load average: 11.57, 16.20, 17.07 [11:38:21] PROBLEM Total Processes is now: WARNING on bots-3 i-000000e5 output: PROCS WARNING: 151 processes [11:45:20] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [12:15:46] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [12:45:49] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [12:50:20] PROBLEM Total Processes is now: CRITICAL on aggregator-test1 i-000002bf output: PROCS CRITICAL: 201 processes [12:55:32] PROBLEM Total Processes is now: WARNING on aggregator-test1 i-000002bf output: PROCS WARNING: 200 processes [12:58:37] PROBLEM Current Load is now: CRITICAL on nova-precise1 i-00000236 output: CRITICAL - load average: 30.50, 20.13, 17.76 [13:09:47] PROBLEM Current Load is now: CRITICAL on nova-essex-test i-000001f9 output: CRITICAL - load average: 30.50, 20.22, 16.45 [13:14:51] PROBLEM Current Load is now: WARNING on nova-essex-test i-000001f9 output: WARNING - load average: 11.77, 17.01, 16.43 [13:15:51] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [13:22:17] PROBLEM Puppet freshness is now: CRITICAL on deployment-transcoding i-00000105 output: Puppet has not run in last 20 hours [13:30:14] PROBLEM Free ram is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [13:34:41] PROBLEM Free ram is now: WARNING on bots-sql2 i-000000af output: Warning: 11% free memory [13:40:57] RECOVERY Disk Space is now: OK on ipv6test1 i-00000282 output: DISK OK [13:42:36] PROBLEM Current Load is now: CRITICAL on nova-essex-test i-000001f9 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:46:29] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [13:49:07] PROBLEM Disk Space is now: WARNING on ipv6test1 i-00000282 output: DISK WARNING - free space: / 69 MB (5% inode=57%): [13:55:11] <^demon> hashar: Thanks for the review :) [13:55:19] <^demon> I'm hoping to get the labs install using puppet today [13:55:20] yw :) [13:55:39] need to find some ops to submit your change now :) [13:55:45] meanwhile, you can use puppetmaster::self [13:55:55] that will install a git clone of operations/puppet on your instance [13:56:13] then you can make your changes directly in /var/lib/git/operations/puppet && puppetd -tv [13:56:21] you will have to manually update from the reference repo though [13:56:29] cd /var/lib/git/operations/puppet && git pull --rebase [13:56:50] that is a great way to make your puppet changes, test them and then submit for review notifying that works for you [13:57:34] ^demon: also Nike opened a bug about having the rows in the change list to alternate colors [13:57:48] <^demon> I saw. [13:58:26] so hmm [13:58:28] it is 4pm [13:58:46] I have done nothing but fixing labs instances, investigating db issue and merging a few changes :( [13:59:17] PROBLEM Puppet freshness is now: CRITICAL on wikistats-01 i-00000042 output: Puppet has not run in last 20 hours [13:59:43] hashar: I know that feeling, it's a monday [14:02:26] <^demon> hashar: Since you're on the subject, you're more than welcome to help with the gerrit instance :) [14:02:50] Damianz: that happens to me more and more often. I guess I have too many projects in parallels :D [14:03:13] ^demon: well already have to manage a whole beta cluster :-] [14:03:21] <^demon> :) [14:03:27] which does not work yet [14:03:28] <^demon> File["/root/.my.cnf"] -- why is this a dependency?? [14:03:35] <^demon> No docs :( [14:03:43] I pushed a change last week that killed stylesheets on enwiki (at least) :-(((( [14:04:00] I kinda prefered the minamalstic look [14:04:09] ^demon: that is probably a file coming from the private repo to setup some password? [14:04:14] ^demon: ask a root? [14:04:25] And yeah, should help with beta more but my projects are insanly time consuming and I suck at making mw work on more than 1 instance :P [14:04:37] PROBLEM Free ram is now: WARNING on bots-3 i-000000e5 output: Warning: 19% free memory [14:04:45] <^demon> hashar: Nope, gerrit::database-server isn't included in production since it runs on db9. [14:05:02] no more idea sorry [14:05:16] <^demon> Yeah, gonna have to pester Ryan once he shows up. [14:05:46] Can we put gerrit's db on the same box and make it fast again? :D [14:05:56] <^demon> no. [14:06:02] <^demon> but we can put them in the same cluster again. [14:06:25] That works too :D [14:07:10] <^demon> s/cluster/datacenter/ [14:10:13] !log chmod /mnt/public_html/damian/api.php to 000 on bots-apache1 - think the report/review sync for cbng is broken and looping, testing if this fixed the bw/spam issues. [14:10:15] chmod is not a valid project. [14:10:18] !log bots chmod /mnt/public_html/damian/api.php to 000 on bots-apache1 - think the report/review sync for cbng is broken and looping, testing if this fixed the bw/spam issues. [14:10:20] Logged the message, Master [14:10:25] -.- [14:10:30] * Damianz eats labs-morebots, figure out what I mean dammit [14:16:41] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [14:18:20] ^demon: we had a very nasty kernel bug over the weekend [14:18:28] <^demon> Stupid leap second? [14:18:30] yup [14:19:03] Tim found an easy fix: stop the ntp service, set date manually, start ntp again [14:19:14] instantly drop the load to the groun [14:19:15] d [14:19:26] nike did open a bug this morning about Gerrit being slow. That solve it [14:19:29] solved it [14:19:31] damn I am tired [14:21:33] <^demon> go to bed? [14:23:06] by the time I am home, I will get to bed at 5pm [14:23:10] not a nice idea ;) [14:23:17] I will fill disoriented [14:23:21] feel [14:23:28] STUPID APPLE AUTO CORRECTER [14:47:40] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [15:11:40] Trying to get everything configured, I'm having trouble using SSH to access an instance using agent forwarding [15:12:18] debug1: Offering public key: /Users/stephenslevinski/.ssh/id_rsa [15:12:26] debug1: Server accepts key: pkalg ssh-rsa blen 277 [15:12:33] debug1: read PEM private key done: type RSA [15:12:40] Connection closed by 208.80.153.207 [15:16:20] I added my SSH key to gerrit. The same key is working fine for github. [15:16:50] following the instructions on https://labsconsole.wikimedia.org/wiki/Access#Accessing_public_and_private_instances [15:17:33] confused by the line: "ssh-add ~/.ssh/labs" because of error "No such file or directory" [15:18:09] slevinski: You have to add your keys into gerrit seperatly [15:18:11] slevinski: that line presumes you put your key in a file called "labs" [15:18:18] There's an open bug about it not pulling them from ldap [15:18:23] it could also have any other name [15:18:42] (Settings -> SSH public keys -> Add) [15:20:41] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [15:20:47] Yes, I did add my SSH key to gerrit already. Able to see them in lab console as well. [15:29:40] .win 7 [15:29:48] (sorry, irssi fail) [15:41:38] hm... deployment is kaput today? [15:47:56] PROBLEM Current Load is now: WARNING on nova-essex-test i-000001f9 output: WARNING - load average: 10.35, 11.23, 12.57 [15:50:56] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [15:55:52] andrewbogott: had to reboot several instance yesterday cause of a kernel bug [15:56:04] andrewbogott: but it should be more or less working though still not as expected [15:56:12] mainly bits not enabled yet :( [15:57:02] andrewbogott: also the whole labs have some issue due to that kernel bug :/ [15:57:29] hashar: I get nothin' but 404s from deployment. [15:57:38] Which, I guess, means it's sort of up :) [15:57:38] which URL ? [15:57:43] hehe [15:58:12] http://deployment.wikimedia.beta.wmflabs.org/wiki [15:58:18] ohh that one [15:58:24] I think it stopped working like 2 months ago [15:58:46] Hm, I've been using it as a test wiki until somewhat recently. What wiki do you advise I use instead? [15:58:48] maybe not hmm [15:58:56] (I'm setting my own up right now anyway, so, not a big concern.) [15:59:15] yup I think it is better to set your own till I finally have beta fixed properly [15:59:27] someone wrote some puppet classes to easily bootstrap a fresh wiki [16:00:28] so indeed http://deployment.wikimedia.beta.wmflabs.org/wiki should work [16:00:32] gr [16:00:43] hashar: 'someone' = me :) [16:01:08] ohhh [16:01:11] \O/ [16:01:27] If I can get an instance to launch at all, this should be easy. [16:01:34] I will fix http://deployment.wikimedia.beta.wmflabs.org/ tonight [16:01:38] for now I am out :-D [16:01:41] sorry bout that [16:01:53] np, just curious. [16:02:10] PROBLEM Current Load is now: WARNING on nova-gsoc1 i-000001de output: WARNING - load average: 20.99, 19.60, 19.33 [16:05:08] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [16:09:28] yes all of labs needs to be rebooted [16:09:51] I'd *really* like to move the instances to the new hardware, so that I only need to do it once, though [16:12:30] 'all of labs'? Because of the leap-second? [16:17:33] well, I'm fixing it with the "date" fix [16:17:39] so, I won't need to reboot it [16:17:54] but, the instances will definitely need to be rebooted for moving them to the new hardware [16:18:00] since we're switching away from shared storage [16:21:49] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [16:26:38] PROBLEM Free ram is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [16:31:18] PROBLEM Free ram is now: WARNING on bots-sql2 i-000000af output: Warning: 10% free memory [16:32:58] RECOVERY Current Load is now: OK on wikistats-01 i-00000042 output: OK - load average: 0.01, 0.68, 3.67 [16:32:58] PROBLEM Current Load is now: CRITICAL on nova-essex-test i-000001f9 output: CRITICAL - load average: 30.07, 18.77, 15.03 [16:35:18] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [16:37:58] PROBLEM Current Load is now: WARNING on nova-essex-test i-000001f9 output: WARNING - load average: 11.30, 16.70, 15.54 [16:41:11] RECOVERY Current Load is now: OK on search-test i-000000cb output: OK - load average: 0.00, 1.25, 4.61 [16:53:03] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [16:55:51] 07/02/2012 - 16:55:51 - User slevinski may have been modified in LDAP or locally, updating key in project(s): signwriting [16:56:10] 07/02/2012 - 16:56:10 - Updating keys for slevinski at /export/keys/slevinski [16:56:56] PROBLEM Total Processes is now: CRITICAL on aggregator-test1 i-000002bf output: PROCS CRITICAL: 207 processes [16:58:47] RECOVERY Current Load is now: OK on dev-solr i-00000152 output: OK - load average: 0.45, 1.80, 3.86 [16:59:42] RECOVERY Current Load is now: OK on nova-ldap1 i-000000df output: OK - load average: 0.22, 0.26, 3.95 [17:02:02] PROBLEM Total Processes is now: WARNING on aggregator-test1 i-000002bf output: PROCS WARNING: 200 processes [17:07:02] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [17:12:55] Out of ideas to get ssh to work for bastion. Reloaded, recreated ssh keys. Changes file permissions. Able to use new keys on github just fine. Failure for bastion.wmflabs.org [17:13:11] Can anyone check the server log for a more meaningful error message? [17:13:18] your name is slevinski there too? [17:13:22] yes [17:13:27] gimme a minute [17:14:58] oh, wait you dont get on bastion itself or you dont get on an instance from there? [17:15:24] labs is slow right now, it may have just not updated your key yet [17:15:42] did you ever see a bot here on IRC telling you it creates a home dir for you / updates your keys? [17:15:54] < labs-home-wm> 07/02/2012 - 16:55:51 - User slevinski may have been modified in LDAP or locally, updating key in project(s): signwriting [17:16:12] <-- ah, like that [17:16:39] 07/02/2012 - 16:56:10 - Updating keys for slevinski at /export/keys/slevinski ..that looks ok so far [17:16:50] I updated the keys several times. [17:17:31] is he in the project? [17:17:41] slevinski: Did you try logigng in before you saw that? [17:17:44] slevinski: what's your username on labsconsole? [17:17:59] same on labconsole [17:18:00] Unless Ryan was awesome and fixed nscd's insane cache time :D [17:18:06] I did not [17:18:09] :( [17:18:37] slevinski: you weren't a member of the bastion project [17:18:39] gimme a sec [17:18:50] I really need to finish unpacking my house [17:19:04] slevinski: it'll work now [17:19:34] awesome. It works. Thanks a lot. [17:19:53] 07/02/2012 - 17:19:53 - Created a home directory for slevinski in project(s): bastion [17:20:06] Did you get gerrit sorted in the end? (I ran off from work to come home) [17:20:52] 07/02/2012 - 17:20:51 - User slevinski may have been modified in LDAP or locally, updating key in project(s): bastion [17:23:59] RECOVERY Current Load is now: OK on nova-essex-test i-000001f9 output: OK - load average: 0.19, 0.41, 3.74 [17:25:02] <^demon> Ryan_Lane: I was going over the gerrit docs, and I want to enable individual users to have their own branch space that doesn't require approval and you can just push to. Kinda like a sandbox or workspace. [17:25:09] <^demon> From the gerrit docs "References can have the current user name automatically included, creating dynamic access controls that change to match the currently logged in user. For example to provide a personal sandbox space to all developers, refs/heads/sandbox/${username}/* allowing the user joe to use refs/heads/sandbox/joe/foo." [17:25:12] <^demon> Does that sound sane? [17:25:52] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [17:26:04] ^demon: yes [17:26:31] <^demon> Mmk, I'll go ahead and enable that on All-Projects. It won't affect Read permissions, so any private repos will still be private. [17:26:37] greay [17:26:40] *great [17:38:15] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [17:39:13] <^demon> Ryan_Lane: Done, and notified wikitech-l. [17:56:24] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [18:02:05] PROBLEM Total Processes is now: CRITICAL on aggregator-test1 i-000002bf output: PROCS CRITICAL: 201 processes [18:08:25] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [18:20:53] <^demon> hashar: Something to think about that may be faster for both jenkins && gerrit. What if we replicated the repos to jenkins, so you could just do local clone/checkouts? [18:26:25] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [18:33:00] ^demon: that is what is done already [18:33:10] <^demon> We're using gerrit's replication? [18:33:20] noo not at all [18:33:23] just git clone [18:33:25] for the init [18:33:43] and then git fetch / git remote update / git fetch refs/something [18:34:03] <^demon> ok. [18:34:11] that is the bin/fetch_gerrit_head.sh script in integration/jenkins [18:34:16] I don't think we need to setup a replication [18:34:23] <^demon> Ryan_Lane: I'm trying to get the labs install of gerrit using the puppet config, but I've hit a couple of snags. Quick questions: 1) Is the public/private key for replication purposes, so I can probably skip that for labs? 2) What is /root/.my.cnf? Is that done by private repo? [18:34:24] that is not the slower part anyway :) [18:34:45] oh. sorry. gerrit's puppet manifests are really fucked up [18:34:48] that's going to be hard [18:34:57] one issue I have is that Gerrit Trigger does not seem to react when a change is merged :/ Only on new patchsets [18:35:13] ^demon: yeah, that's done by the private repo [18:35:23] <^demon> I started some cleanup :) https://gerrit.wikimedia.org/r/#/c/13484/ [18:35:51] heh [18:35:55] the package now installs the user [18:36:29] ^demon: have you managed to setup puppetmaster::self ? It is working fine now thanks to Faidon. [18:36:35] I have tested it on some instance :) [18:36:39] <^demon> Yeah, I got self working. [18:36:44] also, the real way to fix this, is to make gerrit a paramaterized class [18:36:59] noooo [18:37:01] and to move the config into a role class [18:37:01] my nightmare :) [18:37:16] then we can have role::gerrit and role::gerrit::labs [18:37:36] <^demon> Ryan_Lane: Why do I feel like this isn't something I can finish off in an hour or two? [18:37:36] indeed, that has been successful for me [18:37:47] makes the prod / labs split easier to follow [18:38:08] ^demon: welcome in my world :) I will definitely assist as much as can during your mornings [18:38:17] heh [18:38:24] <^demon> Well I was hoping to get it done today or tomorrow :p [18:38:25] ^demon: if you know puppet well enough, you totally can :) [18:38:28] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [18:38:53] ^demon: make the config options parameters to the gerrit class [18:38:58] remove the config class [18:39:14] and call the gerrit class from a role::gerrit class in role/gerrit.pp [18:39:39] set the config through the class call [18:40:16] add sane defaults for parameters, where possible [18:50:16] Ryan_Lane: I was at OSBridge in Portland last week, talking to some people that also do puppet, and I realized that we don't have a public tracker for things that are wrong with our puppet repo (like, "Squid not puppetized at all", "Gerrit manifest not reusable", etc.), so it's difficult for volunteers that want to clean up our puppet code to know where to start [18:56:38] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [19:02:25] RoanKattouw: we have a Labs product [19:02:29] let's make a puppet component [19:03:34] I noticed that, yeah [19:03:35] Then they can sit in bz and be ignored for, forever :D [19:03:37] That seems reasonable [19:03:48] I figured we might have some tickets in RT for it, but these tickets should really be public [19:03:59] And I also suspect many of these things don't have tickets filed for them at all [19:08:54] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [19:23:33] <^demon> qotd: "good thing this isn't the mars rover or something else expensive" [19:26:43] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [19:39:03] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [19:47:54] !log deployment-prep hacking to get files on bits.beta [19:47:56] Logged the message, Master [19:51:37] RECOVERY Free ram is now: OK on bots-3 i-000000e5 output: OK: 31% free memory [19:51:37] RECOVERY Total Processes is now: OK on bots-3 i-000000e5 output: PROCS OK: 115 processes [19:56:47] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [20:09:37] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [20:17:22] PROBLEM Free ram is now: CRITICAL on wikistats-history-01 i-000002e2 output: CHECK_NRPE: Socket timeout after 10 seconds. [20:22:16] PROBLEM Free ram is now: UNKNOWN on wikistats-history-01 i-000002e2 output: NRPE: Unable to read output [20:28:36] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [20:28:55] !log deployment-prep deployed a hack in mw config using https://gerrit.wikimedia.org/r/#/c/13932/ . That is a simply git fetch, pending review. [20:28:57] Logged the message, Master [20:40:17] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [20:45:59] !log deployment-prep set back robots.txt to disallow / [20:46:01] Logged the message, Master [20:47:38] !log deployment-prep Removed Hydriz from deployment-prep. Messed up the whole dblist files :-D Contact me! ;) [20:47:40] Logged the message, Master [20:50:35] !log deployment-prep restarting squid to purge whole cache (yeah I know that is lame) [20:50:36] Logged the message, Master [20:58:50] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [21:10:30] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [21:15:39] andrewbogott: so, for tests… what kind of stuff are they looking for? [21:15:53] I have the bgp support added, and working [21:16:00] I just need to add tests now [21:16:18] Hm... [21:16:50] Generally the process is to stub out any external API calls... then write tests that traverse your code and verify that the stubs were called as expected. [21:17:17] I don't mind writing the tests for you, if figuring out about stub/mox sounds like no fun. [21:17:23] heh [21:17:44] well, the only calls that can affect this are associate and disassociate floating ip [21:17:54] and anything that calls that [21:18:03] so, delete could also affect that [21:18:39] Is your patch on gerrit someplace? [21:18:45] not yet [21:18:54] I want to do a few more manual tests first [21:18:59] If you aren't adding any new function calls, then you can probably just add extra asserts to existing tests. [21:19:00] best part? no more rootwrap :) [21:19:07] the process doesn't need to run as root, at all [21:19:12] cool! [21:19:44] I could have done this as a plugin [21:20:01] I think it's useful enough to be core, though [21:20:17] (the plugin hooks already exist and everything) [21:20:42] If they'll take it as core, then core is good. [21:20:49] * Ryan_Lane nods [21:20:55] Did the thing I said about 'just add extra asserts' make sense? [21:20:59] yes [21:21:30] My experience is that people notice if there are not tests, but aren't usually very uptight about what the tests actually do. [21:21:33] For better or for worse [21:21:38] heh [21:24:55] hm [21:26:19] well, it doesn't make much sense to add asserts for associate and disassociate floating ip [21:26:24] this doesn't change its behavior [21:26:27] I think [21:27:25] it just adds an extra function call in three places, and doesn't affect the behavior [21:28:26] and the function it calls doesn't raise exceptions [21:28:49] so, I think I just need to add tests for the exabgp functions I added [21:28:57] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [21:29:37] PROBLEM Total Processes is now: WARNING on psm-precise i-000002f2 output: PROCS WARNING: 151 processes [21:40:47] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [21:40:53] Ryan_Lane: It'd also be good to test that those extra function calls happen (possibly as part of the same test) [21:45:45] how can I verify that? [21:49:47] So, your new code adds functions which themselves make system calls, right? [21:50:41] You can stub out those system calls, replacing them with mini functions that just log the arguments they received into a global. Then, at the bottom of the existing test (that calls the function that calls your function) check the logged args to make sure they make sense. [21:51:04] That tests everything in one go, I think. [21:51:47] ah [21:51:49] makes sense [21:51:52] cool [21:52:31] There should be examples of how to stub out an external call littered around pretty much any test you look at. [21:53:20] * Ryan_Lane nods [21:53:26] seems I'm not checking something properlty [21:53:37] it's launching multiple versions of exabgp [21:53:38] heh [21:53:47] gotta fix that, for sure :) [21:54:29] <3 mock [21:55:44] andrewbogott: like this code? [21:55:45] if FLAGS.fake_network: [21:55:45] LOG.debug('FAKE NET: %s', ' '.join(map(str, cmd))) [21:55:45] return 'fake', 0 [21:58:51] Fake net aka ASN 645 [21:58:58] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [22:00:03] Ryan_Lane: That's not a pattern that I've used, but it looks like what you want. [22:00:11] the thing I was thinking of is more like this: [22:00:12] def fake_get_nw_info(cls, ctxt, instance): [22:00:12] self.assertTrue(ctxt.is_admin) [22:00:18] self.stubs.Set(nova.network.API, 'get_instance_nw_info', [22:00:18] fake_get_nw_info) [22:00:18] bah. the way I was doing this the first time worked fine [22:00:34] ahhh [22:00:34] ok [22:00:52] So, I guess what I said before isn't quite right; you don't have to log the args if you're only using the fake function once, you can just assert valid args. [22:01:03] But you also need to raise a flag verifying that the fake function was called at all, and check that higher up. [22:01:12] (Otherwise a failure to get called looks like success) [22:01:22] right [22:01:46] that's easy enough. I can set a fake function, have it set something, then check it [22:02:33] Ryan_Lane: Could you fix prod pretty please :) [22:02:41] Damianz: fix *what* in prod? [22:02:49] Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Mon, 02 Jul 2012 22:02:39 GMT [22:02:52] the network stuff that two people have been working on all day? [22:02:52] 503 errors [22:03:15] :( That still not fixed? [22:03:28] seeing as that they're on the phone with juniper, I don't think I can help much [22:03:33] Looking at nagios spam I'd say not, *goes back to leaving you alone* [22:04:28] Juniper... that explains a lot, wonder if they've fixed the VRRP issue I hit last time I was shifting switches around yet =/ [22:09:42] Ryan_Lane: Heading to dinner, will be back on this evening. [22:10:04] ok. see you later [22:10:48] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [22:30:27] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [22:41:07] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [23:01:30] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [23:12:11] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [23:22:11] PROBLEM Puppet freshness is now: CRITICAL on deployment-transcoding i-00000105 output: Puppet has not run in last 20 hours [23:32:10] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [23:43:30] PROBLEM host: pluginreview is DOWN address: i-000002fa check_ping: Invalid hostname/address - i-000002fa [23:57:42] PROBLEM Free ram is now: CRITICAL on etherpad-lite i-000002de output: CHECK_NRPE: Socket timeout after 10 seconds. [23:59:12] PROBLEM Puppet freshness is now: CRITICAL on wikistats-01 i-00000042 output: Puppet has not run in last 20 hours