[00:03:30] PROBLEM Free ram is now: CRITICAL on etherpad-lite i-000002de output: CHECK_NRPE: Socket timeout after 10 seconds. [00:05:20] PROBLEM host: p-b is DOWN address: i-000000ae CRITICAL - Host Unreachable (i-000000ae) [00:06:19] PROBLEM Free ram is now: CRITICAL on wikistats-history-01 i-000002e2 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:07:37] PROBLEM Disk Space is now: CRITICAL on deployment-jobrunner05 i-0000028c output: CHECK_NRPE: Socket timeout after 10 seconds. [00:07:37] PROBLEM Current Users is now: CRITICAL on ve-nodejs i-00000245 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:07:37] PROBLEM Disk Space is now: CRITICAL on ve-nodejs i-00000245 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:07:37] PROBLEM Free ram is now: CRITICAL on ve-nodejs i-00000245 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:07:37] PROBLEM Total Processes is now: CRITICAL on ve-nodejs i-00000245 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:07:42] PROBLEM dpkg-check is now: CRITICAL on ve-nodejs i-00000245 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:08:33] PROBLEM Free ram is now: UNKNOWN on etherpad-lite i-000002de output: NRPE: Unable to read output [00:09:27] PROBLEM Current Users is now: CRITICAL on wikistats-history-01 i-000002e2 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:10:07] PROBLEM Total Processes is now: CRITICAL on aggregator-test1 i-000002bf output: CHECK_NRPE: Socket timeout after 10 seconds. [00:11:13] PROBLEM Current Users is now: CRITICAL on zeromq1 i-000002b7 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:11:13] PROBLEM Current Load is now: CRITICAL on zeromq1 i-000002b7 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:11:13] PROBLEM Disk Space is now: CRITICAL on zeromq1 i-000002b7 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:12:08] RECOVERY Free ram is now: OK on bots-2 i-0000009c output: OK: 20% free memory [00:12:08] RECOVERY Disk Space is now: OK on deployment-jobrunner05 i-0000028c output: DISK OK [00:12:08] RECOVERY Current Users is now: OK on ve-nodejs i-00000245 output: USERS OK - 0 users currently logged in [00:12:08] RECOVERY Disk Space is now: OK on ve-nodejs i-00000245 output: DISK OK [00:12:09] RECOVERY Free ram is now: OK on ve-nodejs i-00000245 output: OK: 77% free memory [00:12:09] RECOVERY Total Processes is now: OK on ve-nodejs i-00000245 output: PROCS OK: 97 processes [00:12:20] RECOVERY dpkg-check is now: OK on ve-nodejs i-00000245 output: All packages OK [00:12:59] PROBLEM Current Load is now: CRITICAL on deployment-apache30 i-000002d3 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:12:59] PROBLEM Disk Space is now: CRITICAL on deployment-apache30 i-000002d3 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:13:00] PROBLEM Free ram is now: CRITICAL on deployment-apache30 i-000002d3 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:13:00] PROBLEM Total Processes is now: CRITICAL on deployment-apache30 i-000002d3 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:13:06] PROBLEM dpkg-check is now: CRITICAL on zeromq1 i-000002b7 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:14:09] RECOVERY Current Users is now: OK on wikistats-history-01 i-000002e2 output: USERS OK - 0 users currently logged in [00:14:25] PROBLEM Free ram is now: CRITICAL on aggregator-test1 i-000002bf output: CHECK_NRPE: Socket timeout after 10 seconds. [00:14:39] PROBLEM Free ram is now: CRITICAL on zeromq1 i-000002b7 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:17:50] RECOVERY Current Load is now: OK on deployment-apache30 i-000002d3 output: OK - load average: 0.95, 3.14, 2.31 [00:17:50] RECOVERY Disk Space is now: OK on deployment-apache30 i-000002d3 output: DISK OK [00:17:50] RECOVERY Free ram is now: OK on deployment-apache30 i-000002d3 output: OK: 92% free memory [00:17:50] RECOVERY Total Processes is now: OK on deployment-apache30 i-000002d3 output: PROCS OK: 119 processes [00:18:00] RECOVERY dpkg-check is now: OK on zeromq1 i-000002b7 output: All packages OK [00:19:09] PROBLEM Free ram is now: UNKNOWN on wikistats-history-01 i-000002e2 output: NRPE: Unable to read output [00:19:29] RECOVERY Free ram is now: OK on zeromq1 i-000002b7 output: OK: 81% free memory [00:19:39] PROBLEM Current Load is now: WARNING on aggregator-test1 i-000002bf output: WARNING - load average: 5.80, 7.01, 5.20 [00:19:50] PROBLEM Total Processes is now: WARNING on aggregator-test1 i-000002bf output: PROCS WARNING: 188 processes [00:20:59] RECOVERY Current Users is now: OK on zeromq1 i-000002b7 output: USERS OK - 0 users currently logged in [00:20:59] RECOVERY Disk Space is now: OK on zeromq1 i-000002b7 output: DISK OK [00:20:59] RECOVERY Current Load is now: OK on zeromq1 i-000002b7 output: OK - load average: 4.38, 4.94, 3.52 [00:22:49] PROBLEM Current Load is now: CRITICAL on build-precise1 i-00000273 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:23:39] PROBLEM dpkg-check is now: CRITICAL on pediapress-ocg2 i-00000234 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:23:40] PROBLEM Current Load is now: CRITICAL on pediapress-ocg2 i-00000234 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:24:18] PROBLEM Free ram is now: WARNING on aggregator-test1 i-000002bf output: Warning: 6% free memory [00:24:39] RECOVERY Current Load is now: OK on aggregator-test1 i-000002bf output: OK - load average: 1.25, 4.47, 4.68 [00:25:30] PROBLEM Puppet freshness is now: CRITICAL on su-fe1 i-000002e5 output: Puppet has not run in last 20 hours [00:25:59] RECOVERY host: p-b is UP address: i-000000ae PING OK - Packet loss = 0%, RTA = 0.77 ms [00:26:29] PROBLEM host: nova-production1 is DOWN address: i-0000007b CRITICAL - Host Unreachable (i-0000007b) [00:27:34] RECOVERY Current Load is now: OK on build-precise1 i-00000273 output: OK - load average: 0.21, 1.81, 1.49 [00:28:29] RECOVERY Current Load is now: OK on pediapress-ocg2 i-00000234 output: OK - load average: 0.13, 1.64, 1.41 [00:28:30] RECOVERY dpkg-check is now: OK on pediapress-ocg2 i-00000234 output: All packages OK [00:48:14] PROBLEM host: labs-build1 is DOWN address: i-0000006b CRITICAL - Host Unreachable (i-0000006b) [00:49:34] PROBLEM Free ram is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Socket timeout after 10 seconds. [00:50:03] RECOVERY host: labs-build1 is UP address: i-0000006b PING OK - Packet loss = 0%, RTA = 0.64 ms [00:54:03] PROBLEM Free ram is now: UNKNOWN on integration-apache1 i-000002eb output: NRPE: Unable to read output [00:56:33] PROBLEM host: nova-production1 is DOWN address: i-0000007b CRITICAL - Host Unreachable (i-0000007b) [01:27:01] PROBLEM host: nova-production1 is DOWN address: i-0000007b CRITICAL - Host Unreachable (i-0000007b) [01:45:00] RECOVERY Current Load is now: OK on bots-sql2 i-000000af output: OK - load average: 3.11, 3.50, 4.61 [01:53:30] PROBLEM host: dumps-incr is DOWN address: i-000002bb CRITICAL - Host Unreachable (i-000002bb) [01:56:40] PROBLEM host: nova-production1 is DOWN address: i-0000007b CRITICAL - Host Unreachable (i-0000007b) [02:08:00] PROBLEM Current Load is now: WARNING on bots-sql2 i-000000af output: WARNING - load average: 8.09, 7.24, 5.76 [02:08:10] PROBLEM host: dumps-2 is DOWN address: i-000002d8 CRITICAL - Host Unreachable (i-000002d8) [02:08:10] PROBLEM Disk Space is now: CRITICAL on e3 i-00000291 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:08:10] PROBLEM Current Users is now: CRITICAL on e3 i-00000291 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:08:11] PROBLEM Free ram is now: CRITICAL on e3 i-00000291 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:08:11] PROBLEM Total Processes is now: CRITICAL on e3 i-00000291 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:08:17] PROBLEM dpkg-check is now: CRITICAL on e3 i-00000291 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:13:00] RECOVERY Disk Space is now: OK on e3 i-00000291 output: DISK OK [02:13:00] RECOVERY Current Users is now: OK on e3 i-00000291 output: USERS OK - 0 users currently logged in [02:13:00] RECOVERY Free ram is now: OK on e3 i-00000291 output: OK: 89% free memory [02:13:01] RECOVERY Total Processes is now: OK on e3 i-00000291 output: PROCS OK: 107 processes [02:13:07] RECOVERY dpkg-check is now: OK on e3 i-00000291 output: All packages OK [02:13:26] PROBLEM Free ram is now: CRITICAL on etherpad-lite i-000002de output: CHECK_NRPE: Socket timeout after 10 seconds. [02:16:27] PROBLEM Current Users is now: CRITICAL on etherpad-lite i-000002de output: CHECK_NRPE: Socket timeout after 10 seconds. [02:16:27] PROBLEM Disk Space is now: CRITICAL on etherpad-lite i-000002de output: CHECK_NRPE: Socket timeout after 10 seconds. [02:16:27] PROBLEM Total Processes is now: CRITICAL on etherpad-lite i-000002de output: CHECK_NRPE: Socket timeout after 10 seconds. [02:16:32] PROBLEM dpkg-check is now: CRITICAL on etherpad-lite i-000002de output: CHECK_NRPE: Socket timeout after 10 seconds. [02:19:11] RECOVERY HTTP is now: OK on wmde-test i-000002ad output: HTTP OK: HTTP/1.1 200 OK - 453 bytes in 9.342 second response time [02:21:25] PROBLEM Current Load is now: CRITICAL on deployment-jobrunner05 i-0000028c output: CHECK_NRPE: Socket timeout after 10 seconds. [02:21:26] PROBLEM Total Processes is now: CRITICAL on deployment-jobrunner05 i-0000028c output: CHECK_NRPE: Socket timeout after 10 seconds. [02:23:25] PROBLEM Current Load is now: CRITICAL on etherpad-lite i-000002de output: CHECK_NRPE: Socket timeout after 10 seconds. [02:23:56] PROBLEM host: dumps-incr is DOWN address: i-000002bb CRITICAL - Host Unreachable (i-000002bb) [02:24:22] PROBLEM Current Load is now: CRITICAL on zeromq1 i-000002b7 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:24:22] PROBLEM Current Users is now: CRITICAL on zeromq1 i-000002b7 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:24:22] PROBLEM Disk Space is now: CRITICAL on zeromq1 i-000002b7 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:25:17] PROBLEM Total Processes is now: CRITICAL on aggregator-test1 i-000002bf output: CHECK_NRPE: Socket timeout after 10 seconds. [02:26:15] RECOVERY Current Users is now: OK on etherpad-lite i-000002de output: USERS OK - 0 users currently logged in [02:26:16] RECOVERY Disk Space is now: OK on etherpad-lite i-000002de output: DISK OK [02:26:16] RECOVERY Total Processes is now: OK on etherpad-lite i-000002de output: PROCS OK: 121 processes [02:26:27] RECOVERY Current Load is now: OK on deployment-jobrunner05 i-0000028c output: OK - load average: 0.99, 3.29, 2.31 [02:26:27] RECOVERY Total Processes is now: OK on deployment-jobrunner05 i-0000028c output: PROCS OK: 110 processes [02:26:32] RECOVERY dpkg-check is now: OK on etherpad-lite i-000002de output: All packages OK [02:26:46] PROBLEM host: nova-production1 is DOWN address: i-0000007b CRITICAL - Host Unreachable (i-0000007b) [02:27:16] PROBLEM HTTP is now: CRITICAL on wmde-test i-000002ad output: CRITICAL - Socket timeout after 10 seconds [02:28:16] RECOVERY Current Load is now: OK on etherpad-lite i-000002de output: OK - load average: 0.26, 2.83, 2.80 [02:28:16] PROBLEM Free ram is now: UNKNOWN on etherpad-lite i-000002de output: NRPE: Unable to read output [02:29:06] RECOVERY Current Load is now: OK on zeromq1 i-000002b7 output: OK - load average: 4.60, 4.94, 2.78 [02:29:56] PROBLEM Total Processes is now: WARNING on aggregator-test1 i-000002bf output: PROCS WARNING: 182 processes [02:31:54] PROBLEM Current Users is now: CRITICAL on mobile-wlm i-000002bc output: CHECK_NRPE: Socket timeout after 10 seconds. [02:31:54] PROBLEM Disk Space is now: CRITICAL on mobile-wlm i-000002bc output: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:45] PROBLEM Current Load is now: CRITICAL on pediapress-ocg2 i-00000234 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:45] PROBLEM dpkg-check is now: CRITICAL on pediapress-ocg2 i-00000234 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:14] PROBLEM dpkg-check is now: CRITICAL on zeromq1 i-000002b7 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:34] PROBLEM Current Users is now: CRITICAL on pediapress-ocg2 i-00000234 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:35] PROBLEM Disk Space is now: CRITICAL on pediapress-ocg2 i-00000234 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:35] PROBLEM Free ram is now: CRITICAL on pediapress-ocg2 i-00000234 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:35] PROBLEM Total Processes is now: CRITICAL on pediapress-ocg2 i-00000234 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:06] RECOVERY Disk Space is now: OK on zeromq1 i-000002b7 output: DISK OK [02:34:06] RECOVERY Current Users is now: OK on zeromq1 i-000002b7 output: USERS OK - 0 users currently logged in [02:34:16] PROBLEM dpkg-check is now: CRITICAL on mobile-wlm i-000002bc output: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:34] PROBLEM Free ram is now: UNKNOWN on puppet-abogott i-0000030b output: NRPE: Unable to read output [02:35:47] PROBLEM Current Load is now: CRITICAL on build-precise1 i-00000273 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:47] PROBLEM Current Users is now: CRITICAL on build-precise1 i-00000273 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:47] PROBLEM Total Processes is now: CRITICAL on build-precise1 i-00000273 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:38:38] PROBLEM host: dumps-2 is DOWN address: i-000002d8 CRITICAL - Host Unreachable (i-000002d8) [02:38:38] PROBLEM Total Processes is now: CRITICAL on mobile-wlm i-000002bc output: CHECK_NRPE: Socket timeout after 10 seconds. [02:39:16] PROBLEM Current Load is now: CRITICAL on incubator-bot0 i-00000296 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:39:16] PROBLEM Current Users is now: CRITICAL on incubator-bot0 i-00000296 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:39:16] PROBLEM Disk Space is now: CRITICAL on incubator-bot0 i-00000296 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:39:16] PROBLEM Free ram is now: CRITICAL on incubator-bot0 i-00000296 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:39:16] PROBLEM Total Processes is now: CRITICAL on incubator-bot0 i-00000296 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:39:21] PROBLEM dpkg-check is now: CRITICAL on incubator-bot0 i-00000296 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:39:21] RECOVERY dpkg-check is now: OK on mobile-wlm i-000002bc output: All packages OK [02:39:21] RECOVERY host: dumps-incr is UP address: i-000002bb PING OK - Packet loss = 0%, RTA = 0.53 ms [02:40:44] RECOVERY Current Load is now: OK on build-precise1 i-00000273 output: OK - load average: 1.60, 4.45, 3.20 [02:40:44] RECOVERY Current Users is now: OK on build-precise1 i-00000273 output: USERS OK - 0 users currently logged in [02:40:44] RECOVERY Total Processes is now: OK on build-precise1 i-00000273 output: PROCS OK: 83 processes [02:40:54] RECOVERY Puppet freshness is now: OK on maps-test3 i-0000028f output: puppet ran at Fri Jul 6 02:40:43 UTC 2012 [02:41:52] 07/06/2012 - 02:41:52 - User laner may have been modified in LDAP or locally, updating key in project(s): deployment-prep [02:41:55] RECOVERY Current Users is now: OK on mobile-wlm i-000002bc output: USERS OK - 0 users currently logged in [02:41:55] RECOVERY Disk Space is now: OK on mobile-wlm i-000002bc output: DISK OK [02:43:03] RECOVERY Total Processes is now: OK on mobile-wlm i-000002bc output: PROCS OK: 107 processes [02:43:08] RECOVERY Current Load is now: OK on pediapress-ocg2 i-00000234 output: OK - load average: 0.24, 2.96, 2.71 [02:43:08] RECOVERY dpkg-check is now: OK on pediapress-ocg2 i-00000234 output: All packages OK [02:43:15] RECOVERY dpkg-check is now: OK on zeromq1 i-000002b7 output: All packages OK [02:43:32] RECOVERY Current Users is now: OK on pediapress-ocg2 i-00000234 output: USERS OK - 0 users currently logged in [02:43:32] RECOVERY Disk Space is now: OK on pediapress-ocg2 i-00000234 output: DISK OK [02:43:32] RECOVERY Free ram is now: OK on pediapress-ocg2 i-00000234 output: OK: 86% free memory [02:43:32] RECOVERY Total Processes is now: OK on pediapress-ocg2 i-00000234 output: PROCS OK: 86 processes [02:44:02] RECOVERY Current Load is now: OK on incubator-bot0 i-00000296 output: OK - load average: 2.74, 3.87, 3.16 [02:44:02] RECOVERY Current Users is now: OK on incubator-bot0 i-00000296 output: USERS OK - 0 users currently logged in [02:44:02] RECOVERY Disk Space is now: OK on incubator-bot0 i-00000296 output: DISK OK [02:44:02] RECOVERY Free ram is now: OK on incubator-bot0 i-00000296 output: OK: 85% free memory [02:44:02] RECOVERY Total Processes is now: OK on incubator-bot0 i-00000296 output: PROCS OK: 86 processes [02:44:07] RECOVERY dpkg-check is now: OK on incubator-bot0 i-00000296 output: All packages OK [02:49:22] RECOVERY host: dumps-2 is UP address: i-000002d8 PING OK - Packet loss = 0%, RTA = 1.73 ms [02:56:02] PROBLEM Current Load is now: WARNING on dumps-incr i-000002bb output: WARNING - load average: 9.16, 8.74, 5.95 [02:56:52] PROBLEM host: nova-production1 is DOWN address: i-0000007b CRITICAL - Host Unreachable (i-0000007b) [03:04:42] RECOVERY Puppet freshness is now: OK on labs-nfs1 i-0000005d output: puppet ran at Fri Jul 6 03:04:31 UTC 2012 [03:08:02] RECOVERY Current Load is now: OK on bots-sql2 i-000000af output: OK - load average: 3.68, 4.08, 4.86 [03:08:42] PROBLEM Total Processes is now: WARNING on dumps-incr i-000002bb output: PROCS WARNING: 158 processes [03:17:42] PROBLEM host: pageviews is DOWN address: i-000000b2 CRITICAL - Host Unreachable (i-000000b2) [03:17:42] PROBLEM host: test3 is DOWN address: i-00000093 CRITICAL - Host Unreachable (i-00000093) [03:27:02] PROBLEM host: nova-production1 is DOWN address: i-0000007b CRITICAL - Host Unreachable (i-0000007b) [03:36:22] PROBLEM host: ganglia-collector is DOWN address: i-000000b7 CRITICAL - Host Unreachable (i-000000b7) [03:37:02] PROBLEM Free ram is now: WARNING on bots-2 i-0000009c output: Warning: 19% free memory [03:41:10] PROBLEM Free ram is now: WARNING on test-oneiric i-00000187 output: Warning: 13% free memory [03:41:10] PROBLEM Current Load is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:41:10] PROBLEM Current Users is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:41:10] PROBLEM dpkg-check is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:41:56] PROBLEM Free ram is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:43:36] PROBLEM Free ram is now: CRITICAL on psm-precise i-000002f2 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:44:09] PROBLEM Total Processes is now: CRITICAL on dumps-incr i-000002bb output: PROCS CRITICAL: 207 processes [03:44:23] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 6.97, 5.86, 3.31 [03:45:58] PROBLEM Free ram is now: WARNING on incubator-bot1 i-00000251 output: Warning: 9% free memory [03:45:58] RECOVERY Current Load is now: OK on incubator-bot1 i-00000251 output: OK - load average: 1.18, 2.29, 1.63 [03:45:58] RECOVERY Current Users is now: OK on incubator-bot1 i-00000251 output: USERS OK - 0 users currently logged in [03:45:58] RECOVERY dpkg-check is now: OK on incubator-bot1 i-00000251 output: All packages OK [03:46:27] RECOVERY host: test3 is UP address: i-00000093 PING OK - Packet loss = 0%, RTA = 0.52 ms [03:48:27] PROBLEM Free ram is now: UNKNOWN on psm-precise i-000002f2 output: NRPE: Unable to read output [03:49:27] PROBLEM host: pageviews is DOWN address: i-000000b2 CRITICAL - Host Unreachable (i-000000b2) [03:54:18] PROBLEM Current Load is now: CRITICAL on test3 i-00000093 output: NRPE: Unable to read output [03:54:18] RECOVERY Current Load is now: OK on nagios 127.0.0.1 output: OK - load average: 0.75, 1.86, 2.52 [03:55:17] PROBLEM dpkg-check is now: CRITICAL on test3 i-00000093 output: NRPE: Unable to read output [03:55:37] PROBLEM Free ram is now: CRITICAL on test-oneiric i-00000187 output: Critical: 4% free memory [03:57:17] PROBLEM host: nova-production1 is DOWN address: i-0000007b CRITICAL - Host Unreachable (i-0000007b) [03:57:47] PROBLEM SSH is now: CRITICAL on test3 i-00000093 output: Server answer: [03:58:37] PROBLEM Current Users is now: CRITICAL on test3 i-00000093 output: NRPE: Unable to read output [03:58:37] PROBLEM Total Processes is now: CRITICAL on test3 i-00000093 output: NRPE: Unable to read output [03:58:42] PROBLEM Free ram is now: CRITICAL on test3 i-00000093 output: NRPE: Unable to read output [03:59:37] PROBLEM Disk Space is now: CRITICAL on test3 i-00000093 output: CHECK_NRPE: Error - Could not complete SSL handshake. [04:00:37] RECOVERY Free ram is now: OK on test-oneiric i-00000187 output: OK: 97% free memory [04:02:47] PROBLEM Free ram is now: WARNING on nova-daas-1 i-000000e7 output: Warning: 13% free memory [04:03:37] PROBLEM Free ram is now: WARNING on utils-abogott i-00000131 output: Warning: 17% free memory [04:04:37] PROBLEM host: test3 is DOWN address: i-00000093 CRITICAL - Host Unreachable (i-00000093) [04:06:27] PROBLEM host: ganglia-collector is DOWN address: i-000000b7 CRITICAL - Host Unreachable (i-000000b7) [04:19:48] PROBLEM host: pageviews is DOWN address: i-000000b2 CRITICAL - Host Unreachable (i-000000b2) [04:21:42] PROBLEM host: wep is DOWN address: i-000000c2 CRITICAL - Host Unreachable (i-000000c2) [04:22:39] PROBLEM host: mobile-feeds is DOWN address: i-000000c1 CRITICAL - Host Unreachable (i-000000c1) [04:23:14] PROBLEM Free ram is now: CRITICAL on nova-daas-1 i-000000e7 output: Critical: 5% free memory [04:24:01] PROBLEM Free ram is now: CRITICAL on ganglia-test2 i-00000250 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:24:01] PROBLEM Total Processes is now: CRITICAL on ganglia-test2 i-00000250 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:25:34] PROBLEM Free ram is now: CRITICAL on utils-abogott i-00000131 output: Critical: 4% free memory [04:27:22] RECOVERY host: wep is UP address: i-000000c2 PING OK - Packet loss = 0%, RTA = 3.53 ms [04:27:31] PROBLEM Free ram is now: CRITICAL on configtest-main i-000002dd output: CHECK_NRPE: Socket timeout after 10 seconds. [04:27:53] PROBLEM host: nova-production1 is DOWN address: i-0000007b CRITICAL - Host Unreachable (i-0000007b) [04:28:15] PROBLEM Current Load is now: CRITICAL on nagios 127.0.0.1 output: CRITICAL - load average: 6.69, 7.21, 4.10 [04:28:25] PROBLEM Free ram is now: WARNING on ganglia-test2 i-00000250 output: Warning: 16% free memory [04:28:25] PROBLEM Total Processes is now: WARNING on ganglia-test2 i-00000250 output: PROCS WARNING: 184 processes [04:29:15] RECOVERY Free ram is now: OK on utils-abogott i-00000131 output: OK: 97% free memory [04:32:15] PROBLEM Free ram is now: UNKNOWN on configtest-main i-000002dd output: NRPE: Unable to read output [04:32:45] RECOVERY Free ram is now: OK on nova-daas-1 i-000000e7 output: OK: 94% free memory [04:34:45] PROBLEM host: test3 is DOWN address: i-00000093 CRITICAL - Host Unreachable (i-00000093) [04:36:35] PROBLEM host: ganglia-collector is DOWN address: i-000000b7 CRITICAL - Host Unreachable (i-000000b7) [04:37:05] RECOVERY host: mobile-feeds is UP address: i-000000c1 PING OK - Packet loss = 0%, RTA = 0.67 ms [04:50:45] PROBLEM host: pageviews is DOWN address: i-000000b2 CRITICAL - Host Unreachable (i-000000b2) [04:54:35] PROBLEM host: analytics is DOWN address: i-000000e2 CRITICAL - Host Unreachable (i-000000e2) [04:55:45] PROBLEM host: nova-daas-1 is DOWN address: i-000000e7 CRITICAL - Host Unreachable (i-000000e7) [04:56:25] PROBLEM Current Load is now: WARNING on bots-sql2 i-000000af output: WARNING - load average: 7.64, 7.01, 5.47 [04:57:55] PROBLEM host: nova-production1 is DOWN address: i-0000007b CRITICAL - Host Unreachable (i-0000007b) [04:59:45] RECOVERY host: analytics is UP address: i-000000e2 PING OK - Packet loss = 0%, RTA = 0.54 ms [04:59:55] PROBLEM Free ram is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Socket timeout after 10 seconds. [11:55:37] Rage.... [11:57:59] * Damianz eats methecooldude [11:58:22] Damianz: What broke the MySQL connection... did the server kick ClueBot out [11:58:33] Exactally what the erorr said [11:58:41] The bot needs patching to bail out if it can't get an id [11:58:54] I though I'd fixed it but clearly not, don't have time until the weekend to work on it [11:59:05] I was replying but your last edit conflicted [11:59:13] Oh, whoops :P [12:00:31] also it shouldn't currently have an issue so the notice can die [12:01:09] Damianz: On a random note, couldn't the report and review interfaces become one thing, save the bandwidth issues for a start [12:01:18] I'd love that [12:01:26] I'm putting off working on it until oauth is enabled [12:01:34] Ah, makes sense [12:01:37] So people can register with their wp details and we can track it [12:01:44] Ideally it will be 1 interface with 1 database and a nice api [12:01:45] * methecooldude slaps Bastion... let me in [12:02:03] Apache seems down atm though hmm [12:02:06] Or really really slow [12:02:23] Yea, it's REALLY slow [12:02:31] methecooldude: Also I thought the issue was the api looping forever so I disabled it... but it's still going over bw [12:02:37] Which is werid because it uses iframes [12:03:08] Wow [12:03:11] Cluster load is high [12:03:24] Nope, MySQL is still kicking out [12:03:28] Looks like it's iowait again [12:03:50] methecooldude: Last revert has an id [12:04:04] Urm, just trhe interface then [12:04:10] yeah [12:04:13] http://ganglia.wmflabs.org/latest/?m=load_one&r=hour&s=by%20name&hc=4&mc=2 [12:04:16] See cluster load [12:04:24] :( @ wait.... [12:04:37] Hopefully the new nodes are ready soon and we can go to non-redundant storage [12:04:51] Damianz: For Bots or the whole grid? [12:04:58] Everything [12:05:10] As far as I know vms are going to local storage with project storage being redundant [12:05:21] Short term solution until we find a long term working cluster [12:05:23] Oh, ok [12:05:38] As gluster broke horribally? [12:06:02] Keep hitting bugs and it's slow, lags out on io rather a lot [12:18:36] Damianz: [12:18:36] bots-apache1 [12:18:36] [12:18:36] Current Load [12:18:37] WARNING 2012-07-06 12:18:04 0d 0h 13m 33s 4/4 WARNING - load average: 5.51, 6.80, 6.80 [12:18:42] Ouch! [12:18:58] not really [12:19:11] Look at the graphs for everything on the grid [12:19:12] Although bot-cb has higher :P [12:19:28] I'm looking on Nagios [12:19:41] Look at http://ganglia.wmflabs.org/latest/?m=load_one&r=hour&s=by%20name&hc=4&mc=2 [12:21:00] Yea, I saw that ealier [14:33:18] I created a new instances, waited for puppet to finsih and then added puppet class role::mediawiki-install::labs [14:33:23] I manually invoked puppet with sudo puppetd -tv [14:33:33] It timed out cloning mediawiki: Git::Clone[mediawiki]/Exec[git_clone_mediawiki]/returns: change from notrun to 0 failed: Command exceeded timeout at /etc/puppet/manifests/generic-definitions.pp:750 [14:33:42] Repeated runs of puppet failed because mediawiki was never cloned: File[/srv/mediawiki/orig]/ensure: change from absent to directory failed: Cannot create /srv/mediawiki/orig; parent directory /srv/mediawiki does not exist [14:33:58] Any ideas how I can proceed? Can I extend the timeout or configure this manually? Any help would be appreciated. [17:54:20] preilly or someone: the vumi machine seems to be unhappy at the moment. [17:54:42] jerith: you can force restart it in the labs console [17:55:59] Hrm. It seems responsive enough now that I'm logged in. [17:56:40] jerith: hmm indeed [17:56:52] But top doesn't want to do anything. [17:57:34] The cluster seems to have been under high load most the day from iowait, everything's a little sluggish still.. hoping Ryan will appear at somepoint and boot it or the new hardware is done. [17:58:30] wow. wtf [17:58:33] the load is crazy [17:59:49] I've been migrating vms [18:00:02] I'd expect the load to go *down* though [18:00:22] ah [18:00:39] in fact: http://ganglia.wikimedia.org/latest/?r=4hr&cs=&ce=&m=load_one&s=by+name&c=Virtualization+cluster+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [18:00:46] It's been like it most the day :( Uber issues with apache -> mysql and generally accessing machines, though my scripts seem to be running fine [18:01:07] well [18:01:12] I'll migrate some more [18:01:18] which ones need to be migrated the most? [18:01:23] give me recommendations ;) [18:05:24] It seems to be taking forever to start something for the first time, and then being fine with it after that. [18:05:36] Personally I'd say the bots-sql* servers (keep randomly dissapearing and dropping connections) or deployment-prep stuff (running like a dog) but my views are relative to my interests [18:06:35] Also mysql needs to DIAF for 'is blocked because of many connection errors' ... yeah it had a load time out and blocking it doesn't fix the issue :( [18:07:05] Ooh server is more responsive than earlier, slightly [18:07:28] jerith: start something? [18:07:34] have people been creating a shit-ton of vms? :) [18:07:46] hm. nope [18:07:54] anyway. I'll move bots and deployment-prep [18:07:59] there's gonna be downtime [18:08:14] this is cold migration, since we're moving away from gluster [18:08:17] to the new hardware [18:08:29] Shame really, gluster is awesome in theory [18:08:40] yeah [18:08:42] in theory [18:09:22] supervisorctl worked, but then I needed to sudo it. [18:09:35] * Ryan_Lane nods [18:09:36] Then sudo was slow. [18:09:43] I wonder if LDAP is overloaded [18:09:58] Earlier it got to the point that it wouldn't read my key and bounced me out as unauthorized :( [18:10:00] nope [18:10:03] puppet is [18:10:15] ldap isn't very loaded at all [18:11:06] I'm probably going to take down deployment-prep for a while [18:11:08] (sorry) [18:11:17] I should write an email to the list [18:12:41] What's the betting it doesn't start back up again? :D [18:12:41] I've already migrated 20 [18:12:41] Though hopefully gluster won't just eat mysql's data files again [18:12:41] * Damianz thinks petan moved them to local storage anyway [18:12:41] he did [18:14:37] Hmm [18:14:57] 17:37, 1 July 2012 Hashar (Talk | contribs) deleted page Nova Resource:I-000002b5 < Still shows in instance list, is that related to your queue issues the other day or just random issue? [18:15:24] from the queue issues [18:15:29] should be fine if he re-deletes it now [18:15:40] will do [18:17:55] You said instances twice :) [18:18:07] Ryan_Lane: I-000002b5 still have the same issue: Successfully deleted instance, but failed to remove deployment-deb DNS entry. [18:18:28] failed from dns is normal [18:18:35] it deleted it from dns the first time [18:18:37] got deleted anyway ( The requested host does not exist. ) [18:20:09] now it deleted it from nova [18:20:09] * hashar starts processing his 140 Gerrit notifications [18:20:18] I might order a take away, it's been a long week, I'm lazy and it's a friday... seems like a good enough reason [18:21:02] Also heh if you think the kvm migration is slow you should try restoring a virtuoozo box from incremental backups (gzip'd tars)... took about 40hours for ~8 containers [18:21:22] ouch [18:58:35] well, load is back down again after doing more migrations [19:00:42] :) [19:02:45] Does anyone know of an awesome dpkg guide? Like how to package stuff that isn't 'run compile, make, make install' which the init command pretty much does for you? Really should package up some of my scripts and puppetize them. [19:04:43] the debian guide is pretty much it [19:05:53] I'll take a read over it, use to packaging rpms and the redhat docs on it where lacking last time I read them lol. Ubuntu/Debian does tend to me a little better with docs tbf [19:12:06] all packaging docs suck [19:39:30] so labs is dead again ? : / [19:39:36] I got I/O errors [19:41:18] dumps-incr has like 1500 processes !! http://ganglia.wmflabs.org/latest/graph.php?r=4hr&z=xlarge&c=dumps&h=dumps-incr&v=1539&m=proc_total&jr=&js=&vl=+&ti=Total+Processes [19:41:31] !log dumps dump-incr has skyrocketted to 1500 processes [19:41:32] Logged the message, Master [19:42:17] I know ryan is moving some vms, if they are being moved they should be offline though.... rather high load/io generally though [19:42:37] dumps-incr? [19:42:51] is that one of hydriz's systems? [19:44:07] I think yes [19:44:40] seems based on https://labsconsole.wikimedia.org/wiki/Nova_Resource:Dumps/SAL [20:35:18] petan: seems that nagios instance is totally screwed [20:35:45] eh [20:35:47] really? [20:35:50] yes [20:35:56] was it recently upgraded? [20:35:57] how [20:36:00] no [20:36:03] ok [20:36:06] I didn't touch it for months [20:36:11] does it work? [20:36:14] maybe the block migration screwed it up [20:36:43] the binaries have elf errors, some services don't work [20:36:51] it won't start its network [20:36:53] ooh [20:36:55] that suck [20:36:58] yes [20:37:04] don't delete it before I recover all scripts [20:37:11] I really hope the block migrations didn't screw anything else up [20:37:12] that's fine [20:37:14] I can mount the disks [20:37:16] I have backups but I don't want to find them [20:37:18] what directories would you like? [20:37:36] I will try to ssh there, /var/nparser is most important [20:38:01] but it contains some configs in /etc/nagios3 [20:38:06] I would like to recover that as well [20:39:02] Ryan_Lane: did you run fsck? [20:39:15] yes [20:39:20] it had errors [20:39:24] petrb@bastion1:~$ ssh nagios [20:39:24] ssh: connect to host nagios port 22: No route to host [20:39:26] but that didn't seem to solve it [20:39:28] yeah [20:39:30] it's down [20:39:34] I'll need to recover the files for you [20:39:41] as mentioned it won't start networking [20:39:47] ok [20:39:51] my home will stay I gues [20:39:57] yep [20:40:01] ok [20:40:14] seems some other instances are down too [20:40:18] * Ryan_Lane grumbles [20:40:18] but keep the disks as backup if I find out there is more to recover [20:40:44] it suck to check what is down when nagios is down :) [20:41:02] maybe we should recover it first [20:41:08] * Ryan_Lane nods [20:41:13] well, create a new one ;) [20:41:18] ok [20:41:40] but we will loose the nlogin (http admin for users of labs) [20:41:50] so that we won't be able to control nagios much [20:42:33] Ryan_Lane did you make any backups before you did that patch? [20:42:38] patch? [20:42:39] I guess no [20:42:40] what patch? [20:42:49] we don't have disk space for backups [20:42:55] whatever you did before it happened [20:43:01] I'm doing migrations [20:43:04] to the new hardware [20:43:07] aah [20:43:14] I thought it's new gluster :) [20:43:17]