[00:03:30] PROBLEM Free ram is now: CRITICAL on etherpad-lite i-000002de output: CHECK_NRPE: Socket timeout after 10 seconds. [00:05:20] PROBLEM host: p-b is DOWN address: i-000000ae CRITICAL - Host Unreachable (i-000000ae) [00:06:19] PROBLEM Free ram is now: CRITICAL on wikistats-history-01 i-000002e2 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:07:37] PROBLEM Disk Space is now: CRITICAL on deployment-jobrunner05 i-0000028c output: CHECK_NRPE: Socket timeout after 10 seconds. [00:07:37] PROBLEM Current Users is now: CRITICAL on ve-nodejs i-00000245 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:07:37] PROBLEM Disk Space is now: CRITICAL on ve-nodejs i-00000245 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:07:37] PROBLEM Free ram is now: CRITICAL on ve-nodejs i-00000245 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:07:37] PROBLEM Total Processes is now: CRITICAL on ve-nodejs i-00000245 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:07:42] PROBLEM dpkg-check is now: CRITICAL on ve-nodejs i-00000245 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:08:33] PROBLEM Free ram is now: UNKNOWN on etherpad-lite i-000002de output: NRPE: Unable to read output [00:09:27] PROBLEM Current Users is now: CRITICAL on wikistats-history-01 i-000002e2 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:10:07] PROBLEM Total Processes is now: CRITICAL on aggregator-test1 i-000002bf output: CHECK_NRPE: Socket timeout after 10 seconds. [00:11:13] PROBLEM Current Users is now: CRITICAL on zeromq1 i-000002b7 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:11:13] PROBLEM Current Load is now: CRITICAL on zeromq1 i-000002b7 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:11:13] PROBLEM Disk Space is now: CRITICAL on zeromq1 i-000002b7 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:12:08] RECOVERY Free ram is now: OK on bots-2 i-0000009c output: OK: 20% free memory [00:12:08] RECOVERY Disk Space is now: OK on deployment-jobrunner05 i-0000028c output: DISK OK [00:12:08] RECOVERY Current Users is now: OK on ve-nodejs i-00000245 output: USERS OK - 0 users currently logged in [00:12:08] RECOVERY Disk Space is now: OK on ve-nodejs i-00000245 output: DISK OK [00:12:09] RECOVERY Free ram is now: OK on ve-nodejs i-00000245 output: OK: 77% free memory [00:12:09] RECOVERY Total Processes is now: OK on ve-nodejs i-00000245 output: PROCS OK: 97 processes [00:12:20] RECOVERY dpkg-check is now: OK on ve-nodejs i-00000245 output: All packages OK [00:12:59] PROBLEM Current Load is now: CRITICAL on deployment-apache30 i-000002d3 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:12:59] PROBLEM Disk Space is now: CRITICAL on deployment-apache30 i-000002d3 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:13:00] PROBLEM Free ram is now: CRITICAL on deployment-apache30 i-000002d3 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:13:00] PROBLEM Total Processes is now: CRITICAL on deployment-apache30 i-000002d3 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:13:06] PROBLEM dpkg-check is now: CRITICAL on zeromq1 i-000002b7 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:14:09] RECOVERY Current Users is now: OK on wikistats-history-01 i-000002e2 output: USERS OK - 0 users currently logged in [00:14:25] PROBLEM Free ram is now: CRITICAL on aggregator-test1 i-000002bf output: CHECK_NRPE: Socket timeout after 10 seconds. [00:14:39] PROBLEM Free ram is now: CRITICAL on zeromq1 i-000002b7 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:17:50] RECOVERY Current Load is now: OK on deployment-apache30 i-000002d3 output: OK - load average: 0.95, 3.14, 2.31 [00:17:50] RECOVERY Disk Space is now: OK on deployment-apache30 i-000002d3 output: DISK OK [00:17:50] RECOVERY Free ram is now: OK on deployment-apache30 i-000002d3 output: OK: 92% free memory [00:17:50] RECOVERY Total Processes is now: OK on deployment-apache30 i-000002d3 output: PROCS OK: 119 processes [00:18:00] RECOVERY dpkg-check is now: OK on zeromq1 i-000002b7 output: All packages OK [00:19:09] PROBLEM Free ram is now: UNKNOWN on wikistats-history-01 i-000002e2 output: NRPE: Unable to read output [00:19:29] RECOVERY Free ram is now: OK on zeromq1 i-000002b7 output: OK: 81% free memory [00:19:39] PROBLEM Current Load is now: WARNING on aggregator-test1 i-000002bf output: WARNING - load average: 5.80, 7.01, 5.20 [00:19:50] PROBLEM Total Processes is now: WARNING on aggregator-test1 i-000002bf output: PROCS WARNING: 188 processes [00:20:59] RECOVERY Current Users is now: OK on zeromq1 i-000002b7 output: USERS OK - 0 users currently logged in [00:20:59] RECOVERY Disk Space is now: OK on zeromq1 i-000002b7 output: DISK OK [00:20:59] RECOVERY Current Load is now: OK on zeromq1 i-000002b7 output: OK - load average: 4.38, 4.94, 3.52 [00:22:49] PROBLEM Current Load is now: CRITICAL on build-precise1 i-00000273 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:23:39] PROBLEM dpkg-check is now: CRITICAL on pediapress-ocg2 i-00000234 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:23:40] PROBLEM Current Load is now: CRITICAL on pediapress-ocg2 i-00000234 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:24:18] PROBLEM Free ram is now: WARNING on aggregator-test1 i-000002bf output: Warning: 6% free memory [00:24:39] RECOVERY Current Load is now: OK on aggregator-test1 i-000002bf output: OK - load average: 1.25, 4.47, 4.68 [00:25:30] PROBLEM Puppet freshness is now: CRITICAL on su-fe1 i-000002e5 output: Puppet has not run in last 20 hours [00:25:59] RECOVERY host: p-b is UP address: i-000000ae PING OK - Packet loss = 0%, RTA = 0.77 ms [00:26:29] PROBLEM host: nova-production1 is DOWN address: i-0000007b CRITICAL - Host Unreachable (i-0000007b) [00:27:34] RECOVERY Current Load is now: OK on build-precise1 i-00000273 output: OK - load average: 0.21, 1.81, 1.49 [00:28:29] RECOVERY Current Load is now: OK on pediapress-ocg2 i-00000234 output: OK - load average: 0.13, 1.64, 1.41 [00:28:30] RECOVERY dpkg-check is now: OK on pediapress-ocg2 i-00000234 output: All packages OK [00:48:14] PROBLEM host: labs-build1 is DOWN address: i-0000006b CRITICAL - Host Unreachable (i-0000006b) [00:49:34] PROBLEM Free ram is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Socket timeout after 10 seconds. [00:50:03] RECOVERY host: labs-build1 is UP address: i-0000006b PING OK - Packet loss = 0%, RTA = 0.64 ms [00:54:03] PROBLEM Free ram is now: UNKNOWN on integration-apache1 i-000002eb output: NRPE: Unable to read output [00:56:33] PROBLEM host: nova-production1 is DOWN address: i-0000007b CRITICAL - Host Unreachable (i-0000007b) [01:27:01] PROBLEM host: nova-production1 is DOWN address: i-0000007b CRITICAL - Host Unreachable (i-0000007b) [01:45:00] RECOVERY Current Load is now: OK on bots-sql2 i-000000af output: OK - load average: 3.11, 3.50, 4.61 [01:53:30] PROBLEM host: dumps-incr is DOWN address: i-000002bb CRITICAL - Host Unreachable (i-000002bb) [01:56:40] PROBLEM host: nova-production1 is DOWN address: i-0000007b CRITICAL - Host Unreachable (i-0000007b) [02:08:00] PROBLEM Current Load is now: WARNING on bots-sql2 i-000000af output: WARNING - load average: 8.09, 7.24, 5.76 [02:08:10] PROBLEM host: dumps-2 is DOWN address: i-000002d8 CRITICAL - Host Unreachable (i-000002d8) [02:08:10] PROBLEM Disk Space is now: CRITICAL on e3 i-00000291 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:08:10] PROBLEM Current Users is now: CRITICAL on e3 i-00000291 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:08:11] PROBLEM Free ram is now: CRITICAL on e3 i-00000291 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:08:11] PROBLEM Total Processes is now: CRITICAL on e3 i-00000291 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:08:17] PROBLEM dpkg-check is now: CRITICAL on e3 i-00000291 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:13:00] RECOVERY Disk Space is now: OK on e3 i-00000291 output: DISK OK [02:13:00] RECOVERY Current Users is now: OK on e3 i-00000291 output: USERS OK - 0 users currently logged in [02:13:00] RECOVERY Free ram is now: OK on e3 i-00000291 output: OK: 89% free memory [02:13:01] RECOVERY Total Processes is now: OK on e3 i-00000291 output: PROCS OK: 107 processes [02:13:07] RECOVERY dpkg-check is now: OK on e3 i-00000291 output: All packages OK [02:13:26] PROBLEM Free ram is now: CRITICAL on etherpad-lite i-000002de output: CHECK_NRPE: Socket timeout after 10 seconds. [02:16:27] PROBLEM Current Users is now: CRITICAL on etherpad-lite i-000002de output: CHECK_NRPE: Socket timeout after 10 seconds. [02:16:27] PROBLEM Disk Space is now: CRITICAL on etherpad-lite i-000002de output: CHECK_NRPE: Socket timeout after 10 seconds. [02:16:27] PROBLEM Total Processes is now: CRITICAL on etherpad-lite i-000002de output: CHECK_NRPE: Socket timeout after 10 seconds. [02:16:32] PROBLEM dpkg-check is now: CRITICAL on etherpad-lite i-000002de output: CHECK_NRPE: Socket timeout after 10 seconds. [02:19:11] RECOVERY HTTP is now: OK on wmde-test i-000002ad output: HTTP OK: HTTP/1.1 200 OK - 453 bytes in 9.342 second response time [02:21:25] PROBLEM Current Load is now: CRITICAL on deployment-jobrunner05 i-0000028c output: CHECK_NRPE: Socket timeout after 10 seconds. [02:21:26] PROBLEM Total Processes is now: CRITICAL on deployment-jobrunner05 i-0000028c output: CHECK_NRPE: Socket timeout after 10 seconds. [02:23:25] PROBLEM Current Load is now: CRITICAL on etherpad-lite i-000002de output: CHECK_NRPE: Socket timeout after 10 seconds. [02:23:56] PROBLEM host: dumps-incr is DOWN address: i-000002bb CRITICAL - Host Unreachable (i-000002bb) [02:24:22] PROBLEM Current Load is now: CRITICAL on zeromq1 i-000002b7 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:24:22] PROBLEM Current Users is now: CRITICAL on zeromq1 i-000002b7 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:24:22] PROBLEM Disk Space is now: CRITICAL on zeromq1 i-000002b7 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:25:17] PROBLEM Total Processes is now: CRITICAL on aggregator-test1 i-000002bf output: CHECK_NRPE: Socket timeout after 10 seconds. [02:26:15] RECOVERY Current Users is now: OK on etherpad-lite i-000002de output: USERS OK - 0 users currently logged in [02:26:16] RECOVERY Disk Space is now: OK on etherpad-lite i-000002de output: DISK OK [02:26:16] RECOVERY Total Processes is now: OK on etherpad-lite i-000002de output: PROCS OK: 121 processes [02:26:27] RECOVERY Current Load is now: OK on deployment-jobrunner05 i-0000028c output: OK - load average: 0.99, 3.29, 2.31 [02:26:27] RECOVERY Total Processes is now: OK on deployment-jobrunner05 i-0000028c output: PROCS OK: 110 processes [02:26:32] RECOVERY dpkg-check is now: OK on etherpad-lite i-000002de output: All packages OK [02:26:46] PROBLEM host: nova-production1 is DOWN address: i-0000007b CRITICAL - Host Unreachable (i-0000007b) [02:27:16] PROBLEM HTTP is now: CRITICAL on wmde-test i-000002ad output: CRITICAL - Socket timeout after 10 seconds [02:28:16] RECOVERY Current Load is now: OK on etherpad-lite i-000002de output: OK - load average: 0.26, 2.83, 2.80 [02:28:16] PROBLEM Free ram is now: UNKNOWN on etherpad-lite i-000002de output: NRPE: Unable to read output [02:29:06] RECOVERY Current Load is now: OK on zeromq1 i-000002b7 output: OK - load average: 4.60, 4.94, 2.78 [02:29:56] PROBLEM Total Processes is now: WARNING on aggregator-test1 i-000002bf output: PROCS WARNING: 182 processes [02:31:54] PROBLEM Current Users is now: CRITICAL on mobile-wlm i-000002bc output: CHECK_NRPE: Socket timeout after 10 seconds. [02:31:54] PROBLEM Disk Space is now: CRITICAL on mobile-wlm i-000002bc output: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:45] PROBLEM Current Load is now: CRITICAL on pediapress-ocg2 i-00000234 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:45] PROBLEM dpkg-check is now: CRITICAL on pediapress-ocg2 i-00000234 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:14] PROBLEM dpkg-check is now: CRITICAL on zeromq1 i-000002b7 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:34] PROBLEM Current Users is now: CRITICAL on pediapress-ocg2 i-00000234 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:35] PROBLEM Disk Space is now: CRITICAL on pediapress-ocg2 i-00000234 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:35] PROBLEM Free ram is now: CRITICAL on pediapress-ocg2 i-00000234 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:35] PROBLEM Total Processes is now: CRITICAL on pediapress-ocg2 i-00000234 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:06] RECOVERY Disk Space is now: OK on zeromq1 i-000002b7 output: DISK OK [02:34:06] RECOVERY Current Users is now: OK on zeromq1 i-000002b7 output: USERS OK - 0 users currently logged in [02:34:16] PROBLEM dpkg-check is now: CRITICAL on mobile-wlm i-000002bc output: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:34] PROBLEM Free ram is now: UNKNOWN on puppet-abogott i-0000030b output: NRPE: Unable to read output [02:35:47] PROBLEM Current Load is now: CRITICAL on build-precise1 i-00000273 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:47] PROBLEM Current Users is now: CRITICAL on build-precise1 i-00000273 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:47] PROBLEM Total Processes is now: CRITICAL on build-precise1 i-00000273 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:38:38] PROBLEM host: dumps-2 is DOWN address: i-000002d8 CRITICAL - Host Unreachable (i-000002d8) [02:38:38] PROBLEM Total Processes is now: CRITICAL on mobile-wlm i-000002bc output: CHECK_NRPE: Socket timeout after 10 seconds. [02:39:16] PROBLEM Current Load is now: CRITICAL on incubator-bot0 i-00000296 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:39:16] PROBLEM Current Users is now: CRITICAL on incubator-bot0 i-00000296 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:39:16] PROBLEM Disk Space is now: CRITICAL on incubator-bot0 i-00000296 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:39:16] PROBLEM Free ram is now: CRITICAL on incubator-bot0 i-00000296 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:39:16] PROBLEM Total Processes is now: CRITICAL on incubator-bot0 i-00000296 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:39:21] PROBLEM dpkg-check is now: CRITICAL on incubator-bot0 i-00000296 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:39:21] RECOVERY dpkg-check is now: OK on mobile-wlm i-000002bc output: All packages OK [02:39:21] RECOVERY host: dumps-incr is UP address: i-000002bb PING OK - Packet loss = 0%, RTA = 0.53 ms [02:40:44] RECOVERY Current Load is now: OK on build-precise1 i-00000273 output: OK - load average: 1.60, 4.45, 3.20 [02:40:44] RECOVERY Current Users is now: OK on build-precise1 i-00000273 output: USERS OK - 0 users currently logged in [02:40:44] RECOVERY Total Processes is now: OK on build-precise1 i-00000273 output: PROCS OK: 83 processes [02:40:54] RECOVERY Puppet freshness is now: OK on maps-test3 i-0000028f output: puppet ran at Fri Jul 6 02:40:43 UTC 2012 [02:41:52] 07/06/2012 - 02:41:52 - User laner may have been modified in LDAP or locally, updating key in project(s): deployment-prep [02:41:55] RECOVERY Current Users is now: OK on mobile-wlm i-000002bc output: USERS OK - 0 users currently logged in [02:41:55] RECOVERY Disk Space is now: OK on mobile-wlm i-000002bc output: DISK OK [02:43:03] RECOVERY Total Processes is now: OK on mobile-wlm i-000002bc output: PROCS OK: 107 processes [02:43:08] RECOVERY Current Load is now: OK on pediapress-ocg2 i-00000234 output: OK - load average: 0.24, 2.96, 2.71 [02:43:08] RECOVERY dpkg-check is now: OK on pediapress-ocg2 i-00000234 output: All packages OK [02:43:15] RECOVERY dpkg-check is now: OK on zeromq1 i-000002b7 output: All packages OK [02:43:32] RECOVERY Current Users is now: OK on pediapress-ocg2 i-00000234 output: USERS OK - 0 users currently logged in [02:43:32] RECOVERY Disk Space is now: OK on pediapress-ocg2 i-00000234 output: DISK OK [02:43:32] RECOVERY Free ram is now: OK on pediapress-ocg2 i-00000234 output: OK: 86% free memory [02:43:32] RECOVERY Total Processes is now: OK on pediapress-ocg2 i-00000234 output: PROCS OK: 86 processes [02:44:02] RECOVERY Current Load is now: OK on incubator-bot0 i-00000296 output: OK - load average: 2.74, 3.87, 3.16 [02:44:02] RECOVERY Current Users is now: OK on incubator-bot0 i-00000296 output: USERS OK - 0 users currently logged in [02:44:02] RECOVERY Disk Space is now: OK on incubator-bot0 i-00000296 output: DISK OK [02:44:02] RECOVERY Free ram is now: OK on incubator-bot0 i-00000296 output: OK: 85% free memory [02:44:02] RECOVERY Total Processes is now: OK on incubator-bot0 i-00000296 output: PROCS OK: 86 processes [02:44:07] RECOVERY dpkg-check is now: OK on incubator-bot0 i-00000296 output: All packages OK [02:49:22] RECOVERY host: dumps-2 is UP address: i-000002d8 PING OK - Packet loss = 0%, RTA = 1.73 ms [02:56:02] PROBLEM Current Load is now: WARNING on dumps-incr i-000002bb output: WARNING - load average: 9.16, 8.74, 5.95 [02:56:52] PROBLEM host: nova-production1 is DOWN address: i-0000007b CRITICAL - Host Unreachable (i-0000007b) [03:04:42] RECOVERY Puppet freshness is now: OK on labs-nfs1 i-0000005d output: puppet ran at Fri Jul 6 03:04:31 UTC 2012 [03:08:02] RECOVERY Current Load is now: OK on bots-sql2 i-000000af output: OK - load average: 3.68, 4.08, 4.86 [03:08:42] PROBLEM Total Processes is now: WARNING on dumps-incr i-000002bb output: PROCS WARNING: 158 processes [03:17:42] PROBLEM host: pageviews is DOWN address: i-000000b2 CRITICAL - Host Unreachable (i-000000b2) [03:17:42] PROBLEM host: test3 is DOWN address: i-00000093 CRITICAL - Host Unreachable (i-00000093) [03:27:02] PROBLEM host: nova-production1 is DOWN address: i-0000007b CRITICAL - Host Unreachable (i-0000007b) [03:36:22] PROBLEM host: ganglia-collector is DOWN address: i-000000b7 CRITICAL - Host Unreachable (i-000000b7) [03:37:02] PROBLEM Free ram is now: WARNING on bots-2 i-0000009c output: Warning: 19% free memory [03:41:10] PROBLEM Free ram is now: WARNING on test-oneiric i-00000187 output: Warning: 13% free memory [03:41:10] PROBLEM Current Load is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:41:10] PROBLEM Current Users is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:41:10] PROBLEM dpkg-check is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:41:56] PROBLEM Free ram is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:43:36] PROBLEM Free ram is now: CRITICAL on psm-precise i-000002f2 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:44:09] PROBLEM Total Processes is now: CRITICAL on dumps-incr i-000002bb output: PROCS CRITICAL: 207 processes [03:44:23] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 6.97, 5.86, 3.31 [03:45:58] PROBLEM Free ram is now: WARNING on incubator-bot1 i-00000251 output: Warning: 9% free memory [03:45:58] RECOVERY Current Load is now: OK on incubator-bot1 i-00000251 output: OK - load average: 1.18, 2.29, 1.63 [03:45:58] RECOVERY Current Users is now: OK on incubator-bot1 i-00000251 output: USERS OK - 0 users currently logged in [03:45:58] RECOVERY dpkg-check is now: OK on incubator-bot1 i-00000251 output: All packages OK [03:46:27] RECOVERY host: test3 is UP address: i-00000093 PING OK - Packet loss = 0%, RTA = 0.52 ms [03:48:27] PROBLEM Free ram is now: UNKNOWN on psm-precise i-000002f2 output: NRPE: Unable to read output [03:49:27] PROBLEM host: pageviews is DOWN address: i-000000b2 CRITICAL - Host Unreachable (i-000000b2) [03:54:18] PROBLEM Current Load is now: CRITICAL on test3 i-00000093 output: NRPE: Unable to read output [03:54:18] RECOVERY Current Load is now: OK on nagios 127.0.0.1 output: OK - load average: 0.75, 1.86, 2.52 [03:55:17] PROBLEM dpkg-check is now: CRITICAL on test3 i-00000093 output: NRPE: Unable to read output [03:55:37] PROBLEM Free ram is now: CRITICAL on test-oneiric i-00000187 output: Critical: 4% free memory [03:57:17] PROBLEM host: nova-production1 is DOWN address: i-0000007b CRITICAL - Host Unreachable (i-0000007b) [03:57:47] PROBLEM SSH is now: CRITICAL on test3 i-00000093 output: Server answer: [03:58:37] PROBLEM Current Users is now: CRITICAL on test3 i-00000093 output: NRPE: Unable to read output [03:58:37] PROBLEM Total Processes is now: CRITICAL on test3 i-00000093 output: NRPE: Unable to read output [03:58:42] PROBLEM Free ram is now: CRITICAL on test3 i-00000093 output: NRPE: Unable to read output [03:59:37] PROBLEM Disk Space is now: CRITICAL on test3 i-00000093 output: CHECK_NRPE: Error - Could not complete SSL handshake. [04:00:37] RECOVERY Free ram is now: OK on test-oneiric i-00000187 output: OK: 97% free memory [04:02:47] PROBLEM Free ram is now: WARNING on nova-daas-1 i-000000e7 output: Warning: 13% free memory [04:03:37] PROBLEM Free ram is now: WARNING on utils-abogott i-00000131 output: Warning: 17% free memory [04:04:37] PROBLEM host: test3 is DOWN address: i-00000093 CRITICAL - Host Unreachable (i-00000093) [04:06:27] PROBLEM host: ganglia-collector is DOWN address: i-000000b7 CRITICAL - Host Unreachable (i-000000b7) [04:19:48] PROBLEM host: pageviews is DOWN address: i-000000b2 CRITICAL - Host Unreachable (i-000000b2) [04:21:42] PROBLEM host: wep is DOWN address: i-000000c2 CRITICAL - Host Unreachable (i-000000c2) [04:22:39] PROBLEM host: mobile-feeds is DOWN address: i-000000c1 CRITICAL - Host Unreachable (i-000000c1) [04:23:14] PROBLEM Free ram is now: CRITICAL on nova-daas-1 i-000000e7 output: Critical: 5% free memory [04:24:01] PROBLEM Free ram is now: CRITICAL on ganglia-test2 i-00000250 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:24:01] PROBLEM Total Processes is now: CRITICAL on ganglia-test2 i-00000250 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:25:34] PROBLEM Free ram is now: CRITICAL on utils-abogott i-00000131 output: Critical: 4% free memory [04:27:22] RECOVERY host: wep is UP address: i-000000c2 PING OK - Packet loss = 0%, RTA = 3.53 ms [04:27:31] PROBLEM Free ram is now: CRITICAL on configtest-main i-000002dd output: CHECK_NRPE: Socket timeout after 10 seconds. [04:27:53] PROBLEM host: nova-production1 is DOWN address: i-0000007b CRITICAL - Host Unreachable (i-0000007b) [04:28:15] PROBLEM Current Load is now: CRITICAL on nagios 127.0.0.1 output: CRITICAL - load average: 6.69, 7.21, 4.10 [04:28:25] PROBLEM Free ram is now: WARNING on ganglia-test2 i-00000250 output: Warning: 16% free memory [04:28:25] PROBLEM Total Processes is now: WARNING on ganglia-test2 i-00000250 output: PROCS WARNING: 184 processes [04:29:15] RECOVERY Free ram is now: OK on utils-abogott i-00000131 output: OK: 97% free memory [04:32:15] PROBLEM Free ram is now: UNKNOWN on configtest-main i-000002dd output: NRPE: Unable to read output [04:32:45] RECOVERY Free ram is now: OK on nova-daas-1 i-000000e7 output: OK: 94% free memory [04:34:45] PROBLEM host: test3 is DOWN address: i-00000093 CRITICAL - Host Unreachable (i-00000093) [04:36:35] PROBLEM host: ganglia-collector is DOWN address: i-000000b7 CRITICAL - Host Unreachable (i-000000b7) [04:37:05] RECOVERY host: mobile-feeds is UP address: i-000000c1 PING OK - Packet loss = 0%, RTA = 0.67 ms [04:50:45] PROBLEM host: pageviews is DOWN address: i-000000b2 CRITICAL - Host Unreachable (i-000000b2) [04:54:35] PROBLEM host: analytics is DOWN address: i-000000e2 CRITICAL - Host Unreachable (i-000000e2) [04:55:45] PROBLEM host: nova-daas-1 is DOWN address: i-000000e7 CRITICAL - Host Unreachable (i-000000e7) [04:56:25] PROBLEM Current Load is now: WARNING on bots-sql2 i-000000af output: WARNING - load average: 7.64, 7.01, 5.47 [04:57:55] PROBLEM host: nova-production1 is DOWN address: i-0000007b CRITICAL - Host Unreachable (i-0000007b) [04:59:45] RECOVERY host: analytics is UP address: i-000000e2 PING OK - Packet loss = 0%, RTA = 0.54 ms [04:59:55] PROBLEM Free ram is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Socket timeout after 10 seconds. [11:55:37] Rage.... [11:57:59] * Damianz eats methecooldude [11:58:22] Damianz: What broke the MySQL connection... did the server kick ClueBot out [11:58:33] Exactally what the erorr said [11:58:41] The bot needs patching to bail out if it can't get an id [11:58:54] I though I'd fixed it but clearly not, don't have time until the weekend to work on it [11:59:05] I was replying but your last edit conflicted [11:59:13] Oh, whoops :P [12:00:31] also it shouldn't currently have an issue so the notice can die [12:01:09] Damianz: On a random note, couldn't the report and review interfaces become one thing, save the bandwidth issues for a start [12:01:18] I'd love that [12:01:26] I'm putting off working on it until oauth is enabled [12:01:34] Ah, makes sense [12:01:37] So people can register with their wp details and we can track it [12:01:44] Ideally it will be 1 interface with 1 database and a nice api [12:01:45] * methecooldude slaps Bastion... let me in [12:02:03] Apache seems down atm though hmm [12:02:06] Or really really slow [12:02:23] Yea, it's REALLY slow [12:02:31] methecooldude: Also I thought the issue was the api looping forever so I disabled it... but it's still going over bw [12:02:37] Which is werid because it uses iframes [12:03:08] Wow [12:03:11] Cluster load is high [12:03:24] Nope, MySQL is still kicking out [12:03:28] Looks like it's iowait again [12:03:50] methecooldude: Last revert has an id [12:04:04] Urm, just trhe interface then [12:04:10] yeah [12:04:13] http://ganglia.wmflabs.org/latest/?m=load_one&r=hour&s=by%20name&hc=4&mc=2 [12:04:16] See cluster load [12:04:24] :( @ wait.... [12:04:37] Hopefully the new nodes are ready soon and we can go to non-redundant storage [12:04:51] Damianz: For Bots or the whole grid? [12:04:58] Everything [12:05:10] As far as I know vms are going to local storage with project storage being redundant [12:05:21] Short term solution until we find a long term working cluster [12:05:23] Oh, ok [12:05:38] As gluster broke horribally? [12:06:02] Keep hitting bugs and it's slow, lags out on io rather a lot [12:18:36] Damianz: [12:18:36] bots-apache1 [12:18:36] [12:18:36] Current Load [12:18:37] WARNING 2012-07-06 12:18:04 0d 0h 13m 33s 4/4 WARNING - load average: 5.51, 6.80, 6.80 [12:18:42] Ouch! [12:18:58] not really [12:19:11] Look at the graphs for everything on the grid [12:19:12] Although bot-cb has higher :P [12:19:28] I'm looking on Nagios [12:19:41] Look at http://ganglia.wmflabs.org/latest/?m=load_one&r=hour&s=by%20name&hc=4&mc=2 [12:21:00] Yea, I saw that ealier [14:33:18] I created a new instances, waited for puppet to finsih and then added puppet class role::mediawiki-install::labs [14:33:23] I manually invoked puppet with sudo puppetd -tv [14:33:33] It timed out cloning mediawiki: Git::Clone[mediawiki]/Exec[git_clone_mediawiki]/returns: change from notrun to 0 failed: Command exceeded timeout at /etc/puppet/manifests/generic-definitions.pp:750 [14:33:42] Repeated runs of puppet failed because mediawiki was never cloned: File[/srv/mediawiki/orig]/ensure: change from absent to directory failed: Cannot create /srv/mediawiki/orig; parent directory /srv/mediawiki does not exist [14:33:58] Any ideas how I can proceed? Can I extend the timeout or configure this manually? Any help would be appreciated. [17:54:20] preilly or someone: the vumi machine seems to be unhappy at the moment. [17:54:42] jerith: you can force restart it in the labs console [17:55:59] Hrm. It seems responsive enough now that I'm logged in. [17:56:40] jerith: hmm indeed [17:56:52] But top doesn't want to do anything. [17:57:34] The cluster seems to have been under high load most the day from iowait, everything's a little sluggish still.. hoping Ryan will appear at somepoint and boot it or the new hardware is done. [17:58:30] wow. wtf [17:58:33] the load is crazy [17:59:49] I've been migrating vms [18:00:02] I'd expect the load to go *down* though [18:00:22] ah [18:00:39] in fact: http://ganglia.wikimedia.org/latest/?r=4hr&cs=&ce=&m=load_one&s=by+name&c=Virtualization+cluster+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [18:00:46] It's been like it most the day :( Uber issues with apache -> mysql and generally accessing machines, though my scripts seem to be running fine [18:01:07] well [18:01:12] I'll migrate some more [18:01:18] which ones need to be migrated the most? [18:01:23] give me recommendations ;) [18:05:24] It seems to be taking forever to start something for the first time, and then being fine with it after that. [18:05:36] Personally I'd say the bots-sql* servers (keep randomly dissapearing and dropping connections) or deployment-prep stuff (running like a dog) but my views are relative to my interests [18:06:35] Also mysql needs to DIAF for 'is blocked because of many connection errors' ... yeah it had a load time out and blocking it doesn't fix the issue :( [18:07:05] Ooh server is more responsive than earlier, slightly [18:07:28] jerith: start something? [18:07:34] have people been creating a shit-ton of vms? :) [18:07:46] hm. nope [18:07:54] anyway. I'll move bots and deployment-prep [18:07:59] there's gonna be downtime [18:08:14] this is cold migration, since we're moving away from gluster [18:08:17] to the new hardware [18:08:29] Shame really, gluster is awesome in theory [18:08:40] yeah [18:08:42] in theory [18:09:22] supervisorctl worked, but then I needed to sudo it. [18:09:35] * Ryan_Lane nods [18:09:36] Then sudo was slow. [18:09:43] I wonder if LDAP is overloaded [18:09:58] Earlier it got to the point that it wouldn't read my key and bounced me out as unauthorized :( [18:10:00] nope [18:10:03] puppet is [18:10:15] ldap isn't very loaded at all [18:11:06] I'm probably going to take down deployment-prep for a while [18:11:08] (sorry) [18:11:17] I should write an email to the list [18:12:41] What's the betting it doesn't start back up again? :D [18:12:41] I've already migrated 20 [18:12:41] Though hopefully gluster won't just eat mysql's data files again [18:12:41] * Damianz thinks petan moved them to local storage anyway [18:12:41] he did [18:14:37] Hmm [18:14:57] 17:37, 1 July 2012 Hashar (Talk | contribs) deleted page Nova Resource:I-000002b5 < Still shows in instance list, is that related to your queue issues the other day or just random issue? [18:15:24] from the queue issues [18:15:29] should be fine if he re-deletes it now [18:15:40] will do [18:17:55] You said instances twice :) [18:18:07] Ryan_Lane: I-000002b5 still have the same issue: Successfully deleted instance, but failed to remove deployment-deb DNS entry. [18:18:28] failed from dns is normal [18:18:35] it deleted it from dns the first time [18:18:37] got deleted anyway ( The requested host does not exist. ) [18:20:09] now it deleted it from nova [18:20:09] * hashar starts processing his 140 Gerrit notifications [18:20:18] I might order a take away, it's been a long week, I'm lazy and it's a friday... seems like a good enough reason [18:21:02] Also heh if you think the kvm migration is slow you should try restoring a virtuoozo box from incremental backups (gzip'd tars)... took about 40hours for ~8 containers [18:21:22] ouch [18:58:35] well, load is back down again after doing more migrations [19:00:42] :) [19:02:45] Does anyone know of an awesome dpkg guide? Like how to package stuff that isn't 'run compile, make, make install' which the init command pretty much does for you? Really should package up some of my scripts and puppetize them. [19:04:43] the debian guide is pretty much it [19:05:53] I'll take a read over it, use to packaging rpms and the redhat docs on it where lacking last time I read them lol. Ubuntu/Debian does tend to me a little better with docs tbf [19:12:06] all packaging docs suck [19:39:30] so labs is dead again ? : / [19:39:36] I got I/O errors [19:41:18] dumps-incr has like 1500 processes !! http://ganglia.wmflabs.org/latest/graph.php?r=4hr&z=xlarge&c=dumps&h=dumps-incr&v=1539&m=proc_total&jr=&js=&vl=+&ti=Total+Processes [19:41:31] !log dumps dump-incr has skyrocketted to 1500 processes [19:41:32] Logged the message, Master [19:42:17] I know ryan is moving some vms, if they are being moved they should be offline though.... rather high load/io generally though [19:42:37] dumps-incr? [19:42:51] is that one of hydriz's systems? [19:44:07] I think yes [19:44:40] seems based on https://labsconsole.wikimedia.org/wiki/Nova_Resource:Dumps/SAL [20:35:18] petan: seems that nagios instance is totally screwed [20:35:45] eh [20:35:47] really? [20:35:50] yes [20:35:56] was it recently upgraded? [20:35:57] how [20:36:00] no [20:36:03] ok [20:36:06] I didn't touch it for months [20:36:11] does it work? [20:36:14] maybe the block migration screwed it up [20:36:43] the binaries have elf errors, some services don't work [20:36:51] it won't start its network [20:36:53] ooh [20:36:55] that suck [20:36:58] yes [20:37:04] don't delete it before I recover all scripts [20:37:11] I really hope the block migrations didn't screw anything else up [20:37:12] that's fine [20:37:14] I can mount the disks [20:37:16] I have backups but I don't want to find them [20:37:18] what directories would you like? [20:37:36] I will try to ssh there, /var/nparser is most important [20:38:01] but it contains some configs in /etc/nagios3 [20:38:06] I would like to recover that as well [20:39:02] Ryan_Lane: did you run fsck? [20:39:15] yes [20:39:20] it had errors [20:39:24] petrb@bastion1:~$ ssh nagios [20:39:24] ssh: connect to host nagios port 22: No route to host [20:39:26] but that didn't seem to solve it [20:39:28] yeah [20:39:30] it's down [20:39:34] I'll need to recover the files for you [20:39:41] as mentioned it won't start networking [20:39:47] ok [20:39:51] my home will stay I gues [20:39:57] yep [20:40:01] ok [20:40:14] seems some other instances are down too [20:40:18] * Ryan_Lane grumbles [20:40:18] but keep the disks as backup if I find out there is more to recover [20:40:44] it suck to check what is down when nagios is down :) [20:41:02] maybe we should recover it first [20:41:08] * Ryan_Lane nods [20:41:13] well, create a new one ;) [20:41:18] ok [20:41:40] but we will loose the nlogin (http admin for users of labs) [20:41:50] so that we won't be able to control nagios much [20:42:33] Ryan_Lane did you make any backups before you did that patch? [20:42:38] patch? [20:42:39] I guess no [20:42:40] what patch? [20:42:49] we don't have disk space for backups [20:42:55] whatever you did before it happened [20:43:01] I'm doing migrations [20:43:04] to the new hardware [20:43:07] aah [20:43:14] I thought it's new gluster :) [20:43:17] seems kvm's block migration support is pretty fucked up [20:43:56] well, this isn't good [20:44:03] it seems almost every migrated instance is fucked up [20:46:01] how fucked? the storage? or it's just down [20:46:29] well, some virtual machines won't boot [20:46:41] hm, that could be problem somewhere else [20:46:53] I doubt it [20:47:46] we can try to recover broken storage somehow, nagios isn't so important, but some other instances might be [20:48:03] I can always mount the storage [20:48:10] all these are ext4? [20:48:10] that's fine [20:48:23] did you fsck? [20:48:34] ext3 and yes, as I mentioned before ;) [20:48:52] aha, what was result of that [20:49:02] <^demon> Ryan_Lane: If you need an instance to mess around with, feel free to use gerrit. It's fucked up right now anyway config-wise, so feel free to break it horribly. [20:49:22] well, any instance that migrated poorly will likely need to be rebuilt [20:49:28] that's a lot of them, so far [20:49:35] some of them are running fine [20:49:45] ^demon actually I have some "fuck-me" instances as well - dozens of [20:49:49] <^demon> I was able to ssh into gerrit. [20:50:42] !ping [20:50:42] pong [20:50:54] bots are alive [20:51:08] or at least they still seem to work [20:51:16] I've been migrating deployment-prep [20:51:31] aha, I hope you didn't migrate -sql [20:51:39] or at least -backup [20:51:42] :P [20:51:47] I did migrate backup [20:51:48] it seems [20:51:54] ok and sql [20:52:01] no clue [20:52:04] that is most data critical instance atm [20:52:20] everything else is in puppet except stuff hashar worked on [20:53:49] that is mostly in puppet though :-] [20:53:59] new nagios will be here in few min [20:55:02] the Apaches and Squid conf are only on instance [20:55:07] i need to puppetize them [20:55:07] hm [20:55:24] seems that this instance wants to come up, but can't because of the stupid nfs mount [20:55:41] yup need to migrate to /data/project [20:55:53] would have to move all the upload data first though [20:56:04] and sync with op to apply the puppet change I have not written yet [20:56:44] this really sucks [20:59:50] we never were stable so... :) [20:59:54] people could expect [21:00:00] something like this [21:00:05] strange [21:00:09] the virts on virt8 seem to be fine [21:00:24] the ones on virt6 not so much [21:00:56] I take that back, just found a corrupted one on virt8 too [21:01:13] I wonder if I can stop the instances before I move them [21:01:58] shutdown -h now [21:02:02] :P [21:02:16] I'm not sure the block-migration will happen if the instance isn't running [21:02:52] oh yeah [21:03:10] I think even the ones that came back up are fucked in some way [21:03:10] meh [21:08:29] can you open http://nagios.wmflabs.org/nagios3/ [21:09:57] yep [21:10:05] so I'm going to stop migrations now [21:10:10] this fucking sucks [21:10:36] I guess I'll do the migrations via rsync and modify the database directly [21:20:50] 34 instances are probably going to have to be rebuilt [21:21:08] petan: and as much as you're going to hate me for this, most of them are in deployment-prep and bots [21:21:49] I was trying to be nice and prioritize instances that needed performance :( [21:21:55] * Ryan_Lane sighs [21:22:16] kvm migrate removes the old storage after I take it? [21:23:18] unfortunately it does [21:23:25] sadtimes [21:23:25] openstack nova does, anyway [21:25:21] Hmm someone is playing death metal outside [21:27:47] annoying way seems it'll work without issues [21:28:52] Ryan_Lane I won't make nagios work unless you give me that /etc/nagios3 [21:29:03] there's too many mods I made [21:29:09] yeah, that's fine [21:29:12] I'll pull any file you need [21:29:20] Total Warnings: 0 [21:29:21] Total Errors: 1281 [21:29:22] ***> One or more problems was encountered while running the pre-flight check... [21:29:24] heh [21:29:45] is this new instance in the nagios project? [21:29:50] yes [21:29:53] ok [21:29:57] nagios-main [21:30:13] I think I will make more instance in nagios, like nagios-bot, nagios-apache etc later [21:30:15] * Ryan_Lane nods [21:30:25] because old nagios had troubles with that [21:30:32] load was always 5+ [21:31:03] which files out of etc do you want? [21:31:13] nagios3 [21:31:16] ok [21:31:16] whole fir [21:31:20] dir [21:31:42] also /var/nparser [21:31:53] I have that but not sure if it's fresh [21:33:16] it's in /root [21:33:18] on the instance [21:33:37] aha, this new client I made doesn't change topic well [21:33:38] :D [21:37:26] Ryan_Lane /etc/nagios3 [21:37:32] you did without 3 [21:37:38] that's nrpe [21:38:06] doh [21:38:08] ok. sec [21:41:36] petan2: ok [21:41:40] it's there no [21:41:43] *now [21:45:35] nagios is back [21:46:14] Ok Warning Unknown Critical Pending [21:46:15] 1 0 0 0 1485 [21:46:19] :D [21:46:25] that's gonna take a while [21:49:13] do the other instances even trust this one? [21:49:19] or do we need to fix it in puppet? [21:49:29] it's by IP, right? [21:49:42] the nrpe config, that is [21:49:49] yes [21:49:51] true [21:49:56] new ip is 120 [21:54:33] I will spam now [21:54:49] petan2: ok, I fixed the nrpe address [21:55:01] Ryan_Lane: could the 'pdbhandler' project get allocated a public IP? i'd like to be able to demo a new extension i've developed (summarized here: http://lists.wikimedia.org/pipermail/wikitech-l/2012-April/060254.html). the only instance in the project has ID i-0000030a and name pdbhandler-dev [21:55:50] are you on labs-l mailing list? [21:56:02] yes [21:56:07] I have a little bit of bad news [21:56:12] did you read the migration email? [21:56:17] your instance is likely corrupted [21:56:26] yes. i imagined that might affect this, but wasn't sure [21:56:30] oh. [21:56:48] second one in the list [21:56:53] so, your instance is up [21:57:18] you may be one of the lucky non-corrupted ones [21:57:37] how would i know if it's not corrupt? [21:57:54] the instance doesn't have much on it at the moment. i created it yesterday. [21:57:59] ah ok [21:58:03] I'd rebuild it then [21:58:11] it's better to know for sure it's not corrupted [21:58:13] On the bright side bots-sql2 seems ok :D [21:58:37] test [21:58:38] * Damianz really should get around to taking backups more often of mysql data [21:58:43] labs-morebots: Test failed [21:58:57] Ryan_Lane: ok, and should i re-request to have an IP allocated once i build a new instance? [21:59:08] I can likely give you one now [21:59:24] i'd appreciate that a ton [21:59:27] project storage can generally be used for backups [21:59:31] it's gluster [21:59:49] but, as long as you aren't doing direct io (like mysql does) it should work without bugs [22:00:14] PROBLEM Current Load is now: UNKNOWN on aggregator-test1 i-000002bf output: (No output returned from plugin) [22:00:14] PROBLEM Current Users is now: UNKNOWN on aggregator1 i-0000010c output: (No output returned from plugin) [22:00:14] PROBLEM Disk Space is now: UNKNOWN on aggregator2 i-000002c0 output: (No output returned from plugin) [22:00:14] PROBLEM Free ram is now: UNKNOWN on analytics i-000000e2 output: (No output returned from plugin) [22:00:14] PROBLEM Total Processes is now: UNKNOWN on bastion-restricted1 i-0000019b output: (No output returned from plugin) [22:00:19] PROBLEM dpkg-check is now: UNKNOWN on bastion1 i-000000ba output: (No output returned from plugin) [22:00:19] PROBLEM dpkg-check is now: UNKNOWN on blamemaps-s1 i-000002c3 output: (No output returned from plugin) [22:00:19] PROBLEM Current Load is now: UNKNOWN on bots-1 i-000000a9 output: (No output returned from plugin) [22:00:19] PROBLEM Current Users is now: UNKNOWN on bots-2 i-0000009c output: (No output returned from plugin) [22:00:19] PROBLEM Disk Space is now: UNKNOWN on bots-3 i-000000e5 output: (No output returned from plugin) [22:01:33] Emw: ok I upped your quota for floating ips [22:01:40] Damianz can you take a care of nagios bot [22:01:43] Emw: you should be able to allocate one through "Manage addresses" [22:01:54] Ryan_Lane: also, for what it's worth, i noticed an odd thing where i had deleted an instance, then created another instance with the same name (but obviously different ID), and noticed a temporary file in my home directory from the deleted instance in the new instance [22:02:03] Ryan_Lane: thank you [22:02:23] Emw: all instances in a project have shared home directories [22:02:31] Damianz I will be afk a bit, if there was problem just quiet it [22:02:33] ah, ok [22:02:33] additionally, there's large shared storage at /data/project [22:02:36] because it's gonna spam a bit [22:02:45] good to know about /data/project [22:03:00] it doesn't appear till you try to access it [22:03:04] petan2: Sure. [22:03:10] but it'll be there when you try to write to it, or read from it [22:03:13] (it auto-mounts) [22:03:32] there's also public datasets at /public/datasets [22:03:46] PROBLEM Free ram is now: UNKNOWN on bots-4 i-000000e8 output: (No output returned from plugin) [22:03:46] PROBLEM Total Processes is now: UNKNOWN on bots-dev i-00000190 output: (No output returned from plugin) [22:03:50] Ryan_Lane: Good point, forgot about that [22:03:51] PROBLEM dpkg-check is now: UNKNOWN on bots-labs i-0000015e output: (No output returned from plugin) [22:03:52] PROBLEM Current Load is now: UNKNOWN on bots-sql1 i-000000b5 output: (No output returned from plugin) [22:03:52] PROBLEM Current Users is now: UNKNOWN on bots-sql2 i-000000af output: (No output returned from plugin) [22:03:52] PROBLEM Disk Space is now: UNKNOWN on bots-sql3 i-000000b4 output: (No output returned from plugin) [22:03:52] PROBLEM Free ram is now: UNKNOWN on build-precise1 i-00000273 output: (No output returned from plugin) [22:03:52] PROBLEM Total Processes is now: UNKNOWN on building i-0000014d output: (No output returned from plugin) [22:03:57] PROBLEM dpkg-check is now: UNKNOWN on catsort-pub i-000001cc output: (No output returned from plugin) [22:03:57] PROBLEM Current Load is now: UNKNOWN on configtest-main i-000002dd output: (No output returned from plugin) [22:03:57] PROBLEM Current Load is now: UNKNOWN on demo-deployment1 i-00000276 output: (No output returned from plugin) [22:03:57] PROBLEM Current Load is now: UNKNOWN on demo-mysql1 i-00000256 output: (No output returned from plugin) [22:03:57] PROBLEM Current Users is now: UNKNOWN on demo-web1 i-00000255 output: (No output returned from plugin) [22:03:57] PROBLEM Current Users is now: UNKNOWN on demo-web2 i-00000285 output: (No output returned from plugin) [22:03:58] PROBLEM Disk Space is now: UNKNOWN on deployment-apache30 i-000002d3 output: (No output returned from plugin) [22:03:59] PROBLEM Disk Space is now: UNKNOWN on deployment-apache31 i-000002d4 output: (No output returned from plugin) [22:03:59] PROBLEM Free ram is now: UNKNOWN on deployment-bastion i-000002bd output: (No output returned from plugin) [22:03:59] PROBLEM Total Processes is now: UNKNOWN on deployment-cache-upload i-00000263 output: (No output returned from plugin) [22:04:02] PROBLEM dpkg-check is now: UNKNOWN on deployment-dbdump i-000000d2 output: (No output returned from plugin) [22:04:02] PROBLEM Current Load is now: UNKNOWN on deployment-imagescaler01 i-0000025a output: (No output returned from plugin) [22:04:02] PROBLEM Current Users is now: UNKNOWN on deployment-jobrunner05 i-0000028c output: (No output returned from plugin) [22:04:02] PROBLEM Disk Space is now: UNKNOWN on deployment-mc i-0000021b output: (No output returned from plugin) [22:04:11] Ryan_Lane: this is slightly off topic but relevant to public data sets: is the page-view data from http://stats.grok.se on wmflabs? [22:04:17] PROBLEM dpkg-check is now: UNKNOWN on en-wiki-db-lucid i-0000023b output: (No output returned from plugin) [22:04:19] hm [22:04:22] PROBLEM Current Load is now: UNKNOWN on exim-test i-00000265 output: (No output returned from plugin) [22:04:22] PROBLEM Current Users is now: UNKNOWN on feeds i-000000fa output: (No output returned from plugin) [22:04:22] PROBLEM Disk Space is now: UNKNOWN on firstinstance i-0000013e output: (No output returned from plugin) [22:04:22] PROBLEM Free ram is now: UNKNOWN on fundraising-civicrm i-00000169 output: (No output returned from plugin) [22:04:39] if it isn't in /public/datasets, we should look at somehow getting it added there [22:05:08] before starting on this new media handling project i began a project using java, hadoop and amazon ec2 to aggregate the hourly data that stats.grok.se uses and getting it into daily data [22:05:13] I think it's just dumps right now [22:05:20] Btw is nagios on the same ip or is nrpe just allowing the whole range? [22:05:27] Damianz: new ip [22:05:30] I added it to puppet [22:05:35] Oh cool [22:05:44] PROBLEM Current Users is now: UNKNOWN on grail i-000002c6 output: (No output returned from plugin) [22:06:05] PROBLEM Total Processes is now: UNKNOWN on opengrok-web i-000001e1 output: (No output returned from plugin) [22:06:10] PROBLEM dpkg-check is now: UNKNOWN on orgcharts-dev i-0000018f output: (No output returned from plugin) [22:06:10] PROBLEM Current Users is now: UNKNOWN on outreacheval i-0000012e output: (No output returned from plugin) [22:06:10] PROBLEM Disk Space is now: UNKNOWN on p-b i-000000ae output: (No output returned from plugin) [22:06:11] PROBLEM Free ram is now: UNKNOWN on pageviews i-000000b2 output: (No output returned from plugin) [22:06:11] PROBLEM Total Processes is now: UNKNOWN on pdbhandler-dev i-0000030a output: (No output returned from plugin) [22:06:15] PROBLEM dpkg-check is now: UNKNOWN on pediapress-ocg1 i-00000233 output: (No output returned from plugin) [22:06:16] PROBLEM Current Load is now: UNKNOWN on pediapress-packager i-000001e4 output: (No output returned from plugin) [22:06:16] PROBLEM Current Users is now: UNKNOWN on precise-test i-00000231 output: (No output returned from plugin) [22:06:16] PROBLEM Disk Space is now: UNKNOWN on psm-precise i-000002f2 output: (No output returned from plugin) [22:06:16] PROBLEM Disk Space is now: UNKNOWN on publicdata-administration i-0000019e output: (No output returned from plugin) [22:06:16] PROBLEM Free ram is now: UNKNOWN on puppet-abogott i-0000030b output: (No output returned from plugin) [22:06:30] !nagios [22:06:31] http://208.80.153.210/nagios3 http://nagios.wmflabs.org/nagios3 [22:07:15] should an instance that i plan to map a public IP to be in availability zone 'nova' or 'pmtpa'? [22:07:37] PROBLEM Current Users is now: UNKNOWN on aggregator-test1 i-000002bf output: (No output returned from plugin) [22:07:37] PROBLEM Disk Space is now: UNKNOWN on aggregator1 i-0000010c output: (No output returned from plugin) [22:07:37] PROBLEM Free ram is now: UNKNOWN on aggregator2 i-000002c0 output: (No output returned from plugin) [22:07:39] well, destroying 1/6 of the instances sure seems to have brought the load down [22:07:48] PROBLEM Total Processes is now: UNKNOWN on asher1 i-0000003a output: (No output returned from plugin) [22:07:49] :D [22:07:53] PROBLEM dpkg-check is now: UNKNOWN on bastion-restricted1 i-0000019b output: (No output returned from plugin) [22:07:53] PROBLEM Current Load is now: UNKNOWN on blamemaps-s1 i-000002c3 output: (No output returned from plugin) [22:07:53] PROBLEM Current Load is now: UNKNOWN on bob i-0000012d output: (No output returned from plugin) [22:07:53] PROBLEM Current Users is now: UNKNOWN on bots-1 i-000000a9 output: (No output returned from plugin) [22:07:53] PROBLEM Disk Space is now: UNKNOWN on bots-2 i-0000009c output: (No output returned from plugin) [22:07:53] PROBLEM Free ram is now: UNKNOWN on bots-3 i-000000e5 output: (No output returned from plugin) [22:07:59] Emw: "nova" is there due to a really old bug. it doesn't actually matter [22:08:02] they map to the same thing [22:08:08] PROBLEM Total Processes is now: UNKNOWN on bots-cb i-0000009e output: (No output returned from plugin) [22:08:13] PROBLEM dpkg-check is now: UNKNOWN on bots-dev i-00000190 output: (No output returned from plugin) [22:08:13] PROBLEM Current Load is now: UNKNOWN on bots-nfs i-000000b1 output: (No output returned from plugin) [22:08:13] PROBLEM Current Users is now: UNKNOWN on bots-sql1 i-000000b5 output: (No output returned from plugin) [22:08:13] PROBLEM Disk Space is now: UNKNOWN on bots-sql2 i-000000af output: (No output returned from plugin) [22:08:13] PROBLEM Free ram is now: UNKNOWN on bots-sql3 i-000000b4 output: (No output returned from plugin) [22:08:16] I'm really pissed off about this [22:08:23] PROBLEM Total Processes is now: UNKNOWN on build1 i-000002b3 output: (No output returned from plugin) [22:08:28] I checked the first 5 instances I did [22:08:29] PROBLEM dpkg-check is now: UNKNOWN on building i-0000014d output: (No output returned from plugin) [22:08:29] PROBLEM Current Load is now: UNKNOWN on cn-wiki-db-lucid i-00000241 output: (No output returned from plugin) [22:08:29] PROBLEM Current Users is now: UNKNOWN on configtest-main i-000002dd output: (No output returned from plugin) [22:08:29] PROBLEM Current Users is now: UNKNOWN on demo-deployment1 i-00000276 output: (No output returned from plugin) [22:08:29] PROBLEM Current Users is now: UNKNOWN on demo-mysql1 i-00000256 output: (No output returned from plugin) [22:08:30] PROBLEM Disk Space is now: UNKNOWN on demo-web1 i-00000255 output: (No output returned from plugin) [22:08:30] PROBLEM Disk Space is now: UNKNOWN on demo-web2 i-00000285 output: (No output returned from plugin) [22:08:30] PROBLEM Free ram is now: UNKNOWN on deployment-apache30 i-000002d3 output: (No output returned from plugin) [22:08:31] PROBLEM Free ram is now: UNKNOWN on deployment-apache31 i-000002d4 output: (No output returned from plugin) [22:08:43] Ryan_Lane: ok, thanks. in that case i'll go with pmtpa [22:08:45] apparently I didn't check well enough :( [22:08:59] PROBLEM Total Processes is now: UNKNOWN on deployment-cache-bits i-00000264 output: (No output returned from plugin) [22:09:05] PROBLEM dpkg-check is now: UNKNOWN on deployment-cache-upload i-00000263 output: (No output returned from plugin) [22:09:05] PROBLEM Current Load is now: UNKNOWN on deployment-feed i-00000118 output: (No output returned from plugin) [22:09:05] PROBLEM Current Users is now: UNKNOWN on deployment-imagescaler01 i-0000025a output: (No output returned from plugin) [22:09:05] PROBLEM Disk Space is now: UNKNOWN on deployment-jobrunner05 i-0000028c output: (No output returned from plugin) [22:09:05] PROBLEM Free ram is now: UNKNOWN on deployment-mc i-0000021b output: (No output returned from plugin) [22:09:20] PROBLEM Total Processes is now: UNKNOWN on deployment-sql i-000000d0 output: (No output returned from plugin) [22:09:25] PROBLEM dpkg-check is now: UNKNOWN on deployment-squid i-000000dc output: (No output returned from plugin) [22:09:25] PROBLEM Current Load is now: UNKNOWN on deployment-wmsearch i-000000e1 output: (No output returned from plugin) [22:09:25] PROBLEM Current Users is now: UNKNOWN on dev-solr i-00000152 output: (No output returned from plugin) [22:09:26] PROBLEM Disk Space is now: UNKNOWN on dumps-1 i-00000170 output: (No output returned from plugin) [22:11:06] Shouldn't Public IP in labsconsole list assigned floating ips on the nova recourse page or are they mapped outside of the recourse knowing about them? [22:12:48] PROBLEM Free ram is now: UNKNOWN on dumps-2 i-000002d8 output: (No output returned from plugin) [22:12:48] PROBLEM Total Processes is now: UNKNOWN on e3 i-00000291 output: (No output returned from plugin) [22:12:53] PROBLEM Total Processes is now: UNKNOWN on ee-prototype i-0000013d output: (No output returned from plugin) [22:13:03] PROBLEM dpkg-check is now: UNKNOWN on embed-sandbox i-000000d1 output: (No output returned from plugin) [22:13:03] PROBLEM Current Load is now: UNKNOWN on etherpad-lite i-000002de output: (No output returned from plugin) [22:13:03] PROBLEM Current Users is now: UNKNOWN on exim-test i-00000265 output: (No output returned from plugin) [22:13:03] PROBLEM Disk Space is now: UNKNOWN on feeds i-000000fa output: (No output returned from plugin) [22:13:03] PROBLEM Free ram is now: UNKNOWN on firstinstance i-0000013e output: (No output returned from plugin) [22:13:04] PROBLEM Total Processes is now: UNKNOWN on fundraising-db i-0000015c output: (No output returned from plugin) [22:13:08] PROBLEM Current Load is now: UNKNOWN on gerrit i-000000ff output: (No output returned from plugin) [22:13:24] PROBLEM Current Load is now: UNKNOWN on nova-gsoc1 i-000001de output: (No output returned from plugin) [22:13:24] PROBLEM Current Users is now: UNKNOWN on nova-ldap1 i-000000df output: (No output returned from plugin) [22:13:24] PROBLEM Disk Space is now: UNKNOWN on nova-ldap2 i-00000238 output: (No output returned from plugin) [22:13:24] PROBLEM Free ram is now: UNKNOWN on nova-precise1 i-00000236 output: (No output returned from plugin) [22:13:24] PROBLEM dpkg-check is now: UNKNOWN on opengrok-web i-000001e1 output: (No output returned from plugin) [22:13:25] PROBLEM Current Load is now: UNKNOWN on otrs-jgreen i-0000015a output: (No output returned from plugin) [22:14:05] Damianz: that's a bug [22:14:13] Ah [22:14:17] I'm not planning on fixing it [22:14:26] page editing is going to be handled by nova [22:14:29] When the new api support is done maybe? [22:14:48] andrewbogott has it written as a nova plugin [22:14:55] :) [22:14:58] <3 andrewbogott [22:15:10] So we can get nova keys and use sexy cli tools? :) [22:17:40] eventually that's the idea ;) [22:18:09] dns, puppet, and page editing is all we needed for that, I think [22:20:00] Page editing is a bit of a weird one in regards to sync, dns at least is either there or not [22:20:20] PROBLEM Current Load is now: UNKNOWN on mwreview i-000002ae output: (No output returned from plugin) [22:20:20] PROBLEM Free ram is now: UNKNOWN on nagios-main i-0000030d output: (No output returned from plugin) [22:20:20] PROBLEM Total Processes is now: UNKNOWN on nginx-ffuqua-doom1-3 i-00000196 output: (No output returned from plugin) [22:20:25] PROBLEM Current Load is now: UNKNOWN on nova-dev3 i-000000e9 output: (No output returned from plugin) [22:20:25] PROBLEM Current Users is now: UNKNOWN on nova-gsoc1 i-000001de output: (No output returned from plugin) [22:20:25] PROBLEM Disk Space is now: UNKNOWN on nova-ldap1 i-000000df output: (No output returned from plugin) [22:20:25] PROBLEM Free ram is now: UNKNOWN on nova-ldap2 i-00000238 output: (No output returned from plugin) [22:20:31] PROBLEM Disk Space is now: UNKNOWN on hugglewa-1 i-000001e0 output: (No output returned from plugin) [22:20:31] PROBLEM Free ram is now: UNKNOWN on hugglewa-db i-00000188 output: (No output returned from plugin) [22:20:31] PROBLEM Total Processes is now: UNKNOWN on incubator-bot1 i-00000251 output: (No output returned from plugin) [22:20:36] PROBLEM dpkg-check is now: UNKNOWN on incubator-bot2 i-00000252 output: (No output returned from plugin) [22:20:36] PROBLEM Current Load is now: UNKNOWN on integration-apache1 i-000002eb output: (No output returned from plugin) [22:20:36] PROBLEM Current Load is now: UNKNOWN on ipv6test1 i-00000282 output: (No output returned from plugin) [22:20:36] PROBLEM Current Users is now: UNKNOWN on kripke i-00000268 output: (No output returned from plugin) [22:20:36] PROBLEM Free ram is now: UNKNOWN on labs-build1 i-0000006b output: (No output returned from plugin) [22:20:36] PROBLEM Total Processes is now: UNKNOWN on labs-nfs1 i-0000005d output: (No output returned from plugin) [22:20:41] PROBLEM dpkg-check is now: UNKNOWN on labs-realserver i-00000104 output: (No output returned from plugin) [22:20:41] PROBLEM Current Load is now: UNKNOWN on localpuppet1 i-0000020b output: (No output returned from plugin) [22:20:41] PROBLEM Current Users is now: UNKNOWN on localpuppet2 i-0000029b output: (No output returned from plugin) [22:20:41] PROBLEM Disk Space is now: UNKNOWN on log1 i-00000239 output: (No output returned from plugin) [22:20:41] PROBLEM Free ram is now: UNKNOWN on mailman-01 i-00000235 output: (No output returned from plugin) [22:20:42] PROBLEM Free ram is now: UNKNOWN on maps-test2 i-00000253 output: (No output returned from plugin) [22:20:42] PROBLEM Free ram is now: UNKNOWN on maps-test3 i-0000028f output: (No output returned from plugin) [22:20:43] PROBLEM Total Processes is now: UNKNOWN on maps-tilemill1 i-00000294 output: (No output returned from plugin) [22:20:46] PROBLEM dpkg-check is now: UNKNOWN on master i-0000007a output: (No output returned from plugin) [22:20:47] PROBLEM Current Load is now: UNKNOWN on migration1 i-00000261 output: (No output returned from plugin) [22:20:47] PROBLEM Current Users is now: UNKNOWN on mingledbtest i-00000283 output: (No output returned from plugin) [22:20:47] PROBLEM Disk Space is now: UNKNOWN on mobile-feeds i-000000c1 output: (No output returned from plugin) [22:20:47] PROBLEM Free ram is now: UNKNOWN on mobile-testing i-00000271 output: (No output returned from plugin) [22:20:52] Damianz: well, that's why nova should handle editing [22:20:58] it knows when things change [22:21:01] PROBLEM Total Processes is now: UNKNOWN on pybal-precise i-00000289 output: (No output returned from plugin) [22:21:06] PROBLEM dpkg-check is now: UNKNOWN on queue-wiki1 i-000002b8 output: (No output returned from plugin) [22:21:06] PROBLEM Current Load is now: UNKNOWN on redis1 i-000002b6 output: (No output returned from plugin) [22:21:06] PROBLEM Current Users is now: UNKNOWN on reportcard2 i-000001ea output: (No output returned from plugin) [22:21:06] PROBLEM Disk Space is now: UNKNOWN on resourceloader2-apache i-000001d7 output: (No output returned from plugin) [22:21:06] PROBLEM Disk Space is now: UNKNOWN on robh2 i-000001a2 output: (No output returned from plugin) [22:21:06] PROBLEM Free ram is now: UNKNOWN on scribunto i-0000022c output: (No output returned from plugin) [22:21:07] PROBLEM Total Processes is now: UNKNOWN on signwriting-ase5 i-0000030c output: (No output returned from plugin) [22:21:11] PROBLEM dpkg-check is now: UNKNOWN on simplewikt i-00000149 output: (No output returned from plugin) [22:21:11] PROBLEM Current Load is now: UNKNOWN on su-be1 i-000002e7 output: (No output returned from plugin) [22:21:11] PROBLEM Current Users is now: UNKNOWN on su-be2 i-000002e8 output: (No output returned from plugin) [22:21:11] PROBLEM Disk Space is now: UNKNOWN on su-be3 i-000002e9 output: (No output returned from plugin) [22:21:11] PROBLEM Free ram is now: UNKNOWN on su-fe1 i-000002e5 output: (No output returned from plugin) [22:21:12] PROBLEM Total Processes is now: UNKNOWN on swift-aux1 i-0000024b output: (No output returned from plugin) [22:21:17] PROBLEM dpkg-check is now: UNKNOWN on swift-aux2 i-0000024c output: (No output returned from plugin) [22:21:17] PROBLEM Current Load is now: UNKNOWN on swift-be2 i-000001c8 output: (No output returned from plugin) [22:21:17] PROBLEM Current Users is now: UNKNOWN on swift-be3 i-000001c9 output: (No output returned from plugin) [22:21:17] PROBLEM Disk Space is now: UNKNOWN on swift-be4 i-000001ca output: (No output returned from plugin) [22:21:17] PROBLEM Free ram is now: UNKNOWN on swift-fe1 i-000001d2 output: (No output returned from plugin) [22:21:17] PROBLEM Total Processes is now: UNKNOWN on test2 i-0000013c output: (No output returned from plugin) [22:21:22] PROBLEM Current Load is now: UNKNOWN on testforx i-000002f3 output: (No output returned from plugin) [22:21:22] PROBLEM Current Users is now: UNKNOWN on testing-virt6 i-00000302 output: (No output returned from plugin) [22:21:22] PROBLEM Disk Space is now: UNKNOWN on testing-virt7 i-00000308 output: (No output returned from plugin) [22:21:22] PROBLEM Free ram is now: UNKNOWN on testing-virt8 i-00000309 output: (No output returned from plugin) [22:21:42] PROBLEM Total Processes is now: UNKNOWN on translation-memory-2 i-000002d9 output: (No output returned from plugin) [22:21:47] PROBLEM dpkg-check is now: UNKNOWN on tutorial-mysql i-0000028b output: (No output returned from plugin) [22:21:47] PROBLEM Current Load is now: UNKNOWN on udp-filter i-000001df output: (No output returned from plugin) [22:21:47] PROBLEM Current Users is now: UNKNOWN on upload-wizard i-0000021c output: (No output returned from plugin) [22:21:47] PROBLEM Current Users is now: UNKNOWN on utils-abogott i-00000131 output: (No output returned from plugin) [22:21:47] PROBLEM Disk Space is now: UNKNOWN on varnish i-000001ac output: (No output returned from plugin) [22:21:48] PROBLEM Free ram is now: UNKNOWN on ve-nodejs i-00000245 output: (No output returned from plugin) [22:22:02] PROBLEM Total Processes is now: UNKNOWN on vivek-puppet i-000000ca output: (No output returned from plugin) [22:22:07] PROBLEM dpkg-check is now: UNKNOWN on vumi i-000001e5 output: (No output returned from plugin) [22:22:08] PROBLEM Current Load is now: UNKNOWN on webserver-lcarr i-00000134 output: (No output returned from plugin) [22:22:08] PROBLEM Current Load is now: UNKNOWN on wep i-000000c2 output: (No output returned from plugin) [22:22:08] PROBLEM Current Users is now: UNKNOWN on wikidata-dev-1 i-0000020c output: (No output returned from plugin) [22:22:08] PROBLEM Disk Space is now: UNKNOWN on wikidata-dev-2 i-00000259 output: (No output returned from plugin) [22:22:08] PROBLEM Free ram is now: UNKNOWN on wikidata-dev-3 i-00000225 output: (No output returned from plugin) [22:22:18] PROBLEM Total Processes is now: UNKNOWN on wikistats-01 i-00000042 output: (No output returned from plugin) [22:22:23] PROBLEM dpkg-check is now: UNKNOWN on wikistats-history-01 i-000002e2 output: (No output returned from plugin) [22:22:24] PROBLEM Current Load is now: UNKNOWN on wmde-test i-000002ad output: (No output returned from plugin) [22:22:24] PROBLEM Current Load is now: UNKNOWN on worker1 i-00000208 output: (No output returned from plugin) [22:22:24] PROBLEM Current Users is now: UNKNOWN on zeromq1 i-000002b7 output: (No output returned from plugin) [22:23:14] PROBLEM Current Load is now: UNKNOWN on maps-tilemill1 i-00000294 output: (No output returned from plugin) [22:23:14] PROBLEM Current Users is now: UNKNOWN on master i-0000007a output: (No output returned from plugin) [22:23:15] PROBLEM Disk Space is now: UNKNOWN on memcache-puppet i-00000153 output: (No output returned from plugin) [22:23:15] PROBLEM Free ram is now: UNKNOWN on migration1 i-00000261 output: (No output returned from plugin) [22:23:15] PROBLEM Total Processes is now: UNKNOWN on mobile-feeds i-000000c1 output: (No output returned from plugin) [22:23:20] PROBLEM dpkg-check is now: UNKNOWN on mobile-testing i-00000271 output: (No output returned from plugin) [22:23:20] PROBLEM Current Users is now: UNKNOWN on mwreview i-000002ae output: (No output returned from plugin) [22:23:20] PROBLEM Total Processes is now: UNKNOWN on nginx-dev1 i-000000f0 output: (No output returned from plugin) [22:23:25] PROBLEM dpkg-check is now: UNKNOWN on nginx-ffuqua-doom1-3 i-00000196 output: (No output returned from plugin) [22:23:25] PROBLEM Disk Space is now: WARNING on nagios 127.0.0.1 output: DISK WARNING - free space: /public/keys 2832 MB (16% inode=74%): /home/petrb 2832 MB (16% inode=74%): /home/autofs_check 2832 MB (16% inode=74%): [22:25:08] PROBLEM dpkg-check is now: UNKNOWN on log1 i-00000239 output: (No output returned from plugin) [22:25:08] PROBLEM dpkg-check is now: UNKNOWN on mailman-01 i-00000235 output: (No output returned from plugin) [22:25:09] PROBLEM dpkg-check is now: UNKNOWN on maps-test2 i-00000253 output: (No output returned from plugin) [22:25:09] PROBLEM Current Users is now: UNKNOWN on maps-tilemill1 i-00000294 output: (No output returned from plugin) [22:25:09] PROBLEM Disk Space is now: UNKNOWN on master i-0000007a output: (No output returned from plugin) [22:25:09] PROBLEM Free ram is now: UNKNOWN on memcache-puppet i-00000153 output: (No output returned from plugin) [22:25:09] PROBLEM Total Processes is now: UNKNOWN on mingledbtest i-00000283 output: (No output returned from plugin) [22:25:14] PROBLEM dpkg-check is now: UNKNOWN on mobile-feeds i-000000c1 output: (No output returned from plugin) [22:25:14] PROBLEM Current Load is now: UNKNOWN on mobile-wlm i-000002bc output: (No output returned from plugin) [22:27:07] PROBLEM Current Load is now: UNKNOWN on ganglia-test2 i-00000250 output: (No output returned from plugin) [22:27:07] PROBLEM Disk Space is now: UNKNOWN on grail i-000002c6 output: (No output returned from plugin) [22:27:07] PROBLEM Free ram is now: UNKNOWN on hugglewa-1 i-000001e0 output: (No output returned from plugin) [22:27:07] PROBLEM Total Processes is now: UNKNOWN on incubator-bot0 i-00000296 output: (No output returned from plugin) [22:27:12] PROBLEM dpkg-check is now: UNKNOWN on incubator-bot1 i-00000251 output: (No output returned from plugin) [22:27:12] PROBLEM Current Load is now: UNKNOWN on incubator-common i-00000254 output: (No output returned from plugin) [22:27:12] PROBLEM Current Users is now: UNKNOWN on integration-apache1 i-000002eb output: (No output returned from plugin) [22:27:12] PROBLEM Current Users is now: UNKNOWN on ipv6test1 i-00000282 output: (No output returned from plugin) [22:27:12] PROBLEM Disk Space is now: UNKNOWN on kripke i-00000268 output: (No output returned from plugin) [22:27:12] PROBLEM Total Processes is now: UNKNOWN on labs-lvs1 i-00000057 output: (No output returned from plugin) [22:27:17] PROBLEM dpkg-check is now: UNKNOWN on labs-nfs1 i-0000005d output: (No output returned from plugin) [22:27:17] PROBLEM Current Load is now: UNKNOWN on labs-relay i-00000103 output: (No output returned from plugin) [22:27:17] PROBLEM Current Users is now: UNKNOWN on localpuppet1 i-0000020b output: (No output returned from plugin) [22:27:17] PROBLEM Disk Space is now: UNKNOWN on localpuppet2 i-0000029b output: (No output returned from plugin) [22:27:17] PROBLEM Free ram is now: UNKNOWN on log1 i-00000239 output: (No output returned from plugin) [22:27:22] PROBLEM HTTP is now: CRITICAL on mailman-01 i-00000235 output: Connection refused [22:27:22] PROBLEM Disk Space is now: UNKNOWN on outreacheval i-0000012e output: (No output returned from plugin) [22:27:22] PROBLEM Free ram is now: UNKNOWN on p-b i-000000ae output: (No output returned from plugin) [22:27:22] PROBLEM Total Processes is now: UNKNOWN on patchtest i-000000f1 output: (No output returned from plugin) [22:27:27] PROBLEM Total Processes is now: UNKNOWN on patchtest2 i-000000fd output: (No output returned from plugin) [22:27:33] PROBLEM Current Load is now: UNKNOWN on pediapress-ocg2 i-00000234 output: (No output returned from plugin) [22:27:33] PROBLEM Current Users is now: UNKNOWN on pediapress-packager i-000001e4 output: (No output returned from plugin) [22:27:33] PROBLEM Disk Space is now: UNKNOWN on precise-test i-00000231 output: (No output returned from plugin) [22:27:33] PROBLEM Free ram is now: UNKNOWN on psm-precise i-000002f2 output: (No output returned from plugin) [22:27:33] PROBLEM Free ram is now: UNKNOWN on publicdata-administration i-0000019e output: (No output returned from plugin) [22:27:34] PROBLEM Total Processes is now: UNKNOWN on puppet-lucid i-00000080 output: (No output returned from plugin) [22:27:48] PROBLEM Current Users is now: UNKNOWN on redis1 i-000002b6 output: (No output returned from plugin) [22:27:48] PROBLEM Disk Space is now: UNKNOWN on reportcard2 i-000001ea output: (No output returned from plugin) [22:27:53] PROBLEM Current Load is now: UNKNOWN on labs-build1 i-0000006b output: (No output returned from plugin) [22:27:53] PROBLEM Current Users is now: UNKNOWN on labs-lvs1 i-00000057 output: (No output returned from plugin) [22:27:54] PROBLEM Disk Space is now: UNKNOWN on labs-nfs1 i-0000005d output: (No output returned from plugin) [22:27:54] PROBLEM Free ram is now: UNKNOWN on labs-realserver i-00000104 output: (No output returned from plugin) [22:27:54] PROBLEM Total Processes is now: UNKNOWN on secondinstance i-0000015b output: (No output returned from plugin) [22:27:58] PROBLEM Free ram is now: UNKNOWN on robh2 i-000001a2 output: (No output returned from plugin) [22:27:59] PROBLEM Current Load is now: UNKNOWN on rds i-00000207 output: (No output returned from plugin) [22:27:59] PROBLEM Free ram is now: UNKNOWN on resourceloader2-apache i-000001d7 output: (No output returned from plugin) [22:27:59] PROBLEM dpkg-check is now: UNKNOWN on pybal-precise i-00000289 output: (No output returned from plugin) [22:27:59] PROBLEM Total Processes is now: UNKNOWN on localpuppet1 i-0000020b output: (No output returned from plugin) [22:28:04] PROBLEM dpkg-check is now: UNKNOWN on localpuppet2 i-0000029b output: (No output returned from plugin) [22:28:04] PROBLEM Total Processes is now: UNKNOWN on shop-analytics-main i-000001e6 output: (No output returned from plugin) [22:28:09] PROBLEM dpkg-check is now: UNKNOWN on signwriting-ase5 i-0000030c output: (No output returned from plugin) [22:28:09] PROBLEM Current Load is now: UNKNOWN on su-aux1 i-000002ea output: (No output returned from plugin) [22:28:09] PROBLEM Current Users is now: UNKNOWN on su-be1 i-000002e7 output: (No output returned from plugin) [22:28:09] PROBLEM Disk Space is now: UNKNOWN on su-be2 i-000002e8 output: (No output returned from plugin) [22:28:09] PROBLEM Free ram is now: UNKNOWN on su-be3 i-000002e9 output: (No output returned from plugin) [22:28:10] PROBLEM Current Load is now: UNKNOWN on mailman-01 i-00000235 output: (No output returned from plugin) [22:28:10] PROBLEM Current Load is now: UNKNOWN on maps-test2 i-00000253 output: (No output returned from plugin) [22:28:10] PROBLEM Current Load is now: UNKNOWN on maps-test3 i-0000028f output: (No output returned from plugin) [22:28:19] PROBLEM Total Processes is now: UNKNOWN on su-fe2 i-000002e6 output: (No output returned from plugin) [22:28:24] PROBLEM dpkg-check is now: UNKNOWN on swift-aux1 i-0000024b output: (No output returned from plugin) [22:28:24] PROBLEM Current Load is now: UNKNOWN on swift-be1 i-000001c7 output: (No output returned from plugin) [22:28:24] PROBLEM Current Users is now: UNKNOWN on swift-be2 i-000001c8 output: (No output returned from plugin) [22:28:24] PROBLEM Disk Space is now: UNKNOWN on swift-be3 i-000001c9 output: (No output returned from plugin) [22:28:25] PROBLEM Free ram is now: UNKNOWN on swift-be4 i-000001ca output: (No output returned from plugin) [22:28:40] PROBLEM Total Processes is now: UNKNOWN on test-oneiric i-00000187 output: (No output returned from plugin) [22:28:45] PROBLEM dpkg-check is now: UNKNOWN on test2 i-0000013c output: (No output returned from plugin) [22:29:41] PROBLEM Current Load is now: UNKNOWN on testblog i-00000167 output: (No output returned from plugin) [22:29:41] PROBLEM Current Users is now: UNKNOWN on testforx i-000002f3 output: (No output returned from plugin) [22:29:41] PROBLEM Disk Space is now: UNKNOWN on testing-virt6 i-00000302 output: (No output returned from plugin) [22:29:42] PROBLEM Free ram is now: UNKNOWN on testing-virt7 i-00000308 output: (No output returned from plugin) [22:30:52] Sorry nagios but I'm not scrolling up 3pages to read the last time :( [22:33:17] Some interesting looking europython videos out from this last week [22:34:05] it'll be a while till it stops spamming, for sure [22:34:15] Yeah... [22:34:37] It's probably not chekced all the hosts yet, was going to look but I can't see the ip lol [22:34:39] <^demon|away> I /ignored that bot ages ago :p [22:34:44] Load is pretty constant [22:38:58] yeah, load is back doing to being sane [22:39:18] each host has 1/4 less instances [22:39:36] which means less swap, and less io [22:40:17] Well we have more ram now too... or sorta do :D [22:41:44] on the new hosts we have more than 3x the amount of ram [22:42:04] so, the good thing, is any new instance will be launched on the new hardware [22:48:22] are the instructions on forcing puppet runs given on https://labsconsole.wikimedia.org/wiki/Help:Instances#Configuring_instance still valid? [22:48:25] You switched the scheduler back to more free node rather than round robin? Or was that the last switch from round robin to next free, totally can't remember [22:48:44] Emw: Yes [22:50:50] when i try sudo-running a command, e.g. 'sudo puppetd -tv' as instructed at that link, i'm told 'emw is not allowed to run sudo on i-0000030e'. i'm a sysadmin and netadmin on the project that instance is a part of. [22:51:29] Hmm might not have a sudo policy setup right [22:51:44] when i try running the command mentioned at that linked section without sudo, i.e. 'puppetd -tv', i get the following message: [22:51:46] err: Could not request certificate: getaddrinfo: Name or service not known [22:51:48] Exiting; failed to retrieve certificate and waitforcert is disabled [22:51:55] There's a "Manage Sudo Policies" link that lets you configure sudo access per project [22:52:09] ah, i'll look at that. thanks. [22:52:28] I thought the default was to allow but as all my projects have funky rules I'm not sure :* [22:52:31] :(* [22:53:53] Damianz: last change was next free [22:54:04] so, hosts with less instances will get next instance [22:54:41] I'm not sure why the default doesn't get set [22:54:48] it's a bug in labsconsole [22:55:05] is there any help content on 'Modify Sudo Policy'? site-searching 'sudo' doesn't turn up much and i don't recall seeing anything relevant to that while looking through the other help content. i can probably guess how 'Modify Sudo Policy' works, but ideally i'd like to read over any documentation first before changing project configuration around. [22:55:29] Emw: make a policy called "default" [22:55:30] Basically just set users to all, commands to all if you don't require restricting access [22:55:43] * Damianz adds hosts all in there somewhere [22:55:45] for hosts, set "ALL", for users set "ALL", and for commands, set "ALL" [22:55:53] none of those should have quotes, of course [22:56:49] ok. 'ALL' is a checkbox option for users and hosts. 'Commands' and 'Options' are both text inputs -- i'll put ALL into those too [22:56:57] not for options [22:57:06] whoops [22:57:12] in general you'll never use options [22:57:54] modified policy to have options left blank [22:58:29] I kinda really hate the idea of sudo... DAC is really lame for some things, need to get down with MAC/RBAC some more [22:58:57] 'sudo puppetd -tv' seems to be working now, woo! [23:00:26] oh. seems OpenStackManager doesn't have a default policy created by default [23:00:27] weird [23:00:45] and my test instance is down [23:00:55] think it'd make sense add a note to add a new policy 'Manage Sudo Policy' named 'default' with 'ALL' set for users, hosts, and commands (but not options) to https://labsconsole.wikimedia.org/wiki/Help:Instances#Configuring_instance? [23:00:56] because it's one of the migrated ones [23:01:04] if so, i can do that [23:01:11] hm [23:01:14] probably not there [23:01:36] alright [23:01:50] on here somewhere: https://labsconsole.wikimedia.org/wiki/Help:Contents [23:01:59] probably under the interface section [23:02:06] a new page for managing sudo policies would be good [23:03:11] i'll add that basic note there, then, and if it makes sense link to it from https://labsconsole.wikimedia.org/wiki/Help:Instances#Configuring_instance (where i ran into this issue) [23:03:25] * Ryan_Lane nods [23:03:29] sounds good [23:12:24] must a user be a sysadmin for the given project in order to create/modify/delete sudo policies for that project? i'd imagine so, but i just want to check. [23:12:32] Emw: did you delete 30a? [23:12:40] yes [23:12:42] ok [23:12:54] you must be a sysadmin in a project to modify sudo policies, yes [23:32:18] a new page https://labsconsole.wikimedia.org/wiki/Help:Sudo_Policies and an edit: https://labsconsole.wikimedia.org/w/index.php?title=Help:Instances&diff=4661&oldid=3923. reviews for accuracy welcome. [23:40:48] thanks [23:45:38] mutante: I probably broke some of your instances [23:45:45] Ryan_Lane: you killed > i-000000c1 <-- mobile-feeds in mobile project ?!? [23:46:01] kvm block migration did [23:46:06] you need anything saved from it? [23:46:13] I still have access to its data [23:46:30] Ryan_Lane: no it's fine [23:46:34] Ryan_Lane: but damn dude [23:46:43] you sure? not a problem for me to pull crap from it [23:46:51] DUDE [23:46:54] Ryan_Lane: no it's all good [23:46:57] heh [23:47:08] preilly: wasn't on purpose ;) [23:47:25] I tested a bunch of instance before I started migrating en-masse [23:47:29] *instances [23:48:00] actually, that instance is up [23:48:47] though it could be corrupted [23:48:55] preilly: so, if it isn't broken, keep on using it [23:50:09] hah [23:50:13] it's broken. for sure [23:50:17] root@i-000000c1:/var/log# file /usr/bin/w.procps [23:50:18] -bash: /usr/bin/file: cannot execute binary file [23:50:18] r [23:50:23] lol [23:50:28] root@i-000000c1:/var/log# ldd /usr/bin/file [23:50:28] -bash: /usr/bin/ldd: cannot execute binary file [23:51:01] when that instance reboots it's a goner [23:51:05] Who needs ldd, I mean meh just static compile everything :D [23:51:57] for the rest of the instances, I'm going to shut them down, rsync their disk files, modify the nova database, and bring them up [23:52:31] I've tested that and it works perfectly [23:54:35] I wonder what performance difference you'd get rsyncing from the data dir vs rsyncing from the gluster mount... should be the same file in theory but the io difference could be interesting. [23:55:16] eww [23:55:17] no [23:55:24] I'm going to go from the gluster mount [23:55:24] heh [23:55:55] lol [23:56:11] I just had corruption issues, I'd prefer not to press my luck [23:56:38] petan2: anything I can do to make you guys life easier in deployment-prep? [23:56:44] Well yeah but doing it that way and not using the delete flag would be pretty safe :) [23:56:48] you mentioned they should all be puppetized, right? [23:57:01] I think squid isn't, that would be the main thing :( [23:57:16] I'm not deleting anything until I've verified its working [23:57:21] etc is in git though... not pushed anywhere as far as I know [23:57:39] it wasn't in the list [23:57:45] maybe the upload is a squid [23:59:00] I haven't seen hyperon in a while....