[02:20:26] RECOVERY Free ram is now: OK on bots-2 i-0000009c output: OK: 21% free memory [02:32:06] PROBLEM Puppet freshness is now: CRITICAL on mailman-01 i-00000235 output: Puppet has not run in last 20 hours [02:40:19] 05/27/2012 - 02:40:19 - Updating keys for laner at /export/home/deployment-prep/laner [02:41:18] 05/27/2012 - 02:41:18 - Updating keys for laner at /export/home/deployment-prep/laner [02:51:21] 05/27/2012 - 02:51:21 - Updating keys for laner at /export/home/deployment-prep/laner [02:53:21] 05/27/2012 - 02:53:20 - Updating keys for laner at /export/home/deployment-prep/laner [03:03:21] 05/27/2012 - 03:03:20 - Updating keys for laner at /export/home/deployment-prep/laner [03:33:37] PROBLEM Current Load is now: WARNING on bots-sql2 i-000000af output: WARNING - load average: 5.04, 5.39, 5.13 [03:39:47] PROBLEM Free ram is now: WARNING on utils-abogott i-00000131 output: Warning: 17% free memory [03:44:47] PROBLEM Free ram is now: WARNING on test-oneiric i-00000187 output: Warning: 17% free memory [03:44:57] PROBLEM Free ram is now: WARNING on nova-daas-1 i-000000e7 output: Warning: 15% free memory [03:48:48] RECOVERY Current Load is now: OK on bots-sql2 i-000000af output: OK - load average: 4.98, 4.80, 4.92 [03:59:49] PROBLEM Free ram is now: CRITICAL on utils-abogott i-00000131 output: Critical: 4% free memory [03:59:59] PROBLEM Free ram is now: WARNING on orgcharts-dev i-0000018f output: Warning: 15% free memory [04:05:08] RECOVERY Free ram is now: OK on utils-abogott i-00000131 output: OK: 96% free memory [04:05:08] PROBLEM Free ram is now: CRITICAL on test-oneiric i-00000187 output: Critical: 4% free memory [04:10:16] RECOVERY Free ram is now: OK on test-oneiric i-00000187 output: OK: 97% free memory [04:10:16] PROBLEM Free ram is now: CRITICAL on nova-daas-1 i-000000e7 output: Critical: 5% free memory [04:11:50] PROBLEM Total Processes is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:11:55] PROBLEM Free ram is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:12:00] PROBLEM Current Users is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:17:32] RECOVERY Total Processes is now: OK on migration1 i-00000261 output: PROCS OK: 87 processes [04:17:37] RECOVERY Free ram is now: OK on migration1 i-00000261 output: OK: 79% free memory [04:17:38] RECOVERY Current Users is now: OK on migration1 i-00000261 output: USERS OK - 0 users currently logged in [04:17:38] PROBLEM Current Load is now: CRITICAL on nagios 127.0.0.1 output: CRITICAL - load average: 15.79, 10.99, 6.07 [04:21:23] RECOVERY Free ram is now: OK on nova-daas-1 i-000000e7 output: OK: 94% free memory [04:21:23] PROBLEM Current Load is now: WARNING on bots-cb i-0000009e output: WARNING - load average: 14.82, 18.00, 11.14 [04:21:38] PROBLEM Free ram is now: CRITICAL on orgcharts-dev i-0000018f output: Critical: 4% free memory [04:23:28] PROBLEM Current Users is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:23:28] PROBLEM Current Load is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:23:28] PROBLEM Disk Space is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:27:08] PROBLEM Current Load is now: CRITICAL on migration1 i-00000261 output: Connection refused or timed out [04:27:14] PROBLEM Total Processes is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:27:19] PROBLEM Free ram is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:27:19] PROBLEM dpkg-check is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:31:28] PROBLEM Disk Space is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:32:31] PROBLEM dpkg-check is now: CRITICAL on pybal-precise i-00000289 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:32:41] PROBLEM Current Load is now: WARNING on bots-sql2 i-000000af output: WARNING - load average: 6.70, 6.60, 6.32 [04:32:51] RECOVERY Current Load is now: OK on migration1 i-00000261 output: OK - load average: 1.38, 3.54, 4.25 [05:12:55] PROBLEM HTTP is now: CRITICAL on mailman-01 i-00000235 output: CRITICAL - Socket timeout after 10 seconds [05:20:07] PROBLEM Current Load is now: CRITICAL on bots-cb i-0000009e output: CRITICAL - load average: 0.81, 10.77, 33.19 [05:31:31] PROBLEM Free ram is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:31:31] PROBLEM Disk Space is now: CRITICAL on pybal-precise i-00000289 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:31:31] PROBLEM Total Processes is now: CRITICAL on test3 i-00000093 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:31:36] PROBLEM Current Users is now: CRITICAL on test3 i-00000093 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:31:36] PROBLEM Free ram is now: CRITICAL on test3 i-00000093 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:31:36] PROBLEM dpkg-check is now: CRITICAL on test3 i-00000093 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:31:36] PROBLEM dpkg-check is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [05:31:36] PROBLEM dpkg-check is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:31:36] PROBLEM Total Processes is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:31:41] PROBLEM Current Load is now: CRITICAL on pybal-precise i-00000289 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:31:41] PROBLEM Current Users is now: CRITICAL on pybal-precise i-00000289 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:31:42] PROBLEM Current Users is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:35:06] PROBLEM HTTP is now: WARNING on mailman-01 i-00000235 output: HTTP WARNING: HTTP/1.1 403 Forbidden - 498 bytes in 0.011 second response time [05:37:06] PROBLEM Current Load is now: WARNING on incubator-bot2 i-00000252 output: WARNING - load average: 5.71, 8.32, 9.55 [05:39:23] RECOVERY Free ram is now: OK on orgcharts-dev i-0000018f output: OK: 95% free memory [05:39:28] PROBLEM Current Load is now: WARNING on worker1 i-00000208 output: WARNING - load average: 6.64, 6.73, 7.69 [05:39:28] RECOVERY Current Users is now: OK on worker1 i-00000208 output: USERS OK - 0 users currently logged in [05:39:28] RECOVERY Disk Space is now: OK on worker1 i-00000208 output: DISK OK [05:39:28] RECOVERY Free ram is now: OK on worker1 i-00000208 output: OK: 91% free memory [05:39:28] RECOVERY Total Processes is now: OK on worker1 i-00000208 output: PROCS OK: 97 processes [05:39:33] PROBLEM Current Load is now: WARNING on bots-cb i-0000009e output: WARNING - load average: 3.86, 3.73, 14.27 [05:39:33] PROBLEM Current Load is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:39:33] PROBLEM Disk Space is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:39:38] PROBLEM Total Processes is now: CRITICAL on pybal-precise i-00000289 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:39:38] PROBLEM Free ram is now: CRITICAL on pybal-precise i-00000289 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:45:54] PROBLEM Current Load is now: WARNING on swift-be4 i-000001ca output: WARNING - load average: 7.22, 8.68, 9.65 [05:45:54] PROBLEM Current Load is now: WARNING on bots-apache1 i-000000b0 output: WARNING - load average: 8.13, 8.28, 9.41 [05:45:59] PROBLEM Current Users is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [05:46:09] PROBLEM Current Load is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:46:09] PROBLEM Current Users is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:46:09] PROBLEM Disk Space is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:46:14] PROBLEM dpkg-check is now: CRITICAL on ipv6test1 i-00000282 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:46:14] PROBLEM Current Load is now: WARNING on ganglia-test2 i-00000250 output: WARNING - load average: 5.58, 5.50, 5.87 [05:46:24] PROBLEM Current Load is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [05:46:24] PROBLEM Current Users is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [05:46:24] PROBLEM Total Processes is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [05:46:34] PROBLEM dpkg-check is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:46:49] PROBLEM Current Load is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:46:59] PROBLEM Total Processes is now: CRITICAL on ipv6test1 i-00000282 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:47:19] PROBLEM Disk Space is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [05:47:19] PROBLEM Free ram is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [05:47:19] PROBLEM Current Load is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:47:19] PROBLEM Disk Space is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:47:19] PROBLEM SSH is now: CRITICAL on bots-sql2 i-000000af output: CRITICAL - Socket timeout after 10 seconds [05:47:19] PROBLEM Free ram is now: CRITICAL on ipv6test1 i-00000282 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:47:19] PROBLEM dpkg-check is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:49:55] PROBLEM Current Load is now: WARNING on rds i-00000207 output: WARNING - load average: 12.09, 10.13, 8.89 [05:49:55] PROBLEM Current Load is now: WARNING on labs-nfs1 i-0000005d output: WARNING - load average: 1.30, 4.21, 7.39 [05:49:55] PROBLEM Current Load is now: WARNING on migration1 i-00000261 output: WARNING - load average: 7.39, 8.12, 6.91 [05:50:00] PROBLEM Current Load is now: WARNING on ee-prototype i-0000013d output: WARNING - load average: 4.59, 5.93, 6.00 [05:50:00] PROBLEM Current Users is now: CRITICAL on ipv6test1 i-00000282 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:50:05] PROBLEM Current Load is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [05:50:05] PROBLEM SSH is now: CRITICAL on ipv6test1 i-00000282 output: CRITICAL - Socket timeout after 10 seconds [05:50:15] PROBLEM Total Processes is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [05:50:20] PROBLEM Disk Space is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [05:50:21] PROBLEM Total Processes is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:50:25] PROBLEM Free ram is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [05:50:26] PROBLEM Total Processes is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [05:50:26] PROBLEM dpkg-check is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [05:50:26] PROBLEM Current Load is now: CRITICAL on ipv6test1 i-00000282 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:50:26] PROBLEM Free ram is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [05:50:26] PROBLEM Disk Space is now: CRITICAL on ipv6test1 i-00000282 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:50:35] RECOVERY Total Processes is now: OK on test3 i-00000093 output: PROCS OK: 78 processes [05:50:40] RECOVERY Current Users is now: OK on test3 i-00000093 output: USERS OK - 0 users currently logged in [05:50:40] RECOVERY Free ram is now: OK on test3 i-00000093 output: OK: 96% free memory [05:50:40] RECOVERY dpkg-check is now: OK on test3 i-00000093 output: All packages OK [05:51:47] RECOVERY Current Users is now: OK on upload-wizard i-0000021c output: USERS OK - 0 users currently logged in [05:51:47] PROBLEM Current Load is now: WARNING on upload-wizard i-0000021c output: WARNING - load average: 6.10, 7.44, 8.81 [05:51:47] RECOVERY Total Processes is now: OK on upload-wizard i-0000021c output: PROCS OK: 104 processes [05:52:59] PROBLEM Current Load is now: WARNING on incubator-bot1 i-00000251 output: WARNING - load average: 5.87, 8.38, 9.46 [05:53:00] RECOVERY Current Users is now: OK on incubator-bot1 i-00000251 output: USERS OK - 0 users currently logged in [05:53:00] RECOVERY Disk Space is now: OK on incubator-bot1 i-00000251 output: DISK OK [05:53:00] RECOVERY dpkg-check is now: OK on incubator-bot1 i-00000251 output: All packages OK [05:53:00] RECOVERY Disk Space is now: OK on upload-wizard i-0000021c output: DISK OK [05:53:00] RECOVERY Free ram is now: OK on upload-wizard i-0000021c output: OK: 89% free memory [05:53:00] RECOVERY Disk Space is now: OK on migration1 i-00000261 output: DISK OK [05:53:01] RECOVERY dpkg-check is now: OK on migration1 i-00000261 output: All packages OK [05:53:35] PROBLEM Free ram is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:53:35] PROBLEM dpkg-check is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:53:35] PROBLEM Current Load is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:53:36] PROBLEM Current Users is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:53:41] PROBLEM Total Processes is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:53:41] PROBLEM Current Load is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:53:41] PROBLEM Current Load is now: WARNING on precise-test i-00000231 output: WARNING - load average: 4.10, 5.41, 6.47 [05:53:41] RECOVERY dpkg-check is now: OK on precise-test i-00000231 output: All packages OK [05:53:48] RECOVERY Current Users is now: OK on ipv6test1 i-00000282 output: USERS OK - 0 users currently logged in [05:53:48] RECOVERY Current Load is now: OK on ee-prototype i-0000013d output: OK - load average: 0.49, 2.28, 4.32 [05:53:49] RECOVERY SSH is now: OK on ipv6test1 i-00000282 output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [05:53:49] RECOVERY Current Load is now: OK on ipv6test1 i-00000282 output: OK - load average: 3.71, 4.26, 4.76 [05:53:49] RECOVERY Disk Space is now: OK on ipv6test1 i-00000282 output: DISK OK [05:53:49] PROBLEM Current Load is now: WARNING on reportcard2 i-000001ea output: WARNING - load average: 15.07, 11.56, 8.86 [05:53:49] RECOVERY Free ram is now: OK on bots-sql2 i-000000af output: OK: 75% free memory [05:53:55] RECOVERY Disk Space is now: OK on bots-sql2 i-000000af output: DISK OK [05:53:55] PROBLEM Current Load is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:55:22] RECOVERY Current Users is now: OK on bots-sql2 i-000000af output: USERS OK - 0 users currently logged in [05:55:22] PROBLEM Current Users is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:55:22] PROBLEM Disk Space is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:55:22] PROBLEM Free ram is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:55:22] PROBLEM Total Processes is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:59:04] RECOVERY Current Load is now: OK on swift-be4 i-000001ca output: OK - load average: 1.20, 1.88, 4.89 [05:59:04] PROBLEM Current Load is now: CRITICAL on aggregator-test3 i-00000293 output: CRITICAL - load average: 5.14, 31.27, 40.29 [05:59:04] RECOVERY dpkg-check is now: OK on ipv6test1 i-00000282 output: All packages OK [05:59:04] RECOVERY Total Processes is now: OK on ipv6test1 i-00000282 output: PROCS OK: 93 processes [05:59:14] PROBLEM Current Users is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [05:59:14] PROBLEM Disk Space is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [05:59:24] PROBLEM Current Load is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:04:25] RECOVERY Free ram is now: OK on ipv6test1 i-00000282 output: OK: 27% free memory [06:04:25] PROBLEM Current Load is now: WARNING on rds i-00000207 output: WARNING - load average: 5.30, 8.14, 8.97 [06:04:25] RECOVERY Current Load is now: OK on labs-nfs1 i-0000005d output: OK - load average: 0.98, 1.58, 4.12 [06:04:25] RECOVERY Total Processes is now: OK on rds i-00000207 output: PROCS OK: 89 processes [06:04:54] PROBLEM Current Load is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:04:54] PROBLEM Current Load is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [06:06:18] RECOVERY Disk Space is now: OK on pybal-precise i-00000289 output: DISK OK [06:06:18] PROBLEM Current Load is now: WARNING on pybal-precise i-00000289 output: WARNING - load average: 6.93, 8.27, 9.80 [06:06:18] RECOVERY Current Users is now: OK on pybal-precise i-00000289 output: USERS OK - 0 users currently logged in [06:06:18] RECOVERY dpkg-check is now: OK on bots-sql2 i-000000af output: All packages OK [06:06:18] RECOVERY Total Processes is now: OK on pybal-precise i-00000289 output: PROCS OK: 93 processes [06:06:23] RECOVERY Free ram is now: OK on pybal-precise i-00000289 output: OK: 86% free memory [06:06:23] RECOVERY dpkg-check is now: OK on pybal-precise i-00000289 output: All packages OK [06:06:27] PROBLEM Current Users is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:06:27] PROBLEM Free ram is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:06:27] PROBLEM Total Processes is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:06:43] RECOVERY Free ram is now: OK on maps-tilemill1 i-00000294 output: OK: 86% free memory [06:06:43] PROBLEM Current Load is now: WARNING on maps-tilemill1 i-00000294 output: WARNING - load average: 5.16, 8.81, 8.97 [06:06:43] RECOVERY Current Users is now: OK on maps-tilemill1 i-00000294 output: USERS OK - 0 users currently logged in [06:06:43] RECOVERY Total Processes is now: OK on maps-tilemill1 i-00000294 output: PROCS OK: 112 processes [06:06:48] RECOVERY dpkg-check is now: OK on maps-tilemill1 i-00000294 output: All packages OK [06:06:48] PROBLEM Current Load is now: WARNING on incubator-bot2 i-00000252 output: WARNING - load average: 1.20, 3.53, 5.77 [06:09:21] RECOVERY SSH is now: OK on bots-sql2 i-000000af output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [06:10:00] RECOVERY Disk Space is now: OK on maps-tilemill1 i-00000294 output: DISK OK [06:10:06] RECOVERY Total Processes is now: OK on bots-sql2 i-000000af output: PROCS OK: 83 processes [06:12:04] RECOVERY Free ram is now: OK on mobile-testing i-00000271 output: OK: 71% free memory [06:12:05] RECOVERY dpkg-check is now: OK on mobile-testing i-00000271 output: All packages OK [06:12:06] RECOVERY Current Users is now: OK on mobile-testing i-00000271 output: USERS OK - 0 users currently logged in [06:12:06] RECOVERY Total Processes is now: OK on mobile-testing i-00000271 output: PROCS OK: 201 processes [06:12:26] PROBLEM Current Load is now: CRITICAL on pybal-precise i-00000289 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:12:26] PROBLEM Current Load is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [06:13:41] RECOVERY Disk Space is now: OK on mobile-testing i-00000271 output: DISK OK [06:13:51] RECOVERY Current Load is now: OK on bots-apache1 i-000000b0 output: OK - load average: 3.46, 3.66, 4.75 [06:14:03] PROBLEM Current Users is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:14:03] PROBLEM Disk Space is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:14:03] PROBLEM dpkg-check is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:14:03] PROBLEM Free ram is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:14:03] PROBLEM Total Processes is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:14:08] PROBLEM Disk Space is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [06:14:08] PROBLEM Free ram is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [06:14:09] PROBLEM Disk Space is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:14:09] PROBLEM dpkg-check is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:14:52] PROBLEM Current Users is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:14:52] PROBLEM Disk Space is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:14:52] PROBLEM Free ram is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:14:52] PROBLEM Total Processes is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:16:11] PROBLEM Current Load is now: WARNING on aggregator-test3 i-00000293 output: WARNING - load average: 1.05, 4.71, 17.85 [06:18:53] PROBLEM Current Load is now: WARNING on mobile-testing i-00000271 output: WARNING - load average: 0.56, 5.80, 19.41 [06:18:59] RECOVERY Free ram is now: OK on incubator-bot1 i-00000251 output: OK: 40% free memory [06:18:59] RECOVERY Total Processes is now: OK on incubator-bot1 i-00000251 output: PROCS OK: 130 processes [06:19:12] PROBLEM Current Load is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:19:43] RECOVERY Current Users is now: OK on rds i-00000207 output: USERS OK - 0 users currently logged in [06:19:44] RECOVERY Disk Space is now: OK on rds i-00000207 output: DISK OK [06:19:44] RECOVERY Free ram is now: OK on rds i-00000207 output: OK: 92% free memory [06:19:48] RECOVERY Free ram is now: OK on reportcard2 i-000001ea output: OK: 85% free memory [06:19:49] RECOVERY dpkg-check is now: OK on reportcard2 i-000001ea output: All packages OK [06:19:49] RECOVERY Total Processes is now: OK on reportcard2 i-000001ea output: PROCS OK: 81 processes [06:23:07] RECOVERY Current Load is now: OK on ganglia-test2 i-00000250 output: OK - load average: 0.37, 1.66, 3.70 [06:23:49] PROBLEM Current Load is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:23:55] PROBLEM Current Load is now: WARNING on bots-apache1 i-000000b0 output: WARNING - load average: 4.39, 5.86, 5.51 [06:26:06] PROBLEM Total Processes is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:36:50] PROBLEM Current Users is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [06:36:50] PROBLEM Total Processes is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [06:41:51] PROBLEM Total Processes is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:41:55] PROBLEM Free ram is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:47:34] RECOVERY Current Load is now: OK on aggregator-test3 i-00000293 output: OK - load average: 3.63, 2.73, 4.66 [06:53:01] PROBLEM dpkg-check is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:53:26] PROBLEM Current Users is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:53:26] PROBLEM Disk Space is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:53:26] PROBLEM Free ram is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:53:26] PROBLEM Total Processes is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:07:55] RECOVERY Current Users is now: OK on incubator-bot2 i-00000252 output: USERS OK - 0 users currently logged in [07:07:56] RECOVERY Disk Space is now: OK on incubator-bot2 i-00000252 output: DISK OK [07:07:56] RECOVERY dpkg-check is now: OK on incubator-bot2 i-00000252 output: All packages OK [07:07:56] RECOVERY Free ram is now: OK on incubator-bot2 i-00000252 output: OK: 40% free memory [07:07:56] RECOVERY Total Processes is now: OK on incubator-bot2 i-00000252 output: PROCS OK: 148 processes [07:21:37] PROBLEM Current Load is now: WARNING on labs-nfs1 i-0000005d output: WARNING - load average: 1.67, 3.66, 7.06 [07:22:52] PROBLEM Current Users is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:22:52] PROBLEM Disk Space is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:22:52] PROBLEM Free ram is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:22:52] PROBLEM Total Processes is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:22:57] PROBLEM dpkg-check is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:23:07] PROBLEM HTTP is now: CRITICAL on ee-prototype i-0000013d output: CRITICAL - Socket timeout after 10 seconds [07:24:54] PROBLEM Puppet freshness is now: CRITICAL on deployment-jobrunner05 i-0000028c output: Puppet has not run in last 20 hours [07:24:54] PROBLEM Current Load is now: WARNING on bots-2 i-0000009c output: WARNING - load average: 7.68, 5.83, 5.67 [07:24:54] PROBLEM Current Load is now: WARNING on firstinstance i-0000013e output: WARNING - load average: 3.57, 4.59, 5.26 [07:24:54] PROBLEM Current Load is now: WARNING on ganglia-test2 i-00000250 output: WARNING - load average: 8.89, 7.98, 6.86 [07:24:54] PROBLEM Current Load is now: WARNING on swift-be4 i-000001ca output: WARNING - load average: 3.02, 4.68, 5.99 [07:24:54] PROBLEM Current Load is now: WARNING on swift-be2 i-000001c8 output: WARNING - load average: 1.88, 4.49, 5.71 [07:24:54] PROBLEM Current Load is now: WARNING on deployment-apache23 i-00000270 output: WARNING - load average: 7.03, 6.42, 5.83 [07:24:55] PROBLEM Current Load is now: WARNING on wikistats-01 i-00000042 output: WARNING - load average: 6.03, 7.27, 6.67 [07:24:56] PROBLEM Current Load is now: WARNING on memcache-puppet i-00000153 output: WARNING - load average: 4.13, 5.20, 5.24 [07:24:56] PROBLEM Current Users is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:24:57] PROBLEM Disk Space is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:24:57] PROBLEM Free ram is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:27:21] PROBLEM Current Load is now: WARNING on deployment-apache22 i-0000026f output: WARNING - load average: 6.65, 5.93, 5.44 [07:27:21] PROBLEM Current Load is now: WARNING on deployment-imagescaler01 i-0000025a output: WARNING - load average: 4.03, 4.87, 5.24 [07:27:41] RECOVERY HTTP is now: OK on ee-prototype i-0000013d output: HTTP OK: HTTP/1.1 200 OK - 1688 bytes in 0.061 second response time [07:29:15] PROBLEM Current Load is now: WARNING on kripke i-00000268 output: WARNING - load average: 4.19, 4.69, 5.14 [07:29:15] PROBLEM Current Load is now: WARNING on robh2 i-000001a2 output: WARNING - load average: 5.36, 5.54, 5.71 [07:29:20] PROBLEM Current Load is now: WARNING on ee-prototype i-0000013d output: WARNING - load average: 6.93, 6.19, 5.46 [07:29:20] RECOVERY Current Load is now: OK on swift-be2 i-000001c8 output: OK - load average: 1.24, 2.21, 4.32 [07:29:20] RECOVERY Current Load is now: OK on memcache-puppet i-00000153 output: OK - load average: 0.46, 2.26, 3.95 [07:31:33] PROBLEM Current Load is now: WARNING on wikistream-1 i-0000016e output: WARNING - load average: 4.60, 5.19, 5.11 [07:31:33] PROBLEM Current Load is now: WARNING on shop-analytics-main i-000001e6 output: WARNING - load average: 4.10, 4.95, 5.04 [07:31:33] PROBLEM Current Load is now: WARNING on test2 i-0000013c output: WARNING - load average: 5.74, 5.21, 5.37 [07:32:45] RECOVERY Current Load is now: OK on deployment-imagescaler01 i-0000025a output: OK - load average: 0.53, 2.33, 3.99 [07:32:55] PROBLEM Current Users is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:32:55] PROBLEM Disk Space is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:32:55] PROBLEM Free ram is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:32:55] PROBLEM Total Processes is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:33:25] PROBLEM dpkg-check is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:40:05] RECOVERY Current Load is now: OK on kripke i-00000268 output: OK - load average: 3.84, 4.20, 4.74 [07:40:05] RECOVERY Current Load is now: OK on firstinstance i-0000013e output: OK - load average: 2.28, 3.92, 4.77 [07:40:06] RECOVERY Current Load is now: OK on deployment-apache23 i-00000270 output: OK - load average: 3.62, 4.15, 4.83 [07:40:31] PROBLEM Total Processes is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:40:36] PROBLEM dpkg-check is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:40:36] PROBLEM Free ram is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:40:36] PROBLEM SSH is now: CRITICAL on reportcard2 i-000001ea output: CRITICAL - Socket timeout after 10 seconds [07:45:00] RECOVERY Current Load is now: OK on deployment-apache22 i-0000026f output: OK - load average: 0.22, 1.28, 3.29 [07:46:16] RECOVERY Current Load is now: OK on robh2 i-000001a2 output: OK - load average: 0.17, 1.76, 3.84 [07:46:20] RECOVERY Current Load is now: OK on wikistats-01 i-00000042 output: OK - load average: 0.17, 1.07, 3.37 [07:46:20] RECOVERY Current Load is now: OK on wikistream-1 i-0000016e output: OK - load average: 0.17, 1.70, 3.52 [07:46:20] RECOVERY Current Load is now: OK on shop-analytics-main i-000001e6 output: OK - load average: 0.20, 1.77, 3.59 [07:46:20] RECOVERY Current Load is now: OK on test2 i-0000013c output: OK - load average: 1.65, 3.31, 4.37 [07:49:32] PROBLEM Free ram is now: WARNING on bots-2 i-0000009c output: Warning: 19% free memory [07:49:32] PROBLEM Puppet freshness is now: CRITICAL on nova-ldap1 i-000000df output: Puppet has not run in last 20 hours [07:49:50] PROBLEM Disk Space is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:49:51] PROBLEM Current Users is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:49:51] PROBLEM Free ram is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:49:51] PROBLEM dpkg-check is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:54:33] PROBLEM host: incubator-bot2 is DOWN address: i-00000252 CRITICAL - Host Unreachable (i-00000252) [07:58:11] RECOVERY Current Load is now: OK on bots-2 i-0000009c output: OK - load average: 5.50, 4.15, 4.77 [08:13:20] 05/27/2012 - 08:13:20 - Updating keys for laner at /export/home/deployment-prep/laner [08:24:22] RECOVERY Current Users is now: OK on upload-wizard i-0000021c output: USERS OK - 0 users currently logged in [08:24:22] RECOVERY Total Processes is now: OK on upload-wizard i-0000021c output: PROCS OK: 106 processes [08:26:04] PROBLEM host: incubator-bot2 is DOWN address: i-00000252 CRITICAL - Host Unreachable (i-00000252) [08:57:13] PROBLEM host: incubator-bot2 is DOWN address: i-00000252 CRITICAL - Host Unreachable (i-00000252) [08:59:19] 05/27/2012 - 08:59:19 - Updating keys for laner at /export/home/deployment-prep/laner [09:07:03] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 1.00, 0.85, 3.57 [09:07:23] RECOVERY host: incubator-bot2 is UP address: i-00000252 PING OK - Packet loss = 0%, RTA = 0.31 ms [09:09:35] !log incubator Restarted incubator-bot2, not sure what killed it. [09:09:37] Logged the message, Master [09:12:16] RECOVERY Current Load is now: OK on nagios 127.0.0.1 output: OK - load average: 0.87, 1.01, 2.89 [09:15:07] !log incubator Rebooted incubator-bot1 and bot2, upgraded some packages and system requires restart [09:15:09] Logged the message, Master [12:32:34] PROBLEM Puppet freshness is now: CRITICAL on mailman-01 i-00000235 output: Puppet has not run in last 20 hours [13:18:09] PROBLEM Disk Space is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:18:19] PROBLEM Total Processes is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:18:23] wtf... [13:22:55] RECOVERY Disk Space is now: OK on incubator-bot2 i-00000252 output: DISK OK [13:22:55] RECOVERY Total Processes is now: OK on incubator-bot2 i-00000252 output: PROCS OK: 116 processes [13:34:47] PROBLEM dpkg-check is now: CRITICAL on bots-cb i-0000009e output: CHECK_NRPE: Socket timeout after 10 seconds. [13:44:34] RECOVERY dpkg-check is now: OK on bots-cb i-0000009e output: All packages OK [13:58:59] petan, deployment prep is down (Can't contact the database server: Host 'i-00000270.pmtpa.wmflabs' is blocked because of many connection errors; unblock with 'mysqladmin flush-hosts' (deployment-sql)) [14:04:03] That still isn't fixed? [14:04:15] I was getting that yesterday [14:13:05] New patchset: Platonides; "Update masters definition in mysql.pp from db.php" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9015 [14:13:20] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/9015 [14:14:26] 05/27/2012 - 14:14:26 - Updating keys for hydriz at /export/home/incubator/hydriz [14:14:33] 05/27/2012 - 14:14:33 - Updating keys for hydriz at /export/home/dumps/hydriz [14:15:12] 05/27/2012 - 14:15:11 - Updating keys for hydriz at /export/home/bots/hydriz [14:15:15] 05/27/2012 - 14:15:15 - Updating keys for hydriz at /export/home/bastion/hydriz [14:20:08] New patchset: Platonides; "Update masters definition in mysql.pp from db.php" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9015 [14:20:23] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/9015 [14:21:03] New review: Bhartshorne; "(no comment)" [operations/puppet] (test); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9015 [14:21:06] Change merged: Bhartshorne; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9015 [14:24:25] PROBLEM Current Load is now: CRITICAL on incubator-bot0 i-00000296 output: Connection refused by host [14:24:34] New patchset: Asher; "Revert "Update masters definition in mysql.pp from db.php"" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9016 [14:24:49] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/9016 [14:24:53] New review: Asher; "(no comment)" [operations/puppet] (test); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9016 [14:24:56] Change merged: Asher; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9016 [14:25:05] PROBLEM Current Users is now: CRITICAL on incubator-bot0 i-00000296 output: Connection refused by host [14:25:46] PROBLEM Disk Space is now: CRITICAL on incubator-bot0 i-00000296 output: Connection refused by host [14:26:25] PROBLEM Free ram is now: CRITICAL on incubator-bot0 i-00000296 output: Connection refused by host [14:27:35] PROBLEM Total Processes is now: CRITICAL on incubator-bot0 i-00000296 output: CHECK_NRPE: Error - Could not complete SSL handshake. [14:28:15] PROBLEM dpkg-check is now: CRITICAL on incubator-bot0 i-00000296 output: CHECK_NRPE: Error - Could not complete SSL handshake. [14:29:25] RECOVERY Current Load is now: OK on incubator-bot0 i-00000296 output: OK - load average: 0.46, 1.17, 0.91 [14:30:05] RECOVERY Current Users is now: OK on incubator-bot0 i-00000296 output: USERS OK - 1 users currently logged in [14:30:54] RECOVERY Disk Space is now: OK on incubator-bot0 i-00000296 output: DISK OK [14:31:15] RECOVERY Free ram is now: OK on incubator-bot0 i-00000296 output: OK: 90% free memory [14:32:37] RECOVERY Total Processes is now: OK on incubator-bot0 i-00000296 output: PROCS OK: 89 processes [14:38:07] RECOVERY dpkg-check is now: OK on incubator-bot0 i-00000296 output: All packages OK [17:15:45] A mysql error a day keeps the spambots away, it would appear [17:23:39] RECOVERY Puppet freshness is now: OK on deployment-jobrunner05 i-0000028c output: puppet ran at Sun May 27 17:23:34 UTC 2012 [17:24:20] 05/27/2012 - 17:24:20 - Updating keys for laner at /export/home/deployment-prep/laner [17:50:12] PROBLEM Puppet freshness is now: CRITICAL on nova-ldap1 i-000000df output: Puppet has not run in last 20 hours [20:55:18] beta down [20:55:19] (Cannot contact the database server: Host 'i-00000270.pmtpa.wmflabs' is blocked because of many connection errors; unblock with 'mysqladmin flush-hosts' (deployment-sql)) [20:55:22] Sorry! This site is experiencing technical difficulties. [20:55:25] http://commons.wikimedia.beta.wmflabs.org/wiki/Special:ListFiles [20:55:31] beta up [20:55:32] beta down [20:55:36] wizziwazro ? [20:55:39] No backend defined with the name `local-swift`. [20:55:50] #0 /usr/local/apache/common-local/wmf-config/swift.php(135): FileBackendGroup->get('local-swift') [20:55:58] hashar, petan [21:21:02] PROBLEM Free ram is now: CRITICAL on bots-2 i-0000009c output: Critical: 5% free memory [22:33:14] PROBLEM Puppet freshness is now: CRITICAL on mailman-01 i-00000235 output: Puppet has not run in last 20 hours [22:41:32] New review: Platonides; "This patchset had been intended for production, wrongly done in the test branch." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9015 [22:43:04] PROBLEM Free ram is now: WARNING on ganglia-test2 i-00000250 output: Warning: 19% free memory