[00:23:45] PROBLEM Current Load is now: CRITICAL on mwreview-quux i-000002a5 output: Connection refused by host [00:24:26] PROBLEM Current Users is now: CRITICAL on mwreview-quux i-000002a5 output: Connection refused by host [00:25:05] PROBLEM Disk Space is now: CRITICAL on mwreview-quux i-000002a5 output: Connection refused by host [00:25:45] PROBLEM Free ram is now: CRITICAL on mwreview-quux i-000002a5 output: Connection refused by host [00:26:55] PROBLEM Total Processes is now: CRITICAL on mwreview-quux i-000002a5 output: Connection refused by host [00:27:35] PROBLEM dpkg-check is now: CRITICAL on mwreview-quux i-000002a5 output: Connection refused by host [00:34:55] RECOVERY Current Load is now: OK on bots-sql2 i-000000af output: OK - load average: 3.18, 3.74, 5.00 [00:36:15] PROBLEM Puppet freshness is now: CRITICAL on mailman-01 i-00000235 output: Puppet has not run in last 20 hours [00:44:35] RECOVERY Current Users is now: OK on mwreview-quux i-000002a5 output: USERS OK - 1 users currently logged in [00:45:45] RECOVERY Free ram is now: OK on mwreview-quux i-000002a5 output: OK: 87% free memory [00:46:55] RECOVERY Total Processes is now: OK on mwreview-quux i-000002a5 output: PROCS OK: 94 processes [00:47:35] RECOVERY dpkg-check is now: OK on mwreview-quux i-000002a5 output: All packages OK [00:47:35] RECOVERY Disk Space is now: OK on mwreview-quux i-000002a5 output: DISK OK [00:54:27] PROBLEM Current Users is now: CRITICAL on mwreview-1 i-000002a6 output: Connection refused by host [00:55:23] PROBLEM Disk Space is now: CRITICAL on mwreview-1 i-000002a6 output: Connection refused by host [00:55:42] PROBLEM Free ram is now: CRITICAL on mwreview-1 i-000002a6 output: Connection refused by host [00:56:52] PROBLEM Total Processes is now: CRITICAL on mwreview-1 i-000002a6 output: Connection refused by host [00:57:32] PROBLEM dpkg-check is now: CRITICAL on mwreview-1 i-000002a6 output: Connection refused by host [00:58:52] PROBLEM Current Load is now: CRITICAL on mwreview-1 i-000002a6 output: Connection refused by host [01:01:34] PROBLEM Free ram is now: WARNING on bots-2 i-0000009c output: Warning: 19% free memory [01:47:54] PROBLEM Current Load is now: WARNING on bots-sql2 i-000000af output: WARNING - load average: 6.53, 6.33, 5.53 [02:18:54] RECOVERY Current Load is now: OK on mwreview-1 i-000002a6 output: OK - load average: 0.75, 0.75, 0.37 [02:19:55] RECOVERY Current Users is now: OK on mwreview-1 i-000002a6 output: USERS OK - 2 users currently logged in [02:20:40] RECOVERY Free ram is now: OK on mwreview-1 i-000002a6 output: OK: 83% free memory [02:20:55] RECOVERY Disk Space is now: OK on mwreview-1 i-000002a6 output: DISK OK [02:22:26] RECOVERY Total Processes is now: OK on mwreview-1 i-000002a6 output: PROCS OK: 106 processes [02:22:41] RECOVERY dpkg-check is now: OK on mwreview-1 i-000002a6 output: All packages OK [02:37:51] New patchset: Andrew Bogott; "Qualify the path to true." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9354 [02:38:06] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/9354 [02:39:25] 05/30/2012 - 02:39:25 - Updating keys for laner at /export/home/deployment-prep/laner [02:56:26] RECOVERY Puppet freshness is now: OK on deployment-apache21 i-0000026d output: puppet ran at Wed May 30 02:55:04 UTC 2012 [02:56:27] 05/30/2012 - 02:56:27 - Updating keys for laner at /export/home/deployment-prep/laner [02:57:54] PROBLEM Total Processes is now: CRITICAL on ganglia-test5 i-000002a7 output: Connection refused by host [02:58:20] 05/30/2012 - 02:58:20 - Updating keys for laner at /export/home/deployment-prep/laner [02:58:39] PROBLEM dpkg-check is now: CRITICAL on ganglia-test5 i-000002a7 output: Connection refused by host [03:00:29] PROBLEM Current Load is now: CRITICAL on ganglia-test5 i-000002a7 output: Connection refused by host [03:00:39] PROBLEM Disk Space is now: CRITICAL on ganglia-test5 i-000002a7 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:39] PROBLEM Current Users is now: CRITICAL on ganglia-test5 i-000002a7 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:49] PROBLEM Free ram is now: CRITICAL on ganglia-test5 i-000002a7 output: Connection refused by host [03:02:20] 05/30/2012 - 03:02:20 - Updating keys for laner at /export/home/deployment-prep/laner [03:03:19] 05/30/2012 - 03:03:19 - Updating keys for laner at /export/home/deployment-prep/laner [03:05:21] 05/30/2012 - 03:05:20 - Updating keys for laner at /export/home/deployment-prep/laner [03:29:49] PROBLEM Free ram is now: WARNING on bots-3 i-000000e5 output: Warning: 19% free memory [03:36:05] RECOVERY Current Load is now: OK on ganglia-test5 i-000002a7 output: OK - load average: 0.53, 1.50, 1.79 [03:36:05] RECOVERY Current Users is now: OK on ganglia-test5 i-000002a7 output: USERS OK - 1 users currently logged in [03:36:05] RECOVERY Disk Space is now: OK on ganglia-test5 i-000002a7 output: DISK OK [03:38:00] RECOVERY Total Processes is now: OK on ganglia-test5 i-000002a7 output: PROCS OK: 90 processes [03:38:48] RECOVERY dpkg-check is now: OK on ganglia-test5 i-000002a7 output: All packages OK [03:41:20] RECOVERY Free ram is now: OK on ganglia-test5 i-000002a7 output: OK: 88% free memory [03:53:04] New patchset: Andrew Bogott; "Don't do a git-clone if the rep is already present." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9356 [03:53:20] New patchset: Andrew Bogott; "Rename the wikimedia core git definition." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9357 [03:53:35] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/9356 [03:53:35] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/9357 [03:54:13] New review: Andrew Bogott; "(no comment)" [operations/puppet] (test); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9356 [03:54:52] New review: Andrew Bogott; "(no comment)" [operations/puppet] (test); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9357 [03:57:25] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 1.95, 4.09, 2.84 [04:01:05] PROBLEM Free ram is now: WARNING on test-oneiric i-00000187 output: Warning: 14% free memory [04:01:16] PROBLEM Free ram is now: WARNING on utils-abogott i-00000131 output: Warning: 15% free memory [04:02:31] RECOVERY Current Load is now: OK on nagios 127.0.0.1 output: OK - load average: 2.13, 2.87, 2.64 [04:02:31] PROBLEM Free ram is now: WARNING on orgcharts-dev i-0000018f output: Warning: 16% free memory [04:16:01] PROBLEM Free ram is now: WARNING on nova-daas-1 i-000000e7 output: Warning: 13% free memory [04:16:11] PROBLEM Free ram is now: CRITICAL on test-oneiric i-00000187 output: Critical: 5% free memory [04:16:11] PROBLEM Free ram is now: CRITICAL on utils-abogott i-00000131 output: Critical: 5% free memory [04:21:11] RECOVERY Free ram is now: OK on test-oneiric i-00000187 output: OK: 97% free memory [04:22:31] PROBLEM Free ram is now: CRITICAL on orgcharts-dev i-0000018f output: Critical: 5% free memory [04:26:13] RECOVERY Free ram is now: OK on utils-abogott i-00000131 output: OK: 96% free memory [04:32:33] RECOVERY Free ram is now: OK on orgcharts-dev i-0000018f output: OK: 95% free memory [04:35:56] PROBLEM Free ram is now: CRITICAL on nova-daas-1 i-000000e7 output: Critical: 5% free memory [04:45:56] RECOVERY Free ram is now: OK on nova-daas-1 i-000000e7 output: OK: 94% free memory [05:13:01] RECOVERY Free ram is now: OK on bots-2 i-0000009c output: OK: 20% free memory [05:54:11] PROBLEM Puppet freshness is now: CRITICAL on nova-ldap1 i-000000df output: Puppet has not run in last 20 hours [06:43:46] PROBLEM Total Processes is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:43:51] PROBLEM Current Users is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:59:01] PROBLEM Free ram is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:59:01] PROBLEM dpkg-check is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:00:03] PROBLEM Current Load is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [07:00:56] PROBLEM Current Load is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:00:56] PROBLEM Current Users is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:00:56] PROBLEM Disk Space is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:15] PROBLEM Disk Space is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:16] PROBLEM Free ram is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:07:15] PROBLEM Current Load is now: CRITICAL on nagios 127.0.0.1 output: CRITICAL - load average: 13.77, 21.83, 17.71 [07:07:45] PROBLEM Total Processes is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:07:50] PROBLEM Current Load is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [07:07:50] PROBLEM Current Users is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [07:07:50] PROBLEM Total Processes is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [07:09:35] PROBLEM Current Load is now: CRITICAL on bots-cb i-0000009e output: CHECK_NRPE: Socket timeout after 10 seconds. [07:10:40] PROBLEM Current Load is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:10:40] PROBLEM dpkg-check is now: CRITICAL on ipv6test1 i-00000282 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:10:45] PROBLEM Free ram is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [07:10:50] PROBLEM Current Load is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:10:50] PROBLEM Total Processes is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:10:55] PROBLEM dpkg-check is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:11:52] PROBLEM Current Load is now: WARNING on gerrit i-000000ff output: WARNING - load average: 3.69, 6.47, 6.09 [07:11:52] PROBLEM Current Load is now: WARNING on bots-2 i-0000009c output: WARNING - load average: 6.42, 6.22, 5.73 [07:11:52] PROBLEM Current Load is now: WARNING on labs-nfs1 i-0000005d output: WARNING - load average: 7.74, 8.22, 8.68 [07:11:52] PROBLEM Current Load is now: CRITICAL on mobile-testing i-00000271 output: CRITICAL - load average: 57.51, 57.88, 37.08 [07:11:52] PROBLEM Current Load is now: WARNING on swift-be3 i-000001c9 output: WARNING - load average: 8.32, 6.86, 5.74 [07:11:52] PROBLEM Current Load is now: WARNING on swift-be2 i-000001c8 output: WARNING - load average: 8.42, 7.63, 6.52 [07:11:53] PROBLEM Current Load is now: WARNING on swift-be4 i-000001ca output: WARNING - load average: 6.43, 6.73, 6.66 [07:11:53] PROBLEM Current Load is now: WARNING on wep i-000000c2 output: WARNING - load average: 14.33, 9.83, 7.03 [07:11:57] PROBLEM dpkg-check is now: CRITICAL on ganglia-test5 i-000002a7 output: Connection refused or timed out [07:12:19] PROBLEM Disk Space is now: CRITICAL on bots-cb i-0000009e output: CHECK_NRPE: Socket timeout after 10 seconds. [07:12:19] PROBLEM Free ram is now: CRITICAL on bots-cb i-0000009e output: CHECK_NRPE: Socket timeout after 10 seconds. [07:12:24] PROBLEM Free ram is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:12:24] PROBLEM Total Processes is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:12:44] PROBLEM dpkg-check is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [07:12:44] PROBLEM Current Users is now: CRITICAL on pybal-precise i-00000289 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:12:44] PROBLEM Disk Space is now: CRITICAL on pybal-precise i-00000289 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:12:44] PROBLEM Free ram is now: CRITICAL on pybal-precise i-00000289 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:12:44] PROBLEM Total Processes is now: CRITICAL on pybal-precise i-00000289 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:12:49] PROBLEM dpkg-check is now: CRITICAL on pybal-precise i-00000289 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:12:59] PROBLEM Current Load is now: CRITICAL on pybal-precise i-00000289 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:12:59] PROBLEM Current Load is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:13:14] PROBLEM Current Load is now: CRITICAL on bots-apache1 i-000000b0 output: CRITICAL - load average: 7.15, 14.11, 24.77 [07:13:44] PROBLEM Disk Space is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [07:13:44] PROBLEM Total Processes is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [07:13:49] PROBLEM Current Load is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:13:50] PROBLEM Current Users is now: CRITICAL on bots-cb i-0000009e output: CHECK_NRPE: Socket timeout after 10 seconds. [07:13:50] PROBLEM Total Processes is now: CRITICAL on bots-cb i-0000009e output: CHECK_NRPE: Socket timeout after 10 seconds. [07:13:54] PROBLEM SSH is now: CRITICAL on bots-sql2 i-000000af output: CRITICAL - Socket timeout after 10 seconds [07:13:55] PROBLEM Current Users is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:13:55] PROBLEM Current Users is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:13:55] PROBLEM Disk Space is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:13:55] PROBLEM Free ram is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:13:55] PROBLEM Total Processes is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:13:59] PROBLEM Disk Space is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:14:00] PROBLEM Free ram is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:14:00] PROBLEM dpkg-check is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:14:00] PROBLEM Current Users is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:14:00] PROBLEM Current Load is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:14:00] PROBLEM Disk Space is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:14:00] PROBLEM Total Processes is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:14:05] PROBLEM dpkg-check is now: CRITICAL on mwreview-1 i-000002a6 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:14:05] PROBLEM Total Processes is now: CRITICAL on mwreview-1 i-000002a6 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:14:05] PROBLEM Current Users is now: CRITICAL on mwreview-1 i-000002a6 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:14:05] PROBLEM Current Load is now: CRITICAL on mwreview-1 i-000002a6 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:14:05] PROBLEM Disk Space is now: CRITICAL on mwreview-1 i-000002a6 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:14:05] PROBLEM Free ram is now: CRITICAL on mwreview-1 i-000002a6 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:14:14] PROBLEM Current Load is now: WARNING on deployment-jobrunner02 i-00000279 output: WARNING - load average: 10.58, 8.43, 6.45 [07:14:29] PROBLEM host: blamemaps-m1small is DOWN address: i-000002a1 CRITICAL - Host Unreachable (i-000002a1) [07:16:50] PROBLEM Current Load is now: WARNING on ganglia-test2 i-00000250 output: WARNING - load average: 6.97, 6.49, 5.54 [07:16:50] PROBLEM Current Load is now: WARNING on precise-test i-00000231 output: WARNING - load average: 5.99, 6.44, 6.23 [07:16:50] RECOVERY dpkg-check is now: OK on ipv6test1 i-00000282 output: All packages OK [07:16:50] RECOVERY Total Processes is now: OK on precise-test i-00000231 output: PROCS OK: 88 processes [07:17:00] PROBLEM HTTP is now: CRITICAL on mailman-01 i-00000235 output: CRITICAL - Socket timeout after 10 seconds [07:17:14] RECOVERY Current Load is now: OK on gerrit i-000000ff output: OK - load average: 1.71, 3.73, 5.00 [07:17:14] RECOVERY Free ram is now: OK on bots-cb i-0000009e output: OK: 78% free memory [07:17:14] RECOVERY Current Users is now: OK on pybal-precise i-00000289 output: USERS OK - 0 users currently logged in [07:17:14] PROBLEM Current Load is now: WARNING on demo-web1 i-00000255 output: WARNING - load average: 4.21, 5.96, 5.20 [07:17:19] RECOVERY Free ram is now: OK on pybal-precise i-00000289 output: OK: 85% free memory [07:17:19] RECOVERY Disk Space is now: OK on pybal-precise i-00000289 output: DISK OK [07:17:19] PROBLEM Current Load is now: WARNING on pybal-precise i-00000289 output: WARNING - load average: 7.00, 7.56, 6.75 [07:17:19] RECOVERY dpkg-check is now: OK on pybal-precise i-00000289 output: All packages OK [07:17:24] RECOVERY Total Processes is now: OK on pybal-precise i-00000289 output: PROCS OK: 95 processes [07:21:34] PROBLEM Current Load is now: WARNING on bots-apache1 i-000000b0 output: WARNING - load average: 7.26, 9.83, 19.99 [07:21:34] RECOVERY Current Users is now: OK on bots-cb i-0000009e output: USERS OK - 1 users currently logged in [07:21:34] PROBLEM Current Load is now: WARNING on reportcard2 i-000001ea output: WARNING - load average: 10.17, 9.58, 7.94 [07:21:34] RECOVERY Total Processes is now: OK on bots-cb i-0000009e output: PROCS OK: 341 processes [07:21:39] RECOVERY Current Users is now: OK on reportcard2 i-000001ea output: USERS OK - 0 users currently logged in [07:21:39] RECOVERY Disk Space is now: OK on reportcard2 i-000001ea output: DISK OK [07:23:38] RECOVERY host: blamemaps-m1small is UP address: i-000002a1 PING OK - Packet loss = 0%, RTA = 3.02 ms [07:23:38] PROBLEM HTTP is now: WARNING on mailman-01 i-00000235 output: HTTP WARNING: HTTP/1.1 403 Forbidden - 498 bytes in 0.011 second response time [07:23:38] RECOVERY dpkg-check is now: OK on reportcard2 i-000001ea output: All packages OK [07:25:29] PROBLEM Current Load is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:26:46] RECOVERY Disk Space is now: OK on bots-cb i-0000009e output: DISK OK [07:26:46] PROBLEM Current Load is now: WARNING on worker1 i-00000208 output: WARNING - load average: 5.42, 6.08, 7.12 [07:27:39] PROBLEM Current Load is now: WARNING on upload-wizard i-0000021c output: WARNING - load average: 7.16, 11.54, 11.16 [07:27:39] RECOVERY Total Processes is now: OK on upload-wizard i-0000021c output: PROCS OK: 114 processes [07:27:44] RECOVERY Current Users is now: OK on upload-wizard i-0000021c output: USERS OK - 0 users currently logged in [07:27:44] RECOVERY dpkg-check is now: OK on bots-sql2 i-000000af output: All packages OK [07:27:44] RECOVERY Disk Space is now: OK on upload-wizard i-0000021c output: DISK OK [07:27:44] RECOVERY Current Users is now: OK on worker1 i-00000208 output: USERS OK - 0 users currently logged in [07:27:44] RECOVERY Disk Space is now: OK on worker1 i-00000208 output: DISK OK [07:27:44] RECOVERY Free ram is now: OK on worker1 i-00000208 output: OK: 89% free memory [07:27:51] !ping [07:27:51] pong [07:28:01] hm [07:28:01] PROBLEM Current Load is now: WARNING on bots-sql2 i-000000af output: WARNING - load average: 7.05, 10.04, 12.75 [07:28:13] poor io [07:28:18] PROBLEM Current Load is now: WARNING on incubator-bot1 i-00000251 output: WARNING - load average: 3.54, 5.23, 6.50 [07:28:21] PROBLEM Current Load is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:28:48] PROBLEM Current Load is now: CRITICAL on pybal-precise i-00000289 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:29:08] PROBLEM dpkg-check is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:30:30] PROBLEM Current Load is now: WARNING on robh2 i-000001a2 output: WARNING - load average: 6.70, 5.97, 5.45 [07:32:06] PROBLEM Current Load is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:33:08] RECOVERY Current Load is now: OK on swift-be3 i-000001c9 output: OK - load average: 0.23, 1.73, 4.02 [07:33:08] RECOVERY Current Load is now: OK on wep i-000000c2 output: OK - load average: 0.71, 2.46, 4.90 [07:34:32] RECOVERY Free ram is now: OK on incubator-bot2 i-00000252 output: OK: 56% free memory [07:34:32] RECOVERY Current Users is now: OK on incubator-bot2 i-00000252 output: USERS OK - 0 users currently logged in [07:34:32] RECOVERY dpkg-check is now: OK on incubator-bot2 i-00000252 output: All packages OK [07:34:32] PROBLEM Current Load is now: WARNING on incubator-bot2 i-00000252 output: WARNING - load average: 6.41, 9.56, 10.10 [07:34:32] RECOVERY dpkg-check is now: OK on mwreview-1 i-000002a6 output: All packages OK [07:34:32] RECOVERY Free ram is now: OK on mwreview-1 i-000002a6 output: OK: 70% free memory [07:34:32] RECOVERY Current Users is now: OK on mwreview-1 i-000002a6 output: USERS OK - 0 users currently logged in [07:34:33] PROBLEM Current Load is now: WARNING on mwreview-1 i-000002a6 output: WARNING - load average: 0.51, 5.24, 8.30 [07:34:34] RECOVERY Total Processes is now: OK on incubator-bot2 i-00000252 output: PROCS OK: 121 processes [07:34:37] RECOVERY Disk Space is now: OK on mwreview-1 i-000002a6 output: DISK OK [07:34:37] RECOVERY Total Processes is now: OK on mwreview-1 i-000002a6 output: PROCS OK: 101 processes [07:34:47] RECOVERY Current Load is now: OK on deployment-jobrunner02 i-00000279 output: OK - load average: 0.29, 1.97, 4.25 [07:34:57] RECOVERY Disk Space is now: OK on incubator-bot2 i-00000252 output: DISK OK [07:34:58] RECOVERY Disk Space is now: OK on bots-sql2 i-000000af output: DISK OK [07:34:58] RECOVERY Total Processes is now: OK on bots-sql2 i-000000af output: PROCS OK: 84 processes [07:35:03] RECOVERY Total Processes is now: OK on rds i-00000207 output: PROCS OK: 91 processes [07:35:08] RECOVERY Disk Space is now: OK on rds i-00000207 output: DISK OK [07:35:08] RECOVERY Current Users is now: OK on rds i-00000207 output: USERS OK - 0 users currently logged in [07:35:08] PROBLEM Current Load is now: WARNING on rds i-00000207 output: WARNING - load average: 7.68, 7.93, 8.50 [07:35:08] RECOVERY Free ram is now: OK on rds i-00000207 output: OK: 92% free memory [07:35:08] RECOVERY SSH is now: OK on bots-sql2 i-000000af output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [07:35:54] helllo [07:36:47] RECOVERY Current Load is now: OK on incubator-bot1 i-00000251 output: OK - load average: 1.24, 2.37, 4.73 [07:36:47] RECOVERY Current Load is now: OK on robh2 i-000001a2 output: OK - load average: 0.43, 2.75, 4.24 [07:36:47] PROBLEM Current Load is now: CRITICAL on ganglia-test5 i-000002a7 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:36:47] PROBLEM Free ram is now: CRITICAL on ganglia-test5 i-000002a7 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:36:47] PROBLEM Disk Space is now: CRITICAL on ganglia-test5 i-000002a7 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:36:47] PROBLEM Current Users is now: CRITICAL on ganglia-test5 i-000002a7 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:36:48] PROBLEM Total Processes is now: CRITICAL on ganglia-test5 i-000002a7 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:36:49] PROBLEM Disk Space is now: CRITICAL on pybal-precise i-00000289 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:36:49] PROBLEM dpkg-check is now: CRITICAL on pybal-precise i-00000289 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:36:49] PROBLEM Total Processes is now: CRITICAL on pybal-precise i-00000289 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:37:49] PROBLEM Current Load is now: WARNING on mobile-testing i-00000271 output: WARNING - load average: 1.41, 4.40, 15.41 [07:37:49] RECOVERY Current Load is now: OK on swift-be4 i-000001ca output: OK - load average: 0.51, 1.79, 4.46 [07:37:49] RECOVERY Current Load is now: OK on demo-web1 i-00000255 output: OK - load average: 0.47, 2.77, 4.30 [07:37:58] PROBLEM Current Users is now: CRITICAL on pybal-precise i-00000289 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:37:58] PROBLEM Free ram is now: CRITICAL on pybal-precise i-00000289 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:41:17] RECOVERY Current Load is now: OK on ganglia-test2 i-00000250 output: OK - load average: 4.34, 3.56, 4.69 [07:41:17] PROBLEM Current Load is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:41:37] PROBLEM Current Load is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [07:41:42] PROBLEM Current Load is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:41:42] PROBLEM dpkg-check is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:41:42] PROBLEM Current Users is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:41:42] PROBLEM Disk Space is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:42:21] RECOVERY Free ram is now: OK on upload-wizard i-0000021c output: OK: 87% free memory [07:42:37] PROBLEM Current Users is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:42:37] PROBLEM Free ram is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:42:37] PROBLEM Disk Space is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:43:41] RECOVERY Current Load is now: OK on labs-nfs1 i-0000005d output: OK - load average: 1.04, 2.04, 4.16 [07:43:41] RECOVERY Current Load is now: OK on swift-be2 i-000001c8 output: OK - load average: 0.29, 1.28, 3.91 [07:43:41] PROBLEM Current Load is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:43:43] baaa [07:43:44] spaam [07:43:46] me mroe [07:47:01] PROBLEM Current Load is now: WARNING on bots-cb i-0000009e output: WARNING - load average: 4.28, 1.88, 15.97 [07:47:01] RECOVERY Current Load is now: OK on mwreview-1 i-000002a6 output: OK - load average: 0.16, 0.72, 3.92 [07:47:38] PROBLEM Current Load is now: WARNING on maps-tilemill1 i-00000294 output: WARNING - load average: 3.20, 6.60, 8.48 [07:47:38] RECOVERY dpkg-check is now: OK on maps-tilemill1 i-00000294 output: All packages OK [07:47:46] PROBLEM Current Users is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:47:47] PROBLEM Disk Space is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:47:47] PROBLEM Free ram is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:48:14] RECOVERY Current Load is now: OK on bots-2 i-0000009c output: OK - load average: 1.18, 2.00, 4.09 [07:48:14] RECOVERY Total Processes is now: OK on reportcard2 i-000001ea output: PROCS OK: 89 processes [07:48:19] RECOVERY Free ram is now: OK on reportcard2 i-000001ea output: OK: 85% free memory [07:51:02] RECOVERY Total Processes is now: OK on worker1 i-00000208 output: PROCS OK: 94 processes [07:52:12] PROBLEM Current Load is now: WARNING on ganglia-test5 i-000002a7 output: WARNING - load average: 4.10, 10.21, 11.03 [07:52:12] RECOVERY Disk Space is now: OK on ganglia-test5 i-000002a7 output: DISK OK [07:52:12] RECOVERY Free ram is now: OK on ganglia-test5 i-000002a7 output: OK: 83% free memory [07:52:12] RECOVERY Current Users is now: OK on ganglia-test5 i-000002a7 output: USERS OK - 0 users currently logged in [07:52:12] RECOVERY Total Processes is now: OK on ganglia-test5 i-000002a7 output: PROCS OK: 180 processes [07:52:17] RECOVERY dpkg-check is now: OK on ganglia-test5 i-000002a7 output: All packages OK [07:57:23] RECOVERY Current Load is now: OK on maps-tilemill1 i-00000294 output: OK - load average: 0.30, 1.23, 4.65 [08:01:02] RECOVERY Current Load is now: OK on bots-apache1 i-000000b0 output: OK - load average: 3.33, 3.15, 4.57 [08:03:39] RECOVERY Current Load is now: OK on mobile-testing i-00000271 output: OK - load average: 1.30, 1.13, 4.11 [08:06:09] RECOVERY Current Load is now: OK on bots-cb i-0000009e output: OK - load average: 0.09, 0.20, 4.62 [08:08:07] RECOVERY Current Load is now: OK on ganglia-test5 i-000002a7 output: OK - load average: 1.33, 1.11, 4.34 [08:15:57] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 0.71, 0.90, 3.30 [08:21:05] RECOVERY Current Load is now: OK on nagios 127.0.0.1 output: OK - load average: 0.39, 1.01, 2.74 [08:42:33] !log deployment prep going to upgrade core / extensions to latest master [08:42:37] deployment is not a valid project. [08:42:42] seriously [08:43:13] !log deployment-prep {{bug|37199}} going to upgrade core / extensions to latest master [08:43:16] Logged the message, Master [08:43:25] and breaking stuff cool [08:45:31] hahh [08:49:16] !log deployment-prep hashar: updating core to 8c65834 [08:49:17] Logged the message, Master [08:51:17] so slooow [08:52:50] 05/30/2012 - 08:52:50 - Updating keys for knissen at /export/home/mediawiki-custom-de/knissen [08:52:58] !log bots petrb: patching wm bot [08:52:59] Logged the message, Master [09:15:43] !log deployment-prep hashar: foreachwiki update.php --quiet --quick [09:15:46] Logged the message, Master [09:17:21] !log deployment-prep Restarted update.php in a screen session [09:17:22] Logged the message, Master [09:21:55] RECOVERY Current Load is now: OK on bots-sql2 i-000000af output: OK - load average: 1.70, 3.90, 4.91 [09:33:14] New review: Hashar; "I do not understand what you would prefer sorry :-( By 'this being done on call site', do you mean ..." [operations/puppet] (test); V: 0 C: 0; - https://gerrit.wikimedia.org/r/8575 [09:39:55] PROBLEM Current Load is now: WARNING on bots-sql2 i-000000af output: WARNING - load average: 7.41, 6.69, 5.49 [09:44:29] PROBLEM Puppet freshness is now: CRITICAL on nova-precise1 i-00000236 output: Puppet has not run in last 20 hours [09:49:39] PROBLEM Puppet freshness is now: CRITICAL on nova-essex-test i-000001f9 output: Puppet has not run in last 20 hours [09:58:36] The username Hashar is not registered on this wiki, but it does exist in the unified login system. [09:58:37] yeahh [09:58:42] cryptic message [09:59:29] PROBLEM Puppet freshness is now: CRITICAL on nova-production1 i-0000007b output: Puppet has not run in last 20 hours [10:13:34] Bug 37216 - can not reset user password | https://bugzilla.wikimedia.org/show_bug.cgi?id=37216 [10:13:45] if anyone has any idea, feel free to comment on that bug report [10:13:47] I am off for lunch [10:36:25] PROBLEM Puppet freshness is now: CRITICAL on mailman-01 i-00000235 output: Puppet has not run in last 20 hours [11:38:15] 05/30/2012 - 11:38:15 - Creating a home directory for knissen at /export/home/bastion/knissen [11:38:16] !log bastion adding Kai Nissen (WMFDE) to bastion [11:38:17] Logged the message, Master [11:50:25] back [12:13:56] PROBLEM Free ram is now: CRITICAL on wmde-test i-000002a8 output: Connection refused by host [12:14:55] PROBLEM Free ram is now: WARNING on bots-3 i-000000e5 output: Warning: 14% free memory [12:15:15] PROBLEM Total Processes is now: CRITICAL on wmde-test i-000002a8 output: Connection refused by host [12:15:45] PROBLEM dpkg-check is now: CRITICAL on wmde-test i-000002a8 output: Connection refused by host [12:16:55] PROBLEM Current Load is now: CRITICAL on wmde-test i-000002a8 output: Connection refused by host [12:17:45] PROBLEM Current Users is now: CRITICAL on wmde-test i-000002a8 output: Connection refused by host [12:18:15] PROBLEM Disk Space is now: CRITICAL on wmde-test i-000002a8 output: Connection refused by host [12:24:06] hashar: can you modify mw so that I we have debug symbol somewhere in source of each page [12:24:14] that one we have in logs [12:24:23] http://oc.wikipedia.beta.wmflabs.org/wiki/Crash_1 [12:24:33] what do you mean by debug symbol ? [12:24:33] I replicated the problem from production but I can't get related error [12:24:47] you might have them in /home/wikipedia/log [12:25:05] that string we have in logs [12:25:10] in prefix [12:25:10] in production I am 99% sure it is just an out of memory error [12:25:20] ok but I would like to see the logs [12:25:27] anyway OOM means it's a bug [12:25:33] there is probabaly loop in parser [12:25:38] probably [12:25:40] which allocate so much memory [12:25:49] will have to reproduce it locally to get all the tools [12:25:53] hashar: I mean that string we have in logs [12:26:04] "that string we have in logs" does not mean anything to me [12:26:08] you will have to be more precise [12:26:12] anyway [12:26:16] since it is a fatal error [12:26:22] the only thing you can get is something like: [12:26:27] ooh [12:26:29] you removed it [12:26:42] PHP Fatal Error: trying to allocate 20192bytes (max 128MB) at includes/parser/Parser.php line XXX [12:26:46] which is not really helpful [12:26:52] no it's not [12:26:59] but in past we had very detailed logs [12:27:03] where are they [12:27:35] cat /usr/local/apache/common-local/wmf-config/logs.php [12:27:44] this file was prefixing logs with certain string [12:27:49] that's the string I talk about [12:28:11] $randomHash = $wgDBname . '-' . substr( md5(uniqid()), 0, 8 ); [12:28:30] it helped me to see which lines correspond to which execution [12:28:42] because logs were mixed from many various php runs [12:28:50] ohhh [12:28:52] hmm [12:28:54] if I could see that in source code of page [12:28:56] $wgDebugFile I guess [12:29:01] I could grep it out of detailed logs [12:29:10] and see exactly why it crashed [12:29:15] $wgDebugLogFile = "udp://$wmfUdp2logDest/catchall"; [12:29:16] [12:29:17] and for CLI $wgDebugLogFile = "udp://$wmfUdp2logDest/cli"; [12:29:24] ok, where is ithat [12:29:26] OH MY GOD [12:29:30] o.o [12:29:30] stupid udp2log [12:29:49] that is sent to deployment-feed which has the udp2log daemon [12:29:55] ok [12:30:00] where that one store it [12:30:53] /home/wikipedia/logs [12:31:01] petrb@deployment-dbdump:~$ ls /home/wikipedia/logs/ [12:31:01] archive CentralAuth.log dberror.log slow-parse.log testwiki.log xff.log [12:31:03] so the logs files should be /home/wikipedia/logs/catchall.log [12:31:04] which one [12:31:05] but that is not there [12:31:07] the succker [12:31:09] grrr [12:31:37] btw, what you think of that idea with debug string in html source code [12:31:46] like [12:31:56] so that I could just grep dafdasfas [12:31:58] from log [12:32:13] I think it would be useful to debug such problems [12:32:19] or is there any better way [12:32:31] hmm [12:32:42] everytime the page crash I would open the source code and grab the symbol [12:32:53] so taht I could see why it was [12:33:25] that was original reason for that prefixes in logs [12:33:34] I know [12:33:35] I just didn't know any proper way to implement it to mediawiki core [12:33:37] that is not the issue [12:34:09] PROBLEM Free ram is now: UNKNOWN on wmde-test i-000002a8 output: Invalid host name i-000002a8 [12:34:09] PROBLEM Puppet freshness is now: CRITICAL on localpuppet2 i-0000029b output: Puppet has not run in last 20 hours [12:34:53] > var_dump( $wgDebugLogFile ); [12:34:53] string(0) "" [12:34:55] yeahhh [12:38:59] PROBLEM host: wmde-test is DOWN address: i-000002a8 check_ping: Invalid hostname/address - i-000002a8 [12:42:01] ahh /data/project/cli.log [12:42:03] interesting [12:42:41] so I moved that to udp2log [12:42:49] petan|wk: /home/wikipedia/log/cli.log [12:42:56] it is not passing through udp2log [12:43:01] hm [12:43:25] ok, can we have a full log [12:44:10] hashar: are you sure we want to have /home/wikipedia on labs-nfs1 [12:44:16] yeah [12:44:19] I would prefer to get it to gluster [12:44:23] or local nfs [12:44:37] there is a bug to get /home/wikipedia to a better place [12:44:44] ok [12:44:52] for now it is labs-nfs1:/export/home/deployment-prep/wikipedia on /home/wikipedia [12:44:54] PROBLEM Current Load is now: CRITICAL on wmde-test1 i-000002a9 output: Connection refused by host [12:44:57] it's as easy as creating a symlink [12:45:01]