[00:23:45] PROBLEM Current Load is now: CRITICAL on mwreview-quux i-000002a5 output: Connection refused by host [00:24:26] PROBLEM Current Users is now: CRITICAL on mwreview-quux i-000002a5 output: Connection refused by host [00:25:05] PROBLEM Disk Space is now: CRITICAL on mwreview-quux i-000002a5 output: Connection refused by host [00:25:45] PROBLEM Free ram is now: CRITICAL on mwreview-quux i-000002a5 output: Connection refused by host [00:26:55] PROBLEM Total Processes is now: CRITICAL on mwreview-quux i-000002a5 output: Connection refused by host [00:27:35] PROBLEM dpkg-check is now: CRITICAL on mwreview-quux i-000002a5 output: Connection refused by host [00:34:55] RECOVERY Current Load is now: OK on bots-sql2 i-000000af output: OK - load average: 3.18, 3.74, 5.00 [00:36:15] PROBLEM Puppet freshness is now: CRITICAL on mailman-01 i-00000235 output: Puppet has not run in last 20 hours [00:44:35] RECOVERY Current Users is now: OK on mwreview-quux i-000002a5 output: USERS OK - 1 users currently logged in [00:45:45] RECOVERY Free ram is now: OK on mwreview-quux i-000002a5 output: OK: 87% free memory [00:46:55] RECOVERY Total Processes is now: OK on mwreview-quux i-000002a5 output: PROCS OK: 94 processes [00:47:35] RECOVERY dpkg-check is now: OK on mwreview-quux i-000002a5 output: All packages OK [00:47:35] RECOVERY Disk Space is now: OK on mwreview-quux i-000002a5 output: DISK OK [00:54:27] PROBLEM Current Users is now: CRITICAL on mwreview-1 i-000002a6 output: Connection refused by host [00:55:23] PROBLEM Disk Space is now: CRITICAL on mwreview-1 i-000002a6 output: Connection refused by host [00:55:42] PROBLEM Free ram is now: CRITICAL on mwreview-1 i-000002a6 output: Connection refused by host [00:56:52] PROBLEM Total Processes is now: CRITICAL on mwreview-1 i-000002a6 output: Connection refused by host [00:57:32] PROBLEM dpkg-check is now: CRITICAL on mwreview-1 i-000002a6 output: Connection refused by host [00:58:52] PROBLEM Current Load is now: CRITICAL on mwreview-1 i-000002a6 output: Connection refused by host [01:01:34] PROBLEM Free ram is now: WARNING on bots-2 i-0000009c output: Warning: 19% free memory [01:47:54] PROBLEM Current Load is now: WARNING on bots-sql2 i-000000af output: WARNING - load average: 6.53, 6.33, 5.53 [02:18:54] RECOVERY Current Load is now: OK on mwreview-1 i-000002a6 output: OK - load average: 0.75, 0.75, 0.37 [02:19:55] RECOVERY Current Users is now: OK on mwreview-1 i-000002a6 output: USERS OK - 2 users currently logged in [02:20:40] RECOVERY Free ram is now: OK on mwreview-1 i-000002a6 output: OK: 83% free memory [02:20:55] RECOVERY Disk Space is now: OK on mwreview-1 i-000002a6 output: DISK OK [02:22:26] RECOVERY Total Processes is now: OK on mwreview-1 i-000002a6 output: PROCS OK: 106 processes [02:22:41] RECOVERY dpkg-check is now: OK on mwreview-1 i-000002a6 output: All packages OK [02:37:51] New patchset: Andrew Bogott; "Qualify the path to true." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9354 [02:38:06] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/9354 [02:39:25] 05/30/2012 - 02:39:25 - Updating keys for laner at /export/home/deployment-prep/laner [02:56:26] RECOVERY Puppet freshness is now: OK on deployment-apache21 i-0000026d output: puppet ran at Wed May 30 02:55:04 UTC 2012 [02:56:27] 05/30/2012 - 02:56:27 - Updating keys for laner at /export/home/deployment-prep/laner [02:57:54] PROBLEM Total Processes is now: CRITICAL on ganglia-test5 i-000002a7 output: Connection refused by host [02:58:20] 05/30/2012 - 02:58:20 - Updating keys for laner at /export/home/deployment-prep/laner [02:58:39] PROBLEM dpkg-check is now: CRITICAL on ganglia-test5 i-000002a7 output: Connection refused by host [03:00:29] PROBLEM Current Load is now: CRITICAL on ganglia-test5 i-000002a7 output: Connection refused by host [03:00:39] PROBLEM Disk Space is now: CRITICAL on ganglia-test5 i-000002a7 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:39] PROBLEM Current Users is now: CRITICAL on ganglia-test5 i-000002a7 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:49] PROBLEM Free ram is now: CRITICAL on ganglia-test5 i-000002a7 output: Connection refused by host [03:02:20] 05/30/2012 - 03:02:20 - Updating keys for laner at /export/home/deployment-prep/laner [03:03:19] 05/30/2012 - 03:03:19 - Updating keys for laner at /export/home/deployment-prep/laner [03:05:21] 05/30/2012 - 03:05:20 - Updating keys for laner at /export/home/deployment-prep/laner [03:29:49] PROBLEM Free ram is now: WARNING on bots-3 i-000000e5 output: Warning: 19% free memory [03:36:05] RECOVERY Current Load is now: OK on ganglia-test5 i-000002a7 output: OK - load average: 0.53, 1.50, 1.79 [03:36:05] RECOVERY Current Users is now: OK on ganglia-test5 i-000002a7 output: USERS OK - 1 users currently logged in [03:36:05] RECOVERY Disk Space is now: OK on ganglia-test5 i-000002a7 output: DISK OK [03:38:00] RECOVERY Total Processes is now: OK on ganglia-test5 i-000002a7 output: PROCS OK: 90 processes [03:38:48] RECOVERY dpkg-check is now: OK on ganglia-test5 i-000002a7 output: All packages OK [03:41:20] RECOVERY Free ram is now: OK on ganglia-test5 i-000002a7 output: OK: 88% free memory [03:53:04] New patchset: Andrew Bogott; "Don't do a git-clone if the rep is already present." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9356 [03:53:20] New patchset: Andrew Bogott; "Rename the wikimedia core git definition." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9357 [03:53:35] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/9356 [03:53:35] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/9357 [03:54:13] New review: Andrew Bogott; "(no comment)" [operations/puppet] (test); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9356 [03:54:52] New review: Andrew Bogott; "(no comment)" [operations/puppet] (test); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9357 [03:57:25] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 1.95, 4.09, 2.84 [04:01:05] PROBLEM Free ram is now: WARNING on test-oneiric i-00000187 output: Warning: 14% free memory [04:01:16] PROBLEM Free ram is now: WARNING on utils-abogott i-00000131 output: Warning: 15% free memory [04:02:31] RECOVERY Current Load is now: OK on nagios 127.0.0.1 output: OK - load average: 2.13, 2.87, 2.64 [04:02:31] PROBLEM Free ram is now: WARNING on orgcharts-dev i-0000018f output: Warning: 16% free memory [04:16:01] PROBLEM Free ram is now: WARNING on nova-daas-1 i-000000e7 output: Warning: 13% free memory [04:16:11] PROBLEM Free ram is now: CRITICAL on test-oneiric i-00000187 output: Critical: 5% free memory [04:16:11] PROBLEM Free ram is now: CRITICAL on utils-abogott i-00000131 output: Critical: 5% free memory [04:21:11] RECOVERY Free ram is now: OK on test-oneiric i-00000187 output: OK: 97% free memory [04:22:31] PROBLEM Free ram is now: CRITICAL on orgcharts-dev i-0000018f output: Critical: 5% free memory [04:26:13] RECOVERY Free ram is now: OK on utils-abogott i-00000131 output: OK: 96% free memory [04:32:33] RECOVERY Free ram is now: OK on orgcharts-dev i-0000018f output: OK: 95% free memory [04:35:56] PROBLEM Free ram is now: CRITICAL on nova-daas-1 i-000000e7 output: Critical: 5% free memory [04:45:56] RECOVERY Free ram is now: OK on nova-daas-1 i-000000e7 output: OK: 94% free memory [05:13:01] RECOVERY Free ram is now: OK on bots-2 i-0000009c output: OK: 20% free memory [05:54:11] PROBLEM Puppet freshness is now: CRITICAL on nova-ldap1 i-000000df output: Puppet has not run in last 20 hours [06:43:46] PROBLEM Total Processes is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:43:51] PROBLEM Current Users is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:59:01] PROBLEM Free ram is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:59:01] PROBLEM dpkg-check is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:00:03] PROBLEM Current Load is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [07:00:56] PROBLEM Current Load is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:00:56] PROBLEM Current Users is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:00:56] PROBLEM Disk Space is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:15] PROBLEM Disk Space is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:16] PROBLEM Free ram is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:07:15] PROBLEM Current Load is now: CRITICAL on nagios 127.0.0.1 output: CRITICAL - load average: 13.77, 21.83, 17.71 [07:07:45] PROBLEM Total Processes is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:07:50] PROBLEM Current Load is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [07:07:50] PROBLEM Current Users is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [07:07:50] PROBLEM Total Processes is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [07:09:35] PROBLEM Current Load is now: CRITICAL on bots-cb i-0000009e output: CHECK_NRPE: Socket timeout after 10 seconds. [07:10:40] PROBLEM Current Load is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:10:40] PROBLEM dpkg-check is now: CRITICAL on ipv6test1 i-00000282 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:10:45] PROBLEM Free ram is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [07:10:50] PROBLEM Current Load is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:10:50] PROBLEM Total Processes is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:10:55] PROBLEM dpkg-check is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:11:52] PROBLEM Current Load is now: WARNING on gerrit i-000000ff output: WARNING - load average: 3.69, 6.47, 6.09 [07:11:52] PROBLEM Current Load is now: WARNING on bots-2 i-0000009c output: WARNING - load average: 6.42, 6.22, 5.73 [07:11:52] PROBLEM Current Load is now: WARNING on labs-nfs1 i-0000005d output: WARNING - load average: 7.74, 8.22, 8.68 [07:11:52] PROBLEM Current Load is now: CRITICAL on mobile-testing i-00000271 output: CRITICAL - load average: 57.51, 57.88, 37.08 [07:11:52] PROBLEM Current Load is now: WARNING on swift-be3 i-000001c9 output: WARNING - load average: 8.32, 6.86, 5.74 [07:11:52] PROBLEM Current Load is now: WARNING on swift-be2 i-000001c8 output: WARNING - load average: 8.42, 7.63, 6.52 [07:11:53] PROBLEM Current Load is now: WARNING on swift-be4 i-000001ca output: WARNING - load average: 6.43, 6.73, 6.66 [07:11:53] PROBLEM Current Load is now: WARNING on wep i-000000c2 output: WARNING - load average: 14.33, 9.83, 7.03 [07:11:57] PROBLEM dpkg-check is now: CRITICAL on ganglia-test5 i-000002a7 output: Connection refused or timed out [07:12:19] PROBLEM Disk Space is now: CRITICAL on bots-cb i-0000009e output: CHECK_NRPE: Socket timeout after 10 seconds. [07:12:19] PROBLEM Free ram is now: CRITICAL on bots-cb i-0000009e output: CHECK_NRPE: Socket timeout after 10 seconds. [07:12:24] PROBLEM Free ram is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:12:24] PROBLEM Total Processes is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:12:44] PROBLEM dpkg-check is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [07:12:44] PROBLEM Current Users is now: CRITICAL on pybal-precise i-00000289 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:12:44] PROBLEM Disk Space is now: CRITICAL on pybal-precise i-00000289 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:12:44] PROBLEM Free ram is now: CRITICAL on pybal-precise i-00000289 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:12:44] PROBLEM Total Processes is now: CRITICAL on pybal-precise i-00000289 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:12:49] PROBLEM dpkg-check is now: CRITICAL on pybal-precise i-00000289 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:12:59] PROBLEM Current Load is now: CRITICAL on pybal-precise i-00000289 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:12:59] PROBLEM Current Load is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:13:14] PROBLEM Current Load is now: CRITICAL on bots-apache1 i-000000b0 output: CRITICAL - load average: 7.15, 14.11, 24.77 [07:13:44] PROBLEM Disk Space is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [07:13:44] PROBLEM Total Processes is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [07:13:49] PROBLEM Current Load is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:13:50] PROBLEM Current Users is now: CRITICAL on bots-cb i-0000009e output: CHECK_NRPE: Socket timeout after 10 seconds. [07:13:50] PROBLEM Total Processes is now: CRITICAL on bots-cb i-0000009e output: CHECK_NRPE: Socket timeout after 10 seconds. [07:13:54] PROBLEM SSH is now: CRITICAL on bots-sql2 i-000000af output: CRITICAL - Socket timeout after 10 seconds [07:13:55] PROBLEM Current Users is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:13:55] PROBLEM Current Users is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:13:55] PROBLEM Disk Space is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:13:55] PROBLEM Free ram is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:13:55] PROBLEM Total Processes is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:13:59] PROBLEM Disk Space is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:14:00] PROBLEM Free ram is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:14:00] PROBLEM dpkg-check is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:14:00] PROBLEM Current Users is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:14:00] PROBLEM Current Load is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:14:00] PROBLEM Disk Space is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:14:00] PROBLEM Total Processes is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:14:05] PROBLEM dpkg-check is now: CRITICAL on mwreview-1 i-000002a6 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:14:05] PROBLEM Total Processes is now: CRITICAL on mwreview-1 i-000002a6 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:14:05] PROBLEM Current Users is now: CRITICAL on mwreview-1 i-000002a6 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:14:05] PROBLEM Current Load is now: CRITICAL on mwreview-1 i-000002a6 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:14:05] PROBLEM Disk Space is now: CRITICAL on mwreview-1 i-000002a6 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:14:05] PROBLEM Free ram is now: CRITICAL on mwreview-1 i-000002a6 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:14:14] PROBLEM Current Load is now: WARNING on deployment-jobrunner02 i-00000279 output: WARNING - load average: 10.58, 8.43, 6.45 [07:14:29] PROBLEM host: blamemaps-m1small is DOWN address: i-000002a1 CRITICAL - Host Unreachable (i-000002a1) [07:16:50] PROBLEM Current Load is now: WARNING on ganglia-test2 i-00000250 output: WARNING - load average: 6.97, 6.49, 5.54 [07:16:50] PROBLEM Current Load is now: WARNING on precise-test i-00000231 output: WARNING - load average: 5.99, 6.44, 6.23 [07:16:50] RECOVERY dpkg-check is now: OK on ipv6test1 i-00000282 output: All packages OK [07:16:50] RECOVERY Total Processes is now: OK on precise-test i-00000231 output: PROCS OK: 88 processes [07:17:00] PROBLEM HTTP is now: CRITICAL on mailman-01 i-00000235 output: CRITICAL - Socket timeout after 10 seconds [07:17:14] RECOVERY Current Load is now: OK on gerrit i-000000ff output: OK - load average: 1.71, 3.73, 5.00 [07:17:14] RECOVERY Free ram is now: OK on bots-cb i-0000009e output: OK: 78% free memory [07:17:14] RECOVERY Current Users is now: OK on pybal-precise i-00000289 output: USERS OK - 0 users currently logged in [07:17:14] PROBLEM Current Load is now: WARNING on demo-web1 i-00000255 output: WARNING - load average: 4.21, 5.96, 5.20 [07:17:19] RECOVERY Free ram is now: OK on pybal-precise i-00000289 output: OK: 85% free memory [07:17:19] RECOVERY Disk Space is now: OK on pybal-precise i-00000289 output: DISK OK [07:17:19] PROBLEM Current Load is now: WARNING on pybal-precise i-00000289 output: WARNING - load average: 7.00, 7.56, 6.75 [07:17:19] RECOVERY dpkg-check is now: OK on pybal-precise i-00000289 output: All packages OK [07:17:24] RECOVERY Total Processes is now: OK on pybal-precise i-00000289 output: PROCS OK: 95 processes [07:21:34] PROBLEM Current Load is now: WARNING on bots-apache1 i-000000b0 output: WARNING - load average: 7.26, 9.83, 19.99 [07:21:34] RECOVERY Current Users is now: OK on bots-cb i-0000009e output: USERS OK - 1 users currently logged in [07:21:34] PROBLEM Current Load is now: WARNING on reportcard2 i-000001ea output: WARNING - load average: 10.17, 9.58, 7.94 [07:21:34] RECOVERY Total Processes is now: OK on bots-cb i-0000009e output: PROCS OK: 341 processes [07:21:39] RECOVERY Current Users is now: OK on reportcard2 i-000001ea output: USERS OK - 0 users currently logged in [07:21:39] RECOVERY Disk Space is now: OK on reportcard2 i-000001ea output: DISK OK [07:23:38] RECOVERY host: blamemaps-m1small is UP address: i-000002a1 PING OK - Packet loss = 0%, RTA = 3.02 ms [07:23:38] PROBLEM HTTP is now: WARNING on mailman-01 i-00000235 output: HTTP WARNING: HTTP/1.1 403 Forbidden - 498 bytes in 0.011 second response time [07:23:38] RECOVERY dpkg-check is now: OK on reportcard2 i-000001ea output: All packages OK [07:25:29] PROBLEM Current Load is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:26:46] RECOVERY Disk Space is now: OK on bots-cb i-0000009e output: DISK OK [07:26:46] PROBLEM Current Load is now: WARNING on worker1 i-00000208 output: WARNING - load average: 5.42, 6.08, 7.12 [07:27:39] PROBLEM Current Load is now: WARNING on upload-wizard i-0000021c output: WARNING - load average: 7.16, 11.54, 11.16 [07:27:39] RECOVERY Total Processes is now: OK on upload-wizard i-0000021c output: PROCS OK: 114 processes [07:27:44] RECOVERY Current Users is now: OK on upload-wizard i-0000021c output: USERS OK - 0 users currently logged in [07:27:44] RECOVERY dpkg-check is now: OK on bots-sql2 i-000000af output: All packages OK [07:27:44] RECOVERY Disk Space is now: OK on upload-wizard i-0000021c output: DISK OK [07:27:44] RECOVERY Current Users is now: OK on worker1 i-00000208 output: USERS OK - 0 users currently logged in [07:27:44] RECOVERY Disk Space is now: OK on worker1 i-00000208 output: DISK OK [07:27:44] RECOVERY Free ram is now: OK on worker1 i-00000208 output: OK: 89% free memory [07:27:51] !ping [07:27:51] pong [07:28:01] hm [07:28:01] PROBLEM Current Load is now: WARNING on bots-sql2 i-000000af output: WARNING - load average: 7.05, 10.04, 12.75 [07:28:13] poor io [07:28:18] PROBLEM Current Load is now: WARNING on incubator-bot1 i-00000251 output: WARNING - load average: 3.54, 5.23, 6.50 [07:28:21] PROBLEM Current Load is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:28:48] PROBLEM Current Load is now: CRITICAL on pybal-precise i-00000289 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:29:08] PROBLEM dpkg-check is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:30:30] PROBLEM Current Load is now: WARNING on robh2 i-000001a2 output: WARNING - load average: 6.70, 5.97, 5.45 [07:32:06] PROBLEM Current Load is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:33:08] RECOVERY Current Load is now: OK on swift-be3 i-000001c9 output: OK - load average: 0.23, 1.73, 4.02 [07:33:08] RECOVERY Current Load is now: OK on wep i-000000c2 output: OK - load average: 0.71, 2.46, 4.90 [07:34:32] RECOVERY Free ram is now: OK on incubator-bot2 i-00000252 output: OK: 56% free memory [07:34:32] RECOVERY Current Users is now: OK on incubator-bot2 i-00000252 output: USERS OK - 0 users currently logged in [07:34:32] RECOVERY dpkg-check is now: OK on incubator-bot2 i-00000252 output: All packages OK [07:34:32] PROBLEM Current Load is now: WARNING on incubator-bot2 i-00000252 output: WARNING - load average: 6.41, 9.56, 10.10 [07:34:32] RECOVERY dpkg-check is now: OK on mwreview-1 i-000002a6 output: All packages OK [07:34:32] RECOVERY Free ram is now: OK on mwreview-1 i-000002a6 output: OK: 70% free memory [07:34:32] RECOVERY Current Users is now: OK on mwreview-1 i-000002a6 output: USERS OK - 0 users currently logged in [07:34:33] PROBLEM Current Load is now: WARNING on mwreview-1 i-000002a6 output: WARNING - load average: 0.51, 5.24, 8.30 [07:34:34] RECOVERY Total Processes is now: OK on incubator-bot2 i-00000252 output: PROCS OK: 121 processes [07:34:37] RECOVERY Disk Space is now: OK on mwreview-1 i-000002a6 output: DISK OK [07:34:37] RECOVERY Total Processes is now: OK on mwreview-1 i-000002a6 output: PROCS OK: 101 processes [07:34:47] RECOVERY Current Load is now: OK on deployment-jobrunner02 i-00000279 output: OK - load average: 0.29, 1.97, 4.25 [07:34:57] RECOVERY Disk Space is now: OK on incubator-bot2 i-00000252 output: DISK OK [07:34:58] RECOVERY Disk Space is now: OK on bots-sql2 i-000000af output: DISK OK [07:34:58] RECOVERY Total Processes is now: OK on bots-sql2 i-000000af output: PROCS OK: 84 processes [07:35:03] RECOVERY Total Processes is now: OK on rds i-00000207 output: PROCS OK: 91 processes [07:35:08] RECOVERY Disk Space is now: OK on rds i-00000207 output: DISK OK [07:35:08] RECOVERY Current Users is now: OK on rds i-00000207 output: USERS OK - 0 users currently logged in [07:35:08] PROBLEM Current Load is now: WARNING on rds i-00000207 output: WARNING - load average: 7.68, 7.93, 8.50 [07:35:08] RECOVERY Free ram is now: OK on rds i-00000207 output: OK: 92% free memory [07:35:08] RECOVERY SSH is now: OK on bots-sql2 i-000000af output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [07:35:54] helllo [07:36:47] RECOVERY Current Load is now: OK on incubator-bot1 i-00000251 output: OK - load average: 1.24, 2.37, 4.73 [07:36:47] RECOVERY Current Load is now: OK on robh2 i-000001a2 output: OK - load average: 0.43, 2.75, 4.24 [07:36:47] PROBLEM Current Load is now: CRITICAL on ganglia-test5 i-000002a7 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:36:47] PROBLEM Free ram is now: CRITICAL on ganglia-test5 i-000002a7 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:36:47] PROBLEM Disk Space is now: CRITICAL on ganglia-test5 i-000002a7 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:36:47] PROBLEM Current Users is now: CRITICAL on ganglia-test5 i-000002a7 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:36:48] PROBLEM Total Processes is now: CRITICAL on ganglia-test5 i-000002a7 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:36:49] PROBLEM Disk Space is now: CRITICAL on pybal-precise i-00000289 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:36:49] PROBLEM dpkg-check is now: CRITICAL on pybal-precise i-00000289 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:36:49] PROBLEM Total Processes is now: CRITICAL on pybal-precise i-00000289 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:37:49] PROBLEM Current Load is now: WARNING on mobile-testing i-00000271 output: WARNING - load average: 1.41, 4.40, 15.41 [07:37:49] RECOVERY Current Load is now: OK on swift-be4 i-000001ca output: OK - load average: 0.51, 1.79, 4.46 [07:37:49] RECOVERY Current Load is now: OK on demo-web1 i-00000255 output: OK - load average: 0.47, 2.77, 4.30 [07:37:58] PROBLEM Current Users is now: CRITICAL on pybal-precise i-00000289 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:37:58] PROBLEM Free ram is now: CRITICAL on pybal-precise i-00000289 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:41:17] RECOVERY Current Load is now: OK on ganglia-test2 i-00000250 output: OK - load average: 4.34, 3.56, 4.69 [07:41:17] PROBLEM Current Load is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:41:37] PROBLEM Current Load is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [07:41:42] PROBLEM Current Load is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:41:42] PROBLEM dpkg-check is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:41:42] PROBLEM Current Users is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:41:42] PROBLEM Disk Space is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:42:21] RECOVERY Free ram is now: OK on upload-wizard i-0000021c output: OK: 87% free memory [07:42:37] PROBLEM Current Users is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:42:37] PROBLEM Free ram is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:42:37] PROBLEM Disk Space is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:43:41] RECOVERY Current Load is now: OK on labs-nfs1 i-0000005d output: OK - load average: 1.04, 2.04, 4.16 [07:43:41] RECOVERY Current Load is now: OK on swift-be2 i-000001c8 output: OK - load average: 0.29, 1.28, 3.91 [07:43:41] PROBLEM Current Load is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:43:43] baaa [07:43:44] spaam [07:43:46] me mroe [07:47:01] PROBLEM Current Load is now: WARNING on bots-cb i-0000009e output: WARNING - load average: 4.28, 1.88, 15.97 [07:47:01] RECOVERY Current Load is now: OK on mwreview-1 i-000002a6 output: OK - load average: 0.16, 0.72, 3.92 [07:47:38] PROBLEM Current Load is now: WARNING on maps-tilemill1 i-00000294 output: WARNING - load average: 3.20, 6.60, 8.48 [07:47:38] RECOVERY dpkg-check is now: OK on maps-tilemill1 i-00000294 output: All packages OK [07:47:46] PROBLEM Current Users is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:47:47] PROBLEM Disk Space is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:47:47] PROBLEM Free ram is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:48:14] RECOVERY Current Load is now: OK on bots-2 i-0000009c output: OK - load average: 1.18, 2.00, 4.09 [07:48:14] RECOVERY Total Processes is now: OK on reportcard2 i-000001ea output: PROCS OK: 89 processes [07:48:19] RECOVERY Free ram is now: OK on reportcard2 i-000001ea output: OK: 85% free memory [07:51:02] RECOVERY Total Processes is now: OK on worker1 i-00000208 output: PROCS OK: 94 processes [07:52:12] PROBLEM Current Load is now: WARNING on ganglia-test5 i-000002a7 output: WARNING - load average: 4.10, 10.21, 11.03 [07:52:12] RECOVERY Disk Space is now: OK on ganglia-test5 i-000002a7 output: DISK OK [07:52:12] RECOVERY Free ram is now: OK on ganglia-test5 i-000002a7 output: OK: 83% free memory [07:52:12] RECOVERY Current Users is now: OK on ganglia-test5 i-000002a7 output: USERS OK - 0 users currently logged in [07:52:12] RECOVERY Total Processes is now: OK on ganglia-test5 i-000002a7 output: PROCS OK: 180 processes [07:52:17] RECOVERY dpkg-check is now: OK on ganglia-test5 i-000002a7 output: All packages OK [07:57:23] RECOVERY Current Load is now: OK on maps-tilemill1 i-00000294 output: OK - load average: 0.30, 1.23, 4.65 [08:01:02] RECOVERY Current Load is now: OK on bots-apache1 i-000000b0 output: OK - load average: 3.33, 3.15, 4.57 [08:03:39] RECOVERY Current Load is now: OK on mobile-testing i-00000271 output: OK - load average: 1.30, 1.13, 4.11 [08:06:09] RECOVERY Current Load is now: OK on bots-cb i-0000009e output: OK - load average: 0.09, 0.20, 4.62 [08:08:07] RECOVERY Current Load is now: OK on ganglia-test5 i-000002a7 output: OK - load average: 1.33, 1.11, 4.34 [08:15:57] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 0.71, 0.90, 3.30 [08:21:05] RECOVERY Current Load is now: OK on nagios 127.0.0.1 output: OK - load average: 0.39, 1.01, 2.74 [08:42:33] !log deployment prep going to upgrade core / extensions to latest master [08:42:37] deployment is not a valid project. [08:42:42] seriously [08:43:13] !log deployment-prep {{bug|37199}} going to upgrade core / extensions to latest master [08:43:16] Logged the message, Master [08:43:25] and breaking stuff cool [08:45:31] hahh [08:49:16] !log deployment-prep hashar: updating core to 8c65834 [08:49:17] Logged the message, Master [08:51:17] so slooow [08:52:50] 05/30/2012 - 08:52:50 - Updating keys for knissen at /export/home/mediawiki-custom-de/knissen [08:52:58] !log bots petrb: patching wm bot [08:52:59] Logged the message, Master [09:15:43] !log deployment-prep hashar: foreachwiki update.php --quiet --quick [09:15:46] Logged the message, Master [09:17:21] !log deployment-prep Restarted update.php in a screen session [09:17:22] Logged the message, Master [09:21:55] RECOVERY Current Load is now: OK on bots-sql2 i-000000af output: OK - load average: 1.70, 3.90, 4.91 [09:33:14] New review: Hashar; "I do not understand what you would prefer sorry :-( By 'this being done on call site', do you mean ..." [operations/puppet] (test); V: 0 C: 0; - https://gerrit.wikimedia.org/r/8575 [09:39:55] PROBLEM Current Load is now: WARNING on bots-sql2 i-000000af output: WARNING - load average: 7.41, 6.69, 5.49 [09:44:29] PROBLEM Puppet freshness is now: CRITICAL on nova-precise1 i-00000236 output: Puppet has not run in last 20 hours [09:49:39] PROBLEM Puppet freshness is now: CRITICAL on nova-essex-test i-000001f9 output: Puppet has not run in last 20 hours [09:58:36] The username Hashar is not registered on this wiki, but it does exist in the unified login system. [09:58:37] yeahh [09:58:42] cryptic message [09:59:29] PROBLEM Puppet freshness is now: CRITICAL on nova-production1 i-0000007b output: Puppet has not run in last 20 hours [10:13:34] Bug 37216 - can not reset user password | https://bugzilla.wikimedia.org/show_bug.cgi?id=37216 [10:13:45] if anyone has any idea, feel free to comment on that bug report [10:13:47] I am off for lunch [10:36:25] PROBLEM Puppet freshness is now: CRITICAL on mailman-01 i-00000235 output: Puppet has not run in last 20 hours [11:38:15] 05/30/2012 - 11:38:15 - Creating a home directory for knissen at /export/home/bastion/knissen [11:38:16] !log bastion adding Kai Nissen (WMFDE) to bastion [11:38:17] Logged the message, Master [11:50:25] back [12:13:56] PROBLEM Free ram is now: CRITICAL on wmde-test i-000002a8 output: Connection refused by host [12:14:55] PROBLEM Free ram is now: WARNING on bots-3 i-000000e5 output: Warning: 14% free memory [12:15:15] PROBLEM Total Processes is now: CRITICAL on wmde-test i-000002a8 output: Connection refused by host [12:15:45] PROBLEM dpkg-check is now: CRITICAL on wmde-test i-000002a8 output: Connection refused by host [12:16:55] PROBLEM Current Load is now: CRITICAL on wmde-test i-000002a8 output: Connection refused by host [12:17:45] PROBLEM Current Users is now: CRITICAL on wmde-test i-000002a8 output: Connection refused by host [12:18:15] PROBLEM Disk Space is now: CRITICAL on wmde-test i-000002a8 output: Connection refused by host [12:24:06] hashar: can you modify mw so that I we have debug symbol somewhere in source of each page [12:24:14] that one we have in logs [12:24:23] http://oc.wikipedia.beta.wmflabs.org/wiki/Crash_1 [12:24:33] what do you mean by debug symbol ? [12:24:33] I replicated the problem from production but I can't get related error [12:24:47] you might have them in /home/wikipedia/log [12:25:05] that string we have in logs [12:25:10] in prefix [12:25:10] in production I am 99% sure it is just an out of memory error [12:25:20] ok but I would like to see the logs [12:25:27] anyway OOM means it's a bug [12:25:33] there is probabaly loop in parser [12:25:38] probably [12:25:40] which allocate so much memory [12:25:49] will have to reproduce it locally to get all the tools [12:25:53] hashar: I mean that string we have in logs [12:26:04] "that string we have in logs" does not mean anything to me [12:26:08] you will have to be more precise [12:26:12] anyway [12:26:16] since it is a fatal error [12:26:22] the only thing you can get is something like: [12:26:27] ooh [12:26:29] you removed it [12:26:42] PHP Fatal Error: trying to allocate 20192bytes (max 128MB) at includes/parser/Parser.php line XXX [12:26:46] which is not really helpful [12:26:52] no it's not [12:26:59] but in past we had very detailed logs [12:27:03] where are they [12:27:35] cat /usr/local/apache/common-local/wmf-config/logs.php [12:27:44] this file was prefixing logs with certain string [12:27:49] that's the string I talk about [12:28:11] $randomHash = $wgDBname . '-' . substr( md5(uniqid()), 0, 8 ); [12:28:30] it helped me to see which lines correspond to which execution [12:28:42] because logs were mixed from many various php runs [12:28:50] ohhh [12:28:52] hmm [12:28:54] if I could see that in source code of page [12:28:56] $wgDebugFile I guess [12:29:01] I could grep it out of detailed logs [12:29:10] and see exactly why it crashed [12:29:15] $wgDebugLogFile = "udp://$wmfUdp2logDest/catchall"; [12:29:16] [12:29:17] and for CLI $wgDebugLogFile = "udp://$wmfUdp2logDest/cli"; [12:29:24] ok, where is ithat [12:29:26] OH MY GOD [12:29:30] o.o [12:29:30] stupid udp2log [12:29:49] that is sent to deployment-feed which has the udp2log daemon [12:29:55] ok [12:30:00] where that one store it [12:30:53] /home/wikipedia/logs [12:31:01] petrb@deployment-dbdump:~$ ls /home/wikipedia/logs/ [12:31:01] archive CentralAuth.log dberror.log slow-parse.log testwiki.log xff.log [12:31:03] so the logs files should be /home/wikipedia/logs/catchall.log [12:31:04] which one [12:31:05] but that is not there [12:31:07] the succker [12:31:09] grrr [12:31:37] btw, what you think of that idea with debug string in html source code [12:31:46] like [12:31:56] so that I could just grep dafdasfas [12:31:58] from log [12:32:13] I think it would be useful to debug such problems [12:32:19] or is there any better way [12:32:31] hmm [12:32:42] everytime the page crash I would open the source code and grab the symbol [12:32:53] so taht I could see why it was [12:33:25] that was original reason for that prefixes in logs [12:33:34] I know [12:33:35] I just didn't know any proper way to implement it to mediawiki core [12:33:37] that is not the issue [12:34:09] PROBLEM Free ram is now: UNKNOWN on wmde-test i-000002a8 output: Invalid host name i-000002a8 [12:34:09] PROBLEM Puppet freshness is now: CRITICAL on localpuppet2 i-0000029b output: Puppet has not run in last 20 hours [12:34:53] > var_dump( $wgDebugLogFile ); [12:34:53] string(0) "" [12:34:55] yeahhh [12:38:59] PROBLEM host: wmde-test is DOWN address: i-000002a8 check_ping: Invalid hostname/address - i-000002a8 [12:42:01] ahh /data/project/cli.log [12:42:03] interesting [12:42:41] so I moved that to udp2log [12:42:49] petan|wk: /home/wikipedia/log/cli.log [12:42:56] it is not passing through udp2log [12:43:01] hm [12:43:25] ok, can we have a full log [12:44:10] hashar: are you sure we want to have /home/wikipedia on labs-nfs1 [12:44:16] yeah [12:44:19] I would prefer to get it to gluster [12:44:23] or local nfs [12:44:37] there is a bug to get /home/wikipedia to a better place [12:44:44] ok [12:44:52] for now it is labs-nfs1:/export/home/deployment-prep/wikipedia on /home/wikipedia [12:44:54] PROBLEM Current Load is now: CRITICAL on wmde-test1 i-000002a9 output: Connection refused by host [12:44:57] it's as easy as creating a symlink [12:45:01] naa [12:45:04] hm... [12:45:08] at least symblink for logs [12:45:14] PROBLEM Current Users is now: CRITICAL on wmde-test1 i-000002a9 output: Connection refused by host [12:45:26] because if they get huge, Ryan will eat you [12:45:32] no it is ok [12:45:35] hm [12:45:38] I have negotiated that with Ryan [12:45:48] ok, can we enable full logs for few mins [12:45:52] we have /home/wikipedia being used just in production (i.e. with MW checkouts + logs) [12:45:54] these we had in past [12:45:55] PROBLEM Disk Space is now: CRITICAL on wmde-test1 i-000002a9 output: Connection refused by host [12:45:59] for now it is no labs-nfs1 [12:46:04] yes but in prod you don't ahve it mounted to labs-nfs [12:46:05] :D [12:46:08] ops will find us a better solution later on [12:46:14] but for now they are aware of it [12:46:17] ok [12:46:34] PROBLEM Free ram is now: CRITICAL on wmde-test1 i-000002a9 output: CHECK_NRPE: Error - Could not complete SSL handshake. [12:47:44] PROBLEM Total Processes is now: CRITICAL on wmde-test1 i-000002a9 output: CHECK_NRPE: Error - Could not complete SSL handshake. [12:47:48] petan|wk: andI think one of the issue was to avoid gluster entirely [12:47:54] yes [12:47:57] it was [12:47:59] aka we do not want logs to be written to /data/project (which is gluster) [12:48:02] anyway I want the logs! [12:48:03] :D [12:48:07] yeah working on it [12:48:14] you got cli one in /home/wikipedia/logs/cli.log [12:48:19] looking for the apaches one now [12:48:24] PROBLEM dpkg-check is now: CRITICAL on wmde-test1 i-000002a9 output: CHECK_NRPE: Error - Could not complete SSL handshake. [12:48:27] but I need logs from crash I got in browser [12:48:32] not cli [12:50:00] 'default' => '', [12:50:01] oh yeah [12:50:02] aghahah [12:51:10] :u [12:51:31] :u) that is my new smiley [12:51:42] :n) [12:54:24] stupid labs I/O [12:55:14] moving include ("logs.php"); to CommonSettingsDeploy.php [12:56:45] petan|wk: [12:56:55] ok [12:56:57] you should get logs in /home/wikipedia/log/catchall.log [12:57:35] yay [12:57:56] cool [12:57:58] will rename it apache.log [12:58:00] but it doesn't work :( [12:58:12] I just did tail -f [12:58:14] nothing happens [12:58:33] well web [12:59:41] PROBLEM host: wmde-test1 is DOWN address: i-000002a9 CRITICAL - Host Unreachable (i-000002a9) [13:00:00] ok it does [13:00:03] but not for my page [13:00:05] :/ [13:00:15] why it crash meh [13:00:21] I wan't stack trace [13:00:26] want [13:00:53] hashar: it's gone [13:00:56] file [13:00:59] what it ? [13:01:05] yeah renamed catchall.log to web.log [13:01:07] sorry ;-D [13:01:11] ah [13:01:21] so now we have cli.log and catchall.log [13:01:41] they are written by "udp2log" daemon on "deployment-feed" [13:01:53] hashar: how long does it take for it to be written [13:01:58] so that I can read it [13:02:09] when I refresh the page, I don't see it [13:02:17] ok, now I do [13:02:25] yeah I/O sucks [13:03:39] wtf [13:03:47] in logs it looks like the page was rendered [13:03:53] but in browser I get blank page [13:03:58] ocwiki-ab26d65b [13:04:54] PROBLEM host: wmde-test is DOWN address: i-000002aa CRITICAL - Host Unreachable (i-000002aa) [13:05:14] ocwiki-ab26d65b: 2.0590 126.2M [13:05:18] what is memory limit [13:05:21] no idea [13:05:23] 128M? [13:05:27] I think so [13:05:28] probably [13:05:33] because it crashed at 126 [13:05:39] there must be a fatal error in apaches log [13:05:43] but we have no syslog server yet [13:05:44] :-(( [13:05:47] somewhere [13:14:24] PROBLEM Current Load is now: CRITICAL on wmde-test1 i-000002ab output: Connection refused by host [13:15:04] PROBLEM Current Users is now: CRITICAL on wmde-test1 i-000002ab output: Connection refused by host [13:15:44] PROBLEM Disk Space is now: CRITICAL on wmde-test1 i-000002ab output: Connection refused by host [13:15:54] !log deployment-pre {{bug|37221}} deleting lucid jobrunners (02 & 03) [13:15:55] deployment-pre is not a valid project. [13:16:24] PROBLEM Free ram is now: CRITICAL on wmde-test1 i-000002ab output: Connection refused or timed out [13:17:23] paravoid: ping ping so I am confused about apache::monitor puppet class https://gerrit.wikimedia.org/r/8575 [13:17:33] I am not sure what you are expecting [13:17:37] I was *just* looking at that [13:17:45] \O/ [13:17:47] I clicked the link then got the irssi notify [13:17:54] haha [13:17:57] go figure [13:18:02] telepathy might exist after all [13:18:36] ok, so [13:18:50] first of all, $::cluster is a typo, right? [13:18:55] and you meant $::realm? [13:20:24] RECOVERY host: wmde-test is UP address: i-000002ad PING OK - Packet loss = 0%, RTA = 3.36 ms [13:20:44] PROBLEM dpkg-check is now: CRITICAL on wmde-test i-000002ad output: Connection refused by host [13:21:24] PROBLEM Current Load is now: CRITICAL on wmde-test i-000002ad output: Connection refused by host [13:21:55] paravoid: yup cluster is a typ [13:21:56] o [13:22:04] PROBLEM Current Users is now: CRITICAL on wmde-test i-000002ad output: Connection refused by host [13:22:15] that is cause $cluster is what is used in mw php files [13:22:20] will definitely fix it [13:22:44] PROBLEM Disk Space is now: CRITICAL on wmde-test i-000002ad output: Connection refused by host [13:22:51] okay [13:23:13] the thing about admin:: [13:23:14] PROBLEM Free ram is now: CRITICAL on wmde-test i-000002ad output: Connection refused by host [13:23:21] is that these classes do not work in labs iirc [13:23:38] because they try to instantiate accounts, which fails due to the ldap/nfs setup [13:24:00] so you have to check for the realm there anyway [13:24:04] why would we want them to work? [13:24:06] PROBLEM SSH is now: CRITICAL on wmde-test i-000002ad output: CRITICAL - Socket timeout after 10 seconds [13:24:34] PROBLEM Total Processes is now: CRITICAL on wmde-test i-000002ad output: Connection refused by host [13:24:36] we don't [13:25:01] * Ryan_Lane nods [13:25:03] why do you talk about admins:: ? [13:25:04] but hashar is adding some labs things to a class that has those, and I'm saying it won't work [13:25:06] ah [13:25:10] * Ryan_Lane nods [13:25:28] I really kind of wish we installed all users on all systems, and managed access via groups [13:25:29] my change is to adapt the apaches::monitoring class which is already deployed on labs AFAIK [13:25:51] then this would be a single change [13:26:14] hashar: have a look at imagescaler / imagescaler::labs [13:26:17] still not ideal [13:26:31] but see how imagescaler has admins & foo and it's for prod [13:26:46] and imagescaler::labs has only the bits needed for labs [13:26:58] so there's no point in adding apache::monitoring::labs to imagescaler [13:27:02] it won't work anyway [13:27:48] I don't get it sorry [13:27:49] :-( [13:28:00] still fail to see how it relate with apache::monitoring [13:28:16] you're adding » » apaches::monitoring::labs to imagescaler [13:28:18] what's the point? [13:28:54] RECOVERY SSH is now: OK on wmde-test i-000002ad output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [13:29:02] monitoring the imagescaler apache ? [13:29:06] (in labs) [13:29:20] ohh [13:29:26] that should be added to imagescaler::labs [13:29:27] k [13:29:32] that's not the point [13:29:44] it won't work for *any* of those classes [13:29:59] homeless, home-no-service, bits [13:30:36] hashar: regarding my bug from earlier, right now imagescaler extracts thumbs from videos not jobrunner, jobrunner only transcodes videos [13:30:43] i.e. if you go to http://commons.wikipedia.beta.wmflabs.org/w/thumb.php?f=Mayday2012-edit-1.ogv&width=200 [13:30:54] where is that image extracted? [13:31:43] j^: that is from the apaches which are indeed still running Lucid [13:32:08] j^: hence they have an old ffmpeg installation. We would have to get the thumb redirected to an imagescaler using Precise [13:32:24] j^: feel free to reopen the bug ;) [13:33:16] paravoid: I am sorry but I don't understand at all [13:33:40] you're adding apaches::monitoring::labs to the class named "bits" [13:33:42] paravoid: the reason I did change 8575 was because the labs Nagios complained about Apache giving a 403 [13:34:00] the root cause being the apaches being monitored were missing the en.wikipedia.org virtual host since they have en.wikipedia.beta.wmflabs.org [13:34:05] (and in the classes named "homeless" and "home-no-serivce") [13:34:10] right? [13:34:22] yes [13:34:31] it won't work [13:34:36] those classes do not work in labs [13:34:44] and that is the class I am using on labsconsole to configure the labs Apaches [13:34:55] which one? [13:35:21] apache20 for example https://labsconsole.wikimedia.org/w/index.php?title=Special:NovaInstance&action=configure&project=deployment-prep&instanceid=i-0000026c [13:35:23] it has: [13:35:35] applicationserver::homeless & imagescaler [13:35:57] puppet will fail with those [13:36:09] it might do some of the things but it will spew a lot of errors [13:36:17] try running puppetd -vt [13:36:20] well it surely fails applying the admins:: accounts::l0nupdate lvs::real server classes (for example) [13:36:28] exactly [13:36:38] but I am pretty sure it does apply apaches::service / apaches::cron / nfs::upload [13:36:53] well, it might, but it's not right nevertheless [13:36:58] we should have separate classes for labs [13:37:02] NOW i understand :-]]]]]]]]] [13:37:07] that do not have the admin:: accounts:: etc. [13:37:13] and that they have the labs monitoring bits [13:38:09] so my next step was to ask how to get ignore admins:: classes on labs ;-] [13:38:10] which is https://gerrit.wikimedia.org/r/#/c/8642/ [13:38:40] which safeguard admins::* so they only do something when on $::realm == 'production' [13:39:46] so as a summary, you on 8575 you were warning me about not using applicationserver::homeless class cause of the include admins::* [13:39:59] well [13:40:00] kind of :) [13:40:04] I don't like 8642 at all [13:40:07] hehe [13:40:14] I have sent it to discuss about it ;D [13:40:14] New patchset: Hashar; "only apply admins::* while on production" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/8642 [13:40:22] they are related to each other [13:40:23] (rewrote commit message) [13:40:25] what I'm saying is [13:40:30] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/8642 [13:40:37] see imagescaler::labs [13:40:41] New review: Hashar; "redid commit message" [operations/puppet] (test); V: 0 C: 0; - https://gerrit.wikimedia.org/r/8642 [13:40:54] so basically should I create a new application server::labs ? [13:40:59] I think so, yes [13:41:01] applicationserver::labs [13:41:06] that includes the labs monitoring bits [13:41:11] and does not include the admins:: foo [13:41:19] and anything else labs-specific that we might want to do [13:41:38] we shouldn't put if/elses all over the tree [13:41:43] this would mess it badly imho [13:42:09] when you see "include admin::accounts", you expect it to create accounts, not *conditionally* create accounts [13:42:14] RECOVERY Current Load is now: OK on bots-sql2 i-000000af output: OK - load average: 5.08, 4.64, 4.94 [13:42:30] totally [13:42:34] so [13:43:24] Change abandoned: Hashar; "if / else clutter puppet. We do not want to conditionally create accounts. Instead, we should use a..." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/8642 [13:44:31] so, [13:44:43] 1) leave apache::monitoring & applicationserver (and others) as is [13:44:58] 2) create apache::monitoring::labs that do the monitoring in labs [13:45:28] 3) create new applicationserver::labs role classes that include apache::monitoring::labs and any other classes that they should (but not admins:: accounts::) [13:45:45] does it make any sense? [13:46:17] yup :-) [13:46:39] do you also like it? :) [13:47:38] yes ;-) [13:50:32] and at a later point [13:50:49] I'd say that if applicationserver & applicationserver::labs share things [13:51:00] they should be abstracted and have a parent class from which they inherit [13:51:22] either applicationserver::base -> applicationserver & applicationserver::base -> applicationserver::labs [13:51:28] if we want to keep the current notation [13:51:39] or applicationserver -> applicationserver::production & applicationserver -> applicationserver::labs [13:51:45] if we want to change it at some point [13:51:53] (i'd go with the first for starters) [13:52:56] I don't understand why admins:: accounts::l0nupdate isn't a system account [13:56:14] New review: Bhartshorne; "(no comment)" [operations/puppet] (test); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7985 [13:56:18] Change merged: Bhartshorne; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/7985 [13:58:18] offtopic: http://www.lucas-nussbaum.net/blog/?p=718 [13:58:29] distribution qa using the cloud [13:58:43] spawn Amazon EC2 instances that do full rebuilds of the Debian archives to find bugs [13:59:09] yes please. [13:59:44] he's found hundeds (or thousands) of bugs [13:59:51] perfect use of spot instances too. [13:59:57] and it especially helps when e.g. moving to a new gcc [14:00:12] "how many breakages should we expect if we switch to gcc-4.7?" [14:00:14] poof [14:01:12] is there a wiki page on how to recover after I break my git repo by running 'git pull' when I have a committed but unmerged change? [14:06:33] New patchset: Hashar; "(bug 37046) fix apache monitoring on deployment-prep" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/8575 [14:06:37] paravoid: ^^^ [14:06:48] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/8575 [14:07:23] New review: Faidon; "Great!" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/8575 [14:07:26] Change merged: Faidon; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/8575 [14:07:29] maplebed: mind pasting result of git log --decorate --graph --oneline ?? [14:07:31] wow, so much cleaner! [14:07:48] sure! [14:07:53] paravoid: trying on labs now [14:08:44] umm... hashar, despite the --oneline flag, the output is, in fact, 4883 lines. [14:08:53] how much do you want? all? [14:10:00] --oneline is one line per commit :) [14:10:09] I know, I was making a funny. [14:10:15] I think I know how much to paste though [14:10:22] gimme a sec. [14:11:36] hashar: http://pastebin.com/gcZxfRS5 [14:11:49] it first diverges at * | | 421cc6c cleaning incoming URLs a little bit to increase hit ratio [14:12:10] then I did a pull, made another change, then merged the cleaning change in gerrit, then pulled again. [14:12:33] so [14:13:04] yeah [14:13:09] you should use: git pull --rebase [14:13:20] instead of git pull [14:13:24] or use branches ;-D [14:13:54] yeah, I know I should have used a branch but I forgot the first change hadn't merged yet. [14:14:02] I've never tried git pull --rebase. [14:14:15] will it magically do the right thing or do I need to go find commits and reorder them or somethin? [14:15:04] let me analyze your case ;) [14:17:38] hmm [14:17:49] I am trying to find a nice way to list commits which are in test and not in origin/test [14:18:03] I guess that should be : git log --oneline origin/test..test [14:18:28] a0c2bb2 Merge branch 'test' of ssh://gerrit.wikimedia.org:29418/operations/puppet into test [14:18:28] d386aeb creating second labs swift cluster for testing the swift upgrade from 1.4.3 to 1.4.7+swiftstack [14:18:28] 29f8572 Merge branch 'test' of ssh://gerrit.wikimedia.org:29418/operations/puppet into test [14:18:36] so you could create a new branch: git checkout -b feature -t origin/test [14:18:56] and cherry-pick the commits from `git log --oneline origin/test..test` in the new branch "feature" [14:19:12] though you don't want to cherry pick the merge commit [14:19:20] so maybe that is only d386aeb [14:19:24] yup. [14:19:43] and probably 421cc6c [14:19:55] git log --oneline --no-merges origin/test..test might help [14:20:03] ophohhhhohoh [14:20:08] git cherry origin/test test [14:20:20] that is the magic command that list commits in a branch and not in another [14:20:24] git-cherry - Find commits not merged upstream [14:20:30] discovered that one last week [14:20:46] it's only one line - the d386aeb change. [14:21:23] do you know off hand the syntax for the cherry pick or should I look it up? [14:21:36] git cherry-pick [14:21:44] git-cherry pick 421cc6c [14:21:51] git cherry-pick d386aeb [14:22:04] though you might be able to pass several sha1 as arguments [14:22:05] only the d386, not 421. [14:22:12] 421's already merged in gerrit. [14:22:35] now I try to push to gerrit again? [14:23:05] New patchset: Hashar; "rename monitor_service in apaches::monitoring::labs" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9391 [14:23:07] just to be sure: git log --oneline --decorate --graph [14:23:21] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/9391 [14:23:24] you should see the change in (test) just above a line with (origin/test) [14:23:44] yes! [14:23:44] paravoid: so the apache monitoring add a duplicate entry. https://gerrit.wikimedia.org/r/9391 fix it (commit message contains puppet error) [14:23:52] maplebed: so you should be safe ;) [14:23:58] New patchset: Bhartshorne; "creating second labs swift cluster for testing the swift upgrade from 1.4.3 to 1.4.7+swiftstack" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9392 [14:24:00] hashar: you are amazing! [14:24:01] there might be another way to fix the issue you had [14:24:14] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/9392 [14:24:14] that was SO much easier than the last time I tried to heal that kind of mess. [14:24:18] but creating a new branch + cherry picking is usually the most straightforward [14:24:21] I gotta write this shit down. [14:24:34] http://www.mediawiki.org/wiki/Gerrit ? ;) [14:24:37] (after I merge my change) [14:24:46] New review: Faidon; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9391 [14:24:50] Change merged: Faidon; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9391 [14:24:53] hmm I think we have a GIT FAQ somewhere [14:24:58] paravoid: danke [14:24:58] isn't it https://labsconsole.wikimedia.org/wiki/Git ? [14:25:06] oh there is that one too [14:26:28] maplebed: Oh and I have a nice list of aliases at http://www.mediawiki.org/wiki/Git/aliases [14:27:15] ohh, pretty... [14:27:17] http://www.mediawiki.org/wiki/Git/aliases#log-fancy [14:27:20] that one is a must have [14:27:26] I have it named LG [14:27:29] shorter to write [14:27:32] so I just: git lg [14:27:38] or: git lg --no-merges [14:27:47] or: git lg --no-merges test..production [14:28:24] you also need to use bash completion for git [14:28:55] so you just have to: git lg --no-m te..pr [14:29:15] (it completes on aliases names, git options, branches and refs) [14:30:50] !log deployment-prep hashar: migrating apache boxes from applicationserver::homeless to the new applicationserver::labs [14:30:51] Logged the message, Master [14:31:20] 05/30/2012 - 14:31:20 - Updating keys for laner at /export/home/deployment-prep/laner [14:32:12] hashar: thanks again. I've grabbed the transcript for wikification but my battery's about dead. [14:32:27] maplebed: enjoy sun in park so :-] [14:32:31] I'll pick it up later tonight or something. [14:32:39] ok ;) [14:32:41] see you tomorrow so [14:32:42] sadly no sun. only clouds. [14:32:44] :) [14:32:47] cause I will take care of my wife and daughter this evening [14:33:20] 05/30/2012 - 14:33:19 - Updating keys for laner at /export/home/deployment-prep/laner [14:34:20] 05/30/2012 - 14:34:20 - Updating keys for laner at /export/home/deployment-prep/laner [14:34:35] PROBLEM HTTP is now: CRITICAL on wmde-test i-000002ad output: CRITICAL - Socket timeout after 10 seconds [14:35:20] 05/30/2012 - 14:35:19 - Updating keys for laner at /export/home/deployment-prep/laner [14:36:20] 05/30/2012 - 14:36:20 - Updating keys for laner at /export/home/deployment-prep/laner [14:36:39] next [14:36:43] fix exim :_D [14:36:50] labs machine send all their mails to mchenry [14:37:19] 05/30/2012 - 14:37:19 - Updating keys for laner at /export/home/deployment-prep/laner [14:39:01] !log deployment-prep Migrating apaches from imagescaler class to imagescaler::labs [14:39:02] Logged the message, Master [14:43:05] go hashar go :) [14:43:23] still a lot to do :-D [14:43:32] but at least the job queue should be fine now ;-D [14:43:43] it had a nasty bug that basically killed the beta project [14:44:11] faidon managed to port our wikimedia packages to the latest ubuntu version and so we finally have a ffmpeg version suitable for TMH [14:44:12] hurrah [14:44:23] still have to figure out how to resubmit a video for trasncoding [14:44:40] err: /Stage[main]/Nfs::Upload/Mount[/mnt/thumbs]: Could not evaluate: Execution of '/bin/mount -o bg,soft,tcp,timeo=14,intr,nfsvers=3 /mnt/thumbs' returned 32: mount.nfs: mounting deployment-nfs-memc:/mnt/export/thumbs failed, reason given by server: [14:44:41] bah [14:44:45] no thumbs ;-D [14:44:57] paravoid: there is almost no more errors now :-]] [14:48:46] New review: Andrew Bogott; "(no comment)" [operations/puppet] (test); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9354 [14:48:48] Change merged: Andrew Bogott; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9356 [14:48:49] Change merged: Andrew Bogott; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9357 [14:48:50] Change merged: Andrew Bogott; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9354 [14:49:31] New patchset: Andrew Bogott; "Added imagemagick to labsmw." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9394 [14:49:46] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/9394 [14:49:48] New review: Andrew Bogott; "(no comment)" [operations/puppet] (test); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9394 [14:49:50] Change merged: Andrew Bogott; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9394 [14:50:34] Thehelpfulone: Can you tell me a bit about the 'global education' instance? [14:51:23] sure, http://education.wmflabs.org/wiki/Main_Page is the wiki, it's used for testing of a new extension for the Wikipedia Education Program [14:51:35] Thehelpfulone: I'm hoping to puppetize a similar install. So, wondering what extensions you used, what kind of extra config you did in addition to the default wikimedia self-install. [14:51:50] ah, well JeroenDeDauw did all the setup, so he would be the one to ask [14:52:20] ok. What's his irc nick? [14:52:52] JeroenDeDauw ;) [14:52:56] he's in this channel :P [14:53:08] Ok, no doubt he'll appear shortly then. thx [14:53:11] np :) [14:54:42] !log deployment-prep Updating mediawiki/core to 9780085 (aka just https://gerrit.wikimedia.org/r/#/c/9397/ which fix a wrong class name in job system) [14:54:44] Logged the message, Master [14:54:55] 13856 ? DN 0:00 php MWScript.php runJobs.php --wiki=?Fatal error: Class 'JobQueue' not found in /usr/local/apache/common-local/php-trunk/maintenance/nextJobDB.php on line 97 --procs=5 [14:54:58] that is not good ;) [14:58:41] fixed [14:58:46] so it is processing job queue again [15:04:42] PROBLEM HTTP is now: CRITICAL on wmde-test i-000002ad output: CRITICAL - Socket timeout after 10 seconds [15:09:22] PROBLEM dpkg-check is now: CRITICAL on mwreview-1 i-000002a6 output: DPKG CRITICAL dpkg reports broken packages [15:11:22] RECOVERY Current Load is now: OK on wmde-test i-000002ad output: OK - load average: 0.96, 0.36, 0.12 [15:11:22] RECOVERY Free ram is now: OK on wmde-test i-000002ad output: OK: 91% free memory [15:12:02] RECOVERY Current Users is now: OK on wmde-test i-000002ad output: USERS OK - 0 users currently logged in [15:14:32] RECOVERY Total Processes is now: OK on wmde-test i-000002ad output: PROCS OK: 85 processes [15:15:52] RECOVERY Disk Space is now: OK on wmde-test i-000002ad output: DISK OK [15:15:52] RECOVERY dpkg-check is now: OK on wmde-test i-000002ad output: All packages OK [15:19:22] RECOVERY dpkg-check is now: OK on mwreview-1 i-000002a6 output: All packages OK [15:21:13] paravoid: would it be ok to add some if / else $::realm in realm.pp ? [15:21:19] since it hold global configuration [15:21:55] I guess [15:28:12] paravoid: do you know labs SMTP relay server ? [15:28:26] mchenry rejects emails [15:28:32] was going to use smtp.pmtpa.wmnet [15:31:21] 550 Administrative prohibition [15:31:23] doh [15:31:26] asking mark [15:44:23] New patchset: Hashar; "(bug 36996) ability to change exim4 route list" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9401 [15:44:35] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (test); V: -1 - https://gerrit.wikimedia.org/r/9401 [15:46:10] New patchset: Hashar; "(bug 36996) ability to change exim4 route list" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9401 [15:46:27] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/9401 [15:49:17] PROBLEM host: mwreview-1 is DOWN address: i-000002a6 check_ping: Invalid hostname/address - i-000002a6 [15:52:39] ugh, is labs broke again? [15:53:12] doesn't look like it... [15:53:40] hrmmm [15:53:45] PROBLEM Current Load is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Error - Could not complete SSL handshake. [15:53:47] my ssh hung [15:54:00] and then i tried again and it worked but took too long [15:54:25] PROBLEM Current Users is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Error - Could not complete SSL handshake. [15:55:05] PROBLEM Disk Space is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Error - Could not complete SSL handshake. [15:55:15] PROBLEM Puppet freshness is now: CRITICAL on nova-ldap1 i-000000df output: Puppet has not run in last 20 hours [15:55:56] PROBLEM Free ram is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [15:56:56] PROBLEM Total Processes is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Error - Could not complete SSL handshake. [15:57:46] PROBLEM dpkg-check is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Error - Could not complete SSL handshake. [16:05:16] paravoid: got you a change to alter exim4 conf https://gerrit.wikimedia.org/r/9401 [16:05:33] wtf [16:05:35] paravoid: looks like mchenry does not like spam^Wmails coming from labs :D [16:05:36] hehe [16:05:36] nothing stops you, does it? [16:05:44] RECOVERY Free ram is now: OK on mwreview i-000002ae output: OK: 89% free memory [16:05:44] RECOVERY Disk Space is now: OK on mwreview i-000002ae output: DISK OK [16:06:40] hrm [16:06:42] I did it similar to $nameservers ;) [16:06:45] I see you removing lily [16:06:48] but not adding it back [16:07:35] ohh [16:07:41] that is LILY [16:07:43] I have put lists [16:07:45] RECOVERY dpkg-check is now: OK on mwreview i-000002ae output: All packages OK [16:11:39] New patchset: Hashar; "(bug 36996) ability to change exim4 route list" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9401 [16:11:42] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/9401 [16:11:54] RECOVERY Total Processes is now: OK on mwreview i-000002ae output: PROCS OK: 106 processes [16:14:14] New review: Hashar; "Patchset3 :" [operations/puppet] (test); V: 0 C: 0; - https://gerrit.wikimedia.org/r/9401 [16:14:19] paravoid: updated :) [16:14:22] https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=commitdiff;h=5f6f7f59ce471d3411ea13d5c158778b662bbc43;hp=f2cda6040c28b86a7e5ba01ae6444a406c47724b [16:14:24] diff [16:14:44] can i get some help on labs-nfs1? [16:14:48] still have to find out a SMTP server for labs though [16:15:33] hashar: paravoid: ^ ? [16:15:48] help on what? [16:15:53] perms [16:15:56] i'll /msg [16:16:37] New review: Faidon; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9401 [16:17:00] hashar: I'm adding Mark as a reviewer too [16:17:06] yeah thanks [16:17:41] Change merged: Faidon; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9401 [16:17:45] but merged it nevertheless [16:17:56] by mistake ? [16:19:30] nah, it looks good to me [16:19:44] mark will still see it [16:19:54] right? :) [16:21:16] hopefully [16:21:37] though if it is on [test] branch maybe he filter them out [16:21:41] will see [16:25:21] paravoid: I sent mark + ryan a mail requesting a SMTP server for labs [16:25:24] you are in CC: [16:25:33] thanks for the reviews today :-] [16:25:35] I am off for now [16:25:43] bye-bye :) [16:25:46] see you tomorrow I guess? [16:25:52] in person? [16:25:54] Friday [16:26:12] ah [16:26:14] friday then [16:26:21] around 2pm at the venue I guess [16:26:22] I'll be travelling most of tomorrow [16:26:27] well, 8h [16:26:33] and then I plan to see berlin in the afternoon [16:26:36] so I'll be mostly be away [16:27:28] fine fine :) [16:27:30] take your time [16:27:37] we have made good progress anyway ;-]] [16:27:42] see ya friday ! [16:37:16] JeroenDeDauw: What mediawiki extensions did you include? [16:37:17] And, do you do any caching? [16:44:07] so, if a dir is 0777 how could a touch of a new file in that dir fail for permissions? [16:44:18] no matter who i am [16:45:31] paravoid: the root cause was `patch` preserves group but not user after surgery [16:45:44] (user=owner) [16:48:48] grr [16:58:40] . [17:01:22] PROBLEM dpkg-check is now: CRITICAL on ganglia-test5 i-000002a7 output: DPKG CRITICAL dpkg reports broken packages [17:08:43] !log bots jeremyb: [bots-1,bots-nfs] did some IRC log redaction surgery. booted wm-bot (wmib) a couple times. (same way as before. kill bot; do surgery; kill sleep; didn' touch restart.sh) had some weird permissions issue that ended up causing #wikimedia-tech's log to lose messages. will restore those from my personal log later (probably after midnight so I don't have to do any more bot killing) [17:08:44] Logged the message, Master [17:20:48] andrewbogott: look at Special:Version ;) [17:37:01] jeremyb: did u touch bot [17:37:08] for some reason it restarted many times [17:39:22] petan|wk: yes, we had to delete some things from the logs [17:42:11] for that you don't need to restart bot [17:42:33] Thehelpfulone: why it's not logged? [17:42:39] and why did u restart it [17:43:04] I didn't restart it, but that's why I imagine it was restarted [17:43:13] which logs [17:43:26] jeremyb did log it [17:44:03] https://labsconsole.wikimedia.org/wiki/Server_Admin_Log#Nova_Resource:Bots.2FSAL [17:44:10] k [17:44:23] jeremyb: what permission issue? [17:46:22] RECOVERY dpkg-check is now: OK on ganglia-test5 i-000002a7 output: All packages OK [17:49:47] petan|wk: back for a min [17:50:00] petan|wk: so, the procedure i used was: [17:50:54] bot is using buffers for IO [17:51:02] ok [17:51:03] so logs are written to disk every minute [17:51:04] i figured [17:51:28] why did u restart it? [17:51:36] hold on [17:52:38] copy the log to a new location, edit the copy, make a diff (with `diff`), recopy it, apply the diff (with patch) and make sure it worked right. then the kill mono; apply patch for real; kill sleep dance [17:52:47] so the bot wasn't down very long [17:52:49] but... [17:53:11] i was doing that as root on bots-1. the files are mounted by nfs from bots-nfs [17:53:39] the patch operation resulted in a file that was owned by nobody:nogroup [17:53:44] PROBLEM Current Load is now: CRITICAL on ganglia-test6 i-000002af output: Connection refused by host [17:53:51] 30 16:45:31 < jeremyb> paravoid: the root cause was `patch` preserves group but not user after surgery [17:53:54] 30 16:45:43 < jeremyb> (user=owner) [17:54:08] ok, next time just open the log in editor and directly update it [17:54:14] do @logoff before [17:54:22] it's same as turning bot down [17:54:24] PROBLEM Current Users is now: CRITICAL on ganglia-test6 i-000002af output: Connection refused by host [17:54:26] ok [17:54:29] just it doesn't spam channels so much [17:54:45] but does it handle SIGINT/SIGKILL cleanly? [17:55:04] PROBLEM Disk Space is now: CRITICAL on ganglia-test6 i-000002af output: Connection refused by host [17:55:05] anyway, the issue was wmib couldn't write to nobody files or something like that. i assume [17:55:15] fixed the perms and then it started writing ok [17:55:39] but then i had to do the rest of the surgery (other channel) [17:55:44] PROBLEM Free ram is now: CRITICAL on ganglia-test6 i-000002af output: Connection refused by host [17:55:55] "ok" was for @logoff [17:56:24] but i don't agree necessarily for "edit directly" [17:56:43] so when I tried to edit directly petan|wk saving gave me a permission denied [17:56:49] so doing @logoff will solve that? [17:56:54] PROBLEM Total Processes is now: CRITICAL on ganglia-test6 i-000002af output: Connection refused by host [17:56:58] no [17:57:16] please don't edit directly unless it's a file written by a process you wrote [17:57:33] so maybe petan|wk can. [17:57:34] PROBLEM dpkg-check is now: CRITICAL on ganglia-test6 i-000002af output: Connection refused by host [17:57:38] at least not while it's still running [17:57:52] jeremyb: it's never opened by thread [17:57:59] logs are in cache [17:58:09] then the file is open every minute and data are written [17:58:14] anyway, i'm in a rush so i'm not thinking straight and can't stick around [17:58:17] if you do @logoff it's like if u shut it down [17:58:53] editing directly is possibl [17:59:01] Thehelpfulone: you weren't loged in as wmib [17:59:14] that's why you couldn't write [17:59:20] it's simple enough to just kill and start again [17:59:39] it's simple, but not a right way to do that... [17:59:46] why not? [17:59:50] it's bothers a lot of people in many channels [17:59:54] 30 17:54:45 < jeremyb> but does it handle SIGINT/SIGKILL cleanly? [17:59:58] bot is join flooding [18:00:02] what does @logoff do? [18:00:11] i really don't think it was such a bother [18:00:16] @logoff [18:00:35] Channel is now logged [18:00:56] no file descriptors are open [18:01:03] anyway, i'm really running out the door [18:01:03] so you can directly write [18:02:50] Thehelpfulone: btw is killion aware of that? [18:03:22] @logstatus [18:03:31] @log [18:03:36] @commands [18:03:36] Commands: channellist, trusted, trustadd, trustdel, info, infobot-link, infobot-share-trust+, infobot-share-trust-, infobot-share-off, infobot-share-on, infobot-off, refresh, infobot-on, drop, whoami, add, reload, suppress-off, suppress-on, help, RC-, recentchanges-on, language, recentchanges-off, logon, logoff, recentchanges-, recentchanges+, RC+ [18:03:48] @help [18:03:48] Unknown command type @commands for a list of all commands I know [18:04:03] @logon [18:04:03] Channel is already logged [18:04:20] @help suppress-on [18:04:20] Info for suppress-on: Disable output to channel [18:04:25] ahh [18:04:43] why am i still here? ;) [18:06:36] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 3.71, 4.41, 2.40 [18:07:43] petan: yes [18:08:18] @logoff [18:08:18] Permission denied [18:08:23] hah [18:08:26] :( [18:08:33] lol [18:08:34] petan: I can haz permissions? [18:08:38] jeremyb: go disappear :P [18:09:58] @trustadd .*@wikimedia/Thehelpfulone admin [18:09:58] Successfuly added .*@wikimedia/Thehelpfulone [18:10:07] ty [18:11:33] RECOVERY Current Load is now: OK on nagios 127.0.0.1 output: OK - load average: 1.45, 2.79, 2.24 [18:19:40] JeroenDeDauw: OK, I'm catching up... which of those extensions are being tested and which are necessary in order to do the testing? [18:28:57] RECOVERY Free ram is now: OK on ganglia-test6 i-000002af output: OK: 93% free memory [18:29:27] RECOVERY Current Load is now: OK on ganglia-test6 i-000002af output: OK - load average: 0.12, 0.33, 0.82 [18:29:27] RECOVERY Current Users is now: OK on ganglia-test6 i-000002af output: USERS OK - 0 users currently logged in [18:30:07] RECOVERY Disk Space is now: OK on ganglia-test6 i-000002af output: DISK OK [18:30:17] RECOVERY Total Processes is now: OK on ganglia-test6 i-000002af output: PROCS OK: 80 processes [18:30:57] RECOVERY dpkg-check is now: OK on ganglia-test6 i-000002af output: All packages OK [18:35:07] ssmollett: so ganglia works on Precise :-] http://ganglia.wmflabs.org/latest/?c=deployment-prep&h=deployment-jobrunner05&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [18:35:11] ssmollett: thanks a ton :-] [18:35:37] !log deployment-prep Sara made ganglia available on Ubuntu Precise and hence jobrunner05 show up http://ganglia.wmflabs.org/latest/?c=deployment-prep&h=deployment-jobrunner05&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [18:35:40] Logged the message, Master [18:37:44] so now I have to find out why that job runner takes that much CPU doing basically nothing [18:40:33] ohh great [18:40:46] and I was looking at that bug report [18:40:54] but sara got me to it [18:41:03] yeah I sent her an email yesterday [18:41:07] well 20 hours ago [18:41:14] I can't remember if I added you to cc [18:41:30] luckily she was already working on Precise ;-]] [18:42:54] I have forwarded you the message [18:42:58] (might be an HTML mail sorry) [18:43:20] 05/30/2012 - 18:43:19 - Updating keys for laner at /export/home/deployment-prep/laner [18:43:27] hehe [18:43:49] that labs-home-wm message about Ryan_Lane1 key is me running puppetd- on jobrunner05 ;) [18:46:08] PROBLEM Current Load is now: CRITICAL on mwreview-lucid i-000002b0 output: Connection refused by host [18:46:33] PROBLEM Current Users is now: CRITICAL on mwreview-lucid i-000002b0 output: Connection refused by host [18:47:08] PROBLEM Disk Space is now: CRITICAL on mwreview-lucid i-000002b0 output: Connection refused by host [18:47:43] PROBLEM Free ram is now: CRITICAL on mwreview-lucid i-000002b0 output: Connection refused by host [18:49:03] PROBLEM Total Processes is now: CRITICAL on mwreview-lucid i-000002b0 output: Connection refused by host [18:49:33] PROBLEM dpkg-check is now: CRITICAL on mwreview-lucid i-000002b0 output: Connection refused by host [18:52:49] so I got job [18:52:53] that are never pop ed out [18:52:57] which is puzzling me [18:52:59] seriously [18:59:52] hashar: no problem. [19:00:02] ssmollett: thanks a ton seriously :-] [19:00:11] just in time to investigate a bit a side issue [19:00:15] and to have it ready for Berlin [19:00:16] \O/ [19:09:55] !log deployment-prep jobrunner05 CPU usage is due to some job infinite loop. Working on it. [19:09:57] Logged the message, Master [19:23:30] !log deployment-prep updatiing mediawiki/core to master [19:23:32] Logged the message, Master [19:23:53] !log deployment-prep updating mediawiki/core to master 58f390e to finish job loop fix [19:23:55] Logged the message, Master [19:24:07] PROBLEM Current Load is now: CRITICAL on mwreview-lucid1 i-000002b1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [19:24:27] PROBLEM Current Users is now: CRITICAL on mwreview-lucid1 i-000002b1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [19:25:33] PROBLEM Disk Space is now: CRITICAL on mwreview-lucid1 i-000002b1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [19:25:43] PROBLEM Free ram is now: CRITICAL on mwreview-lucid1 i-000002b1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [19:26:22] !log deployment-prep jobrunner05 is happy again. Hurrah [19:26:24] Logged the message, Master [19:27:13] PROBLEM Total Processes is now: CRITICAL on mwreview-lucid1 i-000002b1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [19:27:53] PROBLEM dpkg-check is now: CRITICAL on mwreview-lucid1 i-000002b1 output: CHECK_NRPE: Socket timeout after 10 seconds. [19:31:38] hashar, how many mysqls do we have in labs? [19:32:03] only one [19:32:06] deployment-sql [19:32:12] I am out for today sorry [19:32:15] daughter duty ;-D [19:32:18] she is sick :(( [19:32:19] ++ [19:32:40] take care of her [19:44:24] PROBLEM Total Processes is now: CRITICAL on mwreview-test4 i-000002b2 output: CHECK_NRPE: Error - Could not complete SSL handshake. [19:45:04] PROBLEM dpkg-check is now: CRITICAL on mwreview-test4 i-000002b2 output: CHECK_NRPE: Error - Could not complete SSL handshake. [19:45:14] PROBLEM Puppet freshness is now: CRITICAL on nova-precise1 i-00000236 output: Puppet has not run in last 20 hours [19:46:14] PROBLEM Current Load is now: CRITICAL on mwreview-test4 i-000002b2 output: CHECK_NRPE: Error - Could not complete SSL handshake. [19:46:55] PROBLEM Current Users is now: CRITICAL on mwreview-test4 i-000002b2 output: CHECK_NRPE: Error - Could not complete SSL handshake. [19:47:55] PROBLEM Disk Space is now: CRITICAL on mwreview-test4 i-000002b2 output: CHECK_NRPE: Error - Could not complete SSL handshake. [19:48:05] PROBLEM Free ram is now: CRITICAL on mwreview-test4 i-000002b2 output: CHECK_NRPE: Error - Could not complete SSL handshake. [19:50:05] PROBLEM Puppet freshness is now: CRITICAL on nova-essex-test i-000001f9 output: Puppet has not run in last 20 hours [19:50:58] Why won't wmflabs even load? [19:52:15] try now [19:58:57] Reedy: No, still can't connect [20:00:12] PROBLEM Puppet freshness is now: CRITICAL on nova-production1 i-0000007b output: Puppet has not run in last 20 hours [20:37:14] PROBLEM Puppet freshness is now: CRITICAL on mailman-01 i-00000235 output: Puppet has not run in last 20 hours [21:35:03] 05/30/2012 - 21:35:03 - Updating keys for mschon at /export/home/wikistats/mschon [21:35:15] 05/30/2012 - 21:35:15 - Updating keys for mschon at /export/home/bastion/mschon [22:35:10] PROBLEM Puppet freshness is now: CRITICAL on localpuppet2 i-0000029b output: Puppet has not run in last 20 hours [23:29:06] PROBLEM Current Load is now: WARNING on ganglia-test4 i-000002a2 output: WARNING - load average: 0.57, 6.47, 5.04 [23:34:06] RECOVERY Current Load is now: OK on ganglia-test4 i-000002a2 output: OK - load average: 4.91, 3.64, 4.10