[00:01:42] New review: Krinkle; "- bug 1234" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/5 [02:06:55] New patchset: Andrew Bogott; "I long for the day when I can test crap like this without having to publish it for the whole world to see." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9111 [02:07:10] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/9111 [02:07:20] New review: Andrew Bogott; "Me too, man. Me too." [operations/puppet] (test); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9111 [02:07:22] Change merged: Andrew Bogott; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9111 [02:19:01] PROBLEM host: mwreview-proto is DOWN address: i-00000295 check_ping: Invalid hostname/address - i-00000295 [02:23:45] PROBLEM Total Processes is now: CRITICAL on mwreview-test1 i-00000297 output: CHECK_NRPE: Error - Could not complete SSL handshake. [02:24:25] PROBLEM dpkg-check is now: CRITICAL on mwreview-test1 i-00000297 output: CHECK_NRPE: Error - Could not complete SSL handshake. [02:25:10] New patchset: Andrew Bogott; "I will be making quite a few more basic mistakes like this one before the night is out." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9112 [02:25:25] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/9112 [02:25:44] New review: Andrew Bogott; "(no comment)" [operations/puppet] (test); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9112 [02:25:45] PROBLEM Current Load is now: CRITICAL on mwreview-test1 i-00000297 output: CHECK_NRPE: Error - Could not complete SSL handshake. [02:25:46] Change merged: Andrew Bogott; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9112 [02:26:15] PROBLEM Current Users is now: CRITICAL on mwreview-test1 i-00000297 output: CHECK_NRPE: Error - Could not complete SSL handshake. [02:26:55] PROBLEM Disk Space is now: CRITICAL on mwreview-test1 i-00000297 output: CHECK_NRPE: Error - Could not complete SSL handshake. [02:27:35] PROBLEM Free ram is now: CRITICAL on mwreview-test1 i-00000297 output: CHECK_NRPE: Error - Could not complete SSL handshake. [02:28:52] New patchset: Andrew Bogott; "As I said." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9113 [02:29:07] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/9113 [02:29:09] New review: Andrew Bogott; "(no comment)" [operations/puppet] (test); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9113 [02:29:11] Change merged: Andrew Bogott; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9113 [02:37:34] New patchset: Andrew Bogott; "I'm pretty much just trying random stuff now." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9114 [02:37:51] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/9114 [02:37:59] New review: Andrew Bogott; "(no comment)" [operations/puppet] (test); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9114 [02:38:01] Change merged: Andrew Bogott; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9114 [02:47:21] 05/28/2012 - 02:47:21 - Updating keys for laner at /export/home/deployment-prep/laner [02:51:21] 05/28/2012 - 02:51:20 - Updating keys for laner at /export/home/deployment-prep/laner [03:01:21] 05/28/2012 - 03:01:20 - Updating keys for laner at /export/home/deployment-prep/laner [03:03:20] 05/28/2012 - 03:03:19 - Updating keys for laner at /export/home/deployment-prep/laner [03:05:20] 05/28/2012 - 03:05:20 - Updating keys for laner at /export/home/deployment-prep/laner [03:26:19] 05/28/2012 - 03:26:19 - Updating keys for laner at /export/home/deployment-prep/laner [03:48:27] PROBLEM Free ram is now: WARNING on utils-abogott i-00000131 output: Warning: 16% free memory [03:51:07] PROBLEM Puppet freshness is now: CRITICAL on nova-ldap1 i-000000df output: Puppet has not run in last 20 hours [03:59:02] PROBLEM Free ram is now: WARNING on test-oneiric i-00000187 output: Warning: 16% free memory [03:59:02] PROBLEM Free ram is now: WARNING on nova-daas-1 i-000000e7 output: Warning: 14% free memory [04:05:19] PROBLEM Free ram is now: WARNING on orgcharts-dev i-0000018f output: Warning: 14% free memory [04:08:38] PROBLEM Free ram is now: CRITICAL on utils-abogott i-00000131 output: Critical: 3% free memory [04:13:56] RECOVERY Free ram is now: OK on utils-abogott i-00000131 output: OK: 96% free memory [04:19:06] PROBLEM Free ram is now: CRITICAL on test-oneiric i-00000187 output: Critical: 3% free memory [04:19:56] PROBLEM Free ram is now: CRITICAL on orgcharts-dev i-0000018f output: Critical: 5% free memory [04:24:06] RECOVERY Free ram is now: OK on test-oneiric i-00000187 output: OK: 96% free memory [04:24:06] PROBLEM Free ram is now: CRITICAL on nova-daas-1 i-000000e7 output: Critical: 5% free memory [04:29:50] RECOVERY Free ram is now: OK on orgcharts-dev i-0000018f output: OK: 95% free memory [04:34:10] RECOVERY Free ram is now: OK on nova-daas-1 i-000000e7 output: OK: 94% free memory [06:07:06] RECOVERY Free ram is now: OK on bots-2 i-0000009c output: OK: 71% free memory [06:10:56] GRRR ... [06:11:19] LiWa3 can take itself out on bots-2 [06:11:25] making bots-2 inaccessible [06:30:18] I seem to be unable to ssh from bastion to bots-2 .. while I can get into bots-3. Bots-2 says after asking for my passphrase for the key 'Permission denied (publickey).' (did not have that yesterday) [06:37:55] PROBLEM Free ram is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:37:55] PROBLEM Total Processes is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:39:29] PROBLEM Current Users is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:39:29] PROBLEM Current Load is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:39:29] PROBLEM Disk Space is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:39:29] PROBLEM Total Processes is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:39:34] PROBLEM Free ram is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:39:34] PROBLEM Current Load is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:39:34] PROBLEM dpkg-check is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:39:34] PROBLEM Current Users is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:39:34] PROBLEM Disk Space is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:40:08] PROBLEM Current Load is now: CRITICAL on nagios 127.0.0.1 output: CRITICAL - load average: 6.83, 6.03, 3.83 [06:40:43] PROBLEM Free ram is now: CRITICAL on ganglia-test2 i-00000250 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:41:03] PROBLEM Total Processes is now: CRITICAL on bots-cb i-0000009e output: CHECK_NRPE: Socket timeout after 10 seconds. [06:41:08] PROBLEM Current Users is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:41:08] PROBLEM SSH is now: CRITICAL on bots-cb i-0000009e output: CRITICAL - Socket timeout after 10 seconds [06:41:08] PROBLEM Disk Space is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:43:39] PROBLEM Current Load is now: CRITICAL on bots-cb i-0000009e output: CRITICAL - load average: 20.83, 40.82, 24.57 [06:43:54] PROBLEM dpkg-check is now: CRITICAL on ganglia-test2 i-00000250 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:43:54] PROBLEM Disk Space is now: CRITICAL on ganglia-test2 i-00000250 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:43:54] PROBLEM Disk Space is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:43:54] PROBLEM Current Users is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:43:54] PROBLEM Total Processes is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:43:59] PROBLEM dpkg-check is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:43:59] PROBLEM Free ram is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:43:59] PROBLEM Current Load is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:46:03] PROBLEM Free ram is now: WARNING on ganglia-test2 i-00000250 output: Warning: 12% free memory [06:47:31] RECOVERY SSH is now: OK on bots-cb i-0000009e output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [06:47:31] RECOVERY Total Processes is now: OK on bots-cb i-0000009e output: PROCS OK: 115 processes [06:55:30] RECOVERY Total Processes is now: OK on maps-tilemill1 i-00000294 output: PROCS OK: 104 processes [06:55:35] RECOVERY Free ram is now: OK on maps-tilemill1 i-00000294 output: OK: 86% free memory [06:55:35] PROBLEM Current Load is now: WARNING on maps-tilemill1 i-00000294 output: WARNING - load average: 1.97, 3.87, 5.16 [06:55:35] RECOVERY Current Users is now: OK on maps-tilemill1 i-00000294 output: USERS OK - 0 users currently logged in [06:55:35] RECOVERY dpkg-check is now: OK on maps-tilemill1 i-00000294 output: All packages OK [06:55:35] RECOVERY Disk Space is now: OK on maps-tilemill1 i-00000294 output: DISK OK [06:55:35] PROBLEM Current Load is now: WARNING on bots-cb i-0000009e output: WARNING - load average: 7.01, 9.03, 14.27 [06:55:40] RECOVERY Disk Space is now: OK on incubator-bot1 i-00000251 output: DISK OK [06:55:40] PROBLEM Current Load is now: WARNING on incubator-bot1 i-00000251 output: WARNING - load average: 4.02, 6.43, 6.22 [06:55:40] RECOVERY Free ram is now: OK on incubator-bot1 i-00000251 output: OK: 81% free memory [06:55:40] RECOVERY Total Processes is now: OK on incubator-bot1 i-00000251 output: PROCS OK: 124 processes [06:55:45] RECOVERY dpkg-check is now: OK on incubator-bot1 i-00000251 output: All packages OK [06:55:45] RECOVERY Current Users is now: OK on incubator-bot1 i-00000251 output: USERS OK - 0 users currently logged in [06:55:45] RECOVERY dpkg-check is now: OK on ganglia-test2 i-00000250 output: All packages OK [06:55:45] RECOVERY Disk Space is now: OK on ganglia-test2 i-00000250 output: DISK OK [06:55:55] RECOVERY Free ram is now: OK on rds i-00000207 output: OK: 92% free memory [06:55:55] RECOVERY Disk Space is now: OK on rds i-00000207 output: DISK OK [06:55:55] PROBLEM Current Load is now: WARNING on rds i-00000207 output: WARNING - load average: 6.25, 5.57, 5.06 [06:55:55] RECOVERY Current Users is now: OK on rds i-00000207 output: USERS OK - 0 users currently logged in [06:55:55] RECOVERY Total Processes is now: OK on rds i-00000207 output: PROCS OK: 81 processes [07:08:43] RECOVERY Current Load is now: OK on rds i-00000207 output: OK - load average: 0.05, 1.28, 3.13 [07:08:43] RECOVERY Current Load is now: OK on maps-tilemill1 i-00000294 output: OK - load average: 0.45, 1.22, 3.24 [07:08:58] PROBLEM Disk Space is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:08:58] PROBLEM Current Users is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:08:58] PROBLEM Free ram is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:08:58] PROBLEM dpkg-check is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:08:58] PROBLEM Total Processes is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:09:28] PROBLEM Current Load is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:09:43] PROBLEM Current Load is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:09:43] PROBLEM Current Load is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:09:43] PROBLEM Free ram is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:09:43] PROBLEM Total Processes is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:09:48] PROBLEM Free ram is now: CRITICAL on ganglia-test2 i-00000250 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:13:02] hey [07:13:20] PROBLEM Current Users is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:13:21] PROBLEM Disk Space is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:13:21] PROBLEM dpkg-check is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:13:30] RECOVERY Current Users is now: OK on mobile-testing i-00000271 output: USERS OK - 0 users currently logged in [07:13:30] RECOVERY Disk Space is now: OK on mobile-testing i-00000271 output: DISK OK [07:14:20] yeah .. something is wrong [07:15:09] PROBLEM Disk Space is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:15:24] PROBLEM Current Load is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:15:59] PROBLEM Current Load is now: WARNING on bots-apache1 i-000000b0 output: WARNING - load average: 5.72, 5.13, 6.48 [07:17:20] PROBLEM Free ram is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:20] PROBLEM Total Processes is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:47] PROBLEM Current Load is now: CRITICAL on ganglia-test2 i-00000250 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:47] PROBLEM Disk Space is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:47] PROBLEM Free ram is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:47] PROBLEM Total Processes is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:59] PROBLEM Disk Space is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:59] PROBLEM SSH is now: CRITICAL on ganglia-test2 i-00000250 output: CRITICAL - Socket timeout after 10 seconds [07:17:59] PROBLEM Current Users is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:59] PROBLEM Free ram is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:59] PROBLEM Total Processes is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:18:14] PROBLEM Current Load is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [07:18:14] PROBLEM Current Users is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [07:18:59] PROBLEM Disk Space is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:14] PROBLEM Current Load is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:19] PROBLEM Free ram is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:19] PROBLEM Total Processes is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:24] PROBLEM dpkg-check is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:24] PROBLEM dpkg-check is now: CRITICAL on ganglia-test2 i-00000250 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:24] PROBLEM Disk Space is now: CRITICAL on ganglia-test2 i-00000250 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:24] PROBLEM Current Load is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:24] PROBLEM Total Processes is now: CRITICAL on ganglia-test2 i-00000250 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:29] PROBLEM Current Users is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:29] PROBLEM dpkg-check is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:39] PROBLEM Current Users is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:39] PROBLEM dpkg-check is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:39] PROBLEM Current Load is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:39] PROBLEM Current Users is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:55] RECOVERY Disk Space is now: OK on incubator-bot2 i-00000252 output: DISK OK [07:19:55] PROBLEM Current Load is now: WARNING on reportcard2 i-000001ea output: WARNING - load average: 9.28, 7.21, 5.23 [07:20:24] PROBLEM Current Load is now: WARNING on pybal-precise i-00000289 output: WARNING - load average: 9.47, 7.38, 6.03 [07:20:24] PROBLEM Current Users is now: CRITICAL on ganglia-test2 i-00000250 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:20:24] PROBLEM Disk Space is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:20:24] PROBLEM Total Processes is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:20:29] PROBLEM Free ram is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:20:29] PROBLEM Free ram is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:20:29] PROBLEM Disk Space is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:20:29] PROBLEM Current Users is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:20:29] PROBLEM Total Processes is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:21:23] RECOVERY Disk Space is now: OK on upload-wizard i-0000021c output: DISK OK [07:21:23] RECOVERY Total Processes is now: OK on upload-wizard i-0000021c output: PROCS OK: 99 processes [07:21:28] RECOVERY Free ram is now: OK on upload-wizard i-0000021c output: OK: 89% free memory [07:21:28] PROBLEM Current Load is now: WARNING on labs-nfs1 i-0000005d output: WARNING - load average: 4.94, 6.67, 5.73 [07:21:28] PROBLEM Current Load is now: WARNING on upload-wizard i-0000021c output: WARNING - load average: 0.41, 4.62, 5.25 [07:21:28] RECOVERY Current Users is now: OK on upload-wizard i-0000021c output: USERS OK - 0 users currently logged in [07:22:01] Beetstra: hm? [07:22:17] why do you think it looks like typical labs status :D [07:22:35] PROBLEM Current Load is now: CRITICAL on aggregator-test3 i-00000293 output: CRITICAL - load average: 0.48, 10.74, 22.84 [07:22:42] !ping [07:22:42] pong [07:23:04] eh .. is it typical? [07:23:08] kind of [07:23:12] Anyways .. I seem to be locked out of bots-2? [07:23:19] ok [07:23:23] is there anything running now [07:23:31] we need to reboot it to fix that [07:23:36] because no one is able to login [07:23:52] PROBLEM Current Load is now: WARNING on mobile-testing i-00000271 output: WARNING - load average: 4.88, 7.08, 8.66 [07:23:54] it's not working much [07:23:57] PROBLEM Current Load is now: WARNING on incubator-bot2 i-00000252 output: WARNING - load average: 4.33, 8.30, 8.18 [07:23:57] RECOVERY Current Users is now: OK on incubator-bot2 i-00000252 output: USERS OK - 0 users currently logged in [07:23:57] RECOVERY dpkg-check is now: OK on incubator-bot2 i-00000252 output: All packages OK [07:23:57] PROBLEM Current Load is now: WARNING on worker1 i-00000208 output: WARNING - load average: 3.15, 6.86, 6.20 [07:23:57] RECOVERY Current Users is now: OK on worker1 i-00000208 output: USERS OK - 1 users currently logged in [07:23:57] RECOVERY dpkg-check is now: OK on reportcard2 i-000001ea output: All packages OK [07:24:56] RECOVERY Current Load is now: OK on reportcard2 i-000001ea output: OK - load average: 1.92, 4.88, 4.84 [07:25:09] RECOVERY Disk Space is now: OK on worker1 i-00000208 output: DISK OK [07:25:09] RECOVERY Total Processes is now: OK on worker1 i-00000208 output: PROCS OK: 90 processes [07:25:14] RECOVERY Free ram is now: OK on worker1 i-00000208 output: OK: 91% free memory [07:25:19] RECOVERY Current Users is now: OK on ganglia-test2 i-00000250 output: USERS OK - 0 users currently logged in [07:25:39] I told my bot to die .. but I don't know if it did .. just reboot it [07:26:19] RECOVERY Current Load is now: OK on bots-cb i-0000009e output: OK - load average: 0.19, 1.52, 4.48 [07:26:19] PROBLEM Current Load is now: WARNING on migration1 i-00000261 output: WARNING - load average: 3.40, 5.32, 5.99 [07:26:19] PROBLEM Current Load is now: WARNING on precise-test i-00000231 output: WARNING - load average: 1.51, 3.72, 5.08 [07:26:19] RECOVERY Free ram is now: OK on precise-test i-00000231 output: OK: 85% free memory [07:26:19] RECOVERY Total Processes is now: OK on precise-test i-00000231 output: PROCS OK: 83 processes [07:26:24] RECOVERY Current Load is now: OK on labs-nfs1 i-0000005d output: OK - load average: 0.39, 2.75, 4.27 [07:26:24] RECOVERY Current Load is now: OK on upload-wizard i-0000021c output: OK - load average: 0.00, 1.69, 3.80 [07:26:24] RECOVERY Free ram is now: OK on incubator-bot2 i-00000252 output: OK: 80% free memory [07:26:24] RECOVERY Total Processes is now: OK on incubator-bot2 i-00000252 output: PROCS OK: 118 processes [07:26:29] RECOVERY Disk Space is now: OK on reportcard2 i-000001ea output: DISK OK [07:26:29] RECOVERY Free ram is now: OK on reportcard2 i-000001ea output: OK: 85% free memory [07:26:29] RECOVERY Current Users is now: OK on reportcard2 i-000001ea output: USERS OK - 0 users currently logged in [07:26:29] RECOVERY Total Processes is now: OK on reportcard2 i-000001ea output: PROCS OK: 83 processes [07:26:34] RECOVERY SSH is now: OK on ganglia-test2 i-00000250 output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [07:26:34] PROBLEM Current Load is now: WARNING on ganglia-test2 i-00000250 output: WARNING - load average: 7.08, 9.94, 10.68 [07:27:24] PROBLEM Current Load is now: WARNING on aggregator-test3 i-00000293 output: WARNING - load average: 0.11, 4.06, 16.64 [07:28:24] RECOVERY Current Users is now: OK on precise-test i-00000231 output: USERS OK - 0 users currently logged in [07:28:24] RECOVERY Disk Space is now: OK on precise-test i-00000231 output: DISK OK [07:28:24] RECOVERY dpkg-check is now: OK on precise-test i-00000231 output: All packages OK [07:28:24] RECOVERY Current Load is now: OK on worker1 i-00000208 output: OK - load average: 0.02, 2.53, 4.50 [07:28:24] RECOVERY Total Processes is now: OK on ganglia-test2 i-00000250 output: PROCS OK: 173 processes [07:29:41] RECOVERY Disk Space is now: OK on migration1 i-00000261 output: DISK OK [07:29:42] RECOVERY Current Users is now: OK on migration1 i-00000261 output: USERS OK - 0 users currently logged in [07:29:42] RECOVERY Free ram is now: OK on migration1 i-00000261 output: OK: 79% free memory [07:29:42] RECOVERY dpkg-check is now: OK on migration1 i-00000261 output: All packages OK [07:29:42] RECOVERY Total Processes is now: OK on migration1 i-00000261 output: PROCS OK: 87 processes [07:30:23] RECOVERY Current Load is now: OK on pybal-precise i-00000289 output: OK - load average: 0.16, 1.87, 3.95 [07:31:17] RECOVERY Current Load is now: OK on migration1 i-00000261 output: OK - load average: 0.09, 2.02, 4.37 [07:31:17] RECOVERY Current Load is now: OK on precise-test i-00000231 output: OK - load average: 0.07, 1.45, 3.72 [07:33:08] AHH hello [07:33:40] RECOVERY Current Load is now: OK on incubator-bot2 i-00000252 output: OK - load average: 0.29, 1.54, 4.58 [07:36:32] RECOVERY Current Load is now: OK on bots-apache1 i-000000b0 output: OK - load average: 3.10, 3.38, 4.60 [07:41:25] RECOVERY Current Load is now: OK on ganglia-test2 i-00000250 output: OK - load average: 0.37, 0.98, 4.34 [07:47:31] RECOVERY Current Load is now: OK on aggregator-test3 i-00000293 output: OK - load average: 0.64, 0.64, 4.99 [07:48:21] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 0.11, 0.79, 3.23 [07:53:22] RECOVERY Current Load is now: OK on nagios 127.0.0.1 output: OK - load average: 0.21, 0.76, 2.58 [08:05:02] petan, you managed to restart bots-2? [08:05:12] should I? [08:05:26] I think so .. if that is the only way of getting back access to it [08:06:28] I get: Enter passphrase for key '/home/beetstra/.ssh/id_rsa': \n Permission denied (publickey). [08:06:43] done [08:07:07] thanks [08:11:06] Note to self: I have to work on the code of linkwatcher so it does not disable itself in memory flooding [08:33:39] PROBLEM Puppet freshness is now: CRITICAL on mailman-01 i-00000235 output: Puppet has not run in last 20 hours [08:46:56] PROBLEM Current Load is now: WARNING on bots-sql2 i-000000af output: WARNING - load average: 6.07, 6.01, 5.36 [08:52:37] PROBLEM Free ram is now: WARNING on ganglia-test2 i-00000250 output: Warning: 13% free memory [08:58:43] hashar: how we update the svn-trunk in beta [08:58:49] php-trunk [08:58:50] I mean [08:59:02] I made a script for that but maybe you have some as well [08:59:05] mediawiki ? :-D [08:59:09] everything [08:59:12] not just mw [08:59:19] oh with extensions too [08:59:23] + run update.php [08:59:29] you know it is going to break stuff ? ;-D [08:59:34] how [08:59:37] should be something like: [08:59:45] cd /home/wikipedia/common/php-trunk [08:59:50] git pull [09:00:02] then update submodules using something like: [09:00:06] cd /home/wikipedia/common/php-trunk/extensions [09:00:06] that is how does it work [09:00:09] git submodule update [09:00:20] why don't use pull [09:00:23] and to run update.php: foreachwiki update.php [09:00:23] for everything [09:00:45] hashar: ok I know that, but how does it break stuff [09:01:07] cause that deploys code from master which might be unstable [09:01:14] and you might have to install DB updates [09:01:23] so it should be done carefully [09:02:41] hashar: but that's what we want to do, or not? [09:02:53] beta is for testing these unstable things to check if they are stable or not... [09:02:59] yup [09:03:00] or, what is trunk supposed to be? [09:03:12] do we have a stable trunk [09:03:14] eventually we are going to update MediaWiki core + extensions daily [09:03:15] :) [09:03:24] yes, I was thinking of putting it to cron [09:03:26] for now, I would prefer we keep the software as is ;-) [09:03:30] nooo [09:03:34] not cron please ;-]] [09:03:46] script is /usr/local/apache/common-local/bin$ cat updaterepo.sh [09:03:51] ok [09:03:53] we really want to do update the site manually [09:04:07] I will run it now, to check if that does break things as you say, or not [09:04:12] so we know who / why / when something goes wrong [09:04:15] !log deployment-prep petrb: running update [09:04:18] but please no [09:04:19] arhghgg [09:04:28] :D [09:04:32] no wories [09:04:33] I didn't [09:04:33] the cluster is already broken enough [09:04:44] you don't like bottie? [09:04:51] I am not really willing to spend time this way figuring out which new code is breaking it ;-]] [09:04:53] why we have another bot for this project :P [09:05:22] ok, so when are we going to update to trunk [09:05:24] but updating daily is definitely on the list of stuff to do. Will do that later when the cluster is more stable (aka configuration running from production) [09:05:31] ok [09:05:35] so later [09:05:36] ;) [09:05:37] why we have another bot [09:05:42] ah [09:05:50] that was to remove the /bin/log hack [09:05:56] which send the message to some labs instance [09:06:04] instead I used something which is similar to productino [09:06:04] s [09:06:14] so you log directly from the -dbdump machine [09:06:15] I mean there is a lot of projects, I would prefer to use 1 bot for all projects, rather than 50 bots [09:06:36] ok, what if I type log on -apache20 [09:06:40] does it work? [09:06:54] ohh you wanna do that [09:06:55] hmm [09:07:06] it is probably not going to work ;-D [09:07:39] anyway, I liked it to say "message logged" so I knew something did happen [09:07:40] so yeah, you should do everything from -dbdump [09:07:56] if it doesn't do that on prod, we should fix it :D [09:08:09] !log deployment-prep hashar: I am the log bot [09:08:27] so the only difference is that the bot runs localy on db-dump [09:08:31] so `log` is beta-logmsgbot [09:08:31] otherwise it's same? [09:08:38] I am not sure which bots is reading from there [09:08:55] yeah it is running on db-dump [09:08:58] I see [09:09:02] that should be the only difference [09:09:06] but I don't know what is advantage of that [09:09:20] also let us restart it easily whenever needed without having to get an access on bots project [09:09:22] it looks quite same as before to me, just that we have 1 more bot :D [09:09:31] yeah [09:09:44] well, there is instance bots-labs which is supposed to host only labs related services [09:09:48] but one less inter project dependency [09:09:57] bottie is supposed to work 24*6 [09:09:59] * 7 [09:10:00] :D [09:10:01] lol [09:10:10] typo [09:11:42] well it does not [09:11:43] also it was good that you could log it from any machine... but I don't really care [09:11:47] it doesn't? [09:11:48] so I prefer having the bot locally on dbdump [09:11:54] that makes things easier to fix / debug [09:12:02] hm... [09:12:28] we could probably add a new package / puppet class to install the log command on all machines [09:12:40] but I am not sure it is going to be used that much anyway [09:12:47] the whole idea is to do everything from dbdump [09:12:52] and almost never connect to the other hosts [09:13:03] ok, but you sometimes have to [09:13:41] it's hard to restart squid from dbdump [09:13:42] etc [09:13:57] on prod it works from one instance only? [09:13:59] box [09:25:03] hmm I am not sure for squid [09:25:33] but for apaches, we do a dsh to all apaches box and use a script named apache-graceful [09:25:38] which is available locally [09:25:42] aka on each box [09:25:57] not sure how it is installed, I guess it is just in /home/wikipedia/bin which is mounted from fenari [09:26:01] will have to look at [10:07:21] !log deployment-prep hashar: cherry picking change 9116 [10:32:38] PROBLEM Current Load is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [10:37:31] PROBLEM Current Load is now: WARNING on bots-sql2 i-000000af output: WARNING - load average: 7.48, 6.88, 6.35 [10:38:38] lunchhh [12:31:19] !log deployment-prep hashar: Running mwscript rebuildLocalisationCache.php --wiki=aawiki [12:33:13] !log deployment-prep hashar: Running mwscript rebuildLocalisationCache.php --wiki=aawiki --force [12:40:45] !log hashar synchronizing Wikimedia installation... : [12:48:18] mwscript rebuildLocalisationCache.php [12:48:18] No localisation cache found for English. Please run maintenance/rebuildLocalisationCache.php. [12:48:20] yeahhh [13:06:18] hashar: oh forgot to tell you [13:06:30] the two packages are in precise [13:06:42] but I need to forward-port some php extensions of ours to precise [13:06:55] it's on my TODO but I have to finish up something that we absolutely need for the hackathon [13:07:28] so it was evicted a bit from the top of my todo [13:07:37] doh [13:07:47] at least some wikimedia packages moved forward [13:08:15] I am fighting with the l10n cache meanwhile [13:08:16] I think the rest is not much work, so I might find some time tomorrow [13:08:35] apparently there is a debian helper to build packages out of pecl extensions [13:08:39] yep [13:08:50] but I don't even need that [13:08:55] we already have source debian packages for those [13:09:00] great [13:09:00] I just need to rebuild them in a precise environment [13:09:07] which I have to build first [13:09:22] shouldn't be much work really [13:09:39] I also have to rebuild php at some point but that's not a blocker for you [13:09:47] (I presume you read the php thread on ops) [13:10:02] not yet [13:10:06] * hashar opens mail client [13:10:18] * hashar reads the popup [You have 178 new emails] [13:10:25] * hashar closes mail client [13:11:06] hahahahaha [13:11:56] well that ops stuff is well over my mind [13:12:10] stuff like -O3 and --gdb3 are probably funny [13:13:18] funny? [13:13:26] sorry [13:13:33] hmm I mean something but can't remember [13:13:38] loose my mind while writing the sentence [13:13:55] anyway, you can get the packages tomorrow [13:14:29] today, I am busy figuring out the localization cache system for extension ;-) [13:51:36] PROBLEM Puppet freshness is now: CRITICAL on nova-ldap1 i-000000df output: Puppet has not run in last 20 hours [14:15:05] hashar: btw you noticed that I did some modification of that rebuild script? [14:15:18] so that it rebuild cache for extensions as well [14:15:51] which script? [14:16:07] rebuild cache [14:16:14] yeah I am working on it [14:16:18] seems to be working now ;D [14:16:27] $ mwscript rebuildLocalisationCache.php --wiki=aawiki --threads 4 --force [14:16:36] I am waiting for it to complete [14:16:49] it is really not the cleanest part of our config :-( [14:16:57] $ echo "print wfMsg( 'timedmedia-ogg-long-multiplexed' );" | mwscript eval.php --wiki=commonswiki [14:16:58] Ogg multiplexed audio/video file, $1, length $2, $4 × $5 pixels, $3 overall [14:16:58]