[00:01:42] New review: Krinkle; "- bug 1234" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/5 [02:06:55] New patchset: Andrew Bogott; "I long for the day when I can test crap like this without having to publish it for the whole world to see." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9111 [02:07:10] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/9111 [02:07:20] New review: Andrew Bogott; "Me too, man. Me too." [operations/puppet] (test); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9111 [02:07:22] Change merged: Andrew Bogott; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9111 [02:19:01] PROBLEM host: mwreview-proto is DOWN address: i-00000295 check_ping: Invalid hostname/address - i-00000295 [02:23:45] PROBLEM Total Processes is now: CRITICAL on mwreview-test1 i-00000297 output: CHECK_NRPE: Error - Could not complete SSL handshake. [02:24:25] PROBLEM dpkg-check is now: CRITICAL on mwreview-test1 i-00000297 output: CHECK_NRPE: Error - Could not complete SSL handshake. [02:25:10] New patchset: Andrew Bogott; "I will be making quite a few more basic mistakes like this one before the night is out." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9112 [02:25:25] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/9112 [02:25:44] New review: Andrew Bogott; "(no comment)" [operations/puppet] (test); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9112 [02:25:45] PROBLEM Current Load is now: CRITICAL on mwreview-test1 i-00000297 output: CHECK_NRPE: Error - Could not complete SSL handshake. [02:25:46] Change merged: Andrew Bogott; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9112 [02:26:15] PROBLEM Current Users is now: CRITICAL on mwreview-test1 i-00000297 output: CHECK_NRPE: Error - Could not complete SSL handshake. [02:26:55] PROBLEM Disk Space is now: CRITICAL on mwreview-test1 i-00000297 output: CHECK_NRPE: Error - Could not complete SSL handshake. [02:27:35] PROBLEM Free ram is now: CRITICAL on mwreview-test1 i-00000297 output: CHECK_NRPE: Error - Could not complete SSL handshake. [02:28:52] New patchset: Andrew Bogott; "As I said." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9113 [02:29:07] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/9113 [02:29:09] New review: Andrew Bogott; "(no comment)" [operations/puppet] (test); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9113 [02:29:11] Change merged: Andrew Bogott; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9113 [02:37:34] New patchset: Andrew Bogott; "I'm pretty much just trying random stuff now." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9114 [02:37:51] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/9114 [02:37:59] New review: Andrew Bogott; "(no comment)" [operations/puppet] (test); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9114 [02:38:01] Change merged: Andrew Bogott; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9114 [02:47:21] 05/28/2012 - 02:47:21 - Updating keys for laner at /export/home/deployment-prep/laner [02:51:21] 05/28/2012 - 02:51:20 - Updating keys for laner at /export/home/deployment-prep/laner [03:01:21] 05/28/2012 - 03:01:20 - Updating keys for laner at /export/home/deployment-prep/laner [03:03:20] 05/28/2012 - 03:03:19 - Updating keys for laner at /export/home/deployment-prep/laner [03:05:20] 05/28/2012 - 03:05:20 - Updating keys for laner at /export/home/deployment-prep/laner [03:26:19] 05/28/2012 - 03:26:19 - Updating keys for laner at /export/home/deployment-prep/laner [03:48:27] PROBLEM Free ram is now: WARNING on utils-abogott i-00000131 output: Warning: 16% free memory [03:51:07] PROBLEM Puppet freshness is now: CRITICAL on nova-ldap1 i-000000df output: Puppet has not run in last 20 hours [03:59:02] PROBLEM Free ram is now: WARNING on test-oneiric i-00000187 output: Warning: 16% free memory [03:59:02] PROBLEM Free ram is now: WARNING on nova-daas-1 i-000000e7 output: Warning: 14% free memory [04:05:19] PROBLEM Free ram is now: WARNING on orgcharts-dev i-0000018f output: Warning: 14% free memory [04:08:38] PROBLEM Free ram is now: CRITICAL on utils-abogott i-00000131 output: Critical: 3% free memory [04:13:56] RECOVERY Free ram is now: OK on utils-abogott i-00000131 output: OK: 96% free memory [04:19:06] PROBLEM Free ram is now: CRITICAL on test-oneiric i-00000187 output: Critical: 3% free memory [04:19:56] PROBLEM Free ram is now: CRITICAL on orgcharts-dev i-0000018f output: Critical: 5% free memory [04:24:06] RECOVERY Free ram is now: OK on test-oneiric i-00000187 output: OK: 96% free memory [04:24:06] PROBLEM Free ram is now: CRITICAL on nova-daas-1 i-000000e7 output: Critical: 5% free memory [04:29:50] RECOVERY Free ram is now: OK on orgcharts-dev i-0000018f output: OK: 95% free memory [04:34:10] RECOVERY Free ram is now: OK on nova-daas-1 i-000000e7 output: OK: 94% free memory [06:07:06] RECOVERY Free ram is now: OK on bots-2 i-0000009c output: OK: 71% free memory [06:10:56] GRRR ... [06:11:19] LiWa3 can take itself out on bots-2 [06:11:25] making bots-2 inaccessible [06:30:18] I seem to be unable to ssh from bastion to bots-2 .. while I can get into bots-3. Bots-2 says after asking for my passphrase for the key 'Permission denied (publickey).' (did not have that yesterday) [06:37:55] PROBLEM Free ram is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:37:55] PROBLEM Total Processes is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:39:29] PROBLEM Current Users is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:39:29] PROBLEM Current Load is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:39:29] PROBLEM Disk Space is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:39:29] PROBLEM Total Processes is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:39:34] PROBLEM Free ram is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:39:34] PROBLEM Current Load is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:39:34] PROBLEM dpkg-check is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:39:34] PROBLEM Current Users is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:39:34] PROBLEM Disk Space is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:40:08] PROBLEM Current Load is now: CRITICAL on nagios 127.0.0.1 output: CRITICAL - load average: 6.83, 6.03, 3.83 [06:40:43] PROBLEM Free ram is now: CRITICAL on ganglia-test2 i-00000250 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:41:03] PROBLEM Total Processes is now: CRITICAL on bots-cb i-0000009e output: CHECK_NRPE: Socket timeout after 10 seconds. [06:41:08] PROBLEM Current Users is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:41:08] PROBLEM SSH is now: CRITICAL on bots-cb i-0000009e output: CRITICAL - Socket timeout after 10 seconds [06:41:08] PROBLEM Disk Space is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:43:39] PROBLEM Current Load is now: CRITICAL on bots-cb i-0000009e output: CRITICAL - load average: 20.83, 40.82, 24.57 [06:43:54] PROBLEM dpkg-check is now: CRITICAL on ganglia-test2 i-00000250 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:43:54] PROBLEM Disk Space is now: CRITICAL on ganglia-test2 i-00000250 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:43:54] PROBLEM Disk Space is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:43:54] PROBLEM Current Users is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:43:54] PROBLEM Total Processes is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:43:59] PROBLEM dpkg-check is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:43:59] PROBLEM Free ram is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:43:59] PROBLEM Current Load is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:46:03] PROBLEM Free ram is now: WARNING on ganglia-test2 i-00000250 output: Warning: 12% free memory [06:47:31] RECOVERY SSH is now: OK on bots-cb i-0000009e output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [06:47:31] RECOVERY Total Processes is now: OK on bots-cb i-0000009e output: PROCS OK: 115 processes [06:55:30] RECOVERY Total Processes is now: OK on maps-tilemill1 i-00000294 output: PROCS OK: 104 processes [06:55:35] RECOVERY Free ram is now: OK on maps-tilemill1 i-00000294 output: OK: 86% free memory [06:55:35] PROBLEM Current Load is now: WARNING on maps-tilemill1 i-00000294 output: WARNING - load average: 1.97, 3.87, 5.16 [06:55:35] RECOVERY Current Users is now: OK on maps-tilemill1 i-00000294 output: USERS OK - 0 users currently logged in [06:55:35] RECOVERY dpkg-check is now: OK on maps-tilemill1 i-00000294 output: All packages OK [06:55:35] RECOVERY Disk Space is now: OK on maps-tilemill1 i-00000294 output: DISK OK [06:55:35] PROBLEM Current Load is now: WARNING on bots-cb i-0000009e output: WARNING - load average: 7.01, 9.03, 14.27 [06:55:40] RECOVERY Disk Space is now: OK on incubator-bot1 i-00000251 output: DISK OK [06:55:40] PROBLEM Current Load is now: WARNING on incubator-bot1 i-00000251 output: WARNING - load average: 4.02, 6.43, 6.22 [06:55:40] RECOVERY Free ram is now: OK on incubator-bot1 i-00000251 output: OK: 81% free memory [06:55:40] RECOVERY Total Processes is now: OK on incubator-bot1 i-00000251 output: PROCS OK: 124 processes [06:55:45] RECOVERY dpkg-check is now: OK on incubator-bot1 i-00000251 output: All packages OK [06:55:45] RECOVERY Current Users is now: OK on incubator-bot1 i-00000251 output: USERS OK - 0 users currently logged in [06:55:45] RECOVERY dpkg-check is now: OK on ganglia-test2 i-00000250 output: All packages OK [06:55:45] RECOVERY Disk Space is now: OK on ganglia-test2 i-00000250 output: DISK OK [06:55:55] RECOVERY Free ram is now: OK on rds i-00000207 output: OK: 92% free memory [06:55:55] RECOVERY Disk Space is now: OK on rds i-00000207 output: DISK OK [06:55:55] PROBLEM Current Load is now: WARNING on rds i-00000207 output: WARNING - load average: 6.25, 5.57, 5.06 [06:55:55] RECOVERY Current Users is now: OK on rds i-00000207 output: USERS OK - 0 users currently logged in [06:55:55] RECOVERY Total Processes is now: OK on rds i-00000207 output: PROCS OK: 81 processes [07:08:43] RECOVERY Current Load is now: OK on rds i-00000207 output: OK - load average: 0.05, 1.28, 3.13 [07:08:43] RECOVERY Current Load is now: OK on maps-tilemill1 i-00000294 output: OK - load average: 0.45, 1.22, 3.24 [07:08:58] PROBLEM Disk Space is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:08:58] PROBLEM Current Users is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:08:58] PROBLEM Free ram is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:08:58] PROBLEM dpkg-check is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:08:58] PROBLEM Total Processes is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:09:28] PROBLEM Current Load is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:09:43] PROBLEM Current Load is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:09:43] PROBLEM Current Load is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:09:43] PROBLEM Free ram is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:09:43] PROBLEM Total Processes is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:09:48] PROBLEM Free ram is now: CRITICAL on ganglia-test2 i-00000250 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:13:02] hey [07:13:20] PROBLEM Current Users is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:13:21] PROBLEM Disk Space is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:13:21] PROBLEM dpkg-check is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:13:30] RECOVERY Current Users is now: OK on mobile-testing i-00000271 output: USERS OK - 0 users currently logged in [07:13:30] RECOVERY Disk Space is now: OK on mobile-testing i-00000271 output: DISK OK [07:14:20] yeah .. something is wrong [07:15:09] PROBLEM Disk Space is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:15:24] PROBLEM Current Load is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:15:59] PROBLEM Current Load is now: WARNING on bots-apache1 i-000000b0 output: WARNING - load average: 5.72, 5.13, 6.48 [07:17:20] PROBLEM Free ram is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:20] PROBLEM Total Processes is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:47] PROBLEM Current Load is now: CRITICAL on ganglia-test2 i-00000250 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:47] PROBLEM Disk Space is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:47] PROBLEM Free ram is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:47] PROBLEM Total Processes is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:59] PROBLEM Disk Space is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:59] PROBLEM SSH is now: CRITICAL on ganglia-test2 i-00000250 output: CRITICAL - Socket timeout after 10 seconds [07:17:59] PROBLEM Current Users is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:59] PROBLEM Free ram is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:59] PROBLEM Total Processes is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:18:14] PROBLEM Current Load is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [07:18:14] PROBLEM Current Users is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [07:18:59] PROBLEM Disk Space is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:14] PROBLEM Current Load is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:19] PROBLEM Free ram is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:19] PROBLEM Total Processes is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:24] PROBLEM dpkg-check is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:24] PROBLEM dpkg-check is now: CRITICAL on ganglia-test2 i-00000250 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:24] PROBLEM Disk Space is now: CRITICAL on ganglia-test2 i-00000250 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:24] PROBLEM Current Load is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:24] PROBLEM Total Processes is now: CRITICAL on ganglia-test2 i-00000250 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:29] PROBLEM Current Users is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:29] PROBLEM dpkg-check is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:39] PROBLEM Current Users is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:39] PROBLEM dpkg-check is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:39] PROBLEM Current Load is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:39] PROBLEM Current Users is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:55] RECOVERY Disk Space is now: OK on incubator-bot2 i-00000252 output: DISK OK [07:19:55] PROBLEM Current Load is now: WARNING on reportcard2 i-000001ea output: WARNING - load average: 9.28, 7.21, 5.23 [07:20:24] PROBLEM Current Load is now: WARNING on pybal-precise i-00000289 output: WARNING - load average: 9.47, 7.38, 6.03 [07:20:24] PROBLEM Current Users is now: CRITICAL on ganglia-test2 i-00000250 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:20:24] PROBLEM Disk Space is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:20:24] PROBLEM Total Processes is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:20:29] PROBLEM Free ram is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:20:29] PROBLEM Free ram is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:20:29] PROBLEM Disk Space is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:20:29] PROBLEM Current Users is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:20:29] PROBLEM Total Processes is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:21:23] RECOVERY Disk Space is now: OK on upload-wizard i-0000021c output: DISK OK [07:21:23] RECOVERY Total Processes is now: OK on upload-wizard i-0000021c output: PROCS OK: 99 processes [07:21:28] RECOVERY Free ram is now: OK on upload-wizard i-0000021c output: OK: 89% free memory [07:21:28] PROBLEM Current Load is now: WARNING on labs-nfs1 i-0000005d output: WARNING - load average: 4.94, 6.67, 5.73 [07:21:28] PROBLEM Current Load is now: WARNING on upload-wizard i-0000021c output: WARNING - load average: 0.41, 4.62, 5.25 [07:21:28] RECOVERY Current Users is now: OK on upload-wizard i-0000021c output: USERS OK - 0 users currently logged in [07:22:01] Beetstra: hm? [07:22:17] why do you think it looks like typical labs status :D [07:22:35] PROBLEM Current Load is now: CRITICAL on aggregator-test3 i-00000293 output: CRITICAL - load average: 0.48, 10.74, 22.84 [07:22:42] !ping [07:22:42] pong [07:23:04] eh .. is it typical? [07:23:08] kind of [07:23:12] Anyways .. I seem to be locked out of bots-2? [07:23:19] ok [07:23:23] is there anything running now [07:23:31] we need to reboot it to fix that [07:23:36] because no one is able to login [07:23:52] PROBLEM Current Load is now: WARNING on mobile-testing i-00000271 output: WARNING - load average: 4.88, 7.08, 8.66 [07:23:54] it's not working much [07:23:57] PROBLEM Current Load is now: WARNING on incubator-bot2 i-00000252 output: WARNING - load average: 4.33, 8.30, 8.18 [07:23:57] RECOVERY Current Users is now: OK on incubator-bot2 i-00000252 output: USERS OK - 0 users currently logged in [07:23:57] RECOVERY dpkg-check is now: OK on incubator-bot2 i-00000252 output: All packages OK [07:23:57] PROBLEM Current Load is now: WARNING on worker1 i-00000208 output: WARNING - load average: 3.15, 6.86, 6.20 [07:23:57] RECOVERY Current Users is now: OK on worker1 i-00000208 output: USERS OK - 1 users currently logged in [07:23:57] RECOVERY dpkg-check is now: OK on reportcard2 i-000001ea output: All packages OK [07:24:56] RECOVERY Current Load is now: OK on reportcard2 i-000001ea output: OK - load average: 1.92, 4.88, 4.84 [07:25:09] RECOVERY Disk Space is now: OK on worker1 i-00000208 output: DISK OK [07:25:09] RECOVERY Total Processes is now: OK on worker1 i-00000208 output: PROCS OK: 90 processes [07:25:14] RECOVERY Free ram is now: OK on worker1 i-00000208 output: OK: 91% free memory [07:25:19] RECOVERY Current Users is now: OK on ganglia-test2 i-00000250 output: USERS OK - 0 users currently logged in [07:25:39] I told my bot to die .. but I don't know if it did .. just reboot it [07:26:19] RECOVERY Current Load is now: OK on bots-cb i-0000009e output: OK - load average: 0.19, 1.52, 4.48 [07:26:19] PROBLEM Current Load is now: WARNING on migration1 i-00000261 output: WARNING - load average: 3.40, 5.32, 5.99 [07:26:19] PROBLEM Current Load is now: WARNING on precise-test i-00000231 output: WARNING - load average: 1.51, 3.72, 5.08 [07:26:19] RECOVERY Free ram is now: OK on precise-test i-00000231 output: OK: 85% free memory [07:26:19] RECOVERY Total Processes is now: OK on precise-test i-00000231 output: PROCS OK: 83 processes [07:26:24] RECOVERY Current Load is now: OK on labs-nfs1 i-0000005d output: OK - load average: 0.39, 2.75, 4.27 [07:26:24] RECOVERY Current Load is now: OK on upload-wizard i-0000021c output: OK - load average: 0.00, 1.69, 3.80 [07:26:24] RECOVERY Free ram is now: OK on incubator-bot2 i-00000252 output: OK: 80% free memory [07:26:24] RECOVERY Total Processes is now: OK on incubator-bot2 i-00000252 output: PROCS OK: 118 processes [07:26:29] RECOVERY Disk Space is now: OK on reportcard2 i-000001ea output: DISK OK [07:26:29] RECOVERY Free ram is now: OK on reportcard2 i-000001ea output: OK: 85% free memory [07:26:29] RECOVERY Current Users is now: OK on reportcard2 i-000001ea output: USERS OK - 0 users currently logged in [07:26:29] RECOVERY Total Processes is now: OK on reportcard2 i-000001ea output: PROCS OK: 83 processes [07:26:34] RECOVERY SSH is now: OK on ganglia-test2 i-00000250 output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [07:26:34] PROBLEM Current Load is now: WARNING on ganglia-test2 i-00000250 output: WARNING - load average: 7.08, 9.94, 10.68 [07:27:24] PROBLEM Current Load is now: WARNING on aggregator-test3 i-00000293 output: WARNING - load average: 0.11, 4.06, 16.64 [07:28:24] RECOVERY Current Users is now: OK on precise-test i-00000231 output: USERS OK - 0 users currently logged in [07:28:24] RECOVERY Disk Space is now: OK on precise-test i-00000231 output: DISK OK [07:28:24] RECOVERY dpkg-check is now: OK on precise-test i-00000231 output: All packages OK [07:28:24] RECOVERY Current Load is now: OK on worker1 i-00000208 output: OK - load average: 0.02, 2.53, 4.50 [07:28:24] RECOVERY Total Processes is now: OK on ganglia-test2 i-00000250 output: PROCS OK: 173 processes [07:29:41] RECOVERY Disk Space is now: OK on migration1 i-00000261 output: DISK OK [07:29:42] RECOVERY Current Users is now: OK on migration1 i-00000261 output: USERS OK - 0 users currently logged in [07:29:42] RECOVERY Free ram is now: OK on migration1 i-00000261 output: OK: 79% free memory [07:29:42] RECOVERY dpkg-check is now: OK on migration1 i-00000261 output: All packages OK [07:29:42] RECOVERY Total Processes is now: OK on migration1 i-00000261 output: PROCS OK: 87 processes [07:30:23] RECOVERY Current Load is now: OK on pybal-precise i-00000289 output: OK - load average: 0.16, 1.87, 3.95 [07:31:17] RECOVERY Current Load is now: OK on migration1 i-00000261 output: OK - load average: 0.09, 2.02, 4.37 [07:31:17] RECOVERY Current Load is now: OK on precise-test i-00000231 output: OK - load average: 0.07, 1.45, 3.72 [07:33:08] AHH hello [07:33:40] RECOVERY Current Load is now: OK on incubator-bot2 i-00000252 output: OK - load average: 0.29, 1.54, 4.58 [07:36:32] RECOVERY Current Load is now: OK on bots-apache1 i-000000b0 output: OK - load average: 3.10, 3.38, 4.60 [07:41:25] RECOVERY Current Load is now: OK on ganglia-test2 i-00000250 output: OK - load average: 0.37, 0.98, 4.34 [07:47:31] RECOVERY Current Load is now: OK on aggregator-test3 i-00000293 output: OK - load average: 0.64, 0.64, 4.99 [07:48:21] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 0.11, 0.79, 3.23 [07:53:22] RECOVERY Current Load is now: OK on nagios 127.0.0.1 output: OK - load average: 0.21, 0.76, 2.58 [08:05:02] petan, you managed to restart bots-2? [08:05:12] should I? [08:05:26] I think so .. if that is the only way of getting back access to it [08:06:28] I get: Enter passphrase for key '/home/beetstra/.ssh/id_rsa': \n Permission denied (publickey). [08:06:43] done [08:07:07] thanks [08:11:06] Note to self: I have to work on the code of linkwatcher so it does not disable itself in memory flooding [08:33:39] PROBLEM Puppet freshness is now: CRITICAL on mailman-01 i-00000235 output: Puppet has not run in last 20 hours [08:46:56] PROBLEM Current Load is now: WARNING on bots-sql2 i-000000af output: WARNING - load average: 6.07, 6.01, 5.36 [08:52:37] PROBLEM Free ram is now: WARNING on ganglia-test2 i-00000250 output: Warning: 13% free memory [08:58:43] hashar: how we update the svn-trunk in beta [08:58:49] php-trunk [08:58:50] I mean [08:59:02] I made a script for that but maybe you have some as well [08:59:05] mediawiki ? :-D [08:59:09] everything [08:59:12] not just mw [08:59:19] oh with extensions too [08:59:23] + run update.php [08:59:29] you know it is going to break stuff ? ;-D [08:59:34] how [08:59:37] should be something like: [08:59:45] cd /home/wikipedia/common/php-trunk [08:59:50] git pull [09:00:02] then update submodules using something like: [09:00:06] cd /home/wikipedia/common/php-trunk/extensions [09:00:06] that is how does it work [09:00:09] git submodule update [09:00:20] why don't use pull [09:00:23] and to run update.php: foreachwiki update.php [09:00:23] for everything [09:00:45] hashar: ok I know that, but how does it break stuff [09:01:07] cause that deploys code from master which might be unstable [09:01:14] and you might have to install DB updates [09:01:23] so it should be done carefully [09:02:41] hashar: but that's what we want to do, or not? [09:02:53] beta is for testing these unstable things to check if they are stable or not... [09:02:59] yup [09:03:00] or, what is trunk supposed to be? [09:03:12] do we have a stable trunk [09:03:14] eventually we are going to update MediaWiki core + extensions daily [09:03:15] :) [09:03:24] yes, I was thinking of putting it to cron [09:03:26] for now, I would prefer we keep the software as is ;-) [09:03:30] nooo [09:03:34] not cron please ;-]] [09:03:46] script is /usr/local/apache/common-local/bin$ cat updaterepo.sh [09:03:51] ok [09:03:53] we really want to do update the site manually [09:04:07] I will run it now, to check if that does break things as you say, or not [09:04:12] so we know who / why / when something goes wrong [09:04:15] !log deployment-prep petrb: running update [09:04:18] but please no [09:04:19] arhghgg [09:04:28] :D [09:04:32] no wories [09:04:33] I didn't [09:04:33] the cluster is already broken enough [09:04:44] you don't like bottie? [09:04:51] I am not really willing to spend time this way figuring out which new code is breaking it ;-]] [09:04:53] why we have another bot for this project :P [09:05:22] ok, so when are we going to update to trunk [09:05:24] but updating daily is definitely on the list of stuff to do. Will do that later when the cluster is more stable (aka configuration running from production) [09:05:31] ok [09:05:35] so later [09:05:36] ;) [09:05:37] why we have another bot [09:05:42] ah [09:05:50] that was to remove the /bin/log hack [09:05:56] which send the message to some labs instance [09:06:04] instead I used something which is similar to productino [09:06:04] s [09:06:14] so you log directly from the -dbdump machine [09:06:15] I mean there is a lot of projects, I would prefer to use 1 bot for all projects, rather than 50 bots [09:06:36] ok, what if I type log on -apache20 [09:06:40] does it work? [09:06:54] ohh you wanna do that [09:06:55] hmm [09:07:06] it is probably not going to work ;-D [09:07:39] anyway, I liked it to say "message logged" so I knew something did happen [09:07:40] so yeah, you should do everything from -dbdump [09:07:56] if it doesn't do that on prod, we should fix it :D [09:08:09] !log deployment-prep hashar: I am the log bot [09:08:27] so the only difference is that the bot runs localy on db-dump [09:08:31] so `log` is beta-logmsgbot [09:08:31] otherwise it's same? [09:08:38] I am not sure which bots is reading from there [09:08:55] yeah it is running on db-dump [09:08:58] I see [09:09:02] that should be the only difference [09:09:06] but I don't know what is advantage of that [09:09:20] also let us restart it easily whenever needed without having to get an access on bots project [09:09:22] it looks quite same as before to me, just that we have 1 more bot :D [09:09:31] yeah [09:09:44] well, there is instance bots-labs which is supposed to host only labs related services [09:09:48] but one less inter project dependency [09:09:57] bottie is supposed to work 24*6 [09:09:59] * 7 [09:10:00] :D [09:10:01] lol [09:10:10] typo [09:11:42] well it does not [09:11:43] also it was good that you could log it from any machine... but I don't really care [09:11:47] it doesn't? [09:11:48] so I prefer having the bot locally on dbdump [09:11:54] that makes things easier to fix / debug [09:12:02] hm... [09:12:28] we could probably add a new package / puppet class to install the log command on all machines [09:12:40] but I am not sure it is going to be used that much anyway [09:12:47] the whole idea is to do everything from dbdump [09:12:52] and almost never connect to the other hosts [09:13:03] ok, but you sometimes have to [09:13:41] it's hard to restart squid from dbdump [09:13:42] etc [09:13:57] on prod it works from one instance only? [09:13:59] box [09:25:03] hmm I am not sure for squid [09:25:33] but for apaches, we do a dsh to all apaches box and use a script named apache-graceful [09:25:38] which is available locally [09:25:42] aka on each box [09:25:57] not sure how it is installed, I guess it is just in /home/wikipedia/bin which is mounted from fenari [09:26:01] will have to look at [10:07:21] !log deployment-prep hashar: cherry picking change 9116 [10:32:38] PROBLEM Current Load is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [10:37:31] PROBLEM Current Load is now: WARNING on bots-sql2 i-000000af output: WARNING - load average: 7.48, 6.88, 6.35 [10:38:38] lunchhh [12:31:19] !log deployment-prep hashar: Running mwscript rebuildLocalisationCache.php --wiki=aawiki [12:33:13] !log deployment-prep hashar: Running mwscript rebuildLocalisationCache.php --wiki=aawiki --force [12:40:45] !log hashar synchronizing Wikimedia installation... : [12:48:18] mwscript rebuildLocalisationCache.php [12:48:18] No localisation cache found for English. Please run maintenance/rebuildLocalisationCache.php. [12:48:20] yeahhh [13:06:18] hashar: oh forgot to tell you [13:06:30] the two packages are in precise [13:06:42] but I need to forward-port some php extensions of ours to precise [13:06:55] it's on my TODO but I have to finish up something that we absolutely need for the hackathon [13:07:28] so it was evicted a bit from the top of my todo [13:07:37] doh [13:07:47] at least some wikimedia packages moved forward [13:08:15] I am fighting with the l10n cache meanwhile [13:08:16] I think the rest is not much work, so I might find some time tomorrow [13:08:35] apparently there is a debian helper to build packages out of pecl extensions [13:08:39] yep [13:08:50] but I don't even need that [13:08:55] we already have source debian packages for those [13:09:00] great [13:09:00] I just need to rebuild them in a precise environment [13:09:07] which I have to build first [13:09:22] shouldn't be much work really [13:09:39] I also have to rebuild php at some point but that's not a blocker for you [13:09:47] (I presume you read the php thread on ops) [13:10:02] not yet [13:10:06] * hashar opens mail client [13:10:18] * hashar reads the popup [You have 178 new emails] [13:10:25] * hashar closes mail client [13:11:06] hahahahaha [13:11:56] well that ops stuff is well over my mind [13:12:10] stuff like -O3 and --gdb3 are probably funny [13:13:18] funny? [13:13:26] sorry [13:13:33] hmm I mean something but can't remember [13:13:38] loose my mind while writing the sentence [13:13:55] anyway, you can get the packages tomorrow [13:14:29] today, I am busy figuring out the localization cache system for extension ;-) [13:51:36] PROBLEM Puppet freshness is now: CRITICAL on nova-ldap1 i-000000df output: Puppet has not run in last 20 hours [14:15:05] hashar: btw you noticed that I did some modification of that rebuild script? [14:15:18] so that it rebuild cache for extensions as well [14:15:51] which script? [14:16:07] rebuild cache [14:16:14] yeah I am working on it [14:16:18] seems to be working now ;D [14:16:27] $ mwscript rebuildLocalisationCache.php --wiki=aawiki --threads 4 --force [14:16:36] I am waiting for it to complete [14:16:49] it is really not the cleanest part of our config :-( [14:16:57] $ echo "print wfMsg( 'timedmedia-ogg-long-multiplexed' );" | mwscript eval.php --wiki=commonswiki [14:16:58] Ogg multiplexed audio/video file, $1, length $2, $4 × $5 pixels, $3 overall [14:16:58] yeahhh [14:17:04] petrb@deployment-dbdump:/usr/local/apache/common-local/bin$ cat ../wmf-config/CommonSettingsDeployment.php << try it [14:17:28] there I include all extensions we have [14:17:33] so that it rebuild cache for all of them [14:17:47] but it only happen when you start that script [14:17:48] oh my god [14:17:52] that is unneeded ;-] [14:17:53] really [14:18:00] let me read that file [14:18:07] well, I don't think so [14:18:15] Roan told me that similar thing is on prod [14:18:35] because if you run it on aawiki, it only create cache for extensions on aa [14:18:42] !log deployment-prep "Fixed rebuildLocalisationCache system so it now works just like in production. Aka manually trigger it with mwscript rebuildLocalisationCache.php --wiki=aawiki --threads 4. Will be made by scap later. [14:18:53] but we need cache for all wikis [14:19:40] it is shared [14:19:45] --wiki=aawiki is just for fun [14:20:07] hashar: try now :P [14:20:09] !log . [14:20:09] Message missing. Nothing logged. [14:20:17] list of extension is in /home/wikipedia/common/wmf-config/extension-list [14:20:39] hashar: I know that, but we still need to make that script build cache for all [14:20:49] not only for these enabled on aawiki [14:20:51] when `scap` is run, it generated a list of include to be realized. Something like: mwscript mergeMessageFileList.php --wiki=aawiki --list-file=/home/wikipedia/common/wmf-config/extension-list --output=/home/wikipedia/common/wmf-config/ExtensionMessages-trunk.php [14:21:06] right [14:21:13] that ExtensionMessages-trunk.php is included in CommonSettings.php for ANY wiki [14:21:24] so it just work for every project [14:22:41] oh [14:22:52] so hmm yeah [14:22:56] your system would work too [14:24:06] Platonides: did you start working on that tool for uploading? [14:24:15] is it somewhere in git yet [14:26:56] No backend defined with the name `local-swift`. [14:26:58] doh [14:28:12] btw hashar is wmf-config somehow managed by puppet now? [14:28:19] no [14:28:26] because I made some changes which are not to be reverted [14:28:27] it is updated manually from gerrit [14:28:37] or people make changes on fenari and then push to gerrit [14:28:44] ohh [14:28:45] on labs [14:28:47] well [14:28:47] ok, how do I prevent it to be reverted then [14:28:50] yes on labs [14:29:05] if you make change, please make them in a specific file such as db-wmflabs.php [14:29:06] I removed lines causing prod feed get RC from labs [14:29:11] oh [14:29:26] yeah wgRCUdpHost or something [14:29:28] hmm [14:29:29] yes [14:29:34] it needs to stay like it for now [14:29:44] or people will get mad again [14:30:00] I have moved it to InitialiseSettingsDeploy.php IIRC [14:30:09] well, it didn't work [14:30:27] cause I forgot the dash in front [14:30:30] ah [14:30:42] aka need: '-wgRC2UDPAddress' => array( 'default' => false ); [14:30:55] whenever I manage to get the wmf-config on labs in sync with the one in production [14:31:00] I will make the switch [14:31:04] you could make it 'default' => 'deployment-feed' [14:31:08] and drop the current git repo in favor of the one from gerrit [14:31:29] do you care about receiving RC notifications ? [14:31:34] sort of [14:31:43] Platonides wanted to set it up [14:31:55] go ahead and edit : /home/wikipedia/common/wmf-config/InitialiseSettingsDeploy.php [14:32:09] then change the wgRC2UDPAddress conf from default = > false to deployment-feed [14:32:13] then git add && git commit :-] [14:32:20] add? [14:32:36] /home/wikipedia/common is a local git repository [14:32:39] it's already in branch or not [14:32:43] so you can commit locally [14:32:47] ok, but that file is there or not [14:32:50] why should I add it [14:32:52] but that repo i sonly on deployment-dbdump [14:32:59] git add stage a change [14:33:00] not a file [14:33:06] aha [14:33:12] git add != svn add [14:33:13] ;-] [14:33:27] it is more like: hmm let s pick that change to make a commit [14:33:38] aha [14:33:45] the idea of git is: [14:33:56] 'working repository' ---> 'staging area' --> 'commit' [14:34:04] I get it, I can keep a working version uncommited then [14:34:05] err [14:34:11] working copy --> staging area --> commit [14:34:17] where as svn is working copy -> commit [14:45:39] Did I understand correctly that if you want to open a new port in the firewall (even if it is only to bastion), you need to recreate the instance? [14:48:24] hashar: Is there a bug yet for making beta use wmf-config from gerrit? If not I'll open one and note down a few ideas [14:51:20] Krinkle: I am not sure there is a bug [14:51:23] k [14:51:30] Krinkle: but that is surely a work in progress. [14:51:34] 05/28/2012 - 14:51:34 - Creating a home directory for kolossos at /export/home/maps/kolossos [14:52:36] 05/28/2012 - 14:52:35 - Updating keys for kolossos at /export/home/maps/kolossos [14:52:37] hashar: We should also do something about the growing number of "labs" stuff in wmf-config repo, and the if-cluster things. Should probably go into a "LocalSettings" kind of thing, that is in the local repo only (just like fenari has a local repo) for PrivateSettings.php etc.. [14:53:00] there is a PrivateSettings.php file already [14:53:06] I know [14:53:06] as well as some other overriding stuff [14:53:10] need to clean that out though :-( [14:53:24] LocalSettings is relatively clean on the cluster [14:53:28] ie no config [14:53:40] I said "kind of thing" not the one that is for mediawiki [14:53:51] nothing should go in there other than include wmf-config probably [14:54:12] but right now there is some labs stuff in the gerrit repo, some labs stuff in InitializeSettingsDeploy, and then there is some inline with if-cluster/this/that else constructs [14:56:36] what would be great is a way to detect the cluster [14:56:49] for now I have added a $cluster = 'wmflabs' at top of InitialiseSettings.php [14:57:50] I'd say we use 2 files that are not in the repo: PrivateSettings.php and ClusterSettings.php. And both exist in the local repos only. And to there we move all the cluster specific stuff. [14:58:20] maybe put ClusterSettings.php in gerrit though, in a sub dir so that they can be maintained by the community more easily [14:58:31] (both beta and production) [15:00:28] what I did is a if / else require [15:00:37] so we require different files based on cluster [15:00:42] see db.php / db-wmflabs.php [15:00:42] Yep [15:00:48] or mc.php mc-wmflabs.php [15:00:57] since the file are heavily diverging anyway [15:02:14] is there an environmental variable that can be used maybe? [15:02:20] we'd still ned a way to guide that if [15:02:21] need* [15:02:45] I am not aware of any env var on cluster [15:02:51] not sure how apache will get it anyway [15:03:01] maybe though /etc/profile ? :-( [15:03:39] or /etc/cluster [15:03:59] but then we will have to add a stat() on /etc/cluster [15:04:16] I am pretty sure ops will veto the idea -;]]] [15:06:43] hashar: see #mediawiki new bug [15:07:40] happilly skipping [15:16:21] hmm [15:17:11] the l10n cache system is really crazy [15:17:15] as well as our overall config [15:18:34] 05/28/2012 - 15:18:33 - Updating keys for kolossos at /export/home/maps/kolossos [15:19:12] if( $cluster = 'pmtpa' ) { [15:19:13] yeah [15:22:12] https://gerrit.wikimedia.org/r/9128 [15:23:39] !log deployment Fixed a nasty override of $cluster in MediaWiki configuration which caused some interesting issue on labs. See https://gerrit.wikimedia.org/r/9128 [15:23:51] finally I can see something worthwhile at http://commons.wikimedia.beta.wmflabs.org/wiki/File:Mayday2012-edit-1.ogv [15:34:35] 05/28/2012 - 15:34:35 - Updating keys for kolossos at /export/home/maps/kolossos [15:53:04] hashar: got me (bug 37135). I'm always so serious :D [15:53:38] that's gonna cost us a beer in Berlin (or two) ;-) [15:53:44] !bug 37135 [15:53:45] https://bugzilla.wikimedia.org/show_bug.cgi?id=37135 [15:53:49] !bug [15:53:49] https://bugzilla.wikimedia.org/show_bug.cgi?id=$1 [15:53:54] !delete bug [15:53:57] !rm bug [15:53:59] pff [15:54:04] !bug del [15:54:05] Successfully removed bug [15:54:19] Krinke-away: yeah definitely going to have some beers ;-] [15:54:26] !bug is https://bugzilla.wikimedia.org/$1 [15:54:26] Key was added [15:54:31] !bug 37135 [15:54:32] https://bugzilla.wikimedia.org/37135 [15:54:33] Krinke-away: thanks! [15:54:52] Krinke-away: I was honestly just teasing you cause you opened a support request ;D [15:55:04] Krinke-away: that traceback was really confusing [15:55:11] It was, yeah [15:59:32] -if ( file_exists( '/etc/wikimedia-transcoding' ) ) { [15:59:33] aahhhh [15:59:38] stat() [16:22:33] solved by safeguarding it with $cluster == 'wmflabs' && file_exists() [16:22:34] yeah [16:22:35] \O/ [16:45:01] New patchset: Hashar; "revert unmounting /dev/vdb to mount it on /tmp" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/8610 [16:45:17] New patchset: Hashar; "only apply admins::* while on labs" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/8642 [16:45:32] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/8610 [16:45:32] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/8642 [16:59:08] New review: Hashar; "Patchset is a rebase" [operations/puppet] (test); V: 0 C: 0; - https://gerrit.wikimedia.org/r/8610 [16:59:14] New review: Hashar; "Patchset is a rebase" [operations/puppet] (test); V: 0 C: 0; - https://gerrit.wikimedia.org/r/8642 [17:01:15] New review: Faidon; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/8610 [17:01:18] Change merged: Faidon; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/8610 [17:59:40] Ryan_Lane: howdy [17:59:45] howdy [17:59:52] seems today is a holiday :D [18:00:33] okay [18:00:36] I'll stop then :) [18:00:40] heh [18:03:08] but [18:03:10] New patchset: Faidon; "Add ssh parameter to git::clone" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9146 [18:03:20] heh [18:03:26] New patchset: Faidon; "puppetmaster: abstract SSL setup into a subclass" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9147 [18:03:32] got to push that last change in eh? :D [18:03:42] New patchset: Faidon; "Add a puppetmaster::self class" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9148 [18:03:43] you might be interested in that^ [18:03:48] (finally, the bot took a while) [18:03:56] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/9146 [18:03:57] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/9147 [18:03:57] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/9148 [18:04:15] sweet [18:04:24] I'd like your review obviously [18:04:34] but I guess it can wait until tomorrow :) [18:04:34] * Ryan_Lane nods [18:05:20] New review: Ryan Lane; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9146 [18:05:23] Change merged: Ryan Lane; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9146 [18:06:39] New review: Ryan Lane; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9147 [18:06:42] Change merged: Ryan Lane; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9147 [18:06:47] or not? [18:06:51] ? [18:06:57] wait until tomorrow [18:06:58] easy enough to review now [18:07:10] you'll make me feel bad for making you work on a holiday :-) [18:07:55] I'm doing some consulting work today too [18:09:23] New review: Ryan Lane; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9148 [18:09:25] Change merged: Ryan Lane; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/9148 [18:09:31] oh wow [18:09:35] you just merged it [18:09:40] I guess I should commit the private part too then [18:09:50] oh. I guess I should have just +2'd [18:09:56] bad habit [18:10:04] nah, it's fine [18:11:08] so, I'm using salt for another project, and like it quite a bit [18:11:18] (for more than just remote execution) [18:11:43] it needs parameterized classes, though [18:12:25] it basically forces for into using something like modules [18:12:32] s/for/you/ [18:13:08] New patchset: Faidon; "Add labs-puppet-key SSH key" [labs/private] (master) - https://gerrit.wikimedia.org/r/9149 [18:13:17] are you suggesting to replace puppet with salt? [18:13:44] New review: Faidon; "(no comment)" [labs/private] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9149 [18:14:09] puppet has a big community, ready-made modules and momentum [18:14:31] plus, we have tons of stuff in our puppet already that would take ages to convert to something else [18:14:39] nah, just saying it's nice [18:14:45] and the road in-between (having two tools manage configurations) will be hell [18:15:14] heh, okay [18:15:35] I'll have it in mind [18:15:39] competition is good too [18:15:44] hopefully puppet will get better [18:15:52] have a look at puppet 3.0 release notes [18:16:03] they merged a thing called Hiera [18:16:05] I hate most things about puppet, honestly [18:16:11] which might be useful to us [18:16:14] for labs especially [18:16:23] how so? it's just a new backend [18:16:26] not sure yet, haven't thought of the possibilities [18:16:29] I don't see how it changes much [18:16:33] hiera? [18:16:44] it's meant to be able to override settings [18:16:54] variables, configurations etc. [18:17:02] which /might/ make sense for labs [18:17:06] * Ryan_Lane nods [18:17:12] instead of the whole if ($labs) { } else { } [18:17:17] realm even :) [18:17:39] anyway [18:17:41] 9pm here [18:17:45] drinks time [18:18:08] hmm, any idea why gerrit 9149 didn't get verified? [18:18:12] and hence I can't submit? [18:18:30] New review: Faidon; "(no comment)" [labs/private] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9149 [18:18:32] Change merged: Faidon; [labs/private] (master) - https://gerrit.wikimedia.org/r/9149 [18:22:45] stupid hooks are probably broken in some way [18:29:40] well, one thing I'm disliking about salt is the lack of line numbers for errors [18:34:42] PROBLEM Puppet freshness is now: CRITICAL on mailman-01 i-00000235 output: Puppet has not run in last 20 hours [18:38:28] It would be kinda cool if it validated on push and just rejected you there and then if you failed. [18:39:42] Hm.. gerrit serves a nice and innocent 404 error for those labs/private urls [18:39:52] but the irc bot spits them out anyway [18:39:58] so those comments can leak through ? [18:40:08] (the first sentence anyway) [18:40:56] I believe we're suppose to be able to see them logged in but ldap search/gerrit/life sucks [18:41:51] oh. salt gives the state, name, and function that failed and why [18:42:02] that's almost as good as a line number [18:43:53] Line and column number ftw. [19:08:47] https://gerrit.wikimedia.org/r/#/c/9148/ This allows labs instances to be able to run their own puppetmaster <- [19:08:49] ops rocks :-] [19:09:28] hashar: thank paravoid ;) [19:09:43] now if we just fix the IO problems labs will become way more usable [19:09:57] we had a veryyy big IO issue last monday [19:10:01] yeah [19:10:09] heard about it [19:10:09] was because some user was reading giant files of glusterFS [19:10:13] yep [19:10:16] dumps project [19:10:19] exaclty [19:10:23] now it is way better [19:10:30] one problem right now is all instances go through one network node [19:10:44] we need to have a network node per host [19:10:45] if the dumps project really need massive I/O, it might use a dedicated / real hard disk instead of gluster [19:11:11] he should be reading from the public datasets NFS share [19:11:12] by network node do you mean a virtual switch in Nova/OpenStack ? [19:11:19] MongoDB. [19:11:21] yes [19:11:23] Reedy: heh [19:12:24] I am still wondering if we could get memcached to be replaced by redis / MongoDB / postgre [19:12:29] would have to ask domas [19:12:50] (he will for sure tell me about the mysql hack that makes it works fast and able to replace memcached) [19:13:06] facebook uses memcache [19:13:20] most places use memcache. there's nothing wrong with it [19:13:33] mediawiki's memcache implementation is slightly crappy, though [19:13:59] tim has added one for the pecl version [19:14:08] so that's not so much of an issue ;) [19:14:15] the thing I dislike is the hash being based off the number of servers in $wgMemcachedServer or something [19:14:18] pecl/pear/php extension/whatever [19:14:35] and the lack of build in redundancy between datacenter [19:14:42] though we could just write to both DC ahah [19:24:16] Memcache is awesome [19:48:33] 05/28/2012 - 19:48:33 - Updating keys for kolossos at /export/home/maps/kolossos [21:17:09] PROBLEM Current Load is now: WARNING on bots-cb i-0000009e output: WARNING - load average: 1.54, 12.42, 8.40 [21:27:09] RECOVERY Current Load is now: OK on bots-cb i-0000009e output: OK - load average: 0.50, 2.06, 4.58 [21:58:31] 05/28/2012 - 21:58:31 - Updating keys for kolossos at /export/home/maps/kolossos [22:02:31] 05/28/2012 - 22:02:31 - Updating keys for kolossos at /export/home/maps/kolossos [22:23:05] PROBLEM dpkg-check is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [22:23:05] PROBLEM Current Users is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [22:23:05] PROBLEM Disk Space is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [22:23:30] PROBLEM Current Users is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [22:23:30] PROBLEM Current Load is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [22:27:52] RECOVERY Disk Space is now: OK on precise-test i-00000231 output: DISK OK [22:27:52] RECOVERY Current Users is now: OK on precise-test i-00000231 output: USERS OK - 0 users currently logged in [22:27:52] RECOVERY dpkg-check is now: OK on precise-test i-00000231 output: All packages OK [22:28:20] RECOVERY Current Load is now: OK on incubator-bot2 i-00000252 output: OK - load average: 0.34, 1.49, 1.30 [22:28:20] RECOVERY Current Users is now: OK on incubator-bot2 i-00000252 output: USERS OK - 0 users currently logged in [22:31:35] PROBLEM Puppet freshness is now: CRITICAL on mwreview-test1 i-00000297 output: Puppet has not run in last 20 hours [23:52:35] PROBLEM Puppet freshness is now: CRITICAL on nova-ldap1 i-000000df output: Puppet has not run in last 20 hours [23:55:45] PROBLEM HTTP is now: CRITICAL on deployment-apache21 i-0000026d output: CRITICAL - Socket timeout after 10 seconds