[00:09:48] New patchset: Bhartshorne; "adding a per-project debian repo for labs" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/11969 [00:10:08] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/11969 [00:30:26] RECOVERY Free ram is now: OK on aggregator-test1 i-000002bf output: OK: 91% free memory [00:32:06] RECOVERY Free ram is now: OK on ganglia-test2 i-00000250 output: OK: 86% free memory [02:40:46] RECOVERY Free ram is now: OK on bots-sql2 i-000000af output: OK: 24% free memory [02:41:30] 06/19/2012 - 02:41:30 - User laner may have been modified in LDAP or locally, updating key in project(s): deployment-prep [02:48:37] PROBLEM Disk Space is now: CRITICAL on ipv6test1 i-00000282 output: DISK CRITICAL - free space: / 35 MB (2% inode=57%): [02:52:46] RECOVERY dpkg-check is now: OK on translation-memory-2 i-000002d9 output: All packages OK [02:53:36] PROBLEM Disk Space is now: WARNING on ipv6test1 i-00000282 output: DISK WARNING - free space: / 70 MB (5% inode=57%): [02:54:16] RECOVERY Current Load is now: OK on translation-memory-2 i-000002d9 output: OK - load average: 0.33, 0.34, 0.14 [02:54:26] RECOVERY Current Users is now: OK on translation-memory-2 i-000002d9 output: USERS OK - 1 users currently logged in [02:54:36] RECOVERY Disk Space is now: OK on translation-memory-2 i-000002d9 output: DISK OK [02:55:36] RECOVERY Free ram is now: OK on translation-memory-2 i-000002d9 output: OK: 84% free memory [02:56:56] RECOVERY Total Processes is now: OK on translation-memory-2 i-000002d9 output: PROCS OK: 85 processes [02:57:46] RECOVERY Total Processes is now: OK on integration-apache1 i-000002dc output: PROCS OK: 80 processes [02:58:16] RECOVERY dpkg-check is now: OK on integration-apache1 i-000002dc output: All packages OK [02:58:46] RECOVERY Current Load is now: OK on integration-apache1 i-000002dc output: OK - load average: 0.05, 0.13, 0.09 [02:59:26] RECOVERY Current Users is now: OK on integration-apache1 i-000002dc output: USERS OK - 0 users currently logged in [03:00:06] RECOVERY Disk Space is now: OK on integration-apache1 i-000002dc output: DISK OK [03:00:46] RECOVERY Free ram is now: OK on integration-apache1 i-000002dc output: OK: 89% free memory [03:45:46] PROBLEM Free ram is now: WARNING on utils-abogott i-00000131 output: Warning: 16% free memory [03:49:46] PROBLEM Free ram is now: WARNING on orgcharts-dev i-0000018f output: Warning: 15% free memory [03:54:46] PROBLEM Free ram is now: WARNING on nova-daas-1 i-000000e7 output: Warning: 12% free memory [03:59:36] PROBLEM Free ram is now: WARNING on test-oneiric i-00000187 output: Warning: 14% free memory [04:05:46] PROBLEM Free ram is now: CRITICAL on utils-abogott i-00000131 output: Critical: 3% free memory [04:09:46] PROBLEM Free ram is now: CRITICAL on orgcharts-dev i-0000018f output: Critical: 4% free memory [04:10:46] RECOVERY Free ram is now: OK on utils-abogott i-00000131 output: OK: 97% free memory [04:14:35] PROBLEM Free ram is now: CRITICAL on test-oneiric i-00000187 output: Critical: 4% free memory [04:14:45] RECOVERY Free ram is now: OK on orgcharts-dev i-0000018f output: OK: 95% free memory [04:14:45] PROBLEM Free ram is now: CRITICAL on nova-daas-1 i-000000e7 output: Critical: 4% free memory [04:19:35] RECOVERY Free ram is now: OK on test-oneiric i-00000187 output: OK: 96% free memory [04:19:45] RECOVERY Free ram is now: OK on nova-daas-1 i-000000e7 output: OK: 94% free memory [04:19:45] PROBLEM Free ram is now: WARNING on test3 i-00000093 output: Warning: 13% free memory [04:24:45] RECOVERY Free ram is now: OK on test3 i-00000093 output: OK: 96% free memory [05:06:55] RECOVERY Total Processes is now: OK on etherpad-lite-testing i-000002da output: PROCS OK: 96 processes [05:07:54] RECOVERY dpkg-check is now: OK on etherpad-lite-testing i-000002da output: All packages OK [05:08:45] RECOVERY Current Load is now: OK on etherpad-lite-testing i-000002da output: OK - load average: 0.38, 0.74, 0.48 [05:09:25] RECOVERY Current Users is now: OK on etherpad-lite-testing i-000002da output: USERS OK - 0 users currently logged in [05:09:45] PROBLEM Free ram is now: WARNING on bots-sql2 i-000000af output: Warning: 19% free memory [05:10:35] PROBLEM Free ram is now: WARNING on etherpad-lite-testing i-000002da output: Warning: 12% free memory [05:11:25] RECOVERY Disk Space is now: OK on etherpad-lite-testing i-000002da output: DISK OK [05:19:55] PROBLEM SSH is now: CRITICAL on mobile-testing i-00000271 output: CRITICAL - Socket timeout after 10 seconds [05:24:45] RECOVERY SSH is now: OK on mobile-testing i-00000271 output: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [05:54:25] PROBLEM Total Processes is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [05:59:15] RECOVERY Total Processes is now: OK on mobile-testing i-00000271 output: PROCS OK: 280 processes [06:40:50] RECOVERY Free ram is now: OK on etherpad-lite-testing i-000002da output: OK: 30% free memory [08:40:20] PROBLEM Disk Space is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [08:40:20] PROBLEM Current Users is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [08:45:10] RECOVERY Disk Space is now: OK on mobile-testing i-00000271 output: DISK OK [08:45:10] RECOVERY Current Users is now: OK on mobile-testing i-00000271 output: USERS OK - 0 users currently logged in [09:05:46] PROBLEM Free ram is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [09:09:55] PROBLEM Free ram is now: WARNING on bots-sql2 i-000000af output: Warning: 19% free memory [09:37:45] PROBLEM Free ram is now: CRITICAL on incubator-bot2 i-00000252 output: Critical: 5% free memory [10:11:37] 06/19/2012 - 10:11:37 - Updating keys for danwe at /export/keys/danwe [10:14:40] 06/19/2012 - 10:14:39 - Updating keys for danwe at /export/keys/danwe [10:46:11] New patchset: Ryan Lane; "Add second master for labs" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/11987 [10:46:33] New review: Ryan Lane; "(no comment)" [operations/puppet] (test); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11987 [10:46:33] Change merged: Ryan Lane; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/11987 [10:49:33] 06/19/2012 - 10:49:33 - Deleting home directory for robin in project(s): commons-dev [10:50:21] 06/19/2012 - 10:50:21 - Deleting home directory for robin in project(s): commons-dev [10:51:22] 06/19/2012 - 10:51:21 - Deleting home directory for robin in project(s): commons-dev [10:52:03] *deleting* home directory? [10:52:16] labs-home-wm: you, miss, should never be saying such a thing [10:52:21] 06/19/2012 - 10:52:21 - Deleting home directory for robin in project(s): commons-dev [10:52:35] hm, or should it? [10:52:48] I guess it should, if the person isn't in the project anymore [10:53:22] 06/19/2012 - 10:53:21 - Deleting home directory for robin in project(s): commons-dev [10:53:33] hm. something wrong with ldap on the server [10:54:35] I *hate* nslcd [10:55:23] and no, unless the person has been deleted from LDAP, the home directory should never be deleted. [10:55:39] of course, broken nslcd makes the home directories look like they have no owners [10:55:44] * Ryan_Lane groans [11:13:19] petrb@bastion1:~$ ssh bots-1 [11:13:19] You don't exist, go away! [11:13:23] Ryan_Lane: ? [11:13:42] petrb@bastion1:~$ whoami [11:13:42] whoami: cannot find name for user ID 2078 [11:20:11] paravoid: what is ur email [11:20:56] faidon@ but why do you want to send me a personal email? :) [11:21:34] paravoid: not personal [11:22:24] paravoid: you are that guy with debian shirt? [11:22:27] from hackaton? [11:22:46] I hope I didn't send it wrong then [11:22:53] I guess you are from ops? [11:23:55] wm-bot: ping [11:23:55] Hi petan, there is some error, I am a stupid bot and I am not intelligent enough to hold a conversation with you :-) [11:25:27] yes, yes, yes :) [11:42:20] paravoid: ok, read the mail then [11:42:27] fix it :) [11:43:04] I got no mail [11:49:58] petan: ^ [11:53:23] paravoid: from bugzilla [11:53:26] it's fixed btw [11:54:15] fixed by itself? [11:57:44] bugzilla mails sometimes take ages to be delivered [11:58:11] ugh [11:58:17] bastion's nslcd screwed up too? [11:58:43] petan: you said it fixed itself? [11:58:52] no [11:58:56] it was broken now it's fixed [11:59:02] who fixed it I have no idea [11:59:10] it must have fixed itself [11:59:13] nslcd had issues [11:59:13] maybe [11:59:17] aha [11:59:27] it happened when I added the second master [11:59:32] it forced an nslcd restart [11:59:34] which should be find [11:59:35] right [11:59:36] *fine [11:59:51] I bet nslcd negatively cached things when it happened [11:59:56] err [11:59:57] nscd [12:00:03] which is absurd [12:00:05] hm. no [12:00:34] because labs-nfs1 also had this issue, and the cron for making homedirs purges nsc [12:00:35] *nscd [12:04:08] Ryan_Lane: but I can't ssh anywhere [12:04:20] ok, not anywhere, just to bots project [12:04:29] to any instance of bots [12:06:14] it's likely having the same issue [12:06:46] PROBLEM Current Load is now: WARNING on bots-cb i-0000009e output: WARNING - load average: 3.42, 14.65, 8.39 [12:07:37] petan: bots-1 isn't letting you in? [12:08:30] nscd had issues on bots-2 [12:08:37] lemme do a nscd restart across all instances [12:08:37] now it's ok [12:08:56] I just did an nscd restart [12:10:07] maybe I need to always trigger an nscd restart on an nslcd restart [12:16:53] RECOVERY Current Load is now: OK on bots-cb i-0000009e output: OK - load average: 0.38, 2.27, 4.54 [13:11:26] PROBLEM Total Processes is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:26] RECOVERY Total Processes is now: OK on mobile-testing i-00000271 output: PROCS OK: 282 processes [14:32:20] New review: Ottomata; "Looks good, but you should set up require dependencies for your resources in the misc::labsdebrepo c..." [operations/puppet] (test) C: 0; - https://gerrit.wikimedia.org/r/11969 [14:33:05] !ping [14:33:05] pong [14:33:08] @help [14:33:09] Type @commands for list of commands. This bot is running http://meta.wikimedia.org/wiki/WM-Bot version wikimedia bot v. 1.8.2 source code licensed under GPL and located at https://github.com/benapetr/wikimedia-bot [14:33:26] <3 [14:33:35] patching on demand [14:33:38] :D [14:38:35] New patchset: Ryan Lane; "Merge remote-tracking branch 'origin/production' into test" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/12006 [14:39:04] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (test); V: -1 - https://gerrit.wikimedia.org/r/12006 [14:40:47] petan: patching on demand? Was it introduced in 1.8.1? Just asking. [14:40:59] no [14:41:13] then, how is it done? [14:41:20] secret [14:41:26] :( [14:41:27] :P [14:41:35] only wise programmers know that [14:41:43] ancient masters [14:42:18] sigh, got to stick with the old way then :( [14:42:43] I wonder if its possible to transfer files across projects [14:42:50] yes [14:43:06] deployment-backup is different project [14:43:13] it's backup of deployment :P [14:43:24] so [14:43:27] using scp [14:43:28] but no docs or anything? [14:43:35] I see... [14:43:57] why not rsync? [14:44:13] hm, because I am lazy to set it up [14:44:22] we actually need to backup only sql [14:44:28] since everything is in git [14:44:39] !petan [14:44:39] Petr Bena - http://enwp.org/User:Petrb [14:44:41] eh [14:44:53] try !petan in #mediawiki [14:45:25] New patchset: Ryan Lane; "Merge remote-tracking branch 'origin/production' into test" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/12006 [14:45:58] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/12006 [14:48:49] \o/ [14:48:57] New review: Ryan Lane; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/12006 [14:49:00] Change merged: Ryan Lane; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/12006 [14:49:13] so, let's see what I just broke [14:52:22] yep, puppet's broken: err: Could not retrieve catalog from remote server: Error 400 on SERVER: Cannot reassign variable aptpref at /etc/puppet/manifests/base.pp:66 on node i-00000276.pmtpa.wmflabs [14:53:22] New patchset: Ryan Lane; "Fixing duplicate variable definition" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/12015 [14:53:54] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/12015 [14:54:10] New review: Ryan Lane; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/12015 [14:54:13] Change merged: Ryan Lane; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/12015 [14:55:35] (Stage[main] => Stage[last] => Class[Base::Instance-finish] => File[/etc/rsyslog.d/60-puppet.conf] => Service[rsyslog] => Class[Base::Remote-syslog] => Class[Base::Instance-finish]) [14:55:40] err: Could not apply complete catalog: Found 1 dependency cycle [14:56:14] PROBLEM Free ram is now: CRITICAL on nginx-dev1 i-000000f0 output: NRPE: Command check_ram not defined [14:59:02] New patchset: Ryan Lane; "Removing dependency cycle" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/12016 [14:59:35] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/12016 [14:59:47] New review: Ryan Lane; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/12016 [14:59:49] Change merged: Ryan Lane; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/12016 [15:01:46] New patchset: Ryan Lane; "Trying again to fix the dependency cycle" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/12018 [15:02:17] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/12018 [15:02:24] New review: Ryan Lane; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/12018 [15:02:26] Change merged: Ryan Lane; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/12018 [15:04:36] well, puppet is running [15:06:46] overall minor changes [15:10:04] PROBLEM Free ram is now: CRITICAL on ganglia-test2 i-00000250 output: NRPE: Command check_ram not defined [15:10:49] PROBLEM Disk Space is now: CRITICAL on nginx-ffuqua-doom1-3 i-00000196 output: Connection refused by host [15:10:49] PROBLEM Current Users is now: CRITICAL on fundraising-db i-0000015c output: Connection refused by host [15:10:49] PROBLEM Free ram is now: CRITICAL on demo-web2 i-00000285 output: NRPE: Command check_ram not defined [15:10:59] PROBLEM Current Load is now: CRITICAL on nginx-ffuqua-doom1-3 i-00000196 output: Connection refused by host [15:10:59] PROBLEM Disk Space is now: CRITICAL on kripke i-00000268 output: Connection refused by host [15:10:59] PROBLEM Current Users is now: CRITICAL on kripke i-00000268 output: Connection refused by host [15:10:59] PROBLEM Free ram is now: CRITICAL on kripke i-00000268 output: Connection refused by host [15:10:59] PROBLEM Free ram is now: CRITICAL on demo-deployment1 i-00000276 output: NRPE: Command check_ram not defined [15:11:00] PROBLEM Free ram is now: CRITICAL on feeds i-000000fa output: NRPE: Command check_ram not defined [15:11:00] PROBLEM Free ram is now: CRITICAL on nginx-ffuqua-doom1-3 i-00000196 output: Connection refused by host [15:11:24] PROBLEM Current Users is now: CRITICAL on nginx-ffuqua-doom1-3 i-00000196 output: Connection refused by host [15:11:39] PROBLEM Total Processes is now: CRITICAL on fundraising-db i-0000015c output: Connection refused by host [15:11:44] PROBLEM Free ram is now: CRITICAL on ubuntu1-pgehres i-000000fb output: NRPE: Command check_ram not defined [15:11:44] lol wut [15:11:54] PROBLEM Free ram is now: CRITICAL on bastion1 i-000000ba output: NRPE: Command check_ram not defined [15:11:54] PROBLEM Free ram is now: CRITICAL on building i-0000014d output: NRPE: Command check_ram not defined [15:11:54] PROBLEM Total Processes is now: CRITICAL on kripke i-00000268 output: Connection refused by host [15:11:58] BEEP [15:11:59] PROBLEM Free ram is now: CRITICAL on outreacheval i-0000012e output: NRPE: Command check_ram not defined [15:11:59] PROBLEM Total Processes is now: CRITICAL on nginx-ffuqua-doom1-3 i-00000196 output: Connection refused by host [15:12:04] PROBLEM Free ram is now: CRITICAL on fundraising-db i-0000015c output: Connection refused by host [15:12:04] PROBLEM Free ram is now: CRITICAL on demo-web1 i-00000255 output: NRPE: Command check_ram not defined [15:12:04] PROBLEM Free ram is now: CRITICAL on resourceloader2-apache i-000001d7 output: NRPE: Command check_ram not defined [15:12:04] PROBLEM Free ram is now: CRITICAL on swift-aux2 i-0000024c output: NRPE: Command check_ram not defined [15:12:04] PROBLEM Free ram is now: CRITICAL on opengrok-web i-000001e1 output: NRPE: Command check_ram not defined [15:12:14] PROBLEM Total Processes is now: WARNING on ganglia-test2 i-00000250 output: PROCS WARNING: 184 processes [15:12:19] PROBLEM Current Load is now: CRITICAL on demo-mysql1 i-00000256 output: Connection refused by host [15:12:19] PROBLEM Current Users is now: CRITICAL on demo-mysql1 i-00000256 output: Connection refused by host [15:12:19] PROBLEM Disk Space is now: CRITICAL on demo-mysql1 i-00000256 output: Connection refused by host [15:12:19] PROBLEM Total Processes is now: CRITICAL on demo-mysql1 i-00000256 output: Connection refused by host [15:12:24] PROBLEM Disk Space is now: CRITICAL on fundraising-db i-0000015c output: Connection refused by host [15:12:24] PROBLEM Current Load is now: CRITICAL on fundraising-db i-0000015c output: Connection refused by host [15:12:24] PROBLEM Current Load is now: CRITICAL on kripke i-00000268 output: Connection refused by host [15:12:39] PROBLEM dpkg-check is now: CRITICAL on kripke i-00000268 output: Connection refused by host [15:12:39] PROBLEM Free ram is now: CRITICAL on precise-test i-00000231 output: NRPE: Command check_ram not defined [15:12:39] PROBLEM Free ram is now: CRITICAL on reportcard2 i-000001ea output: NRPE: Command check_ram not defined [15:12:39] PROBLEM dpkg-check is now: CRITICAL on nginx-ffuqua-doom1-3 i-00000196 output: Connection refused by host [15:13:54] PROBLEM Free ram is now: CRITICAL on mobile-testing i-00000271 output: Connection refused by host [15:13:54] PROBLEM Free ram is now: CRITICAL on integration-apache1 i-000002dc output: NRPE: Command check_ram not defined [15:13:54] PROBLEM Free ram is now: CRITICAL on ve-nodejs i-00000245 output: NRPE: Command check_ram not defined [15:14:14] PROBLEM Current Load is now: CRITICAL on mobile-testing i-00000271 output: Connection refused by host [15:14:25] bots should have their own channel [15:14:44] PROBLEM dpkg-check is now: CRITICAL on mobile-testing i-00000271 output: Connection refused by host [15:14:44] PROBLEM Total Processes is now: CRITICAL on mobile-testing i-00000271 output: Connection refused by host [15:15:54] PROBLEM Free ram is now: CRITICAL on analytics i-000000e2 output: NRPE: Command check_ram not defined [15:15:54] PROBLEM Free ram is now: CRITICAL on catsort-pub i-000001cc output: NRPE: Command check_ram not defined [15:15:54] PROBLEM Free ram is now: CRITICAL on deployment-mc i-0000021b output: NRPE: Command check_ram not defined [15:15:54] PROBLEM Free ram is now: CRITICAL on cn-wiki-db-lucid i-00000241 output: NRPE: Command check_ram not defined [15:15:54] PROBLEM Free ram is now: CRITICAL on gerrit i-000000ff output: NRPE: Command check_ram not defined [15:16:14] PROBLEM Free ram is now: CRITICAL on wmde-test i-000002ad output: NRPE: Command check_ram not defined [15:16:14] PROBLEM Free ram is now: CRITICAL on secondinstance i-0000015b output: NRPE: Command check_ram not defined [15:16:24] PROBLEM Free ram is now: CRITICAL on blamemaps-s1 i-000002c3 output: NRPE: Command check_ram not defined [15:16:24] PROBLEM Free ram is now: CRITICAL on build1 i-000002b3 output: NRPE: Command check_ram not defined [15:16:24] PROBLEM Free ram is now: CRITICAL on demo-mysql1 i-00000256 output: Connection refused by host [15:16:24] PROBLEM Free ram is now: CRITICAL on swift-fe1 i-000001d2 output: NRPE: Command check_ram not defined [15:16:44] PROBLEM Free ram is now: CRITICAL on exim-test i-00000265 output: NRPE: Command check_ram not defined [15:16:44] PROBLEM Free ram is now: CRITICAL on labs-nfs1 i-0000005d output: NRPE: Command check_ram not defined [15:16:44] PROBLEM dpkg-check is now: CRITICAL on demo-mysql1 i-00000256 output: Connection refused by host [15:16:54] PROBLEM Free ram is now: CRITICAL on bots-1 i-000000a9 output: NRPE: Command check_ram not defined [15:16:54] PROBLEM Free ram is now: CRITICAL on bots-labs i-0000015e output: NRPE: Command check_ram not defined [15:16:54] PROBLEM Free ram is now: CRITICAL on memcache-puppet i-00000153 output: NRPE: Command check_ram not defined [15:16:54] PROBLEM Free ram is now: CRITICAL on p-b i-000000ae output: NRPE: Command check_ram not defined [15:17:04] PROBLEM Disk Space is now: CRITICAL on mobile-testing i-00000271 output: Connection refused by host [15:17:04] PROBLEM Current Users is now: CRITICAL on mobile-testing i-00000271 output: Connection refused by host [15:17:10] Ryan_Lane: minor? [15:17:14] PROBLEM Free ram is now: CRITICAL on bots-cb i-0000009e output: NRPE: Command check_ram not defined [15:17:14] PROBLEM Free ram is now: CRITICAL on migration1 i-00000261 output: NRPE: Command check_ram not defined [15:17:14] PROBLEM Free ram is now: CRITICAL on upload-wizard i-0000021c output: NRPE: Command check_ram not defined [15:17:21] hm [15:17:27] well, puppet may have removed the check? [15:17:32] maybe? [15:17:32] probably yes [15:17:36] or... [15:17:39] maybe it screwed up nrpe? [15:17:54] I think it removed the command [15:18:03] let me check [15:18:14] allowed_hosts=127.0.0.1 [15:18:17] that doesn't look right [15:18:31] oh [15:18:40] allowed_hosts=10.4.0.34 [15:19:11] check_ram is indeed gone [15:19:54] PROBLEM Free ram is now: CRITICAL on bots-sql2 i-000000af output: NRPE: Command check_ram not defined [15:19:54] hm. I wonder how that happened [15:20:49] PROBLEM Current Users is now: CRITICAL on dumps-1 i-00000170 output: Connection refused by host [15:20:54] PROBLEM Free ram is now: CRITICAL on dumps-1 i-00000170 output: Connection refused by host [15:20:54] PROBLEM Free ram is now: CRITICAL on firstinstance i-0000013e output: NRPE: Command check_ram not defined [15:20:54] PROBLEM Total Processes is now: CRITICAL on dumps-1 i-00000170 output: Connection refused by host [15:20:59] PROBLEM Free ram is now: CRITICAL on fundraising-civicrm i-00000169 output: NRPE: Command check_ram not defined [15:20:59] PROBLEM Free ram is now: CRITICAL on bots-dev i-00000190 output: NRPE: Command check_ram not defined [15:20:59] PROBLEM Free ram is now: CRITICAL on redis1 i-000002b6 output: NRPE: Command check_ram not defined [15:21:14] PROBLEM Free ram is now: CRITICAL on mobile-feeds i-000000c1 output: NRPE: Command check_ram not defined [15:21:14] PROBLEM Current Load is now: CRITICAL on dumps-1 i-00000170 output: Connection refused by host [15:21:14] PROBLEM Free ram is now: CRITICAL on deployment-cache-upload i-00000263 output: NRPE: Command check_ram not defined [15:21:14] PROBLEM Free ram is now: CRITICAL on hugglewa-1 i-000001e0 output: NRPE: Command check_ram not defined [15:21:24] PROBLEM dpkg-check is now: CRITICAL on maps-tilemill1 i-00000294 output: Connection refused by host [15:21:24] PROBLEM Free ram is now: CRITICAL on deployment-transcoding i-00000105 output: NRPE: Command check_ram not defined [15:21:24] Ryan_Lane: yes [15:21:34] Ryan_Lane: I think it's because we are using template for nagios [15:21:41] there is different nrpe config [15:21:44] PROBLEM Total Processes is now: CRITICAL on maps-tilemill1 i-00000294 output: Connection refused by host [15:21:45] ah [15:21:49] you pushed production version to labs [15:21:49] PROBLEM dpkg-check is now: CRITICAL on dumps-1 i-00000170 output: Connection refused by host [15:21:49] PROBLEM Free ram is now: CRITICAL on swift-be3 i-000001c9 output: NRPE: Command check_ram not defined [15:21:49] PROBLEM Free ram is now: CRITICAL on wikistream-1 i-0000016e output: NRPE: Command check_ram not defined [15:21:49] PROBLEM Current Load is now: CRITICAL on maps-tilemill1 i-00000294 output: Connection refused by host [15:21:49] PROBLEM Current Users is now: WARNING on bastion-restricted1 i-0000019b output: USERS WARNING - 10 users currently logged in [15:21:50] PROBLEM Disk Space is now: CRITICAL on dumps-1 i-00000170 output: Connection refused by host [15:21:54] PROBLEM Free ram is now: CRITICAL on bastion-restricted1 i-0000019b output: NRPE: Command check_ram not defined [15:21:54] PROBLEM Disk Space is now: CRITICAL on maps-tilemill1 i-00000294 output: Connection refused by host [15:21:54] PROBLEM Free ram is now: CRITICAL on bots-nfs i-000000b1 output: NRPE: Command check_ram not defined [15:21:54] PROBLEM Free ram is now: CRITICAL on pediapress-ocg1 i-00000233 output: NRPE: Command check_ram not defined [15:21:54] PROBLEM Free ram is now: CRITICAL on test2 i-0000013c output: NRPE: Command check_ram not defined [15:22:14] PROBLEM Current Users is now: CRITICAL on maps-tilemill1 i-00000294 output: Connection refused by host [15:22:14] PROBLEM Free ram is now: CRITICAL on maps-tilemill1 i-00000294 output: Connection refused by host [15:22:50] I don't really say it's wrong I like all these checks [15:22:58] but we should insert this one back [15:26:23] petan: yeah. that's fine [15:26:34] ok tell me when you do that :) [15:26:35] let's wait till I'm finished merging the branches, though [15:26:39] ok [15:26:43] btw why we merge them? [15:27:04] because we're going to get rid of the test branch [15:27:07] and we want to keep our changes [15:27:07] ooh [15:27:13] how we are going to test stuff [15:27:19] with no test branch [15:27:24] https://labsconsole.wikimedia.org/wiki/Help:SelfHostedPuppet [15:27:34] I thought we first test in labs then push to prod [15:27:44] yeah, that's possible with that [15:27:49] that link [15:28:42] yay [15:28:45] cool [16:40:39] 06/19/2012 - 16:40:38 - Updating keys for cneubauer at /export/keys/cneubauer [16:42:39] Change on 12mediawiki a page Wikimedia Labs was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=551571 edit summary: /* Proposals */ [17:13:32] 06/19/2012 - 17:13:32 - User danny_b may have been modified in LDAP or locally, updating key in project(s): bastion,configtest [17:13:38] 06/19/2012 - 17:13:38 - Updating keys for danny_b at /export/keys/danny_b [17:39:43] Change on 12mediawiki a page Wikimedia Labs/Per-project saltstack remote execution was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=551592 edit summary: /* Problems with this solution */ [17:47:28] Change on 12mediawiki a page Wikimedia Labs/Per-project saltstack remote execution was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=551594 edit summary: [17:47:45] Change on 12mediawiki a page Wikimedia Labs/Per-project saltstack remote execution was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=551595 edit summary: [21:04:06] New patchset: Bhartshorne; "adding a per-project debian repo for labs" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/11969 [21:04:24] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (test); V: -1 - https://gerrit.wikimedia.org/r/11969 [21:05:24] New patchset: Bhartshorne; "adding a per-project debian repo for labs" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/11969 [21:06:13] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/11969 [21:10:57] Ryan_Lane: I think that's right to implement https://labsconsole.wikimedia.org/wiki/User:Bhartshorne/Using_debs_in_labs. [21:11:00] interested in reviewing it? [21:11:14] I already did [21:11:15] (I haven't tested the syntax for the sources.list.d/ file yet. [21:11:18] and said you should move it [21:11:31] no, I mean the puppet configs. [21:11:59] puppet configs? [21:12:12] that gerrit comment right above my message. [21:12:17] https://gerrit.wikimedia.org/r/11969 [21:12:22] oh [21:12:53] why /srv? [21:12:59] rather than /data/project? [21:13:04] isn't that where the shared storage is? [21:13:10] no [21:13:17] it's /data/project [21:13:27] excellent! I will change that. [21:14:22] New patchset: Bhartshorne; "adding a per-project debian repo for labs" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/11969 [21:14:25] changed. [21:14:42] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/11969 [21:16:22] Ryan_Lane: So I guess Erik wants the public DNS for etherpad after all, did you get that email? [21:17:54] marktraceur: emailed back to him [21:18:31] 06/19/2012 - 21:18:31 - User ben may have been modified in LDAP or locally, updating key in project(s): testlabs,bastion,deployment-prep,mobile-stats,leslie,swift,swift3,packaging [21:18:37] 06/19/2012 - 21:18:37 - Creating a home directory for ben at /export/keys/ben [21:19:40] 06/19/2012 - 21:19:40 - Updating keys for ben at /export/keys/ben [21:25:58] marktraceur: which project is this again? [21:26:56] Ryan_Lane: so i heard you are the master of labs public ip's? [21:27:25] Danny_B|backup: well, anyone on ops technically is, but what did you need one for? [21:27:28] marktraceur: etherpad? [21:27:35] or another project? [21:28:09] Ryan_Lane: to enable the testing wikis to be accessible for ppl without labs account [21:28:33] ok. which project is this? [21:29:07] Ryan_Lane: Yes, Etherpad on labs [21:29:13] ok [21:29:16] gimme a sec [21:29:23] Thanks [21:29:36] marktraceur: done. you can allocate an ip now [21:31:20] Ryan_Lane: configtest. we've just set up the first testwiki and need to test it and then we'll create those others i told you about in berlin [21:31:30] * Ryan_Lane nods [21:31:55] Danny_B|backup: ok. I upped your quota, you can allocate an ip now [21:41:47] No Nova credentials found for your account. [21:42:14] New patchset: Bhartshorne; "adding a per-project debian repo for labs" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/11969 [21:42:34] Danny_B|backup: log out and back in [21:42:36] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/11969 [21:43:19] oh [21:43:20] crap [21:43:25] maplebed: did you read the topic? [21:43:26] heh [21:43:41] no... [21:43:43] I'm trying to kill off the test branch [21:44:04] maplebed: https://labsconsole.wikimedia.org/wiki/Help:SelfHostedPuppet [21:44:22] Ryan_Lane: I want the change I'm submitting to be available for all projects, not just mine. [21:44:30] !puppet::self is https://labsconsole.wikimedia.org/wiki/Help:SelfHostedPuppet [21:44:31] Key was added [21:44:58] we'll be moving to the production branch immanently [21:45:13] that's why I wanted a review of that merge for swift [21:45:16] the test branch is going away [21:45:31] people will test locally, then push into production for review [21:46:17] I can do a second merge [21:46:20] oomph. [21:46:34] I'll probably need to anyway, since some stuff is going to get yanked from the merge anyway [21:46:36] ok. so I should push that change against the prod branch? [21:46:40] Ryan_Lane: move it to front in topic [21:48:16] !puppet::self del [21:48:17] Ryan_Lane: can I sneak this one in since you'll have to do another merge anyways? [21:48:17] Successfully removed puppet::self [21:48:24] maplebed: yeah, that's what I meant [21:48:29] ok. [21:48:31] !puppetmaster::self is https://labsconsole.wikimedia.org/wiki/Help:SelfHostedPuppet [21:48:32] Key was added [21:48:39] New review: Bhartshorne; "(no comment)" [operations/puppet] (test); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11969 [21:48:40] in it goes. [21:48:45] Change merged: Bhartshorne; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/11969 [21:48:50] !puppet::self alias puppetmaster::self [21:48:50] Created new alias for this key [21:48:52] !puppet::self [21:48:52] https://labsconsole.wikimedia.org/wiki/Help:SelfHostedPuppet [21:50:14] maplebed: move the docs to the main namespace, and link to it from https://labsconsole.wikimedia.org/wiki/Help:Contents#Puppetization.2C_packaging.2C_and_moving_to_production [21:50:21] err [21:50:22] sorry [21:50:25] to the help namespace [21:51:34] https://labsconsole.wikimedia.org/wiki/Help:Using_debs_in_labs [21:52:31] and linked. [21:52:59] thanks [21:57:24] it works! [21:57:47] I need to have a doc sprint one of these days :D [21:58:19] should I add the class to an existing puppet group or create a new one? [21:58:28] it sorta fits in 'building' but only kinda. [21:58:48] it sorta fits in puppet, but maybe less so. [22:00:19] Ryan_Lane: unless you have an opinion, I think I'll put it in building. [22:04:39] I think building makes sense [22:04:49] because they tend to go hand and hand [22:05:39] *hand in hand [22:10:01] Ryan_Lane: how's the big fat merge? [22:10:10] I'm waiting on people to review [22:10:16] maplebed: *cough* *cough* [22:10:16] heh [22:10:23] ma rk as well [22:10:52] only ganglia and swift really worry me [22:11:12] overall the changes were fairly minimal [22:11:32] which means people were actually pretty good about moving changes into production! [22:11:41] just not from production to test [22:19:16] Ryan_Lane: so, noone shouldn't use test anymore? [22:19:27] Ryan_Lane: can we just revoke privileges from gerrit? [22:19:30] it'll be easier if they don't [22:19:32] can't yet [22:19:46] until the merge is through, we need to keep it open [22:20:00] because we need to fix things in test, then re-merge [22:20:13] once that happens, I'll make it read-only [22:20:23] then switch virt0 to use production [22:20:43] then I can delete the branch [22:21:24] and puppetmaster::self! [22:21:38] eh? [22:21:49] puppetmaster::self is in the merge ;) [22:40:53] Ryan_Lane: I'm trying to look at https://gerrit.wikimedia.org/r/#/c/12021/ [22:41:12] it's a merge change [22:41:13] do I want to cherry-pick it or pull it? [22:41:17] so you need to fetch it and do a diff [22:41:18] I haven't done a merge change before. [22:41:24] it's annoying [22:41:34] this is one feature I'd like added in gerrit very badly [22:41:45] so from the gerrit interface, there's the 'pull' tab; that's what I want? [22:41:46] the ability to review a merge commit via the interface [22:41:52] I think so [22:42:00] do it in a new local branch [22:42:05] based on origin/production [22:42:21] of course, some new crap went into production since my merge, so it's going to get a little hairy [22:42:43] ideally, you'd diff it against the commit right before it [22:42:52] then it would ignore everything else [22:43:19] git diff 391d540b14968a07906748262d48022b7f2e093c..HEAD [22:43:37] where head is the merge's commit [22:44:12] that line isn't showing me what I expected. (it's giving me much more) [22:44:18] ah. here we go [22:44:20] sec [22:44:31] https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=commitdiff;h=a7f05f30b54e60205e6445c7f84c46609939c736;hp=391d540b14968a07906748262d48022b7f2e093c [22:44:39] lemme add that as a comment [22:45:02] but that's got all sorts of shit like files/planet/es_config.ini... [22:45:06] that has nothing to do with swift [22:45:12] yes [22:45:17] it's a merge of test into production [22:45:31] oh, so you don't need me to review the entire thing, just the swift portions? [22:45:32] you'll need to look through the whole thing for the swift stuff [22:45:39] ahhhh..... [22:45:53] I thought your merge was specifically the swift stuff getting pulled in. [22:46:03] this makes much more sense now. [22:53:30] Ryan_Lane: to make sure I"m looking at this right, would you mind asking for the diff to manifests/swift.pp from that merge so that I can verify that what I'm seeing for my diff is the right one? [22:53:58] this is what I see: http://pastebin.com/TjbQq8bE [22:54:51] is that not correct? [22:55:03] there's a possibility I fucked up some things when merging production into test [22:55:25] I suppose I can check by hand (by looking at a fresh checkout of both test and production) [22:55:32] that's how I've been doing it [22:55:37] I don't know if it's correct; I just wanted to make sure I'm looking at the same thing you are. [22:55:46] I gave you a link [22:55:53] https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=commitdiff;h=a7f05f30b54e60205e6445c7f84c46609939c736;hp=391d540b14968a07906748262d48022b7f2e093c [22:56:10] oh, I missed the patch link. (I only looked at blob). [22:56:20] excellent. [22:56:24] thanks. [22:56:26] yw [23:06:00] Ryan_Lane: all swift-related changes in that merge are ok except for one. [23:06:12] there's one config change that shouldn't be merged. [23:06:12] can you remove that from test? [23:06:21] then I can re-merge [23:06:27] yeah... ok, that'll be easy. [23:07:27] New patchset: Bhartshorne; "production doesn't write thumbs anymore." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/12122 [23:07:48] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/12122 [23:08:26] New review: Bhartshorne; "(no comment)" [operations/puppet] (test); V: 1 C: 2; - https://gerrit.wikimedia.org/r/12122 [23:08:29] ok, merged. [23:08:30] Change merged: Bhartshorne; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/12122 [23:09:05] great [23:09:06] thanks [23:09:58] Is it save to guess that "nc: getaddrinfo: Name or service not known" is a familiar error message with SSH into labs instances? If so, what is the solution? [23:10:03] I'd like to double check when you create the new merge, since if that config option gets changed in production, it'll ... ummm... well, I don't actually know what will happen but I'm pretty sure it won't be good. [23:10:33] marktraceur: are you using an ssh ProxyCommand? [23:10:49] Yup [23:10:59] maplebed: And it was running fine until this new instance [23:11:00] that looks like the spot the error is coming from. [23:11:12] I'm not sure how to diagnose, but that's the spot to look at. [23:11:25] maybe the instance DNS etry hasn't propagated yet? [23:11:35] Might be [23:11:52] Ah well, I'll wait for it [23:11:58] Thanks much [23:12:17] that's just a stab in the dark... I wouldn't put too much faith in it being right. [23:13:00] It seems accurate--anyway, given half an hour or so without success, a reboot ought to solve the problem [23:13:18] If not, it's on my end [23:22:29] marktraceur: did you delete/recreate an instance? [23:26:25] Ryan_Lane: I deleted one, and created one with a different name [23:26:35] (I realized "tiny" was too tiny) [23:26:40] that should work perfectly fine [23:26:44] what's its name? [23:27:00] Ryan_Lane: etherpad-lite [23:27:33] I was able to get in [23:27:43] But not by the name, by the ip address [23:27:52] that's odd [23:27:58] that shouldn't work [23:28:02] Haha [23:28:14] it should have failed creation if it didn't go into dns [23:28:26] *nod* very strange indeed [23:29:05] wtf [23:29:11] hm [23:29:38] OK, twilight zone [23:29:43] The files are still here [23:29:51] in /data/project? [23:29:59] In my home directory [23:30:04] of course [23:30:14] OK, I just haven't experienced it before [23:30:17] Cool feature [23:30:20] home directories are shared to every instance inside of a project [23:30:24] ah [23:30:24] crap [23:30:28] jobs aren't running [23:30:33] Aha! [23:30:42] now the instance has DNS [23:30:48] Thanks [23:30:49] * Ryan_Lane grumbles [23:31:07] I need to add a cron for that