[07:40:34] petan: ryan upgraded the cluster overnight :) [07:40:36] upgraded aha [07:40:37] :D [07:40:37] !ping [07:40:38] pong [07:40:38] you must have missed the labs-l post ;) [07:40:40] I just came to office :P [07:40:41] will check my personal mail soon [07:41:22] !beta restarted apache process on Apaches boxes [07:41:22] !log deployment-prep restarted apache process on Apaches boxes [07:41:25] Logged the message, Master [07:41:40] let me know if you run into bugs [07:41:44] it's freshly upgraded [07:42:38] deployment-prep likely needs to be booted in the proper order [07:42:53] would really be nice to kill off the boot order dependencies [07:43:38] do you have any dependency in mind? [07:43:54] some of the instances fail because there's an nfs mount in the fstab [07:44:06] ahh [07:44:32] when I migrated from our NFS instance to /data/project, I did not had the mount ensure=>absent snippets in puppet :( [07:44:58] heh [07:45:25] RECOVERY HTTP is now: OK on deployment-apache33 i-0000031b output: HTTP OK: HTTP/1.1 200 OK - 27256 bytes in 0.087 second response time [07:46:22] nagios works? o.o [07:46:26] that's news [07:46:34] heh [07:46:38] who fixed it [07:46:44] <- [07:46:47] yay [07:46:50] how? [07:46:57] petan: still broken : http://nagios.wmflabs.org/cgi-bin/nagios3/status.cgi?hostgroup=deployment-prep&style=detail [07:47:04] nrpe had the wrong ip on a number of instances [07:47:08] ooh [07:47:25] though yeah, still has issues [07:47:27] not sure why [07:47:36] (No output returned from plugin)  [07:47:39] hm [07:47:40] not sure why its doing that [07:47:42] that's weird [07:47:50] haven't really investigated much, though [07:47:58] is apache puppetied on that instance? [07:47:58] we need to have a bug for that :P [07:48:05] because / needs to redirect [07:48:06] surely not [07:48:15] I fixed it and somehow that disappeared [07:48:28] there is a lot of non puppetized stuff [07:48:34] * Ryan_Lane nods [07:48:50] !beta migrated Apaches boxes from applicationserver::labs to role::applicationserver [07:48:50] !log deployment-prep migrated Apaches boxes from applicationserver::labs to role::applicationserver [07:48:52] Logged the message, Master [07:49:00] I need to fix some of my scripts [07:49:05] I'm going to do that tomorrow [07:49:13] instance creation, for instance, works [07:49:14] but... [07:49:15] PROBLEM Puppet freshness is now: CRITICAL on robh-spl i-00000369 output: (Return code of 127 is out of bounds - plugin may be missing) [07:49:25] home directories aren't being created [07:49:39] keys should still be updated, though [07:49:56] and gluster shares aren't being updated either [07:50:19] both are easy fixes, though :) [07:51:02] allowed_hosts=127.0.0.1 [07:51:08] nrpe_local.cfg [07:51:46] PROBLEM host: deployment-feed is DOWN address: i-00000118 CRITICAL - Host Unreachable (i-00000118) [07:51:46] PROBLEM host: gerrit-puppet-andrewhaul is DOWN address: i-000003c8 CRITICAL - Host Unreachable (i-000003c8) [07:51:46] PROBLEM host: analytics is DOWN address: i-000000e2 CRITICAL - Host Unreachable (i-000000e2) [07:52:06] PROBLEM host: ve-nodejs is DOWN address: i-00000245 CRITICAL - Host Unreachable (i-00000245) [07:52:14] ok. i need to go to bed [07:52:15] * Ryan_Lane waves [07:52:16] PROBLEM host: configtest-main is DOWN address: i-000002dd CRITICAL - Host Unreachable (i-000002dd) [07:52:16] PROBLEM host: conventionextension-test is DOWN address: i-000003c0 CRITICAL - Host Unreachable (i-000003c0) [07:52:16] PROBLEM host: deployment-cache-upload is DOWN address: i-00000263 CRITICAL - Host Unreachable (i-00000263) [07:52:26] PROBLEM host: mobile-feeds is DOWN address: i-000000c1 CRITICAL - Host Unreachable (i-000000c1) [07:52:32] Ryan_Lane: have a good night :) [07:52:47] thanks. see you guys tomorrow [07:52:56] PROBLEM host: wep is DOWN address: i-000000c2 CRITICAL - Host Unreachable (i-000000c2) [07:52:56] PROBLEM host: deployment-backup is DOWN address: i-000000f8 CRITICAL - Host Unreachable (i-000000f8) [07:55:26] RECOVERY HTTP is now: OK on deployment-apache32 i-0000031a output: HTTP OK: HTTP/1.1 200 OK - 27256 bytes in 0.085 second response time [07:57:16] PROBLEM host: p-b is DOWN address: i-000000ae CRITICAL - Host Unreachable (i-000000ae) [07:58:56] PROBLEM host: pageviews is DOWN address: i-000000b2 CRITICAL - Host Unreachable (i-000000b2) [08:19:16] PROBLEM Puppet freshness is now: CRITICAL on php-packaging i-000003ae output: (Return code of 127 is out of bounds - plugin may be missing) [08:21:46] PROBLEM host: deployment-feed is DOWN address: i-00000118 CRITICAL - Host Unreachable (i-00000118) [08:21:46] PROBLEM host: gerrit-puppet-andrewhaul is DOWN address: i-000003c8 CRITICAL - Host Unreachable (i-000003c8) [08:21:46] PROBLEM host: analytics is DOWN address: i-000000e2 CRITICAL - Host Unreachable (i-000000e2) [08:22:06] PROBLEM host: ve-nodejs is DOWN address: i-00000245 CRITICAL - Host Unreachable (i-00000245) [08:22:16] PROBLEM Puppet freshness is now: CRITICAL on mobile-sphinx i-00000364 output: (Return code of 127 is out of bounds - plugin may be missing) [08:22:16] PROBLEM Puppet freshness is now: CRITICAL on syslogcol-ac i-00000362 output: (Return code of 127 is out of bounds - plugin may be missing) [08:22:16] PROBLEM host: configtest-main is DOWN address: i-000002dd CRITICAL - Host Unreachable (i-000002dd) [08:22:16] PROBLEM host: conventionextension-test is DOWN address: i-000003c0 CRITICAL - Host Unreachable (i-000003c0) [08:22:16] PROBLEM host: deployment-cache-upload is DOWN address: i-00000263 CRITICAL - Host Unreachable (i-00000263) [08:22:26] PROBLEM host: mobile-feeds is DOWN address: i-000000c1 CRITICAL - Host Unreachable (i-000000c1) [08:22:56] PROBLEM host: wep is DOWN address: i-000000c2 CRITICAL - Host Unreachable (i-000000c2) [08:22:56] PROBLEM host: deployment-backup is DOWN address: i-000000f8 CRITICAL - Host Unreachable (i-000000f8) [08:27:16] PROBLEM host: p-b is DOWN address: i-000000ae CRITICAL - Host Unreachable (i-000000ae) [08:28:56] PROBLEM host: pageviews is DOWN address: i-000000b2 CRITICAL - Host Unreachable (i-000000b2) [08:29:16] PROBLEM Puppet freshness is now: CRITICAL on cvn-apache2 i-00000339 output: (Return code of 127 is out of bounds - plugin may be missing) [08:29:16] PROBLEM Puppet freshness is now: CRITICAL on deployment-apache32 i-0000031a output: (Return code of 127 is out of bounds - plugin may be missing) [08:29:16] PROBLEM Puppet freshness is now: CRITICAL on deployment-apache33 i-0000031b output: (Return code of 127 is out of bounds - plugin may be missing) [08:29:16] PROBLEM Puppet freshness is now: CRITICAL on deployment-cache-bits02 i-0000031c output: (Return code of 127 is out of bounds - plugin may be missing) [08:29:16] PROBLEM Puppet freshness is now: CRITICAL on deployment-jobrunner06 i-0000031d output: (Return code of 127 is out of bounds - plugin may be missing) [08:29:17] PROBLEM Puppet freshness is now: CRITICAL on greensmw1 i-0000032c output: (Return code of 127 is out of bounds - plugin may be missing) [08:29:17] PROBLEM Puppet freshness is now: CRITICAL on incubator-bot3 i-00000340 output: (Return code of 127 is out of bounds - plugin may be missing) [08:29:18] PROBLEM Puppet freshness is now: CRITICAL on jesusaurus-cleanup i-0000038a output: (Return code of 127 is out of bounds - plugin may be missing) [08:29:18] PROBLEM Puppet freshness is now: CRITICAL on pdbhandler-1 i-0000030e output: (Return code of 127 is out of bounds - plugin may be missing) [08:29:19] PROBLEM Puppet freshness is now: CRITICAL on signwriting-ase10 i-00000322 output: (Return code of 127 is out of bounds - plugin may be missing) [08:29:19] PROBLEM Puppet freshness is now: CRITICAL on signwriting-ase9 i-00000316 output: (Return code of 127 is out of bounds - plugin may be missing) [08:29:20] PROBLEM Puppet freshness is now: CRITICAL on sultest1 i-0000032d output: (Return code of 127 is out of bounds - plugin may be missing) [08:29:20] PROBLEM Puppet freshness is now: CRITICAL on sultest2 i-00000330 output: (Return code of 127 is out of bounds - plugin may be missing) [08:29:21] PROBLEM Puppet freshness is now: CRITICAL on sultestdb i-0000032f output: (Return code of 127 is out of bounds - plugin may be missing) [08:51:46] PROBLEM host: deployment-feed is DOWN address: i-00000118 CRITICAL - Host Unreachable (i-00000118) [08:51:46] PROBLEM host: gerrit-puppet-andrewhaul is DOWN address: i-000003c8 CRITICAL - Host Unreachable (i-000003c8) [08:51:46] PROBLEM host: analytics is DOWN address: i-000000e2 CRITICAL - Host Unreachable (i-000000e2) [08:52:06] PROBLEM host: ve-nodejs is DOWN address: i-00000245 CRITICAL - Host Unreachable (i-00000245) [08:52:16] PROBLEM host: configtest-main is DOWN address: i-000002dd CRITICAL - Host Unreachable (i-000002dd) [08:52:16] PROBLEM host: conventionextension-test is DOWN address: i-000003c0 CRITICAL - Host Unreachable (i-000003c0) [08:52:16] PROBLEM host: deployment-cache-upload is DOWN address: i-00000263 CRITICAL - Host Unreachable (i-00000263) [08:52:26] PROBLEM host: mobile-feeds is DOWN address: i-000000c1 CRITICAL - Host Unreachable (i-000000c1) [08:52:56] PROBLEM host: wep is DOWN address: i-000000c2 CRITICAL - Host Unreachable (i-000000c2) [08:52:56] PROBLEM host: deployment-backup is DOWN address: i-000000f8 CRITICAL - Host Unreachable (i-000000f8) [08:57:16] PROBLEM host: p-b is DOWN address: i-000000ae CRITICAL - Host Unreachable (i-000000ae) [08:58:56] PROBLEM host: pageviews is DOWN address: i-000000b2 CRITICAL - Host Unreachable (i-000000b2) [09:21:46] PROBLEM host: deployment-feed is DOWN address: i-00000118 CRITICAL - Host Unreachable (i-00000118) [09:21:46] PROBLEM host: gerrit-puppet-andrewhaul is DOWN address: i-000003c8 CRITICAL - Host Unreachable (i-000003c8) [09:21:46] PROBLEM host: analytics is DOWN address: i-000000e2 CRITICAL - Host Unreachable (i-000000e2) [09:22:06] PROBLEM host: ve-nodejs is DOWN address: i-00000245 CRITICAL - Host Unreachable (i-00000245) [09:22:16] PROBLEM host: configtest-main is DOWN address: i-000002dd CRITICAL - Host Unreachable (i-000002dd) [09:22:16] PROBLEM host: conventionextension-test is DOWN address: i-000003c0 CRITICAL - Host Unreachable (i-000003c0) [09:22:16] PROBLEM host: deployment-cache-upload is DOWN address: i-00000263 CRITICAL - Host Unreachable (i-00000263) [09:22:26] PROBLEM host: mobile-feeds is DOWN address: i-000000c1 CRITICAL - Host Unreachable (i-000000c1) [09:22:56] PROBLEM host: wep is DOWN address: i-000000c2 CRITICAL - Host Unreachable (i-000000c2) [09:22:56] PROBLEM host: deployment-backup is DOWN address: i-000000f8 CRITICAL - Host Unreachable (i-000000f8) [09:27:16] PROBLEM Puppet freshness is now: CRITICAL on dumps-1 i-00000355 output: (Return code of 127 is out of bounds - plugin may be missing) [09:27:16] PROBLEM host: p-b is DOWN address: i-000000ae CRITICAL - Host Unreachable (i-000000ae) [09:28:56] PROBLEM host: pageviews is DOWN address: i-000000b2 CRITICAL - Host Unreachable (i-000000b2) [09:51:46] PROBLEM host: deployment-feed is DOWN address: i-00000118 CRITICAL - Host Unreachable (i-00000118) [09:51:46] PROBLEM host: gerrit-puppet-andrewhaul is DOWN address: i-000003c8 CRITICAL - Host Unreachable (i-000003c8) [09:51:46] PROBLEM host: analytics is DOWN address: i-000000e2 CRITICAL - Host Unreachable (i-000000e2) [09:52:06] PROBLEM host: ve-nodejs is DOWN address: i-00000245 CRITICAL - Host Unreachable (i-00000245) [09:52:16] PROBLEM Puppet freshness is now: CRITICAL on mediawiki-dev-1 i-0000039c output: (Return code of 127 is out of bounds - plugin may be missing) [09:52:16] PROBLEM host: configtest-main is DOWN address: i-000002dd CRITICAL - Host Unreachable (i-000002dd) [09:52:16] PROBLEM host: conventionextension-test is DOWN address: i-000003c0 CRITICAL - Host Unreachable (i-000003c0) [09:52:16] PROBLEM host: deployment-cache-upload is DOWN address: i-00000263 CRITICAL - Host Unreachable (i-00000263) [09:52:26] PROBLEM host: mobile-feeds is DOWN address: i-000000c1 CRITICAL - Host Unreachable (i-000000c1) [09:52:56] PROBLEM host: wep is DOWN address: i-000000c2 CRITICAL - Host Unreachable (i-000000c2) [09:52:56] PROBLEM host: deployment-backup is DOWN address: i-000000f8 CRITICAL - Host Unreachable (i-000000f8) [09:57:16] PROBLEM host: p-b is DOWN address: i-000000ae CRITICAL - Host Unreachable (i-000000ae) [09:58:56] PROBLEM host: pageviews is DOWN address: i-000000b2 CRITICAL - Host Unreachable (i-000000b2) [10:12:16] PROBLEM Puppet freshness is now: CRITICAL on dumps-incr i-0000035d output: (Return code of 127 is out of bounds - plugin may be missing) [10:12:16] PROBLEM Puppet freshness is now: CRITICAL on translation-memory-3 i-00000358 output: (Return code of 127 is out of bounds - plugin may be missing) [10:12:25] hashar [10:12:40] meh [10:14:19] !log deployment-prep root: test [10:14:21] Logged the message, Master [10:20:09] !log deployment-prep root: rebooting bastion [10:20:10] Logged the message, Master [10:21:46] PROBLEM host: deployment-feed is DOWN address: i-00000118 CRITICAL - Host Unreachable (i-00000118) [10:21:46] PROBLEM host: gerrit-puppet-andrewhaul is DOWN address: i-000003c8 CRITICAL - Host Unreachable (i-000003c8) [10:21:46] PROBLEM host: analytics is DOWN address: i-000000e2 CRITICAL - Host Unreachable (i-000000e2) [10:22:06] PROBLEM host: ve-nodejs is DOWN address: i-00000245 CRITICAL - Host Unreachable (i-00000245) [10:22:16] PROBLEM host: configtest-main is DOWN address: i-000002dd CRITICAL - Host Unreachable (i-000002dd) [10:22:16] PROBLEM host: conventionextension-test is DOWN address: i-000003c0 CRITICAL - Host Unreachable (i-000003c0) [10:22:16] PROBLEM host: deployment-cache-upload is DOWN address: i-00000263 CRITICAL - Host Unreachable (i-00000263) [10:22:26] PROBLEM host: mobile-feeds is DOWN address: i-000000c1 CRITICAL - Host Unreachable (i-000000c1) [10:22:56] PROBLEM host: wep is DOWN address: i-000000c2 CRITICAL - Host Unreachable (i-000000c2) [10:22:56] PROBLEM host: deployment-backup is DOWN address: i-000000f8 CRITICAL - Host Unreachable (i-000000f8) [10:27:16] PROBLEM host: p-b is DOWN address: i-000000ae CRITICAL - Host Unreachable (i-000000ae) [10:28:56] PROBLEM host: pageviews is DOWN address: i-000000b2 CRITICAL - Host Unreachable (i-000000b2) [10:44:16] PROBLEM Puppet freshness is now: CRITICAL on hildisvini i-000003ac output: (Return code of 127 is out of bounds - plugin may be missing) [10:51:46] PROBLEM host: deployment-feed is DOWN address: i-00000118 CRITICAL - Host Unreachable (i-00000118) [10:51:46] PROBLEM host: gerrit-puppet-andrewhaul is DOWN address: i-000003c8 CRITICAL - Host Unreachable (i-000003c8) [10:51:46] PROBLEM host: analytics is DOWN address: i-000000e2 CRITICAL - Host Unreachable (i-000000e2) [10:52:06] PROBLEM host: ve-nodejs is DOWN address: i-00000245 CRITICAL - Host Unreachable (i-00000245) [10:52:16] PROBLEM host: configtest-main is DOWN address: i-000002dd CRITICAL - Host Unreachable (i-000002dd) [10:52:16] PROBLEM host: conventionextension-test is DOWN address: i-000003c0 CRITICAL - Host Unreachable (i-000003c0) [10:52:16] PROBLEM host: deployment-cache-upload is DOWN address: i-00000263 CRITICAL - Host Unreachable (i-00000263) [10:52:26] PROBLEM host: mobile-feeds is DOWN address: i-000000c1 CRITICAL - Host Unreachable (i-000000c1) [10:52:56] PROBLEM host: wep is DOWN address: i-000000c2 CRITICAL - Host Unreachable (i-000000c2) [10:52:56] PROBLEM host: deployment-backup is DOWN address: i-000000f8 CRITICAL - Host Unreachable (i-000000f8) [10:57:16] PROBLEM host: p-b is DOWN address: i-000000ae CRITICAL - Host Unreachable (i-000000ae) [10:58:56] PROBLEM host: pageviews is DOWN address: i-000000b2 CRITICAL - Host Unreachable (i-000000b2) [10:59:16] PROBLEM Puppet freshness is now: CRITICAL on gerrit-db i-0000038b output: (Return code of 127 is out of bounds - plugin may be missing) [10:59:16] PROBLEM Puppet freshness is now: CRITICAL on wikiminiatlas i-0000038c output: (Return code of 127 is out of bounds - plugin may be missing) [11:01:16] PROBLEM Puppet freshness is now: CRITICAL on gerrit-build i-00000387 output: (Return code of 127 is out of bounds - plugin may be missing) [11:01:16] PROBLEM Puppet freshness is now: CRITICAL on puppet-abogott i-00000389 output: (Return code of 127 is out of bounds - plugin may be missing) [11:21:46] PROBLEM host: deployment-feed is DOWN address: i-00000118 CRITICAL - Host Unreachable (i-00000118) [11:21:46] PROBLEM host: gerrit-puppet-andrewhaul is DOWN address: i-000003c8 CRITICAL - Host Unreachable (i-000003c8) [11:21:46] PROBLEM host: analytics is DOWN address: i-000000e2 CRITICAL - Host Unreachable (i-000000e2) [11:22:06] PROBLEM host: ve-nodejs is DOWN address: i-00000245 CRITICAL - Host Unreachable (i-00000245) [11:22:16] PROBLEM host: configtest-main is DOWN address: i-000002dd CRITICAL - Host Unreachable (i-000002dd) [11:22:16] PROBLEM host: conventionextension-test is DOWN address: i-000003c0 CRITICAL - Host Unreachable (i-000003c0) [11:22:16] PROBLEM host: deployment-cache-upload is DOWN address: i-00000263 CRITICAL - Host Unreachable (i-00000263) [11:22:26] PROBLEM host: mobile-feeds is DOWN address: i-000000c1 CRITICAL - Host Unreachable (i-000000c1) [11:22:56] PROBLEM host: wep is DOWN address: i-000000c2 CRITICAL - Host Unreachable (i-000000c2) [11:22:56] PROBLEM host: deployment-backup is DOWN address: i-000000f8 CRITICAL - Host Unreachable (i-000000f8) [11:23:16] PROBLEM Puppet freshness is now: CRITICAL on apachemxetc i-00000348 output: (Return code of 127 is out of bounds - plugin may be missing) [11:23:16] PROBLEM Puppet freshness is now: CRITICAL on deployment-cache-upload03 i-0000034b output: (Return code of 127 is out of bounds - plugin may be missing) [11:23:16] PROBLEM Puppet freshness is now: CRITICAL on deployment-integration i-0000034a output: (Return code of 127 is out of bounds - plugin may be missing) [11:23:16] PROBLEM Puppet freshness is now: CRITICAL on extrev1 i-00000346 output: (Return code of 127 is out of bounds - plugin may be missing) [11:23:16] PROBLEM Puppet freshness is now: CRITICAL on rocsteady-cleanup i-00000349 output: (Return code of 127 is out of bounds - plugin may be missing) [11:27:16] PROBLEM host: p-b is DOWN address: i-000000ae CRITICAL - Host Unreachable (i-000000ae) [11:28:56] PROBLEM host: pageviews is DOWN address: i-000000b2 CRITICAL - Host Unreachable (i-000000b2) [11:37:16] PROBLEM Puppet freshness is now: CRITICAL on aggregator-test1 i-000002bf output: (Return code of 127 is out of bounds - plugin may be missing) [11:37:16] PROBLEM Puppet freshness is now: CRITICAL on aggregator1 i-0000010c output: (Return code of 127 is out of bounds - plugin may be missing) [11:37:16] PROBLEM Puppet freshness is now: CRITICAL on aggregator2 i-000002c0 output: (Return code of 127 is out of bounds - plugin may be missing) [11:38:16] PROBLEM Puppet freshness is now: CRITICAL on asher1 i-0000003a output: (Return code of 127 is out of bounds - plugin may be missing) [11:38:16] PROBLEM Puppet freshness is now: CRITICAL on bastion-restricted1 i-0000019b output: (Return code of 127 is out of bounds - plugin may be missing) [11:38:16] PROBLEM Puppet freshness is now: CRITICAL on bastion1 i-000000ba output: (Return code of 127 is out of bounds - plugin may be missing) [11:38:16] PROBLEM Puppet freshness is now: CRITICAL on bob i-0000012d output: (Return code of 127 is out of bounds - plugin may be missing) [11:38:16] PROBLEM Puppet freshness is now: CRITICAL on bots-1 i-000000a9 output: (Return code of 127 is out of bounds - plugin may be missing) [14:47:07] <^demon> andrewbogott: So, on a fresh install I'm getting mostly a success, barring http://p.defau.lt/?ItbWWAn87WYxjnsMoctl0g [14:47:39] ^demon: Yeah, I was wondering what to do about that private reference, just when labs stopped working yesterday. [14:47:48] We don't need the private key for labs/gerrit do we? [14:47:59] <^demon> Yes we will. [14:48:09] <^demon> So we can test replication. [14:48:51] <^demon> We can generate a separate one for labs [14:52:32] So, hm, I think I don't know how puppet/private is handled in labs at all. Are labs instances ever allowed to access it? [14:52:54] <^demon> They use a different repo, labs/private, which is in gerrit. [14:54:19] Oh, that should be straightforward then. We just need to move that reference into the role so it can be switched accordingly... [14:54:32] I am moderately distracted atm, but am happy to look at that in a bit. [14:55:48] <^demon> The private key can sit in the same path, iirc. Puppet on labs resolves that to the same repo. [14:55:53] <^demon> Need to double check though [14:56:04] <^demon> Only public key will need to change in gerrit.pp [14:58:12] !log deployment-prep petrb: fixed mounts [14:58:14] Logged the message, Master [14:58:22] !log deployment-prep petrb: we have a new bastion :D [14:58:23] Logged the message, Master [14:58:44] <^demon> Subsequent runs seem to make the page-bkg.jpg error disappear and it looks sane. [14:58:48] <^demon> Just some weird error on a first run [14:58:55] * andrewbogott nods [15:00:24] <^demon> The service error is a problem with how java's installed. [15:00:26] <^demon> "Cannot find a JRE or JDK. Please set JAVA_HOME to a >=1.6 JRE" [15:00:46] <^demon> Prolly cuz we're using openjdk...I had this same problem. [15:00:55] <^demon> Oh, how did I fix it... [15:03:39] !log deployment-prep petrb: updated puppet [15:03:40] Logged the message, Master [15:30:21] <^demon> andrewbogott: Argh, it's because java's installed in different places on labs & production. [15:30:29] <^demon> So the production config is breaking labs [15:30:52] Hm, is there any good reason not to just normalize the labs install so it's the same as production? [15:34:45] <^demon> Yeah, something needs fixing either in prod or labs [15:34:53] <^demon> No need to have different config here. [15:36:20] Well, maybe this is a way we can parallelize? Want me to work on refactoring java whilst you continue with gerrit? [15:37:34] <^demon> Yeah, if you'd take a look at java I'll take a look at the ssh key. [15:42:10] 'k [15:56:36] ^demon: Did you have to build a new instance because your old one wouldn't come up after the upgrade? Or were you just sprucing up? [15:57:33] <^demon> Been sprucing up every so often. [15:57:54] <^demon> Running it repetitively in various states of brokenness ends up in a very confused puppet. [16:01:05] Mine seems to've perished in the deluge. [16:02:00] <^demon> I just deleted gerrit-puppet-overhaul and gerrit-puppet-overhaulz [16:02:06] <^demon> I didn't touch the andrew one [16:02:25] I deleted it because I couldn't reach it. Nothing was there that isn't in gerrit anyway. [16:16:18] Hm. ^demon, have you built an instance since yesterday? [16:16:35] <^demon> One this morning. [16:16:53] Did /home get mounted correctly? [16:17:12] <^demon> As far as I know. [16:17:15] <^demon> Haven't had any problems [16:17:55] hm [16:18:45] <^demon> I fixed the apache restart problem. Wrapped the SSL commands in [16:18:49] <^demon> PS30 ^ [16:19:51] <^demon> Oh, ServerName is still probably wrong :\ [16:25:08] <^demon> And 31 fixes $url. [16:28:55] And I still don't have an instance to work on… I'm going to vanish briefly while I wait for my latest attempt to start up. [16:36:37] <^demon> Other than /home missing, I'm not having any problems with the new instance. [16:36:38] <^demon> Hmm [16:55:07] <^demon> Whoo, fixed private key :) [17:01:02] * Damianz pokes petan [17:01:22] :P [17:01:35] o.0 [17:02:07] petan: Any chance of getting added to the nagios project to a) fix the command syntax on some checks to stop it spamming and possibly b) somewhat puppetize it [17:02:18] ok [17:21:26] <^demon> andrewbogott: Well, I fixed the key issue, but somewhere I introduced a syntax error that I'm just not seeing. [17:21:51] ^demon: OK, I'll look. btw, does it turn out that $HOME is broken for you, same as it was for me? [17:23:11] <^demon> On your new instance, yes. [17:23:16] <^demon> Works on the one I made earlier today though [17:33:18] <^demon> Time to play spot the syntax error everyone :) [17:33:24] <^demon> https://gerrit.wikimedia.org/r/#/c/13484/31..34/manifests/gerrit.pp -- where is the mistake? [17:33:29] You're missing some commas, for starters... [17:33:40] <^demon> I don't see it. [17:33:53] lines 25 and 26 [17:33:57] But there's something else as well [17:34:17] Oh, wait, hang on -- [17:34:23] the commans are missing in the role class, not the base class. [17:34:26] commas [17:36:22] Actually, I think that's it… I'm getting past that now. [17:36:51] <^demon> Oh whoops, missed the role. [17:36:53] <^demon> in the path [17:36:58] Dammit, I got all set up yesterday to use a local puppet repo, and now I don't have a homedir anymore [17:38:07] <^demon> Extra quote as well. PS36 :p [17:38:32] So where/when is puppet installing java? [17:39:12] <^demon> In gerrit::jetty we install openjdk-jre [17:41:02] <^demon> Which is installing it in /usr/lib/jvm/something [17:41:09] <^demon> A different something from production [17:43:28] <^demon> Actually, easiest way I think is just removing the explicit directive in gerrit.config. [17:43:39] <^demon> /etc/init.d/gerrit is smart enough to find it [17:52:16] andrewbogott: can you help out Steve Slevinski? [17:52:27] he's having issues with mysql and the data directory [17:52:54] sure; what channel is he in? [17:53:03] no clue :) [17:53:07] right here [17:53:09] ah [17:53:10] sweet [17:53:35] Ryan_Lane: Are you interested in hearing about post-upgrade that don't work, or do you already have a long list? [17:53:46] I'd like to know them [17:53:50] bugs would be great [17:54:20] adding bugs to bugzilla would be great, I don't actively want bugs :) [17:54:24] Damianz log changes [17:54:24] I have an instance (i-00000245 ) that is in 'shutoff' state since yesterday noon- how can I start it again? [17:54:36] gwicke: reboot it [17:54:46] if it doesn't come up I'll take a look at it [17:54:49] tried that without success [17:54:52] ok [17:54:55] instance id? [17:54:56] ^demon: Does removing that config line work? Does that mean that fixing puppet is temporarily moot? [17:55:09] Ryan_Lane: i-00000245 [17:55:32] <^demon> andrewbogott: Yeah, I think it'll work [17:56:12] slevinski: I'm multi-tasking, but… what's the problem? (Or, is there an email somewhere in my inbox that already describes it?) [17:57:27] mysql was looking in /mnt/mysql for the database, but the database was in /var/lib/mysql [17:58:34] slevinski: What puppet class are you using, and when did you last try? [17:59:34] https://labsconsole.wikimedia.org/wiki/Nova_Resource:I-00000322 [18:00:38] I tried to manually fix it. Moved the /var/lib/mysql to /mnt/mysql and tried to fix the security rights [18:01:02] selvinsky: I cannot, by default, see into your instasne. What puppet class? [18:01:21] role::mediawiki-install::labs [18:01:34] And, you first set it up a month or so ago? [18:01:41] yes [18:02:37] slevinsky: I believe that I've fixed that particular bug, although that class is still not especially stable. If you aren't invested in that instance you could start a fresh one. Otherwise if you want to add me to your project as an admin I can go in and see about fixing it by hand. [18:03:11] gwicke: ok. it's up [18:03:20] Another option is that you could just remove both directories, uninstall mysql, and do a new puppet run. That /might/ fix things. [18:03:24] Or it might break things worse. [18:03:25] Ryan_Lane: thanks! [18:03:38] when I did a reboot of all instances, it was already in the "rebooting" state, so it didn't happen [18:03:49] The database has some valuable information I'd like to keep if possible. [18:04:05] ah, that was probably because I had tried to reboot it yesterday after it went down [18:04:08] <^demon> andrewbogott: I think I got it working beginning to end :D [18:04:17] I tried to get mysqldump to work, but it couldn't find the data [18:04:19] <^demon> I'll fire up a fresh instance (and pray) [18:04:28] ^demon: Right on! Does it actually do something useful, or just not throw errors? [18:04:49] slevinski: Ah, so the instance was up and working and then subsequently broke? [18:04:50] <^demon> Just not throw any errors :) [18:04:56] <^demon> Might still be some issues to iron out [18:06:16] It was broke. I managed to get it up and running. I tried to sqldump the data but failed. Rebooted the instances and it's broke again. [18:06:41] Could not chdir to home directory /home/damian: No such file or directory interesting [18:07:06] Ryan_Lane: https://bugzilla.wikimedia.org/show_bug.cgi?id=39741 [18:07:10] gwicke: ah, yeah. likely [18:07:26] andrewbogott: ah. yep. knew about this one [18:07:48] this is because the script for this needs to be modified for the new project DIT in LDAP [18:07:56] same with gluster [18:08:09] I'll be taking care of that today [18:08:45] Ryan_Lane: OK! I figured you were on top of it. [18:09:35] slevinski: Do you mind if I add myself to your project, temporarily? [18:09:48] I do not mind. [18:11:10] <^demon> andrewbogott: Still got one failure, but it's because gerrit is stupid and explodes if you configure thing one thing wrong :p [18:11:15] <^demon> I'm gonna fix that upstream. [18:12:10] ok… slevinski, I have added the puppet var 'mysql_datadir' to your project. If you change the value of that setting to point at the directory where your actual sql data is and rerun puppet, it may fix the problem. [18:12:22] Want to give that a try? If it fails then I'll log into the instance directly. [18:12:33] Thanks [18:13:16] slevinski: Sorry that we broke your system… it was due to an obvious rearrangement in the sql code that had a typo in it. You just happened to catch a puppet update on a bad day. [18:13:35] (by 'obvious' I mean 'shouldn't have broken anything') [18:19:49] !log nagios chmod 644 /etc/nagios3/resource.cfg so nagios can read it on reload. [18:19:52] Logged the message, Master [18:20:34] slevinski: Any luck? [18:20:50] !log nagios Copied /etc/nagios3/conf.d to /etc/nagios3/conf.d.backup and sed -i 's/check_nrpe/check_nrpe_1arg/g' /etc/nagios3/conf.d/* to fix nrpe checks, need to check the parser. [18:20:51] Logged the message, Master [18:22:23] ^demon: this patchset is making me die of laughter https://gerrit.wikimedia.org/r/#/c/13484/ [18:22:25] :) [18:22:37] <^demon> Well it's almost done now :) [18:22:38] err [18:22:39] the change [18:22:46] 37 patchsets! :D [18:22:55] is this the record? [18:22:57] How many times did you have to rebase that heh [18:23:01] thats my lucky number! [18:23:04] <^demon> Not many rebases. [18:23:07] this must be the record [18:23:16] 1337 problems! [18:23:18] <^demon> It would've been less if Ryan had just reviewed it 3 weeks ago when I asked ;-) [18:23:20] So calling that [18:23:23] we need a change hall of fame [18:23:35] ^demon: can you add that to the interface? changes with the most patchsets ever? [18:23:37] Can I be listed for "most inline comments (75)"? ;) [18:23:39] :) [18:23:44] RoanKattouw: hahaha [18:23:45] indeed [18:23:50] Not sure how to update the variable. From the manage puppet groups page, when I click the modify link next to the variable, the new special page doesn't have any way to change the value. [18:23:54] mine would be in the hall of fame in that situation too [18:23:57] <^demon> I've got about 80 other things to do. [18:24:02] <^demon> None of which involve making a hall of fame. [18:24:05] ^demon: :) [18:24:20] slevinski: https://labsconsole.wikimedia.org/w/index.php?title=Special:NovaInstance&action=configure&project=signwriting&instanceid=4732cf8f-4f1b-463b-990d-6e8793239d07®ion=pmtpa [18:24:40] * Damianz thinks petan needs to lern to love python [18:24:43] RoanKattouw: hey, thankfully, I'm not on the openstack api, and there won't ever be a change that large to the extension ever again [18:24:46] *now [18:25:12] Did you actually ever get it all reviewed or just merge the last bits? :P [18:25:19] <^demon> Ryan_Lane: Also, if you could put your gerrit package in ops/debs/gerrit, that'd be super-cool ;-) [18:25:29] I didn't get the minor fixes changes [18:25:33] err [18:25:39] I didn't get the minor fixes reviewed [18:25:45] because I forgot about it [18:25:46] slevinski: It's a bit confusing, I think you were on the page to modify puppet options rather than to apply specific puppet options to an instance. [18:26:00] yes, you are correct. [18:26:07] ^demon: my package is garbage [18:26:15] <^demon> Mehhhh :\ [18:26:20] ^demon: I can put it there, but you'll likely need to put some work into it [18:26:23] <^demon> I know that. [18:26:26] right now it just wraps the war [18:26:33] <^demon> I need to remove gitweb dependency, for one. [18:26:37] <^demon> (gerrit does not require gitweb) [18:27:06] ah ok [18:27:10] <^demon> Also, I want init to run automatically on upgrade. It runs on install, so upgrade makes sense too. [18:27:16] hm [18:27:23] where do I even have the package? [18:27:31] <^demon> Good question :) [18:27:35] * Damianz goes to make food while nagios catches up. [18:28:29] ^demon: So, you are off and running, and I should now find something else to do, right? [18:28:41] <^demon> Yeah, I think I'm good now. [18:28:45] <^demon> Thanks for all your help :) [18:28:55] ^demon: I probably want to rearrange and squash a few more global references, but it'll be safest to do that once things are working and merged... [18:29:04] (Because then it'll be easier to tell what I broke.) [18:29:35] ^demon: I'm always happy to help, although I'm not convinced that I did, exactly, in this case :( [18:29:49] But I learned more puppetese. [18:30:29] * andrewbogott only contributed three or four of those 38 patch versions. Barely counts. [18:31:30] Question; are custom debs that are not in gerrit around as source packages anywhere? (if debs so such things like srpm's). [18:35:52] Damianz: some .deb might still be in subversion [18:36:21] Damianz: https://svn.wikimedia.org/viewvc/mediawiki/trunk/debs/ [18:36:42] Damianz: of course some are now in Gerrit and others are no more used and should be deleted [18:37:15] Ah, awesome [18:38:04] !log nagios parser re-breaks configs, removing -a $ARG2$ from check_nrpe now as nothing gets an arg passed anyway. This should be fixed in a better way so we /can/ use args later. [18:38:05] Logged the message, Master [18:40:39] Hmm, graphite debs aren't in there either. Maybe they are just pulled straight from the bug on lp and stuck on the repo. [18:41:03] * jeremyb waves Ryan_Lane. just wondering how the timing is for that user creation/rename [18:45:47] RECOVERY dpkg-check is now: OK on wlmpuppet i-0000035c output: All packages OK [18:45:47] RECOVERY dpkg-check is now: OK on wmde-test i-000002ad output: All packages OK [18:45:47] RECOVERY Disk Space is now: OK on apachemxetc i-00000348 output: DISK OK [18:45:48] RECOVERY Free ram is now: OK on bastion-restricted1 i-0000019b output: OK: 89% free memory [18:45:48] RECOVERY Total Processes is now: OK on bastion1 i-000000ba output: PROCS OK: 116 processes [18:45:51] RECOVERY Current Load is now: OK on bots-2 i-0000009c output: OK - load average: 0.00, 0.00, 0.00 [18:45:51] RECOVERY Disk Space is now: OK on bots-3 i-000000e5 output: DISK OK [18:46:01] RECOVERY Free ram is now: OK on demo-mysql1 i-00000256 output: OK: 94% free memory [18:46:01] RECOVERY Disk Space is now: OK on demo-deployment1 i-00000276 output: DISK OK [18:46:02] RECOVERY Current Users is now: OK on deployment-apache33 i-0000031b output: USERS OK - 0 users currently logged in [18:46:02] RECOVERY Total Processes is now: OK on deployment-cache-bits02 i-0000031c output: PROCS OK: 81 processes [18:46:06] RECOVERY Current Users is now: OK on deployment-cache-upload03 i-0000034b output: USERS OK - 0 users currently logged in [18:46:07] RECOVERY Current Users is now: OK on dumps-1 i-00000355 output: USERS OK - 0 users currently logged in [18:46:07] RECOVERY Disk Space is now: OK on embed-sandbox i-000000d1 output: DISK OK [18:46:07] RECOVERY Current Users is now: OK on extrev1 i-00000346 output: USERS OK - 0 users currently logged in [18:46:07] RECOVERY Current Users is now: OK on gerrit i-000000ff output: USERS OK - 0 users currently logged in [18:46:07] RECOVERY Disk Space is now: OK on grail i-000003aa output: DISK OK [18:46:07] RECOVERY Total Processes is now: OK on greensmw1 i-0000032c output: PROCS OK: 91 processes [18:46:12] RECOVERY Current Users is now: OK on hugglewa-1 i-000001e0 output: USERS OK - 0 users currently logged in [18:46:12] PROBLEM Current Users is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Error - Could not complete SSL handshake. [18:46:12] RECOVERY dpkg-check is now: OK on incubator-bot2 i-00000252 output: All packages OK [18:46:12] RECOVERY Total Processes is now: OK on incubator-bot1 i-00000251 output: PROCS OK: 95 processes [18:46:17] RECOVERY Disk Space is now: OK on kripke i-00000268 output: DISK OK [18:46:22] PROBLEM Total Processes is now: CRITICAL on mobile-wlm i-000002bc output: Connection refused by host [18:46:27] RECOVERY Disk Space is now: OK on mobile-testing i-00000271 output: DISK OK [18:46:27] RECOVERY dpkg-check is now: OK on mwreview i-000002ae output: All packages OK [18:46:27] RECOVERY Disk Space is now: OK on nova-ldap1 i-000000df output: DISK OK [18:46:27] PROBLEM dpkg-check is now: CRITICAL on orgcharts-dev i-0000018f output: DPKG CRITICAL dpkg reports broken packages [18:46:27] PROBLEM Free ram is now: CRITICAL on psm-precise i-000002f2 output: Connection refused by host [18:46:28] RECOVERY Current Users is now: OK on rds i-00000207 output: USERS OK - 0 users currently logged in [18:46:28] RECOVERY Disk Space is now: OK on pediapress-ocg2 i-00000234 output: DISK OK [18:46:29] RECOVERY dpkg-check is now: OK on puppet-abogott i-00000389 output: All packages OK [18:46:29] RECOVERY Free ram is now: OK on redis1 i-000002b6 output: OK: 93% free memory [18:46:30] RECOVERY Current Load is now: OK on shop-analytics-main i-000001e6 output: OK - load average: 0.00, 0.00, 0.00 [18:46:30] RECOVERY Total Processes is now: OK on search-test i-000000cb output: PROCS OK: 83 processes [18:46:32] RECOVERY Current Users is now: OK on signwriting-ase10 i-00000322 output: USERS OK - 3 users currently logged in [18:46:37] * Damianz apologies for the spam *whistles* [18:46:37] RECOVERY Current Users is now: OK on su-fe2 i-000002e6 output: USERS OK - 0 users currently logged in [18:46:37] RECOVERY dpkg-check is now: OK on su-be3 i-000002e9 output: All packages OK [18:46:42] RECOVERY Current Users is now: OK on swift-be4 i-000001ca output: USERS OK - 0 users currently logged in [18:46:42] RECOVERY dpkg-check is now: OK on swift-be2 i-000001c8 output: All packages OK [18:46:47] RECOVERY Current Users is now: OK on translation-memory-3 i-00000358 output: USERS OK - 0 users currently logged in [18:46:47] RECOVERY Disk Space is now: OK on testing-singer-puppetization i-00000331 output: DISK OK [18:46:48] RECOVERY Total Processes is now: OK on tutopuppet i-00000336 output: PROCS OK: 80 processes [18:46:52] RECOVERY Total Processes is now: OK on varnish i-000001ac output: PROCS OK: 82 processes [18:46:57] RECOVERY Current Load is now: OK on ve-nodejs i-00000245 output: OK - load average: 0.05, 0.11, 0.10 [18:46:58] RECOVERY Disk Space is now: OK on ve-parsoid3 i-00000345 output: DISK OK [18:46:58] RECOVERY Current Users is now: OK on wikibits-mysql i-00000341 output: USERS OK - 0 users currently logged in [18:46:58] RECOVERY Free ram is now: OK on wikidata-dev-1 i-0000020c output: OK: 90% free memory [18:47:03] RECOVERY Current Load is now: OK on wikiminiatlas i-0000038c output: OK - load average: 0.00, 0.01, 0.05 [18:47:03] RECOVERY Current Users is now: OK on wikisource-web i-000000fe output: USERS OK - 0 users currently logged in [18:47:03] RECOVERY Current Load is now: OK on wmde-test i-000002ad output: OK - load average: 0.00, 0.00, 0.00 [18:47:03] RECOVERY Disk Space is now: OK on zeromq1 i-000002b7 output: DISK OK [18:47:03] RECOVERY Current Load is now: OK on worker1 i-00000208 output: OK - load average: 0.12, 0.07, 0.06 [18:47:03] RECOVERY dpkg-check is now: OK on bastion1 i-000000ba output: All packages OK [18:47:04] RECOVERY Current Users is now: OK on bots-2 i-0000009c output: USERS OK - 0 users currently logged in [18:47:13] RECOVERY Current Load is now: OK on build-precise1 i-00000273 output: OK - load average: 0.06, 0.04, 0.05 [18:47:13] RECOVERY dpkg-check is now: OK on bots-sql2 i-000000af output: All packages OK [18:47:13] RECOVERY Current Users is now: OK on build1 i-000002b3 output: USERS OK - 1 users currently logged in [18:47:13] RECOVERY Free ram is now: OK on catsort-pub i-000001cc output: OK: 91% free memory [18:47:13] RECOVERY Disk Space is now: OK on building i-0000014d output: DISK OK [18:47:14] RECOVERY Current Load is now: OK on conventionextension-trial i-000003bf output: OK - load average: 0.00, 0.00, 0.00 [18:47:14] RECOVERY Free ram is now: OK on demo-deployment1 i-00000276 output: OK: 88% free memory [18:47:18] RECOVERY Current Users is now: OK on deployment-apache32 i-0000031a output: USERS OK - 0 users currently logged in [18:47:18] RECOVERY dpkg-check is now: OK on deployment-cache-bits02 i-0000031c output: All packages OK [18:47:18] RECOVERY Free ram is now: OK on deployment-sql i-000000d0 output: OK: 37% free memory [18:47:18] RECOVERY Total Processes is now: OK on deployment-squid i-000000dc output: PROCS OK: 85 processes [18:47:23] RECOVERY Current Load is now: OK on dev-solr i-00000152 output: OK - load average: 0.14, 0.06, 0.02 [18:47:23] RECOVERY Free ram is now: OK on embed-sandbox i-000000d1 output: OK: 93% free memory [18:47:23] RECOVERY Total Processes is now: OK on en-wiki-db-lucid i-0000023b output: PROCS OK: 79 processes [18:47:28] RECOVERY Current Load is now: OK on follow01-dev i-000003c6 output: OK - load average: 0.16, 0.10, 0.07 [18:47:33] RECOVERY Current Load is now: OK on gerrit-puppet-overhaulz i-000003d2 output: OK - load average: 0.48, 0.19, 0.16 [18:47:34] RECOVERY Total Processes is now: OK on hume i-000003cc output: PROCS OK: 78 processes [18:47:38] RECOVERY Total Processes is now: OK on incubator-apache i-00000211 output: PROCS OK: 130 processes [18:47:44] RECOVERY Current Users is now: OK on jesusaurus-cleanup i-0000038a output: USERS OK - 0 users currently logged in [18:47:44] RECOVERY Free ram is now: OK on kripke i-00000268 output: OK: 96% free memory [18:47:44] RECOVERY Total Processes is now: OK on labs-nfs1 i-0000005d output: PROCS OK: 102 processes [18:47:49] RECOVERY Current Load is now: OK on mailman-01 i-00000235 output: OK - load average: 0.00, 0.00, 0.01 [18:47:49] RECOVERY Total Processes is now: OK on master i-0000007a output: PROCS OK: 95 processes [18:47:54] PROBLEM dpkg-check is now: CRITICAL on mobile-wlm i-000002bc output: Connection refused by host [18:47:54] RECOVERY Disk Space is now: OK on mobile-sphinx i-00000364 output: DISK OK [18:47:54] RECOVERY Free ram is now: OK on mobile-testing i-00000271 output: OK: 97% free memory [18:47:54] RECOVERY Current Load is now: OK on nova-precise1 i-00000236 output: OK - load average: 0.37, 0.25, 0.24 [18:47:54] RECOVERY Current Load is now: OK on otrs-jgreen i-0000015a output: OK - load average: 0.00, 0.00, 0.00 [18:47:55] RECOVERY Free ram is now: OK on patchtest i-000000f1 output: OK: 90% free memory [18:47:55] RECOVERY Free ram is now: OK on pediapress-ocg2 i-00000234 output: OK: 91% free memory [18:47:56] RECOVERY Current Load is now: OK on queue-wiki1 i-000002b8 output: OK - load average: 0.00, 0.00, 0.00 [18:47:56] RECOVERY dpkg-check is now: OK on redis1 i-000002b6 output: All packages OK [18:47:58] RECOVERY Current Load is now: OK on su-fe1 i-000002e5 output: OK - load average: 0.02, 0.02, 0.00 [18:47:58] RECOVERY Disk Space is now: OK on su-fe2 i-000002e6 output: DISK OK [18:47:58] RECOVERY Total Processes is now: OK on su-be2 i-000002e8 output: PROCS OK: 120 processes [18:48:03] PROBLEM dpkg-check is now: CRITICAL on sultest1 i-0000032d output: CHECK_NRPE: Error - Could not complete SSL handshake. [18:48:03] PROBLEM dpkg-check is now: CRITICAL on sultest2 i-00000330 output: DPKG CRITICAL dpkg reports broken packages [18:48:03] PROBLEM Total Processes is now: CRITICAL on swift-be1 i-000001c7 output: CHECK_NRPE: Error - Could not complete SSL handshake. [18:48:08] RECOVERY Disk Space is now: OK on swift-be4 i-000001ca output: DISK OK [18:48:08] RECOVERY Current Load is now: OK on swift-be3 i-000001c9 output: OK - load average: 0.00, 0.02, 0.00 [18:48:08] RECOVERY Current Load is now: OK on test2 i-0000013c output: OK - load average: 0.00, 0.00, 0.00 [18:48:08] RECOVERY Disk Space is now: OK on translation-memory-3 i-00000358 output: DISK OK [18:48:08] RECOVERY Total Processes is now: OK on venus i-000000ea output: PROCS OK: 91 processes [18:48:13] RECOVERY Current Users is now: OK on wikibits-apache i-00000342 output: USERS OK - 0 users currently logged in [18:48:13] RECOVERY Total Processes is now: OK on scribunto i-0000022c output: PROCS OK: 105 processes [18:48:18] RECOVERY Disk Space is now: OK on wikibits-mysql i-00000341 output: DISK OK [18:48:18] RECOVERY Current Users is now: OK on wikiminiatlas i-0000038c output: USERS OK - 0 users currently logged in [18:48:18] PROBLEM dpkg-check is now: CRITICAL on wikidata-dev-2 i-00000259 output: DPKG CRITICAL dpkg reports broken packages [18:48:18] RECOVERY dpkg-check is now: OK on search-test i-000000cb output: All packages OK [18:48:18] RECOVERY Disk Space is now: OK on wikisource-web i-000000fe output: DISK OK [18:48:19] RECOVERY Total Processes is now: OK on wikistream-1 i-0000016e output: PROCS OK: 80 processes [18:48:23] RECOVERY Current Load is now: OK on wlmpuppet i-0000035c output: OK - load average: 0.05, 0.09, 0.08 [18:48:23] RECOVERY Current Users is now: OK on worker1 i-00000208 output: USERS OK - 0 users currently logged in [18:48:23] RECOVERY Current Users is now: OK on aggregator2 i-000002c0 output: USERS OK - 0 users currently logged in [18:48:23] RECOVERY Free ram is now: OK on robh2 i-000001a2 output: OK: 87% free memory [18:48:23] RECOVERY Total Processes is now: OK on bastion-restricted1 i-0000019b output: PROCS OK: 103 processes [18:48:33] RECOVERY Total Processes is now: OK on bots-sql1 i-000000b5 output: PROCS OK: 80 processes [18:48:44] RECOVERY Current Load is now: OK on demo-web2 i-00000285 output: OK - load average: 0.00, 0.00, 0.00 [18:48:49] RECOVERY dpkg-check is now: OK on deployment-squid i-000000dc output: All packages OK [18:48:49] RECOVERY Free ram is now: OK on ee-prototype i-0000013d output: OK: 86% free memory [18:48:49] RECOVERY Total Processes is now: OK on dumps-1 i-00000355 output: PROCS OK: 91 processes [18:48:54] RECOVERY Current Users is now: OK on follow01-dev i-000003c6 output: USERS OK - 0 users currently logged in [18:48:54] RECOVERY Free ram is now: OK on gerrit i-000000ff output: OK: 82% free memory [18:48:54] RECOVERY Current Load is now: OK on hildisvini i-000003ac output: OK - load average: 0.08, 0.03, 0.05 [18:48:59] RECOVERY Free ram is now: OK on hugglewa-1 i-000001e0 output: OK: 91% free memory [18:49:09] RECOVERY Disk Space is now: OK on maps-tilemill1 i-00000294 output: DISK OK [18:49:09] RECOVERY Disk Space is now: OK on nginx-ffuqua-doom1-3 i-00000196 output: DISK OK [18:49:09] RECOVERY Current Users is now: OK on nova-precise1 i-00000236 output: USERS OK - 0 users currently logged in [18:49:09] RECOVERY Current Users is now: OK on otrs-jgreen i-0000015a output: USERS OK - 0 users currently logged in [18:49:09] RECOVERY Disk Space is now: OK on pediapress-ocg1 i-00000233 output: DISK OK [18:49:14] RECOVERY Current Users is now: OK on queue-wiki1 i-000002b8 output: USERS OK - 0 users currently logged in [18:49:14] RECOVERY Disk Space is now: OK on robh-spl i-00000369 output: DISK OK [18:49:14] RECOVERY Current Load is now: OK on secondinstance i-0000015b output: OK - load average: 0.00, 0.01, 0.00 [18:49:14] RECOVERY dpkg-check is now: OK on scribunto i-0000022c output: All packages OK [18:49:15] RECOVERY Current Users is now: OK on su-fe1 i-000002e5 output: USERS OK - 0 users currently logged in [18:49:15] RECOVERY dpkg-check is now: OK on su-be2 i-000002e8 output: All packages OK [18:49:15] RECOVERY Current Users is now: OK on swift-be3 i-000001c9 output: USERS OK - 0 users currently logged in [18:49:16] RECOVERY Free ram is now: OK on swift-be4 i-000001ca output: OK: 88% free memory [18:49:16] RECOVERY Current Load is now: OK on translation-memory-1 i-0000013a output: OK - load average: 0.00, 0.00, 0.00 [18:49:17] RECOVERY Current Users is now: OK on translation-memory-2 i-000002d9 output: USERS OK - 0 users currently logged in [18:49:19] RECOVERY Current Users is now: OK on tutorial-mysql i-0000028b output: USERS OK - 0 users currently logged in [18:49:20] RECOVERY Free ram is now: OK on ubuntu1-pgehres i-000000fb output: OK: 89% free memory [18:49:20] RECOVERY Disk Space is now: OK on ve-nodejs i-00000245 output: DISK OK [18:49:20] RECOVERY Current Load is now: OK on webserver-lcarr i-00000134 output: OK - load average: 0.00, 0.00, 0.00 [18:49:20] RECOVERY dpkg-check is now: OK on venus i-000000ea output: All packages OK [18:49:32] RECOVERY Free ram is now: OK on build1 i-000002b3 output: OK: 88% free memory [18:49:32] RECOVERY Total Processes is now: OK on catsort-pub i-000001cc output: PROCS OK: 79 processes [18:49:35] Damianz: yeahhhhhh ;-] [18:49:40] RECOVERY Disk Space is now: OK on conventionextension-trial i-000003bf output: DISK OK [18:49:40] RECOVERY Current Load is now: OK on demo-web1 i-00000255 output: OK - load average: 0.00, 0.02, 0.04 [18:49:40] RECOVERY dpkg-check is now: OK on deployment-bastion i-00000390 output: All packages OK [18:49:40] RECOVERY Disk Space is now: OK on deployment-dbdump i-000000d2 output: DISK OK [18:49:40] RECOVERY dpkg-check is now: OK on deployment-cache-upload03 i-0000034b output: All packages OK [18:49:41] RECOVERY Current Users is now: OK on deployment-jobrunner06 i-0000031d output: USERS OK - 0 users currently logged in [18:49:41] RECOVERY Total Processes is now: OK on deployment-sql i-000000d0 output: PROCS OK: 81 processes [18:49:50] RECOVERY Total Processes is now: OK on extrev1 i-00000346 output: PROCS OK: 77 processes [18:49:51] Damianz: I got a few bug reports somewhere about that [18:49:55] RECOVERY Current Load is now: OK on firstinstance i-0000013e output: OK - load average: 0.00, 0.00, 0.00 [18:50:00] RECOVERY Disk Space is now: OK on follow01-dev i-000003c6 output: DISK OK [18:50:05] RECOVERY Current Users is now: OK on hildisvini i-000003ac output: USERS OK - 0 users currently logged in [18:50:06] RECOVERY Current Load is now: OK on jawiki-demo i-000003cf output: OK - load average: 0.00, 0.02, 0.05 [18:50:06] PROBLEM dpkg-check is now: CRITICAL on localpuppet2 i-0000029b output: CHECK_NRPE: Error - Could not complete SSL handshake. [18:50:06] PROBLEM Current Users is now: CRITICAL on maps-test2 i-00000253 output: CHECK_NRPE: Error - Could not complete SSL handshake. [18:50:06] PROBLEM dpkg-check is now: CRITICAL on log1 i-00000239 output: DPKG CRITICAL dpkg reports broken packages [18:50:06] RECOVERY Free ram is now: OK on maps-tilemill1 i-00000294 output: OK: 95% free memory [18:50:06] RECOVERY dpkg-check is now: OK on maps-osmrails i-00000373 output: All packages OK [18:50:20] RECOVERY Current Users is now: OK on nginx-dev1 i-000000f0 output: USERS OK - 0 users currently logged in [18:50:25] RECOVERY Current Load is now: OK on nova-osm-keystone i-00000359 output: OK - load average: 0.01, 0.03, 0.05 [18:50:25] RECOVERY Current Load is now: OK on orgcharts-dev i-0000018f output: OK - load average: 0.00, 0.01, 0.05 [18:50:25] RECOVERY Current Load is now: OK on pdbhandler-1 i-0000030e output: OK - load average: 0.01, 0.08, 0.07 [18:50:25] PROBLEM Total Processes is now: CRITICAL on psm-precise i-000002f2 output: Connection refused by host [18:50:30] RECOVERY dpkg-check is now: OK on pediapress-ocg2 i-00000234 output: All packages OK [18:50:30] RECOVERY Current Users is now: OK on reportcard2 i-000001ea output: USERS OK - 0 users currently logged in [18:50:30] RECOVERY Disk Space is now: OK on resourceloader2-apache i-000001d7 output: DISK OK [18:50:30] RECOVERY Total Processes is now: OK on robh2 i-000001a2 output: PROCS OK: 88 processes [18:50:35] RECOVERY Current Load is now: OK on search-test i-000000cb output: OK - load average: 0.00, 0.00, 0.00 [18:50:35] RECOVERY Current Users is now: OK on simplewikt i-00000149 output: USERS OK - 0 users currently logged in [18:50:35] RECOVERY Disk Space is now: OK on solr-ci i-00000391 output: DISK OK [18:50:35] RECOVERY Disk Space is now: OK on su-fe1 i-000002e5 output: DISK OK [18:50:35] RECOVERY dpkg-check is now: OK on su-be1 i-000002e7 output: All packages OK [18:50:40] PROBLEM Current Load is now: CRITICAL on sultest1 i-0000032d output: CHECK_NRPE: Error - Could not complete SSL handshake. [18:50:40] RECOVERY Current Users is now: OK on sultest2 i-00000330 output: USERS OK - 0 users currently logged in [18:50:40] RECOVERY Free ram is now: OK on swift-aux1 i-0000024b output: OK: 91% free memory [18:50:40] RECOVERY Total Processes is now: OK on testing-singer-puppetization i-00000331 output: PROCS OK: 80 processes [18:50:45] RECOVERY Current Users is now: OK on translation-memory-1 i-0000013a output: USERS OK - 0 users currently logged in [18:50:45] RECOVERY Current Load is now: OK on tutopuppet i-00000336 output: OK - load average: 0.49, 0.16, 0.09 [18:50:45] RECOVERY Disk Space is now: OK on tutorial-mysql i-0000028b output: DISK OK [18:50:45] RECOVERY Disk Space is now: OK on translation-memory-2 i-000002d9 output: DISK OK [18:50:46] RECOVERY Current Users is now: OK on varnish-precise i-00000311 output: USERS OK - 0 users currently logged in [18:50:46] RECOVERY Current Load is now: OK on vumi i-000001e5 output: OK - load average: 0.01, 0.03, 0.03 [18:50:46] RECOVERY Current Users is now: OK on webserver-lcarr i-00000134 output: USERS OK - 0 users currently logged in [18:50:47] RECOVERY Total Processes is now: OK on ve-parsoid3 i-00000345 output: PROCS OK: 78 processes [18:50:50] RECOVERY Total Processes is now: OK on wikibits-mysql i-00000341 output: PROCS OK: 80 processes [18:50:55] RECOVERY Current Load is now: OK on wikidata-dev-2 i-00000259 output: OK - load average: 0.46, 0.31, 0.18 [18:50:56] PROBLEM Current Users is now: CRITICAL on wikidata-dev-3 i-00000225 output: CHECK_NRPE: Error - Could not complete SSL handshake. [18:50:56] RECOVERY Current Load is now: OK on wikiversity-sandbox-frontend i-0000033a output: OK - load average: 0.17, 0.10, 0.07 [18:50:56] RECOVERY dpkg-check is now: OK on wikistats-history-01 i-000002e2 output: All packages OK [18:50:56] RECOVERY Free ram is now: OK on worker1 i-00000208 output: OK: 95% free memory [18:50:56] RECOVERY Total Processes is now: OK on zeromq1 i-000002b7 output: PROCS OK: 78 processes [18:51:32] hashar: Could be interesting - I'm thinking about suggesting having a graphite instance for labs where we can push random stats from each project to it. Could even be interesting pushing nagios perfdata in there for if/when we can support custom checks for people. [18:51:53] Considering only carbon IIRC is packaged in the general repos looking over the source for what's in prod use would be nice [18:54:05] do we need to quiet labs-nagios-wm ? ;-) [18:54:59] jeremyb: naahhh it is finally recovering :-D [18:55:22] jeremyb: It should be ok now, there where like 2-3k of checks due to a bad config [18:55:23] Damianz: that looks better http://nagios.wmflabs.org/cgi-bin/nagios3/status.cgi?hostgroup=deployment-prep&style=detail [18:55:33] hashar: and will it break again?? ;) [18:55:36] Damianz: the free ram one still has issue though [18:55:45] Yeah, looking at that atm - we actually lost the plugin [18:55:59] looks like it is available on some of my instances [18:56:07] deployment-mc report OK free memor [18:56:10] PROBLEM host: deployment-backup is DOWN address: i-000000f8 CRITICAL - Host Unreachable (i-000000f8) [18:56:10] PROBLEM host: configtest-main is DOWN address: i-000002dd CRITICAL - Host Unreachable (i-000002dd) [18:56:10] PROBLEM host: conventionextension-test is DOWN address: i-000003c0 CRITICAL - Host Unreachable (i-000003c0) [18:56:10] PROBLEM host: deployment-cache-upload is DOWN address: i-00000263 CRITICAL - Host Unreachable (i-00000263) [18:56:10] PROBLEM host: deployment-feed is DOWN address: i-00000118 CRITICAL - Host Unreachable (i-00000118) [18:56:15] PROBLEM host: analytics is DOWN address: i-000000e2 CRITICAL - Host Unreachable (i-000000e2) [18:56:32] hmmm [18:56:33] * jeremyb detaches [18:58:00] PROBLEM host: wep is DOWN address: i-000000c2 CRITICAL - Host Unreachable (i-000000c2) [18:58:07] Weird... [19:00:20] PROBLEM host: p-b is DOWN address: i-000000ae CRITICAL - Host Unreachable (i-000000ae) [19:01:30] PROBLEM host: pageviews is DOWN address: i-000000b2 CRITICAL - Host Unreachable (i-000000b2) [19:03:00] PROBLEM host: mobile-feeds is DOWN address: i-000000c1 CRITICAL - Host Unreachable (i-000000c1) [19:03:10] PROBLEM Total Processes is now: WARNING on wikistats-01 i-00000042 output: PROCS WARNING: 188 processes [19:09:27] You know what would be cool - having labsconsole changes for the Help namespace relayed in here [19:10:10] PROBLEM host: gerrit-puppet-andrewhauly is DOWN address: i-000003da check_ping: Invalid hostname/address - i-000003da [19:11:23] slevinski: Better? [19:12:01] andrewbogott: You know much about labs puppet or is Ryan a better person? [19:12:26] Damianz: It depends :) [19:12:28] What's up? [19:13:41] Damianz: In general Ryan knows more, and I'm more available to actually troubleshoot. [19:14:12] nrpe seems to be handled via puppet on the hosts, but as far as I can see none of them have nagios::monitor or such in ldap as a puppet class. Is that a default for everything or is there something else funky? [19:14:27] Basically I want to know why check_ram.sh (which is in puppet) isn't on some hosts running puppet [19:14:41] Yet nrpe_local.cfg seems to be updated as expected (which is in the same class) [19:14:44] * Damianz confuzzled [19:15:16] Ryan_Lane may be able to answer that question instantly, but in the meantime I will dig in the source. [19:15:38] No huge rush, just making an attempt at making nagios useful heh [19:15:54] Damianz: you know what would be cool? having an IRC feed at all for labsconsole. anywhere. period [19:16:08] Damianz: or pubsubhubbub [19:16:09] jeremyb: That would also be cool. [19:16:13] or XMPP pubsub [19:16:31] * jeremyb runs away [19:16:40] Really I'd love all feeds to go into a huge queue where you can subscribe with history for bots. Like disconnecting from irc for 30min looses shed loads of potentially useful data. [19:17:10] i chatted with krinkle about this at wikimania [19:17:26] we're building something like this for event tracking for e3 experiments (zeromq + redis) [19:17:50] but you'd be surprised how hard it is to come up with a better solution than irc [19:17:58] ZeroMQ is awesome, not totally sure useful for this case hmm. Unless you know what endpoints are subscribing I guess. [19:18:00] PROBLEM Free ram is now: CRITICAL on deployment-apache32 i-0000031a output: Connection refused by host [19:18:18] Not working yet. I really messed things up trying to fix it this morning. I'm going to try to put it back the way that it was this morning and retry your idea. [19:18:57] ZeroMQ could be used in a lot of places udp is now for things (like payments) that require a slightly higher level of garuntee the logs get to the destination. [19:19:10] slevinski: Ok! I hope I am not making things worse :) [19:19:18] Damianz: wikitech.wikimedia.org/view/Event_tracking [19:20:10] Now that looks useful [19:21:00] PROBLEM Disk Space is now: CRITICAL on deployment-apache32 i-0000031a output: Connection refused by host [19:23:00] Damianz: i hope so! feedback very welcome -- i'm ori@wikimedia.org, in case you don't have a wikitech account [19:23:00] PROBLEM Free ram is now: UNKNOWN on deployment-apache32 i-0000031a output: NRPE: Unable to read output [19:23:05] Damianz: Sorry, reading the backscroll I missed one of your questions. Yes, there is a default for everything. I don't know where it comes from though. [19:23:19] And, you're seeing per-instance differences in that default, which surprises me. [19:24:07] Well some are new instances, others are from back when we ran test - I'd suspect only new ones are missing it. [19:24:21] ori-l: I don't but I'll have a read and let you know if I have any thoughts :) [19:25:29] deployment-apache32 - 10 July 2012 17:35:53 - Missing the plugin [19:25:29] deployment-mc - 23 April 2012 14:02:43 - Has the plugin [19:25:44] For example - I'm just curious as to why it's not getting re-added when nrpe_local gets fixed if you mangle it. [19:26:00] RECOVERY Disk Space is now: OK on deployment-apache32 i-0000031a output: DISK OK [19:26:10] PROBLEM host: configtest-main is DOWN address: i-000002dd CRITICAL - Host Unreachable (i-000002dd) [19:26:10] PROBLEM host: deployment-cache-upload is DOWN address: i-00000263 CRITICAL - Host Unreachable (i-00000263) [19:26:11] PROBLEM host: conventionextension-test is DOWN address: i-000003c0 CRITICAL - Host Unreachable (i-000003c0) [19:26:20] PROBLEM host: deployment-backup is DOWN address: i-000000f8 CRITICAL - Host Unreachable (i-000000f8) [19:26:20] PROBLEM host: deployment-feed is DOWN address: i-00000118 CRITICAL - Host Unreachable (i-00000118) [19:26:20] PROBLEM host: analytics is DOWN address: i-000000e2 CRITICAL - Host Unreachable (i-000000e2) [19:27:20] PROBLEM Puppet freshness is now: CRITICAL on dumps-1 i-00000355 output: (Return code of 127 is out of bounds - plugin may be missing) [19:27:28] Any chance that puppet isn't running at all on the newer instance? [19:28:00] PROBLEM host: wep is DOWN address: i-000000c2 CRITICAL - Host Unreachable (i-000000c2) [19:28:02] I've just forced run it one one after I changed nrpe_local, fixed nrpe_local.cfg but didn't add the plugin so I don't think it's that. [19:28:38] Do we have a pastebin around? [19:29:27] http://etherpad.wikimedia.org/9RFl8QljDZ < [19:29:31] Pastebin enough [19:30:16] (the allowed_hosts change was me to check it was actually applying it as I thought) [19:30:30] PROBLEM host: p-b is DOWN address: i-000000ae CRITICAL - Host Unreachable (i-000000ae) [19:32:30] PROBLEM host: pageviews is DOWN address: i-000000b2 CRITICAL - Host Unreachable (i-000000b2) [19:33:00] PROBLEM host: mobile-feeds is DOWN address: i-000000c1 CRITICAL - Host Unreachable (i-000000c1) [19:33:43] Ahhh [19:34:12] what happens when you service nagios-nrpe-server stop? [19:34:20] That might be cutting short the party somehow. [19:35:43] It just exits cleanly, I wonder if it's just including like nrpe::packages and not nagios::monitor - the former just sets up nrpe with some base plugins, the later adds the free_ram plugin and does some collectfiles stuff for prod from the look of it. [19:36:44] I don't see nagios::monitor getting included anywhere at all. [19:37:53] That would explain why new hosts are broken then [19:38:17] Not sure why https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=blob;f=manifests/nagios.pp;h=0ae31287a726a1a7266ba4d8338eb194c2093bd8;hb=refs/heads/production#l474 that stuff wouldn't be with https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=blob;f=manifests/nrpe.pp#l73 that stuff [19:38:31] Probably can just move the plugins over and it should work for prod and labs I think [19:38:50] Since nagios::monitor requires nrpe [19:40:10] PROBLEM host: gerrit-puppet-andrewhauly is DOWN address: i-000003da check_ping: Invalid hostname/address - i-000003da [19:41:31] puppetd -tv --noop --debug | grep nrpe shows nrpe being applied but greping nagios doesn't show ::monitor. Who know --debug was actually useful [19:43:19] Ah... [19:43:30] Labs stuff is in /usr/lib/nagios/plugins/, prod stuff is in /usr/local/nagios/libexec/ [19:43:34] meh [19:51:37] Thanks for your help andrewbogott, I'll poke puppet to get the file included on labs :) [19:51:56] um… ok. [19:52:13] * andrewbogott sure is getting thanked for not helping a lot today [19:52:20] PROBLEM Puppet freshness is now: CRITICAL on mediawiki-dev-1 i-0000039c output: (Return code of 127 is out of bounds - plugin may be missing) [19:55:09] Hey, ignoring my ranting is sometimes helpful :P [19:56:04] hmm [19:56:13] let see if Nagios still understand the puppet freshness stuff [19:56:20] PROBLEM host: configtest-main is DOWN address: i-000002dd CRITICAL - Host Unreachable (i-000002dd) [19:56:20] PROBLEM host: conventionextension-test is DOWN address: i-000003c0 CRITICAL - Host Unreachable (i-000003c0) [19:56:20] PROBLEM host: deployment-cache-upload is DOWN address: i-00000263 CRITICAL - Host Unreachable (i-00000263) [19:56:20] PROBLEM host: deployment-feed is DOWN address: i-00000118 CRITICAL - Host Unreachable (i-00000118) [19:56:20] PROBLEM host: analytics is DOWN address: i-000000e2 CRITICAL - Host Unreachable (i-000000e2) [19:56:21] PROBLEM host: deployment-backup is DOWN address: i-000000f8 CRITICAL - Host Unreachable (i-000000f8) [19:57:26] Damianz: I am not sure how the puppet freshness works (apparently its a passive check that waits for some SNMP trap), that is still broken :) [19:57:37] but at least lot of stuff is working again so thanks for that! [19:57:49] oh [19:58:00] PROBLEM host: wep is DOWN address: i-000000c2 CRITICAL - Host Unreachable (i-000000c2) [19:58:11] and if you have full access to the nagios on labs, you might want to try out updating to icinga the community fork [19:58:12] I think it's a plugin that does sudo and checks the mtime somewhere, petan wrote a plugin. [19:58:22] Personally I like the snmp version more [19:58:41] oh [19:59:19] Hopefully I'll get some time to puppetize this up and improve it generally. There's some icinga stuff in puppet, not sure if we're heading that direction atm. [19:59:53] I am pretty sure icinga got setup by LeslieCarr [20:00:06] at least I hinted her about using Icinga instead of Nagios [20:00:30] PROBLEM host: p-b is DOWN address: i-000000ae CRITICAL - Host Unreachable (i-000000ae) [20:01:42] I'd rather improve the config building stuff so we can include custom checks tbh but yeah it would be interesting. [20:03:00] PROBLEM host: mobile-feeds is DOWN address: i-000000c1 CRITICAL - Host Unreachable (i-000000c1) [20:03:10] PROBLEM Total Processes is now: CRITICAL on wikistats-01 i-00000042 output: PROCS CRITICAL: 281 processes [20:03:30] PROBLEM host: pageviews is DOWN address: i-000000b2 CRITICAL - Host Unreachable (i-000000b2) [20:07:39] https://gerrit.wikimedia.org/r/#/c/21822/ I think that fixes it [20:10:10] PROBLEM host: gerrit-puppet-andrewhauly is DOWN address: i-000003da check_ping: Invalid hostname/address - i-000003da [20:11:44] andrewbogott: (if you're bored) do you think that's more merge friendly with an if($::realm == "labs") around it? I don't see why it existing in prod is too bad considering it's in the nrpe file anyway. Not really anything to do with you but heh you're around :P [20:12:20] PROBLEM Puppet freshness is now: CRITICAL on dumps-incr i-0000035d output: (Return code of 127 is out of bounds - plugin may be missing) [20:12:20] PROBLEM Puppet freshness is now: CRITICAL on translation-memory-3 i-00000358 output: (Return code of 127 is out of bounds - plugin may be missing) [20:12:32] Damianz: I'm not sure. Is that component getting installed in prod via a different route? [20:12:51] Or do you think that production regressed in exactly the same way? [20:13:33] nagios::monitor installs it in /usr/local/nagios/libexec/ - not sure if that's for the server rather than nrpe though. [20:14:09] OK. I think this is out of my depth now. Probably best to mark it labs only for the moment… or find someone in the operations channel who knows what's going on in production. [20:15:09] Hosts I can see in site.pp don't have that check in nagios - I'll poke ops in a while [20:18:31] Hmm, how do you make an inline comment not draft anymore o.0 [20:21:21] Ah, I have to review the change [20:22:49] Yeah [20:23:10] PROBLEM Total Processes is now: WARNING on wikistats-01 i-00000042 output: PROCS WARNING: 188 processes [20:26:20] PROBLEM host: configtest-main is DOWN address: i-000002dd CRITICAL - Host Unreachable (i-000002dd) [20:26:20] PROBLEM host: conventionextension-test is DOWN address: i-000003c0 CRITICAL - Host Unreachable (i-000003c0) [20:26:20] PROBLEM host: deployment-cache-upload is DOWN address: i-00000263 CRITICAL - Host Unreachable (i-00000263) [20:26:20] PROBLEM host: deployment-feed is DOWN address: i-00000118 CRITICAL - Host Unreachable (i-00000118) [20:26:20] PROBLEM host: analytics is DOWN address: i-000000e2 CRITICAL - Host Unreachable (i-000000e2) [20:26:21] PROBLEM host: deployment-backup is DOWN address: i-000000f8 CRITICAL - Host Unreachable (i-000000f8) [20:28:00] PROBLEM host: wep is DOWN address: i-000000c2 CRITICAL - Host Unreachable (i-000000c2) [20:28:12] Gitweb is blummin slow [20:29:22] <^demon> Gitweb sucks. [20:30:10] PROBLEM Free ram is now: CRITICAL on nagios-main i-0000030d output: NRPE: Unable to read output [20:30:30] PROBLEM host: p-b is DOWN address: i-000000ae CRITICAL - Host Unreachable (i-000000ae) [20:31:10] PROBLEM Free ram is now: CRITICAL on deployment-cache-bits02 i-0000031c output: NRPE: Unable to read output [20:31:10] PROBLEM Free ram is now: CRITICAL on deployment-bastion i-00000390 output: NRPE: Unable to read output [20:31:57] Don't know about sucks but hell 12 seconds to do a wget of a file [20:32:36] Ryan_Lane: Any reason most the instances want to do a fsck on the next boot? Assume you soft restarted them rather than forcefully [20:33:00] PROBLEM Free ram is now: CRITICAL on deployment-apache33 i-0000031b output: NRPE: Unable to read output [20:33:00] PROBLEM Free ram is now: CRITICAL on deployment-apache32 i-0000031a output: NRPE: Unable to read output [20:33:00] PROBLEM host: mobile-feeds is DOWN address: i-000000c1 CRITICAL - Host Unreachable (i-000000c1) [20:33:06] it's not simple to so a soft shutdown [20:33:14] <^demon> Damianz: It sucks. Gitweb's authors don't know what caching is. [20:33:20] Ah =/ [20:34:30] PROBLEM host: pageviews is DOWN address: i-000000b2 CRITICAL - Host Unreachable (i-000000b2) [20:35:10] PROBLEM Free ram is now: CRITICAL on deployment-cache-upload04 i-00000357 output: NRPE: Unable to read output [20:35:10] PROBLEM Free ram is now: CRITICAL on deployment-cache-upload03 i-0000034b output: NRPE: Unable to read output [20:35:20] PROBLEM Free ram is now: CRITICAL on bots-cb-dev i-000003d1 output: NRPE: Unable to read output [20:35:20] PROBLEM Free ram is now: CRITICAL on deployment-jobrunner06 i-0000031d output: NRPE: Unable to read output [20:36:10] PROBLEM Free ram is now: CRITICAL on deployment-integration i-0000034a output: NRPE: Unable to read output [20:40:10] RECOVERY Free ram is now: OK on deployment-cache-upload04 i-00000357 output: 204356 [20:40:10] PROBLEM host: gerrit-puppet-andrewhauly is DOWN address: i-000003da check_ping: Invalid hostname/address - i-000003da [20:40:10] RECOVERY Free ram is now: OK on deployment-cache-upload03 i-0000034b output: 333584 [20:40:10] RECOVERY Disk Space is now: OK on nagios 127.0.0.1 output: DISK OK [20:40:20] RECOVERY Free ram is now: OK on bots-cb-dev i-000003d1 output: 177224 [20:41:10] RECOVERY Free ram is now: OK on deployment-cache-bits02 i-0000031c output: 291460 [20:41:10] RECOVERY Free ram is now: OK on deployment-bastion i-00000390 output: 279940 [20:41:21] I don't know if this is a good time, but i just went on Labs site and i get "No Nova credentials found for your account." User: Jasonspriggs [20:41:32] Yeah this'll be because of the upgrade [20:41:41] That error message appears sometimes, the workaround is logging out and back in [20:41:43] kk; just wondering :) [20:42:06] Errr... that should be fixed I believe, but you'll need to login again at least once [20:42:07] But a side-effect of the recent upgrade is that everyone is now getting this until they log out and back in [20:42:14] Oh, was it fixed in the upgradE? [20:42:19] Should be [20:42:27] I think Ryan_Lane restart memcache to clear everyone's login [20:42:42] all is well now; thanks :) [20:42:46] hm [20:42:52] I thought I forced re-auth [20:43:00] RECOVERY Free ram is now: OK on deployment-apache33 i-0000031b output: 676920 [20:43:06] it's weird that it logged you in at all [20:43:28] RoanKattouw: are long-lived tokens stored in the database? [20:43:54] All the instances I can ssh to are fixed free ram wise now [20:44:02] (included beta, hashar) [20:44:05] Damianz: how are you fixing them? [20:44:07] s/ed/ing/ [20:44:13] Ryan_Lane: sudo wget -O /usr/lib/nagios/plugins/check_ram.sh "https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=blob_plain;f=files/nagios/check_ram.sh;hb=HEAD"; sudo chmod 0555 /usr/lib/nagios/plugins/check_ram.sh [20:44:16] won't puppet just set that back? [20:44:20] PROBLEM Puppet freshness is now: CRITICAL on hildisvini i-000003ac output: (Return code of 127 is out of bounds - plugin may be missing) [20:44:21] Apparently not [20:44:30] that should get installed via puppet, not wget [20:44:40] yeah, well I have a puppet change in for review :P [20:44:46] link? [20:44:57] https://gerrit.wikimedia.org/r/#/c/21822/ [20:45:15] See my comment (or scrollback a while ago) for the hmm about labs vs prod [20:45:20] RECOVERY Free ram is now: OK on deployment-jobrunner06 i-0000031d output: 249248 [20:46:10] RECOVERY Free ram is now: OK on deployment-integration i-0000034a output: 290496 [20:47:40] Quick question as I did have an idea for a labs project; has anyone here heard of OpenSim or Second Life? [20:48:07] heard of, never used [20:48:33] yeah. what about them? [20:50:20] kk; cause i remeber overhearing during wikimania that online meetings would be useful for like mini-hackathons and things such as that; idk tho if that would be a good idea to try to see if labs would be a good platform to host something like opensim for that purpose; idk if people would be interested though in the project [20:56:20] PROBLEM host: configtest-main is DOWN address: i-000002dd CRITICAL - Host Unreachable (i-000002dd) [20:56:20] PROBLEM host: conventionextension-test is DOWN address: i-000003c0 CRITICAL - Host Unreachable (i-000003c0) [20:56:20] PROBLEM host: deployment-cache-upload is DOWN address: i-00000263 CRITICAL - Host Unreachable (i-00000263) [20:56:20] PROBLEM host: deployment-feed is DOWN address: i-00000118 CRITICAL - Host Unreachable (i-00000118) [20:56:20] PROBLEM host: analytics is DOWN address: i-000000e2 CRITICAL - Host Unreachable (i-000000e2) [20:56:21] PROBLEM host: deployment-backup is DOWN address: i-000000f8 CRITICAL - Host Unreachable (i-000000f8) [20:58:10] PROBLEM host: wep is DOWN address: i-000000c2 CRITICAL - Host Unreachable (i-000000c2) [20:58:39] !log nagios re-setup snmpd/snmptt for puppet freshness checks. Used config from puppet, plugin in /usr/lib/nagios/plugins/eventhandlers [20:58:41] Logged the message, Master [20:59:20] PROBLEM Puppet freshness is now: CRITICAL on gerrit-db i-0000038b output: (Return code of 127 is out of bounds - plugin may be missing) [20:59:20] PROBLEM Puppet freshness is now: CRITICAL on wikiminiatlas i-0000038c output: (Return code of 127 is out of bounds - plugin may be missing) [20:59:57] warning: Could not load fact file /var/lib/puppet/lib/facter/default_interface.rb: ./default_interface.rb:43: syntax error, unexpected kELSE, expecting kEND < puppet's currently throwing that on some servers [21:00:30] PROBLEM host: p-b is DOWN address: i-000000ae CRITICAL - Host Unreachable (i-000000ae) [21:01:20] PROBLEM Puppet freshness is now: CRITICAL on puppet-abogott i-00000389 output: (Return code of 127 is out of bounds - plugin may be missing) [21:03:00] PROBLEM host: mobile-feeds is DOWN address: i-000000c1 CRITICAL - Host Unreachable (i-000000c1) [21:03:10] PROBLEM Total Processes is now: CRITICAL on wikistats-01 i-00000042 output: PROCS CRITICAL: 281 processes [21:05:20] PROBLEM host: pageviews is DOWN address: i-000000b2 CRITICAL - Host Unreachable (i-000000b2) [21:08:42] Ryan_Lane: Sorry for the spam I probably just sent you, but if you could look at 21831/21822 when free (heh) it would be awesome. [21:08:51] no worries [21:08:57] I'm sending a bitchy email to the openstack list [21:09:11] Heh [21:09:20] Migration pains related? [21:09:36] yes [21:10:16] PROBLEM host: gerrit-puppet-andrewhauly is DOWN address: i-000003da check_ping: Invalid hostname/address - i-000003da [21:14:56] PROBLEM host: signwriting-ase11 is DOWN address: i-000003dc check_ping: Invalid hostname/address - i-000003dc [21:15:30] Change on 12mediawiki a page OAuth/status was modified, changed by Jdforrester (WMF) link https://www.mediawiki.org/w/index.php?diff=576990 edit summary: +2012-08-monthly [21:18:03] not sure if I should be doing this right now, but I created a new instance. Having trouble connecting with ssh: Could not resolve hostname i-000003dc.pmtpa.wmflabs: Name or service not known [21:24:16] PROBLEM Puppet freshness is now: CRITICAL on apachemxetc i-00000348 output: (Return code of 127 is out of bounds - plugin may be missing) [21:24:16] PROBLEM Puppet freshness is now: CRITICAL on deployment-cache-upload03 i-0000034b output: (Return code of 127 is out of bounds - plugin may be missing) [21:24:16] PROBLEM Puppet freshness is now: CRITICAL on deployment-integration i-0000034a output: (Return code of 127 is out of bounds - plugin may be missing) [21:24:16] PROBLEM Puppet freshness is now: CRITICAL on extrev1 i-00000346 output: (Return code of 127 is out of bounds - plugin may be missing) [21:24:16] PROBLEM Puppet freshness is now: CRITICAL on rocsteady-cleanup i-00000349 output: (Return code of 127 is out of bounds - plugin may be missing) [21:24:58] slevinski: Look at the console log - has it finished building? [21:25:09] Usually takes a couple of min, hopefully there should be 0 errors now [21:26:01] Ewww, need to turn my auto add signature off on this thunderbird install. [21:26:15] * Damianz stabs vcard off that message [21:26:26] that email was kind of long [21:26:54] I could have made a link, but people don't click links. Anyway I heard it's the size that matters [21:27:52] * Damianz wonders if Ryan_Lane was talking about his email or his email... wait that doesn't make sense, but it does. Eitherway [21:28:06] PROBLEM host: analytics is DOWN address: i-000000e2 CRITICAL - Host Unreachable (i-000000e2) [21:28:06] PROBLEM host: configtest-main is DOWN address: i-000002dd CRITICAL - Host Unreachable (i-000002dd) [21:28:06] PROBLEM host: deployment-cache-upload is DOWN address: i-00000263 CRITICAL - Host Unreachable (i-00000263) [21:28:06] PROBLEM host: deployment-feed is DOWN address: i-00000118 CRITICAL - Host Unreachable (i-00000118) [21:28:06] PROBLEM host: deployment-backup is DOWN address: i-000000f8 CRITICAL - Host Unreachable (i-000000f8) [21:28:07] PROBLEM host: conventionextension-test is DOWN address: i-000003c0 CRITICAL - Host Unreachable (i-000003c0) [21:28:52] drdee_: meeting? [21:29:04] Damianz: my email to openstack [21:29:12] Ryan_Lane; yes yes [21:29:15] Ah, I really should subscribe to that list [21:29:25] You'll probably feel the same about my email to labs-l though :P [21:29:29] Ryan_Lane: https://plus.google.com/hangouts/_/0730e1adef864fc028ebcbf9005f48c4c6c9a59d [21:29:48] sec [21:30:16] PROBLEM host: wep is DOWN address: i-000000c2 CRITICAL - Host Unreachable (i-000000c2) [21:31:06] PROBLEM host: p-b is DOWN address: i-000000ae CRITICAL - Host Unreachable (i-000000ae) [21:33:06] PROBLEM host: mobile-feeds is DOWN address: i-000000c1 CRITICAL - Host Unreachable (i-000000c1) [21:34:15] * jeremyb sees ryan's mail [21:34:53] Ryan_Lane: not *too* long [21:35:00] * jeremyb runs away [21:35:26] PROBLEM host: pageviews is DOWN address: i-000000b2 CRITICAL - Host Unreachable (i-000000b2) [21:36:23] Emails with "plea" in the subject are allways fun [21:38:16] PROBLEM Puppet freshness is now: CRITICAL on aggregator-test1 i-000002bf output: (Return code of 127 is out of bounds - plugin may be missing) [21:38:16] PROBLEM Puppet freshness is now: CRITICAL on aggregator1 i-0000010c output: (Return code of 127 is out of bounds - plugin may be missing) [21:38:16] PROBLEM Puppet freshness is now: CRITICAL on aggregator2 i-000002c0 output: (Return code of 127 is out of bounds - plugin may be missing) [21:39:16] PROBLEM Puppet freshness is now: CRITICAL on asher1 i-0000003a output: (Return code of 127 is out of bounds - plugin may be missing) [21:39:16] PROBLEM Puppet freshness is now: CRITICAL on bastion-restricted1 i-0000019b output: (Return code of 127 is out of bounds - plugin may be missing) [21:39:16] PROBLEM Puppet freshness is now: CRITICAL on bastion1 i-000000ba output: (Return code of 127 is out of bounds - plugin may be missing) [21:39:16] PROBLEM Puppet freshness is now: CRITICAL on bob i-0000012d output: (Return code of 127 is out of bounds - plugin may be missing) [21:39:16] PROBLEM Puppet freshness is now: CRITICAL on bots-1 i-000000a9 output: (Return code of 127 is out of bounds - plugin may be missing) [21:41:11] Instance i-000003dc status active. Console output: puppet-agent[1072]: Finished catalog run in 627.38 seconds. Webserver available: http://ase.wikipedia.wmflabs.org/index.html . ssh: Could not resolve hostname i-000003dc.pmtpa.wmflabs [21:42:07] indeed [21:42:25] ssh signwriting-ase11 should work [21:42:49] Not so sure why the FQDN has no A record.... might be a ryan thing [21:43:00] Availability Zone Unimplemented [21:43:01] Region Unimplemented [21:43:03] really? [21:43:17] Success. Thanks. [21:43:18] I assume that's just a OSM not yet written code thing [21:43:26] * Damianz hopes [21:44:04] If the wiki's right it has no security group either... so you might have issues. It could have the default as needed and I just can't see it on the recourse page though heh [21:46:01] eh? that has an A record [21:46:15] that said, wmflabs isn't a top level domain [21:46:26] Hmm [21:46:29] you can only resolve it via our resolvers [21:46:34] or if you are using a socks proxy [21:46:42] /etc/resolve.conf has 10.4.0.1 [21:46:45] yes [21:46:50] damian@bastion1:~$ host i-000003dc.pmtpa.wmflabs [21:46:50] Host i-000003dc.pmtpa.wmflabs not found: 3(NXDOMAIN) [21:46:51] which is the network node [21:47:10] damian@bastion1:~$ dig +short i-000003dc.pmtpa.wmflabs @10.4.0.1 [21:47:10] damian@bastion1:~$ [21:47:16] Maybe I'm just being dumb [21:48:45] hm [21:48:47] that should work [21:49:04] does that instance actually exist? [21:49:11] according to the wiki [21:49:26] It's possible he tried sshing before it hit ldap and got a negative result in pdns's packetcache [21:49:29] Unlikely I'd suggest [21:49:37] hm [21:50:25] ah [21:50:31] the arecord is indeed missing [21:50:32] oh [21:50:33] crap [21:50:35] heh [21:50:43] lol [21:50:52] I forgot I turned job running off [21:51:03] that gets populated by a job [21:51:06] Wait.. that's a background job!? [21:51:16] it has to be [21:51:20] I really should add `ldapsearch -D $(grep binddn /etc/ldap.conf | awk '{print $2}') -w $(grep bindpw /etc/ldap.conf | awk '{print $2}')` to my bash file as an alias [21:51:31] Gets added after the instance dn is added to the ou? [21:51:42] yes [21:51:51] because networking occurs after creation [21:52:07] Makes sense... openstack really needs a lovely hook system with multiple stages avaible [21:52:07] so rather than block, it sticks it into a job [21:52:19] the job re-inserts itself if the instance hasn't networked yet [21:52:23] Damianz: we have that [21:52:30] we also have dns support in ldap [21:52:39] I haven't switched to it yet [21:52:48] Ah [21:53:02] Yeah I recall you mentioning that [21:53:20] ok. fixed [21:53:25] I also turned the jobs back on [21:53:31] I need to purge dns cache for that instance, though [21:53:33] I'm still not 100% sure I understand openstack services/modules/plugins w/e fully as half the 'services' seem to run ontop not part of the core and other stuff's just confusing documentation wise. [21:53:53] heh [21:53:56] yeah, it can be confusing [21:54:40] Like hey this is written in java... but is a service for something in python... with no bindings... so yeah let's just nod [21:58:15] Damianz, your email said that one of my instances (gerrit-puppet-andrewhauly) is down but it isn't. Not a problem for me, but maybe that means nagios is lying to you? [21:59:03] Hmm nagios says it's been up 2m [21:59:22] Possibly it just hadn't re-checked it due to the huge backlog, apologies if that's the case. [21:59:37] The others are down though :) [22:00:24] Not a problem, just thought it might be a useful data point. [22:02:22] Yay for Dzahn merging the snmp change, should be down to 1 broken check and about 10 down hosts after puppet runs on everything (well apart from puppetmaster::self hosts). [22:04:27] Aug 28 21:51:16 i-0000030d nagios3: HOST ALERT: gerrit-puppet-andrewhauly;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 0.63 ms < weird but I'll not fret as there's bigger stuff to sort out for the long term. [22:06:24] Hmm [22:06:50] Aug 28 17:33:45 i-0000030d nagios3: HOST ALERT: gerrit-puppet-andrewhauly;DOWN;SOFT;4;check_ping: Invalid hostname/address - i-000003da < I wonder if that's to do with Ryan_Lane stopping the dns runner stuff, not sure how old the instance is though. [22:07:02] Would explain the lack of dns and therefore a 'help I'm down' [22:07:41] I can pretty much only access things by IP lately; figured that's part of Ryan switching to a new DNS system. [22:07:56] Damianz: the first one seems normal? alerting you that the downtime is over [22:08:11] mutante: Yeah, but if andrewbogott didn't restart it then it shouldn't have been down [22:08:23] I only created it this morning. [22:08:29] We could probably check using the ip and add a "dns check" if needed. Though hopefully dns should be stable long term [22:08:37] So maybe it took ~8 hours to notice. [22:09:05] was the Nagios service really running ? or did puppet just start it up [22:09:18] They've been down a while tbf, I'd probably put it down to the same issue slevinski was having due to the runner being stopped. Not an expert on the Ryanness of how that works though heh [22:09:51] Nagios has been running, it's not managed via puppet atm. Hence me running around fixing it with a plan to make puppet do it. Been semi-broken since the instances got corrupted though. [22:11:18] Damianz: "Use a FQDN in host_name and the IP in address." is recommended [22:11:49] people keep wondering if they should rely on DNS in Nagios config, but yeah http://lusislog.blogspot.com/2008/04/hostname-vs-ip-address-for-host.html [22:12:12] and thanks for trying to puppetize it, cool [22:13:29] well, and that part just applies if you really want to insist on IP addresses [22:13:36] Yeah - the generation is a bit 'dodgy' atm, would like to do that and make it possible for people to add custom checks to their project without doing the horrid puppet nagios config stuff like prod. [22:18:00] if you try to apply the prod. nagios classes to a labs instance you might notice you are getting duplicate service definitions [22:18:22] if you find a way to prevent any duplicates that would be great [22:18:35] it does not really hurt, as Nagios will just skip them with a warning..but still [22:18:44] it bloats the auto-generated configs [22:19:17] that said i would not totally try to avoid the puppet nagios resources [22:20:44] I sorta like the puppet handled monitoring, it ensures stuff is monitored. A few people where against that due to the mess of having to tidy them up, hassle updating them etc. Currently stuff is just pulled from the wiki and there's 1 config for everything. [22:21:44] I'd rather like to scrap the wiki, pull the hosts straight from ldap using the proxyagent login. From there for custom checks either a) in ldap b) on wiki somewhere c) custom interface or d) puppet. Really needs discussion on goals for labs monitoring, which will become more important when we roll the 'production' side out for bots/tools etc. [22:22:12] It does the job for basic stuff now, but can be improved after the base setup is all nicely in puppet and replicatable/testable. [22:25:51] !log nagios Changed snmp host for trap in base::puppet - that should fix Puppet freshness checks on anything not running puppetmaster::self [22:25:53] Logged the message, Master [22:26:02] Damianz: a pro for using puppet is that Labs Nagios could actually be used to test improvements that can later be used in the actual production nagios, which does also need improvement [22:26:18] !log nagios Pending change to fix the Free ram checks (21822) [22:26:19] Logged the message, Master [22:26:34] Damianz: but i figured these _might_ be 2 separate things, one actually monitoring labs and one being a testbed for prod. nagios.. [22:27:29] mutante: Agreed, but it also depends somewhat on the setup of puppet which labs is not quite production like. Also some bits of prod are running different paths etc. Due to the setup of labs I don't think they will ever be 100% compatible but yes we should strive to improve and test prod. [22:29:20] one problem i see is that it is very easy to break a complete Nagios by making a minor mistake when adding a new check or host manually [22:29:51] like you introduce a new servicegroup or hostgroup, but forget to define it.. and bam.. Nagios is completely down [22:30:05] happened to prod before [22:30:23] You totally havn't seen my ex-work's nagios script I wrote to handle that stuff, 4k lines of perl goodness :D [22:31:08] But yeah, puppet can help there. I somewhat hate the way storedconfig stuff is used on prod currently anyway, it's still not as abstracted as it could be. [22:31:10] heh, wow [22:31:32] you know there are also some projects that tried to put a web GUI on top of Nagios, to let people define services [22:31:55] i disliked them somehow, but maybe it is a way to make it easier for labs users [22:31:57] ewww [22:32:17] I'd rather put it in ldap and write a page for OSM before that [22:32:32] http://sourceforge.net/projects/nawui/ [22:33:28] http://www.debianhelp.co.uk/nagiosweb.htm [22:33:34] My thought around this was for example nrpe shouldn't be a static file, all checks should be based on for example nagios::monitor::disk_usage which is an abstraction of check_nrpe!disk_usage or such. But I think the way puppet can handle that prevents it from being possible, and I'm not that good with puppet [22:33:43] andrewbogott: I forgot to turn the mediawiki jobs back on [22:33:46] they add the records [22:34:03] I added that back a little bit ago [22:34:16] Ryan_Lane: You mean, regarding my earlier comment about missing dns? [22:34:33] It would be useful if we could auto insert the puppet classes into labsconsole and remove a set of getting stuff added for things like this - currently puppetmaster::self + something new/requires args = nearly impossible to test. [22:35:15] Damianz: what do you mean? [22:35:31] Damianz: Doesn't https://labsconsole.wikimedia.org/wiki/Special:NovaPuppetGroup do that? [22:35:43] ah, you mean read from the repo and auto-generate? [22:35:50] Yeah [22:35:50] andrewbogott: yeah. about dns [22:35:52] earlier [22:36:04] And I know how hard/annoying that would be so am in no way suggesting it [22:36:16] it would be easier with roles [22:36:18] You could totally do it with regex though! [22:37:20] Ideally I'd like to push my change to my branch, have the option is osm to add my awesome new class to my instance because it's in my project with 0 input from ops. That requires gerrit/puppet/modules (I believe) and some osm magic at a minimum I'd of thought. We can all wish though [22:37:54] so [22:37:59] there's an easier way [22:38:03] that I thought of recently [22:38:30] and it would work globally and per-branch (if we ever get to do per-branch) [22:38:57] we can add keywords into class and variable documentation [22:39:19] then you just parse the comments [22:39:40] that would allow descriptions for the variables and the classes, too [22:40:47] That would be pretty awesome [22:40:52] It would! [22:41:03] Much easier to parse too [22:45:02] Damianz: about the ram_check.sh, i know there is this one: https://gerrit.wikimedia.org/r/#/c/12164/ [22:45:18] !log nagios Commented out free_ram check for now. [22:45:19] Logged the message, Master [22:45:20] but i am not sure anymore why Petr wanted to add a different one from prod [22:46:28] mutante: I think it's the same file, it's currently in the prod config and the file is in the prod repo - it's just missing from the nrpe class, production uses nagios::monitor which adds it after so new labs instances never get it. [22:46:48] Or that's what I think... really want someone who knows if that's used in prod to comment on the change to ensure I don't borke anything heh [22:48:44] Damianz: ok, hmm, it is indeed on the production server,the file is there, it is also in NRPE [22:48:53] Damianz: but...i do not see a single service actually using that command [22:49:01] Neither do I [22:49:31] but i do see it in nrpe_local.cfg [22:49:32] Actually I need to fix https://gerrit.wikimedia.org/r/#/c/21822/2/manifests/nrpe.pp, I'm adding the file twice for production hosts atm =/ [22:49:51] I'm thinking it's a labs only check but not really sure [22:50:24] Damianz: https://gerrit.wikimedia.org/r/#/c/1638/ [22:50:46] Wrong branch [22:50:59] well, yeah, but has been merged [22:51:01] its old [22:51:05] Yeah [22:53:21] That sorta says labs specific check to me thinking about it [22:53:51] Damianz: so..conclusion.. yes, it is just used on labs. i don't know what we want less, another "if $realm" or just not care, because it also does not hurt us to have that file [22:53:54] Doesn't really help if it needs wrapping in an if or not, really hate using that sort of logic in the middle of a class... makes things un-usable outside of here and breaks contributaions back. [22:54:05] ack [22:54:07] heh [22:55:09] I'd say screw it and leave it like that (now I've removed it from the other class) and eventually move this to something that sucks less for portability. Depends if that gets ops love or not... I'll wait until Ryan gets bored enough to review it. [22:55:43] mutante: it may be missing from the merge [22:56:07] it's fine if it's not wrapped in realm [22:56:13] it doesn't hurt if the script exists on production systems [22:56:16] it isn't being used [22:57:00] It already exists on them, on the wrong path for nrpe to use though (going off nagios::monitor) [22:57:29] Also I've just thought.. why did I remove that file again BAH [22:58:21] * Damianz facepalms at the path and reverts that [22:58:26] yeah, it is in there. ..and now we all agreed to not bother about the $realm check.. ok [23:01:42] Ok, so if that's fine now I just un-did my derpyness I'll just badger someone to merge it then :D *looks at ryan* [23:02:15] I think petan is going to stab me in the morning for spamming his email [23:05:41] labs-nfs1 - Disk space - DISK WARNING - free space: /export 856 MB (4% inode=66%). Might want to clean up some bots stuff soonish. [23:16:15] Ryan_Lane: can this be the new error page for labs https://sphotos-a.xx.fbcdn.net/hphotos-ash3/527920_352703064812991_742496159_n.jpg [23:17:21] LOL [23:17:45] need to photoshop a horn in [23:17:50] I wonder what the license on this photo is [23:18:21] Don't the mwf have a tallented team of designers to create random unicorn images for us all day? :D [23:18:32] volunteer designers? [23:19:05] Everyone likes unicorns, volunteer or not [23:19:09] * Damianz rm -f /var/lib/puppet/state/puppetdlock *sigh* [23:20:42] warning: Could not load fact file /var/lib/puppet/lib/facter/default_interface.rb: ./default_interface.rb:43: syntax error, unexpected kELSE, expecting kEND < Is that important at all? [23:22:24] Damianz: not really [23:22:37] http://commons.wikimedia.org/wiki/Category:Unicorns ! [23:22:51] tons of subcategories:) [23:23:00] mutante is actually somewhat awesome [23:24:12] !log nagios free ram check merged in, un-commenting service. Not reloading, should reloading on the next parser run giving time for puppet to run on the instances. [23:24:14] Logged the message, Master [23:26:07] http://commons.wikimedia.org/wiki/File:The_Lion_and_the_Unicorn_2.jpg [23:27:48] mutante: is that the new 500 error page http://upload.wikimedia.org/wikipedia/commons/b/bf/The_Lion_and_the_Unicorn_2.jpg [23:28:26] heh, yeah:) [23:33:19] <^demon> Ryan_Lane: https://gerrit.wikimedia.org/r/#/c/21848/ <3 [23:36:24] Ryan_Lane: merge it…. merge it good [23:49:45] ^demon: this is prone to breakage [23:49:53] ^demon: make that a config option that can be passed in [23:50:15] <^demon> Blergh. I was afraid you'd say that [23:50:22] lol