[00:00:19] 10Tool-Labs: Some grid jobs are in odd state - https://phabricator.wikimedia.org/T95094#1180258 (10yuvipanda) hmm, I do see a few with just the 's', but what else is being affected? why set to 'High'? (Just curious) [01:11:52] 10Tool-Labs: Some grid jobs are in odd state - https://phabricator.wikimedia.org/T95094#1180303 (10scfc) Because a) it may be a symptom of a bug in our grid system (all the affected jobs date from before the last outage, so "best case" is that it is related to that and our recovery process/monitoring does not co... [01:13:57] 10Tool-Labs: Some grid jobs are in odd state - https://phabricator.wikimedia.org/T95094#1180304 (10yuvipanda) Fair enough :) Assuming that the processes are all dead I guess we can just remove them from the grid and have the status page be an accurate reflection of reality. Should probably wait for @Coren to com... [01:14:21] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4, 7Tracking: Replace bigbrother and ssh-cron-thingy with service manifests - https://phabricator.wikimedia.org/T90561#1180305 (10yuvipanda) Beginnings of this going on at https://github.com/yuvipanda/tools-manager [01:15:24] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4, 7Tracking: Replace bigbrother and ssh-cron-thingy with service manifests - https://phabricator.wikimedia.org/T90561#1180306 (10yuvipanda) (will be moved to Gerrit soon, and renamed, etc) [01:16:09] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4: Ensure that all running webservices have a services.manifest file - https://phabricator.wikimedia.org/T95095#1180309 (10yuvipanda) 3NEW [01:28:15] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4, 7Tracking: Replace bigbrother and ssh-cron-thingy with service manifests - https://phabricator.wikimedia.org/T90561#1180321 (10yuvipanda) As for software design, it will be implemented as multiple stateless deamons that can easily be run in multiple hosts for redunda... [01:55:17] 10Tool-Labs: Some grid jobs are in odd state - https://phabricator.wikimedia.org/T95094#1180339 (10scfc) It's definitely not a Saturday night emergency :-), much less non-Greek/Greek Easter holiday stuff. [02:04:59] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4, 7Tracking: Replace bigbrother and ssh-cron-thingy with service manifests - https://phabricator.wikimedia.org/T90561#1180341 (10scfc) Well, if the `bigbrother` replacement would use sudo, we could just use monit :-). I don't think you to fork child processes just for... [02:06:11] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4, 7Tracking: Replace bigbrother and ssh-cron-thingy with service manifests - https://phabricator.wikimedia.org/T90561#1180342 (10yuvipanda) @scfc it will need to use sudo / fork/setuid at some point to start the webservice, no? [02:21:36] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4, 7Tracking: Replace bigbrother and ssh-cron-thingy with service manifests - https://phabricator.wikimedia.org/T90561#1180345 (10scfc) Yes, for the purpose of starting a job/web service (and that'd be no reason to use monit). But your comment above sounded like you th... [02:40:04] (03PS1) 10Tim Landscheidt: Bump Debian Policy compliance to 3.9.6 [labs/toollabs] - 10https://gerrit.wikimedia.org/r/201919 [02:42:58] (03CR) 10Tim Landscheidt: [C: 04-2] Bump Debian Policy compliance to 3.9.6 [labs/toollabs] - 10https://gerrit.wikimedia.org/r/201919 (owner: 10Tim Landscheidt) [03:16:34] (03CR) 10Tim Landscheidt: "This doesn't work as I thought. lintian will complain if the compliance statement is not the same as the "current" one, so on Precise ins" [labs/toollabs] - 10https://gerrit.wikimedia.org/r/201919 (owner: 10Tim Landscheidt) [04:07:46] (03PS2) 10Tim Landscheidt: Bump Debian Policy compliance to 3.9.5 [labs/toollabs] - 10https://gerrit.wikimedia.org/r/201919 [04:11:41] (03PS3) 10Tim Landscheidt: Bump Debian Policy compliance to 3.9.5 [labs/toollabs] - 10https://gerrit.wikimedia.org/r/201919 [04:20:27] (03PS4) 10Tim Landscheidt: Bump Debian Policy compliance to 3.9.5 [labs/toollabs] - 10https://gerrit.wikimedia.org/r/201919 [04:23:12] (03CR) 10Tim Landscheidt: [C: 032 V: 032] Bump Debian Policy compliance to 3.9.5 [labs/toollabs] - 10https://gerrit.wikimedia.org/r/201919 (owner: 10Tim Landscheidt) [04:26:25] (03PS1) 10Tim Landscheidt: Fix lintian warning about maintainer also being in uploaders [labs/toollabs] - 10https://gerrit.wikimedia.org/r/201922 [04:30:03] (03CR) 10Tim Landscheidt: [C: 032 V: 032] Fix lintian warning about maintainer also being in uploaders [labs/toollabs] - 10https://gerrit.wikimedia.org/r/201922 (owner: 10Tim Landscheidt) [04:33:58] (03PS1) 10Tim Landscheidt: WIP: Fix undefined variable warnings in status.php [labs/toollabs] - 10https://gerrit.wikimedia.org/r/201923 [04:34:09] 10Tool-Labs: webservice/webservice2 have no man pages - https://phabricator.wikimedia.org/T95097#1180355 (10scfc) 3NEW [04:38:03] 10Tool-Labs: Make lintian warnings voting errors in labs/toollabs repository - https://phabricator.wikimedia.org/T95098#1180363 (10scfc) 3NEW [04:38:22] 10Tool-Labs: webservice/webservice2 have no man pages - https://phabricator.wikimedia.org/T95097#1180372 (10scfc) [04:38:24] 10Tool-Labs: Make lintian warnings voting errors in labs/toollabs repository - https://phabricator.wikimedia.org/T95098#1180371 (10scfc) [04:39:16] (03CR) 10Tim Landscheidt: [C: 04-2] "WIP." [labs/toollabs] - 10https://gerrit.wikimedia.org/r/201923 (owner: 10Tim Landscheidt) [04:44:49] the manifest syntax itself is going to be same as Heroku's procfile. [04:44:52] gah, sorry [05:09:00] 10Tool-Labs: webservice/webservice2 have no man pages - https://phabricator.wikimedia.org/T95097#1180385 (10yuvipanda) We should / can kill webservice from the repo now, it isn't used anywhere (webservice2 is the default webservice now). We can just put help files in --help there maybe? [05:57:41] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4, 7Tracking: Replace bigbrother and ssh-cron-thingy with service manifests - https://phabricator.wikimedia.org/T90561#1180391 (10yuvipanda) Aha, so I just found out that some of my problems were because I was trying to parse 'qstat -j "*" -xml' instead of 'qstat -u "*"... [06:17:24] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4, 7Tracking: Replace bigbrother and ssh-cron-thingy with service manifests - https://phabricator.wikimedia.org/T90561#1180403 (10yuvipanda) Ok, with the latest commit it kindof works \o/ Still a lot to be done before we can turn this one: # Documentation, both interna... [06:17:44] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4, 7Tracking: Replace bigbrother and ssh-cron-thingy with service manifests - https://phabricator.wikimedia.org/T90561#1180405 (10yuvipanda) (just bigbrother replacement so far, btw) [06:47:46] PROBLEM - Free space - all mounts on tools-webgrid-04 is CRITICAL: CRITICAL: tools.tools-webgrid-04.diskspace._var.byte_percentfree.value (<22.22%) [06:59:45] PROBLEM - Puppet failure on tools-dev is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [10:03:19] 10Tool-Labs: Unattended upgrades are failing from time to time - https://phabricator.wikimedia.org/T92491#1180492 (10scfc) ``` From: root@tools.wmflabs.org (Cron Daemon) Subject: Cron test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) To: root@tools.wmflabs.org Date:... [10:14:08] Nemo_bis: Intuition is frozen until Siebrand and I finish localisation conversion to JSON [10:14:23] I've finished my end, but waiting for siebrand at the moment [10:22:55] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4, 7Tracking: Replace bigbrother and ssh-cron-thingy with service manifests - https://phabricator.wikimedia.org/T90561#1180502 (10scfc) Regarding rate limiting, the problem with `bigbrother` is that in situations where the infrastructure fails and thus job never endure,... [10:40:25] 6Labs, 7Puppet: Labs: Could not find dependency File[/usr/lib/ganglia/python_modules] for File[/usr/lib/ganglia/python_modules/gmond_memcached.py] - https://phabricator.wikimedia.org/T95107#1180505 (10Tgr) 3NEW [10:45:36] 10Tool-Labs: webservice/webservice2 have no man pages - https://phabricator.wikimedia.org/T95097#1180514 (10scfc) I know, but I want to put `webservice{,2}` back in the repository. IIRC you pulled it out of there to hot-fix some issues, and now that these have been worked out and after you'll build in the `bigb... [12:08:59] I'm trying to install postgresql on labs, on a jessie box, and I get "Unable to locate package postgresql-client-9.1" errors from puppet [12:09:08] is that something I m not supposed to be doing? [12:43:38] tgr: It seems the package was dropped a few years ago [12:43:46] Not on trusty or jessie [12:43:53] http://packages.ubuntu.com/precise/postgresql-client-9.1 [12:44:08] https://packages.debian.org/wheezy/postgresql-client-9.1 [12:44:52] You'd have to find a PPA that provides it on Debian, or forward-port perhaps. [12:48:49] Krinkle: I would be fine with 9.4, the postgresql module in operations/puppet specifically requires 9.1 though [12:49:20] Jessie is very new still. Just like with Trusty, most existing manifests can be assumed to be precise or trusty only. [12:49:45] by now anything not trusty compatible has been if-guarded accordingly [12:50:00] and vice versa for trusty-only. [12:50:03] But not for Jessie yet. [12:50:55] in most cases this is mitigated by a small abstraction class that code would depend on itself, which in turn would require the appropiate package based on platform [13:11:32] YuviPanda: You around? Wdq seems to be acting up. Maybe one of the two servers is not completely healthy? [16:10:51] 6Labs, 7Puppet: geoipupdate package not found for jessie - https://phabricator.wikimedia.org/T95117#1180714 (10Tgr) 3NEW [16:15:55] 6Labs, 7Puppet: geoipupdate package not found for jessie - https://phabricator.wikimedia.org/T95117#1180733 (10Tgr) [17:03:09] tgr|away: i can fix that for you [17:06:41] ori: you mean geoipupdate? [17:06:46] postgres [17:07:02] haven't looked at geoipupdate yet [17:09:15] tgr: the class takes a pgversion parameter [17:09:20] have you tried to set that to 9.4? [17:12:06] ori: not yet [17:12:46] I'm trying to create a role for sentry [17:13:22] specifyong the pgsql version there seemed like bad form [17:35:42] 6Labs: /etc/ssh/userkeys/ubuntu notices for every puppet run on labs instances - https://phabricator.wikimedia.org/T94866#1180804 (10Krinkle) [19:28:48] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4, 7Tracking: Replace bigbrother and ssh-cron-thingy with service manifests - https://phabricator.wikimedia.org/T90561#1180887 (10yuvipanda) Yeah rate limiting is going to be interesting but I think I would rather start with no rate limiting and implement it than the ot... [19:51:21] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4, 7Tracking: Replace bigbrother and ssh-cron-thingy with service manifests - https://phabricator.wikimedia.org/T90561#1180903 (10scfc) I look at it the other way: If you start jobs in parallel, you need to take measures to not overload the server that the "launcher" ru... [19:56:12] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4, 7Tracking: Replace bigbrother and ssh-cron-thingy with service manifests - https://phabricator.wikimedia.org/T90561#1180918 (10yuvipanda) Fair enough :) /me strikes asyncio from list. [19:57:01] 6Labs, 10Continuous-Integration, 6operations: Evaluate options to make puppet errors more visible - https://phabricator.wikimedia.org/T92710#1180919 (10Krinkle) As long as the "number of failures" (e.g. errors/warnings, not the complete failures) are counted in Graphite and result in Shinken e-mails. Is that... [19:59:56] 6Labs, 10Continuous-Integration, 6operations: Evaluate options to make puppet errors more visible - https://phabricator.wikimedia.org/T92710#1180922 (10yuvipanda) They are counted in graphite, and present in the emails just not particularly clearly. [21:31:31] How do I have jsub write somewhere other than $HOME? [21:32:24] -e and -o, I guess. [21:42:05] 6Labs, 10Continuous-Integration, 6operations: Evaluate options to make puppet errors more visible - https://phabricator.wikimedia.org/T92710#1180998 (10Krinkle) 5Open>3Resolved a:3Krinkle [22:01:21] RECOVERY - Puppet failure on tools-exec-03 is OK: OK: Less than 1.00% above the threshold [0.0] [22:34:36] Is there something up with the labs-l mailing list? Mail doesn't seem to be getting through. [22:41:44] Sorry, mobile data issue there. Is there something up with the making list? [22:42:11] *mailing list