[00:02:48] 6Labs, 10wikitech.wikimedia.org: Decide on future of Semantic extensions on Wikitech - https://phabricator.wikimedia.org/T123599#1933483 (10Reedy) 3NEW [00:03:34] 6Labs, 10wikitech.wikimedia.org: Decide on future of Semantic extensions on Wikitech - https://phabricator.wikimedia.org/T123599#1933491 (10Legoktm) > Or we just undeploy it all [00:04:16] I have a problem with puppet [00:04:44] 6Labs, 10wikitech.wikimedia.org: Decide on future of Semantic extensions on Wikitech - https://phabricator.wikimedia.org/T123599#1933492 (10Reedy) Wikitech will be rolled back to 1.27.0-wmf.9 so stuff isn't just broken We should have a good look around and see what we'll lose. And then, if anything we want to... [00:08:05] The platform can not connect to MySQL by the puppet [00:08:18] See: https://wmve.wmflabs.org/ [00:08:41] I using a local mysql [00:12:04] 6Labs, 10Labs-Infrastructure, 10Tool-Labs, 10MediaWiki-extensions-SemanticForms: https://wikitech.wikimedia.org/wiki/Special:FormEdit/Tools_Access_Request down - https://phabricator.wikimedia.org/T123583#1933514 (10Reedy) See T123599 [00:22:42] 6Labs, 10Tool-Labs, 6Phabricator: move tool user requests to phabricator - https://phabricator.wikimedia.org/T123601#1933529 (10Dzahn) 3NEW [00:22:51] Anyone have tips on what software one would use for creating temporary mediawiki VMs within a tool for phantomjs? [00:23:39] 6Labs, 10Labs-Infrastructure, 10Tool-Labs, 10MediaWiki-extensions-SemanticForms: https://wikitech.wikimedia.org/wiki/Special:FormEdit/Tools_Access_Request down - https://phabricator.wikimedia.org/T123583#1933538 (10Dzahn) suggesting to do T123601 whether we keep using SMW or not [00:24:03] 6Labs, 10Tool-Labs, 6Phabricator: move tool labs user requests to phabricator - https://phabricator.wikimedia.org/T123601#1933540 (10Dzahn) [00:24:34] Abin_Sur: asking in #wikimedia-releng about phabricator will probably produce better results (assuming that's a phabricator install) [00:24:50] 6Labs, 10Tool-Labs, 6Phabricator: move tool labs user requests to phabricator - https://phabricator.wikimedia.org/T123601#1933529 (10Dzahn) [00:25:05] OH-: asking the CI people what they use might be a good idea, although I don't think you can programatically create VMs anyway, and most definitely not rom inside tools [00:25:06] *from [00:25:47] 6Labs, 10wikitech.wikimedia.org: Move tool labs signup to phab - https://phabricator.wikimedia.org/T123603#1933551 (10Reedy) 3NEW [00:26:06] Hmm, OK. I was thinking of making a bot that automatically installed MW skins and screenshotted them for mw.org and was concerned about security [00:26:13] 6Labs, 10Tool-Labs, 6Phabricator: move tool labs user requests to phabricator - https://phabricator.wikimedia.org/T123601#1933557 (10Dzahn) [00:26:21] 6Labs, 10wikitech.wikimedia.org: Move tool labs signup to phab - https://phabricator.wikimedia.org/T123603#1933551 (10Reedy) [00:26:22] 6Labs, 10Tool-Labs, 6Phabricator: move tool labs user requests to phabricator - https://phabricator.wikimedia.org/T123601#1933559 (10Reedy) [00:26:52] OH-: I think the CI people did something like that [00:27:12] 6Labs, 10wikitech.wikimedia.org: Move tool labs signup to phab - https://phabricator.wikimedia.org/T123603#1933551 (10Reedy) [00:27:14] 6Labs, 10Tool-Labs, 6Phabricator: move tool labs user requests to phabricator - https://phabricator.wikimedia.org/T123601#1933529 (10Reedy) [00:30:39] 6Labs, 10Tool-Labs, 6Phabricator, 6Project-Creators: move tool labs user requests to phabricator - https://phabricator.wikimedia.org/T123601#1933570 (10Reedy) [00:32:48] What if I programatically created and destroyed vagrant machines? That would require a project instead of a tool, right? [00:33:16] depends on what you mean by vagrant machines. you can do that with docker / lxc, and that would require a project [00:33:18] yes [00:33:27] OH-: I also think someone did something like this not too long ago [00:33:30] for the i18n team I think [00:34:43] Do you remember any people that operated the bot? don't want to reinvent the wheel :P [00:35:50] OH-: do new skins show up so often that a bot would be useful for this? [00:37:19] eh, I figured it would be a convenient utility. mostly was just thinking of little personal projects to do :P [00:38:03] :) [00:38:32] https://xkcd.com/1319/ [00:39:38] ha. that reminds me, new xkcd \o/ [00:39:40] OH-: you can explore tools.wmflabs.org/paws if you're looking for fun little personal projects :D [00:42:03] OH-: if you are interested in MediaWiki install automation there is the MediaWiki-Vagrant project and the relatively new https://github.com/wikimedia/mediawiki-containers project [00:42:30] bd808: I wonder if someone adding a Dockerfile to mediawiki/core.git would be controversial [00:42:34] and if so, *how* controversial :) [00:44:09] heh. probably not as controversial as adding Composer support was but ... [00:44:32] I don't know if it'll be useful th [00:44:41] to be useful it'll have to bundle in extensions as well [00:44:52] since you need a db, a web server and and php container it's kind of hard to smush into a singel Dockerfile [00:45:07] yeah, but you'd run a Dockerfile just for mw [00:45:18] and a docker-compose.yaml for the whole thing [00:45:23] YuviPanda, not results. [00:45:24] I'll take a look, thanks [00:45:25] that's what https://github.com/wikimedia/mediawiki-containers is about [00:46:27] * darkblue_b queues for YuviPanda [00:47:03] Abin_Sur: we can't really help though, once you get your own labs project you are kind of on your own. we simply don't have the manpower for debugging application issues :( sorry! [00:47:44] Abin_Sur: the first thing to verify is that you can use mysql's cli to connect to the db using that user and password [00:47:53] hi YuviPanda -good day .. I was reading the gitter backlog for Jupyter and you are there, so here I am :-) [00:48:11] did you get a Jupyterhub up ? which version and what kernals.... [00:48:19] darkblue_b: tools.wmflabs.org/paws [00:48:42] I just need to know how to fix the problem. No intervention is required [00:48:47] aha - nice [00:49:26] how did you install .. build from source or pypi or ... [00:50:07] .. setting aside the LDAP portions [00:51:26] darkblue_b: github.com/yuvipanda/paws [00:51:29] kubernetes [00:52:06] oof [00:52:23] what I am getting at is.. the versions of the ipython and notebook parts [00:53:42] I used pypi just now `pip install ipython; pip install notebook --upgrade` [00:54:02] that gives ipython 4,02 and a synched set of jupyter whatever [00:54:15] Maybe I should work on my twitter bot first [00:54:17] so I was wondering if you are happy with that, with your vast experience [00:54:23] :-) [00:54:45] I wanted to make a twitter bot that simulates TCP [00:54:55] so you could go SYN and it would go SYN ACK ACK ACK ACK ACK ACK [00:55:21] OH-: nice [00:55:41] darkblue_b: I wrote my own custom authentication backend (OAuth) and spawner (kubernetes) and it's dpeloyed via the dockerfiles in that repo [00:56:19] hmm - so custom that I wonder if what I am asking applies... [00:56:33] I am wondering what versions to rely on [00:56:55] of ipython and notebook [00:57:29] you should ask the jupyter folks :) [00:57:36] I just use whatever's in pip [00:57:50] yes pypi pip seems good right now [00:58:28] bd808: there's now an eventlogging schema collecting data about people using webservice commands :) [00:58:34] bd808: going to add it to jsub/jstart now [00:58:53] ok - onward then.. you may have seen this.. the list of available kernals to run Notebooks with ... https://github.com/ipython/ipython/wiki/IPython-kernels-for-other-languages [00:59:07] YuviPanda: oh cool! actual measurements! [00:59:11] bd808: inorite [00:59:21] bd808: I'm wondering what's a nice generic way to 'wrap' some executables [00:59:27] I'll get my -labs stuff in order and log into the PAWS real soon now.. thx ! [00:59:29] bd808: like qsub I wanna measure, but that's coming from a deb repo [00:59:56] darkblue_b: :) I have py3 kernel now and bash kernel, will add R soon (and addshore wants to add PHP) [01:00:16] ugh, I gotta write perl now [01:01:14] bd808: can I just use system() to call out in perl? [01:01:54] YuviPanda: yup -- http://perlmaven.com/running-external-programs-from-perl [01:02:35] hah i'm at the same page [01:03:31] bd808: $commandline = join " ", $0, @ARGV; [01:03:36] to get the full commandline? [01:03:42] sorry am poking you with newb perl shit [01:04:01] 6Labs, 10Labs-Infrastructure, 10Salt: update salt key monitoring scripts for labs to new nova api version - https://phabricator.wikimedia.org/T123607#1933624 (10ArielGlenn) 3NEW a:3ArielGlenn [01:04:10] I think that would work. you know how to write test scripts right? ;) [01:05:38] :D [01:06:11] * bd808 hasn't written anything of substance in perl for ... a really long time [01:06:34] * YuviPanda really likes perl6 [01:14:03] (03PS1) 10Yuvipanda: Start logging command invocations to EventLogging [labs/toollabs] - 10https://gerrit.wikimedia.org/r/264041 (https://phabricator.wikimedia.org/T123444) [01:14:14] bd808: can you take a sanity look? [01:14:21] (03CR) 10jenkins-bot: [V: 04-1] Start logging command invocations to EventLogging [labs/toollabs] - 10https://gerrit.wikimedia.org/r/264041 (https://phabricator.wikimedia.org/T123444) (owner: 10Yuvipanda) [01:14:38] jerkins says hell no :) [01:15:16] (03PS2) 10Yuvipanda: Start logging command invocations to EventLogging [labs/toollabs] - 10https://gerrit.wikimedia.org/r/264041 (https://phabricator.wikimedia.org/T123444) [01:15:32] bd808: yeah rebased [01:16:51] YuviPanda: where does /usr/local/bin/log-command-invocation come from? [01:17:13] bd808: puppet [01:17:19] bd808: I want to move all of these packages into puppet too [01:17:22] (03CR) 10jenkins-bot: [V: 04-1] Start logging command invocations to EventLogging [labs/toollabs] - 10https://gerrit.wikimedia.org/r/264041 (https://phabricator.wikimedia.org/T123444) (owner: 10Yuvipanda) [01:20:02] 6Labs, 10Tool-Labs: Provide resource for db access in grid - https://phabricator.wikimedia.org/T70881#1933699 (10Merl) Today dewiki has high replag since about 12 hours (>3hours replag). Many of my sge jobs are currently testing replag and rescheduling themselves (return code 99) since hours now. This is the... [01:22:11] YuviPanda: looks like it should work to me if you can make jerkins happy [01:22:15] ok [01:22:18] thanks :D [01:22:22] also found out that there's replag now [01:22:29] we should have icinga checks for these [01:26:25] YuviPanda: https://tools.wmflabs.org/replag/ [01:26:55] bd808: yeah, that's how I found out [01:26:58] but that isn't alerting tho [01:27:25] *nod* the logic I used for that is pretty simple [01:27:27] yeah [01:27:34] we could make a check out of it I think [01:27:43] yeah [01:27:59] for an alery you would only care about the shards [01:28:01] *alert [01:28:35] where do we keep custom icinga checks? In ops/puppet somewhere I assume? [01:32:00] bd808: ya [01:32:44] !log tools stopped erwin85's tools since it was causing replag on labsdb1002 [01:55:00] bd808: hmm, I don't even know what is failing on that patch [01:56:15] !log tools rm service.manifest for wikiviewstats to prevent it from constantly trying to start up and fail webservice [01:59:54] YuviPanda: replag is dropping. only 56m behind now [02:00:07] yeah [02:00:09] it's basically [02:00:13] 'kill all the queries' [02:00:20] 'look at tendril and stop tool running bigg queries' [02:08:22] (03PS3) 10Yuvipanda: Start logging command invocations to EventLogging [labs/toollabs] - 10https://gerrit.wikimedia.org/r/264041 (https://phabricator.wikimedia.org/T123444) [02:08:24] bd808: ah, had missed a semicolon [02:08:26] should pass now [02:08:28] hopefully [02:10:07] (03CR) 10jenkins-bot: [V: 04-1] Start logging command invocations to EventLogging [labs/toollabs] - 10https://gerrit.wikimedia.org/r/264041 (https://phabricator.wikimedia.org/T123444) (owner: 10Yuvipanda) [02:10:53] wtf [02:27:43] (03PS4) 10Yuvipanda: Start logging command invocations to EventLogging [labs/toollabs] - 10https://gerrit.wikimedia.org/r/264041 (https://phabricator.wikimedia.org/T123444) [02:28:23] the answer to the wtf was 'yuvi can not differentiate between perl and bash' [02:29:27] (03CR) 10jenkins-bot: [V: 04-1] Start logging command invocations to EventLogging [labs/toollabs] - 10https://gerrit.wikimedia.org/r/264041 (https://phabricator.wikimedia.org/T123444) (owner: 10Yuvipanda) [02:30:30] well [02:30:33] this works fine on tools [02:30:45] (03CR) 10Yuvipanda: [C: 032] Start logging command invocations to EventLogging [labs/toollabs] - 10https://gerrit.wikimedia.org/r/264041 (https://phabricator.wikimedia.org/T123444) (owner: 10Yuvipanda) [02:33:53] (03CR) 10Yuvipanda: [V: 032] "Works for me when I build it on toollabs" [labs/toollabs] - 10https://gerrit.wikimedia.org/r/264041 (https://phabricator.wikimedia.org/T123444) (owner: 10Yuvipanda) [02:39:19] 6Labs, 10Tool-Labs, 5Patch-For-Review: Instrument jsub/jstart/webservices usage - https://phabricator.wikimedia.org/T123444#1933788 (10yuvipanda) Okay, so now we've stats for jsub, jstart, job, webservice and jstop. Now to add them for qsub and qstat (which are harder, since they're deb package provided bina... [04:18:21] 6Labs, 10Tool-Labs, 6Phabricator, 6Project-Creators: move tool labs user requests to phabricator - https://phabricator.wikimedia.org/T123601#1933809 (10Negative24) Could this possibly use another Phabricator form? (Is there a way to remove the visibility of the form from the global drop down to just be use... [05:33:40] 6Labs, 10Tool-Labs, 6Phabricator, 6Project-Creators: move tool labs user requests to phabricator - https://phabricator.wikimedia.org/T123601#1933843 (10greg) Phabricator forms (editable and non-editable pre-fillable fields, only show relevant fields, something relatively new that upstream implemented, kind... [07:30:32] 6Labs, 10Tool-Labs: Linkwatcher spawns many processes without parent - https://phabricator.wikimedia.org/T123121#1933878 (10Beetstra) @valhallasw - I have added 2 more parsers (total now 12) - the bot is creating a backlog, likely during the American daytime, which it does not munch away at night. [07:58:45] 10Tool-Labs-tools-Erwin's-tools: Kill huge query to avoid killing all erwin85 tools - https://phabricator.wikimedia.org/T123613#1933895 (10Nemo_bis) 3NEW [08:06:50] 10Tool-Labs-tools-Erwin's-tools: Kill huge query to avoid killing all erwin85 tools - https://phabricator.wikimedia.org/T123613#1933910 (10Nemo_bis) 5Open>3Resolved a:3Nemo_bis I guess the tool is https://tools.wmflabs.org/erwin85/contribs.php , currently marked red because we aren't even sure it works. I... [09:53:20] 6Labs, 10Phragile, 6TCB-Team: Unable to access Phragile WMFLabs instance - https://phabricator.wikimedia.org/T123369#1934049 (10WMDE-leszek) That is a bit odd then. Neither me (WMDE-leszek) neither any of my fellow WMDE colleagues (Jakob, WMDE-Fisch) can log in. What we're doing here is ssh login to bastion... [10:22:53] (03PS59) 10Ricordisamoa: Initial commit [labs/tools/wikidata-slicer] - 10https://gerrit.wikimedia.org/r/241296 [10:29:32] (03CR) 10Ricordisamoa: "PS59 kills the hard-coded getGenericType() in favor of a generalType property on Section classes" [labs/tools/wikidata-slicer] - 10https://gerrit.wikimedia.org/r/241296 (owner: 10Ricordisamoa) [11:04:41] 6Labs, 10Tool-Labs, 6Phabricator, 6Project-Creators: move tool labs user requests to phabricator - https://phabricator.wikimedia.org/T123601#1934127 (10scfc) [11:04:44] 6Labs, 10Tool-Labs: Migrate Tools access request process to Phabricator - https://phabricator.wikimedia.org/T72625#1934128 (10scfc) [11:08:42] 6Labs: beta swift labs instances requirements - https://phabricator.wikimedia.org/T123512#1934132 (10fgiunchedi) thanks @hashar ! I'd like to have some wiggle room just in case anyways I don't seem to be able to add large/xlarge instances to deployment-prep ATM, quotas have been hit perhaps? [11:11:37] godog: hi! you should get the quota via https://wikitech.wikimedia.org/w/index.php?title=Special:NovaProject&action=displayquotas&projectname=deployment-prep [11:11:53] godog: also andrew created specific instances for ci purposes with different set of (cpu,met,disk) [11:12:26] hashar: oohh thanks a lot! indeed instance limits are hit, 55/55 [11:12:34] :-( [11:12:47] hashar: out of curiosity how did you reach the quota page? [11:12:53] magic? :-} [11:13:05] the link is from the manage project page [11:13:50] i.e. https://wikitech.wikimedia.org/wiki/Special:NovaProject , select your project [11:14:08] on each tables, the right most columns has a bunch of actions link, one of them is 'Display quotas' [11:15:18] I dont think there is any instance we can delete [11:15:26] hah! thanks :D yeah I was thinking large instance with 2x or 3x the disk would be enough [11:22:25] godog: and if you guys have plan to migrate Swift to Jessie, maybe beta can start straight withJessie [11:23:08] yeah not so sure about that now but good point [11:36:31] 6Labs, 7Graphite: graphite.wmflabs.org API is unreliable - https://phabricator.wikimedia.org/T123566#1934141 (10fgiunchedi) graphite.wmflabs.org IIRC is backed by labmon1001 which is the default destination for metrics in labs. anyways I've upgraded labmon to the same graphite version as production (0.9.13) ma... [11:37:13] 6Labs, 7Graphite: graphite.wmflabs.org API is unreliable - https://phabricator.wikimedia.org/T123566#1934142 (10fgiunchedi) graphite.wmflabs.org IIRC is backed by labmon1001 which is the default destination for metrics in labs. anyways I've upgraded labmon to the same graphite version as production (0.9.13) ma... [11:38:18] 6Labs: beta swift labs instances requirements - https://phabricator.wikimedia.org/T123512#1934143 (10fgiunchedi) indeed instance limits (55) have been hit, https://wikitech.wikimedia.org/w/index.php?title=Special:NovaProject&action=displayquotas&projectname=deployment-prep can we bump that to +5 ? [11:59:00] 6Labs, 10Phragile, 6TCB-Team: Unable to access Phragile WMFLabs instance - https://phabricator.wikimedia.org/T123369#1934161 (10scfc) No, you should use the bastion `bastion.wmflabs.org` (and similar); cf. https://wikitech.wikimedia.org/wiki/Help:Access#Accessing_public_and_private_instances. Also note that... [12:12:18] 6Labs, 5Patch-For-Review: Ensure unique uidNumber field in ldap user records. - https://phabricator.wikimedia.org/T122665#1934166 (10MoritzMuehlenhoff) 5Open>3Resolved This has been enabled on our openldap servers [12:42:50] 6Labs, 10Phragile, 6TCB-Team: Unable to access Phragile WMFLabs instance - https://phabricator.wikimedia.org/T123369#1934182 (10WMDE-leszek) Right, that has been a silly mistake of me (although I am sure accessing the instance the way I described above used to work some weeks ago). Anyway, I tried getting th... [12:50:23] 6Labs, 10Tool-Labs, 10labs-sprint-119, 10Diffusion: Figure out a git hosting solution for tools/kubernetes - https://phabricator.wikimedia.org/T117071#1934186 (10Joe) [12:50:25] 6Labs, 10Tool-Labs: Initial Deployment of Kubernetes to Tool Labs (Tracking) - https://phabricator.wikimedia.org/T111885#1934185 (10Joe) [13:11:52] 6Labs, 7Graphite: graphite.wmflabs.org API is unreliable - https://phabricator.wikimedia.org/T123566#1934203 (10Tgr) 5Open>3Resolved a:3Tgr Seems fixed now, thanks! (Feels slightly faster, too.) [13:14:26] 6Labs, 10Tool-Labs: Install a docker registry to be used by kubernetes - https://phabricator.wikimedia.org/T123628#1934208 (10Joe) 3NEW [14:01:13] Can someone point me in the right direction where I can fetch the transclusion count of User:MiszaBot/config? [14:01:24] Or can someone run a quick DB query for me? [14:01:41] YuviPanda, ^ [14:03:05] legoktm, ^ [14:04:28] Guess not. [14:32:46] 6Labs, 10Phragile, 6TCB-Team: Unable to access Phragile WMFLabs instance - https://phabricator.wikimedia.org/T123369#1934284 (10valhallasw) Try connecting with `ssh -vv `. This should report which keys are tried and whether they succeed or not. If agent forwarding is used, you should be able to do... [15:24:41] 6Labs, 10Phragile, 6TCB-Team: Unable to access Phragile WMFLabs instance - https://phabricator.wikimedia.org/T123369#1934370 (10WMDE-leszek) As suggested I've used agent forwarding and first logged into bastion. Then I am trying to log into `phragile.phragile.eqiad.wmflabs`. Looking at debug messages it trie... [15:25:17] 10PAWS, 6Revision-Scoring-As-A-Service: Install revscoring inside PAWS - https://phabricator.wikimedia.org/T120317#1934376 (10Halfak) yuvipanda, it won't pull in models. It will be up to the user to acquire those as necessary. They are pretty easy to pull from github with a wget. Do you think that is good e... [15:34:49] !log wikilabels deployed 4c643e8 with wikilabels:79b0cad [16:44:26] YuviPanda: re: qsub, we can just add a /usr/local/bin/qsub that calls the dpkg'ed qsub I think [16:49:04] valhallasw`cloud: could we use acct to get a historical picture, i.e. the last time certain tools were run etc? [16:49:07] I have a few tools I'm curious about [16:49:12] mainly :) [16:49:15] chasemp: more or less [16:49:36] chasemp: you can search by job name, but not by command [16:49:45] and by user [16:53:01] 10PAWS, 6Revision-Scoring-As-A-Service: Install revscoring inside PAWS - https://phabricator.wikimedia.org/T120317#1934554 (10yuvipanda) I definitely think we should find an easy way to include them [16:53:30] valhallasw`cloud: when something shows up for continuous queue [16:53:31] like [16:53:31] continuous:tools-exec-01.eqiad.wmflabs:tools.whymbot:tools.whymbot:jawikiclaim3.sh:6702940:sge:10:1419304075:1419304077:1419335945:0:0:31868:76.724795:7.696481:44892.000000:0:0:0:0:37401:9:0:14960.000000:11440:0:0:0:115337:7558:NONE:defaultdepartment:NONE:1:0:84.550000:12.647128:0.055227:-u tools.whymbot -q continuous -l h_vmem=256M,release=precise:0.000000:NONE:206929920.000000:0:0 [16:53:36] is that...the last restart? [16:53:52] trying to understand how continuous jobs are viewed in teh accounting file I guess [16:55:01] chasemp: accounting entries are written when the job finishes. The status will tell you why (for continuous jobs, this is typically a restart, which has error code 19 [16:55:10] chasemp: see /home/valhallasw/accountingtools/accounting.py [16:55:15] sorry, 25 is restart [16:57:36] chasemp: this one has status 0 'success', which means that the job finished by itself, I think [16:57:50] or it might have been qdel'ed [16:57:51] accounting.py is meant to be used as a lib not a cli tool yah? [16:57:56] ok [16:58:23] chasemp: right. "for entry in accounting.parse(open('filteredaccount')):" would be the typical usage [16:58:34] does a finite lifetime job that runs and completes in continuous just keep going in theory with restart after restart? [16:58:54] also thanks valhallasw`cloud :) [16:59:06] I think the rule for continuous jobs is 'restart on failure, stop on graceful exit', but let me check. It's defined in `jstart` somewhere [16:59:40] "while ! " . shell_quote($prog, @ARGV) . "; do\n" . " sleep 5\n" . "done\n"; [17:00:12] so the other way around -- if the program exits with an error, stop the job, otherwise restart [17:00:39] huh [17:01:06] I'm trying to think of a sane job that follow that mechanism and not like an internal sleep [17:01:39] chasemp: I think something like a bot that runs over all pages on a wiki [17:01:51] so the bot runtime is much longer than the restart time [17:02:05] I guess it makes sense as it would get rescheduled per resourcing on restart [17:02:17] maybe [17:02:22] although then on restart it would start at A again, so it doesn't really make sense [17:02:31] ok, I'm confused [17:02:34] the docs say 'Continuous jobs are not restarted if they end normally (with the exit status 0)' [17:02:35] ha me too [17:04:20] a long running job that keeps state and is restarted on failure only to run to some defined end and never run again [17:04:27] bsaically a supervised task until exit 0 [17:04:28] I guess [17:04:39] yes, that makes sense [17:04:50] I know why I'm confused -- bash uses 0 for true and 1 for false [17:04:50] which isn't what I think of when I think of a continuous job [17:05:09] no, I think it makes sense. Think of an IRC bot [17:05:24] if it crashes (disconnect not handled), it restarts [17:05:27] i.e. it never finishes with exit 0 so it alwys get restarted [17:05:32] but if you say !quit in irc it actually exits [17:05:36] but if you had like irc commands and chose to shut it down [17:05:37] yeah [17:05:48] we were going to teh same place at hte same time there :) [17:05:54] ok I'm with it [17:06:58] valhallasw`cloud: YuviPand.a was telling me you have other grid engine deployments on your radar, do they use berkely db? just curious [17:07:20] I'm reading up on classic spooling and a lot of the old school reasoning seems to say unless you are doing 100's of jobs a second submission wise [17:07:33] etc don't bother but then again it's the default for like the ubuntu package [17:07:51] the alternative is flat files or something? [17:07:57] chasemp: lemme think. At my labs cluster, they decided SGE was too difficult for users and it's now a free-for-all ssh-in-and-do-what-you-want [17:08:19] mark: yeah there is a flat file "classic" spool scheme (ala mail etc) [17:08:27] right [17:08:30] would that live on nfs too? [17:08:39] it wouldn't have too [17:08:57] but the on nfs / not on nfs question is mostly separate from the mechanism itself [17:09:16] from what I gather the old school spooling method is easier to debug and troubleshoot (all things being equal) [17:09:23] yes, and probably more NFS-safe [17:09:28] is it easy to convert? [17:09:49] that I'm not sure of, I can't find a conversion guide or docs [17:09:57] other than someone said on a mailing list "reinstall" [17:10:08] but I'm guessing that's an allusion to wiping out the queue or not carring over jobs [17:10:27] chasemp, yes, one of surfsara's clusters uses SGE [17:11:09] ...except not, it's just another cluster software that uses qsub >_< [17:11:17] :) [17:11:25] qsub being that loveable I find hard to believe [17:13:30] they run http://www.adaptivecomputing.com/products/open-source/torque/ + http://www.adaptivecomputing.com/products/open-source/maui [17:15:07] ah, I meant the /name/ qsub, not the exact same tool [17:15:58] oh ha [17:22:14] valhallasw`cloud: I emailed you about sutff, btw :) [17:29:55] chasemp: so most recent clusters seem to have converged on TORQUE, but there's one that still uses SGE, but very differently from how we are using it (max runtime 15 mins) [17:30:09] interesting [17:30:17] small odd jobs I guess [17:30:48] right, but on a high-power CPU/GPU cluster [17:40:59] btw [17:41:01] https://phabricator.wikimedia.org/P2474 [17:41:03] valhallasw`cloud: chasemp ^ [17:41:07] preliminary data [17:41:14] from command invocation setup [17:41:16] *stats [17:41:27] most of the usage is cron [17:41:30] that's a lot of webservice restarts by webservicechecker [17:41:44] yah [17:41:50] and then -services is bigbrother, right? [17:41:51] need to find the ones churning and kill them I guess [17:41:57] bigbrother is also submit [17:41:58] I think [17:42:18] wait [17:42:19] I didn't know jstop was a thing [17:42:21] services is both [17:42:23] webservicemonitor [17:42:27] and bigbrother [17:42:29] then what's on checker? [17:42:30] checker is just catchpoint [17:42:32] ohh [17:42:37] of course :D [17:42:41] there's a thing that submits a webservice and checks to see if it succeeds [17:42:43] over how much time is this? [17:42:54] you should divide by #seconds ;-) [17:43:13] 20160114004524 [17:43:17] is first event [17:43:36] max - min timestamp [17:43:36] so that's roughly 17 hours [17:43:38] is [17:43:40] 169787 [17:43:41] that is more jsubs that would ahve thought [17:43:46] we think it's mostly the cron stuff? [17:44:01] we know it's mostly cron [17:44:05] all of tools-submit is cron [17:44:13] right gotcah [17:44:17] yeah, it's 50k in 17h for cron = slightly less than 1 per second [17:44:28] probably because some people are doing something every minute or every few minutes [17:44:39] yeah [17:44:59] I think our outlien says something like more often than every 5m is frowned upon but yes :) [17:45:09] really? [17:45:21] I didn't know that [17:45:24] YuviPanda: and catchpoint checks every minute? [17:45:30] valhallasw`cloud: 5min I think? [17:45:40] but the check might be doing two webservice calls and not one [17:45:48] "Scheduling a command more often than every five minutes (for example * * * * * command) is highly discouraged," [17:45:51] because 1000/17 is about 60 [17:45:56] chasemp: aah, that makes sense [17:46:05] yeah, that's what I thought, but then the numbers are a factor 2.5 off [17:46:42] ok, I can confim it's doing two calls [17:46:44] so tools checker itself is a submit host and that canary check does the whole thing of submitting a webservice and looking for it [17:47:19] I really need to like, look at tools-checker sometime :) [17:47:28] YuviPanda: also, I think many of them are -once invocations that don't actually start a job [17:47:37] valhallasw`cloud: aaah [17:47:45] valhallasw`cloud: so toolschecker counts also include jsub [17:47:48] and job [17:47:49] and stuff like that [17:47:51] because with the outage we had a few hundred jobs queued after an hour [17:47:54] not just webservice [17:48:12] which suggests ~10 per minute rather than 60 [17:49:01] 244 webservice calls [17:49:04] can someone help me understand the -once not starting a job thing [17:49:04] 715 jsubs [17:49:11] I thought -once did start a job just not a continuous job [17:49:15] from checker [17:49:28] chasemp: -once will check if job is running and if it is not start it [17:49:28] chasemp: no, -once checks if there is a running job, and doesn't start it if there is [17:49:30] job with same name [17:49:30] ^ [17:49:42] ohh [17:49:56] I see so this could be longer running things that don't finish within $cron window [17:50:03] and make the execution but are an essential noop [17:50:08] yeah [17:50:33] 21321 [17:50:35] of the invocations [17:50:37] of jsub [17:50:39] have -once [17:50:41] so that's like half [17:50:54] huh, any thoughts on how to track the noop's? [17:50:55] we can add more instrumentation to jsub itself to see how many times it actually submits vs bails [17:51:01] heh was just saying it [17:51:02] that would be cool [17:51:06] :) [17:51:13] right [17:51:22] so we should collect all the ideas for instrumentation [17:51:26] and then set it up [17:51:53] so jsub no-op or not [17:51:56] what else? [17:52:05] so as an aside, a continuous not-once job will pile on in parallel with the same job name [17:52:17] yeah [17:52:25] hmm if they have teh same name [17:52:27] is the -once mechanism a jsub thing or a sge thing [17:52:30] I wonder if gridengine will allow that [17:52:44] I think it's a jsub thing but to be fully sure I need to read the jsub perl [17:52:51] in theory unique id's makes it possible but it would be sorta nuts [17:53:24] I want to say I've submitting jobs with teh same name directly with jsub that piled on but will have to dig a bit to see how that relates to jsub [17:53:26] yeah :) [17:53:56] too early in the day to read perl [17:53:59] yup! [17:54:02] some brave soul has to rewrite jsub into python [17:54:07] I hope it doesn't have to be me [17:54:41] I'm not ready to commit to it yet but I think I have a general feel for the work [17:55:39] chasemp: so I'm looking at SGE spooling in a bit more detail. BDB is explicitly not safe for multi-user access, and one should use BDB RPC (where there's a BDB server running that's communicated wiht) instead. So it's indeed likely the two-masters-running-situation was the cause of the corruption in december [17:55:57] well that's itneresting [17:56:02] are you sure that's not jsut pre-nfsv4 [17:56:15] I saw a lot of chatter from nfsv2 and v3 that indicated that [17:56:32] but I /thought/ that nfv4 was supposed to be the new light and way for local bdb [17:56:39] but there is a ton of noise on this so idk [17:56:52] from what I understand, you can mount bdb as 'single user' or as 'multi user', and SGE does the first -- but not 100% sure on that [17:57:15] interesting, let me know if you see any converting from bdb to classic stuff [17:57:22] it's all very sparse other than "reinstall thanks" [17:57:33] https://arc.liv.ac.uk/SGE/howto/backup.html [17:57:47] ^ this mentions inst_sge -upd for that [17:58:56] (still reading) [18:00:03] this is interesting though, I haven't seen this page [18:00:58] that's the son of grid engine page [18:01:32] ah [18:02:07] that utility doesn't seem to exist in the debian packages [18:02:15] but ...I doubt they chagned the bdb format [18:02:20] I wonder if I could steal the good stuff [18:03:48] the best grid engine guide I see refernces microsoft services for unix [18:04:56] .. just to clarify .. I have a password based login at wikitech, but to use -labs I need to file ssh keys in the OpenStack tab of my prefs.. [18:05:14] .. but all of that has nothing to do with a login on wikipedia ? true ? [18:05:19] dbb: correct [18:14:47] chasemp: from what I can see, SGEs idea of 'upgrading' from one spooling system to another is 'delete spooling data and start over' [18:15:39] I'm trying to get a clone of the git repo pushed to github, but it's so slow :-p [18:15:45] I think if you populate /etc/gridengine/bootstrap with teh right params and start up [18:15:48] it may start up "fresh" [18:15:54] but what those options are I'm not sure of yet [18:16:13] I'm also wavering on the idea of changing too many things at once but woudl like to poke at both methods for sure [18:18:11] it's not winning me over that the db-util package for bdb is not great [18:18:14] man: warning: /usr/share/man/man1/db_dump.1.gz is a dangling symlink [18:18:14] No manual entry for db_dump [18:18:15]