[00:02:48] 6Labs, 10wikitech.wikimedia.org: Decide on future of Semantic extensions on Wikitech - https://phabricator.wikimedia.org/T123599#1933483 (10Reedy) 3NEW [00:03:34] 6Labs, 10wikitech.wikimedia.org: Decide on future of Semantic extensions on Wikitech - https://phabricator.wikimedia.org/T123599#1933491 (10Legoktm) > Or we just undeploy it all [00:04:16] I have a problem with puppet [00:04:44] 6Labs, 10wikitech.wikimedia.org: Decide on future of Semantic extensions on Wikitech - https://phabricator.wikimedia.org/T123599#1933492 (10Reedy) Wikitech will be rolled back to 1.27.0-wmf.9 so stuff isn't just broken We should have a good look around and see what we'll lose. And then, if anything we want to... [00:08:05] The platform can not connect to MySQL by the puppet [00:08:18] See: https://wmve.wmflabs.org/ [00:08:41] I using a local mysql [00:12:04] 6Labs, 10Labs-Infrastructure, 10Tool-Labs, 10MediaWiki-extensions-SemanticForms: https://wikitech.wikimedia.org/wiki/Special:FormEdit/Tools_Access_Request down - https://phabricator.wikimedia.org/T123583#1933514 (10Reedy) See T123599 [00:22:42] 6Labs, 10Tool-Labs, 6Phabricator: move tool user requests to phabricator - https://phabricator.wikimedia.org/T123601#1933529 (10Dzahn) 3NEW [00:22:51] Anyone have tips on what software one would use for creating temporary mediawiki VMs within a tool for phantomjs? [00:23:39] 6Labs, 10Labs-Infrastructure, 10Tool-Labs, 10MediaWiki-extensions-SemanticForms: https://wikitech.wikimedia.org/wiki/Special:FormEdit/Tools_Access_Request down - https://phabricator.wikimedia.org/T123583#1933538 (10Dzahn) suggesting to do T123601 whether we keep using SMW or not [00:24:03] 6Labs, 10Tool-Labs, 6Phabricator: move tool labs user requests to phabricator - https://phabricator.wikimedia.org/T123601#1933540 (10Dzahn) [00:24:34] Abin_Sur: asking in #wikimedia-releng about phabricator will probably produce better results (assuming that's a phabricator install) [00:24:50] 6Labs, 10Tool-Labs, 6Phabricator: move tool labs user requests to phabricator - https://phabricator.wikimedia.org/T123601#1933529 (10Dzahn) [00:25:05] OH-: asking the CI people what they use might be a good idea, although I don't think you can programatically create VMs anyway, and most definitely not rom inside tools [00:25:06] *from [00:25:47] 6Labs, 10wikitech.wikimedia.org: Move tool labs signup to phab - https://phabricator.wikimedia.org/T123603#1933551 (10Reedy) 3NEW [00:26:06] Hmm, OK. I was thinking of making a bot that automatically installed MW skins and screenshotted them for mw.org and was concerned about security [00:26:13] 6Labs, 10Tool-Labs, 6Phabricator: move tool labs user requests to phabricator - https://phabricator.wikimedia.org/T123601#1933557 (10Dzahn) [00:26:21] 6Labs, 10wikitech.wikimedia.org: Move tool labs signup to phab - https://phabricator.wikimedia.org/T123603#1933551 (10Reedy) [00:26:22] 6Labs, 10Tool-Labs, 6Phabricator: move tool labs user requests to phabricator - https://phabricator.wikimedia.org/T123601#1933559 (10Reedy) [00:26:52] OH-: I think the CI people did something like that [00:27:12] 6Labs, 10wikitech.wikimedia.org: Move tool labs signup to phab - https://phabricator.wikimedia.org/T123603#1933551 (10Reedy) [00:27:14] 6Labs, 10Tool-Labs, 6Phabricator: move tool labs user requests to phabricator - https://phabricator.wikimedia.org/T123601#1933529 (10Reedy) [00:30:39] 6Labs, 10Tool-Labs, 6Phabricator, 6Project-Creators: move tool labs user requests to phabricator - https://phabricator.wikimedia.org/T123601#1933570 (10Reedy) [00:32:48] What if I programatically created and destroyed vagrant machines? That would require a project instead of a tool, right? [00:33:16] depends on what you mean by vagrant machines. you can do that with docker / lxc, and that would require a project [00:33:18] yes [00:33:27] OH-: I also think someone did something like this not too long ago [00:33:30] for the i18n team I think [00:34:43] Do you remember any people that operated the bot? don't want to reinvent the wheel :P [00:35:50] OH-: do new skins show up so often that a bot would be useful for this? [00:37:19] eh, I figured it would be a convenient utility. mostly was just thinking of little personal projects to do :P [00:38:03] :) [00:38:32] https://xkcd.com/1319/ [00:39:38] ha. that reminds me, new xkcd \o/ [00:39:40] OH-: you can explore tools.wmflabs.org/paws if you're looking for fun little personal projects :D [00:42:03] OH-: if you are interested in MediaWiki install automation there is the MediaWiki-Vagrant project and the relatively new https://github.com/wikimedia/mediawiki-containers project [00:42:30] bd808: I wonder if someone adding a Dockerfile to mediawiki/core.git would be controversial [00:42:34] and if so, *how* controversial :) [00:44:09] heh. probably not as controversial as adding Composer support was but ... [00:44:32] I don't know if it'll be useful th [00:44:41] to be useful it'll have to bundle in extensions as well [00:44:52] since you need a db, a web server and and php container it's kind of hard to smush into a singel Dockerfile [00:45:07] yeah, but you'd run a Dockerfile just for mw [00:45:18] and a docker-compose.yaml for the whole thing [00:45:23] YuviPanda, not results. [00:45:24] I'll take a look, thanks [00:45:25] that's what https://github.com/wikimedia/mediawiki-containers is about [00:46:27] * darkblue_b queues for YuviPanda [00:47:03] Abin_Sur: we can't really help though, once you get your own labs project you are kind of on your own. we simply don't have the manpower for debugging application issues :( sorry! [00:47:44] Abin_Sur: the first thing to verify is that you can use mysql's cli to connect to the db using that user and password [00:47:53] hi YuviPanda -good day .. I was reading the gitter backlog for Jupyter and you are there, so here I am :-) [00:48:11] did you get a Jupyterhub up ? which version and what kernals.... [00:48:19] darkblue_b: tools.wmflabs.org/paws [00:48:42] I just need to know how to fix the problem. No intervention is required [00:48:47] aha - nice [00:49:26] how did you install .. build from source or pypi or ... [00:50:07] .. setting aside the LDAP portions [00:51:26] darkblue_b: github.com/yuvipanda/paws [00:51:29] kubernetes [00:52:06] oof [00:52:23] what I am getting at is.. the versions of the ipython and notebook parts [00:53:42] I used pypi just now `pip install ipython; pip install notebook --upgrade` [00:54:02] that gives ipython 4,02 and a synched set of jupyter whatever [00:54:15] Maybe I should work on my twitter bot first [00:54:17] so I was wondering if you are happy with that, with your vast experience [00:54:23] :-) [00:54:45] I wanted to make a twitter bot that simulates TCP [00:54:55] so you could go SYN and it would go SYN ACK ACK ACK ACK ACK ACK [00:55:21] OH-: nice [00:55:41] darkblue_b: I wrote my own custom authentication backend (OAuth) and spawner (kubernetes) and it's dpeloyed via the dockerfiles in that repo [00:56:19] hmm - so custom that I wonder if what I am asking applies... [00:56:33] I am wondering what versions to rely on [00:56:55] of ipython and notebook [00:57:29] you should ask the jupyter folks :) [00:57:36] I just use whatever's in pip [00:57:50] yes pypi pip seems good right now [00:58:28] bd808: there's now an eventlogging schema collecting data about people using webservice commands :) [00:58:34] bd808: going to add it to jsub/jstart now [00:58:53] ok - onward then.. you may have seen this.. the list of available kernals to run Notebooks with ... https://github.com/ipython/ipython/wiki/IPython-kernels-for-other-languages [00:59:07] YuviPanda: oh cool! actual measurements! [00:59:11] bd808: inorite [00:59:21] bd808: I'm wondering what's a nice generic way to 'wrap' some executables [00:59:27] I'll get my -labs stuff in order and log into the PAWS real soon now.. thx ! [00:59:29] bd808: like qsub I wanna measure, but that's coming from a deb repo [00:59:56] darkblue_b: :) I have py3 kernel now and bash kernel, will add R soon (and addshore wants to add PHP) [01:00:16] ugh, I gotta write perl now [01:01:14] bd808: can I just use system() to call out in perl? [01:01:54] YuviPanda: yup -- http://perlmaven.com/running-external-programs-from-perl [01:02:35] hah i'm at the same page [01:03:31] bd808: $commandline = join " ", $0, @ARGV; [01:03:36] to get the full commandline? [01:03:42] sorry am poking you with newb perl shit [01:04:01] 6Labs, 10Labs-Infrastructure, 10Salt: update salt key monitoring scripts for labs to new nova api version - https://phabricator.wikimedia.org/T123607#1933624 (10ArielGlenn) 3NEW a:3ArielGlenn [01:04:10] I think that would work. you know how to write test scripts right? ;) [01:05:38] :D [01:06:11] * bd808 hasn't written anything of substance in perl for ... a really long time [01:06:34] * YuviPanda really likes perl6 [01:14:03] (03PS1) 10Yuvipanda: Start logging command invocations to EventLogging [labs/toollabs] - 10https://gerrit.wikimedia.org/r/264041 (https://phabricator.wikimedia.org/T123444) [01:14:14] bd808: can you take a sanity look? [01:14:21] (03CR) 10jenkins-bot: [V: 04-1] Start logging command invocations to EventLogging [labs/toollabs] - 10https://gerrit.wikimedia.org/r/264041 (https://phabricator.wikimedia.org/T123444) (owner: 10Yuvipanda) [01:14:38] jerkins says hell no :) [01:15:16] (03PS2) 10Yuvipanda: Start logging command invocations to EventLogging [labs/toollabs] - 10https://gerrit.wikimedia.org/r/264041 (https://phabricator.wikimedia.org/T123444) [01:15:32] bd808: yeah rebased [01:16:51] YuviPanda: where does /usr/local/bin/log-command-invocation come from? [01:17:13] bd808: puppet [01:17:19] bd808: I want to move all of these packages into puppet too [01:17:22] (03CR) 10jenkins-bot: [V: 04-1] Start logging command invocations to EventLogging [labs/toollabs] - 10https://gerrit.wikimedia.org/r/264041 (https://phabricator.wikimedia.org/T123444) (owner: 10Yuvipanda) [01:20:02] 6Labs, 10Tool-Labs: Provide resource for db access in grid - https://phabricator.wikimedia.org/T70881#1933699 (10Merl) Today dewiki has high replag since about 12 hours (>3hours replag). Many of my sge jobs are currently testing replag and rescheduling themselves (return code 99) since hours now. This is the... [01:22:11] YuviPanda: looks like it should work to me if you can make jerkins happy [01:22:15] ok [01:22:18] thanks :D [01:22:22] also found out that there's replag now [01:22:29] we should have icinga checks for these [01:26:25] YuviPanda: https://tools.wmflabs.org/replag/ [01:26:55] bd808: yeah, that's how I found out [01:26:58] but that isn't alerting tho [01:27:25] *nod* the logic I used for that is pretty simple [01:27:27] yeah [01:27:34] we could make a check out of it I think [01:27:43] yeah [01:27:59] for an alery you would only care about the shards [01:28:01] *alert [01:28:35] where do we keep custom icinga checks? In ops/puppet somewhere I assume? [01:32:00] bd808: ya [01:32:44] !log tools stopped erwin85's tools since it was causing replag on labsdb1002 [01:55:00] bd808: hmm, I don't even know what is failing on that patch [01:56:15] !log tools rm service.manifest for wikiviewstats to prevent it from constantly trying to start up and fail webservice [01:59:54] YuviPanda: replag is dropping. only 56m behind now [02:00:07] yeah [02:00:09] it's basically [02:00:13] 'kill all the queries' [02:00:20] 'look at tendril and stop tool running bigg queries' [02:08:22] (03PS3) 10Yuvipanda: Start logging command invocations to EventLogging [labs/toollabs] - 10https://gerrit.wikimedia.org/r/264041 (https://phabricator.wikimedia.org/T123444) [02:08:24] bd808: ah, had missed a semicolon [02:08:26] should pass now [02:08:28] hopefully [02:10:07] (03CR) 10jenkins-bot: [V: 04-1] Start logging command invocations to EventLogging [labs/toollabs] - 10https://gerrit.wikimedia.org/r/264041 (https://phabricator.wikimedia.org/T123444) (owner: 10Yuvipanda) [02:10:53] wtf [02:27:43] (03PS4) 10Yuvipanda: Start logging command invocations to EventLogging [labs/toollabs] - 10https://gerrit.wikimedia.org/r/264041 (https://phabricator.wikimedia.org/T123444) [02:28:23] the answer to the wtf was 'yuvi can not differentiate between perl and bash' [02:29:27] (03CR) 10jenkins-bot: [V: 04-1] Start logging command invocations to EventLogging [labs/toollabs] - 10https://gerrit.wikimedia.org/r/264041 (https://phabricator.wikimedia.org/T123444) (owner: 10Yuvipanda) [02:30:30] well [02:30:33] this works fine on tools [02:30:45] (03CR) 10Yuvipanda: [C: 032] Start logging command invocations to EventLogging [labs/toollabs] - 10https://gerrit.wikimedia.org/r/264041 (https://phabricator.wikimedia.org/T123444) (owner: 10Yuvipanda) [02:33:53] (03CR) 10Yuvipanda: [V: 032] "Works for me when I build it on toollabs" [labs/toollabs] - 10https://gerrit.wikimedia.org/r/264041 (https://phabricator.wikimedia.org/T123444) (owner: 10Yuvipanda) [02:39:19] 6Labs, 10Tool-Labs, 5Patch-For-Review: Instrument jsub/jstart/webservices usage - https://phabricator.wikimedia.org/T123444#1933788 (10yuvipanda) Okay, so now we've stats for jsub, jstart, job, webservice and jstop. Now to add them for qsub and qstat (which are harder, since they're deb package provided bina... [04:18:21] 6Labs, 10Tool-Labs, 6Phabricator, 6Project-Creators: move tool labs user requests to phabricator - https://phabricator.wikimedia.org/T123601#1933809 (10Negative24) Could this possibly use another Phabricator form? (Is there a way to remove the visibility of the form from the global drop down to just be use... [05:33:40] 6Labs, 10Tool-Labs, 6Phabricator, 6Project-Creators: move tool labs user requests to phabricator - https://phabricator.wikimedia.org/T123601#1933843 (10greg) Phabricator forms (editable and non-editable pre-fillable fields, only show relevant fields, something relatively new that upstream implemented, kind... [07:30:32] 6Labs, 10Tool-Labs: Linkwatcher spawns many processes without parent - https://phabricator.wikimedia.org/T123121#1933878 (10Beetstra) @valhallasw - I have added 2 more parsers (total now 12) - the bot is creating a backlog, likely during the American daytime, which it does not munch away at night. [07:58:45] 10Tool-Labs-tools-Erwin's-tools: Kill huge query to avoid killing all erwin85 tools - https://phabricator.wikimedia.org/T123613#1933895 (10Nemo_bis) 3NEW [08:06:50] 10Tool-Labs-tools-Erwin's-tools: Kill huge query to avoid killing all erwin85 tools - https://phabricator.wikimedia.org/T123613#1933910 (10Nemo_bis) 5Open>3Resolved a:3Nemo_bis I guess the tool is https://tools.wmflabs.org/erwin85/contribs.php , currently marked red because we aren't even sure it works. I... [09:53:20] 6Labs, 10Phragile, 6TCB-Team: Unable to access Phragile WMFLabs instance - https://phabricator.wikimedia.org/T123369#1934049 (10WMDE-leszek) That is a bit odd then. Neither me (WMDE-leszek) neither any of my fellow WMDE colleagues (Jakob, WMDE-Fisch) can log in. What we're doing here is ssh login to bastion... [10:22:53] (03PS59) 10Ricordisamoa: Initial commit [labs/tools/wikidata-slicer] - 10https://gerrit.wikimedia.org/r/241296 [10:29:32] (03CR) 10Ricordisamoa: "PS59 kills the hard-coded getGenericType() in favor of a generalType property on Section classes" [labs/tools/wikidata-slicer] - 10https://gerrit.wikimedia.org/r/241296 (owner: 10Ricordisamoa) [11:04:41] 6Labs, 10Tool-Labs, 6Phabricator, 6Project-Creators: move tool labs user requests to phabricator - https://phabricator.wikimedia.org/T123601#1934127 (10scfc) [11:04:44] 6Labs, 10Tool-Labs: Migrate Tools access request process to Phabricator - https://phabricator.wikimedia.org/T72625#1934128 (10scfc) [11:08:42] 6Labs: beta swift labs instances requirements - https://phabricator.wikimedia.org/T123512#1934132 (10fgiunchedi) thanks @hashar ! I'd like to have some wiggle room just in case anyways I don't seem to be able to add large/xlarge instances to deployment-prep ATM, quotas have been hit perhaps? [11:11:37] godog: hi! you should get the quota via https://wikitech.wikimedia.org/w/index.php?title=Special:NovaProject&action=displayquotas&projectname=deployment-prep [11:11:53] godog: also andrew created specific instances for ci purposes with different set of (cpu,met,disk) [11:12:26] hashar: oohh thanks a lot! indeed instance limits are hit, 55/55 [11:12:34] :-( [11:12:47] hashar: out of curiosity how did you reach the quota page? [11:12:53] magic? :-} [11:13:05] the link is from the manage project page [11:13:50] i.e. https://wikitech.wikimedia.org/wiki/Special:NovaProject , select your project [11:14:08] on each tables, the right most columns has a bunch of actions link, one of them is 'Display quotas' [11:15:18] I dont think there is any instance we can delete [11:15:26] hah! thanks :D yeah I was thinking large instance with 2x or 3x the disk would be enough [11:22:25] godog: and if you guys have plan to migrate Swift to Jessie, maybe beta can start straight withJessie [11:23:08] yeah not so sure about that now but good point [11:36:31] 6Labs, 7Graphite: graphite.wmflabs.org API is unreliable - https://phabricator.wikimedia.org/T123566#1934141 (10fgiunchedi) graphite.wmflabs.org IIRC is backed by labmon1001 which is the default destination for metrics in labs. anyways I've upgraded labmon to the same graphite version as production (0.9.13) ma... [11:37:13] 6Labs, 7Graphite: graphite.wmflabs.org API is unreliable - https://phabricator.wikimedia.org/T123566#1934142 (10fgiunchedi) graphite.wmflabs.org IIRC is backed by labmon1001 which is the default destination for metrics in labs. anyways I've upgraded labmon to the same graphite version as production (0.9.13) ma... [11:38:18] 6Labs: beta swift labs instances requirements - https://phabricator.wikimedia.org/T123512#1934143 (10fgiunchedi) indeed instance limits (55) have been hit, https://wikitech.wikimedia.org/w/index.php?title=Special:NovaProject&action=displayquotas&projectname=deployment-prep can we bump that to +5 ? [11:59:00] 6Labs, 10Phragile, 6TCB-Team: Unable to access Phragile WMFLabs instance - https://phabricator.wikimedia.org/T123369#1934161 (10scfc) No, you should use the bastion `bastion.wmflabs.org` (and similar); cf. https://wikitech.wikimedia.org/wiki/Help:Access#Accessing_public_and_private_instances. Also note that... [12:12:18] 6Labs, 5Patch-For-Review: Ensure unique uidNumber field in ldap user records. - https://phabricator.wikimedia.org/T122665#1934166 (10MoritzMuehlenhoff) 5Open>3Resolved This has been enabled on our openldap servers [12:42:50] 6Labs, 10Phragile, 6TCB-Team: Unable to access Phragile WMFLabs instance - https://phabricator.wikimedia.org/T123369#1934182 (10WMDE-leszek) Right, that has been a silly mistake of me (although I am sure accessing the instance the way I described above used to work some weeks ago). Anyway, I tried getting th... [12:50:23] 6Labs, 10Tool-Labs, 10labs-sprint-119, 10Diffusion: Figure out a git hosting solution for tools/kubernetes - https://phabricator.wikimedia.org/T117071#1934186 (10Joe) [12:50:25] 6Labs, 10Tool-Labs: Initial Deployment of Kubernetes to Tool Labs (Tracking) - https://phabricator.wikimedia.org/T111885#1934185 (10Joe) [13:11:52] 6Labs, 7Graphite: graphite.wmflabs.org API is unreliable - https://phabricator.wikimedia.org/T123566#1934203 (10Tgr) 5Open>3Resolved a:3Tgr Seems fixed now, thanks! (Feels slightly faster, too.) [13:14:26] 6Labs, 10Tool-Labs: Install a docker registry to be used by kubernetes - https://phabricator.wikimedia.org/T123628#1934208 (10Joe) 3NEW [14:01:13] Can someone point me in the right direction where I can fetch the transclusion count of User:MiszaBot/config? [14:01:24] Or can someone run a quick DB query for me? [14:01:41] YuviPanda, ^ [14:03:05] legoktm, ^ [14:04:28] Guess not. [14:32:46] 6Labs, 10Phragile, 6TCB-Team: Unable to access Phragile WMFLabs instance - https://phabricator.wikimedia.org/T123369#1934284 (10valhallasw) Try connecting with `ssh -vv `. This should report which keys are tried and whether they succeed or not. If agent forwarding is used, you should be able to do... [15:24:41] 6Labs, 10Phragile, 6TCB-Team: Unable to access Phragile WMFLabs instance - https://phabricator.wikimedia.org/T123369#1934370 (10WMDE-leszek) As suggested I've used agent forwarding and first logged into bastion. Then I am trying to log into `phragile.phragile.eqiad.wmflabs`. Looking at debug messages it trie... [15:25:17] 10PAWS, 6Revision-Scoring-As-A-Service: Install revscoring inside PAWS - https://phabricator.wikimedia.org/T120317#1934376 (10Halfak) yuvipanda, it won't pull in models. It will be up to the user to acquire those as necessary. They are pretty easy to pull from github with a wget. Do you think that is good e... [15:34:49] !log wikilabels deployed 4c643e8 with wikilabels:79b0cad [16:44:26] YuviPanda: re: qsub, we can just add a /usr/local/bin/qsub that calls the dpkg'ed qsub I think [16:49:04] valhallasw`cloud: could we use acct to get a historical picture, i.e. the last time certain tools were run etc? [16:49:07] I have a few tools I'm curious about [16:49:12] mainly :) [16:49:15] chasemp: more or less [16:49:36] chasemp: you can search by job name, but not by command [16:49:45] and by user [16:53:01] 10PAWS, 6Revision-Scoring-As-A-Service: Install revscoring inside PAWS - https://phabricator.wikimedia.org/T120317#1934554 (10yuvipanda) I definitely think we should find an easy way to include them [16:53:30] valhallasw`cloud: when something shows up for continuous queue [16:53:31] like [16:53:31] continuous:tools-exec-01.eqiad.wmflabs:tools.whymbot:tools.whymbot:jawikiclaim3.sh:6702940:sge:10:1419304075:1419304077:1419335945:0:0:31868:76.724795:7.696481:44892.000000:0:0:0:0:37401:9:0:14960.000000:11440:0:0:0:115337:7558:NONE:defaultdepartment:NONE:1:0:84.550000:12.647128:0.055227:-u tools.whymbot -q continuous -l h_vmem=256M,release=precise:0.000000:NONE:206929920.000000:0:0 [16:53:36] is that...the last restart? [16:53:52] trying to understand how continuous jobs are viewed in teh accounting file I guess [16:55:01] chasemp: accounting entries are written when the job finishes. The status will tell you why (for continuous jobs, this is typically a restart, which has error code 19 [16:55:10] chasemp: see /home/valhallasw/accountingtools/accounting.py [16:55:15] sorry, 25 is restart [16:57:36] chasemp: this one has status 0 'success', which means that the job finished by itself, I think [16:57:50] or it might have been qdel'ed [16:57:51] accounting.py is meant to be used as a lib not a cli tool yah? [16:57:56] ok [16:58:23] chasemp: right. "for entry in accounting.parse(open('filteredaccount')):" would be the typical usage [16:58:34] does a finite lifetime job that runs and completes in continuous just keep going in theory with restart after restart? [16:58:54] also thanks valhallasw`cloud :) [16:59:06] I think the rule for continuous jobs is 'restart on failure, stop on graceful exit', but let me check. It's defined in `jstart` somewhere [16:59:40] "while ! " . shell_quote($prog, @ARGV) . "; do\n" . " sleep 5\n" . "done\n"; [17:00:12] so the other way around -- if the program exits with an error, stop the job, otherwise restart [17:00:39] huh [17:01:06] I'm trying to think of a sane job that follow that mechanism and not like an internal sleep [17:01:39] chasemp: I think something like a bot that runs over all pages on a wiki [17:01:51] so the bot runtime is much longer than the restart time [17:02:05] I guess it makes sense as it would get rescheduled per resourcing on restart [17:02:17] maybe [17:02:22] although then on restart it would start at A again, so it doesn't really make sense [17:02:31] ok, I'm confused [17:02:34] the docs say 'Continuous jobs are not restarted if they end normally (with the exit status 0)' [17:02:35] ha me too [17:04:20] a long running job that keeps state and is restarted on failure only to run to some defined end and never run again [17:04:27] bsaically a supervised task until exit 0 [17:04:28] I guess [17:04:39] yes, that makes sense [17:04:50] I know why I'm confused -- bash uses 0 for true and 1 for false [17:04:50] which isn't what I think of when I think of a continuous job [17:05:09] no, I think it makes sense. Think of an IRC bot [17:05:24] if it crashes (disconnect not handled), it restarts [17:05:27] i.e. it never finishes with exit 0 so it alwys get restarted [17:05:32] but if you say !quit in irc it actually exits [17:05:36] but if you had like irc commands and chose to shut it down [17:05:37] yeah [17:05:48] we were going to teh same place at hte same time there :) [17:05:54] ok I'm with it [17:06:58] valhallasw`cloud: YuviPand.a was telling me you have other grid engine deployments on your radar, do they use berkely db? just curious [17:07:20] I'm reading up on classic spooling and a lot of the old school reasoning seems to say unless you are doing 100's of jobs a second submission wise [17:07:33] etc don't bother but then again it's the default for like the ubuntu package [17:07:51] the alternative is flat files or something? [17:07:57] chasemp: lemme think. At my labs cluster, they decided SGE was too difficult for users and it's now a free-for-all ssh-in-and-do-what-you-want [17:08:19] mark: yeah there is a flat file "classic" spool scheme (ala mail etc) [17:08:27] right [17:08:30] would that live on nfs too? [17:08:39] it wouldn't have too [17:08:57] but the on nfs / not on nfs question is mostly separate from the mechanism itself [17:09:16] from what I gather the old school spooling method is easier to debug and troubleshoot (all things being equal) [17:09:23] yes, and probably more NFS-safe [17:09:28] is it easy to convert? [17:09:49] that I'm not sure of, I can't find a conversion guide or docs [17:09:57] other than someone said on a mailing list "reinstall" [17:10:08] but I'm guessing that's an allusion to wiping out the queue or not carring over jobs [17:10:27] chasemp, yes, one of surfsara's clusters uses SGE [17:11:09] ...except not, it's just another cluster software that uses qsub >_< [17:11:17] :) [17:11:25] qsub being that loveable I find hard to believe [17:13:30] they run http://www.adaptivecomputing.com/products/open-source/torque/ + http://www.adaptivecomputing.com/products/open-source/maui [17:15:07] ah, I meant the /name/ qsub, not the exact same tool [17:15:58] oh ha [17:22:14] valhallasw`cloud: I emailed you about sutff, btw :) [17:29:55] chasemp: so most recent clusters seem to have converged on TORQUE, but there's one that still uses SGE, but very differently from how we are using it (max runtime 15 mins) [17:30:09] interesting [17:30:17] small odd jobs I guess [17:30:48] right, but on a high-power CPU/GPU cluster [17:40:59] btw [17:41:01] https://phabricator.wikimedia.org/P2474 [17:41:03] valhallasw`cloud: chasemp ^ [17:41:07] preliminary data [17:41:14] from command invocation setup [17:41:16] *stats [17:41:27] most of the usage is cron [17:41:30] that's a lot of webservice restarts by webservicechecker [17:41:44] yah [17:41:50] and then -services is bigbrother, right? [17:41:51] need to find the ones churning and kill them I guess [17:41:57] bigbrother is also submit [17:41:58] I think [17:42:18] wait [17:42:19] I didn't know jstop was a thing [17:42:21] services is both [17:42:23] webservicemonitor [17:42:27] and bigbrother [17:42:29] then what's on checker? [17:42:30] checker is just catchpoint [17:42:32] ohh [17:42:37] of course :D [17:42:41] there's a thing that submits a webservice and checks to see if it succeeds [17:42:43] over how much time is this? [17:42:54] you should divide by #seconds ;-) [17:43:13] 20160114004524 [17:43:17] is first event [17:43:36] max - min timestamp [17:43:36] so that's roughly 17 hours [17:43:38] is [17:43:40] 169787 [17:43:41] that is more jsubs that would ahve thought [17:43:46] we think it's mostly the cron stuff? [17:44:01] we know it's mostly cron [17:44:05] all of tools-submit is cron [17:44:13] right gotcah [17:44:17] yeah, it's 50k in 17h for cron = slightly less than 1 per second [17:44:28] probably because some people are doing something every minute or every few minutes [17:44:39] yeah [17:44:59] I think our outlien says something like more often than every 5m is frowned upon but yes :) [17:45:09] really? [17:45:21] I didn't know that [17:45:24] YuviPanda: and catchpoint checks every minute? [17:45:30] valhallasw`cloud: 5min I think? [17:45:40] but the check might be doing two webservice calls and not one [17:45:48] "Scheduling a command more often than every five minutes (for example * * * * * command) is highly discouraged," [17:45:51] because 1000/17 is about 60 [17:45:56] chasemp: aah, that makes sense [17:46:05] yeah, that's what I thought, but then the numbers are a factor 2.5 off [17:46:42] ok, I can confim it's doing two calls [17:46:44] so tools checker itself is a submit host and that canary check does the whole thing of submitting a webservice and looking for it [17:47:19] I really need to like, look at tools-checker sometime :) [17:47:28] YuviPanda: also, I think many of them are -once invocations that don't actually start a job [17:47:37] valhallasw`cloud: aaah [17:47:45] valhallasw`cloud: so toolschecker counts also include jsub [17:47:48] and job [17:47:49] and stuff like that [17:47:51] because with the outage we had a few hundred jobs queued after an hour [17:47:54] not just webservice [17:48:12] which suggests ~10 per minute rather than 60 [17:49:01] 244 webservice calls [17:49:04] can someone help me understand the -once not starting a job thing [17:49:04] 715 jsubs [17:49:11] I thought -once did start a job just not a continuous job [17:49:15] from checker [17:49:28] chasemp: -once will check if job is running and if it is not start it [17:49:28] chasemp: no, -once checks if there is a running job, and doesn't start it if there is [17:49:30] job with same name [17:49:30] ^ [17:49:42] ohh [17:49:56] I see so this could be longer running things that don't finish within $cron window [17:50:03] and make the execution but are an essential noop [17:50:08] yeah [17:50:33] 21321 [17:50:35] of the invocations [17:50:37] of jsub [17:50:39] have -once [17:50:41] so that's like half [17:50:54] huh, any thoughts on how to track the noop's? [17:50:55] we can add more instrumentation to jsub itself to see how many times it actually submits vs bails [17:51:01] heh was just saying it [17:51:02] that would be cool [17:51:06] :) [17:51:13] right [17:51:22] so we should collect all the ideas for instrumentation [17:51:26] and then set it up [17:51:53] so jsub no-op or not [17:51:56] what else? [17:52:05] so as an aside, a continuous not-once job will pile on in parallel with the same job name [17:52:17] yeah [17:52:25] hmm if they have teh same name [17:52:27] is the -once mechanism a jsub thing or a sge thing [17:52:30] I wonder if gridengine will allow that [17:52:44] I think it's a jsub thing but to be fully sure I need to read the jsub perl [17:52:51] in theory unique id's makes it possible but it would be sorta nuts [17:53:24] I want to say I've submitting jobs with teh same name directly with jsub that piled on but will have to dig a bit to see how that relates to jsub [17:53:26] yeah :) [17:53:56] too early in the day to read perl [17:53:59] yup! [17:54:02] some brave soul has to rewrite jsub into python [17:54:07] I hope it doesn't have to be me [17:54:41] I'm not ready to commit to it yet but I think I have a general feel for the work [17:55:39] chasemp: so I'm looking at SGE spooling in a bit more detail. BDB is explicitly not safe for multi-user access, and one should use BDB RPC (where there's a BDB server running that's communicated wiht) instead. So it's indeed likely the two-masters-running-situation was the cause of the corruption in december [17:55:57] well that's itneresting [17:56:02] are you sure that's not jsut pre-nfsv4 [17:56:15] I saw a lot of chatter from nfsv2 and v3 that indicated that [17:56:32] but I /thought/ that nfv4 was supposed to be the new light and way for local bdb [17:56:39] but there is a ton of noise on this so idk [17:56:52] from what I understand, you can mount bdb as 'single user' or as 'multi user', and SGE does the first -- but not 100% sure on that [17:57:15] interesting, let me know if you see any converting from bdb to classic stuff [17:57:22] it's all very sparse other than "reinstall thanks" [17:57:33] https://arc.liv.ac.uk/SGE/howto/backup.html [17:57:47] ^ this mentions inst_sge -upd for that [17:58:56] (still reading) [18:00:03] this is interesting though, I haven't seen this page [18:00:58] that's the son of grid engine page [18:01:32] ah [18:02:07] that utility doesn't seem to exist in the debian packages [18:02:15] but ...I doubt they chagned the bdb format [18:02:20] I wonder if I could steal the good stuff [18:03:48] the best grid engine guide I see refernces microsoft services for unix [18:04:56] .. just to clarify .. I have a password based login at wikitech, but to use -labs I need to file ssh keys in the OpenStack tab of my prefs.. [18:05:14] .. but all of that has nothing to do with a login on wikipedia ? true ? [18:05:19] dbb: correct [18:14:47] chasemp: from what I can see, SGEs idea of 'upgrading' from one spooling system to another is 'delete spooling data and start over' [18:15:39] I'm trying to get a clone of the git repo pushed to github, but it's so slow :-p [18:15:45] I think if you populate /etc/gridengine/bootstrap with teh right params and start up [18:15:48] it may start up "fresh" [18:15:54] but what those options are I'm not sure of yet [18:16:13] I'm also wavering on the idea of changing too many things at once but woudl like to poke at both methods for sure [18:18:11] it's not winning me over that the db-util package for bdb is not great [18:18:14] man: warning: /usr/share/man/man1/db_dump.1.gz is a dangling symlink [18:18:14] No manual entry for db_dump [18:18:15] See 'man 7 undocumented' for help when manual pages are not available. [18:18:22] lrwxrwxrwx 1 root root 15 Nov 28 2013 /usr/share/man/man1/db_dump.1.gz -> db5.3_dump.1.gz [18:18:24] thanks guys [18:30:00] chasemp: https://github.com/valhallasw/son-of-gridengine [18:30:29] somewhat easier to browse :) [18:31:08] nice thank you [18:31:11] https://github.com/valhallasw/son-of-gridengine/blob/master/source/scripts/test_spooling_performance.sh [18:31:12] cool [18:44:48] hey chasemp [18:45:07] so I'm trying to figure out the mechanism that tells jobs to log to /path/to/file for error and stdout [18:45:22] so I think that's the -o and -e params for jsub [18:45:26] are just passed to qsub [18:45:31] which redirects stdout / stderr [18:46:09] hm when I try that directly w/ qsub it didn't work (which is why I started doubting myself) [18:46:19] that was what I gleamed from jsub as well [18:47:29] are you sure? [18:47:33] webservice does the same thing too [18:47:39] it takes a bit of time if the output file is on NFS [18:48:25] hm [18:49:15] valhallasw`cloud: and re: instrumenting qsub, just /usr/local/bin won't work because people often hardcode full paths into their scripts [18:50:08] it doesn't seem to work but the why of it not sure [18:50:56] what's the command you're running? [18:53:30] something like qsub /data/project/cbench/cbench.py -o /data/project/cbench/out [18:53:38] I think maybe the jsub creates / touches the file first here [18:54:41] there is another possibility which is I'm dumb and my test has no native stdout :) [18:55:01] chasemp: ah [18:55:07] chasemp: needs to be qsob -o then file at end [18:55:08] let me see here [18:55:10] chasemp: otherwise [18:55:14] chasemp: it gets passed to your script [18:55:17] and ignored [18:55:28] ah [18:55:32] well then [18:57:23] well I got into different error territory now so progress :) [19:01:46] YuviPanda: orite [19:02:04] chasemp: qsub writes to ~/.o and .e [19:02:14] or -.o and .e, I think [19:03:35] chasemp: also, re: first NFS or first bdb->classic spooling: I think that depends on what you think is the origin of the corruption. If it's a continuous process, killing NFS first would be most effective, if it's leftover from earlier corruption, I think moving to classic spooling is more effective [19:03:45] (03PS12) 10Ricordisamoa: Initial commit [labs/tools/faces] - 10https://gerrit.wikimedia.org/r/192096 [19:04:15] valhallasw`cloud: right yeah agreed [19:04:16] I'm not sure which one is the case -- the fact it started dec 30 and has been an issue ever since suggests it's a leftover, but aiui the entire database was removed and then the maste rrestarted around that time [19:04:18] (03PS13) 10Ricordisamoa: Initial commit [labs/tools/faces] - 10https://gerrit.wikimedia.org/r/192096 [19:04:31] but I don't know the details of how that outage was solved in the end [19:04:41] yeah, I don't know if the entire db was removed [19:04:44] or was just purged in some form [19:05:16] chasemp: fwiw, if I read the docs correctly, there's a way to dump and re-load bdb files [19:05:29] so one thing we might actually do is stop master, do that, restart master and see how things are [19:05:43] if that solves the issue we can then even choose to let it be and focus on k8s [19:06:55] (03CR) 10Ricordisamoa: "PS12 adds and fixes JSHint" [labs/tools/faces] - 10https://gerrit.wikimedia.org/r/192096 (owner: 10Ricordisamoa) [19:07:07] (03CR) 10Ricordisamoa: "PS13 adds package.json" [labs/tools/faces] - 10https://gerrit.wikimedia.org/r/192096 (owner: 10Ricordisamoa) [19:11:22] (03PS14) 10Ricordisamoa: Initial commit [labs/tools/faces] - 10https://gerrit.wikimedia.org/r/192096 [19:12:38] (03CR) 10Ricordisamoa: "PS14 adds and fixes JSCS" [labs/tools/faces] - 10https://gerrit.wikimedia.org/r/192096 (owner: 10Ricordisamoa) [19:12:52] valhallasw`cloud: I like your ideas :D [19:18:59] YuviPanda: https://github.com/gawbul/docker-sge :P [19:19:44] no :P [19:19:58] valhallasw`cloud: can you add a (different from your normal) key to labs/private root? [19:21:54] YuviPanda: uuuh, probably [19:21:58] if I can figure out where in the repo [19:22:17] ah, https://github.com/wikimedia/labs-private/blob/master/modules/passwords/templates/root-authorized-keys.erb [19:22:51] yup [19:22:57] valhallasw`cloud: I'll email scfc asking him too [19:24:23] YuviPanda: is there a bug I should refer to? [19:24:32] valhallasw`cloud: no but let me file one now [19:24:47] (03CR) 10ArthurPSmith: [C: 031] "I was trying to figure out what the problem was, but I get it now. Yes, this change makes sense. I wonder if there's a better way to find " [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/263846 (owner: 10Ricordisamoa) [19:25:53] 6Labs: Add valhallasw and scfc to labs roots - https://phabricator.wikimedia.org/T123655#1935063 (10yuvipanda) 3NEW [19:25:58] valhallasw`cloud: that one [19:26:41] (03PS1) 10Merlijn van Deen: passwords: add root key for valhallasw [labs/private] - 10https://gerrit.wikimedia.org/r/264110 (https://phabricator.wikimedia.org/T123655) [19:28:14] valhallasw`cloud: :D I'll talk to the other folks and get this done hopefully this week and worst case next [19:29:54] chasemp: andrewbogott so LDAP is freaking out on ores-worker-01 [19:30:05] ‘freaking out’? [19:30:06] > sudo: unknown uid 2029: who are you? [19:30:53] I can get in as root [19:31:27] might not be ldap, other ldap queries seem to work [19:32:02] hmm [19:32:04] PAM? [19:32:23] huh so ldap seems ok there for me directly [19:32:31] you guys figured this out already tho :) [19:32:35] what did we? [19:32:37] ah [19:32:40] 'this' [19:32:42] ok [19:32:45] it had broken puppet for like a week [19:32:48] I fixed / ran it [19:32:50] and boom [19:33:05] it's possible some one-offy pam fixes via salt never hit it? [19:33:12] md5sum teh /etc/pam.d ? [19:33:32] I don't know other than sounds like pam if ldap works directly (or maybe sudo-ldap) [19:33:47] moritzm: hey :) [19:34:07] +++ /tmp/puppet-file20160114-10190-1nlnomr 2016-01-14 19:21:03.481304734 +0000 [19:34:10] @@ -17,5 +17,5 @@ [19:34:12] rpc: db files [19:34:14] netgroup: ldap [19:34:16] +sudoers: files ldap [19:34:18] automount: files ldap [19:34:20] -sudoers: files ldap [19:34:22] I wonder if that's related [19:34:24] looks like an ordering + whitespace change? [19:34:49] that looks like an nsswitch.conf file? [19:35:01] yes [19:35:05] that's the diff I see in the puppet run [19:35:22] lol [19:35:25] sudo works fine now?! [19:35:26] wat [19:35:36] I restarted nslcd and nscd [19:35:45] just a while ago [19:35:48] it was [19:35:50] sudo: unknown uid 2029: who are you? [19:36:00] andrewbogott: hmm, I restarted nscd but not nslcd [19:36:02] can you log in w/out root key now too? [19:36:13] I can’t but maybe I’m not in the project [19:36:31] yup [19:36:34] I can [19:36:36] so it was just nslcd restart? [19:36:41] I guess [19:36:41] we should probably subscribe nsswitch to it [19:36:50] I thought it was [19:37:08] but yeah, should if it isn't [19:37:12] andrewbogott: it is [19:37:17] andrewbogott: puppet restarted them too [19:37:17] YuviPanda: can be give me a brief summary or ticket number, so what's the actual problem here? [19:37:24] welp [19:37:39] moritzm: sudo wasn't working on an instance, and now it is and we don't know why [19:37:45] (03CR) 10Ricordisamoa: "https://www.wikidata.org/wiki/Q6643508 appears to be the only instance of https://www.wikidata.org/wiki/Q11344 without a chemical symbol: " [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/263846 (owner: 10Ricordisamoa) [19:37:46] moritzm: any sudo would result in [19:37:59] sudo: unknown uid 2029: who are you? [19:38:07] and now it mysteriously works fine again [19:38:18] and 2029 is your actual uid? [19:39:18] the ordering in the change above is a nop, the only thing what matters is the order inside a service section, not the order of service sections itself [19:39:19] moritzm: yeah [19:39:24] moritzm: right. [19:39:26] (from nsswitch.conf I mean) [19:39:31] moritzm: halfak also experienced the same issue [19:39:40] and couldn't ssh in either I think after logging out [19:40:07] but andrewbogott restarted nslcd again and it seems to have fixed it, despite puppet caliming to have restarted it already after the change [19:40:46] can I ask, why did that file cahnge at all? [19:40:49] when was that change from? [19:40:59] no idea [19:41:06] that's very odd [19:41:36] hmm [19:41:41] last change to nsswitch.conf [19:41:43] is [19:41:47] from June 2013 [19:41:49] lol? [19:42:20] that's my thought yeah, wtf [19:42:27] something changed that file outside of puppet it seems [19:42:32] and puppet was correcting it [19:42:39] but the restart didn't happen or failed or ? [19:42:42] the deb package? [19:42:47] no puppet reported successful restart [19:43:34] who is running https://tools.wmflabs.org/cdnjs/ ? Is it an official WMF setup or somebody's private tool? [19:43:47] SMalyshev: that's me [19:43:53] SMalyshev: it's an official part of toollabs yes [19:43:56] why? [19:43:58] YuviPanda: great! [19:44:04] where did this happen anyway, what that a specific labs instance or several? [19:44:14] moritzm: jsut one. ores-worker-01 [19:44:27] YuviPanda: I wonder if we it's ok to use it for production stuff and if not, do we have something like that for production? [19:44:37] (not mediawiki) [19:44:54] SMalyshev: we don't have anything like that for production tho. and I think using it for prod stuff is in general 'not-OK' [19:45:24] so for production we would be ust copying everything? [19:45:39] basically yeah [19:45:56] etc/nsswitch.conf is created by base-files, but possible modified by other packages upon updates/installation (on my laptop e.g. libnss-mdns), maybe some installed/updated packaeges? [19:46:11] hmm [19:46:18] not sure why that happened to this particular instance [19:46:20] that's a pity... it'd be nice to have some good repo instead of keeping copies of stuff around [19:46:36] what's the fqdn/project of that host, I can dig around [19:46:40] YuviPanda: another thing - "latest version" there are largely out of date. [19:47:00] moritzm: thanks! it's ores-worker-01.ores.eqiad.wmflabs [19:47:09] SMalyshev: let me check if the cron that does it is doing ok [19:47:24] SMalyshev: yeah, I agree [19:47:44] SMalyshev: it's a massive git repo [19:47:47] like, 14G [19:47:49] i.e. bootstrap has at least 3.3.6 but latest is still 3.3.4 and for others the delta is even more [19:48:36] codemirror is 5.2.0 vs 5.10.0 [19:48:54] I'm running a manual update now [19:48:58] and then I'll figure out wtf happened to the cron [19:50:51] !log ores add moritzm to project as projectadmin [19:51:42] YuviPanda: thanks! [19:57:49] SMalyshev: updated now [19:57:57] cool, tahnks [20:03:37] in addition to the nsswitch.conf it also reverted changes to grub.conf (cgroup_enable=memory and swapaccount=1), google points that these are used by docker or lxc [20:04:13] moritzm: yeah, those are recent changes [20:04:13] but I think the nsswitch change is a red herring and the 2029 uid thing was a broken nscd cache entry, [20:04:15] grub.conf [20:04:22] moritzm: so I ran 'sudo puppet agent -tv' [20:04:25] and that ran puppet [20:04:27] and then i can't sudo [20:05:36] the next time this happens try running "nscd -i passwd" to see whether this fixes the problem (then we have at least narrowed it down) [20:05:58] ok [20:07:29] also, maybe try a manual "getent passwd yuvipanda", this does a standard name resolving as an application would do it (to rule out that it's unrelated to sudo in partcular) [20:07:40] ah [20:07:42] ok [20:07:44] (03CR) 10ArthurPSmith: "I'd say that's wrong - it's an atom, not an element. And it has correct subclass statements. I'm going to remove the "instance of" in that" [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/263846 (owner: 10Ricordisamoa) [20:07:56] so getent, then nscd -i passwd, and then I'll try restarting them [20:08:03] moritzm: what's the difference between nscd vs nslcd? [20:08:25] the 'l' in the latter is confusing [20:10:13] moritzm: thanks for looking into it! [20:10:28] nscd is the nameservice cache by glibc, with our config a resolved user or group is cached for 1 hour before it's queried from ldap again [20:11:26] 6Labs, 5Patch-For-Review: Add valhallasw and scfc to labs roots - https://phabricator.wikimedia.org/T123655#1935171 (10Andrew) No objection from me! [20:11:48] it [20:12:06] it's an often unreliable piece of code [20:13:11] moritzm: the cert/key in the ldap role are for replication? Or for client access? (I haven’t provided a cert or key and am surprised to note that puppet doesn’t complain about missing files.) [20:13:47] Oh, actually... [20:13:56] moritzm: ah, I see. ok [20:13:59] moritzm: what's nslcd for? [20:14:24] moritzm: ok, presuming it’s for client access... Is there any security concern with my using the same cert/key in labtest? It will make managing VMs in the test cluster a bit easier. [20:15:47] andrewbogott: no, the replication is unrelated to these. what do you mean with "the same cert/key"? multiple instances of labtest with the same key? [20:18:46] 6Labs: Give madhuvishy and milimetric root on limn and wikimetrics instances - https://phabricator.wikimedia.org/T120900#1935197 (10yuvipanda) a:5yuvipanda>3madhuvishy [20:21:25] moritzm: I mean the same cert/key as is used for the production ldap servers [20:21:42] (‘labtest’ does not run in labs, it’s a test cluster on production hardware. Sorry, that probably wasn’t obvious.) [20:25:20] that should be fine for a test cluster, feel free to add me as reviewer for the final class [20:26:25] 6Labs: Give madhuvishy and milimetric root on limn and wikimetrics instances - https://phabricator.wikimedia.org/T120900#1935208 (10yuvipanda) p:5Triage>3Normal [20:26:57] 6Labs: Give madhuvishy and milimetric root on limn and wikimetrics instances - https://phabricator.wikimedia.org/T120900#1935209 (10yuvipanda) 5Open>3Resolved This was done, and entry manually made for limn1 which had broken puppet. [20:37:32] moritzm: it was an incremental work, but the class as it stands now is puppet/modules/role/manifests/openldap/labtest.pp [20:37:34] no real surprises there [20:55:51] moritzm: where is that 1h caching set out of curiousity? [20:57:47] andrewbogott: chasemp soooooo [20:58:00] a commit removing a submodule in ops/puppet is being merged [20:58:02] except [20:58:09] this might break all self hosted puppetmasters [20:58:16] since git doesnt' deal with removing submodules very well [20:58:42] we need to run 'rm -rf /var/lib/git/operations/puppet/modules/wikimetrics/*' on all self hosted puppetmasters [20:58:45] can we change the cron to do a submodule update after the rebase? [20:58:56] does ‘submodule update’ not do that? [20:59:19] no [20:59:44] cron already does a submodule update [20:59:45] maaaaaan [21:00:15] I guess we can salt it in as many places as salt can reach [21:01:38] andrewbogott: I guess so, yeah [21:03:37] submodules, meh [21:03:45] on the plus side [21:03:48] we're removing one [21:03:51] an this is for the wikimetrics module [21:03:58] where the old one broke every single time [21:04:00] we changed something [21:04:11] since it was a heavily customized non-auto-updated self-hosted puppetmaster [21:04:18] that ran a service at least some people cared about [21:05:22] chasemp: do you think we can use clustershell for this? [21:05:43] chasemp: we can actually just run it on all instances. non-self-hosted puppetmasters won't have /var/lib/git/operations/puppet/modules/wikimetrics and hence won't fail [21:06:07] I had a talk w/ mark today I need to email about using it, but from home sure easy but would need a hostfile list of the nodes [21:06:12] so it's lame atm [21:06:34] chasemp: right, so we can just generate list of all labs instances (I've a script to do it) and then target with that [21:07:01] sure that's pretty acheivable, I did 3 in parallel previously and it took like 12 minutes or so to run some basic commands [21:07:36] chasemp: ok [21:07:59] chasemp: if you run https://github.com/yuvipanda/personal-wiki/blob/master/project-dsh-generator.py with param 'all-instances' [21:08:05] it'll generate a list of all instances for you [21:08:08] and then you have to run [21:08:12] time clush -f 3 -l root -o '-i ~/.ssh/labs_id_rsa -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no' --hostfile=tools.txt nfsstat -rc [21:08:17] rm -r /var/lib/git/operations/puppet/modules/wikimetrics [21:08:20] no -f so we see the errors [21:08:24] so I just installed clustershell with python setup.py locally [21:08:28] (you want the new version) [21:08:29] chasemp: want me to run it or do you want to? [21:08:31] and away I went [21:09:42] I'm not sure what exactly you are doing if you can clone https://github.com/cea-hpc/clustershell and python setup.py install [21:09:46] I think you would be there quick [21:09:51] if you want to give it a whirl [21:10:01] that would be cool [21:10:30] chasemp: sure [21:11:00] there is lost of magic to be had but from a static file the above command is pretty sane [21:11:16] you could ratchet up the -f 3 (in parallel) if you have better latency [21:11:45] and also substitute for teh key you want depending on how your local things are arranged [21:14:43] (03CR) 10Rush: [C: 031] "I'm all for it" [labs/private] - 10https://gerrit.wikimedia.org/r/264110 (https://phabricator.wikimedia.org/T123655) (owner: 10Merlijn van Deen) [21:18:30] (03CR) 10ArthurPSmith: [C: 04-1] "I think I have a number of things which would be broken in python3 - I'll work on fixing them!" [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/245591 (owner: 10ArthurPSmith) [21:37:17] chasemp: it just completed [21:37:19] I'm going to run it now [21:37:38] kk [21:42:43] chasemp: running now [21:43:01] (ran a 'hostname' check to test first) [21:46:37] I had to fiddle a bit w/ concurrency from home w/ latency [21:47:16] ive just set it to 5 [21:47:25] and then made tea [21:47:53] :) [21:55:15] 6Labs, 10Tool-Labs, 6Phabricator, 6Project-Creators: move tool labs user requests to phabricator - https://phabricator.wikimedia.org/T123601#1935486 (10Jdforrester-WMF) [21:55:18] 6Labs, 10Tool-Labs: Migrate Tools access request process to Phabricator - https://phabricator.wikimedia.org/T72625#1935487 (10Jdforrester-WMF) [21:55:20] 6Labs, 10wikitech.wikimedia.org: Decide on future of Semantic extensions on Wikitech - https://phabricator.wikimedia.org/T123599#1935485 (10Jdforrester-WMF) [21:55:57] 6Labs, 10Tool-Labs, 6Phabricator, 6Project-Creators: move tool labs user requests to phabricator - https://phabricator.wikimedia.org/T123601#1933529 (10Jdforrester-WMF) [21:56:00] 6Labs, 10Tool-Labs: Migrate Tools access request process to Phabricator - https://phabricator.wikimedia.org/T72625#751849 (10Jdforrester-WMF) [21:56:02] 6Labs, 10Labs-Infrastructure, 10Tool-Labs, 10MediaWiki-extensions-SemanticForms, 5Patch-For-Review: https://wikitech.wikimedia.org/wiki/Special:FormEdit/Tools_Access_Request down - https://phabricator.wikimedia.org/T123583#1935491 (10Jdforrester-WMF) [21:59:22] 6Labs, 10Tool-Labs: Migrate Tools access request process to Phabricator - https://phabricator.wikimedia.org/T72625#1935506 (10Jdforrester-WMF) >>! In T72625#785419, @yuvipanda wrote: > Going to set this as declined. It'll move to Horizon when we move to Horizon from wikitech. Given T123601#1933843 is this sti... [22:00:10] 6Labs, 10wikitech.wikimedia.org: Decide on future of Semantic extensions on Wikitech - https://phabricator.wikimedia.org/T123599#1935510 (10yuvipanda) [22:00:12] 6Labs, 10Labs-Infrastructure, 10Tool-Labs, 10MediaWiki-extensions-SemanticForms, 5Patch-For-Review: https://wikitech.wikimedia.org/wiki/Special:FormEdit/Tools_Access_Request down - https://phabricator.wikimedia.org/T123583#1935511 (10yuvipanda) [22:00:14] 6Labs, 10Tool-Labs: Migrate Tools access request process to Phabricator - https://phabricator.wikimedia.org/T72625#1935508 (10yuvipanda) 5declined>3Open Given that horizon is a bit far away on the... horizon, and phabricator forms now exists, reopening. [22:01:49] chasemp: still running tho. does it exit with proper timeouts or do I just ctrl-c at some point? [22:02:31] it will timeout with hosts [22:02:41] ok [22:02:51] errors go to stderror and then (iirc) it globs a list of commands that failed at the end [22:02:58] and the list of hosts that are unresponsive is given in real time [22:03:39] I teed it into a file [22:03:50] you can also set the timeouts for connection etc [22:04:07] it's basically a fancy parallel driver around the ssh client (which is the good part) [22:04:18] and the output can be collated in several ways [22:04:33] * YuviPanda nods [22:04:39] somewhat similar to pssh I guess [22:04:45] 6Labs, 10Tool-Labs, 6Project-Creators: Migrate Tools access request process to Phabricator - https://phabricator.wikimedia.org/T72625#1935521 (10Aklapper) a:5coren>3None [22:04:59] it's been stuck on one host for a while now [22:06:37] they ahve a nice "compare me to pssh" on their site actually [22:07:06] I wonder if the tee is messing it up [22:08:13] I ctrl-c'd it [22:08:18] chasemp: it was stuck on tools-redis-01 [22:08:22] which is a 'stuck' instance [22:09:50] command_timeout adn connect_timeout exit [22:09:53] I wonder if effective [22:10:11] I"m about to do a similar op :) [22:10:16] 6Labs, 10Tool-Labs, 6Project-Creators: Migrate Tools access request process to Phabricator - https://phabricator.wikimedia.org/T72625#1935526 (10yuvipanda) p:5Low>3Triage [22:10:39] :D [22:10:42] well [22:10:44] once this completes [22:10:50] I'm going to merge that patch [22:23:59] chasemp: stuck at clush: in progress(2): language-mleb-legacy.eqiad.wmflabs,tools-redis-01.eqiad.wmflabs [22:24:01] again [22:24:03] I'm going to merge now tho [22:24:24] interesting [22:30:08] !log wikimetrics Taking down wikimetrics-staging.wikimetrics to replace with non self hosted instance [22:30:42] chasemp: andrewbogott ok, I've merged it. all puppet self hosted things that were updating themselves would have already been fixed, and the ones that weren't were broken anyway [22:31:17] thank you! [22:31:55] andrewbogott: so this means that the two wikimetrics instances won't break again because they're running normal puppet :D [22:32:17] andrewbogott: and analytics is aware that their limn instances are also unrecoverable and they have root access to them and they'll prioritize fixing those accordingly :) [22:33:01] !log wikimetrics Recreated wikimetrics-staging, force puppet run [22:34:50] madhuvishy: if you delete and recreate an instance with same name sometimes it fucks up. so be careful :) [22:35:17] ya i expected that, but its doing good so far :) [22:35:23] :D ok [22:35:25] !log wikimetrics Recreated wikimetrics-staging, force puppet run [22:36:58] !log wikimetrics Deployment step 1: Ran fab staging initialize_server - all good [22:36:59] PROBLEM - Puppet failure on tools-puppet-is-broken-here-on-purpose is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [22:38:36] !log wikimetrics Deployment step 2: Ran fab staging deploy - good again [22:43:23] for jsub [22:43:26] I should probably record [22:43:29] : [22:44:09] hostname, user, parent process commandline, full commandline, no-op or not (for -once) [22:44:14] anything else? [22:44:21] we can probably get all the info we want from these [22:47:15] !log wikimetrics Last step: fab staging restart_wikimetrics, setup proxy again - and something is wrong [22:53:58] chasemp: valhallasw`cloud https://meta.wikimedia.org/wiki/Schema:ToolsJobSubmission [22:54:01] for jsub [22:54:14] I've to augment the other schema with parent-process stuff too [22:56:06] nice [22:56:49] chasemp: anything missing? [22:57:01] this also means I've to write a EventLogging client in perl... [22:57:06] I'll probably just shell to curl [22:57:09] but uuuhhhgh [22:57:19] command line is the jsub or the eventual qsub cli [22:57:34] jsub line [22:57:44] we can reconstruct the qsub cli from it if we need to but I don't think we do [22:58:17] augh [22:58:27] I need to probably learn how to format JSON and send it to a thing in perl [22:58:35] (03CR) 10Aklapper: "8ohit.dua: Can you please reply to Nemo, plus also set a descriptive patch summary which is not "coming soon"? :) Thanks!" [labs/tools/bub] - 10https://gerrit.wikimedia.org/r/129709 (owner: 108ohit.dua) [23:11:37] !log wikimetrics ran into some path issues - fixed in source - redeploying [23:14:12] !log wikimetrics-staging is back up, all good now [23:25:33] !log wikimetrics Make new prod instance wikimetrics-01.wikimetrics, force puppet run [23:26:31] !log wikimetrics Add role::wikimetrics(prod role), rerun puppet [23:29:44] YuviPanda: interesting problem - how can i do mysql -h from prod if i don't set up mysql-server there? should i install some package alone? [23:30:45] madhuvishy: yeah there's a mysql-client [23:30:48] package [23:30:50] you can install [23:31:01] let me add that to puppet [23:31:15] ok [23:35:42] !log wikimetrics Tried setting up prod server - figured mysql-client missing on prod - added to puppet [23:38:52] YuviPanda: new problem :/ I was doing mysql -u root from deployment - prod is a special case now [23:39:24] hmm [23:39:31] madhuvishy: is this for loading up the initial data? [23:39:39] creating db [23:39:42] right [23:39:53] so it just needs you to pass in username / password onto the command? [23:39:56] need to be root in staging to create [23:40:07] right [23:40:17] one option is to make staging too use labsdb [23:40:20] and just vary db-name [23:40:40] YuviPanda: that wont let me create the testdbs there [23:40:46] will be overkill [23:41:11] i think i can add a config param like DB_ROOT and set it to root vs labs user [23:41:16] yeah [23:41:18] or just [23:41:20] DB_PASS [23:41:20] DB_ROOTUSER [23:41:22] DB_USER [23:41:24] and stuff [23:41:26] that is there [23:41:34] butt [23:41:41] DB_USER for staging is wikimetrics [23:41:45] aaah [23:41:47] I see [23:41:49] yeah [23:41:49] and root is root [23:41:51] makes sense [23:41:51] yess [23:41:54] that is why [23:46:01] madhuvishy: I might be afk for about 30mins. you all good for now? [23:46:42] YuviPanda: yup, might need my patch to the secret repo merged eventually [23:46:49] madhuvishy: you have merge rights [23:46:52] madhuvishy: self-merge away [23:46:56] oh cool [23:46:57] okay [23:47:12] chasemp: I just realized we don't need a special schema just for jsub [23:47:20] if we instrument qsub, we can compare rates for that vs jsub [23:47:23] and get our answer [23:48:00] not sure I understand the idea [23:48:22] chasemp: so jsub with -once [23:48:30] chasemp: either calls qsub (because it needs to submit a job) [23:48:41] or does not call qsub (because the job is already running, and -once was specified) [23:48:43] or it bails out [23:48:46] so if we instrument qsub [23:48:51] right agreed, I read up [23:49:02] we can do this without needing an extra schema [23:49:23] I see so jsub_calls - qsub_calls = jsub_with_overlapping_lives [23:49:33] basically [23:49:35] ya [23:49:36] doesn't facter in the ppl who qsub directly? [23:49:40] but maybe small [23:49:43] we can accoutn for that [23:49:46] because now we're collecting [23:49:48] parent-process-cmdline [23:49:52] ah [23:49:54] so for people using qsub directly [23:49:55] seems sound then [23:49:56] that'll not be jsub [23:50:16] chasemp: 'tis all in mysql, so let me know if you wanna poke around [23:50:19] with the raw data [23:51:19] I will, I'm knee deep a very long sge doc atm and my head is spinning [23:51:37] have some things to show maybe tomorrow if you are willing to put on an optimism hat [23:52:36] :D [23:52:52] depends on how many things break between now and then I guess [23:52:57] yeah! [23:53:07] I'm noticing here that most sge things are ensure=latest [23:53:13] that's kind of a nightmare waiting to happen I think [23:53:37] not that there is new code to be released but if debian packaged son of sge or something [23:53:42] our lives could be turned upside down :) [23:53:48] nah, debian kicked gridengine out :) [23:53:56] there are no gridengine packages there [23:54:05] anyway [23:54:24] heh [23:54:42] I wouldn't be shocked by a SoGE fork popping up, but yeah the whole thing is crazy [23:54:51] (I mean packaged) [23:55:03] :D [23:55:12] I don't think anyone cares [23:55:26] that's what I got from reading the debian bug about removing OGE [23:55:48] anyway, I'm going afk for a bit. [23:56:02] can't remember the alt valhallasw`cloud mentioned something like torque [23:56:09] that seems to have supplemanted in many places [23:56:49] later dude [23:56:52] on the morrow then [23:57:01] +1 [23:57:07] my sleep cycle has somehat assumed normal tones [23:57:09] woke up at 9AM today [23:57:14] whao [23:57:16] yeah [23:57:18] very un-yuvi [23:57:19] it's crazy [23:57:30] I blame otto [23:58:30] sound plan there usually ;)