[00:04:32] bd808: we might need to make sure ffmpeg is also on the image scalers to produce thumbnails [00:04:35] lemme double-check the configs [00:05:56] role::mediawiki::imagescaler uses the same ::mediawiki::multimedia class [00:06:03] ah ok [00:06:06] so we'd just need to force puppet to run there [00:06:09] so it might need a push yeah [00:06:58] ffs wikitech. why do you hate me so much today [00:08:20] * bd808 just realized that beta cluster doesn't have dedicated image scalers [00:09:46] brion: will it hurt anything if I don't purge libav on those boxes? [00:10:21] bd808: should be harmless [00:10:37] cool beans [00:10:37] brion: The counters on http://en.wikipedia.beta.wmflabs.org/wiki/File:Snowdonia_by_drone.webm don't seem to increase any more. [00:10:59] James_F: you may have to use the 'update' link (does an action=purge) depending on caching fun [00:11:21] Oh, ha. [00:11:21] 26 minutes seems like a long time though [00:11:23] * brion hmms [00:11:27] Now 27 minutes. [00:11:36] * James_F suspects "broken". [00:11:45] Optimist James to the rescue. [00:11:53] bd808: can you check if the ffmpeg procs are still running? [00:12:26] with luck they're just pushing it into swap and it's hella slow ;)) [00:13:00] yeah. looks like 5 are running. load average: 2.11, 3.33, 4.97 [00:13:15] and free mem is very low [00:13:30] 487m into swap [00:13:57] would just one transcode job kick off that much work? [00:15:10] bd808: each source file gets transcoded into several different resolutions... so one job for each, and however many job runners are configured will run at once [00:15:26] it's also possible some of the ffmpeg bits use an extra thread for something which may add to it [00:15:40] they do tend to eat memory especially at high resolutions [00:15:49] I told it to only use 2 runner threads but I see a lot of hhvm procs active [00:16:17] hmm if there's 2 runners it should only be running two base processes then [00:16:42] I think there may be multiples running :( [00:16:54] funnnn [00:17:04] $ ps aux|grep job|grep -v grep|wc -l -- 12 [00:17:17] special:timedmediahandler shows 4 ffmpegs (webm) and 3 ffmpeg2theoras (ogv) in 'running' state [00:17:34] i wonder why O_O [00:18:01] It might not sighup right I guess for the config changes puppet made [00:18:27] bd808: feel free to `killall ffmpeg2theora; killall ffmpeg` and we'll recheck the runner count... [00:18:38] It's php/hhvm as a daemon so ... spooky [00:18:41] heh [00:21:18] !log stopping and starting jobrunner and jobchron on deployment-tmh01 [00:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [00:22:21] brion: ok. the website still says 7 running, but the jobrunner says it has no work to do [00:22:39] heh [00:22:44] ok lemme requeue em [00:22:59] i should rewrite the queue management reporting to be a little more ... robust some time :D [00:23:18] ok i'm just requeueing *one* for starters :D [00:23:44] and if that works i'll do the rest [00:27:08] http://upload.beta.wmflabs.org/wikipedia/en/transcoded/9/90/Snowdonia_by_drone.webm/Snowdonia_by_drone.webm.360p.webm [00:28:12] I hit reset on the 480 one [00:28:23] bd808: WOOOHOO [00:28:25] :DDDD [00:28:40] James_F: looks like it's starting to work in beta \o/ [00:28:44] Gosh. [00:28:52] S'amazing. [00:30:22] Though VP9 in VP8 out is a bit… eh. [00:31:00] all of them are in the queue again [00:31:30] James_F: yeah we can do proper vp9 output later, all the code's ready it just needs the config [00:31:36] * James_F nods. [00:31:41] but first, just getting them to work at all is a good step ;) [00:31:45] Yes. :-) [00:31:46] and 3 are running? I wonder if the jobrunner has an off by one problem :/ [00:31:48] and then i don't want to overload the scalers ;) [00:31:56] huh weird [00:33:48] and now all 5 are running [00:33:57] Which all? [00:33:58] I think the jobrunner is being a bit wonky here [00:34:05] http://en.wikipedia.beta.wmflabs.org/wiki/Special:TimedMediaHandler [00:34:27] Is the '2' limit per type? [00:34:43] I setup the jobrunner for this with 2 workers [00:34:47] So 2 of "Web streamable WebM (480P)", 2 of "High quality downloadable WebM (720P)", etc.? [00:34:57] but I see 5 running so ... not sure [00:35:05] o_O [00:35:25] James_F: should apply to all transcode jobs as a group afaik [00:35:34] Well… it ain't. [00:35:37] hehe [00:35:39] I'm not really sure if anyone but Aaron knows how his runner is supposed to work [00:35:53] And he's deep in talks with gwicke right now. [00:36:01] Lots of hand movement involved. [00:36:33] "transcode": { "runners": 2, "include": [ "webVideoTranscode" ] }, [00:36:43] I would guess that means 2 at a time [00:36:50] but meh [00:37:24] my vagrant's config has {groups:{basic:{include:['*'],runners:2}}}, doesn't distinguish transcodes from other types [00:37:47] and that at least limits them as expected in my vagrant.... [00:38:04] this one is setup with 0 for all the other types [00:38:22] which is what we do in prod to split queues up around the runners [00:38:32] s/runners/servers/ [00:40:15] hmm [00:40:34] bd808: can you point me to the repo/file with that config for beta? i'm not finding it yet [00:41:03] it's a little goofy to find; it's all puppet and hiera magic [00:41:21] heh [00:41:31] this is the patch where I set it up -- https://gerrit.wikimedia.org/r/#/c/234599/ [00:41:39] i see modules/mediawiki/templates/jobrunner/jobrunner.conf.erb in puppet, but nothing sets $runners_transcode to non-zero that i see [00:41:40] ok [00:42:24] ah mysterious :D [00:42:28] and that layers on top of -- https://github.com/wikimedia/operations-puppet/blob/production/hieradata/labs/deployment-prep/common.yaml#L104-L113 [00:43:30] And would be in tern overwritten by https://wikitech.wikimedia.org/wiki/Hiera:Deployment-prep if it mentioned jobrunners :) [00:43:38] sounds legit enough ... :D [00:43:51] our hiera config is ... expansive [00:44:01] o_O [00:45:03] 10Continuous-Integration-Infrastructure, 10VisualEditor: Concurrent builds using local Chromium/Firefox browsers on Linux host fail - https://phabricator.wikimedia.org/T90673#1586303 (10Krinkle) a:5Krinkle>3None [00:45:05] So… as long as there's only one file on the wiki and one transcode process it works, but beyond that it fails? [00:45:49] James_F: well the production video scalers should be much better able to handle 5 simultaneous processes :D [00:46:01] * James_F crosses his fingers. [00:46:01] so it's more just concerning that we can't figure out how to set the limit right ;) [00:46:05] Yeah. [00:47:19] bd808: should we pick this up again monday? looks like we've got it mostly working but friday 6pm sounds like a bad time to roll out the updates to production, especially if the job config is still wonky ;)) [00:49:38] brion: I don't have the super powers to take it to prod anyway so :) yes [00:49:49] great :D [00:50:26] Joe is out on vacation next week I think so you'll have to find another root to work with [00:50:30] ok [00:50:34] but it feels like its close [00:50:47] yep :D [00:52:08] 10Beta-Cluster, 7HHVM, 5Patch-For-Review: Upgrade Beta Cluster tmh* host(s) to HHVM and Trusty - https://phabricator.wikimedia.org/T110707#1586318 (10brion) Ok this is now mostly-done but there's a problem with the job runner -- it's running more threads than expected, which overloads the scaler VM (which is... [01:01:01] !log Deleted local mwdeploy user on deployment-tmh01 that was causing scap failures [01:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [01:05:35] Yippee, build fixed! [01:05:35] Project beta-scap-eqiad build #67679: FIXED in 1 min 26 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/67679/ [03:42:32] (03CR) 10BearND: [C: 031] "Nice. I also like that you added the qq check." [integration/config] - 10https://gerrit.wikimedia.org/r/230260 (https://phabricator.wikimedia.org/T62720) (owner: 10Niedzielski) [05:34:29] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-11-sauce build #525: FAILURE in 32 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-11-sauce/525/ [06:42:04] 10Deployment-Systems, 10Scap3: Scap3 should have idempotent deploys - https://phabricator.wikimedia.org/T109513#1586575 (10mmodell) True idempotent behavior is fairly difficult to pull off. We can either half-ass it or we can use something like puppet for the bulk of our deployment logic. I'm not the biggest... [06:42:30] 10Deployment-Systems, 10Scap3: Scap3 should have idempotent deploys - https://phabricator.wikimedia.org/T109513#1586576 (10mmodell) ^ bad metaphor overload. [06:50:39] 10Deployment-Systems, 10Scap3: Scap3 should be able to deploy/rollback service config as part of deploy - https://phabricator.wikimedia.org/T109512#1586581 (10mmodell) See my comment in T107532#1517098 for one idea that I proposed (using `puppet apply` for config changes) which I still believe to be fairly goo... [07:00:25] 10Beta-Cluster, 10MediaWiki-extensions-OAuth, 10Pywikibot-OAuth, 7Pywikibot-network, 7Pywikibot-tests: Nonce already used regularly occurring on beta cluster - https://phabricator.wikimedia.org/T109173#1586584 (10jayvdb) Another one on beta cluster: https://travis-ci.org/jayvdb/pywikibot-core/jobs/777883... [07:58:58] 10Beta-Cluster, 10ContentTranslation-Deployments, 10Wikimedia-Site-Requests, 5Patch-For-Review: Put beta eswiki to read-only mode - https://phabricator.wikimedia.org/T109157#1586604 (10Glaisher) [09:34:23] 10Deployment-Systems, 6Release-Engineering: Don't continue scap if sync to all proxies failed - https://phabricator.wikimedia.org/T110791#1586661 (10Reedy) 3NEW [09:36:26] 10Deployment-Systems, 6Release-Engineering: scap shouldn't log completion (it should log fail!) - https://phabricator.wikimedia.org/T110793#1586675 (10Reedy) 3NEW [09:43:14] Project browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #594: FAILURE in 6 min 14 sec: https://integration.wikimedia.org/ci/job/browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/594/ [09:47:36] 10Deployment-Systems, 6Release-Engineering: SCAP fails with Permission denied (publickey) - https://phabricator.wikimedia.org/T110794#1586698 (10jcrespo) [09:50:07] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-monobook-sauce build #542: FAILURE in 29 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-monobook-sauce/542/ [09:50:21] 10Deployment-Systems, 6Release-Engineering: SCAP fails with Permission denied (publickey) - https://phabricator.wikimedia.org/T110794#1586711 (10jcrespo) [12:59:03] Yippee, build fixed! [12:59:04] Project browsertests-MobileFrontend-en.m.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #768: FIXED in 27 min: https://integration.wikimedia.org/ci/job/browsertests-MobileFrontend-en.m.wikipedia.beta.wmflabs.org-linux-chrome-sauce/768/ [14:02:40] 10Deployment-Systems, 6Release-Engineering, 6operations: SCAP fails with Permission denied (publickey) - https://phabricator.wikimedia.org/T110794#1586779 (10Krenair) [14:03:15] 10Deployment-Systems, 6Release-Engineering, 6operations: SCAP fails with Permission denied (publickey) - https://phabricator.wikimedia.org/T110794#1586782 (10Krenair) Keyholder issue? ```krenair@tin:~$ SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@mw2001 Permission denied (publickey).``` [14:04:08] 10Deployment-Systems, 6Release-Engineering, 6operations: SCAP fails with Permission denied (publickey) - https://phabricator.wikimedia.org/T110794#1586784 (10Krenair) (Works fine from mira.) [14:15:28] 10Deployment-Systems, 6Release-Engineering, 6operations: SCAP fails with Permission denied (publickey) - https://phabricator.wikimedia.org/T110794#1586793 (10Krenair) Yeah, icinga has been showing this for tin's keyholder service for the past 14 hours: CRITICAL: Keyholder is not armed. Run 'keyholder arm' to... [14:52:12] 10Beta-Cluster, 6Release-Engineering, 5Patch-For-Review, 7Regression: Beta cluster logo broken (/static/images/project-logos 404 Not Found) - https://phabricator.wikimedia.org/T105541#1586830 (10Krenair) Looks like you only fixed the desktop site... [19:59:53] (03PS1) 10Legoktm: Move mediawiki-phpunit-zend to "zend" queue for mediawiki/vendor [integration/config] - 10https://gerrit.wikimedia.org/r/234798 [21:33:35] 10Beta-Cluster, 10MediaWiki-extensions-OAuth, 10Pywikibot-OAuth, 7Pywikibot-network, 7Pywikibot-tests: Nonce already used regularly occurring on beta cluster - https://phabricator.wikimedia.org/T109173#1587130 (10jayvdb) Another, beta cluster: https://travis-ci.org/wikimedia/pywikibot-core/jobs/77787307#... [23:09:11] (03CR) 10Alex Monk: "In production on tin I found that there are some files you can't write to without being root. In particular:" [tools/scap] - 10https://gerrit.wikimedia.org/r/224313 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis)