[00:47:52] 6Release-Engineering, 6operations, 5Patch-For-Review: Use SSLCertificateChainFile in Gerrit Apache configuration - https://phabricator.wikimedia.org/T82319#1339936 (10Dzahn) >>! In T82319#1335241, @hashar wrote: > Does #releng has anything to do there? Seems like some infrastructure tech debt. I added sinc... [01:08:05] 6Release-Engineering, 6operations, 5Patch-For-Review: Use SSLCertificateChainFile in Gerrit Apache configuration - https://phabricator.wikimedia.org/T82319#1340016 (10Dzahn) >>! In T82319#1335241, @hashar wrote: > Does #releng has anything to do there? Seems like some infrastructure tech debt. Actually, yo... [01:10:22] 6Release-Engineering, 6operations, 5Patch-For-Review: Use SSLCertificateChainFile in Gerrit Apache configuration - https://phabricator.wikimedia.org/T82319#1340017 (10Dzahn) 5Open>3Resolved a:3Dzahn we use "SSLCACertificatePath /etc/ssl/certs/" in the Gerrit config (meanwhile) and that is ok too [01:13:19] 6Release-Engineering, 6operations, 5Patch-For-Review: Use SSLCertificateChainFile in Gerrit Apache configuration - https://phabricator.wikimedia.org/T82319#1340022 (10Dzahn) https://www.ssllabs.com/ssltest/analyze.html?d=gerrit.wikimedia.org the "-" in "A-" is because we are not supporting PFS which is beca... [01:52:50] 10Continuous-Integration-Infrastructure, 6Labs, 10Labs-Infrastructure, 6operations: dnsmasq returns SERVFAIL for (some?) names that do not exist instead of NXDOMAIN - https://phabricator.wikimedia.org/T92351#1340120 (10scfc) 5Resolved>3declined (AFAIUI, the underlying issue has not been researched or r... [01:52:53] 10Continuous-Integration-Infrastructure: Re-create ci slaves (March 2015) - https://phabricator.wikimedia.org/T91524#1340122 (10scfc) [02:35:43] Project browsertests-WikiLove-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #593: FAILURE in 2 min 42 sec: https://integration.wikimedia.org/ci/job/browsertests-WikiLove-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/593/ [03:18:36] (03PS1) 10Dduvall: Navigation/filtering by project [integration/raita] - 10https://gerrit.wikimedia.org/r/216026 [03:28:21] PROBLEM - Free space - all mounts on deployment-eventlogging02 is CRITICAL deployment-prep.deployment-eventlogging02.diskspace._var.byte_percentfree (<30.00%) [05:10:25] (03CR) 10Pastakhov: "Hashar: please, fix it." [integration/config] - 10https://gerrit.wikimedia.org/r/207754 (owner: 10Pastakhov) [06:38:20] RECOVERY - Free space - all mounts on deployment-eventlogging02 is OK All targets OK [06:57:41] RECOVERY - Free space - all mounts on deployment-videoscaler01 is OK All targets OK [07:19:28] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-firefox-monobook-sauce build #464: FAILURE in 54 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-firefox-monobook-sauce/464/ [08:19:22] zeljkof: finally around [08:19:35] hashar: ready? [08:19:41] or do you need more time? [08:19:42] yeah if hangout let me in [08:19:48] ok, joining :) [08:20:47] I am in [08:22:59] hashar: I am in too! :) [08:25:46] lost ya [09:22:25] PROBLEM - Puppet failure on deployment-mx is CRITICAL 100.00% of data above the critical threshold [0.0] [09:40:01] 6Release-Engineering, 6operations, 5Patch-For-Review: Use SSLCertificateChainFile in Gerrit Apache configuration - https://phabricator.wikimedia.org/T82319#1340615 (10hashar) Thanks @Dzahn :-) [09:43:53] Yippee, build fixed! [09:43:54] Project browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #506: FIXED in 6 min 52 sec: https://integration.wikimedia.org/ci/job/browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/506/ [10:01:34] (03PS1) 10Amire80: Add a job for ContentTranslation screenshots [integration/config] - 10https://gerrit.wikimedia.org/r/216056 [10:05:44] Project browsertests-ContentTranslation-language-screenshot-os_x_10.10-firefox » ca,contintLabsSlave && UbuntuTrusty build #1: FAILURE in 1 min 39 sec: https://integration.wikimedia.org/ci/job/browsertests-ContentTranslation-language-screenshot-os_x_10.10-firefox/LANGUAGE_SCREENSHOT_CODE=ca,label=contintLabsSlave%20&&%20UbuntuTrusty/1/ [10:07:17] Project browsertests-ContentTranslation-language-screenshot-os_x_10.10-firefox » en,contintLabsSlave && UbuntuTrusty build #1: FAILURE in 3 min 13 sec: https://integration.wikimedia.org/ci/job/browsertests-ContentTranslation-language-screenshot-os_x_10.10-firefox/LANGUAGE_SCREENSHOT_CODE=en,label=contintLabsSlave%20&&%20UbuntuTrusty/1/ [10:16:36] (03PS2) 10Zfilipin: Add a job for ContentTranslation screenshots [integration/config] - 10https://gerrit.wikimedia.org/r/216056 (owner: 10Amire80) [10:16:53] (03CR) 10Zfilipin: [C: 032] "The job is deployed and seems to be working fine!" [integration/config] - 10https://gerrit.wikimedia.org/r/216056 (owner: 10Amire80) [10:18:47] (03Merged) 10jenkins-bot: Add a job for ContentTranslation screenshots [integration/config] - 10https://gerrit.wikimedia.org/r/216056 (owner: 10Amire80) [10:49:18] 10Browser-Tests, 5Patch-For-Review: mediawiki_selenium 1.2 breaks mw/core browser test when reporting to raita - https://phabricator.wikimedia.org/T100904#1340744 (10hashar) I have retriggered the job which fails logging to beta but apparently managed to report to raita. https://integration.wikimedia.org/ci/v... [10:50:36] 10Browser-Tests, 5Patch-For-Review: mediawiki_selenium 1.2 breaks mw/core browser test when reporting to raita - https://phabricator.wikimedia.org/T100904#1340745 (10hashar) 5Open>3Resolved Solved in mediawiki_selenium 1.2.1 [10:57:58] Yippee, build fixed! [10:57:59] Project browsertests-Core-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #638: FIXED in 10 min: https://integration.wikimedia.org/ci/job/browsertests-Core-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/638/ [11:03:36] 10Browser-Tests, 5MW-1.26-release, 5Patch-For-Review, 5WMF-deploy-2015-06-09_(1.26wmf9): mediawiki_selenium 1.2 breaks mw/core browser test when reporting to raita - https://phabricator.wikimedia.org/T100904#1340768 (10hashar) Next build is all green https://integration.wikimedia.org/ci/view/BrowserTests/v... [11:33:43] PROBLEM - Free space - all mounts on deployment-videoscaler01 is CRITICAL deployment-prep.deployment-videoscaler01.diskspace._var.byte_percentfree (<50.00%) [12:40:02] 6Release-Engineering, 10Continuous-Integration-Config: MobileFrontend qunit tests should run Gather tests as well - https://phabricator.wikimedia.org/T99877#1340966 (10hashar) [12:40:34] 6Release-Engineering, 10Continuous-Integration-Config: MobileFrontend qunit tests should run Gather tests as well - https://phabricator.wikimedia.org/T99877#1300859 (10hashar) a:5Jdlrobson>3None We need a new qunit job that is like mediawiki-testextensions but runs qunit instead of PHPUnit. [12:48:30] (03PS1) 10Hashar: integration-phpunit-mediawiki-REL1_25 [integration/config] - 10https://gerrit.wikimedia.org/r/216077 [12:48:42] (03CR) 10Hashar: [C: 032] integration-phpunit-mediawiki-REL1_25 [integration/config] - 10https://gerrit.wikimedia.org/r/216077 (owner: 10Hashar) [12:50:49] (03Merged) 10jenkins-bot: integration-phpunit-mediawiki-REL1_25 [integration/config] - 10https://gerrit.wikimedia.org/r/216077 (owner: 10Hashar) [12:50:56] (03PS1) 10Hashar: Drop references to REL1_19 (phased out) [integration/config] - 10https://gerrit.wikimedia.org/r/216078 [12:52:51] (03CR) 10Hashar: [C: 032] Drop references to REL1_19 (phased out) [integration/config] - 10https://gerrit.wikimedia.org/r/216078 (owner: 10Hashar) [12:54:26] (03Merged) 10jenkins-bot: Drop references to REL1_19 (phased out) [integration/config] - 10https://gerrit.wikimedia.org/r/216078 (owner: 10Hashar) [13:02:57] Project browsertests-MobileFrontend-en.m.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #672: FAILURE in 30 min: https://integration.wikimedia.org/ci/job/browsertests-MobileFrontend-en.m.wikipedia.beta.wmflabs.org-linux-chrome-sauce/672/ [13:35:51] (03PS1) 10Hashar: Drop old comment in macro.yaml [integration/config] - 10https://gerrit.wikimedia.org/r/216086 [13:36:01] (03CR) 10Hashar: [C: 032] Drop old comment in macro.yaml [integration/config] - 10https://gerrit.wikimedia.org/r/216086 (owner: 10Hashar) [13:37:50] (03Merged) 10jenkins-bot: Drop old comment in macro.yaml [integration/config] - 10https://gerrit.wikimedia.org/r/216086 (owner: 10Hashar) [13:57:37] (03PS1) 10Hashar: JJB: move single use macro in the job-template [integration/config] - 10https://gerrit.wikimedia.org/r/216090 [13:58:23] (03CR) 10Hashar: "What do you guys think about it? I Got tired of switching between job-templates.yaml and macro.yaml" [integration/config] - 10https://gerrit.wikimedia.org/r/216090 (owner: 10Hashar) [14:12:13] PROBLEM - Puppet failure on deployment-restbase01 is CRITICAL 33.33% of data above the critical threshold [0.0] [14:20:20] RECOVERY - Puppet staleness on deployment-restbase01 is OK Less than 1.00% above the threshold [3600.0] [14:29:17] PROBLEM - Puppet failure on deployment-restbase02 is CRITICAL 55.56% of data above the critical threshold [0.0] [14:29:20] (03PS1) 10Hashar: JJB: zuul-cloner-extdeps slightly more reusable [integration/config] - 10https://gerrit.wikimedia.org/r/216097 [14:32:11] RECOVERY - Puppet failure on deployment-restbase01 is OK Less than 1.00% above the threshold [0.0] [14:34:07] RECOVERY - Puppet staleness on deployment-restbase02 is OK Less than 1.00% above the threshold [3600.0] [14:34:21] Yippee, build fixed! [14:34:21] Project browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #524: FIXED in 8 min 20 sec: https://integration.wikimedia.org/ci/job/browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/524/ [14:39:19] (03PS1) 10Hashar: Refactor mediawiki-extensions-{phpflavor} [integration/config] - 10https://gerrit.wikimedia.org/r/216100 [14:48:24] (03CR) 10Hashar: [C: 032] "Tested on VE and MobileFrontend extensions. Works!" [integration/config] - 10https://gerrit.wikimedia.org/r/216097 (owner: 10Hashar) [14:50:17] (03Merged) 10jenkins-bot: JJB: zuul-cloner-extdeps slightly more reusable [integration/config] - 10https://gerrit.wikimedia.org/r/216097 (owner: 10Hashar) [14:59:15] RECOVERY - Puppet failure on deployment-restbase02 is OK Less than 1.00% above the threshold [0.0] [15:15:45] (03PS1) 10Hashar: Factor VE submodule update in a macro [integration/config] - 10https://gerrit.wikimedia.org/r/216105 [15:15:58] (03CR) 10Hashar: [C: 032] Factor VE submodule update in a macro [integration/config] - 10https://gerrit.wikimedia.org/r/216105 (owner: 10Hashar) [15:17:50] (03Merged) 10jenkins-bot: Factor VE submodule update in a macro [integration/config] - 10https://gerrit.wikimedia.org/r/216105 (owner: 10Hashar) [15:20:19] gooood morning grrrit-wm [15:20:21] grr [15:20:23] greg-g: [15:20:26] good morning :} [15:20:39] challenge: will I leave work before you tonight? [15:20:42] hehe [15:20:59] I hope so [15:21:02] it's a Friday! [15:23:28] PROBLEM - Puppet failure on deployment-cache-bits01 is CRITICAL 100.00% of data above the critical threshold [0.0] [15:30:13] be back later tonight [15:44:45] PROBLEM - Free space - all mounts on deployment-bastion is CRITICAL deployment-prep.deployment-bastion.diskspace._var.byte_percentfree (<10.00%) [15:59:45] RECOVERY - Free space - all mounts on deployment-bastion is OK All targets OK [16:06:48] bd808: just git pull-ed mw-vagrant (so gotten like a week worth of updates), and i'm seeing this: (Cannot access the database: Can't connect to local MySQL server through socket '/dev/null' (111) ()) [16:07:08] heh no wornder it cannot access it via /dev/null [16:07:17] yeah that seems not so good [16:07:42] "have you tried turning it off and on again" :) [16:07:54] haha [16:07:59] yup, sir [16:08:04] even destroy && up [16:08:53] * bd808 does a pull and provision [16:27:24] bd808: as _joe_ just told me when I had this in translatewiki.net, he dropped support for connecting to mysql via socket [16:28:07] ah. so the newest hhvm build is probably the trick here [16:28:23] ah makes sense [16:28:46] so we need to find the config for that and probably set it to 127.0.0.1 or something [16:28:53] yup [16:29:02] yep [16:30:06] we don't set it at all and just take the DefaultSettings wgDBserver=localhost [16:35:09] Nikerabbit: thanks for pointing that out. [16:38:48] yup, that fixed it [16:38:53] grazie bd808 Nikerabbit [16:39:01] (merging the patch now) [16:39:07] perfect [16:44:01] (03PS2) 10Hashar: Refactor mediawiki-extensions-{phpflavor} [integration/config] - 10https://gerrit.wikimedia.org/r/216100 [16:48:55] (03PS3) 10Hashar: Refactor mediawiki-extensions-{phpflavor} [integration/config] - 10https://gerrit.wikimedia.org/r/216100 [16:49:06] (03CR) 10Hashar: "rebased" [integration/config] - 10https://gerrit.wikimedia.org/r/216100 (owner: 10Hashar) [17:33:15] (03PS1) 10Hashar: Common job for extensions that runs qunit [integration/config] - 10https://gerrit.wikimedia.org/r/216132 (https://phabricator.wikimedia.org/T99877) [17:36:53] (03CR) 10Hashar: "Madness happening at https://integration.wikimedia.org/ci/job/mediawiki-extensions-qunit/1/console" [integration/config] - 10https://gerrit.wikimedia.org/r/216132 (https://phabricator.wikimedia.org/T99877) (owner: 10Hashar) [18:04:17] PROBLEM - Puppet failure on integration-slave-precise-1014 is CRITICAL 40.00% of data above the critical threshold [0.0] [18:21:32] PROBLEM - Puppet failure on integration-zuul-packaged is CRITICAL 100.00% of data above the critical threshold [0.0] [18:34:17] RECOVERY - Puppet failure on integration-slave-precise-1014 is OK Less than 1.00% above the threshold [0.0] [18:34:56] Krinkle or thcipriani, shall we move ‘integration’ to the new dns service today? [18:35:41] * Krinkle doesn't know what that means [18:35:46] I'd rather has hashar handle it. [18:35:58] have [18:36:11] which means next week since he left for the day [18:36:53] andrewbogott: oh boy. Are you trying to get everything moved by Monday? [18:37:15] thcipriani: More like ‘I am going to move everything on Monday that people have not already moved themselves’ [18:38:27] got it. Sure, I can move it. I'm pretty sure we can get it done via a few salt/sed commands. [18:39:28] andrewbogott: you around for the next little while, in case I destroy everything? [18:39:41] andrewbogott: Does this involve dns resolution changes? Beware that dns stuff is broken in labs as far as I'm concerned and we're using a live hack in puppet to keep it working. [18:39:50] thcipriani: I’m about to eat lunch but I won’t travel far. [18:39:50] in integration. [18:39:53] Krinkle: um… ? [18:39:56] especially with regards to the 2dot sutff [18:39:59] tell me more, please? [18:40:01] being forced into labs [18:40:50] this has been broken for months. mutante helped us draft a patch to restore functionality for foo.bar domain names like saucelabs.com. [18:41:02] By default domains shorter than 2 dots were forced into labs namespace. [18:41:16] Everytime this changed, all our qunit browsertest jobs failed. [18:41:30] and everytime we reverted it in our local puppet [18:41:34] 3 times last year. [18:41:34] :) [18:42:16] it's related to legacy (or current?) short names for labs db, but we don't use that. [18:43:14] Krinkle, does the new dns server (labs-recursor0) have the same issue? [18:43:18] I don't know. [18:43:36] Afaik it was intentionally done that was and a known issue by Coren pending a better solution. [18:43:55] thcipriani: I’m eating lunch but keyboard is close at hand. [18:44:19] https://phabricator.wikimedia.org/T92351 [18:44:38] https://gerrit.wikimedia.org/r/#/c/196731/ [18:45:40] andrewbogott: If you have a minute, try a labs instance with the old dns system (the one integation uses) and try resolving saucelabs.com from the command line to reproduce that it fails by default, that it works with dots:2 removed in /etc/resolve, and that it works in the new system? [18:45:45] That's be good news :) [18:50:46] Krinkle: I can ping and dig saucelabs.com from everywhere that I try. [18:51:28] (03Merged) 10jenkins-bot: Create dedicated jslint/phplint instances for apps/* repos [integration/config] - 10https://gerrit.wikimedia.org/r/216159 (owner: 10Krinkle) [18:52:07] andrewbogott: I'm looking at /etc/resolv.conf everywhere in integration and it looks like ndots: 2 has been removed [18:52:23] !log Reloading Zuul to deploy https://gerrit.wikimedia.org/r/216159 [18:52:26] Logged the message, Master [18:52:31] thcipriani: yeah. I can’t tell if that’s needed with the new setup or not, since I can’t reproduce the problem it’s meant to fix. [18:52:39] however, host: saucelabs.com seems to come back correctly on deployment-prep [18:56:00] (03PS1) 10Krinkle: Extend file extension filter for phplint, to include *-phplint [integration/config] - 10https://gerrit.wikimedia.org/r/216182 [18:56:24] thcipriani: what if you add back dots: 2? [18:56:31] it'd be silly of course, but just to see if that fails [18:56:35] sure [18:56:37] (03CR) 10Krinkle: [C: 032] Extend file extension filter for phplint, to include *-phplint [integration/config] - 10https://gerrit.wikimedia.org/r/216182 (owner: 10Krinkle) [18:56:40] * thcipriani checks [18:56:51] !log Reloading Zuul to deploy https://gerrit.wikimedia.org/r/216182 [18:56:54] Logged the message, Master [18:57:31] Osm [18:57:39] Isn’t ndots always there, in deployment? [18:57:45] yeah it is [18:57:53] in integration I just tried adding it back [18:57:58] on one of the machines [18:58:14] (03Merged) 10jenkins-bot: Extend file extension filter for phplint, to include *-phplint [integration/config] - 10https://gerrit.wikimedia.org/r/216182 (owner: 10Krinkle) [18:58:39] host saucelabs.com seems to resolve. Looking at integration-slave-precise-1014 [18:59:07] Krinkle: I don't remember, but is the android app still going to be in the MW queue due to 'tox-flake8'? [18:59:34] host and dig seem to come back with nothing unusual [19:02:06] thcipriani: but dots:2 is removed on those instances, right? [19:02:20] I added it back temporarily [19:02:27] on one host [19:03:05] tried it on a trusty and precise instance, can't recreate. [19:03:11] right [19:03:19] I guess we'll find out on monday :) [19:03:22] twentyafterfour: I poked a bit, it looks like Doorkeeper kind of gets us at what we want for tracking old repo names with new repo names [19:03:26] maybe dig works [19:03:26] oh boy [19:03:47] there's a lot of layers before hitting dns inside an end-to-end run like grunt-karma and saucelabs [19:04:34] thcipriani: can you add dots:2 on a trusty host? I can pin oojs-core to a specific host for you and trigger it in Jenkins to see [19:05:58] sure integration-slave-trusty-1016 [19:06:15] should be good to go now, until puppet runs at 22 after [19:07:12] and by "good to go" I mean "options timeout: 5 ndots: 2" is set [19:09:18] ostriches: hmm, doesn't seem very well documented ;) [19:09:27] Nope [19:09:39] But basically, it allows you to tie "external objects" to things in Phab [19:09:48] The working example is something Jira with Maniphest tasks [19:09:59] But I don't see why you couldn't do the same with Gerrit data & Diffusion [19:10:24] yeah they wrote it basically to capture jira ppl and an old facebook case [19:10:29] but I don't think it's going anywhere [19:16:20] it looks like it is mostly for publishing from phabricator into other systems [19:17:04] yeah [19:17:24] what do we specifically need, to make the diffusion and gerrit stuff work better? I mean, it should be easy to map all the gerrit project hierarchy into phabricator ... I can make a url mapper of some sort, I'm fairly sure that's nearly trivial [19:17:41] hey twentyafterfour could you respond to https://phabricator.wikimedia.org/T100519#1340242, I talked bblack's ear off about getting ssh going yesterday and I want to keep the narrative alive [19:18:14] long story sohrt, I think his thought was to pass ssh through lvs and terminate at phab box which should work / conform to standards / not require phab to be on a public ip [19:19:02] chasemp: done [19:19:37] The ip host shouldn't be an issue afaik [19:19:47] agreed [19:19:51] in fact repositories can live on a separate machine entirely [19:19:51] but I thought better coming from you [19:19:58] right [19:20:03] and lvs honestly there gives us a few gains [19:20:06] for HA [19:20:10] thcipriani: Krinkle: So, what did we learn? Anything? [19:20:11] so seems not terrible [19:20:24] chasemp: we might want to think about setting this up on a separate piece of hardware from the start [19:20:43] because it's the one really well supported way of scaling phabricator - run git on one host, web on another [19:20:43] thcipriani: ping me next time :) I did't see you alreayd had it set :? [19:20:46] are you worried about load? [19:20:49] OK. I gotta go in 10 min. [19:20:50] let's do this [19:20:54] ah [19:20:57] trusty-1016 [19:21:01] you know considering teh load jump just from diffusion [19:21:05] because of the shear number of repos [19:21:09] Krinkle: 1 minutes [19:21:10] chasemp: well, our git repos get a lot of traffic I assume (gerrit is already bottlenecking sometimes) [19:21:12] you may be right that it's prudent [19:21:16] that'll get reset in a second here [19:21:40] chasemp: yeah, lets do it, I'm pretty damn sure we will need it [19:21:44] thcipriani: right, I'll wait 1 minute and youre-apply? [19:22:00] its running now [19:22:02] I just commented out the puppet run in cron [19:22:15] chasemp: our phabricator is faster than upstream (somehow) but it's still not super snappy [19:22:27] it's a beefy box [19:22:31] and dedicated db [19:22:37] of which is also beefy [19:22:49] I kind of overprovisioned as far as I could sneak in :) [19:22:51] Krinkle: which job did you pin? [19:22:59] and sean was cool about it (his idea really on m3) [19:23:19] I knew phab would basically do nothing but grow [19:23:24] npm [19:23:30] https://gerrit.wikimedia.org/r/#/c/216227/ [19:23:58] but I made a mistake, hold on [19:24:04] kk [19:24:46] OK. ready [19:24:50] thcipriani: applied? [19:24:51] chasemp: yeah it looks like we have headroom, so maybe not too important to start out but it will be a slightly painful migration later on down the road [19:24:55] Krinkle: yup [19:25:06] I'm totally down with doing it up from teh get go [19:25:10] https://integration.wikimedia.org/ci/job/npm/7808/console [19:25:18] teh return on time for all of eng to make it performant is a no brainer to me [19:25:37] but we'll see how it goes I guess, log the task? [19:25:43] dunno how close you are to wanting it [19:25:50] chasemp: yeah even a few seconds here and there will add up. I'm already annoyed with gerrit slowness sometimes, it slows down my work for sure [19:26:21] oh agreed [19:26:32] chasemp: I don't know where we stand with gerrit migration, it seems like it's proceeding but I haven't seen it on an official team goal yet (and I'm not sure it's going to be on next quarter either) [19:26:40] kk [19:26:47] well maybe as we go we can work out the dual box use case to be ready [19:26:49] or hackathon? [19:26:54] wikimedia are you going? [19:27:53] I still don't have a passport :-/ [19:28:01] thcipriani: seems fine [19:28:04] g2g [19:28:08] Krinkle: seems ok for whatever reason, yeah :\ [19:28:19] Krinkle: ok, thanks for the heads up on this [19:28:44] andrewbogott: Seems like it _should be_ ok to move forward. [19:28:58] great! [19:28:59] I'll see how integration puppetmaster looks [19:30:04] chasemp: I'll ask in #phabricator, see what they think about the need for separate diffusion hosting machine [19:30:14] cool [19:30:33] I know much of it was in theory worked out 6 months or so ago? [19:30:41] but no idea who is using it if anyone [19:31:13] (03PS2) 10Hashar: Common job for extensions that runs qunit [integration/config] - 10https://gerrit.wikimedia.org/r/216132 (https://phabricator.wikimedia.org/T99877) [19:31:35] (03CR) 10Hashar: "Filter out REL1_23 / REL1_24" [integration/config] - 10https://gerrit.wikimedia.org/r/216132 (https://phabricator.wikimedia.org/T99877) (owner: 10Hashar) [19:34:04] (03PS4) 10Hashar: Refactor mediawiki-extensions-{phpflavor} [integration/config] - 10https://gerrit.wikimedia.org/r/216100 [19:34:11] going to unleash some madness [19:34:32] hashar: andrewbogott and I were just about to unleash some madness, actually. [19:34:44] hashar: I poked you over email but any thing needed from me for nodepool new? I know the networking stuff is still pending [19:34:45] moving to new dns on integration [19:34:46] cause infra deploys on Friday :-} [19:34:49] I haven't lost hope there yet [19:35:06] chasemp: I failed to reach out to faidon/mark about lab host net [19:35:23] yeah I'm going to do a bit of study and make a recommendation I hope if I can find some time [19:35:29] chasemp: I tried building some base image to boot them in openstack, but end up in a dead end :-//// [19:35:34] best way to get feedback is to troll with a wrong solution :) [19:35:38] kk [19:35:42] anything I can do there to help? [19:35:43] chasemp: so in short nothing to worry on your side. Gotta write some summary eventually [19:35:49] sounds good man [19:36:03] I tried creating images using operations/puppet.git to reuse all the code there [19:36:21] but that is tied to labs/prod context and does not work well on a local machine hehe [19:36:52] so in short there are no images yet [19:37:02] twentyafterfour: I think if we just added an additional column to diffusion repos (like you can with profiles, tasks, etc), that stores the gerrit repo name [19:37:04] That's all we need [19:37:08] twentyafterfour: when you create a new wmf branch, the commit that adds the submodules and stuff (https://git.wikimedia.org/commitdiff/mediawiki%2Fcore/2bee3bb7008e15fb31214b27f4dc958519e0a488) skips gerrit right? is there a reason for that? [19:37:12] Then we can build a redirector that uses this data [19:37:22] hashar: Now that Krinkle|detached removed the submit button for the android app what can we to retrigger a merge when the initial build failed? I added "recheck" comment on https://gerrit.wikimedia.org/r/#/c/210122/, which rerun the build, this time successfully, but it doesn't merge [19:37:54] twentyafterfour: Goal 1 is "shut down Gitblit and use Phab for all repo browsing and mirroring" [19:37:54] bearND: ah yeah recheck just trigger the test and do not get it merged [19:38:02] bearND: you want to remove your CR+2 and reapply it [19:38:07] legoktm: it's automated [19:38:16] hashar: ok, will do. thanks [19:38:17] ostriches: right [19:38:18] bearND: that will get the change to enter gate-and-submit pipeline which will merge the change [19:38:35] bearND: one day we will make it so 'recheck' actually merge the change if there is a CR+2 applied [19:38:54] twentyafterfour: can we have it go through gerrit? specifically when we move to using composer to run tests, I'd like for the new wmf branch to run the jenkins tests to make sure the right dependencies made it into mediawiki/vendor [19:38:55] legoktm: is there a reason that it needs to go through review? [19:39:12] hashar: btw: Looks like when we change the build.gradle file to use a newer version of the support library or the build-tools package the build fails the first time it runs per build slave [19:39:28] legoktm: that makes my deployment process a lot worse [19:39:41] how so? [19:40:07] !log refreshed Jenkins jobs mediawiki-extensions-hhvm and mediawiki-extensions-zend with https://gerrit.wikimedia.org/r/#/c/216100/3 (refactoring) [19:40:10] Logged the message, Master [19:40:22] (03CR) 10Hashar: "Deployed" [integration/config] - 10https://gerrit.wikimedia.org/r/216100 (owner: 10Hashar) [19:40:46] legoktm: maybe a lot worse is an exaggeration. It means that I will have to wait for gerrit and CI in between two steps that are currently automated into one step. [19:41:07] more CI is a good thing but more steps is a bad thing [19:41:19] I mean more testing is a good thing [19:42:01] ostriches: I think we can do that using a separate table to map the repositories that way we don't meddle with phabricator's sql schema other than adding one table of our own [19:42:09] hashar: cool. that worked this time. Thanks! [19:42:15] ostriches: it's the same as what we did with bugzilla ids [19:42:19] dbrant: ^^ [19:42:27] twentyafterfour: That sounds reasonable. [19:43:19] bearND: roger that; thanks [19:43:23] ostriches: then we can make a redirector that runs on the phab host but responds to the old urls (that's exactly what we did for bugzilla migration and it worked ok) [19:43:38] * ostriches nods [19:44:29] twentyafterfour: isn't the new branch created beforehand though? [19:44:51] ostriches: the only thing I don't know about is keeping it in sync if we continue to add things in gerrit after the migration ... with bugzilla we didn't add a UI to maintain the mappings because it was a one time thing not an ongoing situation [19:45:19] !log set use_dnsmasq: false on Hiera:Integration [19:45:22] legoktm: not really, I usually do it right before the deployment window because the branching takes a long time [19:45:22] Logged the message, Master [19:45:35] twentyafterfour: Well, the tool can learn based on the gerrit.wikimedia.org urls we have set as the upstream repo URL [19:45:46] hmm, I thought Reedy usually prepared everything beforehand [19:45:56] Starts with 0 data, inserts it as it finds a repo matching that name. [19:45:59] ostriches: oh good point [19:46:31] legoktm: "beforehand" yes before the deployment window but it's still a long process that will become longer [19:46:35] yeah :/ [19:46:41] PROBLEM - Puppet failure on deployment-zookeeper01 is CRITICAL 100.00% of data above the critical threshold [0.0] [19:46:50] legoktm: I'm not entirely apposed to it if it's important then so be it if it takes me longer to do [19:47:30] legoktm: but honestly I really want to eliminate the weekly branching entirely and have long lived release branches. The new model will be to merge into the release branch instead of creating a new one each week [19:48:13] twentyafterfour: really all we need is that the checkComposerLockUpToDate.php script is run after branch creation, which jenkins does. You could run it manually after branch creation? but the script requires MW to be installed... [19:48:35] legoktm: hmm that might work [19:49:04] does it require mediawiki to be functional or just needs to be on disk? [19:49:15] I could do it in vagrant maybe? [19:49:16] for the record there are other unit tests that should be run, like the one that checks that $wgVersion is set to a sane value [19:49:33] it needs to be reasonably functional [19:49:36] you could run it in vagrant yeah [19:49:39] (03CR) 10Hashar: [C: 032] Refactor mediawiki-extensions-{phpflavor} [integration/config] - 10https://gerrit.wikimedia.org/r/216100 (owner: 10Hashar) [19:50:22] if we get to having long lived branches all this becomes moot. then we could rely on CI a lot more and spend a lot less time creating submodules over and over each week [19:50:49] sure [19:50:53] <+thcipriani> !log set use_dnsmasq: false on Hiera:Integration # definitely some madness :-} [19:51:22] hashar: already broke deployment-prep this week, more ready for this one :) [19:51:23] but I want to land my composer jenkins change sooner rather than later, and I think changing the entire wmf deployment process will fall under later ;-) [19:51:31] thcipriani: https://integration.wikimedia.org/ci/job/UploadWizard-api-commons.wikimedia.beta.wmflabs.org/ that might a good test [19:51:34] (03Merged) 10jenkins-bot: Refactor mediawiki-extensions-{phpflavor} [integration/config] - 10https://gerrit.wikimedia.org/r/216100 (owner: 10Hashar) [19:51:38] thcipriani: it hits beta from the integration project [19:52:20] andrewbogott: could you take a look at integration-puppetmaster? [19:52:23] 00:00:34.012 FAILED (errors=1) [19:52:23] 00:00:34.194 Finished: SUCCESS [19:52:25] grrr [19:53:24] ah, dangit, nvmd [19:53:31] puppetmaster needed a kick [19:53:47] (03PS3) 10Hashar: Common job for extensions that runs qunit [integration/config] - 10https://gerrit.wikimedia.org/r/216132 (https://phabricator.wikimedia.org/T99877) [19:54:28] PROBLEM - Puppet failure on integration-puppetmaster is CRITICAL 40.00% of data above the critical threshold [0.0] [19:54:57] legoktm: hopefully not much later [19:55:11] legoktm: It's high priority for me at least ;) [19:55:22] thcipriani: yep, looks fine to me [19:55:24] legoktm: but go ahead and land your stuff and just let me know what I need to do [19:56:38] ok, I'll file a bug and assign it to you? [19:56:52] legoktm: the biggest problem is going to be the time it takes for jenkins to test this - it's gonna be slow, and really I'd prefer to run all the tests locally since I will already have a working copy that's in the right state - it'll take at least 15 minutes for jenkins to clone all those submodules and set up the right state [19:56:57] legoktm: ok [19:57:53] PROBLEM - Puppet failure on integration-slave-precise-1013 is CRITICAL 40.00% of data above the critical threshold [0.0] [19:58:01] PROBLEM - Puppet failure on integration-slave-trusty-1013 is CRITICAL 60.00% of data above the critical threshold [0.0] [19:58:34] PROBLEM - Puppet failure on integration-slave-precise-1012 is CRITICAL 50.00% of data above the critical threshold [0.0] [19:58:38] PROBLEM - Puppet failure on integration-slave-precise-1011 is CRITICAL 40.00% of data above the critical threshold [0.0] [19:58:42] PROBLEM - Puppet failure on integration-slave-trusty-1014 is CRITICAL 50.00% of data above the critical threshold [0.0] [19:59:00] PROBLEM - Puppet failure on integration-saltmaster is CRITICAL 10.00% of data above the critical threshold [0.0] [19:59:11] relevant https://www.youtube.com/watch?v=hwm9DvF24Ag [19:59:58] PROBLEM - Puppet failure on integration-slave-trusty-1015 is CRITICAL 50.00% of data above the critical threshold [0.0] [20:01:50] PROBLEM - Puppet failure on integration-zuul-server is CRITICAL 40.00% of data above the critical threshold [0.0] [20:02:12] PROBLEM - Puppet failure on integration-labsvagrant is CRITICAL 66.67% of data above the critical threshold [0.0] [20:02:32] PROBLEM - Puppet failure on integration-slave-trusty-1017 is CRITICAL 30.00% of data above the critical threshold [0.0] [20:02:58] PROBLEM - Puppet failure on integration-slave-trusty-1011 is CRITICAL 60.00% of data above the critical threshold [0.0] [20:03:12] PROBLEM - Puppet failure on integration-slave-trusty-1021 is CRITICAL 66.67% of data above the critical threshold [0.0] [20:04:22] PROBLEM - Puppet failure on integration-slave-trusty-1012 is CRITICAL 60.00% of data above the critical threshold [0.0] [20:04:30] RECOVERY - Puppet failure on integration-puppetmaster is OK Less than 1.00% above the threshold [0.0] [20:05:14] (03PS1) 10Hashar: Migrate UploadWizard-api* jobs to Trusty [integration/config] - 10https://gerrit.wikimedia.org/r/216297 (https://phabricator.wikimedia.org/T101550) [20:05:14] PROBLEM - Puppet failure on integration-slave-precise-1014 is CRITICAL 50.00% of data above the critical threshold [0.0] [20:05:19] PROBLEM - Puppet failure on integration-vmbuilder-trusty is CRITICAL 20.00% of data above the critical threshold [0.0] [20:05:21] (03CR) 10Hashar: [C: 032] Migrate UploadWizard-api* jobs to Trusty [integration/config] - 10https://gerrit.wikimedia.org/r/216297 (https://phabricator.wikimedia.org/T101550) (owner: 10Hashar) [20:05:42] PROBLEM - Puppet failure on integration-slave-trusty-1016 is CRITICAL 20.00% of data above the critical threshold [0.0] [20:05:49] wikibugs? [20:05:55] twentyafterfour: filed https://phabricator.wikimedia.org/T101551?workflow=create [20:06:02] PROBLEM - Puppet failure on integration-publisher is CRITICAL 66.67% of data above the critical threshold [0.0] [20:07:11] legoktm: thanks [20:07:15] (03Merged) 10jenkins-bot: Migrate UploadWizard-api* jobs to Trusty [integration/config] - 10https://gerrit.wikimedia.org/r/216297 (https://phabricator.wikimedia.org/T101550) (owner: 10Hashar) [20:09:00] RECOVERY - Puppet failure on integration-saltmaster is OK Less than 1.00% above the threshold [0.0] [20:12:08] thcipriani: everything going ok? [20:12:21] andrewbogott: yeah, just salt wrestling :) [20:13:17] PROBLEM - Puppet failure on integration-raita is CRITICAL 40.00% of data above the critical threshold [0.0] [20:18:04] andrewbogott: everything that talks to integration-saltmaster should start recovering shortly, I'm trying to find outliers [20:23:38] andrewbogott: would you look at integration-raita? What is up with the notices? Seems to making puppet's exit status non-0. [20:27:59] thcipriani: integration-raita looks fine to me. There are notices about /etc/ssh/userkeys/ubuntu on pretty much every instance these days [20:28:11] I can remove those files and make puppet shut up, if you like :) [20:28:23] Project UploadWizard-api-commons.wikimedia.org build #1615: FAILURE in 38 sec: https://integration.wikimedia.org/ci/job/UploadWizard-api-commons.wikimedia.org/1615/ [20:28:42] Project UploadWizard-api-commons.wikimedia.beta.wmflabs.org build #2048: FAILURE in 41 sec: https://integration.wikimedia.org/ci/job/UploadWizard-api-commons.wikimedia.beta.wmflabs.org/2048/ [20:28:48] well as long as shinken doesn't count that as a failure, it's fine [20:29:07] nah, looks happy to me [20:31:49] the UploadWizard-api jobs breaking is me [20:37:00] PROBLEM - Host deployment-fluoride is DOWN: CRITICAL - Host Unreachable (10.68.16.190) [20:37:01] thcipriani: if all is peaceful, I’m going to step away for 15-20 [20:37:01] marxarelli: is deployment-mediawiki03 sill in use? That was the host you setup for the pen testing stuff in December [20:37:31] bd808: not that i know of, but you might want to double check with csteipp [20:37:48] andrewbogott: everything seems fine. Just mostly waiting for instances to re-run, checking as I wait. [20:37:53] do we still have the custom varnish patch to get access to it? [20:38:04] bd808: I'm planning to use it! [20:38:47] csteipp: like between today and say next Friday? [20:39:08] bd808: Nope [20:39:37] bd808: Just in the future I'm planning to bang heavily on beta... and I don't want to take out stuff for everyone. [20:39:41] bd808: not sure if it's still cherry-picked on deployment-puppetmaster but the ps was never merged [20:39:44] https://gerrit.wikimedia.org/r/#/c/158016/ [20:39:47] :( [20:39:55] the varnish patch is gone from deployment-salt [20:40:06] blerg. [20:41:00] looks like the patch will need some refactoring [20:41:30] Here's why I ask. I need to build a new logstash server using jessie + a new security group and there isn't enough quote to spin it up [20:41:38] so I'm hunting for old junk to kill [20:41:44] *quotta [20:42:05] updated puppet master on deployment-salt yesterday (day before). I don't think I smashed any patches. [20:42:33] bd808: Totally fine by me, as long as you put it back [20:42:55] thcipriani: naw I think yuvipanda killed the cherry-pick in the great "beta cluster == prod" purge [20:43:43] that does sound like him [20:44:22] I spent a long time with, "that does sound like Y" hitting tab, where is yuvipanda‽ [20:44:35] bd808: just increase our quota [20:44:49] greg-g: well that's just too easy ;) [20:45:03] take the gmail approach: archive never delete ;) [20:45:19] I don't know why I said that, /me is looking at beta logstash right now and confused by it [20:45:49] greg-g: my gmail was full. then i deleted 200.000 cron spam mails [20:45:55] thanks labs [20:45:56] I click on one of the little magnifying glasses to limit the results to only that type and the result is 0 [20:46:02] mutante: :) :) [20:46:24] greg-g: lame. which board and field? [20:47:12] I see the viedoscaler is back to whining about having bad jobs [20:47:21] PHP Fatal error: /srv/mediawiki/wikiversions-labs.cdb has no version entry for `mswiki`. [20:49:33] PROBLEM - Puppet failure on integration-dev is CRITICAL 100.00% of data above the critical threshold [0.0] [20:49:48] bd808: I was on fatalmonitor, and the clicked the magnifying glass for the error I just reported in https://phabricator.wikimedia.org/T101558 [20:50:04] wanted a temp saved search url to share [20:50:10] that's a known mod_fcgi bug [20:50:15] it's junk [20:50:22] I couldn't find it in phab... is there a ticket? [20:50:33] probably not [20:50:38] (sorry, I didn't mean to snipe you) [20:50:40] there are upstream bugs [20:51:00] meh I'm playing logstash today so not a distraction [20:51:34] the error message means that the client disconnected from apache before hhvm responded [20:51:47] thcipriani: integration-slave-trusty-1013.eqiad.wmflabs cant resolve integration-puppetmaster.integration.eqiad.wmflabs [20:51:56] thcipriani: I guess the DNS migration is still going on isn't it ? [20:52:05] hashar: looking, but yes it is [20:52:13] no worries so [20:53:28] those are the hosts I'm digging around trying to find. I updated all the hosts that'll talk to salt, now just trying to find failures. [20:54:00] bd808: but yeah, no matter which magglass I click on in fatalmonitor takes me to a zero result page [20:55:41] offtopic: I love software that has built in dark themes [20:57:26] greg-g: you're cursed? https://logstash-beta.wmflabs.org/#dashboard/temp/tZOncd0TTIuNSE1cyk7cSQ [20:57:54] * greg-g grumbles and tries again [20:58:39] opening up the filtering section can help see what the problem is occasionally. sometimes a weird filter gets added [20:59:10] So the bug here is that our apache mod_fcgi timeout is less than our hhvm timeout [20:59:18] but that may not really be a bug [20:59:21] oh, THAT issue [20:59:39] because we get into situations where we tell hhvm/php to run forever [20:59:41] the one that got gabriel and $someopsen to argue about services and timeouts and such [20:59:48] it's related [20:59:52] (or maybe I'm conflating) [21:00:13] Gabriel wants MW to guarantee a <3s response [21:00:16] which is madness [21:00:28] SLA man, SLA ;) [21:00:53] yeah. the thing about SLAs is you can't just yank a number out of your ear and hold others to it [21:01:12] we have a 3 second connect timeout to mysql :) [21:01:18] :) [21:01:42] so we can't have a sub-three second SLA on anything that talks to a db [21:01:50] which is eveything [21:02:02] * greg-g is just joshn' you [21:02:06] * greg-g has never typed that before [21:02:15] oh I know I just like to rant [21:02:32] :) [21:03:53] meanwhile [21:04:02] Yippee, build fixed! [21:04:02] Project UploadWizard-api-commons.wikimedia.org build #1618: FIXED in 32 sec: https://integration.wikimedia.org/ci/job/UploadWizard-api-commons.wikimedia.org/1618/ [21:04:09] ton of log errors have been dealt with and are no more in logstash [21:04:13] wmf-insecte: good guy [21:04:14] hashar you may not issue bot commands in this chat! [21:04:39] hashar: :) [21:04:44] and [21:05:04] hashar: you did a ton of triage of those while I was sleeping, you rock [21:05:06] Yippee, build fixed! [21:05:06] Project UploadWizard-api-commons.wikimedia.beta.wmflabs.org build #2049: FIXED in 35 sec: https://integration.wikimedia.org/ci/job/UploadWizard-api-commons.wikimedia.beta.wmflabs.org/2049/ [21:05:08] the UploadWizard-api* smoke jobs have been silently broke for ages :( [21:05:14] yeah, saw that bug roll by [21:05:16] :/ [21:06:12] yeah 1 hour 10 to fix it [21:07:57] thanks for catching that hashar [21:08:03] (03PS4) 10Hashar: Common job for extensions that runs qunit [integration/config] - 10https://gerrit.wikimedia.org/r/216132 (https://phabricator.wikimedia.org/T99877) [21:08:32] tgr: yeah that is the kind of issues I hate [21:08:51] tgr: it is unnoticed / under the carpet and the second you see it you know you must fix it on spot [21:09:28] tgr: thanks for your kind words :} [21:10:46] no [21:10:49] w [21:10:59] lets go the crazy 50 minutes till midnight deploy [21:12:26] RECOVERY - Host deployment-fluoride is UPING OK - Packet loss = 0%, RTA = 0.54 ms [21:13:22] https://youtu.be/N1KfJHFWlhQ [21:13:50] ahahhah [21:14:03] can't wait for my second kid to start that phase [21:14:30] (03CR) 10Hashar: [C: 032] "Lets qunit stuff https://youtu.be/N1KfJHFWlhQ" [integration/config] - 10https://gerrit.wikimedia.org/r/216132 (https://phabricator.wikimedia.org/T99877) (owner: 10Hashar) [21:15:17] (copyright nerds will also know of and appreciate that video) [21:15:47] oh I have a manager to deal with legal stuff [21:15:59] * hashar grins [21:16:16] (03Merged) 10jenkins-bot: Common job for extensions that runs qunit [integration/config] - 10https://gerrit.wikimedia.org/r/216132 (https://phabricator.wikimedia.org/T99877) (owner: 10Hashar) [21:16:38] :) [21:17:19] !log Pooled in mediawiki-extensions-qunit which runs qunit tests with karma with multiple extensions . https://gerrit.wikimedia.org/r/#/c/216132/ . https://phabricator.wikimedia.org/T99877 [21:17:23] Logged the message, Master [21:17:26] PROBLEM - Puppet staleness on deployment-eventlogging02 is CRITICAL 100.00% of data above the critical threshold [43200.0] [21:17:29] * hashar grabs a beer [21:18:26] PROBLEM - Puppet failure on deployment-parsoidcache02 is CRITICAL 100.00% of data above the critical threshold [0.0] [21:19:40] PROBLEM - Puppet failure on deployment-kafka02 is CRITICAL 100.00% of data above the critical threshold [0.0] [21:21:59] !log restarted puppetmaster on deployment-salt ("Could not request certificate: Error 500 on SERVER: ") [21:22:02] Logged the message, Master [21:24:05] I don't drink often enough at home to have much in my fridge, so I'm drinking a Shock Top my mom brought back in.... january [21:24:20] it's not good [21:24:22] that is yucky on multiple levels [21:24:42] yeah, actually, I just took my first real swig.... I don't think I'm going to continue [21:25:21] irc is bad at beer delivery -- 🍺 [21:25:26] that's the best I can do [21:26:59] and the unicode didn't pan out over here either [21:27:37] * bd808 sends greg-g better fonts [21:28:20] * greg-g blames the standard debian install on digital ocean [21:29:43] andrewbogott: huh. so, the integration project has been migrated to new dns, all puppet runs look fine (logging in, looking at puppet.log/running puppet) shinken still not happy :( [21:35:12] thcipriani: is there any super secret thing I need to to to spin up a new deployment-prep instance these days? My new deployment-logstash2.eqiad.wmflabs instance is stuck in a loop where it is getting 500 errors from the puppetmaster. I think it is actually blowing up trying to do the first puppet run against the labs master [21:36:40] I haven't run into that [21:37:00] k. I shouted for root help in -labs [21:37:00] 10Deployment-Systems, 6Release-Engineering: Run checkComposerLockUpToDate.php after creating a new WMF deployment branch - https://phabricator.wikimedia.org/T101551#1342468 (10mmodell) Chat log: ``` [19:46:50] legoktm: I'm not entirely apposed to it if it's important then so be it if it take... [21:37:07] they will get to me at some point [21:38:00] I did have to restart the puppetmaster twice yesterday, had to kill the pid, just locked up, wasn't serving, netstat didn't show the port as active. [21:39:23] legoktm: If your bot is scouring repos, for could you also report on MediaWiki core, and for jscs could you report on how many rules are defined in .jscsrc beyond 'preset: wikimedia'? :-) [21:39:48] 10Beta-Cluster, 10Wikimedia-Logstash, 15User-Bd808-Test: Build jessie based elasticsearch/logstash/kibana (ELK) host for beta testing - https://phabricator.wikimedia.org/T101541#1342474 (10bd808) p:5Triage>3High [21:40:25] 10Beta-Cluster, 10Wikimedia-Logstash, 15User-Bd808-Test: Build jessie based elasticsearch/logstash/kibana (ELK) host for beta testing - https://phabricator.wikimedia.org/T101541#1342048 (10bd808) [21:42:29] bd808: it's weird, I can't even get to your instance, even though it's in the deployment-prep project [21:42:37] public key errors [21:42:43] James_F: probably. How should I fit that into the table? [21:43:03] thcipriani: me neither. I think the first puppet run after provision is failing [21:43:16] so it doesn't know how to be a labs host yet [21:43:20] legoktm: Maybe in the jscs column put "1.8.0" or "1.5.0; 4 over-rides" if there are over-rides? [21:43:22] ah, yeah, I think labs puppetmaster is having a bad time [21:43:31] ok [21:43:35] http://shinken.wmflabs.org/service/labs-puppetmaster/Labs%20Puppetmaster%20HTTPS [21:43:50] And can you tell what is "latest", so we can put green ticks in the columns if they're up-to-date? [21:44:01] (03PS1) 10Hashar: mediawiki-extensions-qunit-mobile [integration/config] - 10https://gerrit.wikimedia.org/r/216322 (https://phabricator.wikimedia.org/T99877) [21:44:07] legoktm: (I'm very demanding, sorry. :-)) [21:44:12] :P [21:44:19] PMs and their demands [21:44:24] does npm or whatever have an API? [21:44:30] I know the packagist API [21:44:39] legoktm: `npm outdated` [21:44:49] legoktm: I'm sure there's an API for it. [21:45:46] 10Browser-Tests: When beta labs is down Jenkins jobs should be aborted and not trigger e-mail notifications - https://phabricator.wikimedia.org/T101563#1342483 (10Jdlrobson) 3NEW [21:46:41] 10Browser-Tests: When beta cluster is down Jenkins jobs should be aborted and not trigger e-mail notifications - https://phabricator.wikimedia.org/T101563#1342491 (10greg) [21:46:59] (03CR) 10Hashar: [C: 04-1] "For mobile we run the qunit tests passing &useformat=mobile . That is why we have the qunit-querystring macro so we can vary." [integration/config] - 10https://gerrit.wikimedia.org/r/216322 (https://phabricator.wikimedia.org/T99877) (owner: 10Hashar) [21:47:21] I AM DONE !!!!!!!!!!!!!!!!!!!!! [21:47:39] For your tirelessly work on Jenkins, CI, tests, Librarifizification, here is the official Continuous integration barnstar! [21:47:40] /\ [21:47:40] /**\ [21:47:40] _______/****\_______ [21:47:40] *.******/^^\******.* [21:47:40] *.***( () )***.* [21:47:40] *.**\,./**.* [21:47:41] /**.**.**\ [21:47:41] /*.* *.*\ [21:47:42] /.* *.\ [21:47:42] ' ` [21:48:28] 10Browser-Tests: When beta cluster is down Jenkins jobs should be aborted and not trigger e-mail notifications - https://phabricator.wikimedia.org/T101563#1342483 (10greg) p:5Triage>3Low Setting low just due to complexity for now. I like this idea. It could even, if doing it in real time (canceling jenkins... [21:48:51] hashar: g'night! [21:49:46] :> [21:50:18] James_F: npm has managed to destroy any hopes of googling "npm api" by calling all of their documentation "API documentation" [21:50:29] legoktm: Yeah. [21:50:42] legoktm: Krinkle|detached is the only person I'd ask for this stuff… [21:51:45] `npm outdated` doesn't seem to be working for me [21:52:16] I have banana-checker 0.2.0 installed, run `npm outdated` and it outputs an empty newline instead of saying 0.2.1 is available [21:53:02] 10Continuous-Integration-Infrastructure, 7Epic: Provide infrastructure to store data for pre-merge reports on patchsets - https://phabricator.wikimedia.org/T101545#1342512 (10hashar) [21:53:04] 10Continuous-Integration-Infrastructure: Preview generated documentation in test pipeline for review - https://phabricator.wikimedia.org/T72945#1342511 (10hashar) [21:54:42] 10Continuous-Integration-Infrastructure, 7Epic: Provide infrastructure to store data for pre-merge reports on patchsets - https://phabricator.wikimedia.org/T101545#1342113 (10hashar) >>! In T100294#1322186, @hashar wrote: > ... > At first we would need a place to push temporary materials tool. A lot of disk (m... [21:55:02] hashar: you said you were done [21:55:04] you lied [21:55:09] that's going in your annual review [21:55:29] 10Continuous-Integration-Infrastructure, 7Epic: Provide (pre-merge) code coverage reports on patchsets - https://phabricator.wikimedia.org/T101544#1342521 (10hashar) [21:55:30] 6Release-Engineering, 10Gather, 6Mobile-Web, 10MobileFrontend, and 2 others: [EPIC] Encourage developers to increase code coverage - https://phabricator.wikimedia.org/T100294#1309866 (10hashar) [21:56:13] legoktm: `npm install && npm outdated`? [21:56:27] James_F: tried that too :( [21:56:33] 10Continuous-Integration-Infrastructure, 7Epic: Provide infrastructure to store data for pre-merge reports on patchsets - https://phabricator.wikimedia.org/T101545#1342522 (10hashar) p:5Triage>3Low @Jdforrester-WMF thanks a ton for filling those tasks. With the current team workload that is probably not g... [21:56:50] legoktm: Well, never mind. Can you merge https://github.com/wikimedia/grunt-banana-checker/pull/20 in the meanwhile? [21:57:10] 10Browser-Tests: When beta cluster is down Jenkins jobs should be aborted and not trigger e-mail notifications - https://phabricator.wikimedia.org/T101563#1342525 (10dduvall) So we already have a preliminary script in the job that queries the API for the current `git_branch`. ``` /srv/deployment/integration/sla... [21:57:40] done [21:58:19] legoktm: Ta. [21:58:21] legoktm: Now release 0.2.2 exists. :_) [21:58:23] 10Continuous-Integration-Infrastructure, 7Epic: Provide (pre-merge) performance reports on patchsets - https://phabricator.wikimedia.org/T101543#1342535 (10hashar) p:5Triage>3Low [21:58:24] 10Continuous-Integration-Infrastructure, 7Epic: Provide pre-merge reports on patchsets - https://phabricator.wikimedia.org/T101542#1342537 (10hashar) p:5Triage>3Low [21:58:26] 10Continuous-Integration-Infrastructure, 7Epic: Provide (pre-merge) code coverage reports on patchsets - https://phabricator.wikimedia.org/T101544#1342539 (10hashar) p:5Triage>3Low [22:00:13] I'll just screen scrape for now [22:01:14] 10Continuous-Integration-Infrastructure: Jenkins: Fail on BOM in submitted files - https://phabricator.wikimedia.org/T40233#1342542 (10hashar) [22:02:39] legoktm: https://github.com/wikimedia/grunt-banana-checker/pull/21 [22:02:45] 10Continuous-Integration-Infrastructure: Jenkins: Fail on BOM in submitted files - https://phabricator.wikimedia.org/T40233#453795 (10hashar) [22:02:56] 10Continuous-Integration-Infrastructure: Jenkins: Fail on BOM in submitted files - https://phabricator.wikimedia.org/T40233#453795 (10hashar) [22:03:40] * legoktm waits for travis [22:03:40] RECOVERY - Puppet failure on integration-slave-precise-1011 is OK Less than 1.00% above the threshold [0.0] [22:03:42] legoktm: It's a Markdown file… :-P [22:03:44] RECOVERY - Puppet failure on integration-slave-trusty-1014 is OK Less than 1.00% above the threshold [0.0] [22:03:59] you also touched the license thingy [22:04:03] Meh. [22:04:04] OK. [22:04:10] * James_F glares at travis. [22:04:21] RECOVERY - Puppet failure on integration-slave-trusty-1012 is OK Less than 1.00% above the threshold [0.0] [22:04:40] PROBLEM - Puppet failure on deployment-pdf02 is CRITICAL 100.00% of data above the critical threshold [0.0] [22:04:59] RECOVERY - Puppet failure on integration-slave-trusty-1015 is OK Less than 1.00% above the threshold [0.0] [22:05:19] RECOVERY - Puppet failure on integration-slave-precise-1014 is OK Less than 1.00% above the threshold [0.0] [22:05:19] RECOVERY - Puppet failure on integration-vmbuilder-trusty is OK Less than 1.00% above the threshold [0.0] [22:05:41] RECOVERY - Puppet failure on integration-slave-trusty-1016 is OK Less than 1.00% above the threshold [0.0] [22:05:42] legoktm: Go. [22:06:05] RECOVERY - Puppet failure on integration-publisher is OK Less than 1.00% above the threshold [0.0] [22:06:50] RECOVERY - Puppet failure on integration-zuul-server is OK Less than 1.00% above the threshold [0.0] [22:06:52] 10Browser-Tests, 10Continuous-Integration-Infrastructure, 7Epic, 7Tracking: [EPIC] trigger browser tests from Gerrit (tracking) - https://phabricator.wikimedia.org/T55697#1342572 (10hashar) p:5Normal>3Low [22:07:12] RECOVERY - Puppet failure on integration-labsvagrant is OK Less than 1.00% above the threshold [0.0] [22:07:15] 10Browser-Tests: When beta cluster is down Jenkins jobs should be aborted and not trigger e-mail notifications - https://phabricator.wikimedia.org/T101563#1342576 (10dduvall) p:5Low>3Triage [22:07:20] done :P [22:07:37] RECOVERY - Puppet failure on integration-slave-trusty-1017 is OK Less than 1.00% above the threshold [0.0] [22:07:43] James_F: is there any specific green checkmark image you want to use? [22:07:53] RECOVERY - Puppet failure on integration-slave-precise-1013 is OK Less than 1.00% above the threshold [0.0] [22:07:57] RECOVERY - Puppet failure on integration-slave-trusty-1011 is OK Less than 1.00% above the threshold [0.0] [22:08:01] RECOVERY - Puppet failure on integration-slave-trusty-1013 is OK Less than 1.00% above the threshold [0.0] [22:08:07] legoktm: https://commons.wikimedia.org/wiki/File:Light_green_check.svg is traditional. [22:08:14] RECOVERY - Puppet failure on integration-slave-trusty-1021 is OK Less than 1.00% above the threshold [0.0] [22:08:16] 10Continuous-Integration-Infrastructure, 7Epic: Provide infrastructure to store data for pre-merge reports on patchsets - https://phabricator.wikimedia.org/T101545#1342578 (10hashar) [22:08:18] 10Continuous-Integration-Infrastructure: Store Jenkins build output outside Jenkins (e.g. static storage) - https://phabricator.wikimedia.org/T53447#1342577 (10hashar) [22:08:20] RECOVERY - Puppet failure on integration-raita is OK Less than 1.00% above the threshold [0.0] [22:08:34] RECOVERY - Puppet failure on integration-slave-precise-1012 is OK Less than 1.00% above the threshold [0.0] [22:08:34] legoktm: Or just ✓ [22:08:50] 10Continuous-Integration-Infrastructure, 7Epic: Provide infrastructure to store data for pre-merge reports on patchsets - https://phabricator.wikimedia.org/T101545#1342113 (10hashar) >>! In T53447#1202474, @hashar wrote: > Potentially, we can already write a wrapper that would send the log to a central storage... [22:09:03] sleep [22:10:40] Bye hashar. [22:12:10] thcipriani: everything ok? I’m about to punch out for the day. [22:12:34] …just as soon as shinken starts reporting recoveries [22:12:54] yeah, it's weird, restarting the labs puppetmaster seems to have let the flood gates loose on shinken [22:13:09] the instances have been fine for a while, but shinken still reported them as having problems [22:13:28] Yippee, build fixed! [22:13:29] Project browsertests-MobileFrontend-en.m.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #673: FIXED in 31 min: https://integration.wikimedia.org/ci/job/browsertests-MobileFrontend-en.m.wikipedia.beta.wmflabs.org-linux-chrome-sauce/673/ [22:13:49] but yes, integration has been moved to the new dns, no issues seemingly [22:14:17] great, thanks! [22:14:21] I’ll break more stuff on Monday :) [22:14:29] oh good. [22:14:47] Well, your stuff I mostly won’t break until Thursday [22:15:45] greg-g: https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=prev&oldid=161966 – are we deploying anyway, or just not on those dates? [22:17:29] James_F: I figured since mukunda will be around (and myself), we'll be ok doing the train [22:17:49] does that effect VE badly? [22:20:08] greg-g: No no, I'm unworried. [22:20:14] kk, just making sure [22:20:21] greg-g: Just no edit summary -> me concerned it might be a mis-edit. :-) [22:21:07] yeah, also, it was a brain fart, I was, for a moment, thinking that was June and I was like "no no, we can't do that right after we switch to the new cadence!" [22:21:20] but now that I've done it.... [22:21:45] James_F: added the checkboxes, screenscraping npm isn't that bad :P [22:21:54] s/checkboxes/checks/ [22:22:10] James_F: plus, with the new cadence, it doesn't actually affect us too much [22:23:07] James_F: is there another tool that does jsonlint besides grunt-jsonlint? or do repos just not have jsonlint...? [22:23:10] * greg-g is just justifying his brain fart now, but he thinks successfully enough to convince himself [22:29:39] RECOVERY - Puppet failure on deployment-pdf02 is OK Less than 1.00% above the threshold [0.0] [22:29:53] legoktm: Whee. [22:29:55] greg-g: Yeah, seems good to me. [22:29:57] legoktm: Most repos don't use jsonlint, no. [22:29:59] legoktm: grunt-jsonlint sounds right. [22:31:17] legoktm: SyntaxHighlight_GeSHi uses it, though. [22:31:22] legoktm: As of this week. :-) [22:32:10] James_F: I have a script to bump versions, should I start uploading banana-checker commits? or should we combine it with other updates? [22:37:16] yuck yuck yuck. I have to port logstash to systemd start scripts to run on jessie [22:37:27] * bd808 looks upstream to see if this is solved [22:39:57] yeah the sysv script works [22:52:48] (03CR) 10Krinkle: "We don't need it with karma. Special:JavaScriptTest/qunit/plain doesn't support this because it loads modules directly. Jon has already ap" [integration/config] - 10https://gerrit.wikimedia.org/r/216322 (https://phabricator.wikimedia.org/T99877) (owner: 10Hashar) [23:23:37] James_F: James_F: I have a script to bump versions, should I start uploading banana-checker commits? or should we combine it with other updates? [23:24:07] legoktm: I have a three-hour slot tomorrow to mass-update all devDeps across dozens of repos. [23:24:15] alright :D [23:24:33] legoktm: But… maybe. :-) [23:25:44] my script is pretty fast :p [23:34:39] legoktm: But does it also fix jscs issues manually? ;-) [23:39:45] James_F: no, it just complains nicely that it needs human intervention. [23:40:05] legoktm: Will it fix all npm devDeps? [23:40:09] legoktm: If so… go for it. [23:40:21] I can have it update all to the latest [23:40:35] (if they all pass) [23:40:56] legoktm: Have it update to latest and I'll manually review and merge. [23:43:26] legoktm: Doing the composer bit too? [23:43:36] James_F: just npm [23:43:45] legoktm: It's a start, at least. [23:43:45] I already did all the MW-CS updates that were trivial [23:44:12] Also there's several repos not in your table. [23:44:24] VE-core, OOjs, OOjs UI, etc. [23:44:25] But eh. [23:47:15] * legoktm git pulls all extensions [23:47:44] legoktm: None of those are MW extensions… [23:47:56] yeah, unrelated :P [23:48:00] * James_F grins. [23:55:21] !log added deployment-logstash2 host and told cluster to move logstash all data there [23:55:24] Logged the message, Master