[00:47:52] 6Release-Engineering, 6operations, 5Patch-For-Review: Use SSLCertificateChainFile in Gerrit Apache configuration - https://phabricator.wikimedia.org/T82319#1339936 (10Dzahn) >>! In T82319#1335241, @hashar wrote: > Does #releng has anything to do there? Seems like some infrastructure tech debt. I added sinc... [01:08:05] 6Release-Engineering, 6operations, 5Patch-For-Review: Use SSLCertificateChainFile in Gerrit Apache configuration - https://phabricator.wikimedia.org/T82319#1340016 (10Dzahn) >>! In T82319#1335241, @hashar wrote: > Does #releng has anything to do there? Seems like some infrastructure tech debt. Actually, yo... [01:10:22] 6Release-Engineering, 6operations, 5Patch-For-Review: Use SSLCertificateChainFile in Gerrit Apache configuration - https://phabricator.wikimedia.org/T82319#1340017 (10Dzahn) 5Open>3Resolved a:3Dzahn we use "SSLCACertificatePath /etc/ssl/certs/" in the Gerrit config (meanwhile) and that is ok too [01:13:19] 6Release-Engineering, 6operations, 5Patch-For-Review: Use SSLCertificateChainFile in Gerrit Apache configuration - https://phabricator.wikimedia.org/T82319#1340022 (10Dzahn) https://www.ssllabs.com/ssltest/analyze.html?d=gerrit.wikimedia.org the "-" in "A-" is because we are not supporting PFS which is beca... [01:52:50] 10Continuous-Integration-Infrastructure, 6Labs, 10Labs-Infrastructure, 6operations: dnsmasq returns SERVFAIL for (some?) names that do not exist instead of NXDOMAIN - https://phabricator.wikimedia.org/T92351#1340120 (10scfc) 5Resolved>3declined (AFAIUI, the underlying issue has not been researched or r... [01:52:53] 10Continuous-Integration-Infrastructure: Re-create ci slaves (March 2015) - https://phabricator.wikimedia.org/T91524#1340122 (10scfc) [02:35:43] Project browsertests-WikiLove-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #593: FAILURE in 2 min 42 sec: https://integration.wikimedia.org/ci/job/browsertests-WikiLove-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/593/ [03:18:36] (03PS1) 10Dduvall: Navigation/filtering by project [integration/raita] - 10https://gerrit.wikimedia.org/r/216026 [03:28:21] PROBLEM - Free space - all mounts on deployment-eventlogging02 is CRITICAL deployment-prep.deployment-eventlogging02.diskspace._var.byte_percentfree (<30.00%) [05:10:25] (03CR) 10Pastakhov: "Hashar: please, fix it." [integration/config] - 10https://gerrit.wikimedia.org/r/207754 (owner: 10Pastakhov) [06:38:20] RECOVERY - Free space - all mounts on deployment-eventlogging02 is OK All targets OK [06:57:41] RECOVERY - Free space - all mounts on deployment-videoscaler01 is OK All targets OK [07:19:28] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-firefox-monobook-sauce build #464: FAILURE in 54 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-firefox-monobook-sauce/464/ [08:19:22] zeljkof: finally around [08:19:35] hashar: ready? [08:19:41] or do you need more time? [08:19:42] yeah if hangout let me in [08:19:48] ok, joining :) [08:20:47] I am in [08:22:59] hashar: I am in too! :) [08:25:46] lost ya [09:22:25] PROBLEM - Puppet failure on deployment-mx is CRITICAL 100.00% of data above the critical threshold [0.0] [09:40:01] 6Release-Engineering, 6operations, 5Patch-For-Review: Use SSLCertificateChainFile in Gerrit Apache configuration - https://phabricator.wikimedia.org/T82319#1340615 (10hashar) Thanks @Dzahn :-) [09:43:53] Yippee, build fixed! [09:43:54] Project browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #506: FIXED in 6 min 52 sec: https://integration.wikimedia.org/ci/job/browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/506/ [10:01:34] (03PS1) 10Amire80: Add a job for ContentTranslation screenshots [integration/config] - 10https://gerrit.wikimedia.org/r/216056 [10:05:44] Project browsertests-ContentTranslation-language-screenshot-os_x_10.10-firefox » ca,contintLabsSlave && UbuntuTrusty build #1: FAILURE in 1 min 39 sec: https://integration.wikimedia.org/ci/job/browsertests-ContentTranslation-language-screenshot-os_x_10.10-firefox/LANGUAGE_SCREENSHOT_CODE=ca,label=contintLabsSlave%20&&%20UbuntuTrusty/1/ [10:07:17] Project browsertests-ContentTranslation-language-screenshot-os_x_10.10-firefox » en,contintLabsSlave && UbuntuTrusty build #1: FAILURE in 3 min 13 sec: https://integration.wikimedia.org/ci/job/browsertests-ContentTranslation-language-screenshot-os_x_10.10-firefox/LANGUAGE_SCREENSHOT_CODE=en,label=contintLabsSlave%20&&%20UbuntuTrusty/1/ [10:16:36] (03PS2) 10Zfilipin: Add a job for ContentTranslation screenshots [integration/config] - 10https://gerrit.wikimedia.org/r/216056 (owner: 10Amire80) [10:16:53] (03CR) 10Zfilipin: [C: 032] "The job is deployed and seems to be working fine!" [integration/config] - 10https://gerrit.wikimedia.org/r/216056 (owner: 10Amire80) [10:18:47] (03Merged) 10jenkins-bot: Add a job for ContentTranslation screenshots [integration/config] - 10https://gerrit.wikimedia.org/r/216056 (owner: 10Amire80) [10:49:18] 10Browser-Tests, 5Patch-For-Review: mediawiki_selenium 1.2 breaks mw/core browser test when reporting to raita - https://phabricator.wikimedia.org/T100904#1340744 (10hashar) I have retriggered the job which fails logging to beta but apparently managed to report to raita. https://integration.wikimedia.org/ci/v... [10:50:36] 10Browser-Tests, 5Patch-For-Review: mediawiki_selenium 1.2 breaks mw/core browser test when reporting to raita - https://phabricator.wikimedia.org/T100904#1340745 (10hashar) 5Open>3Resolved Solved in mediawiki_selenium 1.2.1 [10:57:58] Yippee, build fixed! [10:57:59] Project browsertests-Core-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #638: FIXED in 10 min: https://integration.wikimedia.org/ci/job/browsertests-Core-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/638/ [11:03:36] 10Browser-Tests, 5MW-1.26-release, 5Patch-For-Review, 5WMF-deploy-2015-06-09_(1.26wmf9): mediawiki_selenium 1.2 breaks mw/core browser test when reporting to raita - https://phabricator.wikimedia.org/T100904#1340768 (10hashar) Next build is all green https://integration.wikimedia.org/ci/view/BrowserTests/v... [11:33:43] PROBLEM - Free space - all mounts on deployment-videoscaler01 is CRITICAL deployment-prep.deployment-videoscaler01.diskspace._var.byte_percentfree (<50.00%) [12:40:02] 6Release-Engineering, 10Continuous-Integration-Config: MobileFrontend qunit tests should run Gather tests as well - https://phabricator.wikimedia.org/T99877#1340966 (10hashar) [12:40:34] 6Release-Engineering, 10Continuous-Integration-Config: MobileFrontend qunit tests should run Gather tests as well - https://phabricator.wikimedia.org/T99877#1300859 (10hashar) a:5Jdlrobson>3None We need a new qunit job that is like mediawiki-testextensions but runs qunit instead of PHPUnit. [12:48:30] (03PS1) 10Hashar: integration-phpunit-mediawiki-REL1_25 [integration/config] - 10https://gerrit.wikimedia.org/r/216077 [12:48:42] (03CR) 10Hashar: [C: 032] integration-phpunit-mediawiki-REL1_25 [integration/config] - 10https://gerrit.wikimedia.org/r/216077 (owner: 10Hashar) [12:50:49] (03Merged) 10jenkins-bot: integration-phpunit-mediawiki-REL1_25 [integration/config] - 10https://gerrit.wikimedia.org/r/216077 (owner: 10Hashar) [12:50:56] (03PS1) 10Hashar: Drop references to REL1_19 (phased out) [integration/config] - 10https://gerrit.wikimedia.org/r/216078 [12:52:51] (03CR) 10Hashar: [C: 032] Drop references to REL1_19 (phased out) [integration/config] - 10https://gerrit.wikimedia.org/r/216078 (owner: 10Hashar) [12:54:26] (03Merged) 10jenkins-bot: Drop references to REL1_19 (phased out) [integration/config] - 10https://gerrit.wikimedia.org/r/216078 (owner: 10Hashar) [13:02:57] Project browsertests-MobileFrontend-en.m.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #672: FAILURE in 30 min: https://integration.wikimedia.org/ci/job/browsertests-MobileFrontend-en.m.wikipedia.beta.wmflabs.org-linux-chrome-sauce/672/ [13:35:51] (03PS1) 10Hashar: Drop old comment in macro.yaml [integration/config] - 10https://gerrit.wikimedia.org/r/216086 [13:36:01] (03CR) 10Hashar: [C: 032] Drop old comment in macro.yaml [integration/config] - 10https://gerrit.wikimedia.org/r/216086 (owner: 10Hashar) [13:37:50] (03Merged) 10jenkins-bot: Drop old comment in macro.yaml [integration/config] - 10https://gerrit.wikimedia.org/r/216086 (owner: 10Hashar) [13:57:37] (03PS1) 10Hashar: JJB: move single use macro in the job-template [integration/config] - 10https://gerrit.wikimedia.org/r/216090 [13:58:23] (03CR) 10Hashar: "What do you guys think about it? I Got tired of switching between job-templates.yaml and macro.yaml" [integration/config] - 10https://gerrit.wikimedia.org/r/216090 (owner: 10Hashar) [14:12:13] PROBLEM - Puppet failure on deployment-restbase01 is CRITICAL 33.33% of data above the critical threshold [0.0] [14:20:20] RECOVERY - Puppet staleness on deployment-restbase01 is OK Less than 1.00% above the threshold [3600.0] [14:29:17] PROBLEM - Puppet failure on deployment-restbase02 is CRITICAL 55.56% of data above the critical threshold [0.0] [14:29:20] (03PS1) 10Hashar: JJB: zuul-cloner-extdeps slightly more reusable [integration/config] - 10https://gerrit.wikimedia.org/r/216097 [14:32:11] RECOVERY - Puppet failure on deployment-restbase01 is OK Less than 1.00% above the threshold [0.0] [14:34:07] RECOVERY - Puppet staleness on deployment-restbase02 is OK Less than 1.00% above the threshold [3600.0] [14:34:21] Yippee, build fixed! [14:34:21] Project browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #524: FIXED in 8 min 20 sec: https://integration.wikimedia.org/ci/job/browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/524/ [14:39:19] (03PS1) 10Hashar: Refactor mediawiki-extensions-{phpflavor} [integration/config] - 10https://gerrit.wikimedia.org/r/216100 [14:48:24] (03CR) 10Hashar: [C: 032] "Tested on VE and MobileFrontend extensions. Works!" [integration/config] - 10https://gerrit.wikimedia.org/r/216097 (owner: 10Hashar) [14:50:17] (03Merged) 10jenkins-bot: JJB: zuul-cloner-extdeps slightly more reusable [integration/config] - 10https://gerrit.wikimedia.org/r/216097 (owner: 10Hashar) [14:59:15] RECOVERY - Puppet failure on deployment-restbase02 is OK Less than 1.00% above the threshold [0.0] [15:15:45] (03PS1) 10Hashar: Factor VE submodule update in a macro [integration/config] - 10https://gerrit.wikimedia.org/r/216105 [15:15:58] (03CR) 10Hashar: [C: 032] Factor VE submodule update in a macro [integration/config] - 10https://gerrit.wikimedia.org/r/216105 (owner: 10Hashar) [15:17:50] (03Merged) 10jenkins-bot: Factor VE submodule update in a macro [integration/config] - 10https://gerrit.wikimedia.org/r/216105 (owner: 10Hashar) [15:20:19] gooood morning grrrit-wm [15:20:21] grr [15:20:23] greg-g: [15:20:26] good morning :} [15:20:39] challenge: will I leave work before you tonight? [15:20:42] hehe [15:20:59] I hope so [15:21:02] it's a Friday! [15:23:28] PROBLEM - Puppet failure on deployment-cache-bits01 is CRITICAL 100.00% of data above the critical threshold [0.0] [15:30:13] be back later tonight [15:44:45] PROBLEM - Free space - all mounts on deployment-bastion is CRITICAL deployment-prep.deployment-bastion.diskspace._var.byte_percentfree (<10.00%) [15:59:45] RECOVERY - Free space - all mounts on deployment-bastion is OK All targets OK [16:06:48] bd808: just git pull-ed mw-vagrant (so gotten like a week worth of updates), and i'm seeing this: (Cannot access the database: Can't connect to local MySQL server through socket '/dev/null' (111) ()) [16:07:08] heh no wornder it cannot access it via /dev/null [16:07:17] yeah that seems not so good [16:07:42] "have you tried turning it off and on again" :) [16:07:54] haha [16:07:59] yup, sir [16:08:04] even destroy && up [16:08:53] * bd808 does a pull and provision [16:27:24] bd808: as _joe_ just told me when I had this in translatewiki.net, he dropped support for connecting to mysql via socket [16:28:07] ah. so the newest hhvm build is probably the trick here [16:28:23] ah makes sense [16:28:46] so we need to find the config for that and probably set it to 127.0.0.1 or something [16:28:53] yup [16:29:02] yep [16:30:06] we don't set it at all and just take the DefaultSettings wgDBserver=localhost [16:35:09] Nikerabbit: thanks for pointing that out. [16:38:48] yup, that fixed it [16:38:53] grazie bd808 Nikerabbit [16:39:01] (merging the patch now) [16:39:07] perfect [16:44:01] (03PS2) 10Hashar: Refactor mediawiki-extensions-{phpflavor} [integration/config] - 10https://gerrit.wikimedia.org/r/216100 [16:48:55] (03PS3) 10Hashar: Refactor mediawiki-extensions-{phpflavor} [integration/config] - 10https://gerrit.wikimedia.org/r/216100 [16:49:06] (03CR) 10Hashar: "rebased" [integration/config] - 10https://gerrit.wikimedia.org/r/216100 (owner: 10Hashar) [17:33:15] (03PS1) 10Hashar: Common job for extensions that runs qunit [integration/config] - 10https://gerrit.wikimedia.org/r/216132 (https://phabricator.wikimedia.org/T99877) [17:36:53] (03CR) 10Hashar: "Madness happening at https://integration.wikimedia.org/ci/job/mediawiki-extensions-qunit/1/console" [integration/config] - 10https://gerrit.wikimedia.org/r/216132 (https://phabricator.wikimedia.org/T99877) (owner: 10Hashar) [18:04:17] PROBLEM - Puppet failure on integration-slave-precise-1014 is CRITICAL 40.00% of data above the critical threshold [0.0] [18:21:32] PROBLEM - Puppet failure on integration-zuul-packaged is CRITICAL 100.00% of data above the critical threshold [0.0] [18:34:17] RECOVERY - Puppet failure on integration-slave-precise-1014 is OK Less than 1.00% above the threshold [0.0] [18:34:56] Krinkle or thcipriani, shall we move ‘integration’ to the new dns service today? [18:35:41] * Krinkle doesn't know what that means [18:35:46] I'd rather has hashar handle it. [18:35:58] have [18:36:11] which means next week since he left for the day [18:36:53] andrewbogott: oh boy. Are you trying to get everything moved by Monday? [18:37:15] thcipriani: More like ‘I am going to move everything on Monday that people have not already moved themselves’ [18:38:27] got it. Sure, I can move it. I'm pretty sure we can get it done via a few salt/sed commands. [18:39:28] andrewbogott: you around for the next little while, in case I destroy everything? [18:39:41] andrewbogott: Does this involve dns resolution changes? Beware that dns stuff is broken in labs as far as I'm concerned and we're using a live hack in puppet to keep it working. [18:39:50] thcipriani: I’m about to eat lunch but I won’t travel far. [18:39:50] in integration. [18:39:53] Krinkle: um… ? [18:39:56] especially with regards to the 2dot sutff [18:39:59] tell me more, please? [18:40:01] being forced into labs [18:40:50] this has been broken for months. mutante helped us draft a patch to restore functionality for foo.bar domain names like saucelabs.com. [18:41:02] By default domains shorter than 2 dots were forced into labs namespace. [18:41:16] Everytime this changed, all our qunit browsertest jobs failed. [18:41:30] and everytime we reverted it in our local puppet [18:41:34] 3 times last year. [18:41:34] :) [18:42:16] it's related to legacy (or current?) short names for labs db, but we don't use that. [18:43:14] Krinkle, does the new dns server (labs-recursor0) have the same issue? [18:43:18] I don't know. [18:43:36] Afaik it was intentionally done that was and a known issue by Coren pending a better solution. [18:43:55] thcipriani: I’m eating lunch but keyboard is close at hand. [18:44:19] https://phabricator.wikimedia.org/T92351 [18:44:38] https://gerrit.wikimedia.org/r/#/c/196731/ [18:45:40] andrewbogott: If you have a minute, try a labs instance with the old dns system (the one integation uses) and try resolving saucelabs.com from the command line to reproduce that it fails by default, that it works with dots:2 removed in /etc/resolve, and that it works in the new system? [18:45:45] That's be good news :) [18:50:46] Krinkle: I can ping and dig saucelabs.com from everywhere that I try. [18:51:28] (03Merged) 10jenkins-bot: Create dedicated jslint/phplint instances for apps/* repos [integration/config] - 10https://gerrit.wikimedia.org/r/216159 (owner: 10Krinkle) [18:52:07] andrewbogott: I'm looking at /etc/resolv.conf everywhere in integration and it looks like ndots: 2 has been removed [18:52:23] !log Reloading Zuul to deploy https://gerrit.wikimedia.org/r/216159 [18:52:26] Logged the message, Master [18:52:31] thcipriani: yeah. I can’t tell if that’s needed with the new setup or not, since I can’t reproduce the problem it’s meant to fix. [18:52:39] however, host: saucelabs.com seems to come back correctly on deployment-prep [18:56:00] (03PS1) 10Krinkle: Extend file extension filter for phplint, to include *-phplint [integration/config] - 10https://gerrit.wikimedia.org/r/216182 [18:56:24] thcipriani: what if you add back dots: 2? [18:56:31] it'd be silly of course, but just to see if that fails [18:56:35] sure [18:56:37] (03CR) 10Krinkle: [C: 032] Extend file extension filter for phplint, to include *-phplint [integration/config] - 10https://gerrit.wikimedia.org/r/216182 (owner: 10Krinkle) [18:56:40] * thcipriani checks [18:56:51] !log Reloading Zuul to deploy https://gerrit.wikimedia.org/r/216182 [18:56:54] Logged the message, Master [18:57:31] Osm [18:57:39] Isn’t ndots always there, in deployment? [18:57:45] yeah it is [18:57:53] in integration I just tried adding it back [18:57:58] on one of the machines [18:58:14] (03Merged) 10jenkins-bot: Extend file extension filter for phplint, to include *-phplint [integration/config] - 10https://gerrit.wikimedia.org/r/216182 (owner: 10Krinkle) [18:58:39] host saucelabs.com seems to resolve. Looking at integration-slave-precise-1014 [18:59:07] Krinkle: I don't remember, but is the android app still going to be in the MW queue due to 'tox-flake8'? [18:59:34] host and dig seem to come back with nothing unusual [19:02:06] thcipriani: but dots:2 is removed on those instances, right? [19:02:20] I added it back temporarily [19:02:27] on one host [19:03:05] tried it on a trusty and precise instance, can't recreate. [19:03:11] right [19:03:19] I guess we'll find out on monday :) [19:03:22] twentyafterfour: I poked a bit, it looks like Doorkeeper kind of gets us at what we want for tracking old repo names with new repo names [19:03:26] maybe dig works [19:03:26] oh boy [19:03:47] there's a lot of layers before hitting dns inside an end-to-end run like grunt-karma and saucelabs [19:04:34] thcipriani: can you add dots:2 on a trusty host? I can pin oojs-core to a specific host for you and trigger it in Jenkins to see [19:05:58] sure integration-slave-trusty-1016 [19:06:15] should be good to go now, until puppet runs at 22 after [19:07:12] and by "good to go" I mean "options timeout: 5 ndots: 2" is set [19:09:18] ostriches: hmm, doesn't seem very well documented ;) [19:09:27] Nope [19:09:39] But basically, it allows you to tie "external objects" to things in Phab [19:09:48] The working example is something Jira with Maniphest tasks [19:09:59] But I don't see why you couldn't do the same with Gerrit data & Diffusion [19:10:24] yeah they wrote it basically to capture jira ppl and an old facebook case [19:10:29] but I don't think it's going anywhere [19:16:20] it looks like it is mostly for publishing from phabricator into other systems [19:17:04] yeah [19:17:24] what do we specifically need, to make the diffusion and gerrit stuff work better? I mean, it should be easy to map all the gerrit project hierarchy into phabricator ... I can make a url mapper of some sort, I'm fairly sure that's nearly trivial [19:17:41] hey twentyafterfour could you respond to https://phabricator.wikimedia.org/T100519#1340242, I talked bblack's ear off about getting ssh going yesterday and I want to keep the narrative alive [19:18:14] long story sohrt, I think his thought was to pass ssh through lvs and terminate at phab box which should work / conform to standards / not require phab to be on a public ip [19:19:02] chasemp: done [19:19:37] The ip host shouldn't be an issue afaik [19:19:47] agreed [19:19:51] in fact repositories can live on a separate machine entirely [19:19:51] but I thought better coming from you [19:19:58] right [19:20:03] and lvs honestly there gives us a few gains [19:20:06] for HA [19:20:10] thcipriani: Krinkle: So, what did we learn? Anything? [19:20:11] so seems not terrible [19:20:24] chasemp: we might want to think about setting this up on a separate piece of hardware from the start [19:20:43] because it's the one really well supported way of scaling phabricator - run git on one host, web on another [19:20:43] thcipriani: ping me next time :) I did't see you alreayd had it set :? [19:20:46] are you worried about load? [19:20:49] OK. I gotta go in 10 min. [19:20:50] let's do this [19:20:54] ah [19:20:57] trusty-1016 [19:21:01] you know considering teh load jump just from diffusion [19:21:05] because of the shear number of repos [19:21:09] Krinkle: 1 minutes [19:21:10] chasemp: well, our git repos get a lot of traffic I assume (gerrit is already bottlenecking sometimes) [19:21:12] you may be right that it's prudent [19:21:16] that'll get reset in a second here [19:21:40] chasemp: yeah, lets do it, I'm pretty damn sure we will need it [19:21:44] thcipriani: right, I'll wait 1 minute and youre-apply? [19:22:00] its running now [19:22:02] I just commented out the puppet run in cron [19:22:15] chasemp: our phabricator is faster than upstream (somehow) but it's still not super snappy [19:22:27] it's a beefy box [19:22:31] and dedicated db [19:22:37] of which is also beefy [19:22:49] I kind of overprovisioned as far as I could sneak in :) [19:22:51] Krinkle: which job did you pin? [19:22:59] and sean was cool about it (his idea really on m3) [19:23:19] I knew phab would basically do nothing but grow [19:23:24] npm [19:23:30] https://gerrit.wikimedia.org/r/#/c/216227/ [19:23:58] but I made a mistake, hold on [19:24:04] kk [19:24:46] OK. ready [19:24:50] thcipriani: applied? [19:24:51] chasemp: yeah it looks like we have headroom, so maybe not too important to start out but it will be a slightly painful migration later on down the road [19:24:55] Krinkle: yup [19:25:06] I'm totally down with doing it up from teh get go [19:25:10] https://integration.wikimedia.org/ci/job/npm/7808/console [19:25:18] teh return on time for all of eng to make it performant is a no brainer to me [19:25:37] but we'll see how it goes I guess, log the task? [19:25:43] dunno how close you are to wanting it [19:25:50] chasemp: yeah even a few seconds here and there will add up. I'm already annoyed with gerrit slowness sometimes, it slows down my work for sure [19:26:21] oh agreed [19:26:32] chasemp: I don't know where we stand with gerrit migration, it seems like it's proceeding but I haven't seen it on an official team goal yet (and I'm not sure it's going to be on next quarter either) [19:26:40] kk [19:26:47] well maybe as we go we can work out the dual box use case to be ready [19:26:49] or hackathon? [19:26:54] wikimedia are you going? [19:27:53] I still don't have a passport :-/ [19:28:01] thcipriani: seems fine [19:28:04] g2g [19:28:08] Krinkle: seems ok for whatever reason, yeah :\ [19:28:19] Krinkle: ok, thanks for the heads up on this [19:28:44] andrewbogott: Seems like it _should be_ ok to move forward. [19:28:58] great! [19:28:59] I'll see how integration puppetmaster looks [19:30:04] chasemp: I'll ask in #phabricator, see what they think about the need for separate diffusion hosting machine [19:30:14] cool [19:30:33] I know much of it was in theory worked out 6 months or so ago? [19:30:41] but no idea who is using it if anyone [19:31:13] (03PS2) 10Hashar: Common job for extensions that runs qunit [integration/config] - 10https://gerrit.wikimedia.org/r/216132 (https://phabricator.wikimedia.org/T99877) [19:31:35] (03CR) 10Hashar: "Filter out REL1_23 / REL1_24" [integration/config] - 10https://gerrit.wikimedia.org/r/216132 (https://phabricator.wikimedia.org/T99877) (owner: 10Hashar) [19:34:04] (03PS4) 10Hashar: Refactor mediawiki-extensions-{phpflavor} [integration/config] - 10https://gerrit.wikimedia.org/r/216100 [19:34:11] going to unleash some madness [19:34:32] hashar: andrewbogott and I were just about to unleash some madness, actually. [19:34:44] hashar: I poked you over email but any thing needed from me for nodepool new? I know the networking stuff is still pending [19:34:45] moving to new dns on integration [19:34:46] cause infra deploys on Friday :-} [19:34:49] I haven't lost hope there yet [19:35:06] chasemp: I failed to reach out to faidon/mark about lab host net [19:35:23] yeah I'm going to do a bit of study and make a recommendation I hope if I can find some time [19:35:29] chasemp: I tried building some base image to boot them in openstack, but end up in a dead end :-//// [19:35:34] best way to get feedback is to troll with a wrong solution :) [19:35:38] kk [19:35:42] anything I can do there to help? [19:35:43] chasemp: so in short nothing to worry on your side. Gotta write some summary eventually [19:35:49] sounds good man [19:36:03] I tried creating images using operations/puppet.git to reuse all the code there [19:36:21] but that is tied to labs/prod context and does not work well on a local machine hehe [19:36:52] so in short there are no images yet [19:37:02] twentyafterfour: I think if we just added an additional column to diffusion repos (like you can with profiles, tasks, etc), that stores the gerrit repo name [19:37:04] That's all we need [19:37:08] twentyafterfour: when you create a new wmf branch, the commit that adds the submodules and stuff (https://git.wikimedia.org/commitdiff/mediawiki%2Fcore/2bee3bb7008e15fb31214b27f4dc958519e0a488) skips gerrit right? is there a reason for that? [19:37:12] Then we can build a redirector that uses this data [19:37:22] hashar: Now that Krinkle|detached removed the submit button for the android app what can we to retrigger a merge when the initial build failed? I added "recheck" comment on https://gerrit.wikimedia.org/r/#/c/210122/, which rerun the build, this time successfully, but it doesn't merge [19:37:54] twentyafterfour: Goal 1 is "shut down Gitblit and use Phab for all repo browsing and mirroring" [19:37:54] bearND: ah yeah recheck just trigger the test and do not get it merged [19:38:02] bearND: you want to remove your CR+2 and reapply it [19:38:07] legoktm: it's automated [19:38:16] hashar: ok, will do. thanks [19:38:17] ostriches: right [19:38:18] bearND: that will get the change to enter gate-and-submit pipeline which will merge the change [19:38:35] bearND: one day we will make it so 'recheck' actually merge the change if there is a CR+2 applied [19:38:54] twentyafterfour: can we have it go through gerrit? specifically when we move to using composer to run tests, I'd like for the new wmf branch to run the jenkins tests to make sure the right dependencies made it into mediawiki/vendor [19:38:55] legoktm: is there a reason that it needs to go through review? [19:39:12] hashar: btw: Looks like when we change the build.gradle file to use a newer version of the support library or the build-tools package the build fails the first time it runs per build slave [19:39:28] legoktm: that makes my deployment process a lot worse [19:39:41] how so? [19:40:07] !log refreshed Jenkins jobs mediawiki-extensions-hhvm and mediawiki-extensions-zend with https://gerrit.wikimedia.org/r/#/c/216100/3 (refactoring) [19:40:10] Logged the message, Master [19:40:22] (03CR) 10Hashar: "Deployed" [integration/config] - 10https://gerrit.wikimedia.org/r/216100 (owner: 10Hashar) [19:40:46] legoktm: maybe a lot worse is an exaggeration. It means that I will have to wait for gerrit and CI in between two steps that are currently automated into one step. [19:41:07] more CI is a good thing but more steps is a bad thing [19:41:19] I mean more testing is a good thing [19:42:01] ostriches: I think we can do that using a separate table to map the repositories that way we don't meddle with phabricator's sql schema other than adding one table of our own [19:42:09] hashar: cool. that worked this time. Thanks! [19:42:15] ostriches: it's the same as what we did with bugzilla ids [19:42:19] dbrant: ^^ [19:42:27] twentyafterfour: That sounds reasonable. [19:43:19] bearND: roger that; thanks [19:43:23] ostriches: then we can make a redirector that runs on the phab host but responds to the old urls (that's exactly what we did for bugzilla migration and it worked ok) [19:43:38] * ostriches nods [19:44:29] twentyafterfour: isn't the new branch created beforehand though? [19:44:51] ostriches: the only thing I don't know about is keeping it in sync if we continue to add things in gerrit after the migration ... with bugzilla we didn't add a UI to maintain the mappings because it was a one time thing not an ongoing situation [19:45:19] !log set use_dnsmasq: false on Hiera:Integration [19:45:22] legoktm: not really, I usually do it right before the deployment window because the branching takes a long time [19:45:22] Logged the message, Master [19:45:35] twentyafterfour: Well, the tool can learn based on the gerrit.wikimedia.org urls we have set as the upstream repo URL [19:45:46] hmm, I thought Reedy usually prepared everything beforehand [19:45:56] Starts with 0 data, inserts it as it finds a repo matching that name. [19:45:59] ostriches: oh good point [19:46:31] legoktm: "beforehand" yes before the deployment window but it's still a long process that will become longer [19:46:35] yeah :/ [19:46:41] PROBLEM - Puppet failure on deployment-zookeeper01 is CRITICAL 100.00% of data above the critical threshold [0.0] [19:46:50] legoktm: I'm not entirely apposed to it if it's important then so be it if it takes me longer to do [19:47:30] legoktm: but honestly I really want to eliminate the weekly branching entirely and have long lived release branches. The new model will be to merge into the release branch instead of creating a new one each week [19:48:13] twentyafterfour: really all we need is that the checkComposerLockUpToDate.php script is run after branch creation, which jenkins does. You could run it manually after branch creation? but the script requires MW to be installed... [19:48:35] legoktm: hmm that might work [19:49:04] does it require mediawiki to be functional or just needs to be on disk? [19:49:15] I could do it in vagrant maybe? [19:49:16] for the record there are other unit tests that should be run, like the one that checks that $wgVersion is set to a sane value [19:49:33] it needs to be reasonably functional [19:49:36] you could run it in vagrant yeah [19:49:39] (03CR) 10Hashar: [C: 032] Refactor mediawiki-extensions-{phpflavor} [integration/config] - 10https://gerrit.wikimedia.org/r/216100 (owner: 10Hashar) [19:50:22] if we get to having long lived branches all this becomes moot. then we could rely on CI a lot more and spend a lot less time creating submodules over and over each week [19:50:49] sure [19:50:53] <+thcipriani> !log set use_dnsmasq: false on Hiera:Integration # definitely some madness :-} [19:51:22] hashar: already broke deployment-prep this week, more ready for this one :) [19:51:23] but I want to land my composer jenkins change sooner rather than later, and I think changing the entire wmf deployment process will fall under later ;-) [19:51:31] thcipriani: https://integration.wikimedia.org/ci/job/UploadWizard-api-commons.wikimedia.beta.wmflabs.org/ that might a good test [19:51:34] (03Merged) 10jenkins-bot: Refactor mediawiki-extensions-{phpflavor} [integration/config] - 10https://gerrit.wikimedia.org/r/216100 (owner: 10Hashar) [19:51:38] thcipriani: it hits beta from the integration project [19:52:20] andrewbogott: could you take a look at integration-puppetmaster? [19:52:23] 00:00:34.012 FAILED (errors=1) [19:52:23] 00:00:34.194 Finished: SUCCESS [19:52:25] grrr [19:53:24] ah, dangit, nvmd [19:53:31] puppetmaster needed a kick [19:53:47] (03PS3) 10Hashar: Common job for extensions that runs qunit [integration/config] - 10https://gerrit.wikimedia.org/r/216132 (https://phabricator.wikimedia.org/T99877) [19:54:28] PROBLEM - Puppet failure on integration-puppetmaster is CRITICAL 40.00% of data above the critical threshold [0.0] [19:54:57] legoktm: hopefully not much later [19:55:11] legoktm: It's high priority for me at least ;) [19:55:22] thcipriani: yep, looks fine to me [19:55:24] legoktm: but go ahead and land your stuff and just let me know what I need to do [19:56:38] ok, I'll file a bug and assign it to you? [19:56:52] legoktm: the biggest problem is going to be the time it takes for jenkins to test this - it's gonna be slow, and really I'd prefer to run all the tests locally since I will already have a working copy that's in the right state - it'll take at least 15 minutes for jenkins to clone all those submodules and set up the right state [19:56:57] legoktm: ok [19:57:53] PROBLEM - Puppet failure on integration-slave-precise-1013 is CRITICAL 40.00% of data above the critical threshold [0.0] [19:58:01] PROBLEM - Puppet failure on integration-slave-trusty-1013 is CRITICAL 60.00% of data above the critical threshold [0.0] [19:58:34] PROBLEM - Puppet failure on integration-slave-precise-1012 is CRITICAL 50.00% of data above the critical threshold [0.0] [19:58:38] PROBLEM - Puppet failure on integration-slave-precise-1011 is CRITICAL 40.00% of data above the critical threshold [0.0] [19:58:42] PROBLEM - Puppet failure on integration-slave-trusty-1014 is CRITICAL 50.00% of data above the critical threshold [0.0] [19:59:00] PROBLEM - Puppet failure on integration-saltmaster is CRITICAL 10.00% of data above the critical threshold [0.0] [19:59:11] relevant https://www.youtube.com/watch?v=hwm9DvF24Ag [19:59:58] PROBLEM - Puppet failure on integration-slave-trusty-1015 is CRITICAL 50.00% of data above the critical threshold [0.0] [20:01:50] PROBLEM - Puppet failure on integration-zuul-server is CRITICAL 40.00% of data above the critical threshold [0.0] [20:02:12] PROBLEM - Puppet failure on integration-labsvagrant is CRITICAL 66.67% of data above the critical threshold [0.0] [20:02:32] PROBLEM - Puppet failure on integration-slave-trusty-1017 is CRITICAL 30.00% of data above the critical threshold [0.0] [20:02:58] PROBLEM - Puppet failure on integration-slave-trusty-1011 is CRITICAL 60.00% of data above the critical threshold [0.0] [20:03:12] PROBLEM - Puppet failure on integration-slave-trusty-1021 is CRITICAL 66.67% of data above the critical threshold [0.0] [20:04:22] PROBLEM - Puppet failure on integration-slave-trusty-1012 is CRITICAL 60.00% of data above the critical threshold [0.0] [20:04:30] RECOVERY - Puppet failure on integration-puppetmaster is OK Less than 1.00% above the threshold [0.0] [20:05:14] (03PS1) 10Hashar: Migrate UploadWizard-api* jobs to Trusty [integration/config] - 10https://gerrit.wikimedia.org/r/216297 (https://phabricator.wikimedia.org/T101550) [20:05:14] PROBLEM - Puppet failure on integration-slave-precise-1014 is CRITICAL 50.00% of data above the critical threshold [0.0] [20:05:19] PROBLEM - Puppet failure on integration-vmbuilder-trusty is CRITICAL 20.00% of data above the critical threshold [0.0] [20:05:21] (03CR) 10Hashar: [C: 032] Migrate UploadWizard-api* jobs to Trusty [integration/config] - 10https://gerrit.wikimedia.org/r/216297 (https://phabricator.wikimedia.org/T101550) (owner: 10Hashar) [20:05:42] PROBLEM - Puppet failure on integration-slave-trusty-1016 is CRITICAL 20.00% of data above the critical threshold [0.0] [20:05:49] wikibugs? [20:05:55] twentyafterfour: filed https://phabricator.wikimedia.org/T101551?workflow=create [20:06:02] PROBLEM - Puppet failure on integration-publisher is CRITICAL 66.67% of data above the critical threshold [0.0] [20:07:11] legoktm: thanks [20:07:15] (03Merged) 10jenkins-bot: Migrate UploadWizard-api* jobs to Trusty [integration/config] - 10https://gerrit.wikimedia.org/r/216297 (https://phabricator.wikimedia.org/T101550) (owner: 10Hashar) [20:09:00] RECOVERY - Puppet failure on integration-saltmaster is OK Less than 1.00% above the threshold [0.0] [20:12:08] thcipriani: everything going ok? [20:12:21] andrewbogott: yeah, just salt wrestling :) [20:13:17] PROBLEM - Puppet failure on integration-raita is CRITICAL 40.00% of data above the critical threshold [0.0] [20:18:04] andrewbogott: everything that talks to integration-saltmaster should start recovering shortly, I'm trying to find outliers [20:23:38] andrewbogott: would you look at integration-raita? What is up with the notices? Seems to making puppet's exit status non-0. [20:27:59] thcipriani: integration-raita looks fine to me. There are notices about /etc/ssh/userkeys/ubuntu on pretty much every instance these days [20:28:11] I can remove those files and make puppet shut up, if you like :) [20:28:23] Project UploadWizard-api-commons.wikimedia.org build #1615: FAILURE in 38 sec: https://integration.wikimedia.org/ci/job/UploadWizard-api-commons.wikimedia.org/1615/ [20:28:42] Project UploadWizard-api-commons.wikimedia.beta.wmflabs.org build #2048: FAILURE in 41 sec: https://integration.wikimedia.org/ci/job/UploadWizard-api-commons.wikimedia.beta.wmflabs.org/2048/ [20:28:48] well as long as shinken doesn't count that as a failure, it's fine [20:29:07] nah, looks happy to me [20:31:49] the UploadWizard-api jobs breaking is me [20:37:00] PROBLEM - Host deployment-fluoride is DOWN: CRITICAL - Host Unreachable (10.68.16.190) [20:37:01] thcipriani: if all is peaceful, I’m going to step away for 15-20 [20:37:01] marxarelli: is deployment-mediawiki03 sill in use? That was the host you setup for the pen testing stuff in December [20:37:31] bd808: not that i know of, but you might want to double check with csteipp [20:37:48] andrewbogott: everything seems fine. Just mostly waiting for instances to re-run, checking as I wait. [20:37:53] do we still have the custom varnish patch to get access to it? [20:38:04] bd808: I'm planning to use it! [20:38:47] csteipp: like between today and say next Friday? [20:39:08] bd808: Nope [20:39:37] bd808: Just in the future I'm planning to bang heavily on beta... and I don't want to take out stuff for everyone. [20:39:41] bd808: not sure if it's still cherry-picked on deployment-puppetmaster but the ps was never merged [20:39:44] https://gerrit.wikimedia.org/r/#/c/158016/ [20:39:47] :( [20:39:55] the varnish patch is gone from deployment-salt [20:40:06]