[00:46:58] (03Abandoned) 10Krinkle: performance-webpagetest: Raise runs from 5 to 6 for desktop tests [integration/config] - 10https://gerrit.wikimedia.org/r/341461 (owner: 10Krinkle) [01:10:35] PROBLEM - Free space - all mounts on deployment-fluorine02 is CRITICAL: CRITICAL: deployment-prep.deployment-fluorine02.diskspace._srv.byte_percentfree (<40.00%) [01:16:36] bd808: RainbowSprinkles: Would like to schedule https://phabricator.wikimedia.org/T158721 for some time this week. [01:18:07] Krinkle: ok with me, but I have nothing at all to do with it ;) [01:18:24] bd808: Yeah, but wanted to know if maybe there are certain concerns with extension disabling [01:18:31] it's been a while since I've seen this being done [01:18:35] what the current process is, do you know? [01:18:55] Reedy is the guy you want to talk to I bet [01:19:33] Concerns with disabling? Why? [01:19:37] I haven't done any SWAT things for well over a year. I'm just a Labs nerd these days [01:20:15] Krinkle: I'd just SWAT that tbh [01:20:43] Reedy: Right, I guess I'm thinking more about wide uninstall as opposed to individual wikis [01:20:49] localisation extlist etc. [01:20:53] make branch [01:20:54] Yeah [01:20:55] but that's not involved here [01:20:58] Okay [01:21:46] You've not got any db tables to drop either [01:22:19] indeed [01:22:22] Okay, I've scheduled it [01:22:23] thx [04:18:24] Yippee, build fixed! [04:18:24] Project selenium-MultimediaViewer » firefox,beta,Linux,BrowserTests build #329: 09FIXED in 22 min: https://integration.wikimedia.org/ci/job/selenium-MultimediaViewer/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/329/ [06:28:53] Yippee, build fixed! [06:28:54] Project selenium-Wikibase » chrome,test,Linux,BrowserTests build #299: 09FIXED in 1 hr 48 min: https://integration.wikimedia.org/ci/job/selenium-Wikibase/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=test,PLATFORM=Linux,label=BrowserTests/299/ [07:05:35] RECOVERY - Free space - all mounts on deployment-fluorine02 is OK: OK: All targets OK [08:44:56] PROBLEM - Puppet run on integration-saltmaster is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [0.0] [08:54:37] 10Continuous-Integration-Config, 05Continuous-Integration-Scaling, 06translatewiki.net: Replace job translatewiki-shelllint and have its functionality invoked from composer - https://phabricator.wikimedia.org/T160394#3097265 (10hashar) [08:54:55] RECOVERY - Puppet run on integration-saltmaster is OK: OK: Less than 1.00% above the threshold [0.0] [09:54:44] !log Jenkins: dropping Sniedzielski more specific permissions. Account is already in wmf ldap group [09:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [10:02:48] (03PS1) 10Hashar: Remove translatewiki-shelllint [integration/config] - 10https://gerrit.wikimedia.org/r/342597 (https://phabricator.wikimedia.org/T160394) [10:05:21] (03CR) 10Hashar: [C: 032] Remove translatewiki-shelllint [integration/config] - 10https://gerrit.wikimedia.org/r/342597 (https://phabricator.wikimedia.org/T160394) (owner: 10Hashar) [10:06:20] (03Merged) 10jenkins-bot: Remove translatewiki-shelllint [integration/config] - 10https://gerrit.wikimedia.org/r/342597 (https://phabricator.wikimedia.org/T160394) (owner: 10Hashar) [10:08:22] 10Continuous-Integration-Config, 05Continuous-Integration-Scaling, 06translatewiki.net, 13Patch-For-Review: Replace job translatewiki-shelllint and have its functionality invoked from composer - https://phabricator.wikimedia.org/T160394#3097387 (10hashar) The job is no more in CI. Pending https://gerrit.w... [10:15:59] !log Added Niedzielski to integration. [10:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [10:21:20] 10Continuous-Integration-Config, 05Continuous-Integration-Scaling, 10releng-201516-q3, 13Patch-For-Review, 07WorkType-NewFunctionality: [keyresult] Migrate as many misc CI jobs as possible to Nodepool - https://phabricator.wikimedia.org/T119140#3097449 (10hashar) [11:32:02] 10Gerrit, 10MediaWiki-extensions-General-or-Unknown, 06Repository-Admins, 07Technical-Debt: Archive PageLanguageApi extension - https://phabricator.wikimedia.org/T160371#3097526 (10Aklapper) [12:09:13] Project beta-scap-eqiad build #146302: 04FAILURE in 25 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/146302/ [12:17:14] Yippee, build fixed! [12:17:15] Project beta-scap-eqiad build #146303: 09FIXED in 2 min 12 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/146303/ [13:26:59] 10Browser-Tests-Infrastructure, 07Easy, 05MW-1.29-release (WMF-deploy-2017-03-14_(1.29.0-wmf.16)), 13Patch-For-Review: Remove lines from Gemfile that are used by RVM - https://phabricator.wikimedia.org/T1331#3097759 (10zeljkofilipin) 05Open>03Resolved TransparencyReport report is configured in a strang... [13:31:01] 10Deployment-Systems, 06Release-Engineering-Team (Long-Lived-Branches), 10releng-201617-q1, 07Epic: Merge to deployed branches instead of cutting a new deployment branch every week. - https://phabricator.wikimedia.org/T89945#3097765 (10mmodell) [13:31:04] 10Deployment-Systems, 06Release-Engineering-Team (Long-Lived-Branches): thoroughly document the new branch cutting plan / strategy - https://phabricator.wikimedia.org/T136015#3097764 (10mmodell) 05Open>03declined [13:46:19] Project selenium-VisualEditor » firefox,beta,Linux,BrowserTests build #336: 04FAILURE in 2 min 18 sec: https://integration.wikimedia.org/ci/job/selenium-VisualEditor/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/336/ [13:49:10] PROBLEM - Puppet run on buildlog is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [14:17:09] (03PS5) 10Zfilipin: WIP Problem: Can not use --retry option to retry failed tests as part of the same run [selenium] - 10https://gerrit.wikimedia.org/r/341523 (https://phabricator.wikimedia.org/T160086) [14:17:33] (03PS6) 10Zfilipin: Problem: Can not use --retry option to retry failed tests as part of the same run [selenium] - 10https://gerrit.wikimedia.org/r/341523 (https://phabricator.wikimedia.org/T160086) [14:37:26] (03PS1) 10Hashar: Remove mediawiki/php/wikidiff (obsolete) [integration/config] - 10https://gerrit.wikimedia.org/r/342632 (https://phabricator.wikimedia.org/T134381) [14:41:21] (03PS2) 10Hashar: Remove obsolete PHP extensions [integration/config] - 10https://gerrit.wikimedia.org/r/342632 (https://phabricator.wikimedia.org/T134381) [15:14:40] hasharLunch: uh how is it possible that we are again seeing weird qunit failures: https://integration.wikimedia.org/ci/job/mwext-qunit-jessie/9013/console ? I thought the upgrade solved that [15:22:06] (03CR) 10Hashar: [C: 032] Remove obsolete PHP extensions [integration/config] - 10https://gerrit.wikimedia.org/r/342632 (https://phabricator.wikimedia.org/T134381) (owner: 10Hashar) [15:23:00] (03PS1) 10Hashar: (WIP) massage php-compile [integration/config] - 10https://gerrit.wikimedia.org/r/342638 (https://phabricator.wikimedia.org/T134381) [15:23:03] (03Merged) 10jenkins-bot: Remove obsolete PHP extensions [integration/config] - 10https://gerrit.wikimedia.org/r/342632 (https://phabricator.wikimedia.org/T134381) (owner: 10Hashar) [15:24:04] Nikerabbit: have you tried a recheck? [15:24:19] or maybe that is the chrome version [15:24:26] can't remember which chrome version caused the issue [15:24:36] that build ran with Chrome 56 [15:24:52] then a previous patchset passed with the same version [15:25:02] hitting "recheck" [15:37:36] 10Continuous-Integration-Config, 05Continuous-Integration-Scaling, 10releng-201516-q3, 13Patch-For-Review, 07WorkType-NewFunctionality: [keyresult] Migrate as many misc CI jobs as possible to Nodepool - https://phabricator.wikimedia.org/T119140#3098204 (10hashar) [15:37:39] 10Continuous-Integration-Config, 05Continuous-Integration-Scaling, 06translatewiki.net, 13Patch-For-Review: Replace job translatewiki-shelllint and have its functionality invoked from composer - https://phabricator.wikimedia.org/T160394#3098201 (10hashar) 05Open>03Resolved Merged by @Nikerabbit :-} [15:48:10] hashar: yeah still fails [15:53:25] RECOVERY - Long lived cherry-picks on puppetmaster on deployment-puppetmaster02 is OK: OK: Less than 100.00% above the threshold [0.0] [15:58:37] Nikerabbit: try to reproduce it on your local machine? [15:59:04] Nikerabbit: iirc I left instructions on the previous task and there is also https://www.mediawiki.org/wiki/Manual:JavaScript_unit_testing [16:02:58] Nikerabbit: will try tonight (if I remember) else poke me about it again tomorrow [16:03:01] meetingtime [16:03:05] Yippee, build fixed! [16:03:06] Project selenium-MobileFrontend » firefox,beta,Linux,BrowserTests build #358: 09FIXED in 41 min: https://integration.wikimedia.org/ci/job/selenium-MobileFrontend/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/358/ [16:36:13] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.29.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T158997#3098495 (10mmodell) p:05Normal>03High [16:59:28] RainbowSprinkles hi, upstream are liking the way my polygerrit fix is turnning out to be :), so i have mananged to make it so we wont need to use alot of rewrites though will need some rewrites :) [17:08:22] As I said before: any rewrites mean the fix is incomplete imho :\ [17:09:00] Upstream wrote code with *broken assumptions* that breaks a perfectly acceptable setup [17:09:12] (although your fix is moving things in the right direction) [17:09:46] Yep [17:10:05] RainbowSprinkles stopping using rewrites is what im trying to do in my part 3 fix [17:10:13] though it fails the tests for unkown reasons [17:10:18] :) [17:10:35] See https://gerrit-review.googlesource.com/#/c/99673/ [17:10:47] which would allow me not to do rewrites for /c/ [17:11:25] i want to do it for /q/ too and all the other links but i have to fix the tests which it does not explain correctly. Running the test locally i can see in the test.log file for the errors. [17:11:56] Polygerrit will be no more anyways if they carn't fix it for ios 10.3 anyways [17:19:06] RainbowSprinkles what they really want me to do instead is either edit the index.html file manually which will never happen if users use the offical release and doint build gerrit. Or to use java's request filter like this https://github.com/gerrit-review/gerrit/blob/49df12cb7da00f9298d8a37e231d55ecc83fa0c5/gerrit-httpd/src/main/java/com/google/gerrit/httpd/raw/StaticModule.java#L439 in polygerrit [17:50:22] 10Scap (Scap3-Adoption-Phase1), 10scap2, 10Cassandra, 10RESTBase-Cassandra, 06Services: Deploy logstash logback encoder with scap3 - https://phabricator.wikimedia.org/T116340#3098803 (10demon) Is this still even a thing? `/srv/deployment/logstash-logback-encoder` doesn't exist. I see `/srv/deployment/log... [17:51:42] 06Release-Engineering-Team, 10Scap (Scap3-MediaWiki-MVP): cannot delete non-empty directory: php-1.25wmf14/cache/l10n - https://phabricator.wikimedia.org/T90798#3098805 (10demon) 05Open>03Resolved a:03demon This should be fixed with `scap clean`. It'll occasionally happen on the deploy masters themselves... [17:53:24] 10Continuous-Integration-Config, 06Release-Engineering-Team, 06Discovery, 06Discovery-Analysis: Add CI to all wikimedia/discovery repositories that are active - https://phabricator.wikimedia.org/T153856#3098816 (10mpopov) Linking to the [[ https://cran.r-project.org/package=lintr | lintr package ]] here fo... [18:25:17] thcipriani: what's the status of thinsg moving to nodepool? we are noticing increased load, have large job pools been moved over recently? [18:26:58] chasemp: afaik nothing new has been moved over to nodepool recently. What is the value of "recently" here? [18:27:17] thcipriani: we think yesterday<=>today [18:27:21] hashar moved tests over to nodepool [18:27:23] yesturday [18:27:37] heh, well that's pretty recently [18:27:41] I wasn't around yesterday [18:27:42] thcipriani: atm it's not going very well [18:27:43] mwext tests were moved over [18:28:40] thcipriani: greg-g it's frustrating not to be in teh loop on this tbh as we know it's has the potential for breakage, I think any new jobs moved to nodepool should include a ping to us if not a prearranged time [18:28:48] where this is jobs moved to nodepool [18:28:58] agreed, I thought that was the case. has [18:29:07] and most of the time it takes consideration of the existing pool and usage and someone to roll it back if needed [18:29:45] !sal [18:29:45] https://tools.wmflabs.org/sal/releng [18:30:42] I don't see it there nor the prod SAL :/ [18:31:18] https://gerrit.wikimedia.org/r/#/q/status:merged+project:integration/config+branch:master+owner:%22Hashar+%253Chashar%2540free.fr%253E%22 [18:31:31] https://gerrit.wikimedia.org/r/#/c/342464/ https://gerrit.wikimedia.org/r/#/c/342466/ https://gerrit.wikimedia.org/r/#/c/342471/ https://gerrit.wikimedia.org/r/#/c/342478/ [18:31:41] https://gerrit.wikimedia.org/r/#/c/342508/ [18:31:45] yeah, but it should have been logged in the SAL [18:32:44] I'm not sure if you mean SAL should have been in teh loop in some automated way or hashar should have SAL'd it manually and didn't [18:32:56] but pasting was mainly for sync up [18:33:07] chasemp: should have manually put it, at least [18:33:17] it's a big enough thing to log, imo [18:33:23] but yeah, I'm feeling like it's inconsiderate to have done this [18:33:26] esp on what is our day off [18:33:38] * greg-g nods [18:33:42] and I'm wondering how to keep this process under control [18:43:02] I think it's settling but I'm not sure of the pattern of those jobs so we'll see how it goes https://grafana.wikimedia.org/dashboard/db/nodepool?panelId=1&fullscreen&from=now-3h&to=now [18:43:33] I emailed Antoine. [18:45:55] thanks greg-g, I'm not reverting anything or whatnot but it requires some babysitting to see if that's needed I think. Appreciate your responding really. [18:46:46] greg-g: are you OK with starting https://phabricator.wikimedia.org/T125917#3039569 today? [18:46:49] we had a breakdown of job groupings and flipover plans and timing from when things crashed and burned while antoine was away, I would think we know similarly what remains not-on-nodepool and could lay out remaining and make a broader plan? [18:49:51] tgr: seems you're on top of it, so yeah [18:51:05] chasemp: the primary issue from our side is simply: we're currently maintaining two systems and trying to migrate to only one. Especially since we will in the near future add a third (see "Streamlined delivery of service" program, it's completely re-doing all of CI and deployments). We definitely don't have the bandwidth to maintain 3 systems concurrently. [18:51:36] yep, understood [18:51:48] so, that's the overarching "plan" ;) [18:51:49] I suggested halting nodepool progress previously as migration == unknown work [18:52:12] sure but if we want to continue to add to the nodepool pile then we have to make a plan or continue to eat dirt every few times at random intervals [18:52:39] I'm pretty sympathetic honestly to what you guys are managing but it's not an out on deciding to continue this migration [18:53:39] I'm talking a plan as in, we have n job types on y old mechanism, and here is our timeline for moving them [18:54:26] right now for me it seems random and stressful [18:54:27] chasemp: clearly that's not been laid out (hence yesterday). I'd like that as well, as we've all stated at least once, I believe. [18:54:47] (I'm 100% agreeing with you) [18:55:00] can we say step 1 is to not move anything else until we've enumerated what remains to be moved to nodepool? [18:55:10] Yippee, build fixed! [18:55:11] Project selenium-QuickSurveys » chrome,beta,Linux,BrowserTests build #344: 09FIXED in 3 min 36 sec: https://integration.wikimedia.org/ci/job/selenium-QuickSurveys/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/344/ [18:55:58] chasemp: sounds reasonable. [18:56:25] ok cool that would be great [18:57:05] I actually have no clue if we are at 50% or 80% [18:59:08] RainbowSprinkles yahoo unblocked wikimedia now :) So i receive gerrit emails now [18:59:15] 10Deployment-Systems, 10Scap (Scap3-MediaWiki-MVP), 13Patch-For-Review, 07Technical-Debt: `scap clean` should delete /var/lib/l10nupdate/caches/cache-$wmgVersionNumber - https://phabricator.wikimedia.org/T119747#3099222 (10demon) 05Open>03Resolved a:03demon [18:59:52] huh, the irc logbot isn't liking the new wikibugs output: https://wm-bot.wmflabs.org/logs/%23wikimedia-releng/20170314.txt [19:00:27] Someone created a task on that [19:00:47] looks like it never has (as far back as the -releng logs go) [19:01:26] Oh woops sorry i thought you meant something else [19:01:42] I doint think a task has been created for that [19:01:43] legoktm ^^ [19:01:45] the unicode tofu [19:02:00] or color escape codes [19:02:10] it has never liked the color stuff [19:02:25] someone just needs to teach it to strip those characters, the regex is pretty simple [19:03:47] * greg-g nods [19:12:43] PROBLEM - Check for valid instance states on labnodepool1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:14:22] Is that ^^ normal ? [19:14:27] chasemp ^^ [19:17:16] not really but could be a few things [19:18:13] there is a holdover [19:18:18] root@labnodepool1001:~# /usr/local/bin/check_nodepool_states [19:18:18] 2017-03-14 19:18:02,062 WARNING ['| wmflabs-eqiad | ci-jessie-wikimedia-566503 | d4161747-1002-41eb-aa45-de236c4a84f2 | 10.68.16.30 |'] [19:18:18] 1 nodepool alien(s) present [19:18:38] thcipriani: ^ I have to step out for a minute but it seems nodepool lost an instance and the alert fired appropriately [19:18:50] 10Scap (Scap3-MediaWiki-MVP), 10scap2: updateWikiversions: Don't assume that all versions being operated on +/- of each other - https://phabricator.wikimedia.org/T125702#3099329 (10demon) updateWikiVersions does explicitly do the right thing and takes a version number. What **doesn't** is the `updateBranchPoin... [19:19:25] chasemp: nice, glad that alert is working. ok, I'll take a look at it. [19:22:22] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.29.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T158997#3099367 (10mmodell) wmf.16 uses elastic search 5 (hosted in codfw). Will be keeping an eye out for any search-related issues in this branch. See {T157479} [19:22:33] RECOVERY - Check for valid instance states on labnodepool1001 is OK: nodepool state management is OK [19:22:59] !log removed alien nodepool instance via: openstack server delete ci-jessie-wikimedia-566503 [19:23:02] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [19:29:12] * bd808 sighs about the number of jobs on nodepool again [19:29:54] looks like the last few patches rebased in ops/puppet have taken 15 minutes to run their jobs because of waiting for available jessie nodes [19:30:22] it was at abotu 25 earlier, so going down [19:30:36] oh [19:30:43] its completely horrible [19:31:09] I belive there is an experiment going on with Docker [19:31:18] waht part of nodepool isn't scaling for us is hard to grasp? [19:32:08] nodepool creates instances and deletes instances on the fly (i presume you know that so sorry if im re saying what you know) Which means labs has to have the capacity. [19:32:58] it also means that when more jobs are sent to jessie/nodepool with out expanding the pool then things get slower [19:33:27] Yep, but it seems nodepool has a bug so it leaks [19:33:40] and yet jobs keep getting moved to it [19:33:46] which is the core problem [19:33:56] these jobs aren't moving themselves [19:34:50] nodepool is buggy crap designed to solve a completely different problem than the one we have (namely it's designed to bridge N different cloud providers to OpenStack's CI system) [19:35:11] * bd808 is venting [19:35:28] Though all this could be fixed now [19:35:43] we are using a old version [19:36:36] or it could be worse [19:37:12] I think upstream doint test speed as looking on there zuul system they have tests running for a while [19:37:42] though they have a heep ton of money from what i hear and also have a ton more instances then us [19:37:47] so they have more capacity then us. [19:38:08] Docker may be more suitable as it can be safe whilst having mutiple docker instances launched on one host. [20:22:06] RainbowSprinkles i just learned they have deprecated ReviewDB support in gerrit. So support for any new db's is on hold. [20:34:54] Idiots [20:43:40] And if that's the case: it shouldn't be a plugin anymore and should be a core requirement. [20:43:50] Otherwise users will be able to shoot themselves in the foot [20:50:45] Although I guess it sidesteps some of our problems lol [20:52:26] (03PS1) 10Gergő Tisza: Add PageViewInfo to make-wmf-branch/config.json [tools/release] - 10https://gerrit.wikimedia.org/r/342694 (https://phabricator.wikimedia.org/T125917) [21:05:46] twentyafterfour: is it OK to merge ^ today and immediately deploy the extension afterwards? I'm unclear on how the branch creation stuff works [21:08:45] tgr: yes it should be ok [21:09:17] tgr: I mean, sorry I didn't read that right [21:09:57] Changes to make-wmf-branch config only take effect for the next time we cut a new branch [21:11:38] twentyafterfour: thx. it has just been explained to me on #mediawiki-core that it's OK but I need to set up the branch manually this way [21:11:56] exavtly [21:12:00] exactly [21:26:14] (03CR) 10Chad: [C: 032] Add PageViewInfo to make-wmf-branch/config.json [tools/release] - 10https://gerrit.wikimedia.org/r/342694 (https://phabricator.wikimedia.org/T125917) (owner: 10Gergő Tisza) [21:27:10] tgr: Maybe this needs documentation better, but ideally we merge to make-wmf-branch the *week before* intended deployment [21:27:20] That way it'll automatically end up in production where it should be :) [21:28:01] (03Merged) 10jenkins-bot: Add PageViewInfo to make-wmf-branch/config.json [tools/release] - 10https://gerrit.wikimedia.org/r/342694 (https://phabricator.wikimedia.org/T125917) (owner: 10Gergő Tisza) [21:28:10] RainbowSprinkles: yeah, I realized that, just a bit late [21:28:58] a separate "how to deploy a new extension" page would be nice IMO, the how to deploy code page is a bit overloaded and hard to follow [21:29:11] That whole page is a gigantic mess :( [21:29:33] ^understatement of the week [21:29:45] RainbowSprinkles: also, probably two weeks before? otherwise you are in for a surprise if the train needs to be rolled back for unrelated reasons [21:30:10] That's true, I suppose :) [21:30:37] twentyafterfour: Water is wet. Full story at 11 :) [21:31:58] tgr: Also, I want to fix T125678 so we can stop having to futz with extension-list [21:31:58] T125678: SCAP should not rely on extension-list, instead pass --extension-dir to mergeMessageFileList.php - https://phabricator.wikimedia.org/T125678 [21:32:11] Only reason I haven't is cuz of beta :\ [21:32:39] (rebuilding cdb would be impossibly long as we currently check out the whole extension meta-repo) [21:54:27] RainbowSprinkles: can you sanity-check https://phabricator.wikimedia.org/T125917#3100089 ? [21:54:47] lgtm [22:12:34] RainbowSprinkles: https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Case_1c:_new_submodule_.28extension.2C_skin.2C_etc..29 mentions default.conf but the link is dead and there is no other mention - is that outdated? [22:13:08] Wtf is default.conf? [22:13:34] I'll take that as a yes :) [22:18:04] rofl [22:18:06] !bug 1 [22:31:19] RainbowSprinkles ReviewDB is what we currently use in gerrit. They are moving to NoteDB the slowess ever db. (Still uses mysql or other db backends but stores alot in the a git repo) [22:31:27] are we having a testpocalyse? I see some patches are waiting for 55 minutes. ping hashar [22:31:54] https://integration.wikimedia.org/zuul/ [22:31:58] it seems to be slow [22:32:03] chasemp thcipriani ^^ [22:32:09] paladox: please don't ping chasemp [22:32:12] chasemp: please ignore [22:32:18] Oh sorry [22:32:47] if antoine is online, you don't need to step in and ping everyone just because their is a backlog, there is a known reason for it, no need to continually ping everyone [22:32:57] s/their is/there is/ [22:33:06] sorry [22:34:11] hashar: im pretty sure jenkins needs some help (deadlock?) [22:34:43] Zppix most likly all the mwext tests as they are on nodepool now. [22:34:55] Zppix: no [22:35:04] Its not only exts [22:35:10] Its all kinds of jobs [22:35:13] paladox: do you have data to back up that claim? [22:35:26] Everything from mw to ops [22:35:56] I said most likly which could mean im wrong. But looking at https://integration.wikimedia.org/zuul/ you see all those extensions (mw) [22:36:40] paladox: mwext wouldnt affect every job [22:37:03] hmm, I see a bunch of DonationInterface commits repeated several times [22:37:07] MaxSem: paladox Zppix CI is overloaded. Too many patches are being sent at it [22:37:30] hashar: restart? And have it gradually readd them? [22:37:31] the patches receiving Code-Review +2 are prioritized over everything else [22:37:35] so they get merged asap [22:37:39] ok [22:37:40] paladox: right, I understand your style of throwing out possibilities/guesses. Guesses aren't useful unless their is data to back them up, otherwise it is just noise. [22:37:52] oh sorry. [22:38:02] Zppix: no that is not how that system work. It is a queue that can grow indefinitely and the queue is processed but a set number of worker [22:38:10] Zppix: the queue is too long, and we do not have enough workers [22:38:26] hashar, but what's up with DonationInterface? [22:38:27] hashar: the backlog appears to be affecting deployments [22:38:34] MaxSem: CI baclogged [22:38:38] eg if you have 100 jobs to run and 20 workers, each jobs take a minute. Guess how long it is going to take for 101th job to complete [22:38:49] I'm asking about multiplication [22:39:02] Depends-on and stuff MaxSem [22:39:08] MaxSem: that show the dependencies a patch has [22:39:19] ughhh [22:39:37] MaxSem: so in gerrit if you have a chain of patchset: 1 -> 2 -> 3 -> 4 , that is represented in the Zuul status page with change 4 and its job [22:39:51] and above it patches for 1 2 3 but those have no jobs running [22:40:20] the ui is not very clear, it still show an empty progress bar when we could show something like "this is just a dependency, no jobs" [22:40:48] so basically someone has sent a chain of 6 or 7 patches in Gerrit [22:41:06] * MaxSem eyes awight [22:41:55] d'oh [22:42:05] Did I kill Zuul by crossing the proton beams? [22:42:10] No, you didn't [22:42:11] Yes [22:42:17] >.< lol [22:42:33] * Zppix trouts awight [22:42:37] eh [22:42:59] I'll immediately stop writing patches that follow other patches :p [22:43:02] * MaxSem bites Zppix [22:43:08] yeah so that is 12 patches each consuming 6 instances [22:43:11] or 72 instances [22:43:28] hashar: i think its time for more workers lol [22:43:58] Atleast that are assigned to all awight patches [22:45:19] Spring cleaning gone wrong [22:46:35] awight: you are not the only one being very busy today [22:47:18] Sorry though, if there's a workaround I'm happy to follow it [22:47:38] I did notice we have some always-failing, non-voting jobs that should be removed. [22:48:25] or fix them and make voting? :P [22:48:52] awight: IIRC Donation interface targets REL1_27 and there is a non voting job for REL1_28 [22:49:13] but maybe we can stop triggering that one on every patchset and make it a daily run that send an email [22:49:23] 10Continuous-Integration-Config, 10FR-Smashpig, 10Fundraising-Backlog, 10MediaWiki-extensions-DonationInterface, 10Wikimedia-Fundraising-CiviCRM: Disable fundraising CI jobs that are non-voting and always fail - https://phabricator.wikimedia.org/T160476#3100201 (10awight) [22:49:35] reedy asked for something similar [22:49:57] namely run phpunit tests on every single extensions every day and mail on failure [22:50:04] (which is like 800 more jobs per day to add) [22:52:11] Wouldnt it be easier to dedicate an instance to phpunit and other similar jobs? Let the more "important" jobs run without waiting long times for other jobs to finish? [22:53:03] hashar: I wonder if we could cut off post-merge jobs for now. [22:53:17] (they can always catch up $later / tomorrow) [22:53:32] iirc postmerge has lower precedence [22:53:47] It does, but it's still grabbing them sometimes [22:53:50] but yeah probably most of them should just be converted to poll the repo [22:53:53] Which means an executor slot is stolen until it's done [22:53:55] :) [22:54:10] 10Continuous-Integration-Config, 10FR-Smashpig, 10Fundraising-Backlog, 10MediaWiki-extensions-DonationInterface, 10Wikimedia-Fundraising-CiviCRM: Disable fundraising CI jobs that are non-voting and always fail - https://phabricator.wikimedia.org/T160476#3100236 (10awight) [22:54:29] so when there are a spam of say 8 oojs jobs we dont end up regenerating the same doc 8 time (and thus consuming 8 instances) [22:54:40] when a git poll would be able to throttle and only run it once in a while [22:54:46] hashar: or perhaps post merge should be done after all the gate submit jobs are done but on the same instance if possible? [22:55:13] speaking of oojs ui, our post-merge job is currently broken (we didn't investigate why). so you can totally disable that one if you want :P [22:55:25] (we didn't investigate why **yet**) [22:55:37] heh [22:56:07] MatmaRex: I will rather leave it on to annoy you until it is fixed :-}}} [22:56:33] MatmaRex: will eventually fill a task for oojs-ui to rethink the jobs we trigger on it. There is a lot of overlap between npm test / npm run doc and npm run demo [22:57:06] (j.ames knows how that works much better than me) [22:57:13] yup I guess :} [22:57:38] Zend55 isnt that depreciated? [22:57:41] Would be nice if postmerge deduped like test does [22:57:42] Timo also identified a potential way to cache node_modules directory so we don't have to reinstall (recompile) modules every time [22:59:12] RainbowSprinkles: Jenkins git poll has such a feature, when it find a new commit you can ask it to wait x minutes in case some other changes get merged [22:59:19] but zuul definitely does not have such a system [23:00:16] RainbowSprinkles: https://phabricator.wikimedia.org/T94715 [23:01:01] This would only run one at a time. so if there are 5 commits, it will either only build the last one, or (worsst case) the first and last one (to ensure eventually consistent with the current HEAD) [23:01:43] RainbowSprinkles: the test pipeline dedupes by preferring the latest one, however that's scoped to the individual change set. post-merge is only 1 per change, so that wouldn't work the same way [23:02:02] presumably you'd dedupe based on repo+branch, but that's tricky since you don't want it to keep post-poning [23:02:14] In the unlikely event there is never a long enough gap [23:02:20] code-coverage for mwcore takes 3 hours [23:02:26] it was removed from post-merge and instead bi-daily cron [23:02:31] It'd be nice to move that back [23:02:51] and I think I moved mw doxygen publish to a git polling mechanism as well [23:02:59] Oh? [23:03:35] How does that work? [23:03:38] Without Zuul you mean? [23:03:39] hmm apparently no [23:03:44] it is still triggered by zuul [23:05:10] Krinkle: so yeah that could be done without zuul [23:05:19] have Jenkins to poll the git repo [23:05:39] there is an extra option to ask it to wait some additional time in case some other changes get merged [23:05:47] then it will catch up and finally run [23:06:09] Honestly, I don't think doc being delayed a few hours is a big deal at all [23:06:37] hashar: How does that behave exactly? Does it keep cancelling existing builds, or will it just do first-and-last of a commit streak? (I hope for hte latter) [23:06:48] the only advantage of running them postmerge is to give feedback on each changes as to how the doc build ran [23:07:57] Krinkle: the job would be configured to poll the repo once per hour. If the branch got updated, jenkins basically idle for x minutes then poll again [23:08:09] then it trigger the build (or maybe idle some more) [23:08:13] Right [23:08:28] so if you have a change at 01:00 , wait 5 minutes and get 7 patches merged in between [23:08:39] the job would end up running at 01:05 with all the 7 patches [23:08:54] and if another patch is merged at that time, it'll be part of the next hour's build [23:09:23] yup [23:09:26] it'd be nice for less busy repos to keep the wait loop lazy triggered instead of on a fixed schedule. [23:09:34] E.g. whenever a patch merges, wait 5 minutes and start the build. [23:09:36] though maybe it has some additional heuristic to poll it earlier when there is activity [23:10:01] And if another commit is found in the same time frame, just ensure one more builds after it is finished. I believe that's how openstack designed theirs. [23:10:10] So du-dupe with maximum of 1 overlap [23:10:37] There's a zuul option for this right? Or they were workng on it [23:10:44] I think they build their doc when a branch is updated (ref-updated gerrit event) which is our publish pipeline [23:11:03] when we do the doc on post-merge, which is per patchset/change merged and let us report back to Gerrit [23:15:22] https://phabricator.wikimedia.org/T115755 for mediwiki doxygen