[00:52:23] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: Beta puppetmaster cherry-pick process - https://phabricator.wikimedia.org/T135427#2581108 (10mmodell) Now we are down to these: |Petr Pchelko|Change-Prop: Enable file transclusion updates| |Alex Monk|Remove the hard-coded /a/mw-log references scattere... [00:58:10] 06Release-Engineering-Team (Deployment-Blockers), 13Patch-For-Review, 05Release: MW-1.28.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T141551#2581122 (10Legoktm) [00:59:00] 06Release-Engineering-Team (Deployment-Blockers), 13Patch-For-Review, 05Release: MW-1.28.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T141551#2502733 (10Legoktm) I added {T143840} as a blocker to further rolling out the train. Don't have time to investigate it right now, sorry. [02:55:23] Project mediawiki-core-code-coverage build #2220: 15ABORTED in 7.5 sec: https://integration.wikimedia.org/ci/job/mediawiki-core-code-coverage/2220/ [02:55:46] ^ that was me [03:35:28] 06Release-Engineering-Team (Deployment-Blockers), 13Patch-For-Review, 05Release: MW-1.28.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T141551#2581250 (10Tgr) [04:20:10] Yippee, build fixed! [04:20:10] Project selenium-MultimediaViewer » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #121: 09FIXED in 24 min: https://integration.wikimedia.org/ci/job/selenium-MultimediaViewer/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/121/ [08:01:25] 06Release-Engineering-Team (Deployment-Blockers), 13Patch-For-Review, 05Release: MW-1.28.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T141551#2581554 (10hashar) @Legoktm thank you for noticing and definitely be bold in adding blockers even if you can investigate immediately. [08:12:24] 10Beta-Cluster-Infrastructure, 07Puppet, 07Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2581568 (10hashar) [08:12:27] 10Beta-Cluster-Infrastructure, 06Labs, 13Patch-For-Review, 07Puppet: /etc/puppet/puppet.conf keeps getting double content - first for labs-wide puppetmaster, then for the correct puppetmaster - https://phabricator.wikimedia.org/T132689#2581564 (10hashar) 05Open>03Resolved a:03mmodell Thanks @Dzahn... [08:21:11] Project beta-update-databases-eqiad build #10854: 04FAILURE in 1 min 10 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/10854/ [08:24:05] bah [08:24:22] Your composer.lock file is up to date with current dependencies! [08:24:30] race condition I guess [08:25:42] will wait for scap to finish [08:25:44] https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-eqiad/117029/console [08:28:00] !log beta cluster update database failed due to: "Your composer.lock file is up to date with current dependencies!" Probably a race condition with ongoing scap. [08:28:03] Yippee, build fixed! [08:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [08:28:04] Project beta-update-databases-eqiad build #10855: 09FIXED in 1 min 2 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/10855/ [08:28:09] magic [08:28:14] !log beta update database fixed [08:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [08:28:21] that is all so magic :} [08:47:22] Tobi_WMDE_SW: do you want to pair on merging your browser tests commits this week? [08:47:34] or should I just review and merge? no questions asked? ;) [08:57:02] zeljkof: I need to talk to the wikidata team a bit first [08:57:51] zeljkof: also for some reason the tests started to fail for the last patch in that chain.. without anything changing in the code [08:58:43] need to look into that... do you know if something with the infrastructure for the browsertests on integration has been happening yesterday zeljkof ? [08:59:46] (03PS1) 10Hashar: [operations/software] add experimental tox [integration/config] - 10https://gerrit.wikimedia.org/r/306641 (https://phabricator.wikimedia.org/T143559) [09:00:12] Tobi_WMDE_SW: nothing changed as far as I know [09:00:56] (03CR) 10Hashar: [C: 032] [operations/software] add experimental tox [integration/config] - 10https://gerrit.wikimedia.org/r/306641 (https://phabricator.wikimedia.org/T143559) (owner: 10Hashar) [09:02:03] (03Merged) 10jenkins-bot: [operations/software] add experimental tox [integration/config] - 10https://gerrit.wikimedia.org/r/306641 (https://phabricator.wikimedia.org/T143559) (owner: 10Hashar) [09:03:20] Tobi_WMDE_SW: we pushed wmf.16 to group1 yesterday [09:03:30] and obviously beta cluster runs tip of master branch [09:03:37] and we could have an ongoing issue on one part of beta :( [09:03:51] lot of moving parts [09:04:24] hashar: ok - think this is unrelated since it's the browsertests running directly on integration (on every commit) were failing.. [09:04:54] ohh [09:05:27] hashar: I've just triggered them again.. let's see https://gerrit.wikimedia.org/r/#/c/301764 [09:06:34] unknown error: Element is not clickable at point (810, 416). Other element would receive the click:
...
[09:06:52] in Scenario: Add reference with multiple snaks [09:06:55] when doing " And I add the following reference snaks: " [09:07:30] screen shot being https://integration.wikimedia.org/ci/job/mwext-mw-selenium-composer/4533/artifact/log/Adding%20references%20to%20statements%3A%20Add%20reference%20with%20multiple%20snaks.png [09:07:30] not sure how helpful it is though [09:07:39] hashar: yeah, saw that. but it makes no sense since it started failing after updating the commit msg.. and there was no code change in between [09:07:57] hashar: let's see what https://integration.wikimedia.org/ci/job/mwext-mw-selenium-composer/4552/console will tell [09:11:16] hashar: passed! \o/ [09:13:30] Tobi_WMDE_SW: so a one time error :/ [09:13:42] maybe that specific test is slightly racy / depends on a timer for an element to disappear [09:13:44] hard to know really [09:14:09] hashar: hm, possible.. we'll find that out over time [09:14:45] hashar: is it possible to capture screenshots during testing on integration? [09:15:35] Tobi_WMDE_SW: I think it only capture on failure [09:15:49] not sure we have a flag to enable capture of each intermediary step like sauce labs does [09:16:00] there is also a video generated apparently, but I can't play it for some reason [09:16:22] I mean the .mp4 https://integration.wikimedia.org/ci/job/mwext-mw-selenium-composer/4533/artifact/log/Adding%20references%20to%20statements%3A%20Add%20reference%20with%20multiple%20snaks.mp4 [09:16:48] ah my browser can't play it [09:16:53] but downloading it it works just fine [09:17:55] hashar: yeah, works fine for me with vlc [09:18:02] cool, thx.. didn't know that [09:29:44] Yippee, build fixed! [09:29:44] Project selenium-VisualEditor » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #124: 09FIXED in 3 min 54 sec: https://integration.wikimedia.org/ci/job/selenium-VisualEditor/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/124/ [09:32:28] 10Browser-Tests-Infrastructure, 06Reading-Web-Backlog, 05MW-1.28-release-notes, 13Patch-For-Review, and 2 others: Various browser tests failing with MediawikiApi::LoginError - https://phabricator.wikimedia.org/T142600#2581738 (10zeljkofilipin) 05Open>03Resolved The problem seems to be fixed. I have jus... [09:33:04] Tobi_WMDE_SW: can you +1 or +2 this? https://gerrit.wikimedia.org/r/#/c/306449/ [09:33:23] I would like to merge it today, it is already deployed [09:38:46] (03CR) 10Tobias Gritschacher: [C: 031] Tobi is owner of selenium-Wikibase and selenium-Wikidata jobs [integration/config] - 10https://gerrit.wikimedia.org/r/306449 (https://phabricator.wikimedia.org/T143309) (owner: 10Zfilipin) [09:39:08] zeljkof: would love to +2 but I can only +1 :) [09:39:35] Tobi_WMDE_SW: yes, that repo is strict about giving +2, I have forgot :) [09:39:43] for JJB changes [09:39:55] (03PS2) 10Zfilipin: Tobi is owner of selenium-Wikibase and selenium-Wikidata jobs [integration/config] - 10https://gerrit.wikimedia.org/r/306449 (https://phabricator.wikimedia.org/T143309) [09:40:04] assuming Tobi got JJB installed on his machine and know about deploying Jenkins jobs changes [09:40:07] (03CR) 10Zfilipin: [C: 032] Tobi is owner of selenium-Wikibase and selenium-Wikidata jobs [integration/config] - 10https://gerrit.wikimedia.org/r/306449 (https://phabricator.wikimedia.org/T143309) (owner: 10Zfilipin) [09:40:12] I am all fine in granting CR+2 right to him [09:40:32] https://www.mediawiki.org/wiki/CI/JJB <-- doc :D [09:40:41] might need a couple training session [09:41:00] hashar: have done it some times in the past - though not that often in the past months [09:41:05] hashar: depends if Tobi_WMDE_SW wants it, I do not think he does a lot of work there, but fine with me too [09:41:33] (03Merged) 10jenkins-bot: Tobi is owner of selenium-Wikibase and selenium-Wikidata jobs [integration/config] - 10https://gerrit.wikimedia.org/r/306449 (https://phabricator.wikimedia.org/T143309) (owner: 10Zfilipin) [09:41:38] I'm fine with it - though not super important for me [09:41:49] 10Browser-Tests-Infrastructure, 13Patch-For-Review, 15User-zeljkofilipin: Ownership of Selenium tests - https://phabricator.wikimedia.org/T134492#2581768 (10zeljkofilipin) [09:41:53] 10Browser-Tests-Infrastructure, 10Continuous-Integration-Config, 10Wikidata, 13Patch-For-Review, 15User-zeljkofilipin: Ownership for wikidata/wikibase selenium tests - https://phabricator.wikimedia.org/T143309#2581767 (10zeljkofilipin) 05Open>03Resolved [09:50:41] 05Gerrit-Migration, 06Release-Engineering-Team, 10releng-201516-q3, 10ArchCom-RfC, and 5 others: [RfC]: Migrate code review / management from Gerrit to Phabricator - https://phabricator.wikimedia.org/T119908#2581782 (10Qgil) [10:06:03] 10Deployment-Systems, 03Scap3: Update Debian Package for Scap3 - https://phabricator.wikimedia.org/T127762#2581844 (10hashar) 05Open>03Resolved tin / mira and a few random mw servers have scap 3.2.3-1 so looks like it is fixed. [10:17:12] 10Browser-Tests-Infrastructure, 06Release-Engineering-Team, 07Documentation, 15User-zeljkofilipin: Document browser tests ownership (and what it means) on wiki - https://phabricator.wikimedia.org/T142409#2581871 (10zeljkofilipin) How does this look? [[ https://www.mediawiki.org/wiki/Continuous_integration... [10:42:22] (03PS1) 10Zfilipin: Added Phabricator username for owners of Selenium jobs [integration/config] - 10https://gerrit.wikimedia.org/r/306648 (https://phabricator.wikimedia.org/T142409) [10:46:05] (03PS2) 10Zfilipin: Added Phabricator username for owners of Selenium jobs [integration/config] - 10https://gerrit.wikimedia.org/r/306648 (https://phabricator.wikimedia.org/T142409) [11:17:37] (03PS1) 10Hashar: [operations/software] enable tox [integration/config] - 10https://gerrit.wikimedia.org/r/306651 (https://phabricator.wikimedia.org/T143559) [11:18:56] (03CR) 10Hashar: [C: 032] [operations/software] enable tox [integration/config] - 10https://gerrit.wikimedia.org/r/306651 (https://phabricator.wikimedia.org/T143559) (owner: 10Hashar) [11:19:55] (03Merged) 10jenkins-bot: [operations/software] enable tox [integration/config] - 10https://gerrit.wikimedia.org/r/306651 (https://phabricator.wikimedia.org/T143559) (owner: 10Hashar) [11:22:24] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: Beta puppetmaster cherry-pick process - https://phabricator.wikimedia.org/T135427#2581998 (10mobrovac) >>! In T135427#2581108, @mmodell wrote: > |@Pchelolo |Change-Prop: Enable file transclusion updates| Pre-prod testing, should be merged early next we... [12:35:03] 10Continuous-Integration-Config, 13Patch-For-Review: Add Python validation to operations/software repo - https://phabricator.wikimedia.org/T143559#2582111 (10hashar) CI now runs `tox`. Some patches are pending to polish up flake8 so it passes across all scripts. [12:35:25] 10Continuous-Integration-Config, 13Patch-For-Review: Add Python validation to operations/software repo - https://phabricator.wikimedia.org/T143559#2582113 (10hashar) p:05Triage>03Normal [12:45:55] 10Continuous-Integration-Config: Frequent "No space left on device" failures for debian-glue jobs on integration-slave-jessie-1001 - https://phabricator.wikimedia.org/T124746#2582131 (10hashar) 05Open>03Resolved a:03hashar That one got solved. Instance has been rebuild with more disk space and the debian-g... [12:57:41] 10Continuous-Integration-Infrastructure, 06Operations, 07Zuul: Upgrade Zuul on scandium.eqiad.wmnet (Jessie zuul-merger) - https://phabricator.wikimedia.org/T140894#2582143 (10hashar) Looks like I have been building it without dpkg-gen-changes -sa to force the inclusion of the original tarball in the `.chang... [13:02:14] 10MediaWiki-Codesniffer: Detect "and" and "or" tokens used in PHP code - https://phabricator.wikimedia.org/T143888#2582185 (10Dereckson) [13:55:42] 10Continuous-Integration-Config, 13Patch-For-Review: Add Python validation to operations/software repo - https://phabricator.wikimedia.org/T143559#2582365 (10Volans) tox enabled, existing subdirectories that have flake8 errors are excluded in `tox.ini`. Each existing sub-directory can //opt-in// removing the l... [14:04:30] 06Release-Engineering-Team: Feedback for European SWAT window - https://phabricator.wikimedia.org/T143894#2582390 (10hashar) [14:05:23] 06Release-Engineering-Team, 15User-greg, 15User-zeljkofilipin: Add a European mid-day SWAT window - https://phabricator.wikimedia.org/T137970#2582409 (10hashar) I have created a lame survey to get some feedback following the first week of European SWAT window at T143894. [14:05:47] zeljkof: I have crafted a lame survey for the European SWAT https://phabricator.wikimedia.org/T143894 :D [14:05:53] nothing serious merely trying to get some feedback [14:12:18] hashar: cool! :) [14:12:37] 10Beta-Cluster-Infrastructure, 03Scap3 (Scap3-Adoption-Phase1), 10scap, 10Analytics, and 3 others: Set up AQS in Beta - https://phabricator.wikimedia.org/T116206#2582429 (10elukey) Thanks for reporting, this is my bad since analytics_hadoop_hosts is not in hiera labs. Since this value should be removed soo... [14:33:07] 06Release-Engineering-Team: Feedback for European SWAT window - https://phabricator.wikimedia.org/T143894#2582390 (10KartikMistry) This worked well for Language team, not too late, not too early. [14:58:29] PROBLEM - Puppet run on deployment-sca02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [14:59:16] PROBLEM - Puppet run on deployment-sca01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:15:03] 06Release-Engineering-Team: Feedback for European SWAT window - https://phabricator.wikimedia.org/T143894#2582669 (10Ladsgroup) It was super helpful [15:27:49] (03CR) 10Greg Grossmeier: [C: 031] Added Phabricator username for owners of Selenium jobs [integration/config] - 10https://gerrit.wikimedia.org/r/306648 (https://phabricator.wikimedia.org/T142409) (owner: 10Zfilipin) [15:47:45] greg-g: hey, We want to book a dedicated deployment window for ORES the next week. [15:47:51] Is it possible? [15:48:06] as early as possible :) [15:48:35] Since this deployment takes lot of time and possibly we need to deploy some changes to the mediawiki-config repo too [15:57:57] PROBLEM - Puppet run on integration-slave-trusty-1013 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [16:00:58] Amir1: sounds reasonable, pick a time and add to the calendar :) [16:01:25] greg-g: I think I don't have access to that calendar [16:01:54] also is it okay for us to do deployment for Wikipedia too (we want to change ores settings) [16:02:11] integration-slave-trusty-1013 fails puppet with Error: /File[/var/lib/puppet/lib]: Could not evaluate: Connection refused - connect(2) Could not retrieve file metadata for puppet://localhost/plugins: Connection refused - connect(2) [16:02:11] [16:02:34] !log integration restarted puppetmaster service [16:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [16:03:43] (03PS3) 10Florianschmidtwelzow: Disallow parenthesis around keywords like clone or require [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/301828 (https://phabricator.wikimedia.org/T116779) [16:04:06] Amir1: it's a wiki page :) [16:04:15] https://wikitech.wikimedia.org/wiki/Deployments [16:04:19] Oh, the deployments [16:04:22] okay [16:04:23] that's "the calendar" :) [16:04:34] the gcal is just a nicety that I don't keep 100% up to date :) [16:04:38] Thanks [16:04:41] !log fixing puppet.conf on integration-slave-trusty-1013 it mysteriously considered itself as the puppetmaster [16:04:41] np, ty [16:04:45] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [16:12:20] greg-g: Does this sound good? https://wikitech.wikimedia.org/wiki/Deployments#Week_of_August_29th [16:12:47] yup, https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160829T1400 looks good [16:13:17] Thanks [16:19:19] integration-slave-trusty-1013 is fixed [16:22:55] RECOVERY - Puppet run on integration-slave-trusty-1013 is OK: OK: Less than 1.00% above the threshold [0.0] [16:38:43] 10Continuous-Integration-Infrastructure, 06Labs, 07Wikimedia-Incident: OpenStack misreports number of instances per project - https://phabricator.wikimedia.org/T143018#2582880 (10hashar) For reference, `nova absolute-limits` seems to show the quota and their usage: ``` $ nova absolute-limits +---------------... [17:02:51] 06Release-Engineering-Team, 15User-greg: skill matrix updates - https://phabricator.wikimedia.org/T140507#2582993 (10greg) [17:10:54] 05Gerrit-Migration, 10releng-201516-q4, 05Goal: Phase 1 repository migrations to Differential (goal - end of June 2016) - https://phabricator.wikimedia.org/T130418#2583011 (10bd808) [17:10:57] 05Gerrit-Migration, 10releng-201516-q4, 10Wikimedia-IEG-grant-review: Migrate wikimedia-iegreview to Differential - https://phabricator.wikimedia.org/T132174#2583009 (10bd808) 05Resolved>03Open >>! In T132174#2424028, @mmodell wrote: > I added a `.arcconfig` in the repository so that arcanist will know w... [17:12:29] Can somebody who knows stuff about .arcconfig and harbormaster and all that look at https://phabricator.wikimedia.org/T132174 and help Niharika get a patch submitted. We've got a bug that needs to be fixed up soon. [17:19:56] 10Browser-Tests-Infrastructure, 06Reading-Web-Backlog, 05MW-1.28-release-notes, 13Patch-For-Review, and 2 others: Various browser tests failing with MediawikiApi::LoginError - https://phabricator.wikimedia.org/T142600#2583032 (10Jdlrobson) Thank you for fixing this <3. [17:20:01] Project beta-update-databases-eqiad build #10864: 04FAILURE in 0.84 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/10864/ [17:49:19] 10Deployment-Systems, 03Scap3 (Scap3-MediaWiki-MVP), 10scap, 10MediaWiki-API, and 3 others: Create a script to run test requests for the MediaWiki service - https://phabricator.wikimedia.org/T136839#2583184 (10mobrovac) a:03mobrovac [17:56:50] (03PS1) 10Arlolra: Parsoid's tool and roundtrip tests should be on node v4 [integration/config] - 10https://gerrit.wikimedia.org/r/306710 [18:02:16] (03CR) 10Subramanya Sastry: [C: 031] Parsoid's tool and roundtrip tests should be on node v4 [integration/config] - 10https://gerrit.wikimedia.org/r/306710 (owner: 10Arlolra) [18:11:59] thcipriani: hasharAway so I'm currently watching things and hoping https://phabricator.wikimedia.org/rOPUP1aae21a7afc5de2b5e349325674af7b8569a6eb6 is an early christmas present [18:12:19] so far it seems to be within 1 or two instances count for quota as it is recalulated on expire [18:12:47] I do still sometimes see 11 instances via list even tho 10 is set as max but I think that's mostly normal churn [18:12:56] chasemp: ooh, nice find. [18:20:02] Project beta-update-databases-eqiad build #10865: 04STILL FAILING in 0.8 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/10865/ [18:21:23] marxarelli: btw, I liked seeing the coverage metric for that scap change you pushed yesterday, so nice of Differential (and good test coverage, you) :) [18:23:51] 03Scap3 (Scap3-Adoption-Phase1), 10scap, 10Parsoid, 06Services, and 2 others: Deploy Parsoid with scap3 - https://phabricator.wikimedia.org/T120103#2583377 (10dduvall) [18:23:54] 03Scap3, 15User-mobrovac: Sequential execution should be per-deployment, not per-phase - https://phabricator.wikimedia.org/T142990#2583375 (10dduvall) 05Open>03Resolved [18:25:41] greg-g: thanks. i was happy to see all the refactoring and code cleanup in scap when i went back into it :) [18:34:30] * thcipriani wipes brow [18:34:39] we didn't make it too much worse :) [18:46:31] chasemp: hello :) [18:47:00] chasemp: has it recomputed the proper quota for contintcloud tenant ? I have also noticed that tools tenant is off by one [18:47:22] 123 instances listed on horizon but the pie chart says 124/150 [18:48:07] chasemp: that christmas present looks like it will refresh the quota on each reservation unless the last update is less than 30 secs old [18:48:11] sounds good [18:54:08] PROBLEM - Long lived cherry-picks on puppetmaster on deployment-puppetmaster is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [18:55:47] marxarelli: integration-raita.integration.eqiad.wmflabs went down earlier, not sure whether it recovered [18:56:43] 10Continuous-Integration-Infrastructure, 06Labs, 13Patch-For-Review, 07Wikimedia-Incident: Nodepool instance instance creation quota management - https://phabricator.wikimedia.org/T143016#2583542 (10hashar) [18:56:54] hashar: within a cycle it had reduced the 4 node offset and has kept pretty close to actual usage so far [18:57:26] 10Continuous-Integration-Infrastructure, 06Labs, 13Patch-For-Review, 07Wikimedia-Incident: Nodepool instance instance creation quota management - https://phabricator.wikimedia.org/T143016#2554147 (10hashar) @chasemp found a parameter that would cause Nova to refresh the quota on each reservation if the las... [18:57:36] chasemp: the quota has been reliable for a year+ [18:57:53] reliable or unreliable? [18:58:00] I guess whatever craziness occured probably related to Liberty upgrade, caused the quota to not be lowered on some instances deletions / crashes [18:58:08] it apparently has been constantly off by 4 [18:58:28] "reliable" [18:58:58] also tools is off by one :] at least on horizon, but I guess that will fix up on the next instance being added or deleted thanks to the max_age setting you have found [18:59:00] kudos really! [19:00:38] hashar: hmm... looks to be down still. maybe we should take this opportunity to kill raita [19:00:52] the instance is "SHUTOFF" and won't reboot [19:03:19] :( [19:03:34] marxarelli: I feel sad to say that, but I am not even sure raita is even used [19:03:56] though I Really like the approach of aggregating test results in ElasticSearch and have magic framework to generate the reports [19:04:12] maybe we can try resurecting it and find a way to point jenkins job notifications to raita? [19:04:17] or shout it :/ [19:04:19] hashar: well, the code will always be there if we want to revive it [19:04:35] but it's infrastructural debt at this point :) [19:04:48] destroy debt! [19:04:54] er, forgive debt [19:05:04] I would said nice POC that went forgotten [19:05:16] * marxarelli thinks debt is a problematic analogy [19:05:21] !log hard rebooting integration-raita via Horizon [19:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [19:05:43] RECOVERY - Host integration-raita is UP: PING OK - Packet loss = 0%, RTA = 1.27 ms [19:05:59] \o/ [19:06:01] horizon is magic [19:06:24] hashar: ha! i didn't try horizon [19:06:34] well... it's up now... [19:06:39] chasemp: looks to me the quota issue is solved [19:07:27] hashar: i still think we should shut it down. looking at integration/config, the RAITA_URL is no longer defined anywhere, so nothing is reporting to it [19:08:43] I noticed that instance a few days ago as a big disk user and was curious about it [19:08:48] if we could get rid of it that woudl be cool [19:10:06] 05Gerrit-Migration, 10releng-201516-q4, 05Goal: Phase 1 repository migrations to Differential (goal - end of June 2016) - https://phabricator.wikimedia.org/T130418#2583574 (10mmodell) [19:10:08] 05Gerrit-Migration, 10releng-201516-q4, 10Wikimedia-IEG-grant-review: Migrate wikimedia-iegreview to Differential - https://phabricator.wikimedia.org/T132174#2583572 (10mmodell) 05Open>03Resolved [19:11:56] 10Continuous-Integration-Infrastructure, 06Labs, 13Patch-For-Review, 07Wikimedia-Incident: Nodepool instance instance creation quota management - https://phabricator.wikimedia.org/T143016#2583575 (10hashar) The quota issue has been solved. With 4 instances floating around, `nova absolute-limits` properly... [19:12:21] marxarelli: feel free to drop the instance so [19:13:07] chasemp: the last concern is the Horizon usage summary showing just two instances. But I think that is an oddity in the usage data [19:13:30] chasemp: so I guess we can close the task about quota https://phabricator.wikimedia.org/T143016 looks entirely solved to me [19:13:59] let's give a day or two and see if where we need to land assigned quota wise [19:14:11] I still see 11 instances sometimes I think due to async operations [19:14:14] PROBLEM - Puppet staleness on integration-raita is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [43200.0] [19:15:28] hashar, chasemp: deleted the instance [19:15:48] nice [19:18:21] well we could have nodepool set to max-server: N [19:18:35] and the tenant quota at N+x to accomodate for potential skew [19:19:11] I guess I will migrate bunch of jobs back [19:19:28] PROBLEM - Host integration-raita is DOWN: CRITICAL - Host Unreachable (10.68.16.53) [19:20:02] Project beta-update-databases-eqiad build #10866: 04STILL FAILING in 1 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/10866/ [19:26:51] (03PS1) 10Hashar: Revert "Move npm-node-4 off of nodepool" [integration/config] - 10https://gerrit.wikimedia.org/r/306722 [19:26:53] (03PS1) 10Hashar: Revert "Move `rake` jobs off of nodepool" [integration/config] - 10https://gerrit.wikimedia.org/r/306723 [19:26:55] (03PS1) 10Hashar: Revert "rake: Fix bundle install path" [integration/config] - 10https://gerrit.wikimedia.org/r/306724 [19:26:57] (03PS1) 10Hashar: evert "Move tox-jessie & co. off of nodepool" [integration/config] - 10https://gerrit.wikimedia.org/r/306725 [19:26:59] (03PS1) 10Hashar: Revert "Move mediawiki-core-phpcs off of nodepool" [integration/config] - 10https://gerrit.wikimedia.org/r/306726 [19:27:01] (03PS1) 10Hashar: Revert "Temporarily move composer-hhvm/php5 jobs off of nodepool" [integration/config] - 10https://gerrit.wikimedia.org/r/306727 [19:27:22] hashar: I don't think that's a good idea tbh. [19:27:41] legoktm: mind to elaborate ? [19:27:50] most of those jobs run for less than a minute, and I don't see how that's worth consuming an entire VM for [19:29:23] yeah I have read that [19:29:55] we had a few discussions about using LXC / Docker for that [19:31:07] so I don't think we should move them to nodepool until that happens [19:31:18] the permanent slaves have to go [19:31:37] why> [19:31:38] there is also a makefile floating around for differential jobs that would run all the entries points [19:31:40] ?* [19:32:02] so presumably we would get a single jobs running all the tasks with make running them in paralle [19:32:07] so less instances get consumed [19:32:17] that needs to be done *first* then [19:32:24] needs? [19:32:27] bbl 30, meeting time [19:32:36] spawning an instance does not seem to have too much overhead [19:32:50] and I want to get rid of the permanent slaves sooner rather than later [19:33:04] (or not, delayed by 10 min) [19:33:23] the problem is when there are a lot of changes, patches are sitting waiting for new vms to be created [19:33:51] yes, I think we should work on merging the jobs into one job first [19:33:56] that can be done on the permanent slaves [19:36:35] legoktm: the quota used to be 20 which was just enough to accomodate for the load [19:36:48] and I have freeze the migration of more jobs pending the quota to be bumped [19:37:12] eventually end of June the labs infra RAM usage reached its limit and in an emergency the quota of Nodepool went from 20 to 10 [19:37:25] has the quota been bumped back up yet? [19:37:28] been stall like that since July 4th, which is also explain the delay waiting for Vms [19:37:34] it is still at 10 [19:38:02] most of the memory exhaustion happened over June with more than 100GBytes of instances being added to tools labs [19:38:10] + a few 16Gbytes added by random projects [19:38:18] the low hanging fruit has been to lower Nodepool quota [19:38:46] anyway, raising the quota to more than the 20 we had, was pending new hardware to be added to the labs infra, which has been done beginning of July [19:38:53] a couple more labvirt machine got added [19:39:10] and quota is still at 10 instances for now [19:39:43] (03PS1) 10Legoktm: Build universal wheels [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/306731 [19:40:07] the idea is to scale the pool based on the time jobs are waiting for VM to run on. That is tracked via https://grafana.wikimedia.org/dashboard/db/releng-zuul?panelId=18&fullscreen [19:40:48] until we have a larger quota or less jobs to run, I don't think we should be adding more things to nodepool currently. [19:41:01] then the wait time is not changing [19:41:07] and I cant get the quota bumped [19:41:12] kind of a chicken and egg issue [19:41:32] so I am really going to migrate the jobs back. Maybe tomorrow morning if I find tonight that is straightforward, else on monday [19:41:47] and remove some of the permanent slaves [19:41:48] 05Gerrit-Migration, 10releng-201516-q4, 10Wikimedia-IEG-grant-review: Migrate wikimedia-iegreview to Differential - https://phabricator.wikimedia.org/T132174#2583616 (10mmodell) @bd808: fixed. Apparently I had forgotten to push the commit. [19:41:51] to free up resources on labs [19:42:15] my hope is to get Nodepool quota raised back to 20 soonish and ideally to 40 if that is at all possible [19:42:27] then migrate the rest of the jobs and drop a large chunks of the permanent slaves [19:42:48] eg reclaiming resources from the 'integration' project [19:44:56] 06Release-Engineering-Team: Feedback for European SWAT window - https://phabricator.wikimedia.org/T143894#2583620 (10MarcoAurelio) > What went well? Everything went well, at least on my side. > What went bad? I can't remember anything. > How confortable are you with the X-Wikimedia-Debug extension and testin... [19:52:07] thcipriani: I have added you to a calendar event for next week labs maintenance on wednesday [19:52:35] thcipriani: CI will be impacted since running on labs. Will probably just need to prevent jenkins from running jobs by putting the slave offline [19:53:47] I think we should migrate some chunk of work from permanent slaves and give it at least 3 days [19:53:49] and repeat [19:54:18] my understanding is it's per test or so? [19:54:34] (03CR) 10Hashar: "Halfak / Ladsgroup have built wheels to deploy the ORES service on Wikimedia infrastructure. They might not have needed universal ones bu" [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/306731 (owner: 10Legoktm) [19:55:11] chasemp: it is per job [19:55:30] sorry I meant per test type or job [19:55:34] chasemp: my understanding is that some compute node had an issue with KVM being slow / disrupted [19:55:46] ? [19:55:47] some random craziness caused Nova quota to be off [19:55:49] not sure when you mean [19:56:18] which has lead Nodepool to get 403 / and only able to have a few instances booted [19:56:26] no I'm pretty sure it's a somewhat expected outcome of long term node churn w/o a reclamation type thing such [19:56:28] with kvm working and the quota issue solved. Guess I can move the bulk of them [19:56:56] I don't know what kvm issue you mean, and the quota thing is not even a day into it, could still blow up [19:57:52] and we shouldn't raise the quota or the max servers unless there is some observed change in backlog / processing ability which I think is best figured out with a slow migration back to nodepool [19:58:14] for all we know, we throw everything back and it's fine where it's at now [19:58:21] or it's totally not [19:58:22] well the processing has already slowed down when we went from 20 to 10 back on July 4th [19:58:39] sure but you don't have actual baselines to know what's acceptable [19:58:50] so it's all difficult to reason about and without context [19:58:56] yeah [19:59:02] so right now is our baseline [19:59:06] and we move slowly from here [19:59:08] both in load and quota [19:59:36] :( [19:59:36] that's my take [19:59:48] why is that a sad face? I don't understand [19:59:53] what else can we do? [20:00:20] I am sad cause I asked 3 months ago to double the quota from 20 to 40 , though knowing it might be a challenge [20:00:38] I understand that without any context on why [20:00:41] so I could complete the migration of jobs and finally get rid of the 20 or so instances in integration projects which are a burden to maintain [20:00:43] you may as well have asked for 11 or 100 [20:00:54] eventually labs exploded due to RAM exhaustion (largely caused by tools allocating 100+ gbytes ram) [20:01:07] and while on vacations most jobs are reverted back to permanent slaves [20:01:35] I appreciate that totally, it sucks [20:01:38] so I end up in a position with 4-5 months of work reverted and the last 2-3 months waiting on resources to complete [20:01:50] but from my vantage point the migration strategy has been sort of spray and pray with how to size things [20:01:59] but it has very real consequences our side of the fence [20:02:14] 10-20-40? [20:02:14] but why? [20:02:18] yeah hasn't migrated as fast as I wanted :( [20:02:47] basically based on the 'integration' / permanent slaves total usage and number of parallel executors [20:03:10] right but that's a totally nonsequitor number afaik [20:03:18] what matters is backlog times and job processing [20:03:42] 'integration' used to have 14 Trusty instances m1.large + 8 Precise isntances m1.large [20:04:20] (eek there is still 12 Trusty m1.large as peermanent slaves) [20:04:33] yes I think that's true [20:05:00] so we must had way more of them [20:05:08] anyway [20:05:13] the processing time, I could migrate jobs [20:05:21] get the wait time to raise [20:05:47] that would let me delete permanent slaves and then I guess get a justification to get the Nodepool quota raised ? [20:06:06] based on resources having been freed on 'integration' and wait time having increased [20:08:42] something like that, although it's not in practical terms critical we release pressure in integration as move over I think [20:08:44] but that's ideal [20:08:50] my main concern is https://phabricator.wikimedia.org/T139771 [20:09:05] and I understand your frustrated things are going slow but [20:09:10] why should we have to degrade our service to get more resources? [20:09:19] why can't we get resources then migrate? [20:09:31] I think that's what I said [20:09:35] or that it wasn't the real concern [20:09:40] to release old stuff before new [20:09:52] well, it's a catch 22, no? [20:10:05] the stuff currently running on the permanent hosts need to go somewhere before we can delete them [20:10:13] that somewhere being nodepool [20:10:23] !log Delete integration-slave-trusty-1023 with label AndroidEmulator. The Android job has been migrated to a new Jessie based instance via T138506 [20:10:25] agreed [20:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [20:10:44] I was saying we should move in small increments see if performance is impact on current pool [20:10:51] impacted even [20:10:57] we have overhead now I think/ [20:11:01] but really we don't know [20:11:08] because we don't know what our target numbers are [20:11:10] or I don't [20:12:14] the reasoning is to drop most (if not all) of the 14 Trusty m1.large [20:12:25] they are solely used to run the MediaWiki hhvm/php jobs [20:12:32] sure but things continually crash and burn as try to jump form one system to the other [20:12:40] and the sizing of nodepool no one knows [20:13:01] and it's all been guesswork thus far and guesswork without cycling back on any set list of critical metrics [20:13:44] previously we have talked about labs being on the hook so to speak for instance creation as a service to CI [20:13:48] and I think that's fair we are [20:14:05] but what if i slow it down to 1 instance every 5 minutes? who is to say that's not ok? [20:14:08] we think it isn't I imagine [20:14:20] but there is no shared reasoning on what it means that CI is functioning well [20:14:25] or kinda well and we could add a few nodes [20:14:30] we just know, it's hugely broken or not [20:14:36] that's difficult from my side [20:15:05] the whole scaling of CI has always been guess work for sure [20:15:24] PROBLEM - Host integration-slave-trusty-1023 is DOWN: CRITICAL - Host Unreachable (10.68.18.10) [20:15:33] that doesn't make any sense to me though, why can we not define metrics for what it means that CI is working well? [20:15:34] but at least since May we have Zuul reports timing metrics and we have grafana boards showing the time to wait, builds per hours etc [20:15:54] grafana dashboards are not worth the paper they are printed on if no one knows what they mean [20:16:02] what is a good time to wait? [20:16:19] https://grafana-admin.wikimedia.org/dashboard/db/releng-zuul?panelId=18&fullscreen&edit [20:16:33] change to 90d, looks like we should aim for keeping launch wait time to <2 minutes [20:17:03] Hey, I have a question re: step (1) on https://www.mediawiki.org/wiki/Writing_an_extension_for_deployment#Deploy_to_beta_cluster_on_Labs [20:17:24] "Add the new extension to the MW core repo. " -- which branch? and should this be done in the checkout on beta, or via gerrit? [20:17:26] what we have settled on for MediaWiki/core is to have the time to merge to be under 10 minutes [20:17:52] that is one of the two KPIs release engineering has defined and showing at https://grafana.wikimedia.org/dashboard/db/releng-kpis [20:17:54] beside that [20:18:15] I only relay on complains of developers waiting for patches to land / CI to report back [20:18:33] and guess work say that ten-fifteen minutes is acceptable [20:18:45] anyway [20:19:04] we have a good idea of the needed capacity via the permanent slaves [20:19:07] 10-15 is max launch wait time? [20:19:16] moving that to Nodepool would leave us in the same situation [20:19:19] time to merge [20:19:58] from the time a developer does an action to the time it get report [20:20:02] Project beta-update-databases-eqiad build #10867: 04STILL FAILING in 0.8 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/10867/ [20:20:15] the KPI for mediawiki is time between CodeReview + 2 till patch being merged [20:20:21] anyone? I'm trying to guide phedenskog on using beta appropriately. [20:20:42] which is mostly impacted by the PHPUnit runtime, but also get delays whenever CI is overloaded / backloaded [20:21:30] last time I looked at the time to boot an instances, it is 30 - 40 seconds from the OpenStack reservation request till the instance is ready to process in Jenkins [20:24:00] ori: that is a good question. We run master (as php-master) in beta. I'm pretty sure all our extensions come from the extensions submodule which is autogenerated. [20:24:10] greg-g: hashar (I know your busy) sub 2m for max launch wait, I think maybe daily averages of sub 2 or even 3m launch time and daily max or 90th percentile? there seems to be something on the 23rd that would blow out those numbers and I don't know what or if it was an issue [20:24:39] since the 19th the 22nd and 23rd seem to have had the only serious surges [20:24:59] I'm not sure what kind of workload that is or why so disparate [20:25:54] not sure what happened on 23rd [20:26:00] https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/Overview#New_extension says it's the extension.git repository [20:26:07] https://www.mediawiki.org/wiki/Writing_an_extension_for_deployment#Deploy_to_beta_cluster_on_Labs says it's mediawiki/core.git [20:26:09] maybe that was the day the security release mass amount of patches got send [20:26:30] chasemp: should be roughly 600 jobs being allocated in just a couple minutes, that would explain the backlog [20:26:55] is it a problem that becomes way more async usually? [20:27:33] ori: hrm, extension.git repo is correct, that is what is currently running on beta. [20:28:02] I'm trying to find where extensions are added to that repository (likely by https://github.com/wikimedia/mediawiki-extensions/blob/master/update-extensions.sh) [20:28:32] chasemp: security releases are an edge case really :( lot of patches being send that the infra can't really handle [20:28:38] (as mediawiki/core master has no submodules) [20:28:45] we would need way more capacity, then it is only every two months or so [20:29:12] thcipriani: yeah there is a script at the root of mediawiki/extensions.git to sync up the extensions submodules with gerrit ls-project [20:29:26] hashar: what about l10n updates? [20:29:44] thcipriani: once extension is registered, Gerrit takes care of the submodule bump automagically as the 'master' branch of each extension is updated. Same we do for wmf branche on mw/core [20:29:47] are those as bad, resource wise? [20:29:54] for CI? [20:29:54] sync-with-gerrit.py or update-extensions.sh or quick-update ? [20:30:00] no they are entirely skipped [20:30:05] hashar: /me nods [20:30:06] the l10n-bot is essentially ignored iirc [20:30:15] hashar: indeed, I was trying to find out what runs that (digging in integration) if something auto-runs it [20:30:17] I think it is roughly 300 patches over 2 hours [20:30:31] what does entirely skipped mean? [20:30:49] ori: Sync-with-gerrit I think. [20:30:54] Definitely not quick-update. [20:31:05] (Also: that extension meta repo sucks and I hates it) [20:31:29] greg-g: at least l10n-bot is ignored for CR+2 / gate and submit https://github.com/wikimedia/integration-config/blob/master/zuul/layout.yaml#L552-L554 [20:31:40] I just want to know how to add a new extension to beta, outside the regular auto-update process. It looks like it's already in extension.git [20:31:54] Then you just enable it in wmf-config like usualy [20:31:55] so I guess the auto update process picked it up [20:31:55] *usual [20:31:59] got it, thanks [20:32:06] Otherwise, the code is already *there* [20:32:15] Along with all the other crappy extensions we'll never ever deploy ;-) [20:32:28] chasemp: CI ignore changes send by l10n-bot [20:32:42] chasemp: that is a bot that is solely to update localization translations and that is all automatic [20:33:03] from time to time an oddity is introduced that CI might have caught, but that is dealt with when testing with beta cluster [20:33:14] and they are pretty rare [20:34:17] it is actually sending patches right now https://gerrit.wikimedia.org/r/#/q/owner:l10n-bot [20:35:07] heh [20:35:29] chasemp: so in the end what would we need to provide to get the quota back to 20 and then further raised ? [20:36:01] hrm. https://www.mediawiki.org/wiki/Writing_an_extension_for_deployment#Deploy_to_beta_cluster_on_Labs code on this page says "Add the new extension to the {{git repo|mediawiki/extensions}} repo" but is rendering "Add the new extension to the MW core repo" [20:36:11] * thcipriani digs in templates [20:36:34] yeah, I think there might be some oddities with the merge with the wikitech.w.o page? [20:36:48] from a past discussion we had a task to identify user-perceived service of CI https://phabricator.wikimedia.org/T139771 But I have really no clue how to represent that :( [20:37:37] hashar: ahh, l10n-bot does trigger post-merge jobs, obviously(?) https://gerrit.wikimedia.org/r/#/c/306734/ [20:37:54] 10Beta-Cluster-Infrastructure, 03Discovery-Search-Sprint, 13Patch-For-Review, 07Verified: Switching the prod cluster to query from codfw as part of the DC switchover broke beta cluster - https://phabricator.wikimedia.org/T132408#2583847 (10ksmith) [20:38:05] greg-g: yeah. We only have a few postmerge jobs [20:38:30] some should actually be phased out in favor of polling git (such as generating mediawiki or puppet documentation) [20:42:53] so if I'm looking at https://graphite.wikimedia.org/render/?width=966&height=489&_salt=1471549160.573&target=cactiStyle(zuul.pipeline.gate-and-submit.label.ci*.wait_time.mean)&hideLegend=false&lineMode=connected&from=-1d [20:43:17] I'm trying to understand these numbers, based on label are those both nodepool queues? [20:43:21] and it's in ms? [20:44:45] one more question : do I need to schedule a deployment for the InitialiseSettings-labs.php / CommonSettings-labs.php change, or can I deploy that whenever? [20:44:55] chasemp: should be in miliseconds yes [20:45:44] chasemp: the Trusty slaves on Nodepool are tagged with ci-trusty-wikimedia, so we can then get the job directed to them [20:45:52] same for Jessie which uses ci-jessie-wikimedia label [20:46:21] thcipriani: that bit of the instructions should probably be dropped or rewritten. AIUI, extension developers don't need to add their extension to extension.git, because it will be done automatically by CI [20:46:24] ori: you can just deploy whenever as long as you pull down on tin/mira so we don't get any alerts/freak out deployers [20:46:29] * ori nods [20:46:56] the timing there represent the time a change spend in Zuul 'gate-and-submit' pipeline. So that has the Zuul overhead, time to get an executor/instance available, time to run the test [20:46:58] indeed. Will give these pages some scrutiny. [20:47:23] chasemp: should be as close as how users perceives the time it takes between a CR+2 and the actual merge [20:47:56] that's kind of what I was thinking, or at least the nodepool impacting perspective [20:48:37] (03CR) 10BryanDavis: [C: 032] Build universal wheels [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/306731 (owner: 10Legoktm) [20:48:51] that is then used to draw the two 'MediaWiki changes resident time' on https://grafana.wikimedia.org/dashboard/db/releng-kpis [20:49:09] (03Merged) 10jenkins-bot: Build universal wheels [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/306731 (owner: 10Legoktm) [20:49:57] [13:46:21] thcipriani: that bit of the instructions should probably be dropped or rewritten. AIUI, extension developers don't need to add their extension to extension.git, because it will be done automatically by CI <-- CI doesn't do that... [20:50:38] but it does seem like *something* does it? [20:51:07] I've never been clear on how things are added there. I've done it manually once before. [20:51:41] chasemp: and the other interesting metric is the time between a job is triggered by Zuul until an instance start running it. That is https://grafana.wikimedia.org/dashboard/db/releng-zuul?panelId=18&fullscreen [20:51:50] might want to make that a KPIs [20:52:28] these are the same numbers? [20:52:43] teh backed of https://grafana.wikimedia.org/dashboard/db/releng-zuul?panelId=18&fullscreen is https://graphite.wikimedia.org/render/?width=966&height=489&_salt=1471549160.573&target=cactiStyle(scale(zuul.pipeline.gate-and-submit.label.ci*.wait_time.mean,1))&hideLegend=false&lineMode=connected&from=-1d [20:52:45] no? [20:53:03] thcipriani: people do it manually [20:53:12] chasemp: oh yeah sorry [20:53:18] chasemp: missed the wait_time on your graph :( [20:53:33] chasemp: so wait_time is basically how long Zuul waits for the job to be started on a slave [20:53:38] thcipriani: https://gerrit.wikimedia.org/r/#/q/project:mediawiki/extensions+-owner:Jenkins-mwext-sync [20:54:20] chasemp: whereas on https://grafana.wikimedia.org/dashboard/db/releng-kpis we use the resident_time [20:55:01] doc being at bottom of http://docs.openstack.org/infra/zuul/statsd.html under zuul.pipeline. [20:55:44] legoktm: ah, explains why I could never find where it was happening. [20:56:02] why use resident_time vs wait_time [20:56:02] what's teh difference? [20:56:10] resident_time timing representing how long the Change has been known by Zuul (which includes build time and Zuul overhead). [20:56:11] vs [20:56:25] time spent waiting to start running a test? (i.e. more true to whatever backend nodepool included) [20:58:45] the resident_time takes everything in account [20:59:03] from Zuul receiving the change, to getting an executor, running the tests, reporting back [20:59:11] it is really how long a change is known to Zuul [20:59:27] wait_time is how long job waited for an executor to start processing it [21:00:31] the more slaves/executors we have the lower the wait_time will be [21:00:47] Project selenium-Wikidata » firefox,test,Linux,contintLabsSlave && UbuntuTrusty build #97: 04FAILURE in 2 hr 10 min: https://integration.wikimedia.org/ci/job/selenium-Wikidata/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=test,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/97/ [21:01:44] hashar: fyi I don't think we are getting zuul stats for a bit? [21:01:45] https://graphite.wikimedia.org/render/?width=966&height=489&_salt=1471549160.573&target=cactiStyle(zuul.pipeline.gate-and-submit.label.ci-trusty*.wait_time.mean)&hideLegend=false&lineMode=connected&from=-1h [21:01:52] https://graphite.wikimedia.org/render/?width=966&height=489&_salt=1471549160.573&target=cactiStyle(zuul.pipeline.gate-and-submit.label.ci-trusty*.wait_time.mean)&hideLegend=false&lineMode=connected&from=-3h [21:02:08] stopped 40 minutes or so ago [21:02:27] hm or just trusty [21:02:29] https://graphite.wikimedia.org/render/?width=966&height=489&_salt=1471549160.573&target=cactiStyle(zuul.pipeline.gate-and-submit.label.ci-jessie*.wait_time.mean)&hideLegend=false&lineMode=connected&from=-3h [21:03:06] if there are no change, no metric got reported [21:03:12] so graphite flag it as NULL I guess [21:04:52] if you have a single job in the queue waiting for three hours [21:05:05] the wait_time will be 3 hours, 3 hours after the job entered the queue [21:05:22] and the metric is reported when the job start [21:06:02] so to detect an issue we rely more on the number of changes known to zuul https://grafana.wikimedia.org/dashboard/db/releng-zuul?panelId=25&fullscreen [21:06:07] thcipriani: this is the first time I have had to run scap in a while -- it was very cool to see the canary and log error checks [21:06:28] :D [21:07:13] yeah, only hooked to mediawiki errors for the time being. Have a patch to actually follow fatalmonitor kicking around some place. [21:07:13] thcipriani: does scap has an option to scap pull solely on mw1099 then wait for user confirmation? [21:07:52] hashar: no, but I did have the thought that it seems like something scap could do easily when we introduced that policy. [21:08:29] it would save some time and an extra tmux pane. twentyafterfour is cooking something fancy up, I do believe. [21:08:36] I guess [21:08:47] zeljkof and I just had another terminal on mw1099 [21:09:26] yeah, it's not a huge hassle, but it's extra mental overhead in a process that already has too much :) [21:12:08] chasemp: I am heading to bed. I will move a few jobs to the pool tomorrow morning [21:12:25] hashar: can we make a task w/ a plan on what and when? [21:12:32] chasemp: then monitor/watch what is going on and migrate another chunk on Monday [21:12:45] a task for ? [21:12:51] what I jhust said? sure [21:13:07] yeah, I'm not sure how much there is to migrate and what divisible chunks you are thinking [21:13:44] well it just about reverting to previous situation [21:13:53] with high confidence the quota is solved [21:14:01] so I am not too worried :] but will be careful [21:14:04] right but I don't know what that is [21:14:09] and I"m way more worried then [21:17:35] https://phabricator.wikimedia.org/T143938 [21:17:35] 10Continuous-Integration-Infrastructure, 07Nodepool: Bring back jobs to Nodepool - https://phabricator.wikimedia.org/T143938#2584030 (10hashar) [21:18:21] chasemp: I am pretty sure it the quota that just went off due to some instances errors then Nodepool not gracefully handling OpenStack being out of quota / reporting incorrect quota [21:18:44] from Nodepool point of view, it knows about like 6 instances, has confirmation from openstack there is 6 instances and room for 10 [21:18:49] 10Continuous-Integration-Infrastructure, 07Nodepool: Bring back jobs to Nodepool - https://phabricator.wikimedia.org/T143938#2584046 (10Paladox) :) (not this is not a spam comment just that I like this) [21:18:54] so it think it can actually spawn them [21:19:14] with the quota value fixed, it is no more an issue [21:19:20] hashar: what I'm suggesting is more like a grid w/ all the job types one side and a schedule for moving them on the other that spans all the way from where we are now to all things on nodepool? [21:19:24] the four instances nodepool think it can spawn would match the quota usage [21:20:02] I don't know what this means 'I will migrate a few low traffic jobs on Friday, and the rest in bulk on Monday/Tuesday.' [21:20:02] Project beta-update-databases-eqiad build #10868: 04STILL FAILING in 1.1 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/10868/ [21:20:05] other than in bulk is scary :) [21:20:24] I havent reviewed the six or seven revert patches I have sent [21:20:34] so I have no idea which jobs each revert migrate [21:20:40] nor how many builds each of them are doing [21:20:44] work for tomorrow :] [21:21:00] right me niether [21:21:01] ok cool [21:21:03] one sure thing, I will need the nodepool rate to be lowered from 10 back to 1 [21:21:09] why? [21:21:18] or more to the point, what will we watch that proves that? [21:21:21] though gotta investigate exactly how many tasks are processed on each tick [21:21:27] but pretty sure it is only one task per tick [21:21:35] from looking at tasks completed we were doing approx 4x more tasks at 1s than at 10s [21:21:36] a task could be spawn an instance or delete an instance [21:21:42] > Fatal error: Uncaught exception 'MediaWiki\\Services\\NoSuchServiceException' with message 'No such service: MobileFrontend.Config [21:21:50] there is a definte point of dimishing returns after which it's just api abuse and spinning wheels [21:21:52] but I am not entirely sure if a tick can trigger several actions [21:22:24] so I have gotta dig in the code a bit [21:23:15] (03PS2) 10Hashar: Revert "Move tox-jessie & co. off of nodepool" [integration/config] - 10https://gerrit.wikimedia.org/r/306725 [21:23:17] (03PS2) 10Hashar: Revert "Move mediawiki-core-phpcs off of nodepool" [integration/config] - 10https://gerrit.wikimedia.org/r/306726 [21:23:19] (03PS2) 10Hashar: Revert "Temporarily move composer-hhvm/php5 jobs off of nodepool" [integration/config] - 10https://gerrit.wikimedia.org/r/306727 [21:23:25] I don't think we should change anything without seeing the numbers move in a way we intend to effect [21:23:32] otherwise we end up back where we were in guess land [21:23:42] and eventually thigns explode and we don't know what is sane and what isn't [21:23:56] our only choice is for our current settings and rates to be our baseline [21:26:39] 10Continuous-Integration-Infrastructure, 07Nodepool: Bring back jobs to Nodepool - https://phabricator.wikimedia.org/T143938#2584071 (10hashar) [21:28:11] chasemp: as I said the whole spam of the API is entirely due to Nodepool rightfully thinking it can spawn instances while OpenStack had a glitch in the quota preventing to do so [21:28:23] though Nodepool does not handle that edge case [21:31:48] PROBLEM - Puppet run on deployment-aqs01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [21:31:53] Yippee, build fixed! [21:31:53] Project selenium-MobileFrontend » chrome,beta,Linux,contintLabsSlave && UbuntuTrusty build #133: 09FIXED in 16 min: https://integration.wikimedia.org/ci/job/selenium-MobileFrontend/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/133/ [21:33:35] hashar: sure I get that, but I'm not hung up on the changes necessarily during that fiasco [21:33:37] prior to that we had no shared understanding of performance [21:33:57] and why 10 or 20 instances for nodepool was a topic of debate [21:34:26] I'm thinking in terms of working together to move forward in it's entirety and that involves a systematic metrics based view of nodepool workload [21:34:38] PROBLEM - Puppet run on integration-slave-precise-1012 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [21:35:13] that's why https://phabricator.wikimedia.org/T139771 was a point of contention I thought [21:35:33] you were proposing we increase nodepool workers and we wanted finite metric based answers for why [21:35:43] PROBLEM - Puppet run on integration-slave-precise-1011 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [21:35:55] and that task is still open which is frustating, but I imagine we can use the wait_times for zuul as indicated as some proxy marker to get going [21:36:33] it seems really difficult for you to move things around and over without working closely with us and so I'm trying to suggest we do that [21:38:00] my intention is to help out not gum you up, but I can only see it working one way, anyhow, sorry you have been on so late [21:38:08] we can work things out tomorrow? [21:38:41] 10Continuous-Integration-Infrastructure, 07Nodepool: Bring back jobs to Nodepool - https://phabricator.wikimedia.org/T143938#2584093 (10hashar) [21:39:01] chasemp: catching up :D [21:39:51] chasemp: you did mention identifying metrics and I fully back up that [21:39:54] Yippee, build fixed! [21:39:54] Project selenium-MobileFrontend » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #133: 09FIXED in 24 min: https://integration.wikimedia.org/ci/job/selenium-MobileFrontend/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/133/ [21:40:17] chasemp: from the chat we had back in June, yeah that totally made sense and we get some boards to help determining metrics now :] [21:40:59] most of my frustration is really that we had 20 nodes and are down to 10 which was supposed to be just an emergency fix [21:41:10] anyway [21:41:48] will look at the impact of each of the revert patch. List the jobs and try to get an estimate of how many builds per day will be moved to nodepool for each of them [21:42:32] + dig in and get more explanation about the rate request . I have already referenced the commit that changed the rate to 1 https://phabricator.wikimedia.org/rOPUP7bcff1d06a00ac0311ec0eb1b625b0fb08bfb315 [21:43:12] in teh drop from 20 to 10 nodes, what changed? what got worse? [21:43:18] why would we bump up to to 20 again? [21:43:26] that's been the friction [21:43:38] no one on our side understands those questions and I feel like we keep asking [21:43:57] so when we are talking 20 or 40 [21:44:04] now it seems really hard to reason about [21:45:27] that's teh catalyst for https://phabricator.wikimedia.org/T139771 [21:47:06] if it's been offset for year by 4 that means we really dropped from 16 to 6 [21:47:10] and are now back up to a real 10 [21:48:04] chasemp: the friction is that we had 20 which is just enough to keep the migration continuing and 10 entirely stopping it [21:48:13] with another task filled to even ask for 40 (subject to debate) [21:48:37] then I can migrate more jobs, which will raise the wait/resident time and in turn justify the bump of instances [21:49:25] and https://phabricator.wikimedia.org/T139771 (which again I fully agree with) is to get us a clear representation of the delay / service performance [21:49:57] shall we remove "user-percieved" and just focus on "identify metrics that indicate saturation/overload of CI infra"? [21:50:00] (meanwhile I found a potential optimizaton to reduce number of API queries) [21:50:06] would it be helpful if there was another task for identifying performance markers for nodepool itself between CI and openstack? [21:50:09] in all seriousness [21:50:49] I didn't see your comment greg-g before I made mine, that's nearly an equivalent idea [21:50:52] (I think) [21:51:22] * greg-g nods [21:57:06] 10Continuous-Integration-Infrastructure, 07Nodepool: Investigate use of Nodepool ListFloatingIPsTask - https://phabricator.wikimedia.org/T143943#2584195 (10hashar) [21:57:20] ^ potentially one less source of OpenStack API hammering :] [21:58:01] chasemp: there are some metrics on the dashboard I created a few days ago at https://grafana.wikimedia.org/dashboard/db/nodepool [21:58:11] thanks to the activation of statsd in Nodepool [21:58:29] it shows a bunch of "task" which are really API queries to openstack [21:58:36] we get the count and the median time to serve [21:58:49] so that might be an indication as how well wmflabs api replies [21:58:55] in theory that should align w/ zuul wait_times [21:59:30] example for the median time to create a server https://grafana.wikimedia.org/dashboard/db/nodepool?panelId=6&fullscreen&from=1470866352522&to=1472162352523&var-provider=All&var-task=All [21:59:47] chasemp: that is just the time for the REST api to serve a response [21:59:53] so should typically be very short [22:00:07] sure makes sense [22:00:25] mobrovac, thcipriani: added a little something to help with the many many prompts that can result from a low group_size https://phabricator.wikimedia.org/D323 [22:00:33] heck maybe that can be used to guess wmflabs performances. Time will tell [22:01:06] the link I have pasted shows that response to create a server went from 200ms to 800 ms .. then I have no idea if that is just an oddity in the metric or an actual slowdown [22:01:08] hard to tell [22:01:13] if that graph explodes surely things are wrong [22:01:40] marxarelli: nice :) I did set my group_size to 2 at some point and was unhappy about it immediately. [22:02:45] chasemp: so yeah performance markers between Nodepool and openstack, I guess we kind of have them. That is pretty fresh though [22:03:14] yes agreed, which means we are forced to use our current data as our baseline :D [22:03:19] yes I know I had to say that [22:03:27] thcipriani: should be a decent cli ux now: deploy, "looks good", , "looks good", , "all good", c [22:04:27] ('enter' is one of those words you shouldn't look at for very long; looks odd) [22:04:49] marxarelli: nice. enter :) [22:05:08] quit messing with my brain, you two [22:05:21] hashar: I have to go pretty soon :) which means it's INSANE late for you [22:05:24] thanks for talking this out [22:08:47] 10Continuous-Integration-Infrastructure, 07Nodepool: Investigate use of Nodepool ListFloatingIPsTask - https://phabricator.wikimedia.org/T143943#2584239 (10hashar) The bit: ``` IPS_LIST_AGE = 5 # How long to keep a cached copy of the ip list ``` Got removed when `python-shade` has been introduced with: `... [22:10:21] chasemp: it is just past midnight :] Thanks for the exchange_ [22:10:22] ! [22:11:17] "just past midnight" is me in a stupor from tiredness, usually [22:11:57] yeah [22:12:01] will have a nap tomorrow [22:12:12] I wake up in a little less than 7 hours from now [22:12:39] go sleep [22:12:49] :) [22:13:11] yes sir! [22:14:40] Also 11:14pm here :) [22:20:02] Project beta-update-databases-eqiad build #10869: 04STILL FAILING in 0.93 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/10869/ [22:33:17] 22:20:02 Exception: ('command: ', '/usr/local/bin/mwscript update.php --wiki=aawiki --quick', 'output: ', "#!/usr/bin/env php\nFatal error: Uncaught exception 'MediaWiki\\Services\\NoSuchServiceException' with message 'No such service: MobileFrontend.Config' in /mnt/srv/mediawiki-staging/php-master/includes/Services/ServiceContainer.php:185\nStack trace:\n#0 /mnt/srv/mediawiki-staging/php-master/includes/MediaWikiServices.php(201): [22:33:17] MediaWiki\\Services\\ServiceContainer->peekService('MobileFrontend....')\n#1 /mnt/srv/mediawiki-staging/php-master/includes/MediaWikiServices.php(185): MediaWiki\\MediaWikiServices->salvage(Object(MediaWiki\\MediaWikiServices))\n#2 /mnt/srv/mediawiki-staging/php-master/includes/Setup.php(506): MediaWiki\\MediaWikiServices::resetGlobalInstance(Object(GlobalVarConfig), 'quick')\n#3 /mnt/srv/mediawiki-staging/php-master/maintenance/doMaintenance. [22:33:17] php(97): require_once('/mnt/srv/mediaw...')\n#4 /mnt/srv/mediawiki-staging/php-master/maintenance/update.php(216): require_once('/mnt/srv/mediaw...')\n#5 /mnt/srv/mediawiki-staging/multiversion/MWScript.php(97): require_once('/mnt/srv/mediaw...')\n#6 {main}\n thrown in /mnt/srv/mediawiki-staging/php-master/includes/Services/ServiceContainer.php on line 185\n") [22:33:48] ...what is mobilefrontend doing.... [22:34:18] also, zuul is stuck [22:35:39] so it is. [22:37:25] the MF change is https://gerrit.wikimedia.org/r/#/c/305211/4 [22:39:34] thcipriani: did you do anything to kick zuul? or did it just start working? [22:39:50] just started working. I was just looking through logs [22:39:55] nothing looked strange [22:42:13] huh [22:44:51] well that's probably a looming bad thing. [23:15:03] !log cherry-picked 306839/1 into puppetmaster [23:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [23:17:13] legoktm: is it simple to move a gerrit repo to a different gate-and-submit queue in the zuul config? [23:17:36] * bd808 it tired of getting stuck in the mw queue for completely unrelated things [23:18:43] when I run puppet agent in deployment-sca03 it returns this: [23:18:44] Error: Could not retrieve catalog from remote server: Error 400 [23:19:13] I think there was a maintenance going on [23:19:48] not related [23:20:01] Project beta-update-databases-eqiad build #10870: 04STILL FAILING in 0.85 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/10870/ [23:20:47] PROBLEM - Puppet run on deployment-sca03 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [23:21:12] yes! [23:25:35] bd808: it depends. what jobs are running? [23:26:19] bd808: if you file a task, it's pretty simple for you :P [23:26:41] legoktm: mostly "noop" :) [23:26:59] but also tox [23:27:28] which repo? [23:29:08] I un-cherry-picked the 306839/1 but still can't connect to puppetmaster [23:45:19] 10Beta-Cluster-Infrastructure, 06Revision-Scoring-As-A-Service: deployment-sca03 can't call puppetmaster - https://phabricator.wikimedia.org/T143958#2584624 (10Ladsgroup) [23:47:59] legoktm: labs-striker* [23:48:41] there are 4 repos I think. striker, striker-deploy, striker-static, and striker-... something [23:49:52] legoktm: all of these https://gerrit.wikimedia.org/r/#/c/303218/4/zuul/layout.yaml