[02:12:16] RECOVERY - Host deployment-parsoidcache02 is UP: PING OK - Packet loss = 0%, RTA = 0.99 ms [04:22:04] PROBLEM - Host deployment-parsoidcache02 is DOWN: CRITICAL - Host Unreachable (10.68.16.145) [04:23:08] PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:27:59] RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 38948 bytes in 0.519 second response time [05:08:43] RECOVERY - Host deployment-parsoidcache02 is UP: PING OK - Packet loss = 0%, RTA = 1.38 ms [05:49:07] PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:53:57] RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 38934 bytes in 0.888 second response time [06:05:08] PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:10:01] RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 38941 bytes in 0.537 second response time [06:12:21] PROBLEM - Host deployment-parsoidcache02 is DOWN: CRITICAL - Host Unreachable (10.68.16.145) [06:32:40] PROBLEM - Puppet failure on integration-publisher is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [07:07:43] RECOVERY - Puppet failure on integration-publisher is OK: OK: Less than 1.00% above the threshold [0.0] [07:31:46] anybody knows what's going on with git deploy/git fat on tin? I try to deploy and it takes veyr long time and deployment does not work at the end [08:22:35] looks like trebuchet/salt is not working on wdqs1001... could anybody take a look on it? [08:34:01] Yippee, build fixed! [08:34:01] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce build #777: 09FIXED in 23 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce/777/ [08:36:42] 6Release-Engineering-Team, 6operations: deployment broken on wdqs1001 - https://phabricator.wikimedia.org/T118148#1792656 (10Smalyshev) 3NEW [09:15:14] 10Continuous-Integration-Config: Add Rakefile to repositories with Ruby code - https://phabricator.wikimedia.org/T117993#1792695 (10zeljkofilipin) [09:17:59] 10Continuous-Integration-Config: Add Rakefile to repositories with Ruby code - https://phabricator.wikimedia.org/T117993#1792697 (10zeljkofilipin) [09:18:15] 10Continuous-Integration-Config: Cucumber linter should run for all repositories that contain Cucumber code - https://phabricator.wikimedia.org/T58251#1792699 (10zeljkofilipin) [09:19:47] 10Continuous-Integration-Config: Move Bundler Jenkins jobs to Nodepool instances - https://phabricator.wikimedia.org/T114860#1792707 (10zeljkofilipin) [09:19:50] 10Continuous-Integration-Config: Add Rakefile to repositories with Ruby code - https://phabricator.wikimedia.org/T117993#1788669 (10zeljkofilipin) [09:57:25] PROBLEM - Host deployment-cache-parsoid04 is DOWN: CRITICAL - Host Unreachable (10.68.19.197) [09:57:50] 10Beta-Cluster-Infrastructure, 7Blocked-on-RelEng, 6operations, 7HHVM, 5Patch-For-Review: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1792734 (10Joe) [11:08:00] 7Browser-Tests, 10Wikidata, 5Patch-For-Review: [Task] Adjust browsertests for references - https://phabricator.wikimedia.org/T92249#1792884 (10Tobi_WMDE_SW) [11:08:16] 7Browser-Tests, 10Wikidata, 5Patch-For-Review: [Task] Adjust browsertests for references - https://phabricator.wikimedia.org/T92249#1103887 (10Tobi_WMDE_SW) [11:08:57] 7Browser-Tests, 10Wikidata, 5Patch-For-Review: [Task] Adjust browsertests for references - https://phabricator.wikimedia.org/T92249#1103887 (10Tobi_WMDE_SW) [11:09:22] !log Upgrading Jenkins from LTS 1.609.3 to LTS 1.625.1 https://phabricator.wikimedia.org/T118157 [11:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [11:16:23] 10Continuous-Integration-Infrastructure, 7Jenkins, 7WorkType-Maintenance: Upgrade Jenkins to LTS 1.625.1 - https://phabricator.wikimedia.org/T118157#1792897 (10hashar) [11:18:07] 10Continuous-Integration-Infrastructure, 6operations, 7Jenkins, 7WorkType-Maintenance: Please refresh Jenkins package on apt.wikimedia.org to 1.625.1 - https://phabricator.wikimedia.org/T118158#1792900 (10hashar) 3NEW [11:19:34] 10Continuous-Integration-Infrastructure, 7Jenkins, 7WorkType-Maintenance: Upgrade Jenkins to LTS 1.625.1 - https://phabricator.wikimedia.org/T118157#1792909 (10hashar) 5Open>3stalled I have upgraded Jenkins on gallium. This task is now pending refresh on apt.wikimedia.org which is {T118158} [11:20:27] 7Browser-Tests, 10Wikidata, 5Patch-For-Review, 3Wikidata-Sprint-2015-11-03: [Task] Adjust browsertests for references - https://phabricator.wikimedia.org/T92249#1792915 (10Jonas) [11:39:22] 10Deployment-Systems, 6Release-Engineering-Team, 6operations: deployment broken on wdqs1001 - https://phabricator.wikimedia.org/T118148#1793022 (10hashar) [12:56:39] !log restarting Jenkins to refresh the cli-shutdown.groovy script -- https://gerrit.wikimedia.org/r/251935 (https://phabricator.wikimedia.org/T118064) [12:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [13:01:27] 10Beta-Cluster-Infrastructure, 6operations: Can't apply ::role::logging::mediawiki on a trusty host - https://phabricator.wikimedia.org/T98627#1793203 (10hashar) 5Open>3Invalid a:3hashar deployment-fluorine has been rebuild as a Precise host. No point in keeping this task around, whenever one migrates it... [13:05:15] 10Beta-Cluster-Infrastructure: Setup puppet exported resources to collect ssh host keys for beta - https://phabricator.wikimedia.org/T72792#1793208 (10hashar) [13:05:58] 10Beta-Cluster-Infrastructure: Setup puppet exported resources to collect ssh host keys for beta - https://phabricator.wikimedia.org/T72792#771462 (10hashar) https://gerrit.wikimedia.org/r/#/c/148112/ is still cherry picked on deployment-bastion in `/srv/deployment/scap/scap` [13:07:59] 10Beta-Cluster-Infrastructure, 10Deployment-Systems, 5Patch-For-Review, 7WorkType-Maintenance: beta-scap-eqiad mira / deployment-bastion permissions problem - https://phabricator.wikimedia.org/T117016#1793218 (10hashar) p:5Unbreak!>3High [13:08:05] 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-OAuth, 10Pywikibot-OAuth, 5Patch-For-Review, and 2 others: "Nonce already used" regularly occurring on beta cluster - https://phabricator.wikimedia.org/T109173#1793220 (10hashar) p:5Unbreak!>3High [13:08:11] 10Beta-Cluster-Infrastructure: Creating wiki at beta cluster for the Dutch Wikipedia - https://phabricator.wikimedia.org/T118005#1793221 (10hashar) p:5Triage>3Normal [13:10:17] 10Beta-Cluster-Infrastructure: Setup a Swift cluster on beta-cluster to match production - https://phabricator.wikimedia.org/T64835#1793223 (10hashar) a:5Andrew>3None Unassigning, since this task is stalled pending blocking task T114998. [13:11:02] 10Beta-Cluster-Infrastructure, 10Reading Web Planning, 7Easy, 7Mobile, 5Patch-For-Review: MobileFrontEnd on Beta Cluster should display a different logo so that it is clearly not a production site - https://phabricator.wikimedia.org/T115078#1793228 (10hashar) p:5Normal>3Low [13:11:32] 10Beta-Cluster-Infrastructure, 6operations, 7Performance: Need a way to simulate replication lag to test replag issues - https://phabricator.wikimedia.org/T40945#1793233 (10hashar) 5Open>3stalled a:5Nikerabbit>3None [13:17:23] 10Continuous-Integration-Infrastructure, 7Upstream, 7Zuul: zuul status page has double underline in Firefox due to abbr styles - https://phabricator.wikimedia.org/T109747#1793268 (10hashar) 5Open>3stalled [13:21:56] 10Continuous-Integration-Infrastructure: Store Jenkins build output outside Jenkins (e.g. static storage) - https://phabricator.wikimedia.org/T53447#1793289 (10hashar) 5Open>3stalled [13:41:21] 5Continuous-Integration-Scaling, 6operations, 7Nodepool: Backport python-shade from debian/testing to jessie-wikimedia - https://phabricator.wikimedia.org/T107267#1793341 (10hashar) a:3hashar [14:29:15] 5Continuous-Integration-Scaling, 6operations, 7Nodepool: Backport python-shade from debian/testing to jessie-wikimedia - https://phabricator.wikimedia.org/T107267#1793414 (10hashar) I tried but I eventually give up. The toolchain is just too complicated for me to figure out. So at first the /debian/ source... [14:42:14] 7Browser-Tests, 10MediaWiki-extensions-UniversalLanguageSelector, 7Ruby: Update UniversalLanguageSelector mediawiki_selenium Ruby gem to version 1.x - https://phabricator.wikimedia.org/T117976#1793432 (10zeljkofilipin) [14:42:22] 7Browser-Tests, 10MediaWiki-extensions-TwnMainPage, 7Ruby: Update TwnMainPage mediawiki_selenium Ruby gem to version 1.x - https://phabricator.wikimedia.org/T117977#1793434 (10zeljkofilipin) [14:42:33] 7Browser-Tests, 10MediaWiki-extensions-Translate, 5Patch-For-Review, 7Ruby: Update Translate mediawiki_selenium Ruby gem to version 1.x - https://phabricator.wikimedia.org/T117978#1793435 (10zeljkofilipin) [14:42:42] 7Browser-Tests, 6Collaboration-Team-Backlog, 10MediaWiki-extensions-PageCuration, 7Ruby: Update PageTriage mediawiki_selenium Ruby gem to version 1.x - https://phabricator.wikimedia.org/T117979#1793436 (10zeljkofilipin) [14:42:54] 7Browser-Tests, 10MediaWiki-extensions-PdfHandler: Update PdfHandler mediawiki_selenium Ruby gem to version 1.x - https://phabricator.wikimedia.org/T117980#1793437 (10zeljkofilipin) [14:43:06] 7Browser-Tests, 10MediaWiki-extensions-GettingStarted: Delete or fix failed GettingStarted browsertests Jenkins job - https://phabricator.wikimedia.org/T94154#1793439 (10zeljkofilipin) [14:43:15] 7Browser-Tests, 10CirrusSearch, 6Discovery: Upgrade CirrusSearch browser tests to use mediawiki_selenium 1.x - https://phabricator.wikimedia.org/T99653#1793440 (10zeljkofilipin) [14:43:24] 7Browser-Tests, 10VisualEditor: Delete or fix failed VisualEditor browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94162#1793441 (10zeljkofilipin) [14:43:32] 7Browser-Tests, 10MediaWiki-extensions-MultimediaViewer: Upgrade MultimediaViewer browser tests to use mediawiki_selenium 1.x - https://phabricator.wikimedia.org/T99659#1793442 (10zeljkofilipin) [14:43:43] 7Browser-Tests, 6Collaboration-Team-Backlog, 10Flow: Fix or delete failing Flow browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94153#1793444 (10zeljkofilipin) [14:43:50] 7Browser-Tests, 6Collaboration-Team-Backlog, 10Echo: Fix or delete failing Echo browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94152#1793445 (10zeljkofilipin) [14:43:57] 7Browser-Tests, 10MediaWiki-extensions-MultimediaViewer: Fix failed MultimediaViewer browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94157#1793446 (10zeljkofilipin) [14:44:07] 7Browser-Tests, 6Multimedia, 10UploadWizard: Fix failed UploadWizard browsertests Jenkins job - https://phabricator.wikimedia.org/T94161#1793447 (10zeljkofilipin) [14:47:08] 6Release-Engineering-Team, 7Tracking: Fix easy problems reported by RuboCop - https://phabricator.wikimedia.org/T91485#1793451 (10zeljkofilipin) [14:56:55] 10Beta-Cluster-Infrastructure: Setup a Swift cluster on beta-cluster to match production - https://phabricator.wikimedia.org/T64835#1793455 (10yuvipanda) It isn't blocked on T114998 since that's for labs generally and would be setup totally different from production, while this should be set up *on VMs* and conf... [15:08:30] 10Continuous-Integration-Config: Move Bundler Jenkins jobs to Nodepool instances - https://phabricator.wikimedia.org/T114860#1793512 (10zeljkofilipin) a:3zeljkofilipin [15:08:39] 10Continuous-Integration-Infrastructure: Delete ruby2.0lint job and only run bundle-rubocop job for repositories with Ruby code - https://phabricator.wikimedia.org/T114262#1793513 (10zeljkofilipin) a:3zeljkofilipin [15:09:56] 7Browser-Tests, 10MediaWiki-extensions-GettingStarted, 5Patch-For-Review: Upgrade GettingStarted browser tests to use mediawiki_selenium 1.x - https://phabricator.wikimedia.org/T99655#1793515 (10zeljkofilipin) [15:11:24] 7Browser-Tests, 10MediaWiki-extensions-GettingStarted, 5Patch-For-Review, 7Ruby: Upgrade GettingStarted browser tests to use mediawiki_selenium 1.x - https://phabricator.wikimedia.org/T99655#1295863 (10zeljkofilipin) [15:13:00] 6Release-Engineering-Team, 7Ruby, 7Tracking: Fix easy problems reported by RuboCop - https://phabricator.wikimedia.org/T91485#1793528 (10zeljkofilipin) [15:13:08] 6Release-Engineering-Team, 10Browser-Tests-Infrastructure, 7Ruby, 7Tracking: Update repositories that use mediawiki_selenium Ruby gem to version 1.x - https://phabricator.wikimedia.org/T94083#1793529 (10zeljkofilipin) [15:14:16] 10Continuous-Integration-Infrastructure: Delete ruby2.0lint job and only run bundle-rubocop job for repositories with Ruby code - https://phabricator.wikimedia.org/T114262#1793542 (10zeljkofilipin) [15:14:17] 10Continuous-Integration-Config: Move Bundler Jenkins jobs to Nodepool instances - https://phabricator.wikimedia.org/T114860#1793541 (10zeljkofilipin) [15:16:43] 10Continuous-Integration-Config: Delete ruby2.0lint job and only run bundle-rubocop job for repositories with Ruby code - https://phabricator.wikimedia.org/T114262#1793557 (10zeljkofilipin) [15:17:34] 7Browser-Tests, 10Continuous-Integration-Config, 7Ruby: Cucumber linter should run for all repositories that contain Cucumber code - https://phabricator.wikimedia.org/T58251#1793564 (10zeljkofilipin) [15:17:50] 7Browser-Tests, 10Continuous-Integration-Config, 7Ruby: Delete ruby2.0lint job and only run bundle-rubocop job for repositories with Ruby code - https://phabricator.wikimedia.org/T114262#1689876 (10zeljkofilipin) [15:18:03] 10Continuous-Integration-Config, 7Puppet, 7Ruby: Move RuboCop job from experimental pipeline to the usual pipelines for operations/puppet - https://phabricator.wikimedia.org/T110019#1793567 (10zeljkofilipin) [15:18:19] 7Browser-Tests, 10Continuous-Integration-Config, 7Ruby: Add Rakefile to repositories with Ruby code - https://phabricator.wikimedia.org/T117993#1793568 (10zeljkofilipin) [15:18:31] 10Continuous-Integration-Config, 7Ruby: Move Bundler Jenkins jobs to Nodepool instances - https://phabricator.wikimedia.org/T114860#1793569 (10zeljkofilipin) [15:19:24] Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-os_x_10.9-chrome-sauce build #235: 04FAILURE in 1 min 23 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-os_x_10.9-chrome-sauce/235/ [15:30:44] 7Browser-Tests, 10Continuous-Integration-Config, 7Ruby: Add Rakefile to repositories with Ruby code - https://phabricator.wikimedia.org/T117993#1793585 (10zeljkofilipin) [15:32:20] 7Browser-Tests, 10Continuous-Integration-Config, 7Ruby: Add Rakefile to repositories with Ruby code - https://phabricator.wikimedia.org/T117993#1788669 (10zeljkofilipin) [15:34:35] greg-g: I am, but it's auto-archived. However I've got an e-mail from zeljkof so I'll go look. [15:36:26] James_F: hi there! :) [15:36:38] ping me if you have questions/comments, I should be around for a few more hours [15:43:55] zeljkof: I will have time to look at it in 1–2 days, hopefully. [15:44:21] James_F: great, let me know when you have the time [15:45:03] zeljkof: Sorry for slowness. [15:45:58] James_F: no problem, not that I was really fast, everybody is busy [15:47:02] :-( [15:54:42] 10Continuous-Integration-Config, 10Differential, 5Patch-For-Review: Allow `doc-publish` to be run without zuul dependency - https://phabricator.wikimedia.org/T117770#1793722 (10thcipriani) @hashar implemented your suggestion from T117770#1785791 @dduvall and I tested this by creating two jobs, one of which... [16:15:18] 7Browser-Tests, 10Continuous-Integration-Config, 5Patch-For-Review, 7Ruby: Add Rakefile to repositories with Ruby code - https://phabricator.wikimedia.org/T117993#1793799 (10zeljkofilipin) [16:28:25] (03PS1) 10Zfilipin: Run Ruby jobs using Rake [integration/config] - 10https://gerrit.wikimedia.org/r/251979 (https://phabricator.wikimedia.org/T114262) [16:29:51] (03CR) 10Zfilipin: "Looks like it works fine for Translate repo:" [integration/config] - 10https://gerrit.wikimedia.org/r/251979 (https://phabricator.wikimedia.org/T114262) (owner: 10Zfilipin) [16:30:31] (03PS3) 10Thcipriani: Use $JOB_NAME-$BUILD_NUMBER in place of $ZUUL_UUID [integration/config] - 10https://gerrit.wikimedia.org/r/251442 (https://phabricator.wikimedia.org/T117770) [17:16:51] 10Deployment-Systems, 3Scap3: Document Scap3 post-stage checks - https://phabricator.wikimedia.org/T116636#1793993 (10mmodell) [17:17:07] hashar: Hi it seems integration zuul is taking a long time to start jenkins. I uploaded the patch to https://gerrit.wikimedia.org/r/#/c/212981/ and tests are not running yet. [17:17:32] hashar: The test is now running but was slow to start it. [17:17:46] Probably beacuse it's busy [17:17:49] paladox: that is running apparently https://integration.wikimedia.org/zuul/ [17:18:14] hashar: Yes and Reedy no there was only one running. [17:18:44] You've uploaded 4 or 5 patchsets to that change in half an hour [17:19:25] Well i was looking at zuul instead of waiting for the results to be published on jenkins. So i fixed errors as the test was run. [17:19:26] 10Deployment-Systems, 3Scap3: Document Scap3 post-stage checks - https://phabricator.wikimedia.org/T116636#1793999 (10mmodell) http://home.buffers.us:8001/checks.html [17:19:37] paladox: you should really run them locally [17:19:45] hashar: Ok. [17:19:54] in most case it is about running: composer install && composer install; npm test; composer test; [17:20:09] then once happy with the changes locally, push them for review :-} [17:21:30] paladox: and as you mentionned in a bug this week-end, there is indeed a slight delay before starting the jobs. A few seconds on each job whenever the event is received from Gerrit by Zuul. [17:21:47] paladox: haven't looked at it though. Does not seem to cause too much issues [17:21:52] hashar: Ok. [17:24:53] 7Browser-Tests, 10Continuous-Integration-Config, 5Patch-For-Review, 7Ruby: Add Rakefile to repositories with Ruby code - https://phabricator.wikimedia.org/T117993#1794026 (10zeljkofilipin) [17:32:18] 7Browser-Tests, 10Continuous-Integration-Config, 5Patch-For-Review, 7Ruby: Add Rakefile to repositories with Ruby code - https://phabricator.wikimedia.org/T117993#1794045 (10zeljkofilipin) [17:34:26] 7Browser-Tests, 10Continuous-Integration-Config, 5Patch-For-Review, 7Ruby: Add Rakefile to repositories with Ruby code - https://phabricator.wikimedia.org/T117993#1794048 (10zeljkofilipin) [17:42:20] RECOVERY - Host deployment-parsoidcache02 is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms [17:50:58] 10Beta-Cluster-Infrastructure, 7Blocked-on-RelEng, 6operations, 7HHVM, 5Patch-For-Review: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1794121 (10mmodell) Is this really blocked on #blocked-on-releng? [17:51:59] 10Deployment-Systems, 3Scap3: Document Scap3 post-stage checks - https://phabricator.wikimedia.org/T116636#1794124 (10mmodell) a:3mmodell [17:59:25] 6Release-Engineering-Team, 10Staging, 10releng-201415-Q3: [Quarterly Success Metric] Green nightly builds on the staging cluster (tracking) - https://phabricator.wikimedia.org/T88701#1794163 (10thcipriani) We had a discussion about this during the RelEng team meeting today. A couple items of note: 1. The s... [18:23:20] 3Scap3: Remove apache dependency from scap3 deployment host - https://phabricator.wikimedia.org/T116630#1794258 (10mmodell) p:5Normal>3Low [18:26:03] 10Deployment-Systems: [scap] New command to sync all of the files touched in a given commit - https://phabricator.wikimedia.org/T108132#1794263 (10mmodell) p:5Triage>3Normal [18:28:38] 10Deployment-Systems, 3Scap3: default lock file for scap3 should be repo-dependent - https://phabricator.wikimedia.org/T116208#1794266 (10mmodell) I recently added a sync.flag (in D36), I think that could take the place of a lock file in /var/lock/scap. @thcipriani: Does that sound reasonable to you? The lock... [18:29:02] 10Deployment-Systems, 3Scap3: default lock file for scap3 should be repo-dependent - https://phabricator.wikimedia.org/T116208#1794278 (10mmodell) a:3mmodell [18:58:20] 10Deployment-Systems, 3Scap3: Need a way to see config diffs in Scap - https://phabricator.wikimedia.org/T118206#1794339 (10thcipriani) 3NEW a:3mmodell [19:25:56] 10Deployment-Systems, 3Scap3: default lock file for scap3 should be repo-dependent - https://phabricator.wikimedia.org/T116208#1794402 (10thcipriani) @mmodell, sounds like merging the two lock files is a reasonable approach. While both files have different semantics, both have the same effect of locking deploy... [19:55:16] Project beta-code-update-eqiad build #80119: 04FAILURE in 2 min 15 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/80119/ [20:06:56] twentyafterfour: Yet another ping to ensure that you are aware of https://gerrit.wikimedia.org/r/#/c/251130/ for the wmf.6 train tomorrow. It sets up a new extension for l10n on prod [20:09:12] bd808: thanks, I am aware of it now [20:09:19] awesome [20:09:33] (I was sorta aware before, but now I'm reminded) [20:09:51] and if we messed it up please poke me or jhobs for fixes [20:27:26] 7Browser-Tests, 10Wikidata, 5Patch-For-Review, 3Wikidata-Sprint-2015-11-03: [Task] Adjust browsertests for references - https://phabricator.wikimedia.org/T92249#1794534 (10Jonas) a:3Jonas [21:20:44] Project browsertests-QuickSurveys-en.m.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #65: 04FAILURE in 4 min 43 sec: https://integration.wikimedia.org/ci/job/browsertests-QuickSurveys-en.m.wikipedia.beta.wmflabs.org-linux-chrome-sauce/65/ [22:24:56] twentyafterfour: did bd808 ping you around having this patch (https://gerrit.wikimedia.org/r/#/c/251130/) ride the train tomorrow? [22:25:14] twentyafterfour: and if so, everything look good for that to happen? [22:25:43] I did ping, yes "[13:09] bd808: thanks, I am aware of it now" [22:25:43] jhobs: it looks ok but I'm not 100% sure what the potential problems would be [22:26:18] for that first patch, just that we did something incorrectly for adding a new extension [22:26:19] thanks bryan (i wasn't in this channel earlier, hence why i asked) [22:26:37] jhobs: *nod* [22:26:41] and I guess more of what I meant is, can someone +2 it? [22:26:57] not until the branch is cut and deployed [22:27:11] config changes have to be made live immediately [22:27:12] oh, ok I thought it worked the other way [22:27:18] gotcha [22:27:40] the other change to add the extension to the branch process already landed [22:27:43] bd808: I am not terribly familiar with the procedure for adding new extensions so I can only give a sanity check [22:28:06] alright, I'll be sure to be around for the train deployment tomorrow then in case there are any issues for some reason [22:28:16] twentyafterfour: ah good point. [22:28:17] even if there is guard [22:28:22] there is still extension-list [22:28:39] ostriches: can you sanity check https://gerrit.wikimedia.org/r/#/c/251130/ [22:29:46] thanks for the help on this btw bd808, i'm completely unfamiliar with the process for extension deployment, so I'm just trying to stay on top of what needs to be done :) [22:30:05] bd808: Can you give me a nutshell about beta vs alpha, whether there will be such a thing, what the status/purpose of it is. [22:30:21] Krinkle: ? context? [22:30:36] bd808: Looking. [22:30:55] I am wondering whether we still need the extension-list file [22:31:24] hashar: I *think* that l10nupdate uses it somehow [22:31:28] but I may be wrong [22:31:29] bd808: I'm writing a task about cutting the branch earlier (e.g. Friday), but that means we may wanna change beta to use that branch earlier instead of master for a few days, thus creating a potential need for an alpha that runs master. But regardless, this is a more general contextless enquery. I saw some code a few months ago about a second beta cluster [22:31:29] of some kind and was just curious whether you knew more about it [22:31:51] bd808: for php maintenance/mergeMessageFileList.php [22:32:20] though in core in 2012 I added a --extensions-dir parameter to get it to crawl a given path for l10n files [22:32:21] ad1609059f1d4e5a1e4a748fd2779837316d1f96 [22:32:24] (in core) [22:32:45] Krinkle: ah. Sounds like a question for the releng team rather than me. But the "other" testign cluster is actually more of a step between beta cluster and prod in my understanding [22:32:57] forget me anyway [22:33:02] that is unrelated [22:33:07] * bd808 sends hashar to bed [22:33:19] bd808: Right, you are Readingnow [22:33:33] So, there was a discussion about staging/beta/prod in our team mtg this morning. [22:33:35] twentyafterfour: :) [22:33:41] (mainly how it relates to Q3 planning, but ya) [22:33:57] * hashar bd808 ACK [22:34:31] bd808: lgtm. [22:34:36] cool [22:36:37] Krinkle: eventually I want to have the deployment branch become long-lived, instead of cutting a new branch we would just continuously merge into the existing branch(es) [22:36:53] I keep saying jfdi.... [22:37:17] goooooo [22:37:40] so one wmf branch with tags rather than a zillion branches? Sounds good to me [22:37:41] PROBLEM - Host deployment-parsoidcache02 is DOWN: CRITICAL - Host Unreachable (10.68.16.145) [22:37:45] ostriches: I would but it isn't totally simple, we would have to stop the practice of cherry picking all the time [22:37:51] and merge instead of cherry pick [22:37:53] and get rid of the wmf branches from all other repositories [22:37:58] twentyafterfour: Go forth, and announce. [22:38:19] I am not sure how you the batch deployment among different wikis will be handled though [22:38:27] 10Deployment-Systems, 6Release-Engineering-Team: Take heat off day before the weekly branch-cut? - https://phabricator.wikimedia.org/T118212#1794712 (10Krinkle) 3NEW [22:38:33] bd808: ostriches: ^ [22:39:08] This is a quarterly blocker on my continued sanity. [22:39:09] twentyafterfour: does that also involve collapsing to a single release on the cluster at any given time? Or would you merge and then re-tag somehow? [22:39:42] https://phabricator.wikimedia.org/T89945 [22:39:47] thre is pretty much always going to be a use case for updating the wikipedia branch while testing the group0/1 branch [22:40:01] Krinkle: Is "Krinkle's sanity" a KPI? [22:40:01] bd808: I would have two branches [22:40:10] greg-g: mind if i grab a deploy window for 3 to 4 to test out elasticsearch writes to the labs replica ? [22:40:24] twentyafterfour: greg-g suggested to use multi version on beta cluster to play with that new branch strategy [22:40:24] see T89945 (though that shows 3 branches, the new deployment schedule would only call for 2) [22:41:08] hashar: I've been experimenting with it already locally but the big problem I keep running into is cherry picked patches [22:41:11] or a special scap config to sync from and to different folders. would be nice to exercise the workflow [22:41:21] twentyafterfour: ah. got it. That is similar to the never really realized plan for deploying MW with trebuchet [22:41:24] ostriches: I think it is entirely reasonably to aim for weekly deployment. However that is a Lie (TM). This is weekly on 1 day a week only. The rest of the week it's a 7-N deployment cadence, reducing to 2, 1 and even <1 day cadence effectively on Monday and Tuesday. That is entirley unreasonable given our current infrastructure and overal review competence [22:41:24] and strictness. [22:42:05] every time swat cherry picks to a deployed branch it screws with git's merge capabilities. We couldn't continue cherry picking with a long lived branch ...at least I haven't found a convenient way to resolve merge conflicts with cherry picks conflicting with the original patch [22:42:32] Then...tell people the format for deploying is going to change on (X) date and cherry-picks are no more as of then [22:42:43] Make sure [[How to deploy code]] is updated, and you're done. [22:42:45] Krinkle: I agree with you there should be more time for the code to simmer [22:42:56] luckily diffusion will get rid of the shiny "cherry-pick" button [22:43:10] we could attempt merging the cherry pick continuously before deployment and report failure asking for the patches to be rebase [22:43:21] bd808: indeed [22:43:22] I am not sure how they are managed though [22:44:38] Anyway, I don't like the idea of having a branch on the canonical repos anyway. If we had a dedicated "MW Deployment" repo that contained the needed branches/etc, we could jettison 99% of the history. [22:44:48] Then A) MW wouldn't need to know/care about silly WMF deployments [22:44:59] and B) The git repo for deploying MW will actually be a manageable size. [22:45:08] Krinkle: that task has a lot in common with the thinking when the staging cluster was started (pre-reorg I think) [22:45:26] on the paper we could generate the core + skins + vendor + extensions set of commits automatically [22:45:41] We can also use containers to solve this problem. [22:45:49] Or node. [22:45:54] Node is the solution to everything. [22:46:00] scap npm install mediawiki-express [22:46:05] ok, with no complaints i'll be taking the 3-4pm deployment window [22:46:08] We should containerize node, and then use node to make mediawiki containes [22:46:23] Hehe [22:46:31] containers don't solve the branching problem at all [22:46:37] * Krinkle was just trolling [22:46:42] ostriches: I whole-heartedly agree on the separate repo fork [22:46:43] But they would because containers, like node, solve everything [22:46:46] Although only slightly. [22:46:56] Containers would allow us to build a snapshot and bring that all the way through production [22:47:05] instead of having beta and production try to mimick each other [22:47:07] that troll is used so much it is very heard to tell from genuine misunderstanding [22:47:15] *hard [22:47:40] ebernhardson: go for it I guess ;) [22:47:41] So that whatever method is used to compose the different repos, once built,it can be pushed through beta and into various stages, canaries and further as-is [22:47:51] Krinkle: Beta mimic'ing production isn't hard at all when people don't do dumb stuff like if $::realm == 'production' ... else ... [22:48:09] it was a nightmare till we got hiera [22:48:10] Krinkle: repo-authoritative mode in hhvm would be similar to containers in that regard [22:48:28] hashar: Which is why pre-hiera we had to write dumb if/else blocks [22:48:32] nowadays it is pretty straightforward. The only issue we have is much of the new stuff is done on production first and eventually fixed on beta later if people care about it [22:48:35] But in a post-hiera world it's nbd. [22:48:44] ostriches: yup my point exactly :-} [22:48:53] ostriches: it is a much better world [22:48:57] I meant slightly more wider. E.g. mediawiki config, core extensions and skins and perhaps more would be frozen in a container, and cherry-picking or changing branches or whatever resutls in a new snapshot that can be tested in beta, canary servers, and then the rest as a single unit. [22:49:00] facebook deploys their code in a squashfs image [22:49:10] I said quite a bit about this this morning. [22:49:11] ostriches: I'd like to see a system that builds the equivalent of /srv/mediawiki-staging into a single git repo that can be prepared off of the prod deploy server(s). I think that's probably similar to what Krinkle is saying containers could help with [22:49:13] And yeah, that requires a lot of those realm things to go away, and find another home [22:49:42] bd808: Yeah, absolutely. [22:49:47] Krinkle: containers are noontide for that [22:49:52] * not needed [22:50:05] I agree it's not difficult in theory to align beta with prod. But I've been hearing that for 2+ years, and it's not there yet. [22:50:23] we can well flatten the various repos in a single repo like wikimedia/mediawiki-deploy.git [22:50:44] Krinkle: It's not difficult currently either. [22:50:48] that would contain a flattened version of core / vendor / skins / extensions etc [22:50:50] It does mean, of course, that backporting a change requires a 1-2 hour pipeline. [22:50:57] flattening the repos is something that thcipriani and I experimented with on staging, it works well [22:51:06] the problem twentyafterfour mentionned is handling the cherry picked patches on tin [22:51:15] and figure out a way to incorporate them in the flattened repo [22:51:22] Anyway my energy level is about nil, I'm out for awhile. [22:51:35] before I "moved on" from deploy tooling a combined repo was the next step I was interested in taking. I was going to do it as a shadow repository on tin first. [22:51:36] and make sure we block deployment until the cherry picked patches in merge conflicts have been solved. [22:51:46] one way to deal with it is to just for merge with --theirs or whatever [22:52:16] I would first work on creating the flattened repo [22:52:19] *force-merge [22:52:23] from there figure out how to handle the cherry pick conflicts [22:52:30] the flattened repo is almost trivial to create [22:52:49] I mean there are several strategies. One nice one is git-subrepo [22:52:51] if we were to build the flattened repo on each patchset being merged, we could reply the cherry pick on top of that and raise a notification / ask for rebase ahead of time [22:53:04] as soon as a patch ends up in a merge conflict and way before we actually do the deploy [22:53:27] hashar: I think we should just ban cherry picking and do merges instead. it solves the problem nicely [22:53:33] but it's not a convenient button in gerrit [22:54:09] if I can make a phabricator workflow that does the equivalent of cherry-picks but using a feature branch + merge... [22:54:15] that would be ideal [22:54:43] the trouble being a cherry pick from a repo to the flattened repo isn't it , [22:54:44] ? [22:55:10] the trouble is when you merge the cherry pick and the branch containing the original copy of the same commit it always conflicts [22:55:34] so cherry picks are like landmines that will constantly interfere with any automated merges [22:56:15] that is because the cherry pick turns out to be an empty commit right ? [22:56:17] with a fresh branch every week we avoid it because we always start from master and begin a new cycle of cherry picking. but long lived branches are a different situation [22:56:53] hashar: no the cherry pick turns out to conflict ... git can't resolve cherry picks with their original, as far as git is concerned they are two separate commits that totally conflict with eachother [22:57:13] it can't see that they contain the same change and just ignore one of them... [22:57:24] Hey all. mwext-VisualEditor-qunit is failing every time in gate but passing every time in test. Any ideas? [22:58:21] twentyafterfour: isn't that fixed with git cherry-pick --allow-empty ? [22:58:32] James_F: I'm not sure. Hashar might know? [22:58:43] alternatively one can preprocess them with git cherry which only compares the patch [22:58:56] quite useful to figure out whether some patch is already included [22:58:58] hashar: I don't think so --allow-empty does the trick [22:59:07] * twentyafterfour didn't know about git cherry [22:59:20] if you get a statement of the problem in a state and a way to reproduce, I don't mind looking at it [22:59:26] * twentyafterfour reads up on it [23:00:01] hashar: it's easy to reproduce, try merging wmf/1.27.0-wmf.4 into wmf.5 [23:00:33] or any wmf* branch into any adjacent wmf& branch [23:01:49] James_F: a change landed via gate-and-submit though [23:02:15] James_F: started falling fairly recently from https://integration.wikimedia.org/ci/job/mwext-VisualEditor-qunit/buildTimeTrend [23:02:22] hashar: No, it was force-pushed. [23:02:27] Yeah. [23:02:55] https://integration.wikimedia.org/ci/job/mwext-VisualEditor-qunit/17414/ that one passed ( gate-and-submit , master branch , https://gerrit.wikimedia.org/r/#/c/252014/ ) [23:02:59] though yeah got force merged [23:07:16] twentyafterfour: that is rather messy [23:08:50] hashar: yeah [23:10:12] James_F: I have put a slave offline ( https://integration.wikimedia.org/ci/computer/integration-slave-trusty-1015/ ) maybe it is corrupted somehow [23:10:25] hashar: Awesome, thanks. [23:10:33] James_F: the jenkins console complains about chromium not being reachable after sometime, no clue how to debug that :-/ [23:10:35] hashar: And sorry, it's late for you. [23:10:48] hashar: "Normally" it happens 10% of the time, when CI is over-loaded. [23:11:14] hashar: But now it was happening even when there were no other CI jobs running, or browser tests, and the slave was unused except for this job. [23:11:16] !log Made https://integration.wikimedia.org/ci/computer/integration-slave-trusty-1015/ offline because Chromium/XVFB is unreacheable somehow causing issue to https://integration.wikimedia.org/ci/job/mwext-VisualEditor-qunit/ [23:11:18] Something new. Eh. [23:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [23:11:29] James_F: I would think of a race condition of some sort [23:11:38] Project browsertests-Gather-en.m.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #320: 04FAILURE in 14 min: https://integration.wikimedia.org/ci/job/browsertests-Gather-en.m.wikipedia.beta.wmflabs.org-linux-chrome-sauce/320/ [23:11:42] James_F: but then there is barely any log for the X server / chromium :-( [23:13:06] James_F: same deal on another slave :-/ [23:13:19] hashar: Yeah. [23:13:32] * James_F needs to find a cloning vat and print a few spare K.rinkles [23:19:06] !log put https://integration.wikimedia.org/ci/computer/integration-slave-trusty-1015/ back on [23:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [23:23:19] James_F: and maybe we can entirely drop the job https://integration.wikimedia.org/ci/job/mwext-VisualEditor-qunit/ [23:23:53] hashar: I'd love to kill it given it never tells us anything new, AFAICT. [23:23:54] James_F: VisualEditor is already part of mediawiki-extensions-qunit which clones and is triggered by several repositories (including VE). Example run https://integration.wikimedia.org/ci/job/mediawiki-extensions-qunit/19837/consoleFull [23:24:52] Yeah. [23:29:44] (03PS1) 10Hashar: Phase out mwext-VisualEditor-qunit [integration/config] - 10https://gerrit.wikimedia.org/r/252130 [23:30:22] (03CR) 10Jforrester: [C: 031] Phase out mwext-VisualEditor-qunit [integration/config] - 10https://gerrit.wikimedia.org/r/252130 (owner: 10Hashar) [23:32:13] (03CR) 10Hashar: "https://integration.wikimedia.org/ci/job/mwext-VisualEditor-qunit/ has been flappy for the last hour or so and seems redundant with the me" [integration/config] - 10https://gerrit.wikimedia.org/r/252130 (owner: 10Hashar) [23:32:36] James_F: if you can seek your spare K.rinkles approval... [23:32:59] * James_F grins. [23:33:01] the difference between the jobs is that mwext-VisualEditor-qunit only clones VE [23:33:06] * James_F nods. [23:34:10] as for the chromium failure, I have no clue honestly :( [23:54:52] PROBLEM - Puppet failure on pmcache is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]