[00:00:19] Because if 3 requests take 9s each, QUnit will declare a timeout on the test as a whole while the 4th request is pending [00:01:22] RoanKattouw: Yeah. [00:01:42] RoanKattouw: I'll track requests in testrunner, abort on teardown. And change jqueryMsgtest to not continue on fail [00:01:47] but stop after first failure [00:02:04] OK [00:02:22] RoanKattouw: That doesn't address the time out itself though. That's still an unknown. [00:02:41] Something regressed causing it to time out all of a sudden as of 1-2 days ago. [00:02:46] And quite frequently, too. [00:05:27] Yeah I guess the test would still fail, just less confusingly [00:34:58] PROBLEM - Puppet failure on deployment-bastion is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [00:47:16] RoanKattouw: So, while merging my patch, we happen to trigger the error [00:47:16] https://integration.wikimedia.org/ci/job/mediawiki-core-qunit/40426/consoleFull [00:47:31] Because of another bug, it's actually trying it again, so don't worry about re +2 [00:47:45] > Language "jp" failed to load. [00:47:51] That's the first one [00:49:21] Only "ml" and "jp" fail to load [00:53:40] Why is Language "jp" failed to load duplicated a bunch of times? [00:53:59] 10 times [00:54:52] Krinkle: Also, 00:31:45 WARN: 'Pending requests does not match jQuery.active count' [01:03:56] RoanKattouw: It is only requested once, but the error is cached (the promise rather) intentionally [01:04:02] because they're separate tests [01:04:17] RoanKattouw: The count is mismatched because the requests were fired after the test finished. [01:04:23] I put that fix in a separate patch after it [01:05:17] Ooh of course [01:06:20] https://gerrit.wikimedia.org/r/#/c/204426/ [01:07:02] We've got another fresh one https://integration.wikimedia.org/ci/job/mediawiki-core-qunit/40432 [01:07:16] From before that merged though. [01:09:47] Krinkle: Ahmmm [01:09:52] .then( next, abort ).done( ... ) [01:10:12] :D [01:10:21] That means that *first* it'll run the next test *then* this one [01:10:26] Consider a case where the cache is hot [01:10:31] Oh, right. [01:10:32] Now the tests run in reverse order? [01:10:44] Abort running first is fine though. [01:10:54] * Krinkle changes [01:11:45] Yeah that is fine [01:11:53] Well, it still reorders things a bit [01:12:28] abort merely makes it stop process the queue and invoke QUnit.start, so that when the callstack finishes, it will continue to the next test. [01:12:45] Oh it doesn't cause any assertions to happen? [01:12:51] when written out, it's stop(), ajax... start().. assert... / end of test / [01:13:38] Fixed [01:14:09] What you've now written probably works in practice but isn't conceptually correct [01:14:38] A conceptually correct version would be .then( function() { current done handler } , function () { current fail handler } ).done( next ).fail( abort ); [01:17:01] RoanKattouw: Why? [01:17:18] Because there is no guarantee of order of execution otherwise [01:17:38] The fact that jQuery runs handlers in order of registration and that done, always and then all go into the same pool is an implementation detial [01:17:49] Right [01:17:56] For a.done(b).then(c) a valid implementation could execute c before b [01:18:05] jQuery probably does b first [01:18:18] But as a matter of principle I don't like writing code that relies on such things [01:18:22] Yes, because .then() is just another .done/.fail handler [01:18:29] But I see what you mean [01:18:41] .then() is a fork [01:18:46] from the same promise [01:19:05] which conceptually is resolved at the same time, not before or after per se [01:19:22] I guess we can use .then() for both? [01:19:23] Well yeah but chaining then gives you an actual order guarantee [01:19:31] It has to, because a then handler can change the resolution value or status [01:19:35] or delay resolution [01:19:43] Yeah [01:20:18] You don't really need .then() at the end because the promise isn't being returned to anywhere, but .then( foo, bar ) is shorter than .done( foo ).fail( bar ) [01:22:33] :) [01:29:30] RoanKattouw: ha, it's still not solid. [01:29:35] https://integration.wikimedia.org/ci/job/mediawiki-core-qunit/40440/console [01:29:41] Called start() while already started (test's semaphore was 0 already) [01:29:47] So both done/fail are happening [01:29:52] wat [01:30:03] 10Staging, 5Patch-For-Review: Setup staging-tin as deployment host - https://phabricator.wikimedia.org/T88442#1211256 (10thcipriani) This all has to be puppetized/whatevered, but here are the steps, roughly, to a working mediawiki (sans database, guessing that will come from addWiki.php, still to be investigat... [01:30:32] RoanKattouw: Not on the same promise I guess. [01:30:45] But the end result is that the complete() callback which is QUnit.start, runs twice [01:31:39] Oh [01:31:45] Because it's called by abort() and then again by run()? [01:32:16] Hm.. [01:32:18] I would propose the following [01:32:28] After abort() empties tasks, it should call run() [01:32:33] which will take care of calling complete() [01:32:51] OK. [01:32:53] Then when run() encounters an empty task array and calls complete(), it should unset (set to null) the tasks variable [01:33:07] Which is a flag to run() that it should do nothing [01:33:10] thcipriani, heh, "a working mediawiki (sans database)" [01:33:23] Because they're nothing preventing run() from being called many times in tasks.length === 0 mode [01:33:26] *there's [01:33:38] RoanKattouw: So we want to tolerate it? [01:33:49] I think we should [01:33:52] Seems more robust [01:33:53] Krenair: heh, well, it's not missing tons of config files anymore :) [01:33:57] I particularly enjoyed cp /srv/mediawiki-staging/private/PrivateSettings.php.example /srv/mediawiki-staging/private/PrivateSettings.php [01:34:10] Krenair: wat [01:34:17] RoanKattouw, :) [01:34:19] Or, .example [01:34:21] *oh [01:34:44] https://gerrit.wikimedia.org/r/#/c/197389/2/private/PrivateSettings.php.example [01:34:46] Fun [01:35:09] this is just placeholdery-type-stuffs, I'll get it cleaned up, just wanted to get my notes some place [01:35:26] RoanKattouw: Hm.. Not sure. The test times out and aborts the request, that makes the language request fail and call abort, which calls it to stop. [01:35:39] So yeah, same deal, but differnent scenario than you thought [01:36:00] When the test times out, it effectively calls start (Qunit internally) [01:36:44] This reminds me of the new fetch() specification. Abort vs cancel. [01:37:03] Meaning, basically want we want is for the request to be removed, without notifying its handlers. [01:37:11] Does one of them invoke the fail handler and the other doesn't? [01:37:14] Cause that would be awesome [01:37:17] Hence quite oftten people have to write code that does this in the fail() handler: if abort; return [01:37:25] It's so annoying having to deal with fail handlers being run synchronously from .abort() [01:37:30] Yeah exactly [01:37:37] RoanKattouw: Not quite, there's only gonna be one of them (abort of cancel), not both. [01:37:41] I think they'll go with cancel. [01:38:06] Or rather, the broader concept of CancellablePromises, aka Task. [01:38:31] Which also cascades after then() [01:38:52] Oooooh [01:38:54] YES PLEASE [01:39:13] Manually propagating .abort methods is such a pain [01:39:28] In certain cases (when the XHR is not the first promise in the .then chain) it's close to rocket science [01:40:52] RoanKattouw: Yup [01:43:44] RoanKattouw: When you have a few months of time with nothing to do: https://github.com/whatwg/fetch/issues/27 [01:44:49] Jake Archibald and Anne van Kesteren (the two names I recognise) are both involved, so I've got peace of mind. [01:44:53] But it's still early dayus [01:55:04] RoanKattouw: So, any ideas where the delay might originate or how to find out? [01:55:15] In the backend that is [01:55:16] You mean the root of it timing out? [01:55:20] Yes [01:55:31] apache, mysql, fs, php, .. [02:03:47] 10Continuous-Integration, 7Upstream: Fails npm build failure "File exists: ../esprima/bin/esparse.js" - https://phabricator.wikimedia.org/T90816#1211279 (10Krinkle) 5Open>3stalled a:5Krinkle>3None [02:04:22] Krinkle: So like I said earlier, I think having timestamps in those MW logs could be helpful [02:05:15] 10Continuous-Integration, 7Upstream: Fails npm build failure "File exists: ../esprima/bin/esparse.js" - https://phabricator.wikimedia.org/T90816#1068235 (10Krinkle) Upstream JSCS (markelog) is working on solving it per [my comment on node-jscs issue #883](https://github.com/jscs-dev/node-jscs/issues/883#issuec... [02:06:56] (BTW, why do VE/VE and mw/ext/VE runs depend on each other? They don't seem to have a job in common) [02:07:11] Krinkle: I'm hoping that'll tell us where the server is spending all that time [02:07:36] We may want to do some logging from the client side too, like how long each request takes [02:07:50] RoanKattouw: Hm.. OK. I'll hotpatch wfDebug/LegacyLogger to include an actual timestamp, like we do for error logs like fatal/exception [02:08:02] OK [02:08:17] RoanKattouw: Check https://github.com/wikimedia/integration-config/blob/master/zuul/layout.yaml for common jobs [02:09:21] Krinkle: I'm observing that the jobs that show up on the Zuul dashboard don't seem to have any in common, are you telling me there are more? [02:09:52] Oh Christ [02:09:55] extension-rubylint [02:12:25] RoanKattouw: Ah, yeah :) [02:12:39] RoanKattouw: And maybe jsonlint/jshint, depending on whether it correlates with check pipeline. not sure. [02:12:41] RoanKattouw: actually it's any of npm, jshint, jsonlint, jsduck which are all somewhere in the mediawiki queue [02:13:02] er, not jshint/jsonlint because they're in "check" [02:13:11] legoktm: ve and ve-mw don't have any of those in common. [02:13:39] it just has to have something in common with anything in the mediawiki queue [02:13:42] it's recursive [02:13:50] Right [02:24:05] 10Continuous-Integration: /var/lib/mysql/ filling up on old Precise slaves due to mysql usage - https://phabricator.wikimedia.org/T94138#1211306 (10Krinkle) Exactly what in `/var/lib/mysql/` was so big? I assume MatchSearch/dvips does not put its core dumps in there. Right now it seems all quite moderately size... [02:27:51] RoanKattouw: It is interesting we're hitting the race condition so often in the related pathces [02:27:53] Almost nice :) [02:28:37] 10Continuous-Integration: /var/lib/mysql/ filling up on old Precise slaves due to mysql usage - https://phabricator.wikimedia.org/T94138#1211320 (10Legoktm) /var/lib/mysql itself wasn't big, just that core dumps were filling up /var so mysql would run out of room. [02:34:56] legoktm: btw, during the ci-triage this week we skipped a few tasks assigned to you. Sorry for the bad timing. But wanna catch up for a minute? [02:35:10] https://phabricator.wikimedia.org/project/sprint/board/401/query/assigned/ [02:35:32] Krinkle: uh, this is probably just a little worse timing, I *just* start SULF :P [02:35:35] started* [02:36:12] OK. no worries [02:51:02] (03PS1) 10Krinkle: qunit: Use MW_SCRIPT_DIR instead of hardcoding "/$BUILD_TAG" [integration/config] - 10https://gerrit.wikimedia.org/r/204440 [02:54:58] (03PS1) 10Krinkle: Remove outdated mention of JOB_NAME in $wgWikimediaJenkinsCI comment [integration/jenkins] - 10https://gerrit.wikimedia.org/r/204441 [02:55:17] (03CR) 10Krinkle: [C: 032] Remove outdated mention of JOB_NAME in $wgWikimediaJenkinsCI comment [integration/jenkins] - 10https://gerrit.wikimedia.org/r/204441 (owner: 10Krinkle) [02:56:27] (03Merged) 10jenkins-bot: Remove outdated mention of JOB_NAME in $wgWikimediaJenkinsCI comment [integration/jenkins] - 10https://gerrit.wikimedia.org/r/204441 (owner: 10Krinkle) [03:11:52] 10Continuous-Integration: Upgrade Zuul server to latest upstream - https://phabricator.wikimedia.org/T94409#1211352 (10Krinkle) [03:11:53] 10Continuous-Integration, 6Release-Engineering, 5Patch-For-Review: Zuul-cloner forgets to clear workspace - https://phabricator.wikimedia.org/T76304#1211351 (10Krinkle) [03:12:07] 10Continuous-Integration, 6Release-Engineering: Zuul-cloner forgets to clear workspace - https://phabricator.wikimedia.org/T76304#1211353 (10Krinkle) 5Open>3stalled [03:12:38] 10Continuous-Integration: Zuul: python git assert error assert len(fetch_info_lines) == len(fetch_head_info) - https://phabricator.wikimedia.org/T61991#1211356 (10Krinkle) 5Open>3stalled [05:12:57] 10Continuous-Integration: Creating MySQL tables for MediaWiki sometimes stalled on I/O for several minutes - https://phabricator.wikimedia.org/T96229#1211529 (10Krinkle) 3NEW [05:14:03] 10Continuous-Integration, 10MediaWiki-Database, 10MediaWiki-Installer: Creating MySQL tables for MediaWiki sometimes stalled on I/O for several minutes - https://phabricator.wikimedia.org/T96229#1211541 (10Krinkle) [05:16:42] 10Continuous-Integration: Switch MySQL storage to tmpfs - https://phabricator.wikimedia.org/T96230#1211547 (10Krinkle) 3NEW [05:26:08] 10Continuous-Integration, 10MediaWiki-Database, 10MediaWiki-Installer: Creating MySQL tables for MediaWiki sometimes stalled on I/O for several minutes - https://phabricator.wikimedia.org/T96229#1211567 (10Krinkle) [05:46:59] (03CR) 10Krinkle: [C: 032] qunit: Use MW_SCRIPT_DIR instead of hardcoding "/$BUILD_TAG" [integration/config] - 10https://gerrit.wikimedia.org/r/204440 (owner: 10Krinkle) [05:48:57] (03Merged) 10jenkins-bot: qunit: Use MW_SCRIPT_DIR instead of hardcoding "/$BUILD_TAG" [integration/config] - 10https://gerrit.wikimedia.org/r/204440 (owner: 10Krinkle) [06:41:32] PROBLEM - Puppet failure on deployment-db2 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [07:06:35] RECOVERY - Puppet failure on deployment-db2 is OK: OK: Less than 1.00% above the threshold [0.0] [07:51:57] (03PS3) 10Hashar: E-mail fr-tech for CentralNotice browser tests [integration/config] - 10https://gerrit.wikimedia.org/r/204295 (owner: 10AndyRussG) [07:53:20] 10Browser-Tests, 10Continuous-Integration, 10MediaWiki-extensions-CentralNotice: Fix failing CentralNotice browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94151#1211703 (10hashar) @AndyRussG made the CentralNotice jobs to notify the fundraising tech team! Will definitely help them notice failu... [07:55:27] (03CR) 10Hashar: [C: 032] "I have refreshed all the jobs! Thank you Andy and good luck fixing the broken tests!" [integration/config] - 10https://gerrit.wikimedia.org/r/204295 (owner: 10AndyRussG) [07:57:21] (03Merged) 10jenkins-bot: E-mail fr-tech for CentralNotice browser tests [integration/config] - 10https://gerrit.wikimedia.org/r/204295 (owner: 10AndyRussG) [07:58:31] 10Continuous-Integration: Investigate integration-jjb-config-diff job slowness - https://phabricator.wikimedia.org/T78532#1211705 (10hashar) 5Open>3Resolved a:3hashar Thanks to the reduction in the number of jobs (I think), the build time went down to less than 2 minutes! https://integration.wikimedia.org... [08:09:32] !sal [08:09:32] https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [08:14:54] 10Browser-Tests, 6Release-Engineering: Browser tests running against beta all failing because of mw-api-siteinfo.py - https://phabricator.wikimedia.org/T95163#1211720 (10hashar) The root cause is that the CI instances have been migrated to a new DNS resolver which for *.beta.wmflabs.org replied with the public... [08:15:06] 10Continuous-Integration, 6Labs: integration labs project DNS resolver improperly switched to openstack-designate - https://phabricator.wikimedia.org/T95273#1185067 (10hashar) [08:15:08] 10Browser-Tests, 6Release-Engineering: Browser tests running against beta all failing because of mw-api-siteinfo.py - https://phabricator.wikimedia.org/T95163#1211722 (10hashar) [08:15:16] 10Browser-Tests, 6Release-Engineering: Browser tests running against beta all failing because of mw-api-siteinfo.py - https://phabricator.wikimedia.org/T95163#1211724 (10hashar) 5Open>3Resolved a:3hashar [08:23:42] (03PS2) 10Hashar: Remove Wikidata jslint job [integration/config] - 10https://gerrit.wikimedia.org/r/204287 (owner: 10JanZerebecki) [08:23:49] (03CR) 10Hashar: [C: 032] ":)" [integration/config] - 10https://gerrit.wikimedia.org/r/204287 (owner: 10JanZerebecki) [08:25:46] (03Merged) 10jenkins-bot: Remove Wikidata jslint job [integration/config] - 10https://gerrit.wikimedia.org/r/204287 (owner: 10JanZerebecki) [08:58:09] 10Continuous-Integration: Switch MySQL storage to tmpfs - https://phabricator.wikimedia.org/T96230#1211751 (10hashar) The disk I/O on labs are not that nice on labs and I think Precise instances have slightly lower I/O capabilities than Trusty ones. Instances runs on different compute nodes which might have diff... [09:15:07] (03PS19) 10Hashar: Package python deps with dh-virtualenv [integration/zuul] (debian/precise-wikimedia) - 10https://gerrit.wikimedia.org/r/195272 (https://phabricator.wikimedia.org/T48552) [09:15:42] (03CR) 10Hashar: "python-six 1.9.0 is now available on precise-wikimedia so add the package to build-deps and depends so we don't have to get it from pip." [integration/zuul] (debian/precise-wikimedia) - 10https://gerrit.wikimedia.org/r/195272 (https://phabricator.wikimedia.org/T48552) (owner: 10Hashar) [09:25:24] (03PS7) 10Hashar: Forward port precise dh-virtualenv to trusty [integration/zuul] (debian/trusty-wikimedia) - 10https://gerrit.wikimedia.org/r/197329 (https://phabricator.wikimedia.org/T48552) [09:25:56] (03CR) 10Hashar: "Rebased on latest version of Precise patch." [integration/zuul] (debian/trusty-wikimedia) - 10https://gerrit.wikimedia.org/r/197329 (https://phabricator.wikimedia.org/T48552) (owner: 10Hashar) [09:32:23] PROBLEM - Content Translation Server on deployment-cxserver03 is CRITICAL: Connection refused [09:37:21] RECOVERY - Content Translation Server on deployment-cxserver03 is OK: HTTP OK: HTTP/1.1 200 OK - 1103 bytes in 0.022 second response time [09:37:51] 6Release-Engineering, 10Wikimedia-Hackathon-2015: Release/QA tasks at the Wikimedia Hackathon 2015 - https://phabricator.wikimedia.org/T92565#1211781 (10Qgil) @greg, would it be fair to assign this task to you? @rfarrand and @alexcella will have an easier time if each main area has identified owners. [09:38:26] 10Continuous-Integration, 5Continuous-Integration-Isolation, 6operations, 7Blocked-on-Operations, and 2 others: Create a Debian package for Zuul - https://phabricator.wikimedia.org/T48552#1211784 (10hashar) @fgiunchedi and I have just finished a 1/1. I have further tweaked the debian files for Precise and... [10:07:11] 6Release-Engineering, 3releng-201415-Q4: RelEng Roadmap April - June 2015 (Q4 2014/2015) - https://phabricator.wikimedia.org/T93955#1211864 (10Qgil) [12:32:06] (03CR) 10Hashar: [C: 032] ignore local.conf [tools/release] - 10https://gerrit.wikimedia.org/r/204289 (owner: 1020after4) [12:32:13] (03Merged) 10jenkins-bot: ignore local.conf [tools/release] - 10https://gerrit.wikimedia.org/r/204289 (owner: 1020after4) [12:33:18] PROBLEM - Puppet staleness on deployment-bastion is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [12:35:29] (03CR) 10Hashar: "So we have some incremental timestamp in the log, though messages using wfDebugLog() do not have the timestamp applied." [integration/jenkins] - 10https://gerrit.wikimedia.org/r/203806 (owner: 10Hashar) [12:37:36] PROBLEM - SSH on deployment-lucid-salt is CRITICAL: Connection refused [12:48:58] (03CR) 10JanZerebecki: "I deleted the mwext-Wikidata-jslint job from Jenkins." [integration/config] - 10https://gerrit.wikimedia.org/r/204287 (owner: 10JanZerebecki) [13:18:38] (03CR) 10Hashar: [C: 04-1] "That blacklist the people from the 'check' pipeline, they further need to be whitelisted in the 'test' pipeline :) Look at email_whitelis" [integration/config] - 10https://gerrit.wikimedia.org/r/204029 (owner: 10Aude) [13:24:27] (03PS2) 10Aude: Add wikidata people [integration/config] - 10https://gerrit.wikimedia.org/r/204029 [13:24:56] (03CR) 10jenkins-bot: [V: 04-1] Add wikidata people [integration/config] - 10https://gerrit.wikimedia.org/r/204029 (owner: 10Aude) [13:24:58] (03CR) 10Aude: "been a long time since I added anyone here... fixed." [integration/config] - 10https://gerrit.wikimedia.org/r/204029 (owner: 10Aude) [13:32:55] gah [13:34:46] (03CR) 10Hashar: [C: 04-1] Add wikidata people (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/204029 (owner: 10Aude) [13:35:21] (03PS3) 10Aude: Add wikidata people [integration/config] - 10https://gerrit.wikimedia.org/r/204029 [13:37:46] (03CR) 10Hashar: [C: 032] "\o/" [integration/config] - 10https://gerrit.wikimedia.org/r/204029 (owner: 10Aude) [13:39:45] (03PS4) 10Hashar: Add wikidata people [integration/config] - 10https://gerrit.wikimedia.org/r/204029 (owner: 10Aude) [13:39:53] (03CR) 10Hashar: [C: 032] Add wikidata people [integration/config] - 10https://gerrit.wikimedia.org/r/204029 (owner: 10Aude) [13:41:49] (03Merged) 10jenkins-bot: Add wikidata people [integration/config] - 10https://gerrit.wikimedia.org/r/204029 (owner: 10Aude) [13:46:28] (03CR) 10Hashar: "Deployed! The volunteers should now have tests triggered for them :-)" [integration/config] - 10https://gerrit.wikimedia.org/r/204029 (owner: 10Aude) [13:49:27] :) [13:53:06] 10Continuous-Integration, 5Continuous-Integration-Isolation, 6operations, 7Blocked-on-Operations, and 2 others: Create a Debian package for Zuul - https://phabricator.wikimedia.org/T48552#1212105 (10hashar) Package has been build out of https://gerrit.wikimedia.org/r/#/c/195272/ patchset 19 ``` $ apt-cach... [13:59:17] (03CR) 10Hashar: [C: 032 V: 032] "The Precise package has been uploaded to precise-wikimedia!" [integration/zuul] (debian/precise-wikimedia) - 10https://gerrit.wikimedia.org/r/195272 (https://phabricator.wikimedia.org/T48552) (owner: 10Hashar) [14:04:04] (03PS1) 10Krinkle: Remove mwext-Wikidata-jslint [integration/config] - 10https://gerrit.wikimedia.org/r/204512 [14:04:31] (03CR) 10Krinkle: "Please remove the job from jjb as well. Otherwise we're still re-compiling and deploying it every time. As well as preventing workspaces a" [integration/config] - 10https://gerrit.wikimedia.org/r/204287 (owner: 10JanZerebecki) [14:04:43] (03CR) 10Krinkle: "Done in I17e8b565dee1." [integration/config] - 10https://gerrit.wikimedia.org/r/204287 (owner: 10JanZerebecki) [14:04:59] (03CR) 10Krinkle: [C: 032] Remove mwext-Wikidata-jslint [integration/config] - 10https://gerrit.wikimedia.org/r/204512 (owner: 10Krinkle) [14:05:02] (03PS3) 10Hashar: Package python deps with dh-virtualenv [integration/zuul] (debian/trusty-wikimedia) - 10https://gerrit.wikimedia.org/r/197328 (https://phabricator.wikimedia.org/T48552) [14:06:00] jzerebecki: Removing it from Jenkins manually doesn't do much. It will just be re-created next time jjb is deployed. [14:06:12] Fixed :0 [14:06:13] :) [14:06:15] 10Continuous-Integration, 5Continuous-Integration-Isolation, 6operations, 7Blocked-on-Operations, and 2 others: Create a Debian package for Zuul - https://phabricator.wikimedia.org/T48552#1212127 (10hashar) The Trusty version has been build out of https://gerrit.wikimedia.org/r/#/c/197329/ patchset 7 ``` a... [14:06:47] hashar: I'm working on mounting tmpfs at /var/lib/mysql. https://phabricator.wikimedia.org/T96230 [14:06:50] (03CR) 10Hashar: [C: 032 V: 032] "Uploaded to apt.wikimedia.org by Filippo" [integration/zuul] (debian/trusty-wikimedia) - 10https://gerrit.wikimedia.org/r/197329 (https://phabricator.wikimedia.org/T48552) (owner: 10Hashar) [14:06:59] Krinkle: great [14:06:59] Not really sure yet, but gonna be fun :) [14:07:12] Krinkle: it might be worth contacting Sean Pringle to figure out whether we can tweak some mysql settings to make it faster [14:07:14] hashar: Any advice with regards to preserving what is in there and making mysql re-create that stuff? [14:07:25] things that come to mind is to have it not sync changes to disk after transactions [14:07:30] since we can afford data lost [14:07:38] Yeah [14:07:48] hashar: I was talking with ori and tim yesterday. [14:07:55] hashar: Ori also suggested libeatmydata [14:07:59] Krinkle: thx. i swear i searched for Wikidata in the repo... [14:08:03] which overwrites the C calls that sync to disk. [14:08:08] +1 :) [14:08:14] But that's quite intrusive [14:08:20] and only does part of the job [14:08:25] might as well just do it in tmpfs entirely, right? [14:08:46] I am sure Sean can find some great settings that would make it faster [14:08:59] (albeit less reliable and potentially subject to data loss but we dont care for testing purposes) [14:09:03] Yeah, but that'll risk changing behaviour from prod. [14:09:20] With tmpfs we know everything is stable, just not safe during a crash (since its in memory) [14:09:21] not necessalry [14:09:23] which is exactly what we want [14:15:02] (03Merged) 10jenkins-bot: Remove mwext-Wikidata-jslint [integration/config] - 10https://gerrit.wikimedia.org/r/204512 (owner: 10Krinkle) [14:28:02] Krinkle: what is the task for the mysql / tmpfs thing? [14:28:17] https://phabricator.wikimedia.org/T96230 [14:28:34] You replied this morning ;-) [14:28:57] you will find out that getting older your short memory tends to garbage collect quite often :D [14:29:37] what is up with puppet on deployment bastion? The log is just full of "E: dpkg was interrupted, you must manually run 'dpkg --configure -a' to correct the problem." [14:30:41] 10Continuous-Integration, 7Tracking: MySQL tunning on CI slaves (tracking) - https://phabricator.wikimedia.org/T96249#1212156 (10hashar) 3NEW [14:30:55] 10Continuous-Integration: Switch MySQL storage to tmpfs - https://phabricator.wikimedia.org/T96230#1212164 (10hashar) [14:30:56] 10Continuous-Integration, 7Tracking: MySQL tunning on CI slaves (tracking) - https://phabricator.wikimedia.org/T96249#1212163 (10hashar) [14:31:09] created a tracking bug https://phabricator.wikimedia.org/T96249 [14:31:36] 10Continuous-Integration, 7Tracking: MySQL tunning on CI slaves (tracking) - https://phabricator.wikimedia.org/T96249#1212156 (10hashar) [14:31:37] 10Continuous-Integration, 10MediaWiki-Database, 10MediaWiki-Installer: Creating MySQL tables for MediaWiki sometimes stalled on I/O for several minutes - https://phabricator.wikimedia.org/T96229#1212165 (10hashar) [14:31:42] 10Continuous-Integration, 7Tracking: MySQL tunning on CI slaves (tracking) - https://phabricator.wikimedia.org/T96249#1212156 (10hashar) [14:31:49] 10Continuous-Integration, 7Tracking: MySQL tunning on CI slaves (tracking) - https://phabricator.wikimedia.org/T96249#1212156 (10hashar) [14:31:50] 10Continuous-Integration, 10MediaWiki-Unit-tests, 7JavaScript: Apache on Jenkins slave takes over 30s to respond (QUnit/AJAX "Test timed out") - https://phabricator.wikimedia.org/T95971#1212169 (10hashar) [14:32:22] 10Continuous-Integration: Switch MySQL storage to tmpfs - https://phabricator.wikimedia.org/T96230#1211547 (10hashar) I have created T96249 as a tracking task. We should further tune the innodb settings as well. [14:32:42] looks like deployment-bastion started having problems with "The following signatures were invalid: BADSIG 40976EAF437D05B5" [14:33:49] looks like this bug: https://phabricator.wikimedia.org/T95541 [14:35:35] !log running dpkg --configure -a on deployment-bastion to correct puppet failures [14:35:40] Logged the message, Master [14:36:41] 10Continuous-Integration, 7Tracking: MySQL tunning on CI slaves (tracking) - https://phabricator.wikimedia.org/T96249#1212179 (10Krinkle) [14:37:16] hashar: https://phabricator.wikimedia.org/T93556 is about tests itself in mediawiki core. They use too much database. Code quality issue. Not CI or mysql perf [14:38:32] 10Browser-Tests, 6Mobile-Web: Issue with Chrome driver with resizing window - https://phabricator.wikimedia.org/T88288#1212180 (10zeljkofilipin) @jdlrobson: I do not know. You are the one that reported the bug, I did not have the time to investigate. [14:39:34] 10Continuous-Integration, 7Tracking: Tune MySQL innodb settings on CI slaves - https://phabricator.wikimedia.org/T96250#1212181 (10hashar) 3NEW [14:40:12] thcipriani: yeah some Precise mirror had bad signatures apparently [14:40:29] hashar: seems to have fixed with the dpkg configure [14:40:34] \o/ [14:40:43] Krinkle: innodb tweaking is https://phabricator.wikimedia.org/T96250 [14:40:59] I am off park time [14:41:05] might show up later in the evening [14:53:18] RECOVERY - Puppet staleness on deployment-bastion is OK: OK: Less than 1.00% above the threshold [3600.0] [14:54:27] Krinkle: what is the difference between gerrit.wikimedia.org/r/ vs gerrit.wikimedia.org/r/p/? [14:54:40] thcipriani: afaik r/ doesn't exist. [14:54:54] RECOVERY - Puppet failure on deployment-bastion is OK: OK: Less than 1.00% above the threshold [0.0] [14:55:25] thcipriani: Hm.. it seems it does. [14:55:28] it's weird, I checked out both for mediawiki/vendor and there don't seem to be any differences aside from remote origin [14:55:36] https://git.wikimedia.org/summary/cdb.git advertises r/p/ in the dropdown [14:55:41] https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki/core advertises r/ [14:56:47] ah, yeah, I've been looking at gerrit.wikimedia.org, I copied the other two repos from other scripts and all seemed to work fine so I left it. I'll change it for consistency's sake. [14:59:14] <^d> It's a weird redirecting thing [14:59:18] <^d> It's stupid [14:59:20] * ^d stabs gerrit [14:59:27] <^d> /r/p/foo is correct [14:59:35] and the other is a redirect? [14:59:42] <^d> Yeah [14:59:55] got it, thanks! [15:00:07] Interesting that the redirect is for r/ not /p. Since /p is within the gerrit app and /r/ where it is mounted. [15:00:41] <^d> Stop trying to make sense of gerrit :p [15:17:34] 10Continuous-Integration, 6Release-Engineering: Make qunit test failures contain useful and readable information about where does it come from, how did you get there, etc - https://phabricator.wikimedia.org/T96072#1212335 (10greg) [15:18:29] 10Browser-Tests: Should be possible in browser tests to use images with meta data or without meta data - https://phabricator.wikimedia.org/T67274#1212344 (10greg) 5Open>3declined a:3greg [15:18:58] 10Continuous-Integration, 7Tracking: MySQL tunning on CI slaves (tracking) - https://phabricator.wikimedia.org/T96249#1212348 (10Krinkle) In addition to using tmpfs for the mysqld.datadir (T96230), there seems to be quite a lot of information online about using tmpfs for mysqld.tmpdir. * http://www.fromdual.c... [15:20:21] 6Release-Engineering, 10Wikimedia-Hackathon-2015: Release/QA tasks at the Wikimedia Hackathon 2015 - https://phabricator.wikimedia.org/T92565#1212351 (10greg) a:3greg sure. [15:38:08] PROBLEM - Puppet failure on deployment-videoscaler01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [15:47:20] 10Browser-Tests, 6Mobile-Web: Issue with Chrome driver with resizing window - https://phabricator.wikimedia.org/T88288#1212451 (10Jdlrobson) 5Open>3Resolved a:3Jdlrobson @chrismcmahon was working on this - he said that this was the reason MobileFrontend's browser tests for overlays were failing. I can't... [15:47:49] PROBLEM - Puppet failure on integration-slave-precise-1014 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [15:51:27] 10Continuous-Integration, 10Wikidata: Add Wikidata to Jenkins job mediawiki-extensions-hhvm - https://phabricator.wikimedia.org/T96264#1212470 (10JanZerebecki) 3NEW [15:54:11] 10Continuous-Integration, 10Wikidata: Add Wikidata to Jenkins job mediawiki-extensions-hhvm - https://phabricator.wikimedia.org/T96264#1212470 (10JanZerebecki) Make sure to avoid problems like in T95897. [15:58:11] RECOVERY - Puppet failure on deployment-videoscaler01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:06:59] 10Continuous-Integration, 10Wikidata, 10Wikidata-Sprint-2015-04-07: the changed job configuration extension-unittests -> extension-unittests-generic for Wikidata.git makes it not run all tests and fail - https://phabricator.wikimedia.org/T95897#1212599 (10JanZerebecki) p:5High>3Normal [16:12:50] RECOVERY - Puppet failure on integration-slave-precise-1014 is OK: OK: Less than 1.00% above the threshold [0.0] [16:19:53] 6Release-Engineering, 3releng-201415-Q4: RelEng Roadmap April - June 2015 (Q4 2014/2015) - https://phabricator.wikimedia.org/T93955#1212634 (10greg) 5Open>3Resolved [16:23:01] 10Deployment-Systems, 5Patch-For-Review: [l10n] Use Scap in Localisation Update - https://phabricator.wikimedia.org/T72443#1212644 (10greg) p:5Normal>3Unbreak! [16:23:03] 10Beta-Cluster, 10Deployment-Systems: l10nupdate on Beta Cluster filled /var - https://phabricator.wikimedia.org/T95868#1212647 (10greg) p:5High>3Unbreak! [16:27:53] 10Deployment-Systems, 5Patch-For-Review: [l10n] Use Scap in Localisation Update - https://phabricator.wikimedia.org/T72443#1212663 (10greg) p:5Unbreak!>3Normal [16:27:53] 10Beta-Cluster, 10Deployment-Systems: l10nupdate on Beta Cluster filled /var - https://phabricator.wikimedia.org/T95868#1212664 (10greg) p:5Unbreak!>3Normal [16:38:48] gerrit is being weird for me. It keeps saying I'm logged out [16:39:34] it did so for everybody today, did you get logged out more than once? [16:42:30] it was restarted [16:43:26] sun spots then maybe. I got about 7 "you are logged out" messages in a row but it seems to be working now [16:57:30] separate tabs maybe? all with old sessions? [16:57:30] whatever, moving on :) [17:13:30] 10Continuous-Integration, 5Patch-For-Review: Switch MySQL storage to tmpfs - https://phabricator.wikimedia.org/T96230#1212822 (10Krinkle) a:3Krinkle [17:22:33] !log Gracefully depool integration slaves to deploy https://gerrit.wikimedia.org/r/#/c/204528/ (T96230) [17:22:36] Logged the message, Master [17:26:48] <^d> thcipriani: I should just stop saving staging-tin in my known_hosts until we're done blowing it up :p [17:27:34] heh, sorry, I like to verify my notes by blowing up servers :P [17:29:17] <^d> lol [17:35:38] PROBLEM - Puppet failure on integration-slave-trusty-1011 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [17:35:40] PROBLEM - Puppet failure on integration-slave-trusty-1013 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [17:36:10] <^d> thcipriani: The only thing I'm stuck on is how we'd want puppet to call checkoutMW on an initial setup [17:36:15] <^d> How does it know which branch to pick? [17:36:27] <^d> hiera? We'd have to bump it each time we branch prod... [17:37:19] PROBLEM - Puppet failure on integration-slave-precise-1013 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [17:37:48] PROBLEM - Puppet failure on integration-slave-precise-1011 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [17:38:04] <^d> (technically. in practice only anytime we'd need a new tin) [17:38:19] PROBLEM - Puppet failure on integration-slave-trusty-1014 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [17:38:24] PROBLEM - Puppet failure on integration-slave-trusty-1012 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [17:39:22] readlink -f php on master? dunno, hiera seems the least sloppy in thinking about for a few minutes. [17:39:34] PROBLEM - Puppet failure on integration-slave-precise-1012 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [17:39:58] PROBLEM - Puppet failure on integration-slave-trusty-1015 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [17:47:42] <^d> thcipriani: Although if we decouple the clone from the checkout (cc twentyafterfour) then it's easier to do the clone bits at checkout [17:47:49] <^d> And actually, we could... [17:47:52] <^d> Hmmm [17:48:31] <^d> You could just clone an initial copy of MW + extensions into /srv. Stuff like checkoutMW could use those as an initial reference [17:48:48] PROBLEM - Puppet failure on integration-slave-precise-1014 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [17:50:57] PROBLEM - Puppet failure on integration-slave-trusty-1016 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [17:52:09] PROBLEM - Puppet failure on integration-dev is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [0.0] [17:55:41] ^d: my guess is you'd probably want to do this in php, use wikiversions.json, create a wrapper for checkoutMW. Even though I feel like it's terrible to add more scripts in wmf-config. [17:55:56] * ^d sighs [17:57:01] !log Repooled instances. Converstion of mysql.datadir to tmpfs worked, but puppet run has errors. Coren and Krinkle working on it. https://gerrit.wikimedia.org/r/#/c/204528/ (T96230) [17:57:05] Logged the message, Master [18:02:22] ^d: thcipriani: doesn't staging-tin need to run master? [18:02:44] <^d> It does, but I'm curious what you'd run checkoutMW with on a fresh production tin [18:02:55] <^d> (assuming we want to puppetize that setup more) [18:07:26] well you can run activeMWVersions to get the two current branches [18:16:08] 6Release-Engineering, 10MediaWiki-Vagrant: Vagrant command for running browser tests - https://phabricator.wikimedia.org/T96283#1213237 (10dduvall) 3NEW [18:16:29] 6Release-Engineering, 10MediaWiki-Vagrant: Vagrant command for running browser tests - https://phabricator.wikimedia.org/T96283#1213246 (10dduvall) p:5Triage>3Low [18:16:35] <^d> twentyafterfour: I need to clean up how we load wikiversions.json [18:16:45] <^d> We'll need a different one for staging too [18:16:59] <^d> (also, getRealmStuff() is only labs/prod/dc) [18:19:02] I don't particularly like anything about the way multiversion works. it's really hacky [18:21:51] but I spent quite a bit of time going through the code on vagrant and I couldn't come up with a cleaner way of doing it really... the whole mediawiki initialization process is very messy [18:22:47] twentyafterfour: hey, when's the next scap deployment? [18:23:27] i got caught up in playing with js + elasticsearch and forgot i was supposed to sit in :) [18:24:04] marxarelli: my deployments are done for the week (tuesday and wednesday are the train deployments) [18:24:41] doh. my homework is incomplete then. just like the old days [18:24:54] The wednesday deployment is the messy one because that's when a new branch is cut and the old branch's patches have to be ported over [18:25:35] <^d> twentyafterfour: Nobody is married to multiversion. [18:25:38] RECOVERY - Puppet failure on integration-slave-trusty-1011 is OK: OK: Less than 1.00% above the threshold [0.0] [18:25:42] <^d> :) [18:26:18] twentyafterfour: and trebuchet? [18:29:01] marxarelli: I've only used trebuchet once and it didn't really work [18:29:34] ^d: right, but I can't complain much since I can't think of a better way to do it really [18:29:53] <^d> Unless we just create all the wikis in staging and beta too ;-) [18:30:04] <^d> Who cares if they're all empty :p [18:30:50] create all the wikis? how does that change multiversion? [18:31:26] !log Rebooting integration-slave-precise-1012 and integration-slave-trusty-1012 [18:31:29] Logged the message, Master [18:33:30] marxarelli: I can show you a fake-ish trebuchet deploy on staging if you're curious. ^d were you going to demo a trebuchet deploy live-ish? [18:33:50] <^d> It would basically have the same effect. [18:33:55] <^d> I was going to no-op something [18:34:10] i'm down for either at any time today [18:35:10] sure, but it'd be good to see like the .deploy file and the tag that is created, etc. Also, I've realized reporting from trebuchet ain't so rosy, so it's good to see that. [18:36:37] marxarelli: if you want to do a hangout, I can give you a demo/brain dump if you're interested in my limited understandings. [18:37:10] thcipriani: yeah, that'd be rad [18:37:23] thcipriani: you available now-ish (after i make coffee that is) [18:37:25] ? [18:37:42] I'd probably like to sit in on this as well ... [18:39:00] sure: 11:45 work? (coffee seems like a good idea) [18:41:47] works for me [18:44:23] thcipriani, marxarelli: there should be a "test/test" trebuchet target that you can play with all you want (both in prod and beta cluster) [18:45:04] bd808: but I'm less scared of breaking staging :P [18:45:17] fair [18:46:23] marxarelli: twentyafterfour getting hangout setup if ff would stahp [18:49:06] thcipriani: can you re-send that invite? [18:49:11] yup [18:56:15] 10Continuous-Integration, 5Patch-For-Review: Switch MySQL storage to tmpfs - https://phabricator.wikimedia.org/T96230#1213380 (10Krinkle) Running `slave-scripts/bin/mw-install-mysql.sh` and `slave-scripts/bin/mw-teardown-mysql.sh` alternatingly on a slave with `/var/lib/mysql` as tmpfs and on another slave wit... [18:58:25] RECOVERY - Puppet failure on integration-slave-trusty-1012 is OK: OK: Less than 1.00% above the threshold [0.0] [18:59:33] RECOVERY - Puppet failure on integration-slave-precise-1012 is OK: OK: Less than 1.00% above the threshold [0.0] [19:03:20] hashar: https://phabricator.wikimedia.org/T96230#1213380 [19:03:48] I hope that with the entire test run (not just PrefixSeachTest) it'll make a bigger impact. [19:04:46] it's too early to be sure, but from a sample run of 5. It seems to save 50%! [19:04:49] https://integration.wikimedia.org/ci/job/mediawiki-phpunit-zend/buildTimeTrend [19:04:53] RECOVERY - Puppet failure on integration-slave-trusty-1015 is OK: OK: Less than 1.00% above the threshold [0.0] [19:04:58] 14min and 17min ->>>> 7min and 9min [19:05:39] RECOVERY - Puppet failure on integration-slave-trusty-1013 is OK: OK: Less than 1.00% above the threshold [0.0] [19:07:17] RECOVERY - Puppet failure on integration-slave-precise-1013 is OK: OK: Less than 1.00% above the threshold [0.0] [19:07:47] RECOVERY - Puppet failure on integration-slave-precise-1011 is OK: OK: Less than 1.00% above the threshold [0.0] [19:08:19] RECOVERY - Puppet failure on integration-slave-trusty-1014 is OK: OK: Less than 1.00% above the threshold [0.0] [19:13:48] RECOVERY - Puppet failure on integration-slave-precise-1014 is OK: OK: Less than 1.00% above the threshold [0.0] [19:15:56] RECOVERY - Puppet failure on integration-slave-trusty-1016 is OK: OK: Less than 1.00% above the threshold [0.0] [19:17:08] RECOVERY - Puppet failure on integration-dev is OK: OK: Less than 1.00% above the threshold [0.0] [19:36:28] Krinkle: well done! just by switching to tmpfs ? [19:36:41] hashar: Yep [19:36:43] Krinkle: beware the tmpfs is only 512MB, no clue how big the mysql data is per job [19:36:59] hashar: This has a separate tmpfs mount [19:37:01] hashar: 256M [19:37:02] see https://gerrit.wikimedia.org/r/#/c/204528/ [19:37:12] oh [19:37:14] (I wrote the commit message) [19:37:49] yeah so the disk is 256MB , each run is 74-90M and we allow 4 concurrent runs [19:38:03] so 4 * 90 = 360 MB > 256 MB :D [19:38:04] 74-90M is when there are 3 jibs running [19:38:08] oh [19:38:22] This was captured with three [19:38:22] concurrent mediawiki-core builds running. [19:38:25] :) [19:39:24] So I expect 120M for 4 concurrent builds (the actual total will be less because 90M is not 3x30, some amount is for MySQL internal tables) [19:39:44] that leaves us 2.1X increase room for larger builds with extensions and sample data [19:40:40] hashar: It was quite tricky to get it deployed. [19:40:46] Because of chicken-egg problem. [19:41:04] mysql server wants to start at boot [19:41:10] Coren fixed it [19:47:42] I commented on the change [19:47:55] might just want to change the mysql datadir and mount at /srv/mysql/datadir or something like that [19:48:27] good to see the perf has improveed [20:08:24] thcipriani, twentyafterfour: was just chatting with a friend who uses ansible over at microsoft. "each step in ansible blocks until all servers have completed. so if one of your steps says "stop the process" it has to wait until every single server finishing stopping until moving on to the next step" [20:08:51] they're deploying to ~ 3k servers [20:09:38] hmm [20:10:07] ^d, thcipriani, twentyafterfour: their deployments have different requirements of course, but do you think it would be beneficial to invite him to a working group session? [20:11:00] sounds like they do tarballs -> s3 <- ec2 instances mostly [20:14:27] <^d> See, we need both behavior really [20:14:36] <^d> We need "keep going and depool X" for like apaches [20:14:43] <^d> Or "oh crap, we're failing, abort!" [20:16:00] how is depooling done now? [20:17:00] edit varnish backends, reload? [20:20:21] <^d> Remove from pybal config, profit [20:20:56] <^d> https://config-master.wikimedia.org/pybal/ [20:21:26] <^d> See also: epics for asking for programmatic (de)pooling of nodes [20:22:18] seems like you should be able to batch out various steps to servers, so you don't have to wait for all servers to do step one: http://docs.ansible.com/playbooks_delegation.html#rolling-update-batch-size also seems like there's some configuration for failure percentages (see #maximum-failure-percentage) [20:24:09] oh, huh, there's also some async stuff, too: http://docs.ansible.com/playbooks_async.html [20:24:30] ish, polling [20:27:03] 10Continuous-Integration, 5Continuous-Integration-Isolation, 6operations, 5Patch-For-Review, 7Upstream: Create a Debian package for NodePool on Debian Jessie - https://phabricator.wikimedia.org/T89142#1213570 (10hashar) I have build the package based on https://gerrit.wikimedia.org/r/#/c/203961/ patchset... [20:29:56] 5Continuous-Integration-Isolation, 6operations: install/deploy labnodepool1001 - https://phabricator.wikimedia.org/T95045#1213581 (10hashar) I have created a basic Debian package for Nodepool (T89142) and installed it on `labnodepool1001.eqiad.wmnet`. For testing purposes I have created a basic configuration... [20:31:29] RECOVERY - Puppet failure on deployment-parsoidcache02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:51:04] 10Continuous-Integration: Jenkins: Assert no PHP errors (notices, warnings) were raised or exceptions were thrown - https://phabricator.wikimedia.org/T50002#1213644 (10EBernhardson) [21:27:51] 10Continuous-Integration, 6Release-Engineering: Make qunit test failures contain useful and readable information about where does it come from, how did you get there, etc - https://phabricator.wikimedia.org/T96072#1213826 (10hashar) Surely a single line is not useful! Adding a newline before each @ is slightly... [21:38:25] 10Browser-Tests, 10VisualEditor: Delete or fix failed VisualEditor browsertests Jenkins job - https://phabricator.wikimedia.org/T94162#1213850 (10Jdforrester-WMF) a:3Ryasmeen [21:42:18] ^d: so pybal doesn't currently have an api to depool a server? [21:46:10] Nope [21:46:16] Just text files [21:46:24] Meh why isn't ori in this channel [21:47:07] and if it only reloads the text files every minute or two it's less than ideal for rolling deployments [21:47:28] Yeah it sucks [21:48:42] etcd and what not, in our glorious future. [21:51:15] ideally that shouldn't be a central configuration file it should be a dynamic status that the target server can override - e.g. return 503 when the balancer polls the server status [21:52:38] that way the process that does the update can set a local flag on the target machine, do the update, then remove the flag, no complicated interactions with config files, no waiting for things to sync up, and no worry about race conditions. [21:53:37] twentyafterfour: maybe we could tweak it to reload upon sighup or something [21:54:20] marxarelli|lunch: that means sighuping a bunch of times - twice for each target server.. [21:54:47] yeah, still not great [21:55:00] it shouldn't require modifying a file in one location, sighuping in another location, then applying an update in a 3rd location... too much cross network synchronization [21:55:18] it seriously needs to be like what I said above, all localized to the target machine [21:55:35] even if that means major changes to pybal it's worth it [21:57:12] 10Continuous-Integration, 5Patch-For-Review: Switch MySQL storage to tmpfs - https://phabricator.wikimedia.org/T96230#1213920 (10Krinkle) [21:57:42] well, if the depooling is temporary, perhaps it's not too much overhead to simply let varnish probe it for a time [21:59:04] there's a bug for that somewhere [21:59:07] so, do we not have backend polling already? [22:00:06] varnish is just being used for the cache layer, not the load balancing, at least that's what I think I gathered from the wiki and ^d's statements [22:01:05] https://wikitech.wikimedia.org/wiki/LVS [22:02:00] ori joins, flames, then leaves [22:02:02] :) [22:07:57] !log Rebooted integration-slave-trusty-1015 (experimenting with libeatmydata) [22:08:00] Logged the message, Master [22:08:18] !log Rebooting integration-slave-precise-1013 (depooled; experimenting with libeatmydata) [22:08:21] Logged the message, Master [22:11:14] <^demon|away> What twentyafterfour said [22:11:23] <^demon|away> varnish is caching layer, LVS is load balancing [22:13:01] right. now that you mention it, i remember watching a talk on WP ops and LVS right after coming on last year [22:13:06] trying to find it... [22:13:11] so i'm going to look into what it'll take to retrofit pybal because this is pretty much critical no matter which deployment tools which choose [22:15:15] 10Continuous-Integration: Evaluate using libeatmydata for mysqld - https://phabricator.wikimedia.org/T96308#1214014 (10Krinkle) 3NEW [22:16:10] <^demon|away> marxarelli: LVS serves as a general-purpose load balancer. We also stuff things like Elasticsearch, DNS and other random services behind it. [22:16:18] so, if it goes lvs -> varnish -> apache, it seems like we could still effectively depool the app server at the caching layer. am i still missing something? [22:16:20] 10Continuous-Integration: Evaluate using libeatmydata for mysqld - https://phabricator.wikimedia.org/T96308#1214014 (10Krinkle) These two instances currently have libeatmydata installed and configured for mysql: * integration-slave-trusty-1015 * integration-slave-precise-1013 Observing their behaviour on Jenkin... [22:16:22] <^demon|away> Well, part of DNS maybe? [22:16:38] <^demon|away> marxarelli: lvs -> varnish -> lvs -> apache [22:16:43] <^demon|away> You want to depool at the 2nd lvs call [22:16:46] <^demon|away> (from the apache lvs pool) [22:17:01] durrr... got it :) [22:18:36] <^demon|away> twentyafterfour: The stupid easy way I thought of a couple of weeks ago to fix pybal here would be to write some dead-simple rest API around pybal and have that just write out the existing config file format [22:18:56] <^demon|away> Then pybal itself doesn't change, you just have a service for editing files but that seems stupid in retrospect [22:24:06] demon: I want to make it happen in real time without reloading entire config files [22:26:52] static files seem so wrong [22:29:26] <^demon|away> Yeah I don't disagree [22:34:03] geez guys, just `nc -kl 666 | while read cmd node; do sed ... && kill -HUP` already :P [22:46:28] marxarelli: you scare me sometimes [22:46:31] :) [22:47:26] and you haven't even seen my creepy face [22:47:56] (it's different from my usual face. creepier) [22:48:17] * greg-g shudders [23:18:35] PROBLEM - Puppet failure on deployment-memc03 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [23:19:13] PROBLEM - Puppet failure on deployment-memc04 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [23:33:34] RECOVERY - Puppet failure on deployment-memc03 is OK: OK: Less than 1.00% above the threshold [0.0] [23:39:16] RECOVERY - Puppet failure on deployment-memc04 is OK: OK: Less than 1.00% above the threshold [0.0]