[02:11:15] 07HTTPS, 10Traffic, 06Operations, 13Patch-For-Review: Create a secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548#2607883 (10AlexMonk-WMF) [02:13:36] 07HTTPS, 10Traffic, 06Operations, 13Patch-For-Review: Create a secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548#2235376 (10AlexMonk-WMF) >>! In T133548#2242401, @BBlack wrote: > According to [[ https://letsencrypt.org/upcoming-features/ | h... [12:33:33] 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502#2608658 (10ema) We suspect that the bug(s) encountered while upgrading ulsfo might have been caused by running a mix of Varnish 3 and Varnish 4 through multiple layers of... [12:44:07] ema: no more v4 in ulsfo? [12:44:15] (upload cluster I mean) [12:44:42] elukey: yep, we're planning to upgrade codfw first instead https://phabricator.wikimedia.org/T131502#2608658 [12:46:01] ah ok thanks! [12:46:56] wow a lot of work :) [13:24:35] <_joe_> indeed [14:17:45] 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502#2180463 (10fgiunchedi) >>! In T131502#2608658, @ema wrote: > We suspect that the bug(s) encountered while upgrading ulsfo might have been caused by running a mix of Varnis... [15:21:08] ema: for the ulsfo->eqiad part, can change routing in hieradata/role/common/cache/upload.yaml [15:21:42] ema: I don't think we can actually do split active/active routing yet, with what we have today. best we can do is set both codfw and eqiad to 'direct' there, but they'll still go to a single side at the applayer. [15:22:29] (which is fine for testing this, IMHO) [15:24:50] bblack: right, because for the v4 upgrade what we really care about is having cache_upload codfw going straight to swift, regardless of where the appservers are [15:26:44] yeah, it's a PII leak for codfw->appserver miss, but we've tolerated that temporarily before for e.g. codfw-switchover testing [15:28:34] the patch series starting at https://gerrit.wikimedia.org/r/#/c/300574/ (which hasn't been rebased in a while, should really finish that set off...) is ~95% ready to go, and re-unifies backend routing stuff for all the clusters to be declarative with path splitting and force-pass support, etc... [15:30:26] the idea is to get through that transition first, and then build on that plus the already-merged loop-protection.... and make a patch that changes cache::route_table and cache::text::apps work a bit differently, where cache::route_table entries never contain 'direct' - they always point one DC at some other valid/logical backend DC. and the cache::foo::apps table defines backends available at on [15:30:32] e or more DCs. [15:30:52] and then the routing logic automatically switches from 'best available next-DC' to 'direct' if an applayer backend is defined at the current DC. [15:32:04] so in route_table, eqiad->codfw and codfw->eqiad (loop!), but so long as one or both have a direct backend defined, it will go where the applayer settings tell it to go (all one side or the other, or active/active split). [15:33:02] that setup is slightly more dangerous in terms of bad sequences of puppet commits (or even single puppet commits) being able to create real loops, but again we already have anti-loop protection in VCL (which 503s any race-condition (or worse) requests that try to loop). [15:35:20] bblack: https://gerrit.wikimedia.org/r/308582 [15:42:19] after that's fully rolled out, next step would be 'direct' for codfw, and then roll through upgrades there. it's the lightest DC in end-user terms, so less impact but probably also slower refill times without ulsfo helping to drive traffic through its backends. [15:43:11] you may have to roll through wiping codfw backend storage, then separately codfw frontends, after the final node converts to v4, if there's any chance objects stored during mixed v3/v4 reqs pollute the caches with buggy objects. [15:43:36] I'm out, holiday day here, good luck, call if something goes horrible :) [15:44:00] bblack: oh, right! Enjoy your holiday day :) [18:30:28] bblack: https://jve.linuxwall.info/blog/index.php?post/2016/08/04/TLS-stats-from-1.6-billion-connections-to-mozilla.org [20:15:48] that's an interesting post, ori [20:36:49] 10Traffic, 10Analytics, 06Operations, 06Performance-Team: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2609746 (10Nuria) > The problem is, however, that your "observations", the impressions, are not independent, because subsets of them are generated by the same users, and so >yo... [21:32:22] the codfw upgrade is going fine, no 503 spikes [21:36:57] CPU usage is also much better than it was in ulsfo