[06:10:12] <wikibugs>	 10Traffic, 06Operations, 10Pybal, 13Patch-For-Review: Pybal IdleConnectionMonitor with TCP KeepAlive shows random fails if more than 100 servers are involved. - https://phabricator.wikimedia.org/T119372#3142801 (10ema)
[06:11:05] <wikibugs__>	 10Traffic, 06Operations, 10Pybal: Make PyBal respect advertised BGP capabilities - https://phabricator.wikimedia.org/T81305#3142806 (10ema)
[06:11:27] <wikibugs>	 10Traffic, 06Operations, 10Pybal: Add pybal check to ensure service IP is bound - https://phabricator.wikimedia.org/T79730#3142807 (10ema)
[07:24:47] <wikibugs>	 10Traffic, 06Operations, 10Pybal: Unhandled pybal ValueError: need more than 1 value to unpack - https://phabricator.wikimedia.org/T143078#3142889 (10ema) This problem should be [[https://github.com/twisted/twisted/commit/942b63cc04fba83dabf1958b3ed24af860778681|solved upstream]]. I've just finished upgradin...
[07:40:31] <ema>	 interesting, we've had a pretty steep 50x spike in esams text: https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?panelId=2&fullscreen&orgId=1&var-site=esams&var-cache_type=text&var-status_type=5&from=now-1h&to=now 
[07:41:08] <ema>	 looking at 5xx.json, most of the errors came from cp3040's varnish-be
[07:41:48] <ema>	 between 07:06ish and 07:08 
[07:42:31] <ema>	 which is exactly when expiry mailbox lag spiked on cp3040: https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp3040&var-datasource=esams%20prometheus%2Fops
[07:43:15] <ema>	 varnish-be is close to restart there (running since 6 days)
[07:44:42] <ema>	 the interesting part here is that cp3040 is a text node, we've usually seen these kind of issues happen more frequently in cache_upload
[07:47:03] <ema>	 oh, better to link the graphs with fixed timeframes 
[07:47:07] <ema>	 https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?panelId=2&fullscreen&orgId=1&var-site=esams&var-cache_type=text&var-status_type=5&from=1490857016731&to=1490858103299
[07:47:29] <ema>	 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp3040&var-datasource=esams%20prometheus%2Fops&from=1490854951582&to=1490859915707
[07:48:57] <wikibugs>	 10Traffic, 06Operations, 10Pybal: Upgrade twisted on load balancers to 16.2.0 - https://phabricator.wikimedia.org/T160433#3142982 (10ema) 05Open>03Resolved
[08:00:52] <wikibugs__>	 10Traffic, 06Commons, 06Operations, 10Wikimedia-Site-requests, and 2 others: Allow anonymous users to change interface language on Commons with ULS - https://phabricator.wikimedia.org/T161517#3142987 (10Nemo_bis) > I don't think we've been aware of the uselang hack or its mechanics before  The documentatio...
[09:03:56] <ema>	 so, back to the pybal release process topic. We've got a fix in master (the IPv6 one in this case) that needs to be cherry-picked into 1.13 and release 1.13.6 needs to be cut out of it. I'd cherry-pick locally and then push to gerrit refs/heads/1.13 (instead of refs/for/1.13 as the change as been reviewed already and merged into master) 
[09:04:17] <ema>	 alternatively, I could push to refs/for/1.13 and create another CR
[09:04:46] <ema>	 at any rate, at a certain point we'll end up with the fix cherry-picked into gerrit/1.13
[09:05:09] <ema>	 how about the changelog update for the new release? Should that go to refs/for/master and then cherry-picked into 1.13 too?
[09:05:58] <ema>	 s/as been reviewed/has been reviewed/ a few lines above :)
[13:34:40] <bblack>	 _joe_: I'm still trying to think ahead about the next steps for re-structuring the varnish switching stuff (to get it sane + etcd-able).  We could just about use the same data as the internal service endpoints (the hieradata config + etcd data that dns-discovery uses), but my hangup is this:
[13:36:17] <bblack>	 _joe_: for the actual dns-disc stuff, we've decided (because it works) that services can flip from one side to the other without any kind of barrier between.  e.g. I could flip cxserver from eqiad-only to codfw-only by executing two serial confctl commands from automation, and it would happen faster than we can have any assurance that all consumers consumed the first update.
[13:36:58] <bblack>	 _joe_: for the cache case, I haven't found a solution where that doesn't cause a race condition and try to loop some requests.  I'm still think of ways to avoid that natively, but for now assume it's a constant.
[13:37:42] <_joe_>	 bblack: the only way would be to have two levers and pull them at the right time to avoid loops, I guess
[13:37:45] <bblack>	 _joe_: for the varnish-level stuff to do a smooth switch, there has to be an intermediate state of active+active in the middle when switching from only-A to only-B
[13:38:02] <_joe_>	 bblack: and that's ok I think
[13:38:17] <bblack>	 well we have those levers even now in dns-disc, but preventing the loops would be a matter of policy "do not ever do 'blah' (which is totally easy to do)"
[13:38:57] <_joe_>	 bblack: that's why I was suggesting to add separate levers for traffic :)
[13:39:12] <_joe_>	 what is specifically that we "should never do"?
[13:39:23] <bblack>	 I thought that was me suggesting that before and you asking why we couldn't reuse dns-disc :)
[13:40:16] <bblack>	 well I should qualify "never" - if we do it, all it does is create a temporary spike of 508 errors to the user.  it doesn't melt everything.
[13:41:03] <bblack>	 if serviceX is currently set to eqiad-only, doing a single commit that flips it to codfw-only causes the loop behavior
[13:42:09] <_joe_>	 well, given etcd only does atomic modifications of values in v 2
[13:42:18] <bblack>	 or in the future where this is remapped to etcd levers: flipping the two levers without waiting it out (technically, without ensuring consumption, but waiting is a decent proxy for that for now I guess)
[13:42:18] <_joe_>	 we can't just flip with one transaction
[13:42:24] <_joe_>	 it's always going to be two
[13:42:38] <bblack>	 yes
[13:43:01] <bblack>	 but two back-to-back before the effect of the first is in full effect, it's all the same from this pov
[13:43:02] <_joe_>	 bblack: so for active/active things, the point is "don't flip a dc to inactive too fast"
[13:43:03] <volans>	 and have a TTL sleep between the two flips
[13:43:15] <_joe_>	 after you brought up the other
[13:43:28] <_joe_>	 for active/passive, how will it work?
[13:43:53] <bblack>	 for active/passive they have to move through an active/active intermediate state for a smooth switchover
[13:44:05] <bblack>	 which is the whole multi-commit thing with 1 in the middle for MW-RO
[13:44:07] <_joe_>	 which is what we do in the case of mediawiki
[13:44:29] <_joe_>	 ok we can reason from this base
[13:44:46] <_joe_>	 what we were doing now was:
[13:45:24] <_joe_>	 1) read-only 2) set codfw mw to active 3) set eqiad mw to disabled 4) $stuff 5) read-only removed from codfw
[13:45:46] <bblack>	 ^ that's what you're doing for dns-disc level stuff for internal discovery right?
[13:45:50] <_joe_>	 the only thing we need to change is we wait ~ 1 minute or so between 2 and 3, If I get it correctly
[13:45:53] <_joe_>	 bblack: yes
[13:46:06] <_joe_>	 at 2) both dcs are active in etcd
[13:46:20] <_joe_>	 but dns will refuse to update and continue to point to eqiad
[13:46:26] <bblack>	 yeah if we wanted to share the same etcd data for MW's internal dns-disc and varnish traffic routing
[13:46:44] <_joe_>	 bblack: exactly, I'm trying to understand what's better
[13:46:49] <bblack>	 technically all that's needed between 2 and 3 is a gaurantee that all the varnish nodes consumed the update from 2
[13:46:53] <_joe_>	 if using the same data or separate them
[13:47:14] <_joe_>	 separation will give us larger operational freedom, but logically is... so-so
[13:47:19] <bblack>	 but I don't think we have a good way to know that, except maybe executing another cumin command across them that greps for the effect of the update or something
[13:48:30] <bblack>	 at least with puppet, even though it's slower, we know when (2) is applied
[13:50:41] <volans>	 if we use the "serial" like it was proposed for etcd we could check that each host got the new serial and loop/wait until all have it
[13:51:07] <_joe_>	 volans: or we md5sum the generated file :P
[13:51:13] <volans>	 not sure if it's an overhead for the discovery part, in that case we could check the generated config
[13:51:18] <bblack>	 then we just know that it changed, not why
[13:52:49] <bblack>	 mostly the indirect underlying reason for all of this is lack of varnish outbound TLS
[13:52:59] <bblack>	 once we're past that problem, there's other ways to make this go away
[13:54:07] <bblack>	 and of course all of this is about the problem of "how do we do smooth outage-free transitions for testing when both DCs are fine".  in a real scenario with a dead DC (or even a single service dead on one side) we wouldn't care and it wouldn't matter.
[13:56:25] <bblack>	 maybe there's a way to encode the data that makes the undesirable state impossible?
[14:04:21] <_joe_>	 sorry I'm doing some mw releases right now
[14:05:05] <bblack>	 np, this is all just future-looking stuff, not urget
[14:05:07] <bblack>	 *urgent
[14:05:28] <bblack>	 also I fixed up last night: https://wikitech.wikimedia.org/wiki/Global_traffic_routing
[14:05:49] <bblack>	 so that it describes how the new stuff works, along with the annoying "don't do this" bits, which are unfortunate
[14:06:52] <paravoid>	 oh that's nice!
[14:07:21] <bblack>	 yes it even has a new graphic.  graphics make up for the insanity of what they're describing :)
[14:07:51] <paravoid>	 that's kinda awesome
[14:08:07] <_joe_>	 bblack: do you happen to have the time for writing a page on dns discovery too? :P
[14:10:25] <bblack>	 well that one's pretty easy, because it's kinda sane :)
[14:16:22] <ema>	 oh wow that's cool :)
[14:21:46] <bblack>	 how the data is modeled for all of this (in hieradata for now) is still kinda "wrong".  it's mapped out in a way that makes more sense to the code than to the administrator
[14:22:14] <bblack>	 fixing that kinda happens automagically on the way to moving the switches to etcd though
[14:22:33] <bblack>	 (because the etcd part would really be switches, not commenting-out or altering data keys/values)
[14:26:49] <bblack>	 as a result of that, it's probably easier to understand everything on that page by explaining it in varnish decision-making pseudocode.  maybe I should add that there
[14:27:28] <bblack>	 basically when a request enters the (cache) box anywhere (from a user, or forwarded from another site), the pseudocode it follows is something like:
[14:28:08] <bblack>	 $which_app = parse_request($req, hiera('cache::req_handling'))
[14:29:04] <bblack>	 if (backends_list_of(hierata('cache::app_directors')).has_key($::site)) { send traffic to applayer at hostname specified for my $::site }
[14:29:23] <bblack>	 else { forward to hieradata('cache::route_table[$::site]') }
[14:29:48] <bblack>	 or something like that
[14:39:58] <bblack>	 https://wikitech.wikimedia.org/wiki/Global_traffic_routing#A_code-level_view_of_inter-cache_and_cache-.3Eapp_routing
[14:40:03] <bblack>	 ^ there, clearer
[14:40:17] <volans>	 :)
[14:45:08] <bblack>	 I'm so glad I wrote that down, the pseudo-code simplification, because it made me realize how to fix some logistical issues here
[14:45:42] <bblack>	 the essence of the loop-race is that each cache makes its own part of the decision independently while requests are flying around between them
[14:46:28] <bblack>	 so the way to avoid that is to make the decision only once for a given request.  the only place we know for sure a request passes through exactly once is the front edge when it first enters...
[14:46:57] <bblack>	 so clearly, move the entire routing logic to the front edge, have it iterate the data for the whole global route, and put that in a header the other caches consume rather than making their own decisions
[14:47:34] <bblack>	 so the front edge at, say, esams does the req_handling + app_directors + route_table logic iteratively, and emits something like:
[14:48:08] <bblack>	 X-Internal-Routing: cache-eqiad, applayer-eqiad
[14:48:19] <bblack>	 or X-Internal_ruting: cache-eqiad, cache-codfw, applayer-codfw
[14:48:23] <bblack>	 or whatever the case may be
[14:48:43] <bblack>	 the route is set in stone and reaches a valid destination when first evaluated
[14:49:03] <bblack>	 the async update of all the front edge thus doesn't create loops, because the actual decision process is not distributed in nature
[14:51:15] <bblack>	 I'll have to think on that and come up with a data model and header-encoding and such that actually works.  but now I think that's the right step between where it's at now and moving to etcd (and carefully splitting what's config data in hieradata and what's dynamic switches from etcd)
[14:53:20] <bblack>	 of course, the whole thing where any change is necessarily async, and therefore all services will get temporary active/active traffic for a brief window when switching from one to the other is unavoidable by its nature
[14:53:22] <volans>	 yes, the only con I can see is that if something happens during the flight time from the external edge and the applayer that force a topology change that request will not take advantage of the change and will probably end un failing
[14:53:56] <bblack>	 it just makes the window briefer, and we don't have to worry about staging out multi-stage changes to avoid loops
[14:54:04] <volans>	 but I guess that this flight time is in nature smaller than the propagation of the topology change
[14:54:46] <bblack>	 volans: yes.  but also when it's a voluntary change (testing, maintenance), that toplogy-race won't actually fail reqs, it just causes the brief active/active window from the app's POV
[14:55:09] <bblack>	 and when it's an involuntary change, things are already breaking and the etcd update is moving in the direction of unbreaking them
[14:55:35] <_joe_>	 that's pretty awesome :)
[14:55:50] <_joe_>	 we did some awesome work this quarter, I have to say
[14:56:11] <volans>	 yeah, and that should also allow to easily have the case of failoid in case we need it, just treating it as possible applayer I guess
[14:56:15] <_joe_>	 we have an automation framework idea and implementation that will make it possible to automate complex tasks
[14:56:21] <_joe_>	 we have a discovery system
[14:56:40] <_joe_>	 we have conftool that is way more flexible and can use arbitrary schemas
[14:56:54] <_joe_>	 and we have most of this integrated in puppet/mediawiki
[14:57:15] <_joe_>	 (and the traffic stuff we were discussing, of course)
[14:57:21] <bblack>	 (to be clear, with the pre-routing at the edge above, I expect all of that to be etcd-controllable in the end-game.  etcd can re-route inter-cache routing as well, etc)
[14:58:24] <bblack>	 somewhat analogous to my constant iterative remodeling of the data for varnish, I think at some point we're going to want to remodel how we do the etcd switches for this stuff, too.  We've talked before about having multiple hierarchical levels of switches.
[14:59:02] <bblack>	 as in, somewhere there should be a single global switch that says "disable all things in eqiad, because it's borked", which is logically combined with the finer-grained switches for smaller-scope voluntary maintenance/testing, etc
[14:59:47] <bblack>	 and that global switch can flip everything at once: it shuts off geodns user routing to eqiad, shuts off inter-cache routing through eqiad, shuts off active/active services in eqiad, etc
[15:02:33] <bblack>	 randomly rambling into some of the future stuff about handling a real core site outage:
[15:03:07] <bblack>	 the biggest problem is once we declare a site dead (because its power completely failed, or it seems tragically disfunctional but still kinda reachable, etc)
[15:03:23] <bblack>	 we have to pursue two things in parallel:
[15:03:37] <bblack>	 1) Flipping switches to disable it everywhere in etcd
[15:04:12] <bblack>	 2) Isolating it, because we can't assume it got the etcd updates properly or whatever, and it might bring itself back online (or worse, flap on and off) and wreak havoc / cause split brain.
[15:04:41] <bblack>	 the isolating part is rather difficult
[15:04:45] <volans>	 yeah, probably at routers level?
[15:04:54] <volans>	 assuming we can reach them ofc
[15:05:06] <bblack>	 so at the routers level, we could certainly go around to every still-alive site's routers and disable our transport wan links to it
[15:05:30] <bblack>	 that cuts it away from the rest of our internal infra
[15:06:00] <volans>	 but they could still announce IPs if flapping/half up
[15:06:06] <bblack>	 but if it flaps back online and still on old config, its own routers will advertise public IPs to the world, including its own DNS still advertising those IPs (it never got the etcd update to disable them)
[15:06:13] <bblack>	 it will suck users into its broken self
[15:06:19] <ema>	 at the cache level we could flip the equivalent of cache::traffic_shutdown
[15:06:28] <bblack>	 if we can even reach it
[15:06:39] <volans>	 the solution for this is the old task to make DNS to be anycast
[15:06:51] <volans>	 partial solution
[15:06:55] <bblack>	 even anycast DNS doesn't fix it though
[15:07:02] <volans>	 yeah if they announce the IP
[15:07:07] <volans>	 will be back sucking traffic
[15:07:23] <bblack>	 (right, including DNS traffic which still answers with its own IPs that are disabled elsewhere)
[15:07:52] <volans>	 our NS have 1d TTL now
[15:07:58] <bblack>	 so that's the really thorny issue I don't know a great answer for
[15:08:32] <bblack>	 (and no, you can't turn down NS TTLs, it's bad practice.  also eventually those NS address will be multicast anyways, making it not a way to shut out a nameserver)
[15:08:47] <volans>	 I was not suggesting that ;)
[15:08:48] <volans>	 I know
[15:08:50] <bblack>	 ok
[15:09:19] <bblack>	 well anyways, my best guess is the answer lies at the BGP routing level, but I don't know the details well enough
[15:09:34] <volans>	 I was thinking if we could setup a secondary anycast IP with just the other DCs and change the NS record, but make sense only in a real DC-is-gone scenario
[15:10:11] <volans>	 where that DC is really gone, but in that case if it's really gone the problem is auto-resolved
[15:10:23] <bblack>	 basically we need the logical equivalent of "all our sites advertise to public BGP peers with metric X in normal config.  and when we decide we need to isolate a site, we bump the metric priority our routers have at all the other sites, and advertise the dead site's space from there"
[15:10:42] <bblack>	 but I think there's no metrics like that in public routing, except artificially inflating the AS-path
[15:11:06] <bblack>	 (e.g. listing your AS twice, so when you want to take over you can later advertise a shorter path without the duplicate?)
[15:11:26] <bblack>	 these are good things for our new network engineer to think about maybe :)
[15:11:33] <volans>	 eheheh
[15:11:39] <volans>	 indeed
[15:12:30] <bblack>	 my beef with the DNS-y solutions is "change the NS record" isn't something that can be done in reasonable time
[15:12:42] <bblack>	 (in response to an outage)
[15:12:45] <volans>	 yep
[15:13:14] <bblack>	 I guess there are technically out-of-band solutions to the problem too
[15:13:41] <bblack>	 we could call up the network service providers that give us transit in eqiad and ask them to disable us there or stop routing our space from there or whatever
[15:13:53] <volans>	 but I could reply that the quickest solution to a DC half dead that we want  really dead is to have physically disconnect the uplink cable(s) ;)
[15:13:58] <bblack>	 but then there's exchange peering too :)
[15:14:24] <bblack>	 yeah that's true.  if we assume our core site DC ops can still get in there, they can unplug cables
[15:14:41] <bblack>	 you'd think either they can, or the site is so destroyed that nothing's going to flap back up, but I donno
[15:14:52] <bblack>	 there's probably inbetween scenarios too
[15:15:47] <bblack>	 I can totally imagine a site power failure where the front desk isn't letting anyone in until they restore their generator power or whatever
[15:16:06] <bblack>	 and our dc ops is standing there outside the front door going "but I need to disconnect that cable before that power comes back on :P"
[15:16:38] <volans>	 sounds surely a possible scenario
[15:18:44] <bblack>	 the ideal is some way for our living, connected sites to tell the internet "don't trust our dead site, even if it comes back online later"
[15:18:52] <bblack>	 there must be some answer to that problem in the BGP world
[15:20:36] <ema>	 I'm sure the new netops engineer won't mind answering that as a distraction from the onboarding tasks :)
[15:21:32] <bblack>	 gehel: that reminds me, maps is another potential active/active candidate.  is it currently capable in the long-term?
[15:30:49] <gehel>	 bblack: in meeting, back to you in a few... but yes, maps should be active/active ready
[16:01:40] <gehel>	 bblack: so back to maps... the 2 clusters (eqiad/codfw) are serving identical traffic, completely independent, ... should be all good for an active / active scenario.
[16:02:36] <gehel>	 There is probably something that will break, but I can't think of what ...
[16:04:37] <bblack>	 gehel: well we can turn it on now, or we can turn it on later :)
[16:04:57] <gehel>	 are we ready in term of traffic?
[16:04:59] <bblack>	 gehel: really depends on your comfort level and risk-aversion or whatever
[16:05:08] <bblack>	 gehel: yes
[16:05:34] <gehel>	 bblack: I'll want to check with the rest of the team first, at least give them a head's up, but yeah, that sound like a good plan!
[16:05:59] <gehel>	 And do some testing on codfw. As the passive cluster, it does not see as much attention as it should...
[16:06:29] <bblack>	 ok, I have the change prepped at: https://gerrit.wikimedia.org/r/#/c/345591/ .  Just +1 it when we reach the point in time where it's sane to merge it.
[16:07:12] <gehel>	 bblack: kool! Thanks a lot! I've been waiting for that one!
[16:07:21] <gehel>	 should we plan the same for wdqs?
[16:08:17] <bblack>	 gehel: that's been on the back of my mind.  there's 30x different backend appservices in cache_misc, and I know at least a few of them have indicated they're ready in the past when the traffic infra wasn't
[16:08:56] <bblack>	 but I don't even have a good list of who asked before, and of course it's expected that saying "Yeah my app is active/active ready when you are" is a little different than saying yes to "I'm going to turn that on now, welcome to your new world"
[16:09:24] <bblack>	 so I do kinda have to go back through them all on a case-by-case basis and ask whoever would know about that particular app
[16:09:32] <gehel>	 at least a final check before sending real users is a must!
[16:09:49] <gehel>	 I'll check with maps and wdqs teams and get back to you.
[16:09:52] <gehel>	 Great work!
[16:09:56] <bblack>	 thanks!
[16:12:46] <bblack>	 I guess a good first pass at candidates would be to see which cache_misc services seem to use standard cluster naming and have a similar cluster in codfw
[16:13:43] <bblack>	 e.g. contint1001 in current cache_misc backends has a matching contint2001 host in DNS, so it's definitely one to ask someone about.  they may come back and say that's active/passive failover only, though.
[16:19:38] <ema>	 bblack: all backends in misc.yaml that don't end with 'ium' basically? :)
[16:20:08] <bblack>	 :)
[16:21:03] <bblack>	 svc in the hostname is a good indicator too
[16:22:16] <bblack>	 but like, "bromine" that runs a bunch of static sites like 15.wp or annualreport, I have no idea offhand if there's a codfw equivalent to bromine
[16:23:52] <ema>	 it seems to be the only host in site.pp with role 'webserver_misc_static' so I guess there's no equivalent?
[16:27:38] <bblack>	 yeah
[16:28:08] <bblack>	 I bet that one's easy to fix though, I think it's just a ganeti node that's static-configured from puppet and git repos
[16:59:19] <wikibugs__>	 10Traffic, 06Operations: Investigate 502 errors from nginx, when backend returns 302 - https://phabricator.wikimedia.org/T161819#3144203 (10EBernhardson)
[17:00:38] <wikibugs>	 10Traffic, 06Operations: Investigate 502 errors from nginx, when backend returns 302 - https://phabricator.wikimedia.org/T161819#3144218 (10EBernhardson)
[17:01:59] <wikibugs>	 10Traffic, 06Operations: Investigate 502 errors from nginx, when backend returns 302 - https://phabricator.wikimedia.org/T161819#3144203 (10EBernhardson)
[17:03:11] <wikibugs>	 10Traffic, 06Operations, 10Wikimedia-Logstash: Investigate 502 errors from nginx when backend returns 302 - https://phabricator.wikimedia.org/T161819#3144248 (10EBernhardson)
[17:24:31] <wikibugs>	 10Traffic, 06Commons, 06Operations, 10Wikimedia-Site-requests, and 2 others: Allow anonymous users to change interface language on Commons with ULS - https://phabricator.wikimedia.org/T161517#3144389 (10matmarex) I think that's a different uselang hack.
[17:32:34] <wikibugs__>	 10Traffic, 06Commons, 06Operations, 10Wikimedia-Site-requests, and 2 others: Allow anonymous users to change interface language on Commons with ULS - https://phabricator.wikimedia.org/T161517#3144415 (10Steinsplitter) >>! In T161517#3144389, @matmarex wrote: > I think that's a different uselang hack.  Yepp...
[18:16:45] <volans>	 bblack: a bunch of random thoughts about the "tracking" header for the edge decision making process...
[18:17:05] <volans>	 1) It's probably important to keep track of it during the whole path, hence I would avoid to have each layer do a pop(0) of themselves
[18:17:16] <volans>	 2) I thought of having 2 headers one with the next hop and one with the whole path, but I don't like it too much
[18:17:36] <volans>	 3) I think I'd prefer instead to have a "marker" (like *) in the header for the next hop. Assuming cp3010 is the first and make the path decision something like (comma/space/whatever separated):
[18:17:56] <volans>	 cp3010 *cp3020 cp1050 api-appservers
[18:18:49] <volans>	 4) to be decided if a host gets the request and the hop don't match itself if it should reply with a 5xx, probably yes at this point given that it doesn't have the logic to decide.
[18:19:05] <volans>	 Even if somehow it has the object in cache? Probably it will not even look for it...
[18:19:19] <volans>	 5) goes without saying that the edge layer must overwrite the custom header in case it is already present
[18:23:59] * volans bbl
[18:24:09] <bblack>	 volans: if you mean tracking it during the whole path for later debugging or whatever, we already have X-Cache and X-DCPath tracking that in two different ways, so we have plenty of analysis on where things went even if we pop items from the routing decision
[18:27:30] <bblack>	 re: 4 - we can check the cache either before or after the routing decision (currently it's after for hysterical raisins, but it would be more efficient to put it before, but that's just a separate optimization)
[18:28:44] <bblack>	 re: the other part of 4, I don't think the receiver needs to look for itself.  what it does need, though, is the ability to sanity-check the decision its being asked to make
[18:29:44] <bblack>	 because we can't actually pass off an arbitrary direct instruction like "connect to applayer at foo.svc.eqiad.wmnet:9330" or "connect to a cache you've never heard of in datacenter cosin"
[18:30:39] <bblack>	 whatever the header instructs the cache to do as its next hop, that next hop needs to already be defined (in the relatively-static puppet-driven config) as a named backend of one kind or another, and we're just passing the name of the pre-defined backend.
[18:31:54] <bblack>	 (and thus, it also goes without saying that one has to be careful about timing interactions between configuration updates and state updates.  don't merge the definition of a new applayer backend through puppet and then immediately try to instruct cache routing via etcd to use it.  that's not gonna fly until you're sure puppet ran everywhere first)
[18:38:40] <bblack>	 also I was kind of mentally cheating earlier (in the wiki page and the diagram and in talking here) and mixing up the concepts of the frontend and backend cache within each site
[18:38:48] <bblack>	 just to make explanation simpler
[18:39:31] <bblack>	 but the actual "frontend" edge cache wouldn't be making this decision, at least not at this phase in our overall evolution.  it always forwards misses and passes the local backend cache in the same DC
[18:40:09] <bblack>	 the local backend cache at the DC we entered through is the one making the decision (or acting on received routing information from a remote backend cache)
[18:41:47] * volans mostly still afk, but yeah, make sense. For the header I was thinking in case we would consolidate them
[18:42:13] <bblack>	 and I think honestly the transmitted (via a header over the wire) route, even with asia in play and using ulsfo as a backing cache for it, is fairly short
[18:42:30] <bblack>	 in many cases there will be no reason to set the header at all (when the request arrives at the same DC as the applayer it would be routed to)
[18:43:20] <bblack>	 for traffic entering ulsfo or esams, they'll forward directly to either eqiad or codfw and only send it instructions to use the local applayer
[18:43:44] <bblack>	 for traffic entering in asia, it might forward to ulsfo with instructions to contact either eqiad or codfw and then the local applayer
[18:44:12] <bblack>	 so we're looking at a header that for the foreseeable future contains only 1-2 items in it
[18:45:45] <bblack>	 (well, perhaps is constructed with up to 3 items when it's generated, but then the first is popped off locally before transmission)
[18:46:21] <bblack>	 the worst part is translating some of these things to VCL is difficult because it's so limited as a language
[18:47:18] <bblack>	 (it has no scoping to speak of, no local variables, no loops or iterators, no data structures, no subroutines that take proper arguments, and no subroutines that actually return to their callsite)
[18:47:58] <bblack>	 (or return values obviously, but there's no returning upon which to miss the lack of a return value!)
[18:49:36] <bblack>	 so translating sane pseudocode for anything complicated into runtime VCL code is a nightmare.  it's like doing a code translation from python to a series of soduko puzzles whose concatenated solutions numerically encode the answer :P
[18:50:31] <RainbowSprinkles>	 "it's like doing a code translation from python to a series of soduko puzzles whose concatenated solutions numerically encode the answer" Hah!
[18:50:37] <volans>	 rotfl
[18:53:48] <bblack>	 for all of that, it is a powerful way to do common minor manipulations of HTTP traffic at lightning fast speeds (as it compiles directly to C code fragments)
[18:54:08] <bblack>	 I just don't think at the start of the design process anyone expected it to be used in such crazy general-purpose ways
[18:59:55] <bblack>	 hmm always I was wrong earlier that we could defer the backend decision until after checking the cache
[19:00:04] <bblack>	 there are edge-cases that make that not possible in the general case
[19:00:38] <bblack>	 (because pipe-traffic and asynchronous stale-while-revalidate refreshes take odd codepaths where our last chance to set the backend is earlier-on)
[19:04:14] <volans>	 mmmh I guess things we discussed with geh.el some time ago for example
[19:05:13] <volans>	 what if instead we add a loop detection mechanism to avoid the issue? keeping the current decision in each place logic, but keeping track of the path already done, we forbid some looping paths
[19:05:49] <bblack>	 we already have that, that's what X-DCPath is used for (loop detection)
[19:06:21] <bblack>	 because of that, instead of the async rollout of the current config the "wrong" way causing a loop-storm, it just causes a bunch of user-facing "508 Loop Detected" errors
[19:06:34] <volans>	 ah ok
[19:06:47] <bblack>	 the whole thing about stepping through active/active instead of an immediate A/B switch is to avoid the 508s now
[19:07:49] * volans bbiab sorry
[19:08:34] <bblack>	 https://github.com/wikimedia/puppet/blob/production/modules/varnish/templates/vcl/wikimedia-backend.vcl.erb#L88
[19:10:20] <bblack>	 basically X-DCPath records each cache (backend) DC we hit on the request side as we traverse the request inwards, and X-Cache records the specific hosts on the way back out on the response side
[19:12:22] <bblack>	 if it weren't for X-DCPath loop-checking, the infinite loop would happen purely on the request side.  We'd just keep bouncing the request back and forth between the core cache DCs without ever generating a response, until something became horribly broken enough to break everything heh
[19:14:54] <volans>	 and I guess having the possibility to manage 2 configurations with a serial is not an option right?
[19:23:14] <bblack>	 varnish does keep recent configs and assigns them uuids or hashes or whatever
[19:23:43] <bblack>	 but that sounds like a nightmare to manage too (using it to control the rollout)
[19:24:11] <bblack>	 you can't escape that the update on N cache hosts in different DCs will always be async from the POV of thousands of in-flight requests at all times
[19:24:19] <bblack>	 you can only minimize the asyncness :)
[19:27:55] <bblack>	 508 isn't ideal as an error code as it comes from WebDAV, but it seems less confusing than yet another random source of 503s that can't be told apart
[20:03:28] <volans>	 yeah, I was thinking that if we had access to the previous config with an ID and we add the ID of the config with wich it was generated we could apply the previous config or the current one based on this value. Of course assuming we could always update the config in a clear bottom-up way
[20:04:00] <volans>	 otherwise requests with the new ID might arrive to backends that don't have yet that config
[20:21:49] <gwicke>	 we are preparing a blog post announcing the REST API 1.0 release, and are mentioning our global caching infra & geodns
[20:22:17] <gwicke>	 currently, we show a modified version of em.a's geodns heat map from the varnish talk, and link to https://wikitech.wikimedia.org/wiki/Global_traffic_routing
[20:22:18] <bblack>	 volans: and also, there is no bottom :)
[20:22:52] <gwicke>	 is there another, better entry point where API users and devs could learn about our caching infrastructure at a high level?
[20:23:31] <bblack>	 not really
[20:23:37] <volans>	 bblack: yeah, in particular for active-active :D
[20:23:46] <bblack>	 there's the various other side-links there on wikitech
[20:24:15] <gwicke>	 yeah, the sidebar has good info for those interested
[20:24:55] <gwicke>	 Cool, we'll stick with that page then. Thanks!
[20:25:06] <bblack>	 for the most part, hopefully most people don't have to know about our caching infrastructure very much.  we're trying to keep that level of detail abstracted away as much as possible.
[20:25:28] <bblack>	 but service developers certainly have things they need to know, most of which aren't in wikitech anywhere good :)
[20:25:43] <gwicke>	 yeah, definitely not users, but it's always interesting to developers to learn about how a really large site like Wikipedia is scaled behind the curtains
[20:26:15] <gwicke>	 and we talk quite a bit about the benefits of caching integration in that post
[20:26:18] <bblack>	 for service developers, we've talked before about writing what I've been calling a "Traffic Contract"
[20:26:33] <bblack>	 which is an attempt to document everything they need to know about the traffic<->application boundary
[20:26:49] <bblack>	 what standards it supports, what exceptions and gotchas exist, what headers they should set to control what things, etc, etc.
[20:26:57] <gwicke>	 https://docs.google.com/document/d/18pPv4VheJN4sGtu3VBp7cN1eLQ_sndCBlnVK58oRqZw/edit#  <- draft
[20:27:14] <bblack>	 we talked about it a bit at the last ops onsite, but haven't found time to honestly work on it
[20:28:39] <gwicke>	 yeah, that sounds useful
[20:29:00] <gwicke>	 to some degree we in services abstract / consult on some of those issues for backend service developers, but right now we don't have something we could point them to for background
[20:29:31] <bblack>	 yeah it's tempting to just say "it's HTTP!" and be done, but there are so many little details...
[20:29:56] <gwicke>	 for most end points we control cache headers and -invalidation
[20:29:59] <bblack>	 and just writing it helps to find the weak-points too
[20:30:12] <bblack>	 like right now, one of our main pain-points is cookies
[20:30:43] <gwicke>	 yeah
[20:30:52] <gwicke>	 those are tricky, especially with gadgets in the mix
[20:30:53] <volans>	 bblack: I'm adding the discovery entry for rendering.svc.{eqiad,codfw}.wmnet for swift, should I consider it active/passive?
[20:30:56] <bblack>	 (we don't have a good way to vary on just certain cookies, because we don't have the X-Vary-Options we used to under squid)
[20:31:39] * gwicke got to run to a meeting
[20:31:57] <bblack>	 volans: don't we already have that?
[20:32:19] <volans>	 oh right, in the active/active block
[20:32:26] <bblack>	 no I mean...
[20:32:44] <bblack>	 there's some historical naming confusion, but "rendering.svc" == imagescales
[20:32:47] <bblack>	 *imagescalers
[20:33:02] <bblack>	 the discovery hieradata already has:
[20:33:05] <bblack>	   imagescaler-rw:
[20:33:05] <bblack>	     lvs: rendering
[20:33:05] <bblack>	     active_active: false
[20:33:05] <bblack>	   imagescaler-ro:
[20:33:07] <bblack>	     lvs: rendering
[20:33:10] <bblack>	     active_active: true
[20:33:15] <volans>	 ahhh ok
[20:33:24] <volans>	 I was looking at the dns part first
[20:33:33] <volans>	 didn't know the historical equivalence :)
[20:33:39] <bblack>	 so yeah that's a bit confusing, since the .discovery hostname is not the same as .svc
[20:34:07] <bblack>	 I don't know which name you want to call authoritative at this point, but it would be nice to clear up that confusion
[20:34:27] <volans>	 that would be nice yeah
[20:34:30] <bblack>	 or maintain the status quo and change the left hand side of the dns records to be rendering-rw and rendering-ro :)
[20:34:37] <volans>	 eheheh
[20:34:48] <volans>	 in swift::proxy::rewrite_thumb_server I guess I should use the -rw right?
[20:35:01] <volans>	 (hieradata)
[20:35:14] <bblack>	 I guess so?
[20:35:35] <bblack>	 I'm honestly not sure, it's all on the other side of a barrier I usually try not to stare across too hard :)
[20:35:51] <volans>	 lol
[20:36:12] <bblack>	 (not that things aren't saner there than where I live, but my brain can only contain so much complexity!)
[20:36:15] <volans>	 I'll double check with filippo tomorrow then, but being a record that needs to be changed during the RO period of the switchover I guess is active/passive
[20:40:05] <volans>	 and ofc thanks!
[21:16:04] <wikibugs_>	 10Traffic, 06Operations, 10Wikimedia-Logstash: Investigate 502 errors from nginx when backend returns 302 - https://phabricator.wikimedia.org/T161819#3145357 (10BBlack) Ok, I was wrong in my initial thinking.  Even though we configure `proxy_buffering off;`, `proxy_buffer_size` is still a factor.  Technicall...