[09:43:21] <wikibugs>	 10Traffic, 06Operations, 13Patch-For-Review: Investigate TCP Fast Open for tlsproxy - https://phabricator.wikimedia.org/T108827#2552754 (10ema) 05Open>03Resolved The high rate of failed incoming TFO connections in esams seems to have stopped since [[  https://gerrit.wikimedia.org/r/297418| we switched th...
[09:45:33] <wikibugs>	 10Traffic, 06Operations: Sideways Only-If-Cached on misses at a primary DC - https://phabricator.wikimedia.org/T142841#2552771 (10ema) p:05Triage>03Normal
[11:52:54] <bblack>	 paravoid: \o/
[11:57:28] <bblack>	 ema: re: the range-pass patch: _miss is the same on both now, so move it to -common maybe?  also, you'd want to get rid of the hash_ignore_busy stuff in the backend _recv.
[12:05:51] <bblack>	 ema: probably get rid of the std.log() in there too, there's little point in spamming a log about it.
[12:27:21] <ema>	 bblack: good point!
[12:29:40] <ema>	 bblack: failed inbound TFO connections in esams stopped since the switch to sh. Interestingly, it looks like they were only happening Mon-Fri, not during the weekend...
[12:47:27] <bblack>	 ema: ok cool.  the TFO graphs still make little sense to me.   I added another graph at the bottom that doesn't distinguish DCs just to play with how to look at those stats, but it's still not right
[12:48:43] <bblack>	 but it still ends up with strange spiky-looking data, and I haven't really tried to dig through the layers on how that data goes from kernel->grafan and find out why/where.
[12:48:53] <ema>	 bblack: yeah something must be going wrong at some point with diamond perhaps? There are a few spikes that make little sense
[13:22:40] <bblack>	 ema: there's one more related bit in the range-pass stuff that I don't understand, but maybe not for this patch
[13:23:07] <bblack>	 ema: in upload-backend vcl_backend_response, the bit with:
[13:23:10] <bblack>	 if (beresp.http.Content-Range ~ "\/[0-9]{8,}$") {
[13:23:21] <bblack>	 ... stream+hfp
[13:23:56] <bblack>	 sorry that's in upload-frontend, not backend
[13:23:59] <bblack>	 but still
[13:24:21] <bblack>	 the stream is fine for v3 for now, but why do we hit-for-pass there when the frontend already passes all range reqs?
[13:27:01] <ema>	 bblack: have we always passed range requests on the frontends? Perhaps the hfp part comes from a past where we didn't necessarily always pass?
[13:27:32] <bblack>	 yeah that's my best guess too, but I haven't dug through the history
[13:27:46] <bblack>	 in any case, it seems functionally redundant today
[13:28:36] <bblack>	 oh the history is simple, and it's not that, because they come from the same commit heh
[13:29:01] <ema>	 mmh
[13:29:15] <bblack>	 https://gerrit.wikimedia.org/r/#/c/29293/5/templates/varnish/upload-frontend.inc.vcl.erb
[13:30:02] <bblack>	 oh, nevermind.  this is all in my head
[13:30:27] <bblack>	 the range-pass stuff is only on miss
[13:30:51] <bblack>	 so it does make sense that there are range responses which are not pass-traffic, at least not for the same reason
[13:31:42] <ema>	 oh, meaning that in case of a range request for an object in cache we don't pass?
[13:32:48] <bblack>	 yeah but it still makes no sense, as those wouldn't be invoking vcl_fetch
[13:33:11] <bblack>	 well vcl_backend_response either
[13:33:17] <bblack>	 hmmmm
[13:33:34] <ema>	 right, if it's a hit vcl_backend_response/fetch doesn't get called anyways
[13:33:48] <bblack>	 even in the context of the original patch, I don't see how you'd ever hit that hfp without already being "pass" from the vcl_miss code, unless the backend was sending Content-Range headers without being asked
[13:34:27] <ema>	 bblack: note the "Varnish itself doesn't ask for ranges" comment, perhaps we're missing something there?
[13:35:22] <bblack>	 I think it's just an implication that varnish would never issue a Range: request to a backend ever on its own.  That only happens from our own VCL hacks.
[13:35:30] <bblack>	 which is true, both for v3 and v4
[13:36:00] <bblack>	 and since content-range is only a response to range, therefore if we see a backend response with content-range, it was the result of us hacking a Range: header through to the backend
[13:36:07] <bblack>	 which we only do while also doing (pass)
[13:36:46] <bblack>	 it's also size-based too, which is curious
[13:37:01] <bblack>	 it's not like it's hit-for-pass on all range responses, just ones with large total size
[13:38:36] <bblack>	 assuming there's no "good" reason for that hfp, there's definitely a "bad" fallout: there might be, say, a 20MB object that the frontend would normally cache (for non-range requests), and hit range responses on.  but because of that block, if a range request on it misses, we'll hfp it for future full requests that could bring it into cache, too.
[13:39:48] <bblack>	 that might explain the reason for it, really, because we still have the problem that our range-passes are only on "miss", and sometimes we may "hit" an object that's not fully loaded yet from a full req, and stall
[13:40:30] <bblack>	 it still seems like it's in the wrong place or something, though
[13:42:04] <bblack>	 if we were doing a pass-on-Range in vcl_recv instead, none of this would matter.  it's the fact that it's only on miss that makes some of the other subtleties still matter.
[13:42:42] <ema>	 bblack: should we do that in _recv then?
[13:43:28] <ema>	 s/do that/pass/
[13:43:38] <bblack>	 I donno.  it makes VCL much simpler, but it's kind of awful to pass those when we had the full file in cache to answer from, too
[13:44:09] <ema>	 yeah
[13:44:50] <bblack>	 maybe the right answer (as universal behavior in both layers) is to hash_ignore_busy all Range reqs in vcl_recv, and then pass them in vcl_miss (and not do any other funny stuff with hfp objects on them)
[13:45:01] <bblack>	 it's slightly better than just passing them all in recv, and avoids stalls
[13:47:07] <bblack>	 hash_ignore_busy will let range reqs hit on fully-loaded cache objects, but they won't stall on a partially-loaded one, which turns them into miss->pass
[13:47:30] <bblack>	 and we can still kill the ugly hfp and its possible downsides for the next 10 minutes of related traffic
[13:48:18] <bblack>	 (well, I said 10 minutes out of habit because that applies to most text hfp.  in this case it's default_ttl or whatever + capping)
[13:51:06] <bblack>	 we already have a high-range hash_ignore_busy in backend's vcl recv, too
[13:51:58] <bblack>	 maybe move that to common_recv with no bytes checking
[13:52:04] <bblack>	 and then kill the hfp
[13:54:55] <bblack>	 could fix this layer of the problem in a separate commit, though
[13:55:08] <bblack>	 either way, the first commit that's already up is a correct start in the right direction
[13:56:36] <bblack>	 oh wait, the current patch kills the existing high-range hash_ignore_busy already too, so maybe do it all in one go
[13:59:12] <ema>	 so basically on top of what the patch currently does we also need to hash_ignore_busy range requests in vcl_recv and kill the hfp? 
[14:00:48] <bblack>	 yeah.   hash_ignore_busy all range in the common recv that's already there, and get rid of the frontend backend_response hfp that's inside of if (beresp.http.Content-Range ~ "\/[0-9]{8,}$") {
[14:00:59] <bblack>	 (but I guess keep the do_stream for now, it can die with v3)
[14:03:17] <bblack>	 we'll get to kill a number of those do_stream conditionals once we're past rolling back to v3
[14:06:04] <bblack>	 and then we can push your patch for v3 VCL for now and see if it has any real/notable impact on cache stats, and know that nothing fundmanetally changes about this on the v4 switch.
[14:06:19] <bblack>	 (except that things will get more efficient from universally stall-free do_stream)
[14:08:02] <bblack>	 I'm guessing we'll see some small changes to miss/pass/hit percentages, but it's hard to say exactly how it will end up in the net.  Some of these changes may trade existing passes for hits or true misses.  some may trade passes for hits.
[14:08:24] <bblack>	 depending on context/timing/requests.  but it shouldn't be huge in any case
[14:09:15] <bblack>	 some of it might turn existing hits into passes, too, I meant to say at the end there two lines above :)
[14:09:57] <bblack>	 but either way range requests are small-potatoes statistically, and if anything we're removing some possible ways they can stall a little, in exchange for sending more of them straight through to swift.
[14:11:07] <ema>	 yep. It will be interesting to see if/how the graphs change with this
[14:12:04] <ema>	 probably not that much, we're talking < 0.3% of all requests
[14:12:33] <bblack>	 in some cases the old VCL has some overlapping corner cases with non-range requests, though.
[14:12:50] <bblack>	 I'm hoping the avoidance of creating those hfp objects improves hitrate on non-range requests, basically.
[14:13:05] <ema>	 let's see!
[14:13:24] <bblack>	 (but with only a little under 3% of requests making it to swift anyways, the shift there would be hard to see as well)
[14:13:41] <bblack>	 it might shift the fe/be hit percentage around too, that's a bigger thing
[14:14:08] <bblack>	 it may take days for all the related old hfp objects to expire out, though.
[14:15:07] <ema>	 merged
[14:16:29] <bblack>	 so, assuming this looks fine, we're pretty much done with any VCL-mucking we need to do before taking steps towards cluster v4 conversion, right?
[14:17:13] <ema>	 bblack: yes, VCL-wise we should be good to go
[14:17:29] <bblack>	 well and of course there's the s/persistent/file/ bit
[14:17:50] <bblack>	 we could do that transition pre-v4 as well, though.
[14:19:02] <bblack>	 only cache_upload makes such significant/constant use of local-backend storage.  on any other cluster I would just wipe it quickly while the frontends stay up.  but here we probably want to spend a little time rolling through them on wipe->convert.
[14:19:52] <bblack>	 (and maybe give a few more days for any further discussion/counter-argument on T142848 )
[14:19:53] <stashbot>	 T142848: Stop using persistent storage in our backend varnish layers. - https://phabricator.wikimedia.org/T142848
[14:23:34] <bblack>	 paravoid: the India shift looks like ~3.5% of total global reqs moved from eqiad->esams, at least at this time of day
[14:23:51] <paravoid>	 nice
[14:24:03] <paravoid>	 I was looking at the esams traffic graphs, it didn't look like it made a big difference
[14:25:21] <paravoid>	 how did you calculate that?
[14:25:24] <paravoid>	 grafana or oxygen logs?
[14:25:42] <bblack>	 you can see it better in the total req counts (top graph) of https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes, if you switch between eqiad and esams views in the dc dropdown.
[14:26:25] <bblack>	 but you have to make some minor adjustments in looking at the ~15 minute move window, because there are existing trends significant to that level at this time of day too (eqiad on rise)
[14:26:44] <paravoid>	 nod
[14:26:48] <bblack>	 but ~3.5% or so seems ballpark right for what happened "at the moment" over those 15 minutes.
[14:26:54] <paravoid>	 it's clearly visible though
[14:27:05] <paravoid>	 awesome :)
[14:28:28] <paravoid>	 what's the reasoning behind I1a1e19c1d0f0666969fdddd435348d721a0c3796 ?
[14:28:35] <paravoid>	 ("config-geo: list all DCs in failover lists for completeness")
[14:29:07] <bblack>	 well, completeness :)
[14:29:39] <bblack>	 the fact that eqiad/codfw are "primary" and we can't live without them if they're truly-dead, doesn't mean we wouldn't ever want to shift edge traffic away from them while the applayer is still up
[14:29:53] <bblack>	 this gives us the option to mark any combination DOWN and still have all traffic go *somewhere* intentionally.
[14:30:06] <paravoid>	 I suppose
[14:30:16] <paravoid>	 that probably wouldn't work due to other constraints
[14:30:35] <bblack>	 it's kind of relevant to the persistence ticket stuff
[14:30:38] <paravoid>	 but yeah, I suppose
[14:31:32] <bblack>	 e.g. if eqiad as a whole was fine, but we lost all cache contents on cp1xxx (both layers), planned or not.  Requests through other DCs would be covered by local caches, but we could move the users away from eqiad's edge to avoid huge miss-rates there, until the small misses from other DCs had helped refill eqiad's backends.
[14:32:05] <bblack>	 that scenario was already covered before that commit, because all things mapped directly to codfw or eqiad had the other as a fallback
[14:32:13] <paravoid>	 yeah
[14:32:25] <paravoid>	 if we lost both eqiad and codfw, we could probably modify config-geo on the spot too :P
[14:32:34] <bblack>	 but still, it's  better to be complete and know the traffic has a destination if both are marked down at the edge for some unforseen reason
[14:32:50] <bblack>	 modifying config-geo directly is complicated, marking things down is easy
[14:34:29] <bblack>	 but probably if we had to mark both down, things get ugly, because maybe ulsfo can't handle all the US (either due to networking, or due to smaller total count of cache boxes in text/upload)
[14:34:52] <paravoid>	 yup, that was my point
[14:34:53] <bblack>	 I don't know that we've ever tried it
[14:35:03] <paravoid>	 it'd be a complicated decision anyway
[14:35:20] <bblack>	 we've talked before about how operationally we should assume any one of the 4 cache edge sites should be able to at least barely handle total global user load
[14:35:24] <bblack>	 but with ulsfo I don't think it's the case.
[14:35:45] <paravoid>	 the total *global* user load?
[14:35:51] <bblack>	 probably the closest real assumption to that we can make, is that on the current map we can always handle 1/4 being marked down.
[14:36:15] <paravoid>	 we could probably sustain 2/4 down
[14:36:15] <bblack>	 well, where would you define the cutoff? any 2/4, so long as the 2 aren't codfw+eqiad?
[14:36:45] <paravoid>	 == "at least one primary DC must be up"
[14:37:14] <bblack>	 well, do you think eqiad could handle total global load on its own?
[14:37:18] <paravoid>	 it's possible ulsfo+esams could handle the load, but I'm not sure and at what extent
[14:37:41] <bblack>	 eqiad and esams are similar in capacity at the edge.  they all are except ulsfo.
[14:37:41] <paravoid>	 I dunno, it's possible
[14:37:52] <bblack>	 (well, cache machine capabilities, not network maybe)
[14:38:03] <paravoid>	 esams is pretty well connected too
[14:38:08] <paravoid>	 probably better than eqiad, actually :)
[14:38:15] <bblack>	 on the next ulsfo hw refresh I want to bring it up to speed with the rest, but it doesn't seem worth bothering with until then.
[14:38:38] <bblack>	 if asia gets real dates and such, that might change that, too
[14:38:41] <paravoid>	 I wonder if hit ratios matter
[14:39:08] <bblack>	 do you mean "I wonder if total size of local backend cache matters"?
[14:39:15] <paravoid>	 global load means that one cache cluster would get both e.g. enwiki, jawiki and dewiki
[14:39:34] <bblack>	 in the case of text, caches are oversized, I suspect that wouldn't be an issue.
[14:39:35] <paravoid>	 s/both/all of/
[14:39:51] <bblack>	 differentials in cache_upload patterns might be relevant on that front, though.
[14:40:11] <paravoid>	 nod
[14:40:33] <paravoid>	 is there any realistic scenario where setting 3/4 frontend sites as down makes sense?
[14:40:37] <paravoid>	 I can't think of any
[14:41:04] <bblack>	 probably not, but we've never really written down any kind of policy about it
[14:41:15] <paravoid>	 yeah
[14:41:25] <bblack>	 if we consider 2-sites-down our limit, it might change risk calculations when we take one offline
[14:41:36] <bblack>	 which we already do usually, but informally.
[14:43:15] <bblack>	 for the 4 we have today, probably the simplest way to state the limit for edge stuff is "up to two sites can be down, so long as at least one core site remains available"
[14:44:00] <bblack>	 which means if we're taking one edge site out for maintenance, we only have capacity to suffer one other natural loss due to unforeseen failure.
[14:44:15] <bblack>	 but if we take a core site out, we can't really handle another site going down at the same time in all cases.
[14:44:45] <bblack>	 even then that doesn't seem right, the wording there
[14:44:52] <bblack>	 eqiad+esams both down probably wouldn't work?
[14:45:12] <paravoid>	 from a networking perspective?
[14:45:14] <bblack>	 it might, but we've never tried
[14:45:16] <paravoid>	 possibly
[14:45:30] <bblack>	 well from edge networking and edge caches perspective
[14:45:36] <paravoid>	 we've never tried nor planned for it
[14:45:40] <paravoid>	 we have transits at chicago as well
[14:45:44] <paravoid>	 that are underused right now
[14:46:11] <paravoid>	 so it's possible
[14:46:24] <paravoid>	 if you want to have a policy, you have to take satellite sites into account too :)
[14:46:29] <paravoid>	 eqord/eqdfw/knams
[14:46:36] <paravoid>	 or "network PoPs"
[14:46:39] <bblack>	 yeah....
[14:47:11] <paravoid>	 knams has 4x10G transits
[14:47:17] <bblack>	 so that raises an interesting-to-me question:
[14:47:18] <paravoid>	 that's quite significant
[14:47:24] <paravoid>	 (it's more than all of eqiad has!)
[14:47:38] <bblack>	 chicago is probably the only place where we have important decisions to make on hot/cold routing
[14:47:54] <bblack>	 e.g. do we take in user traffic there that could've reached out directly to eqiad or codfw
[14:48:21] <paravoid>	 due to yet unresolved network design technicalities we don't take inbound user traffic there at all
[14:48:36] <paravoid>	 i.e. we don't announce our routes from there (IIRC)
[14:48:40] <bblack>	 ok
[14:48:47] <paravoid>	 but we use eqord for outbound traffic, which is what matters more
[14:49:06] <bblack>	 so in those cases, the routing is asymmetric?
[14:49:12] <paravoid>	 yes
[14:49:26] <paravoid>	 not only in those cases, routing is asymmetric in most cases I'd say
[14:49:58] <bblack>	 because we'll often tend to make different decisions on transits/peers for outbound than the traffic made on the inbound side?
[14:50:09] <paravoid>	 that's one reason
[14:50:21] <paravoid>	 another is hot potato routing
[14:50:30] <bblack>	 right
[14:50:49] <paravoid>	 and those "different decisions" may be either due to explicit policy decisions
[14:50:52] <paravoid>	 or pure luck
[14:50:58] <bblack>	 so we might have a user's IP rough-geo-mapped to eqiad, but in BGP terms the networks in front of them chose to send it into our codfw site
[14:51:10] <bblack>	 but we always hand off straight to transit/peer from eqiad on return
[14:51:16] <paravoid>	 in the sense that if between us and AS 12345 there are 3 equal-distance paths
[14:51:40] <paravoid>	 which of the three we pick and 12345 picks gets down to criteria that are often "random"
[14:51:51] <paravoid>	 quotes because they're deterministic but meaningless
[14:51:56] <bblack>	 yeah
[14:52:27] <paravoid>	 it could be whether the local subnet with the transit is arithmetically lower than the other one, for instance
[14:52:51] <paravoid>	 in our case, and that's very typical in our setup, it could be whether the VRRP master is cr1 or cr2
[14:52:55] <bblack>	 so sure, I can understand some "random" variation between, say, 2 different transit links into eqiad, with traffic coming in one and out the other to same user.
[14:53:16] <paravoid>	 right now we flip the VRRP masters for all subnets at the same time, but I've considered varying that per subnet
[14:53:23] <paravoid>	 to load-balance traffic a little more
[14:53:44] <paravoid>	 if that happens, you may get an entirely different path to an end-user from a cp* server in row A and from a cp* server in row B
[14:53:55] <paravoid>	 which is a little mind-boggling :)
[14:53:56] <bblack>	 the more-interesting case though is the one I mentioned earlier: there could be significant mismatches between what remote BGP considers our closest point to the user and where our geodns is sending them.
[14:54:03] <bblack>	 resulting in "eqiad" traffic coming in through ulsfo or codfw
[14:54:30] <paravoid>	 it's possible
[14:54:34] <paravoid>	 and I'm sure it happens to some extent
[14:55:00] <bblack>	 on the rows thing: in codfw all the cache clusters are evenly split between rows as well as is possible
[14:55:03] <bblack>	 in eqiad, not so much
[14:55:22] <bblack>	 we do have clusters split between rows for redundancy, but not evenly across all 4
[14:58:02] <paravoid>	 nod
[15:00:17] <paravoid>	 on an unrelated matter, I've scheduled the cr1-eqiad upgrade on Thursday
[15:00:25] <paravoid>	 I set up a gcal event and invited you to it for the FYI
[15:00:36] <paravoid>	 if that's ok with everyone I'll send an email to ops@ too
[15:00:46] <bblack>	 ema: so far the shifts for upload seem sane/unimportant.  You can kinda see a maybe ~0.2% reduction in misses, and maybe something closer to 0.1% shift of traffic from local-backend to frontend termination.  which may be from the -hfp.  but it might take days to see the full effect of that anyways, and it's hard to separate those from noise on this short timescale.
[15:00:55] <bblack>	 paravoid: seems ok to me
[15:00:59] <paravoid>	 cool
[15:01:43] <paravoid>	 also I'd like to upgrade ulsfo & esams routers to newer JunOS at some point
[15:02:08] <paravoid>	 they're at 12.3, everything else is (will be with cr1-eqiad) at 13.3R
[15:02:08] <mark>	 yay, we can upgrade knams now
[15:02:16] <mark>	 without depooling it
[15:02:25] <paravoid>	 yeah, although I think I'd still depool
[15:02:33] <paravoid>	 and upgrade all of them in the same window :)
[15:02:42] <paravoid>	 all three of them
[15:03:15] <bblack>	 so is this our final esams link config for the present, what we have now with a new wave + existing mpls?
[15:03:15] <paravoid>	 well, two I suppose, cr2-esams' junos isn't that old
[15:03:29] <paravoid>	 well yes for the "present" :)
[15:03:51] <paravoid>	 we'd still like to renegotiate and move the MPLS link to chicago
[15:04:00] <paravoid>	 (or find another vendor)
[15:04:56] <bblack>	 so, the wave goes straight to esams.  I forget is the mpls also to esams (other router) or to knams?
[15:05:01] <paravoid>	 knams
[15:05:03] <bblack>	 ok
[15:05:25] <paravoid>	 cr1-eqiad<->cr2-knams + cr2-eqiad<->cr2-esams
[15:06:14] <ema>	 bblack: nice, ~0.2% is in line with the range request sampling we did
[15:06:24] <bblack>	 right, so the cr2-eqiad<->cr2-esams is our primary wave, and cr2-knames<->(cr1-eqiad, maybe eqord in the future) is our MPLS that's basically a backup at this point.
[15:06:31] <paravoid>	 correct!
[15:07:10] <bblack>	 having that on eqord is nice.  in case eqiad gets nuked, we'll still have a way to have codfw<->esams talking.
[15:07:17] <paravoid>	 yup
[15:07:23] <paravoid>	 could be to codfw directly too
[15:07:36] <paravoid>	 but eqord is nicer because if we lose the wave, we still get to eqiad with reasonable latency
[15:07:44] <bblack>	 right
[15:07:45] <paravoid>	 and we don't need to reconfigure the varnish layer
[15:08:37] <bblack>	 so, assuming all links are running fine and we switch to codfw as primary in cache tiering, esams->codfw would actually flow through eqiad for now (and even in the mpls->eqord case, too)?
[15:10:36] <paravoid>	 yes on the first
[15:10:44] <paravoid>	 not sure what you mean about the mpls->eqord case?
[15:11:09] <bblack>	 if the mpls moves to eqorg on the US side, would we still prefer to send esams<->codfw traffic through eqiad and the wave, if all links are up?
[15:11:09] <paravoid>	 if the mpls terminated at eqord, we'd have an esams->knams->eqord->codfw path
[15:11:14] <paravoid>	 oh
[15:11:15] <paravoid>	 yes :)
[15:12:07] <bblack>	 I didn't know if maybe other metrics came into play on eqord<->codfw vs eqiad<->codfw that could override the decision or something
[15:12:35] <paravoid>	 no
[15:12:42] <paravoid>	 using the mpls link costs a lot
[15:13:02] <bblack>	 ok I see in the ticket now, it's basically got a virtual 1s RTT penalty on its metric
[15:13:10] <bblack>	 oh wait, 100ms
[15:13:20] <bblack>	 so unless the path was actually 100ms+ faster, we wouldn't take it
[15:13:20] <paravoid>	 no
[15:13:21] <paravoid>	 1000ms
[15:13:28] <paravoid>	 but that's arbitrary
[15:13:32] <paravoid>	 I just added a "1" in front :)
[15:13:45] <bblack>	 820 == 82ms, +1000 == +100ms ?
[15:13:50] <paravoid>	 we generally don't want to use it, it costs a lot
[15:14:52] <bblack>	 but even if something were gauging all those combined metrics to make a decision and the 100ms is correct above, it would only get picked if codfw<->eqord was metric'd as 100ms faster than codfw<->eqiad, which would never be the case.
[15:15:03] <paravoid>	 yes, sorry, 100ms
[15:15:07] <paravoid>	 but it doesn't matter, it's arbitrary
[15:15:10] <bblack>	 ok
[15:15:28] <paravoid>	 right now it's two paths to the same site
[15:15:31] <paravoid>	 one has 840, the other has 1820
[15:15:50] <paravoid>	 could have set the latter to 900, wouldn't have made a difference :)
[15:16:30] <bblack>	 in theory though, if the metrics for esams->knams->eqord->codfw added up to less than the metrics for esams->eqiad->codfw (which would never be the case with the +100ms metric), we'd take the MPLS for the traffic, right?
[15:16:43] <paravoid>	 yes
[15:16:45] <bblack>	 all the above after mpls theoretically moves to eqord
[15:16:58] <bblack>	 ok
[15:17:09] <paravoid>	 also esams<->knams metrics aren't latency-based right now
[15:17:24] <paravoid>	 but it's close enough to not matter :)
[15:17:31] <bblack>	 right
[15:17:41] <bblack>	 I kind of tend to assume that's the case for eqdfw+codfw too
[15:17:49] <paravoid>	 hes
[15:17:51] <paravoid>	 yes
[15:19:10] <bblack>	 in the long run, especially post-asia, it might be nice to have a more algorithmic approach to geodns decisions (even if it's a manual algorithm).  but for what we have now, we could map it out manually into a simple decision chart any ops can follow
[15:19:29] <bblack>	 e.g. a list of scenarios like: "esams + eqord are down: mark blah DCs offline in geodns"
[15:19:55] <bblack>	 which is "obvious" for most cases, maybe not for all 2x sites down cases where some of them might be network-only, too
[15:21:31] <bblack>	 then again, I donno, the possible corner-case scenarios are quite diverse
[15:21:45] <bblack>	 e.g. whether we lost cache contents or the whole site (power) or just 1 or more network links
[15:22:20] <bblack>	 when those things happen though, someone has to make a decision, and sometimes it's not obvious enough to make it in seconds off the cuff
[15:23:14] <paravoid>	 I think those cases are rare enough that an extra few minutes won't matter
[15:23:26] <paravoid>	 DNS changes take minutes to propagate anyway
[15:23:43] <bblack>	 assuming someone who can figure it out in a few minutes wakes up and logs in
[15:24:19] <bblack>	 at the whole-site, level, there's only 6 possible combinations of "2x DCs are down", and one of those is eqiad+codfw so that one's obviously "you're screwed"
[15:24:58] <bblack>	 I think all the rest the only choice is the obvious one: mark the dead sites dead in geodns
[15:26:29] <bblack>	 with any one failure of just a link, usually the fallout's not going to warrant a DNS change
[15:27:04] <bblack>	 but one site being down for maintenance and another random link going down somewhere? I don't know if some of those situations warrant further geodns changes, in all cases.
[15:35:24] <wikibugs>	 10Traffic, 06Operations, 13Patch-For-Review: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2553546 (10BBlack)
[15:46:26] <wikibugs>	 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502#2553571 (10ema)
[15:46:30] <wikibugs>	 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: Analyze Range requests on cache_upload frontend - https://phabricator.wikimedia.org/T142076#2553569 (10ema) 05Open>03Resolved
[16:35:11] <ema>	 bblack: I might have found a dirty way to carry on with https://gerrit.wikimedia.org/r/#/c/276529
[16:35:25] <ema>	 bblack: we could do:
[16:35:27] <ema>	 .backend = be_{{ $parts := split $node "." }}{{ join $parts "_" }};
[16:35:44] <ema>	 or upgrade to a recent confd, which actually supports replace
[16:36:06] <ema>	 https://github.com/kelseyhightower/confd/blob/master/docs/templates.md#replace
[16:37:30] <ema>	 I've tested the split/join hack on cp1008, output in /tmp/directors.ematest.backend.vcl
[16:37:56] <bblack>	 split/join seems reasonable.  it makes the names longer than necc with the wmnet on the end, but who cares? unknown risks to upgrading confd really.
[16:38:10] <ema>	 agreed
[16:38:36] <bblack>	 making that change will probably require two puppet runs to complete successfully, due to race condition between puppet mod of direct VCL + go template, and confd updating the output from the go template.
[16:39:01] <bblack>	 you might save some irc spam by doing a salted run on all the caches of 'puppet agent -t; puppet agent -t'
[16:39:13] <ema>	 also the current ruby code does not do the same thing as the confd template
[16:39:28] <ema>	 in particular, in ruby we replace '-' as well I think
[16:40:16] <ema>	 yes, the confd template just splits on '.' and gets the first element (hostname)
[16:40:25] <bblack>	 I think we do have hostnames with - in them
[16:40:42] <bblack>	 but you could fix that by splitting on both . and - right?
[16:41:15] <ema>	 right, but why does it work now then if we have hostnames with '-'?
[16:41:29] <bblack>	 because none of them are dynamic, maybe
[16:41:34] <ema>	 ah
[16:41:43] <ema>	 risky stuff :)
[16:41:53] <bblack>	 they probably never will be
[16:42:13] <bblack>	 in practice and in the long-term ideal, the only dynamic directors are the ones that are varnish<->varnish
[16:42:34] <bblack>	 everything else is static because it's behind LVS to manage pooling at a different level
[16:43:00] <bblack>	 right now there are cases where it's static and not-LVS, where in theory we could/should use dynamic directors to manage it at the varnish level
[16:43:12] <bblack>	 but I'd rather we ignore that case, it's too messy anyways
[16:44:00] <bblack>	 and eventually split up the metadata more-properly too
[16:44:17] <bblack>	 right now at some level in puppet, we specify varnish<->varnish backends in a very different place than applayer backends
[16:44:47] <bblack>	 but when you get all the way down to the templating, it's all the same list of directors that has to handle all possible cases: varnish or applayer, dynamic or not.
[16:45:30] <bblack>	 we should probably preserve that split of things being very different all the way through, and only do "dynamic" behavior for varnish<->varnish and use simpler logic/code for applayer.
[16:46:36] <bblack>	 so I guess what I'm saying is, ignore the '-' problem, we don't use that in varnish hostnames, which are the only dynamic hostnames, and I don't think we plan to change that.
[16:47:06] <bblack>	 I'm not sure that we ever use IP addresses for backends directly either, for that whole ipv4_ mess
[16:47:58] <ema>	 oh good because splitting on - as well was making my eyes bleed
[16:48:21] <bblack>	 yeah confirmed with salt, nothing using the backend 'ipv4_' crap in practice anymore
[16:49:14] <bblack>	 so we can simplify wikimedia-common.vcl.erb in that respect
[16:50:43] <ema>	 nice
[16:51:30] <bblack>	 I'd leave the dash-mangling part in VCL though, since it's useful for some of the static director backends
[16:51:38] <bblack>	 we can just ignore it in the directors template for go
[17:07:39] <bblack>	 ema: why dots to commas?
[17:09:51] <ema>	 bblack: because I'm retarded
[17:10:10] <bblack>	 ok :)
[17:11:51] <bblack>	 ema: also gsub takes a pattern, like .gsub(/[xy]/, 'z')
[17:12:53] <ema>	 bblack: oh so we can use /[-.]/ I guess
[17:13:06] <bblack>	 yeah
[17:21:01] <ema>	 bblack: I have to leave soon, feel free to merge if it looks good to you, otherwise I'll merge tomorrow
[17:59:34] <bblack>	 ema: ok
[18:10:20] <wikibugs>	 07HTTPS, 10Traffic, 06Operations, 13Patch-For-Review: letsencrypt puppetization: upgrade for scalability - https://phabricator.wikimedia.org/T134447#2554248 (10BBlack)
[22:11:54] <wikibugs>	 10Traffic, 10Citoid, 10ContentTranslation-CXserver, 06Operations, and 3 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001#2555433 (10Jdforrester-WMF)