[05:15:40] <marostegui>	 Going to give wikibugs another restart
[06:54:22] <elukey>	 reporting in here too
[06:55:14] <elukey>	 there were some BFD errors between cr1 eqiad and cr3-knams and cr2-eqdfw, the common problem seems to be GTT transport links
[06:55:39] <elukey>	 but I don't see maintenance
[06:55:51] <elukey>	 there was an interface down for cr1-eqiad
[06:56:09] <elukey>	 but now it seems solved, but the BFD sessions are still down afaics from the cr1-eqiad
[06:57:10] <XioNoX>	 thx looking
[06:57:21] <elukey>	 XioNoX: <3
[06:57:34] <XioNoX>	 with Zayo down and that issue, it's not very comfortable
[06:57:48] <elukey>	 yeah I agree, I was about to call you if you weren't here
[06:58:31] <elukey>	 I see some recoveries for bfd sessions
[06:58:47] <elukey>	 maybe GTT had some troubles nearby eqiad
[06:59:39] <XioNoX>	 planned work
[06:59:41] <XioNoX>	 Start: 2020-09-18 06:00:00 GMT
[06:59:42] <XioNoX>	 End: 2020-09-18 10:00:00 GMT
[07:00:07] <elukey>	 I see it now yes, I looked for the transport link id in gmail but didn't find it
[07:01:06] <XioNoX>	 #fun, at least it started only a few hours after the CenturyLink outage ended
[07:01:06] <elukey>	 it also arrived like 2hs ago
[07:01:09] <XioNoX>	 :)
[07:01:24] <XioNoX>	 otherwise we would have another issue with esams
[07:01:39] <XioNoX>	 CenturyLink = Lumen now
[07:02:32] <elukey>	 TIL
[07:03:16] <XioNoX>	 since Monday I think
[09:30:22] <XioNoX>	 there is also an emergency maintenance notification from Telia on the eqord-ulsfo link
[12:27:01] <cdanis>	 godog: hello, did you see https://phabricator.wikimedia.org/T263206 ?
[12:29:21] <kormat>	 elukey, klausman: just FYI - stat1008 shows up on icinga/alerts with a warning that puppet hasn't run in a day
[12:29:43] <ema>	 cdanis: hi, Filippo is off today I think
[12:29:58] <klausman>	 Hurm. Did I disable puppet on the wrong machine yesterday>
[12:30:25] <klausman>	 No, I did not.
[12:31:00] <cdanis>	 ah
[12:31:09] <cdanis>	 well then I guess we’ll just yolo it
[12:32:07] <ema>	 cdanis: this is upload@eqiad fetching from codfw, for reference https://grafana.wikimedia.org/d/kHk7W6OZz/ats-cluster-view?viewPanel=6&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-layer=backend&var-cluster=upload
[12:35:33] <ema>	 cdanis: so yes, your "probably 1.5Gbps" estimate seems accurate
[12:40:01] <klausman>	 I did disable the wrong machine. Re-enabled it just now
[12:40:26] <kormat>	 +1
[12:40:27] <cdanis>	 ema: ah thanks
[12:40:37] <cdanis>	 ema: do you know of any reason why we shouldn't repool swift at eqiad? :)
[12:41:19] <ema>	 cdanis: I'm rusty on the current state of swift, but "swiftrepl" might be a concern? It used to be in the past
[12:42:15] <ema>	 see https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Media_storage/Swift
[12:42:40] <cdanis>	 swiftrepl is a contingency for the rare instances where MW FileBackend fails to write to both DCs but reports the upload as successful anyway, AIUI
[12:44:14] <elukey>	 just to keep going with the kudos friday session, today klausman reimaged stat1004 to debian preserving all the home dirs of the users (in a separate partition) with the awesome reuse-parts partman recipe create by kormat :)
[12:44:25] <elukey>	 *debian buster
[12:44:25] <cdanis>	 and I thought ... yeah, its direction was flipped for the switchover https://gerrit.wikimedia.org/r/#/q/Iccc4bf5a6e93628d3433bc38a54a00904aaf88e9
[12:44:29] <kormat>	 😅
[12:44:48] <ema>	 cdanis: so maybe we want to rever that, before switching back
[12:45:00] <cdanis>	 no I don't think so
[12:45:07] <cdanis>	 swift is suppposed to be active/active on reads
[12:45:42] <ema>	 ah of course, swift itself isn't doing the writing :)
[12:47:14] <bblack>	 so how many link failures in are we now?
[12:47:43] <cdanis>	 the only confusing part is the swift-rw service, but I believe that is old and dead
[12:48:16] <ema>	 cdanis: correct
[12:49:20] <ema>	 ATS uses https://swift.discovery.wmnet, no mention of swift-rw
[12:49:52] <bblack>	 by that I mean, what's the picture right now in terms of "because of link failure X we did Y, and because of link failure Z we did A", and have any of those recovered now (s/failure/maintenance/ as appropriate)
[12:50:07] <XioNoX>	 bblack: Zayo is back
[12:50:09] <cdanis>	 bblack: the link failures aren't actually the issue
[12:50:16] <cdanis>	 Zayo is still splicing many of their fibers but ours has been done
[12:50:26] <cdanis>	 the centurylink and GTT maintenances are over, as well
[12:50:37] <bblack>	 so could we re-depool eqiad at this point?
[12:50:41] <cdanis>	 the issue is that we put ourselves in a configuration we can't actually sustain with only 10Gbps
[12:50:46] <cdanis>	 no, the original issue was
[12:50:56] <cdanis>	 that there's one 10Gbps bottleneck between codfw and eqdfw
[12:51:06] <cdanis>	 and we were sending 8Gbps of peering/transit traffic that way
[12:51:12] <bblack>	 oh ok, I thought that was something we started noticing after
[12:51:14] <bblack>	 got it
[12:51:22] <cdanis>	 no that was the reason we repooled eqiad
[12:51:24] <cdanis>	 but
[12:51:30] <cdanis>	 it turns out that we can't repool eqiad and not have swift pooled there
[12:51:37] <bblack>	 yeah I saw that above
[12:51:40] <cdanis>	 because then both eqiad and esams upload-lb are pulling from swift in codfw
[12:51:52] <bblack>	 we could also repool swift reads in eqiad
[12:51:59] <cdanis>	 I have just done that :)
[12:52:05] <bblack>	 awesome :)
[12:52:58] <bblack>	 so there's probably a few different perspectives you can take on the long term of this situation
[12:53:16] <cdanis>	 yeah I think there's a very interesting discussion there
[12:53:29] <bblack>	 and it kind of all loops back to the long view of what we're trying to accomplish with the switchover testing, and what we're aiming to eventually accomplish with two core DCs in general
[12:53:53] <bblack>	 but I need at least another coffee before I could even begin a naive conversation about half of it heh
[12:53:56] <cdanis>	 the test was successful in that we found bottlenecks we didn't know about / realize existed
[12:54:06] <XioNoX>	 +1
[12:54:15] <cdanis>	 lol I am on 0.5 coffees right now, after maybe 6h sleep, and making prod changes I don't think I fully understand 🙃
[12:54:25] <paravoid>	 yeah, that's great to see
[12:54:34] <paravoid>	 I'd extend this to beyond bottlenecks, though
[12:54:44] <paravoid>	 I'm not sure we understand what all this traffic is for either
[12:55:15] <bblack>	 yeah I'm thinking more in terms of: what parts will always (well, foreseeable future) be a/p, vs aiming to be viably a/a, vs already are a/a
[12:55:42] <bblack>	 and what scenarios we're hoping to be able to handle by having the two core DCs and the various switching/failover tools at our disposal
[12:55:43] <cdanis>	 I want internal netflow :)
[12:56:15] <paravoid>	 yeah +1
[12:56:42] <bblack>	 it's possible the answers to some of this is just that the scenario that triggered the problem wasn't a realistic one (in the sense that we shouldn't be aiming to cover that case in that way at all)
[12:57:17] <cdanis>	 I also wonder how you might solve the problem of when IP addresses don't tell you the whole story (on ganeti nodes, k8s nodes, etc etc)
[12:57:29] <cdanis>	 oh I guess ganeti is fine but not k8s
[12:58:19] <cdanis>	 bblack: yeah, I think given how our waves are set up, having specifically eqiad edge pooled with swift depooled is probably a non-goal as a configuration, it doesn't make a ton of sense and will always stress the eqiad/codfw connection (so long as that is 1*10G)
[12:58:28] <paravoid>	 from what I can tell from esams librenms+turnilo, it seems like 25% of our edge traffic turns up in the backhaul link
[12:58:43] <paravoid>	 is that normal from a traffic/caching perspective?
[12:58:44] <bblack>	 bytes or reqs?
[12:58:47] <paravoid>	 bytes
[12:58:52] <bblack>	 and all of just text or just upload or?
[12:58:59] <XioNoX>	 cdanis: can install netflow on linux boxes (eg. k8s) to sample closer to the source
[12:59:00] <paravoid>	 I don't have that data right now
[12:59:08] <cdanis>	 XioNoX: 👀
[12:59:10] <bblack>	 ok
[12:59:24] <cdanis>	 bblack: we were doing some turnilo'ing from webrequest_sampled_128 yesterday, it's tricky though
[12:59:40] <bblack>	 so from a network pov, the most-useful graph in the grafana stuff is "Terminal Layer"
[12:59:41] <paravoid>	 I'm just looking at wmf_netflow now
[12:59:47] <bblack>	 https://grafana.wikimedia.org/d/000000500/varnish-caching?viewPanel=1&orgId=1&refresh=15m
[13:00:03] <cdanis>	 bblack: but does that account for ats-be?
[13:00:06] <bblack>	 but it's reqs, not bytes, so ymmv comparing the two
[13:00:07] <paravoid>	 just looking at exporter IP: esams+knams, split time (5 minutes), exclude as dst 64600, measure bytes
[13:00:08] <cdanis>	 or is that 'local varnish'?
[13:00:21] <paravoid>	 and then at librenms for the esams-eqiad link
[13:00:26] <bblack>	 the graph is old, but the data is still correctly-layered, just poorly labeled
[13:00:38] <bblack>	 "front varnish" is the varnish-fe layer, and "local varnish" is the ats-be layer
[13:00:41] <cdanis>	 ack
[13:00:45] <ema>	 if you want bytes, see https://grafana.wikimedia.org/d/kHk7W6OZz/ats-cluster-view?viewPanel=6&orgId=1&var-datasource=esams%20prometheus%2Fops&var-layer=backend&var-cluster=upload
[13:00:54] <paravoid>	 at noon UTC, esams edge traffic was 15G, while eqiad-esams was 3.5G
[13:00:56] <cdanis>	 paravoid: would it be the same to look at librenms (es|kn)ams traffic bill?
[13:01:33] <bblack>	 noon UTC as in an hour ago?
[13:01:38] <cdanis>	 yes
[13:02:06] <paravoid>	 actually noon UTC was over 4G for some reason
[13:02:08] <bblack>	 so on the reqs graph, we're claiming 8.6% of reqs went through to eqiad at that time, from esams edge
[13:02:09] <XioNoX>	 cdanis: should be, yes
[13:02:31] <paravoid>	 there was a spike at ~11:50
[13:02:35] <paravoid>	 still ongoing
[13:02:41] <paravoid>	 but ok, even ignoring that
[13:02:54] <bblack>	 for text specifically it's 11%, and for upload it's more like 3.9%
[13:03:10] <bblack>	 but, without the Bytes info, it's hard to translate that to what you want on the network links
[13:03:41] <bblack>	 upload@esams could be sending only 1% of upload reqs through to swift at the core DCs, but the wrong 1% could be worth almost all the traffic
[13:03:55] <paravoid>	 yeah
[13:04:03] <paravoid>	 last night there was confusion around "pass" as well
[13:04:17] <bblack>	 what confusion?
[13:04:47] <paravoid>	 how much of upload traffic is pass etc.
[13:04:58] <bblack>	 ah
[13:05:14] <bblack>	 more than I would expect, for sure
[13:05:30] <bblack>	 but I'm not sure "pass" is always realistically differentiable from "miss" anymore, in our v+ats world
[13:05:44] <bblack>	 ema would know better where we stand on that
[13:05:59] <cdanis>	 sure, but from a octets-over-wire perspective, you care about the sum of both
[13:06:11] <cdanis>	 hm, upload traffic seems bursty
[13:06:36] <cdanis>	 https://w.wiki/cdo
[13:06:42] <bblack>	 right
[13:06:55] <bblack>	 the "terminal layer" graph doesn't care about miss-vs-pass
[13:07:17] <bblack>	 just "did this request stop at v-fe, or ats-be, or applayer"
[13:07:35] <paravoid>	 https://w.wiki/cdp
[13:07:52] <paravoid>	 cp3 miss -> cp3 pass, uri host: upload
[13:07:56] <ema>	 in ATS we cannot distinguish between miss and pass, no
[13:08:08] <paravoid>	 that's ~1.5G of traffic or so
[13:08:32] <ema>	 what is the question we're trying to answer?
[13:08:36] <bblack>	 so in that graph, you're looking at one specific combo
[13:08:41] <paravoid>	 "what is this traffic" :D
[13:08:48] <bblack>	 miss->pass, but not e.g. pass->miss, pass->pass, miss->miss ?
[13:09:01] <cdanis>	 pass->pass doesn't seem to happen on upload
[13:09:02] <bblack>	 did you narrow in on that combo because the others were lesser?
[13:09:21] <paravoid>	 because there were no good matches on the others really, there was only hit -> pass
[13:09:51] <paravoid>	 you can change the regexp to (miss|pass) fwiw
[13:10:17] <cdanis>	 yeah that is what it is in my link
[13:10:23] <bblack>	 the order is confusing too
[13:10:58] <cdanis>	 btw: https://librenms.wikimedia.org/graphs/to=1600434600/id=11595/type=port_bits/from=1600413000/
[13:11:04] <cdanis>	 this is what repooling swift in eqiad did
[13:11:16] <cdanis>	 took 4.5Gbps off the link
[13:11:18] <paravoid>	 ema: to your earlier question, the higher order level question is "in terms of bytes, what would be a "normal" value of the edge:backhaul ratio"
[13:11:27] <XioNoX>	 cdanis: nice, I'll re-enable the checks
[13:11:59] <bblack>	 I think we'd have to stich together multiple data sources to find edge:backhaul bytes ratio
[13:12:05] <paravoid>	 (and yes I realize normal is a bit of a weird question)
[13:12:21] <cdanis>	 bblack: I think it's all in webrequest_128?
[13:12:45] <bblack>	 oh maybe
[13:12:55] <bblack>	 yeah I guess so
[13:13:19] <cdanis>	 you would just want to match an x-cache of something like cp3... .*, cp3... .*
[13:13:35] <bblack>	 anyways, back on the x-cache regexes, that "miss, pass" pattern I believe means the ats-be recorded a miss, and the v-fe recorded the pass
[13:13:53] <bblack>	 v-fe can pass for lots of reasons (especially objects that are too-large for an fe mem cache)
[13:14:36] <bblack>	 range requests that miss are passes in the fe too
[13:14:47] <ema>	 paravoid: so the esams<->codfw/eqiad part should be easy to find, it's https://w.wiki/ce4 (upload) + https://w.wiki/ce5 (text)
[13:15:23] <ema>	 paravoid: somewhere we should have the equivalent "varnish bytes sent" for the varnish-fe part at the edge
[13:16:04] <paravoid>	 yeah those figures match librenms -- somewhere in the 3.5Gbps backhaul traffic
[13:16:17] <paravoid>	 (at peak)
[13:16:32] <paravoid>	 with edge being ~15-16Gbps
[13:16:41] <paravoid>	 is that ratio "normal"?
[13:16:56] <ema>	 edge: https://grafana.wikimedia.org/d/000000304/prometheus-varnish-dc-stats?viewPanel=18&orgId=1&var-datasource=esams%20prometheus%2Fops&var-cluster=cache_upload&var-layer=frontend
[13:18:16] <paravoid>	 yup, roughly in the same ballpark as turnilo (yay our tools work!?)
[13:18:34] <ema>	 they do and we also seem to know how to use them!
[13:18:44] <bblack>	 so at roughly noon, looking at those, we're seeing ~9.6Gbps at the edge outbound, and ~3.6Gbps backhaul, right?
[13:19:02] <paravoid>	 no
[13:19:28] <bblack>	 oh upload and text are separate on the edge graph, right
[13:19:33] <paravoid>	 yes that
[13:19:55] <bblack>	 so ~14.4Gbps at edge + ~3.6Gbps backhaul
[13:20:01] <paravoid>	 yup
[13:20:27] <cdanis>	 paravoid: https://w.wiki/ceD
[13:20:36] <paravoid>	 which matches netflow etc.
[13:20:47] <cdanis>	 usually backhaul is 15-20% of edge for esams upload-lb
[13:20:53] <cdanis>	 it's higher right now apparently
[13:20:56] <bblack>	 rolling both graphs out to ~7d, the patterns don't seem crazy-anomalous
[13:21:15] <bblack>	 there might be some variation, but it's in the right realm
[13:21:28] <paravoid>	 and ~33% for text?
[13:21:34] <bblack>	 well except for just today
[13:21:53] <bblack>	 that jump today for esams backhaul starting just before noon UTC
[13:21:59] <paravoid>	 from those varnish graphs above 200MB/s origin vs. 600MB/s edge for text
[13:22:13] <cdanis>	 for text, the ratio varies diurnally
[13:22:24] <cdanis>	 https://w.wiki/ceF
[13:22:28] <cdanis>	 which is also kind of what you would expect
[13:23:00] <paravoid>	 nice queries :)
[13:24:13] <paravoid>	 bblack: anomalous compared to what?
[13:25:04] <bblack>	 what I meant is: the edge-vs-backhaul pair of graphs above, the present values seem consistent, roughly, with the past 7d, other than the upload@esanms backhaul increase that suddenly started around 11:40
[13:25:55] <paravoid>	 right, but is the state 7d ago to be expected in general?
[13:26:10] <paravoid>	 or do we have some kind of weird traffic pattern that causes lots of misses that we have to track down etc.
[13:26:38] <bblack>	 if you look further back in history, there are events like this all the time :)
[13:26:40] <cdanis>	 looking at past 90d, it seems like mostly, although there was some sort of event around Aug 23-Aug 28
[13:27:11] <cdanis>	 text looks very consistent over past 90d
[13:27:17] <cdanis>	 aside from something else on 8/25
[13:27:37] <bblack>	 we've investigated some events on upload before, there's a lot of weird "legitimate" uses that can happen
[13:28:11] <bblack>	 even the spikes in june and july, look bigger than the present one so far
[13:28:21] <bblack>	 https://grafana.wikimedia.org/d/kHk7W6OZz/ats-cluster-view?viewPanel=6&orgId=1&from=now-90d&to=now&var-datasource=esams%20prometheus%2Fops&var-layer=backend&var-cluster=upload
[13:28:38] <paravoid>	 to my completely naive view 25% total / 33% text of origin:edge in bytes feels a litte fishy
[13:28:54] <paravoid>	 but if you guys say that yup, it's how it's supposed to be, that's the best we can absorb at the edge
[13:29:00] <paravoid>	 then that can feed in into network planning
[13:29:10] <bblack>	 well, you're saying two different things there in those two lines
[13:29:21] <paravoid>	 i.e. that's not the ratio of transport:edge capacity right now
[13:29:34] <bblack>	 are you asking what's normal, or what's the best we could possibly do by making changes to software? :)
[13:29:34] <cdanis>	 hm
[13:29:42] <cdanis>	 these aren't 'total' numbers
[13:29:45] <cdanis>	 let's compute total
[13:30:01] <paravoid>	 best with a reasonable amount of effort, not year-long projects
[13:30:29] <bblack>	 honestly, I don't think we've ever really focused hard on looking at bytes ratios, mostly request ratios
[13:30:39] <bblack>	 especially on cache_text
[13:31:03] <bblack>	 on upload we did some bytes-focused things in the various cache admission policy stuff, especially the size cutoffs and that exponential function work
[13:31:11] <cdanis>	 https://w.wiki/ceL total bytes isn't as good as you would hope paravoid
[13:31:35] <paravoid>	 what is that number?
[13:31:42] <cdanis>	 upload+text combined
[13:31:50] <cdanis>	 not differentiating between the two, just, bytes
[13:32:03] <paravoid>	 origin / edge?
[13:32:06] <cdanis>	 yes
[13:32:28] <bblack>	 so in bytes terms, the aggregate total is 25% bytes missed, right?
[13:32:30] <paravoid>	 so up to ~40%?
[13:32:52] <bblack>	 there are spikes there as high as 60%
[13:32:53] <cdanis>	 paravoid: so, the good news is that 40% is at local nadir
[13:32:57] <bblack>	 well 70
[13:33:19] <bblack>	 and this is all-dcs, right?
[13:33:23] <cdanis>	 no this is just esams bblack
[13:33:34] <bblack>	 oh ok
[13:33:46] <cdanis>	 since we're concerned about transport link capacity, it'd be tricky to write a meaningful all-DCs query for that
[13:34:13] <cdanis>	 text is least-cacheable in esams between 0100-0500 UTC, which makes some sense
[13:34:36] <cdanis>	 but that means we don't have to provision for 40% of *peak*
[13:34:58] <bblack>	 so the equivalent at the per-request level is:
[13:34:58] <paravoid>	 yeah that's just bots making edits, with no real users to balance the %
[13:35:04] <cdanis>	 (we perhaps do have to have a discussion about provisioning for turning up a cold DC in zenith)
[13:35:17] <ema>	 we could definitely save lots of bytes by compressing text before sending it over the backhaul
[13:35:30] <cdanis>	 ema: was there a cpu and/or latency concern there?
[13:35:30] <bblack>	 https://grafana.wikimedia.org/d/000000500/varnish-caching?viewPanel=1&orgId=1&refresh=15m&var-cluster=All&var-site=esams&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5&from=now-30d&to=now
[13:35:52] <ema>	 cdanis: right now we're explicitly unsetting Accept-Encoding due to https://phabricator.wikimedia.org/T125938 and similar
[13:35:55] <cdanis>	 right
[13:35:56] <bblack>	 where we're averaging around 11% reqs missed, peaking more like 25% missed
[13:36:02] <bblack>	 so reqs and bytes definitely don't add up
[13:36:06] <paravoid>	 ema: wait, so we compress at the edge, but carry the misses uncompressed through the backhaul?
[13:36:13] <bblack>	 yes
[13:36:21] <paravoid>	 hah
[13:36:38] <paravoid>	 yeah so that would account for a lot of the miss ratio in req vs. miss ratio in bytes wouldn't it
[13:36:43] <cdanis>	 wow this bug is gross
[13:36:43] <bblack>	 because http compression interop is hard :P
[13:36:49] <ema>	 who knows, perhaps now the applayer isn't as bad at compressing things as it used to be
[13:36:59] <cdanis>	 bblack: so my next question is, does Envoy do a better job??
[13:37:11] <bblack>	 I'm not sure that it's an envoy-level problem
[13:37:39] <bblack>	 but who knows, maybe envoy can salvage things in a way that makes it work
[13:38:03] <bblack>	 especially if we had mediawiki+pals not-compressing to their local envoy, and envoy handling all the compression
[13:38:07] <paravoid>	 I think it'd be neat to explore this a bit further again
[13:38:10] <cdanis>	 that is what I was imagining
[13:38:20] <bblack>	 it's mediawiki compressing that has caused past problems, IIRC
[13:38:26] <cdanis>	 yeah, let's just not do that
[13:38:27] <paravoid>	 esp. before we go and spend thousands of $$$ in transport links :D
[13:40:05] <bblack>	 so, the general nature of the compression problem is this (regardless of whose bug where makes it bad), IIRC:
[13:40:12] <paravoid>	 I suspect this wouldn't make a huge difference in upload right? jpeg and everything wouldn't compress all that well
[13:40:20] <cdanis>	 yeah I would only bother for text
[13:40:47] <bblack>	 If, as an http origin, I emit a compressed response and then die halfway through sending it...
[13:40:57] <bblack>	 it more or less hoses the whole connection unrecoverably
[13:41:20] <cdanis>	 that's why I want to do this in Envoy ;)
[13:41:29] <bblack>	 but both ends don't necessarily see the situation the same way at that point, which then also hoses the next request being sent to the origin on that connection, which results in a timeout/500 on a *diufferent( request
[13:41:52] <bblack>	 and then if that connection we're discussing is a shared connection (cache backhaul for many users)
[13:42:15] <bblack>	 the problem bleeds even between unrelated users, causing all kinds of havoc-at-a-distance which has been historically difficult to debug
[13:43:20] <bblack>	 but I think most of the root of its prevalence is about how Mediawiki and/or its fronting Apache httpd handles this situation as an origin, and yes, envoy could make a diff.
[13:43:46] <bblack>	 historically, we did have shared backhaul with compression working fine for e.g. varnish<->varnish
[13:44:01] <bblack>	 so in the old layered setup, we'd have compression on this link, because esams varnish-be talked to eqiad varnish-be
[13:44:19] <bblack>	 and then eqiad varnish-be would strip compression from the request to make sure mediawiki didn't compress, but that was all DC-local
[13:44:32] <bblack>	 and in the new model, now that's obviously remote (ats-be@esams -> mediawiki@eqiad/codfw)
[13:45:54] <bblack>	 envoy's probably already configured to compress if asked to?
[13:46:14] <bblack>	 at least it could be for all we know, but we prevent it at our origin-requesting side in ats-be
[13:46:27] <cdanis>	 https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/compressor_filter
[13:46:34] <bblack>	 (by stripping AE:gzip from the request before we ask the origin)
[13:46:43] <cdanis>	 surprised it doesn't already support other kinds of compression
[13:47:03] <bblack>	 yeah...
[13:47:15] <cdanis>	 it's probably "no one has done the work" but that's still funny for envoy
[13:47:21] <bblack>	 ats does too, it becomes a thorny problem with a cache when the world goes through an algorithm change
[13:47:23] <cdanis>	 i expected at least brotli
[13:47:52] <bblack>	 because ideally you want to store your cache contents compressed as well, to save space and avoid re-encoding for every hit from the commonly AE:gzip clients
[13:48:09] <cdanis>	 mmm
[13:48:24] <bblack>	 but in a gzip+brotli world, you either waste the space to store both, or store in the most-popular and decode->recode for the less-popular
[13:48:26] <cdanis>	 do caches like ATS ever store a few differently-compressed flavors of popular documents?
[13:48:28] <cdanis>	 right
[13:48:43] <bblack>	 storing both costs a lot in terms of byte hit ratio potential too
[13:49:38] <bblack>	 for now we just haven't deployed brotli at the edge at all and stuck with gzip, at least in part over that general complication and varnish's lack of support and being new to the ats-be transition
[13:50:12] <bblack>	 but by now I bet the stats supports going brotli-native throughout, and re-encoding for the gzippers
[13:50:26] <bblack>	 since chrome dominates the real UAs
[13:50:34] <bblack>	 I donno about API traffic though
[13:51:26] <bblack>	 you could also theoretically do more-advanced things, like track how many times a brotli cache object has been requested as gzip, and store an extra copy in gzip after the Nth time, to get some more-subtle tradeoff
[13:52:06] <cdanis>	 mmm, cache-text cpu is a bit hotter than I thought it would be
[13:52:15] <bblack>	 there's a lot happening there :)
[13:52:22] <cdanis>	 we're regularly hitting 50% cpu util in esams text
[13:52:35] <bblack>	 and esams is pretty heavy client load, too
[13:52:47] <cdanis>	 yeah, I looked at it first :)
[13:53:27] <bblack>	 if we don't get the 2nd EU DC, ideally we'd expand the esams cluster, it has some raw scaling worries that have been noted months ago
[13:53:46] <bblack>	 but the decision at the time was not to ask for all that extra capex and focus on getting the 2nd EU DC
[13:53:56] <cdanis>	 2nd EU DC helps a lot of things anyway
[13:54:11] <cdanis>	 it's the better solution
[13:54:31] <bblack>	 there are software changes coming that may help too
[13:54:44] <cdanis>	 oh?
[14:01:19] <bblack>	 yeah I think we can do better on TLS termination at least
[14:01:27] <bblack>	 but that's all up-in-the-air stuff
[14:01:37] <bblack>	 we'll see how Q2 goes
[14:01:51] <bblack>	 envoy compression might shift a little CPU load out too
[14:02:04] <cdanis>	 ack
[14:03:40] <cdanis>	 I summarized a bit on https://phabricator.wikimedia.org/T263206
[14:04:04] <cdanis>	 which I think can be resolved now, because I opened it to not forget about the immediate-term situation, which we've resolved
[14:04:36] <cdanis>	 but I want to take a stretch and another coffee and file some tracking tasks and maybe an umbrella epic for the general topic of rethinking backbone provisioning
[14:04:50] <cdanis>	 (which potentially has a *lot* of software stuff)
[14:05:03] <bblack>	 well, like pvoid said, it's hard to judge how text compression might change the link capacity stuff
[14:05:17] <cdanis>	 yeah exactly
[14:05:38] <bblack>	 that's the lowest-hanging fruit here.  it's probably easy to turn it on and give it a spin, and then we'll see if the situation is tenable or we have horrible issues we can't track down and fix
[14:05:41] <cdanis>	 and there's a lot to think about between eqiad/codfw specifically, and what "core DC" means
[14:05:52] <cdanis>	 as we kind of handwaved at before
[14:06:22] <bblack>	 compression doesn't help with cache_upload, but there we can look again at admission policies.  e.ma's far more up to date on that, than I am.
[14:06:45] <bblack>	 but we have a variety of amission policy tools already at our disposal, and the tunables on the exponential one to attempt to re-tune
[14:06:59] <bblack>	 and all of that can have a dramatic impact on bytes-vs-reqs tradeoffs for cache_upload hitrates
[14:07:06] <cdanis>	 the admission policy stuff is just for varnishfe, right?
[14:07:14] <cdanis>	 or are we also tuning stuff on ats-be?
[14:07:27] <bblack>	 I don't think we've tried and advanced one in ats-be-land
[14:07:30] <bblack>	 but we could
[14:07:32] <cdanis>	 right
[14:07:52] <cdanis>	 I think that, at least in the case of upload, we should be looking at bytes as well as requests
[14:10:18] <bblack>	 the only bytes-tradeoff admission policy that I can fine in ATS lua right now is:
[14:10:21] <bblack>	     elseif content_length and tonumber(content_length) > 1024 * 16 * 16 * 16 * 16 * 16 then
[14:10:24] <bblack>	         -- Do not cache files bigger than 1GB
[14:10:52] <bblack>	 which we had historically with varnish-be as well, I believe
[14:11:31] <bblack>	 that's a pretty arbitrary cutoff, and I don't have a good grasp of the modern object size distribution (either in storage, or in request-popularity sense)
[14:12:37] <bblack>	 the 1GB number has probably been around since our backend cache size was smaller, too
[14:12:48] <cdanis>	 yeah
[14:12:55] <bblack>	 with modern configs like current-esams, each of the 8x upload servers has 1.4TB, and combined that's 11.2TB
[14:13:03] <bblack>	 we can probably afford to cache some 1GB+ files :)
[14:14:08] <bblack>	 and then another angle of attack here, which was always somewhere in the plans but never gotten-around-to
[14:14:21] <bblack>	 was combining the cache backend pools, if that works out ok
[14:14:28] <bblack>	 there's a lot of tradeoffs, but I tend to think it's a net win
[14:14:49] <bblack>	 right noe the ats-be in cache_upload and cache_text are configured identically, and both can handle either cluster's traffic in config terms
[14:15:02] <bblack>	 but they're split up artificially into clusters so they don't bleed into each other
[14:15:20] <cdanis>	 ah interesting
[14:15:45] <bblack>	 cache_upload could gain a lot of storage space from combining them, and cache_text more connection-handling volume (total daemons taking care of all the pass/miss traffic through text)
[14:15:58] <cdanis>	 makes a lot of sense
[14:16:06] <bblack>	 at least that's the theory.  it could also go horribly wrong as they compete with each other in the same storage pool :)
[14:16:42] <cdanis>	 how often are we evicting stuff for cache-text rather than it just expiring?
[14:16:56] <bblack>	 I don't know the answer to that, and it might be tricky to find
[14:17:14] <bblack>	 my general impression is that text doesn't have eviction problems in the backend, but upload might.
[14:17:53] <bblack>	 (evicting for space reasons I assume we both mean)
[14:18:04] <cdanis>	 yeah
[14:18:10] <bblack>	 (because we'd have to separate out purge evictions and probably some other cases)
[15:17:18] <paravoid>	 are we capturing actions somewhere?
[15:17:31] <cdanis>	 I was going to soon
[15:17:38] <cdanis>	 did you have more you wanted to add?
[15:17:54] <paravoid>	 no, just want to make sure we don't lose all those ideas :)
[15:19:01] <paravoid>	 1) internal netflow 2) explore compression 3) upgrade eqdfw/codfw transport (T263210) 4) better loadbalancing over existing transports (T263212) 5) potentially more transports
[15:19:02] <stashbot>	 T263212: Consider balancing VRRP primaries to cr1/cr2 - https://phabricator.wikimedia.org/T263212
[15:19:11] <cdanis>	 yeah
[15:19:19] <paravoid>	 6) that annoying turnilo rate bug :D
[15:19:22] <cdanis>	 7) 'unified' ats-be pool
[15:20:04] <paravoid>	 maybe something around grafana/turnilo, standardize some fields or make some graphs or something, dunno
[15:20:39] <paravoid>	 seems like we were struggling to figure out the right data and understand traffic flows, not sure what exactly we could improve though (besides (1) and (6))
[15:21:03] <cdanis>	 yes, I agree -- I'm still trying to crystallize it into something meaningful but "have better ideas about backbone provisioning" is on my mind
[15:24:02] <paravoid>	 can be "transport:edge capacity ratio should approach origin:edge ratios" or some variation of that
[15:24:24] <cdanis>	 yeah but I think that's not the whole story -- that's the story in the steady-state
[15:24:58] <cdanis>	 if you don't *over*provision transport there, you still either add an operational cost of not being able to repool a cache-cold edge around zenith, or needing stuff like fractional pooling in geoDNS
[15:25:05] <paravoid>	 of course lots of caveats apply -- transport capacity is hard to calculate, edge even more so, origin:edge ratios vary during the day, and also in cases of cold turn-ups, or attacks and whatnot
[15:25:10] <cdanis>	 yes
[15:25:14] <cdanis>	 :)
[15:36:20] <bblack>	 the unified ats-be pool is already covered somewhere
[15:36:39] <bblack>	 compression, is easy to try, we can make a new ticket maybe instead re-re-re-reviving old ones
[15:38:03] <cdanis>	 yeah that was my plan, reference the old ticket but only as a mention of "let's not enable it at the applayer, just envoy"
[15:38:29] <bblack>	 [trying to find the somewhere for the other]
[15:40:02] <bblack>	 we're way overdue for some serious time spent cleaning up the traffic phab backlog
[15:40:10] <bblack>	 s/we're/I'm/ heh
[15:40:53] <bblack>	 I bet a large chunk of our open tickets can be handwaved away as outdated/outmoded/inapplicable/already-done/etc
[15:41:01] <bblack>	 maybe not half, but something significant
[15:43:02] <bblack>	 we need some improvements to how we organize, too.  the topical approach we adopted however-many years ago isn't working great
[15:43:40] <bblack>	 maybe time to at least try more of a kanban-style approach for a while
[15:44:45] <bblack>	 (and create a column within that system that covers the large amount of tickets that aren't merely short-term backlog, but longer-term big-ide tickets that are more like: "this is a record of a maybe-great idea if we ever get around to putting it in an annual/quarterly plan"
[15:44:51] <bblack>	 )
[15:45:10] <bblack>	 "Wishlist" maybe
[15:46:14] <cdanis>	 i kind of think you want a 'wishlist' and 'epics' parking zone, yeah
[15:47:00] <bblack>	 yeah, maybe Epics for active ones that are driving a flow of shorter-term tickets through the queues
[15:47:05] <bblack>	 and Wishlist for big ideas that are inactive
[15:47:19] <bblack>	 and a default Triage queue for random incoming
[15:47:23] <bblack>	 (like we have now)
[15:48:02] <bblack>	 and then some standard kanban flow columns for the rest.  backlog->in-progress->review/testing->done, blocked
[15:48:05] <bblack>	 or whatever
[16:15:23] <hashar>	 bblack: cdanis: every couple weeks or so I spend a couple hours to review old tasks for the few project I am a lead for
[16:15:32] <hashar>	 maybe for 3 or 4 hours, usually on friday afternoon ;]
[16:15:44] <hashar>	 not ideal, but that helps dismiss a bunch of old/obsolete/already done tasks
[16:16:09] <hashar>	 I usually look at tasks ordered by creating date, oldest first
[16:20:40] <hashar>	 out for dinner etc *waves*
[18:19:47] <ryankemper>	 apergos: just to loop you in, looks like after deploying https://gerrit.wikimedia.org/r/c/operations/puppet/+/622342/ last weekend the old cron entry wasn't cleaned up so we were running two instances of the wikidata rdf dumps concurrently, I've removed the stale cron entry (see the SAL) and will kill the in-flight wikidata dump jobs since they're in an unknown state
[18:20:02] <apergos>	 hmmm
[18:20:25] <apergos>	 did the name of the entry change?
[18:20:35] <apergos>	 because yeah if you change the name of it, puppet won't remove the old one
[18:20:49] <apergos>	 ryankemper:
[18:20:58] <ryankemper>	 yes it did
[18:21:05] <apergos>	 that would be it then
[18:21:25] <ryankemper>	 previously was known as `wikidatardf-dumps` and now split into 3 crons with different names
[18:21:27] <apergos>	 if you keep the name the same puppet can find its entry in the crontab and just alter the subsequent line...
[18:21:42] <apergos>	 right
[18:21:46] <ryankemper>	 that makes sense. looks like in the future if renaming we need to keep the old entry too with ensure absent
[18:22:07] <ryankemper>	 apergos: so I've removed the cron entry and am about to kill the inflight processes, are you aware of anything else I should do?
[18:22:28] <apergos>	 you can definitely do that, I prefer to manually check by running puppet on the host, looking at the crontab, rmoving any extra entry, and running puppet again to make sure it says gone
[18:22:50] <apergos>	 there are no final output files, right?
[18:22:55] <apergos>	 from this week's run?
[18:23:07] <apergos>	 only temp files?
[18:24:46] <apergos>	 ryankemper:
[18:24:56] <ryankemper>	 apergos: looking into that now. right now we have `/usr/local/bin/dumpwikibaserdf.sh wikidata all ttl nt` (correct) and `/bin/sh -c /usr/local/bin/dumpwikibaserdf.sh wikidata all ttl nt; /usr/local/bin/dumpwikibaserdf.sh wikidata truthy nt; /usr/local/bin/dumpwikibaserdf.sh wikidata lexemes ttl nt` (incorrect) running
[18:25:37] <ryankemper>	 need to look at what the dumps script does to be sure but if it follows the "do work in a temp file and move it over only when complete" then at least the `all ttl nt` should be fine
[18:25:45] <apergos>	 yes that's what it does
[18:25:47] <ryankemper>	 trying to find out if `lexemes` and `truthy` already ran or not
[18:26:25] <apergos>	 you can see how the maintenance script is being called to figure that out
[18:27:14] <apergos>	 if I am telling you stuff you already know, I apologize. I'm now running on autopilot, which is fine, I'm here in the channel, just means slightly reduced functonality :-)
[18:27:46] <ryankemper>	 I'm still learning all the ropes so the pythonic approach of explicit is better than implicit is great with me :P
[18:28:04] <apergos>	 👍
[18:28:37] <ryankemper>	 apergos: what maintenance script were you referring to btw?
[18:28:48] <ryankemper>	 or did you just mean `/usr/local/bin/dumpwikibaserdf.sh`
[18:28:54] <apergos>	 so the way all these dumps run is that there's a bash script that calls some mediawiki maintenance script
[18:30:03] <apergos>	 right now as I look at what's running, I see a gzip -dc of /mnt/dumpsdata/xmldatadumps/temp/wikidatattl-all.5-batch6.gz  and a pipe to serdi
[18:30:39] <apergos>	 so there's no actual maintenance script running at the moment, it's just the serdi format conversion
[18:31:38] <apergos>	 /bin/bash /usr/local/bin/dumpwikibaserdf.sh wikidata all ttl nt      that seems to be the one running, no lexemes nor truthy
[18:32:33] <apergos>	 if you look at the bash script you can see that    extensions/Wikibase/repo/maintenance/dumpRdf.php    is the maintenance script invoked earlier in the run
[18:32:52] <ryankemper>	 right
[18:33:09] <ryankemper>	 and based off the cron timing I'd have expected truthy to have ran, but not lexemes yet
[18:33:52] <ryankemper>	 I was thinking I'd expect to see truthy stuck as well but given the old cron that's causing problems is just sequentially doing all->truthy->lexeme
[18:34:01] <apergos>	 /mnt/dumpsdata/otherdumps/wikibase/wikidatawiki/20200916    as I look here
[18:34:09] <apergos>	 that would be the directory for this week's run I guess
[18:34:13] <apergos>	 or for part of the run
[18:34:22] <apergos>	 I see some truthy-beta-nt in there
[18:34:29] <ryankemper>	 Cool, that lines up with what I'd expect
[18:34:35] <ryankemper>	 which is to see truthy but not the other two
[18:34:58] <ryankemper>	 Okay, going to kill the currently running wikidata jobs, sec
[18:34:59] <apergos>	 in /mnt/dumpsdata/otherdumps/wikibase/wikidatawiki/20200914   I see all.json
[18:35:09] <apergos>	 and that's it for the wee
[18:35:26] <apergos>	 k
[18:38:51] <ryankemper>	 okay, everything's killed
[18:39:09] <ryankemper>	 `ps axuww | grep dumpsgen   | grep -i wikidata` shows nothing now
[18:39:30] <apergos>	 awesome
[18:40:47] <apergos>	 and I see the old cron cruft has been removed from the crontab, probably by you
[18:40:49] <apergos>	 so perfect
[18:41:39] <ryankemper>	 yup
[18:41:44] <ryankemper>	 So I killed all of these:
[18:41:52] <ryankemper>	 https://www.irccloud.com/pastebin/ktyu348H/
[18:42:10] <apergos>	 yeah good riddance
[18:42:23] <apergos>	 you need to get rid of the serdi and gzip -9 ones too
[18:42:28] <apergos>	 those are you
[18:42:35] <ryankemper>	 ack
[18:42:43] <ryankemper>	 question: wouldn't the serdi/gzip have been children of the above
[18:42:48] <ryankemper>	 i.e. why did they not get recursively killed?
[18:42:51] <apergos>	 yeah but it doesn't mean they
[18:43:05] <apergos>	 necessarily get the axe at the same time
[18:43:17] <ryankemper>	 or do they actually just get inherited by init but stay alive
[18:43:22] * ryankemper is rusty on his linux process management fundamentals
[18:44:13] <apergos>	 depends on what the calling shell did
[18:45:22] <apergos>	 those are the middle and end bits of a pipe, my experience is that those can continue to hang around
[18:45:28] <ryankemper>	 great, thanks for all the help apergos! hope I didn't disturb your friday too much
[18:45:34] <apergos>	 not at all
[18:45:42] <ryankemper>	 the dangling PIDs are dead now so we should be all done with cleanup
[18:45:50] <apergos>	 if I wasn't in a space to answer, I would have just handwaved to monay
[18:45:53] <apergos>	 great
[18:45:59] <ryankemper>	 I'm sending out an e-mail to the wdqs mailing list letting them know that this week's dumps are shot
[18:46:01] <apergos>	 *to monday
[18:46:06] <apergos>	 good idea
[18:46:16] <apergos>	 they would be looking to import them...
[18:46:30] <apergos>	 no time to catch them up unfortunately