[09:11:48] 10Traffic, 10Observability-Metrics, 10SRE, 10Patch-For-Review, 10User-ema: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 (10ema) p:05High→03Low In the last 24 hours we had just one overrun on 4 nodes: ` Dec 05 20:59:55 cp3060 varn... [10:46:33] 10netops, 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, and 2 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi) @JAllemandou This is great, thanks! Note that we can tune sampling to adapt. What would be the next steps? [14:32:50] 10netops, 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, and 2 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10Ottomata) > @BTullis thanks! Real-time, would be a nice plus, but a hard requirement (unlike netflow). Did you mean _not_ a hard require... [15:40:10] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-tools, and 3 others: Investigate Capirca - https://phabricator.wikimedia.org/T273865 (10ayounsi) 05In progress→03Stalled Waiting for Capirca upstream to merge PRs. [15:55:32] 10netops, 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, and 2 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi) > Did you mean _not_ a hard requirement? Yep, my bad :) [16:23:56] (EdgeTrafficDrop) firing: 22% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org [16:28:56] (EdgeTrafficDrop) resolved: 55% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org [16:59:37] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Rebuild ping* hosts with 10G disks - https://phabricator.wikimedia.org/T295767 (10ayounsi) 05Open→03Resolved Alright, closing this for now then :) [17:38:22] repeating my question from last week :) I'm currently helping with https://phabricator.wikimedia.org/T223053 and would like a bit of clarity on how cookies affect frontend caching for MediaWiki. I see that MW emits a `Vary: Cookie` header, does that mean the caches are already split based on the values of cookies? Specifically, we're thinking about adding a "variant" cookie for anonymous users that caches should be split on. [17:38:39] bblack, vgutierrez: ^ [19:42:04] 10Wikimedia-Apache-configuration, 10Fundraising-Backlog, 10SRE, 10Thank-You-Page, and 4 others: Deal with donatewiki Thank You page launching in apps - https://phabricator.wikimedia.org/T259312 (10LGoto) [19:43:51] 10Wikimedia-Apache-configuration, 10Fundraising-Backlog, 10SRE, 10Thank-You-Page, and 4 others: Deal with donatewiki Thank You page launching in apps - https://phabricator.wikimedia.org/T259312 (10Tsevener) a:05Ejegg→03Tsevener [20:22:26] legoktm: those are huge topics, sorry for the slowness! [20:23:52] no worries, I guessed as much when I wasn't able to obviously figure it out myself :) but in exchange for an answer, I'll document it on-wiki somewhere [20:23:57] legoktm: the MW Vary:Cookie scheme is mostly about anon vs logged-in. If a URI's output would be different for anons than users (or maybe more precisely, if a URI's output is custom per-user, and then you'd think of anon as the default/blank/no-user option), then MW should emit Vary:Cookie. [20:24:27] we don't actually take the Vary:Cookie literally. there's sort of an implicit contract between traffic+MW that V:C is only about the session cookies for the above. [20:24:56] we don't vary on other cookies because of its prescence. [20:25:32] ok, and that's https://gerrit.wikimedia.org/g/operations/puppet/+/5251c8caee84f2ec1f7415bf0d225b227474c135/modules/profile/files/trafficserver/default.lua#151 right? [20:25:40] and there's a pre-warning I'll give, because it eventually comes up in every conversation about this, and it's very non-obvious why it's so important, until you've been through it a few times: [20:26:36] in general (any software, any cache, anywhere) - A given URI should either always emit Vary:Foo, or never emit Vary:Foo. Variably-deciding whether the same URI should emit Vary:Foo under different input conditions is a recipe for disasterous buggy behavior. [20:27:39] * legoktm nods [20:28:10] legoktm: it's that lua code, and also (perhaps more-importantly) a bunch of similar-looking VCL code in the varnish frontends (which does more with it than the Lua does) [20:29:23] in modules/varnish/templates/text-frontend.inc.vcl.erb [20:29:37] evaluate_cookie and some related methods [20:30:04] cluster_fe_backend_fetch has some more of the mechanism [20:30:23] (sorry, it's hard to organize code sanely in VCL) [20:31:27] the critical optimization in the VCL code is: we don't actually fully Vary even on the session cookie, because that would mean one cache entry per-user, per-page (even if that cache entry is just a virtual entry that says "please don't cache this") [20:32:18] we swap their real session cookie for the special cookie "Token=1" temporarily, and have all sessions share that value for Vary-slotting purposes, so that there's only one shared "please don't cache this" per URI, and then swap the real cookie back into place before the request is forwarded further inwards in the stack. [20:32:59] interesting [20:33:01] this prevents a lot of potential perf problems at this layer [20:33:33] (because we don't have pre-knowledge of which URIs emit V:C, so every cache "miss" looking for the "please don't cache this" would otherwise have to stack up and serialize on each other) [20:33:40] I didn't realize varnish held onto "please don't cache this" entries [20:33:46] ahh [20:33:54] yeah, they're more-formally called "hit for pass" or "hit for miss" in Varnish parlance [20:34:19] this gets at one of the core concepts in all of the tricky things in our varnish part of the stack: [20:35:13] an important optimization (for our traffic volume + patterns) is that if two or more [anonymous] users are all requesting /wiki/Foo at the same time, overlapping, Varnish coalesces them - it makes them all wait together while it only fetches from within once, to satisfy them all. [20:35:31] which works great for cacheable items, like anon pageviews [20:35:49] our current idea/proposal is to add a "variant" cookie for anonymous users. MediaWiki would generate different HTML based on the selected variant, but it would be cachable based on the variant selected. [20:36:04] but by default that's a terrible pattern for an uncacheable URI that's popular (because it serializes everyone and re-discovers it's uncacheable for each fetch) [20:36:39] so that's why we have things like the idea of a "hit-for-pass" cache entry in Varnish, and that's also why we coalesce them across all users in this case, with Token=1 [20:37:26] * legoktm nods [20:37:38] but yeah, stepping back out to your language issues: [20:38:31] we could probably concoct a scheme to also Vary the cache of anonymous pageviews based on a cookie, but it might be rather complex and brittle, and creates more "business logic tied together between MW+Varnish" problems (which we'll have no matter what, but we at least hope to minimize/reduce) [20:39:31] but we could ignore this (and ignore the cookie), if for example the behavior of the variant-selection cookie was just to cause an uncacheable redirect to the correct URI for the variant (/wiki/Foo -> /wiki/zh-tw/Foo or whatever it is) [20:40:10] it sounds like that's not a very easy path either, though [20:40:34] I don't have a magic answer, but this is definitely in Here Be Dragons territory :) [20:40:54] one of the considerations was that we generally want people to use and end up on /wiki/ URLs so when resharing they're neutral, and MediaWiki can use Accept-Language to give another user their ideal variant [20:41:21] AL negotiation is different, and I think we have it working in at least some cases [20:41:55] maybe we did it for restbase at one point? [20:42:06] but it sounds on the ticket like AL isn't enough for all your cases [20:42:23] right, my understanding is that on language-converter wikis MediaWiki will use it to pick a variant, but yeah, not all browsers easily support it [20:43:48] (I think the way AL support works is that, since it's separate from this whole session cookie mess, we simply support Vary:AL as emitted by the backends, and try to normalize it on inbound in some cases to reduce duplication) [20:44:12] but if people share URLs like /zh-tw/ they'll end up on that variant, even if it's not ideal for them, so we'd rather people mostly use /wiki/ URLs [20:44:42] but since the HTML content is different per variant we need to split the caches if we're going to read from a variant cookie on the same URL [20:45:06] not that it helps in the moment, but this sounds like Yet Another Thing that would be helped by moving forward on some kind of content-composition plan :) [20:46:08] heh, yep [20:46:16] the crux of why it's painful (but probably not impossible) to support more than one Vary:Cookie usecase is that Vary:Cookie doesn't tell you which cookie keys matter and which don't :) [20:46:55] Tim actually had a proposed better-than-Vary:Cookie header for this, which I think attempted to standardize at one point, and that we supported many many years ago in Squid before we migrated to Varnish [20:47:02] I'm trying to recall the name of it [20:47:08] X-Vary-Options? [20:47:20] yeah, that [20:49:01] so the impression I'm getting is that there's no immediate "this is a bad idea", but we have to evaluate whether implementing this feature is worth the additional linking between MW+Varnish behaviors [20:49:39] and the complexity it will cause in Varnish in general [20:49:41] and the technical complexity of doing, so, the cost of further splitting the anon zhwp (and other lang converter wikis) caches [20:49:42] yeah [20:50:18] the splitting costs probably isn't terrible. it's real, but we tend to overestimate it in these conversations. [20:50:41] (because that's what cache eviction algorithms handle for us, is re-shaping the curve to minimize the impact of that in the global aggregate) [20:51:00] gotcha [20:52:57] my current recommendation is to move foward with a gadget based implementation that rewrites links, I think that would give us good user feedback and see how much usage it gets, and then we will have better (or worse) argument to make for taking on the work of doing it properly in varnish [20:54:08] thank you for explaining this all to me :) I'll write up a summary on the task and some notes on-wiki too [20:55:11] ok, thanks :) [21:24:05] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q2), 10Sustainability (Incident Followup): Alert that should have paged did not reach VictorOps because of partial networking outage - https://phabricator.wikimedia.org/T294166 (10herron) [22:11:02] 10Wikimedia-Apache-configuration, 10Fundraising-Backlog, 10SRE, 10Thank-You-Page, and 4 others: Deal with donatewiki Thank You page launching in apps - https://phabricator.wikimedia.org/T259312 (10Tsevener) Fix proposal for issue above is in https://github.com/wikimedia/wikipedia-ios/pull/4081.