[01:44:58] 10Traffic, 10Discovery, 10Operations, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan 20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Huji) It seems like since 8/25, WD maxlag has rarely approached 5 seconds ([[ https://grafana.wikimedi... [06:58:36] 10netops, 10Operations, 10cloud-services-team (Kanban): Enable L3 routing on cloudsw nodes - https://phabricator.wikimedia.org/T265288 (10ayounsi) Would another day in the October 26th week be possible otherwise? I want to make sure we have time to schedule a followup work if something doesn't go as planned... [08:13:10] 10netops, 10Operations: Upgrade Routinator 3000 to 0.8.0 - https://phabricator.wikimedia.org/T266001 (10ayounsi) 05Open→03Resolved All done. [09:25:57] 10netops, 10Operations, 10cloud-services-team (Kanban): Enable L3 routing on cloudsw nodes - https://phabricator.wikimedia.org/T265288 (10aborrero) Ok, new proposal: 2020-10:-29 [09:31:10] 10netops, 10Operations, 10cloud-services-team (Kanban): Enable L3 routing on cloudsw nodes - https://phabricator.wikimedia.org/T265288 (10ayounsi) WFM, thanks! [10:06:14] 10Traffic, 10Operations, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) >>! In T264398#6564517, @gerritbot wrote: > Change 635298 **merged** by Ema: > [operations/puppet@production] varnish: fix websockets on... [10:39:46] 10Traffic, 10Operations, 10Patch-For-Review: Large text objects are randomized to cache backends - https://phabricator.wikimedia.org/T266040 (10ema) Instead of using hfp vs hfm, I think we might want to distinguish between requests that definitely cannot be cached at the ats-be layer (eg: those with `req.htt... [10:54:19] 10Traffic, 10Operations, 10Patch-For-Review: Large text objects are randomized to cache backends - https://phabricator.wikimedia.org/T266040 (10BBlack) That is what I was thinking too, but I'm not sure if the VCL state diagram allows us to see that at the right point in time to make the decision or not. We'... [12:38:13] 10Traffic, 10Operations, 10Patch-For-Review: Large text objects are randomized to cache backends - https://phabricator.wikimedia.org/T266040 (10ema) >>! In T266040#6567585, @BBlack wrote: > 1. User requests /foo/bar -> frontend cache cp1234 miss -> chash to cp9999 > 2. Response from cp9999 indicates CL:500KB... [13:16:30] ema: I've tried and deleted like 5 different lengthy responses on the ticket [13:16:34] it's challenging! [13:16:52] maybe we should just hash out the meat of the discussion here, but I have a meeting for this hour block [13:19:02] 10Traffic, 10Operations, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) I found that there's a significant difference between the [[https://grafana.wikimedia.org/d/EiAVq3FGz/t264398?viewPanel=13&orgId=1&from=... [13:19:31] bblack: I've gotta go afk for a while anyways, let's chat later on today! [13:19:44] meanwhile lulz @ n_objecthead ^ [14:15:13] https://github.com/varnishcache/varnish-cache/issues/2864 + https://github.com/varnishcache/varnish-cache/issues/2865 too are probably required reading [14:15:27] (linked from the n_objecthead ticket, and go into some gory details related to both tickets) [14:19:36] it doesn't look like we've got any MAIN.cache_hitmiss going on on either v5 and v6 though [14:26:35] but, MAIN.cache_hit_grace *is* very different between cp3052 and all other text@esams [14:26:56] ~80 rps vs ~30 [14:27:51] see https://w.wiki/hue [14:28:48] there are only two cases where we have HFM in VCL: [14:29:10] the exp policy that's turned off, and the odd case of cacheable CL:0 200's in cache_upload being converted into HFM, if they even exist [14:29:18] (they were some old bug workaround that may no longer be a problem) [14:29:30] so we'd expect no HFMs I think [14:34:12] ema: is it possible this is just an accounting shift? that some reqs which were accounted as hits before are now accounted as hit_grace, and this explains both your graph and the supposed drop in hit/(hit+miss) ? [14:35:31] bblack: that's exactly what I'm suspecting [14:36:20] I'm scratching my head while looking at https://github.com/varnishcache/varnish-cache/pull/2705/ [14:37:18] the doc change here: https://github.com/varnishcache/varnish-cache/pull/2705/commits/537ef614f1bf514e865ea5e4907dcde4a4c899aa seems to suggest that when grace runs out they now call vcl_miss instead of vcl_hit [14:39:05] that's not mentioned anywhere in https://varnish-cache.org/docs/6.0/whats-new/index.html grrr [14:39:07] yeah, that makes more sense anyways [14:39:12] as much as *any* of this makes sense [14:40:07] the shift from vcl_hit to vcl_miss is what affects our X-CDIS-based accounting [14:41:02] ok the change is not mentioned in the what's new docs because it happened between 6.0.0 and 6.0.1 https://github.com/varnishcache/varnish-cache/blob/6.0/doc/changes.rst#fixed-bugs-which-may-influence-vcl-behavior [14:41:12] lol [14:41:28] it's not funny! [14:41:50] they should've called that changes.rst#point-release-changes-that-shouldve-warranted-a-version-bump-but-meh [14:42:51] anways, they're clearly not using semver even for other releases [14:43:09] they add supposedly-non-breaking new features in point releases all the time [14:43:40] yeah [14:43:53] apparently that patch was intended to address https://github.com/varnishcache/varnish-cache/issues/1799 [14:43:59] which has.. a lot of words in it, is what I can say about it so far [14:44:11] hahah [14:44:18] yeah we've stumbled through links to #1799 in many past cases over the past several years... [14:44:33] and it would have even more words in it, if the link to their old Trac instance still worked [14:45:05] * ema waits for the moment when cdanis reads the word "catflap" [14:45:35] ... ahahah [14:46:16] > A catflap_fini_f, if present, gets a final say on the object beauty contest. For example, a catflap_f may have saved the most beautiful object in the private pointer, but the decision is only final after all objects were considered. [14:46:29] I think I should take the day off, I'm clearly unwell [14:47:06] when I get sucked into issues like these, it makes me really wonder if anyone who's actively working on this stuff even has a coherent and complete design+implementation picture in their head, or if the source of varnishd has reached "escape complexity" (which is like "escape velocity", but for things that can no longer be captured by the gravity of human understanding) [14:49:17] I have an absolute maximum depth when starting to click from varnishcache/varnish-cache/issues/N - it's 3 [14:49:22] after 3 clicks I stop [14:49:38] [ to which the other me might reply: "Your supposed comprehensive understanding of simpler things was an illusion anyways", cf https://en.wikisource.org/wiki/I,_Pencil ] [14:49:46] ema: Catflap chaining has deliberately been left out of the core implementation. vmods wishing to chain catflaps can still do so (by calling req->catflap_init and deploying a custom cat which knows how to operate additional catflaps). [14:50:22] cdanis: BTW 1799 is a bit like the wikipedia game that takes you to Philosophy starting from a random article [14:50:35] 1799 seems like the Kevin Bacon of Varnish [14:52:16] 10netops, 10Operations, 10ops-codfw: patch in FB peering into cr2-eqdfw - https://phabricator.wikimedia.org/T265412 (10Papaul) ` papaul@cr2-eqdfw> show interfaces xe-0/1/2 descriptions Interface Admin Link Description xe-0/1/2 up up Reserved for Facebook PNI - no-mon [14:54:18] 10netops, 10Operations, 10ops-codfw: patch in FB peering into cr2-eqdfw - https://phabricator.wikimedia.org/T265412 (10Papaul) ` Physical interface: xe-0/1/2 Laser bias current : 37.980 mA Laser output power : 0.3080 mW / -5.11 dBm Module temperatur... [14:56:27] 10netops, 10Operations, 10ops-codfw: patch in FB peering into cr2-eqdfw - https://phabricator.wikimedia.org/T265412 (10Papaul) [14:56:30] I'm adding more graphs to https://grafana.wikimedia.org/d/EiAVq3FGz/t264398?orgId=1 [14:56:44] yes, it starts looking like varnish-machine-stats [14:57:51] unfortunately hit_grace difference does not explain hit-front difference to my tired eyes [14:59:20] I don't think we necessarily have to understand all of the deep changes from 5 to 6, it may not be worth the effort [14:59:40] upstream doesn't either, apparently, so we can't do any better than them [14:59:48] :) [15:00:08] 10netops, 10Operations, 10ops-codfw: patch in FB peering into cr2-eqdfw - https://phabricator.wikimedia.org/T265412 (10Papaul) 05Open→03Resolved complete [15:18:15] bblack: I'm wondering if, at the end of the day, pass_random does more harm then good [15:19:01] surely for the size-based cutoff case it doesn't do much good [15:19:43] well in the case we're looking at right now, it's very actively harmful [15:20:13] but, we've had real incidents repeatedly in the past with the hyper-focus of some uncacheable URL on a specific chashed backend causing an outage [15:20:22] just hasn't happened in a long time now, because pass_random is in place [15:20:26] 10Traffic, 10Operations, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) As @dpifke pointed out in our team meeting yesterday, there's also the possibility that v5 was returning hits on things that it shoul... [15:22:46] so long as fe+be agree on what's cacheable and what isn't, none of this is a big problem [15:23:12] but even if you get rid of the obvious case of the size cutoff, there will be other disagreements [15:23:49] (ats vs varnish internal interpretations of standards, and makings of design tradeoffs, plus any differentials, in spite of our best effort, at request-parsing or response-parsing that affects cacheability) [15:26:10] most of those, hopefully minimal? [15:26:58] the best signal for pass_random would be if ats-be could signal explicitly (in response headers) that it considered an item to be cacheable (which I guess implies either it was a hit, or it was a miss that just created an object). [15:27:28] but that's already too-late, we already picked the backend in order to observe that response, and pass_random affects picking backends [15:29:56] 10Traffic, 10Operations, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) >>! In T264398#6568588, @Gilles wrote: > As @dpifke pointed out in our team meeting yesterday, there's also the possibility that v5 was... [15:32:08] bblack: yeah, it's tricky [15:32:43] sorry I'm multitasking, or these sentences would be less-spaced-out :) [15:33:19] my humble proposal in the patch I put up for scorn and derision [15:33:36] is that Pass/HFP should be reserved for cases where we at least hope all the layers agree [15:33:57] and HFM for cases where the frontend layer willfully chooses to not cache a cacheable object [15:34:09] and then pass_random can still be reasonably-applied to all pass-traffic [15:34:50] and the size cutoff HFM case should probably be wrapped in the same miss+200 logic that the exp policy had, which avoids applying it to wildly inappropriate cases (like request-based pass of an Authorization header getting caught up in the size cutoff and converted from HFP to HFM) [15:36:08] although that last case begs the question: what's going on with the new stuff above putting grace requests through vcl_miss and how does that impact the above thinking (etc) [15:42:21] ema: I think one of the ways our thinking on this diverges, is you tend to blame/scorn pass_random more, and I tend to blame/scorn the size cutoff more. [15:42:45] we could also disable the size cutoff at the fe layer and see what happens! [15:42:51] on one node [15:45:18] but yeah I think the hfm/hfp idea is very smart, I was trying to come up with a much dumber solution really [15:45:27] I suspect it's much more important on upload than text, the size cutoff [15:45:42] but it would be good to have *some* sanity limit size cutoff, in any case [15:46:05] * ema nods [15:46:13] even the backends have one [15:47:15] and then it has to be implemented in some way that doesn't exacerbate the impact of that rare very-large-object case, due to this pass_random conflict [15:47:32] and so we're still back to the same problem, we're just limiting the impact by changing the numeric cutoffs [15:48:05] (or kill pass_random, but I'm convinced we'd end up putting it back some weeks or months later after a series of annoyances from not having it) [15:49:58] an example of a very popular page that we cut off at the fe layer is https://en.wikipedia.org/wiki/Donald_Trump [15:51:43] arguably, we actually do want pass_random at the even-larger backend size cutoff [15:52:19] by that I mean: we might want to HFM at >=256KB to protect fe storage, but if the BE has a pass-limit at >=1GB, the the frontend should probably pass (implying pass_random) at >=1GB as well. [15:52:51] gotta go, sorry! [15:52:57] cya [15:53:09] see you tomorrow :) [15:59:34] ^ 15:45 < ema> but yeah I think the hfm/hfp idea is very smart, I was trying to come up with a much dumber solution really [16:00:38] I'm taking the words smart and dumb there in the sense of "cleverness" from the quote at the top of https://www.linusakesson.net/programming/kernighans-lever/index.php [16:01:03] and yes, I tend to agree, but I don't see a dumber solution that I can't readily poke holes in, either [16:28:14] for tomorrow, back on the question of "does the fe cache get fully full"... if we zoom to 3054's last restart: [16:28:24] https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?viewPanel=4&orgId=1&var-site=esams%20prometheus%2Fops&var-instance=cp3054&from=1601389308775&to=1601478769605 [16:28:57] I think the discontinuities there are other peer caches restarting and jolting traffic around a little, but it basically takes ~24h to fill in this case [16:29:46] whereas FE hitrate seems to recover on a very different shape/timescale, where it recovers much faster and then has a long tail [16:30:22] which might imply that just because the graph looks "fully used" for all the days after that ramp-in, that doesn't mean that it's all usefully-used. It could just be full of junk waiting to be pushed out. [16:31:09] in some sense, it might be experimentally valid to home in on a large_objects_cutoff which causes the hitrate and storage-used graphs to recover from restart on a roughly similar curve. [16:31:35] in any case, I think this line of thinking argues the cutoff should be higher than it is, on text. [16:40:30] it's a complicated question, but one angle to approach it from is: what popular wiki pages are we never caching in the frontend? [16:41:30] [[Barack Obama]] and [[Donald Trump]] are both well over the cutoff (1444kiB and 1796kiB respectively) [16:50:20] the "exp" policy encapsulates the optimal tradeoff strategy [16:50:42] but it requires certain tunable parameters, which are difficult to derive (but can be derived, from e.g. data we have in hive) [16:52:02] we could shove the same data through some calculation and arrive at a fixed cutoff as well, but it does make for a strange discontinuity and isn't fully optimal [16:52:37] exp has the advantage that it's proportionately probabilistic. It defines a curve over which the smaller the file is, the more likely it is to get cached. [16:52:58] maybe some 200MB outputs are rarely seen and rarely deserve cache space. but others that are hit more frequently, do deserve it. [16:53:26] the exp policy also reacts better to the sudden popularity of a previously cold file that might've been just "a little" over the fixed cutoff value [16:53:57] but, we'd need to make a real projects out of deriving tunables for the text+upload cases from real data, and updating them regularly without too much effort. [16:54:57] 10Traffic, 10Discovery, 10Operations, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan 20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Ladsgroup) @Lydia_Pintscher with the {T258354} being done, Can this be called done now? [16:55:03] in theory we have all the breadcrumbs in the old ticket [16:55:54] and we have the runtime code implemented in VCL/C already as well, just currently-disabled and with tunable values hardcoded that were derived a very long time ago from only cache_upload [16:57:17] https://phabricator.wikimedia.org/T144187 [16:58:03] (which started out as an "ideas" level ticket about bloom filters to handle one-hit-wonder cases, then some amazing researcher fortuitously stepped in and steered us in better directions) [17:01:41] e.ma recently retried and didn't have great success with it in terms of hitrate, but the tunables probably make all the difference [17:02:58] (and even if the fixed 256KB were optimal, exp has the advantage of being adaptable to that suddenly-very-popular 259KB article) [17:04:30] it would be even better if we could derive the tunables dynamically, but I don't know how reasonable that is [17:05:42] (as in, have a vmod or vcl-C-code that keeps some kind of sliding-window data over the past ~24h of response size/freq data and constantly retunes the params) [17:07:26] more-reasonable would be some script that can be executed against a month of recent hive data to generate per-cluster tunables, and re-running it once per X (month? quarter?) to update the numbers we plug into the current VCL solution. [17:10:11] (another thing, I think the calculations that were run before were optimizing the parameters for byte-hit-ratio, not object-hit-ratio. we probably want bytes on upload, but objects on text?) [17:10:37] I think that's right [17:10:55] I think objects comes closer to minimizing origin server costs, on text [17:13:14] 10Traffic, 10Discovery, 10Operations, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan 20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Lydia_Pintscher) 05Open→03Resolved a:03Lydia_Pintscher Yeah let's call this resolved. Additional... [17:41:14] https://engineering.fb.com/networking-traffic/how-facebook-is-bringing-quic-to-billions/ [21:12:05] 10Traffic, 10Operations, 10observability, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Document and/or improve navigation of the various HTTP frontend Grafana dashboards - https://phabricator.wikimedia.org/T253655 (10Ladsgroup) Varnish http requests and "Prometheus varnish http reque...