[06:58:40] 10Traffic, 10netops, 06Operations, 10Pybal: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3257675 (10elukey) Two snipped of nginx error log set to debug, from captured from the `close http upstream connection` event onwards: With Connection: close ``` 2017/0... [07:09:06] FYI, I've removed the no_nfs_hack from puppet and pruned rpcbind/nfs-common from cp/lvs (our d-i setup has been fixed to no longer install these by default as well) [08:40:38] <_joe_> moritzm: great [09:02:35] 10Traffic, 10netops, 06Operations, 10Pybal: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3257925 (10elukey) Opened a task to upstream to confirm: https://trac.nginx.org/nginx/ticket/1270#ticket [10:42:26] 10netops, 06Operations, 13Patch-For-Review: analytics hosts frequently tripping 'port utilization threshold' librenms alerts - https://phabricator.wikimedia.org/T133852#3258144 (10fgiunchedi) @ayounsi I'm guessing the cause was a network-intensive hadoop job, so yeah that might not be a one-time reoccurence... [11:46:53] 10netops, 06Operations: High latency for reaching Wikipedia from Jio - https://phabricator.wikimedia.org/T165103#3258312 (10ayounsi) [11:47:46] 10netops, 06Operations: High latency for reaching Wikipedia from Jio - https://phabricator.wikimedia.org/T165103#3258314 (10faidon) Thanks @Josve05a for relaying the report, that's very useful. The reverse path was going via PCCW to Singapore(!) and then to Jio in India: ``` Host... [11:48:04] XioNoX: oops! [11:48:15] XioNoX: collision in flight :) [11:48:27] haha yeah [11:48:42] XioNoX: so there is a way to fix this and I just did [11:49:10] I don't think we should have changed anything though, to me the issue is not on our side [11:49:35] it kinda is [11:51:28] but peering would be ideal, yeah [11:51:42] they do prepend like 10 times, the fact that we select that path is a limitation on our part [11:52:52] I understand, but it's normal practice to prefer the routes learned via our IXP peers [11:54:03] yeah, although ideally we'd only do so when the path difference is huge, but yeah hard to do that on the Junipers [11:54:13] to me the issue is that they are in a weird situation, where they are multihomed 1/ via telia, 2/ via pccw [11:54:41] if they only had transits, that would be fine [11:55:10] or the other way where they would be 100% a customer of PCCW [11:55:42] they are multihomed via telia and pccw, but in singapore for pccw [11:56:16] probably to get all of asia from there [11:56:33] they're not really doing anything wrong [11:56:47] also PCCW doesn't really peer in IXPs usually, we're the exception [11:57:27] these are all limitations of a) BGP, where ASNs count as distance, not latency b) our own setup where IXP routes are preferred over transit no matter the AS path length [11:58:45] so we have the workaround on a case-by-case basis of downprefing paths [12:01:42] 10Traffic, 10DBA, 06Operations: dbtree: make wasat a working backend and become active-active - https://phabricator.wikimedia.org/T163141#3258345 (10MoritzMuehlenhoff) p:05Triage>03Normal [12:03:06] them, not too much, pccw a bit more [12:03:57] basically advertising someone else's IP at the IXP [12:04:23] 10netops, 06Operations: High latency for reaching Wikipedia from Jio - https://phabricator.wikimedia.org/T165103#3258352 (10Reedy) >>! In T165103#3258314, @faidon wrote: > I wonder how many of those cases we have and how to find them in our very big haystack. Any ideas? Presumably, that's what things like the... [12:32:21] 10netops, 06Operations: High latency for reaching Wikipedia from Jio - https://phabricator.wikimedia.org/T165103#3258433 (10Josve05a) >>! In T165103#3258314, @faidon wrote: > @Josve05a, will you convey the fix back to them or should we? If you can, please do, otherwise, I'll just quote all comments made to t... [14:07:53] ema: what's the state of the transient-related things? [14:08:24] (also, I was thinking: it doesn't necessarily have to be a rate increase in hfp creation, they could just be living longer TTL-wise (perhaps indefinitely)) [14:18:44] bblack: hey :) [14:19:23] still trying to find out more, cp4010 (running 4.1.5) is now kinda reaching the point where the storage usage goes above what it used to be before May the 4th [14:19:53] ok [14:19:53] which seems to indicate that 4.1.6 isn't the culprit [14:20:54] I think there may be something fundamentally wrong/buggy with some of our related VCL, and some more-recent change that seems simpler has exacerbated some edge-case into a more-prominent issue somehow [14:21:41] e.g. maybe there's some really fundamental logical flaws in our 601s hit-for-pass block in wikimedia-common, and it's now being aggravated by the TTL changes or some backend behavior change, etc [14:21:48] yesterday I've read that in case of (normal) malloc storage allocation failures, the allocation would then happen in transient storage [14:22:01] not that I think that's what happening, but good to know [14:22:11] heh [14:22:38] how does normal malloc even fail? failing to nuke out space at the nuke_limit? [14:23:06] yeah I imagine that's how it would fail [14:23:25] I've never been 100% on malloc storage engine's operation at the limits, because it's never really been an issue [14:24:17] but given its speed, you'd think nuking wouldn't be hard, unless we were really going over nuke_limit to make space and failing there [14:24:32] I've added a plot for malloc alloc failures, they did use to happen before the 4th https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&from=now-30d&to=now&panelId=60&fullscreen&var-server=cp4010&var-datasource=ulsfo%20prometheus%2Fops [14:24:38] but having malloc fall back to transient is fascinating, especially since transient is a malloc :) [14:25:29] for that matter, I don't think I've ever dug deep enough to understand exactly what to make of "storage allocator failure" even for file [14:25:30] right, but they're separate so while malloc might be full and have trouble nuking I guess transient could be used instead [14:25:59] does a statistical "allocation failure" just mean there was no immediate room and we had to nuke? [14:26:24] no idea, let's see the code [14:26:27] or does it mean something else, like we hit nuke_limit and still didn't find space, so gave up and didn't store the object on the way through [14:28:18] https://github.com/varnishcache/varnish-cache/blob/master/bin/varnishd/storage/storage_malloc.c#L74 [14:28:35] https://github.com/varnishcache/varnish-cache/blob/master/bin/varnishd/storage/storage_malloc.c#L105 [14:30:49] surely I must be missing something from the storage base class about nuking? [14:31:38] mmh so c_fail seems to be incremented when sma_sc->sma_alloc + size > sma_sc->sma_max [14:31:53] sma_max is I think the size specified when defining the storage [14:32:24] right [14:32:37] it's the basic "out of space" when it's full, without any kind of attempt to free up space [14:32:44] so when that happens that's when nuke is needed right? [14:33:04] right, which I think is abstracted outside of specific storage drivers [14:33:29] I think the generic storage code calls (storage_module)->alloc, and when that fails it does the nuke_limit stuff and tries again [14:33:35] right [14:33:57] and probably if that also fails (nuke limit?) the object then gets stored in transient instead [14:36:16] bblack: what do you think about https://gerrit.wikimedia.org/r/#/c/353274/? [14:36:35] not sure what we should set as the limit, nor what happens exactly when transient storage is full [14:36:52] but I guess it starts nuking :) [14:39:02] well [14:39:22] yeah I mean I think that patch is where we should be going in general, to set some sanity limits [14:40:04] and I kind of assume for truly-transient objects (hfp and such) it would just nuke and thus some of them expire earlier than expected, which is no big deal [14:40:24] but it would be good to know wtf is going on here before we clamp down the limits [14:40:31] indeed [14:42:00] in my quest to find more about transient storage and its historical development I've bumped into some interesting (solved) bugs such as https://github.com/varnishcache/varnish-cache/issues/1821 [14:43:02] and https://github.com/varnishcache/varnish-cache/issues/958 [14:44:24] https://www.varnish-cache.org/lists/pipermail/varnish-commit/2011-August/006789.html [14:44:33] yeah,,, [14:44:44] I think things have changed even since then [14:45:04] the code that ends up calling stv_pick_stevedore() (and handling nuke_limit right below)... [14:45:10] STV_NewObject [14:45:38] I'd have to dig a little deeper, but at first glace it looks likely that it really only treats our backend hint as a hint [14:46:13] and its generic algorithm is on initial allocator failure (as in, space is used up, but no attempt to evict) it may walk other random stevedore options, probably ultimately trying transient at the end? [14:46:36] ooh [14:46:38] which probably screws with storage binning pretty badly :) [14:47:05] varnishlog might help! [14:47:07] varnishlog -n frontend -q 'Storage eq "malloc Transient"' [14:47:35] it would explain a lot of things if storage is automatically waterfalling down all the stevedores on simple failure [14:47:51] there's a lot of stuff there though, we should probably filter out synth responses from the output [14:48:04] yeah storage hint has always just been a hint [14:48:06] not guaranteed [14:48:16] hi mark :) [14:48:29] hi :) [14:49:41] mark: yeah I was expecting it to only fall back to a different stevedore if it couldn't find space even after trying eviction [14:49:56] the issues we're seeing only affect frontends though, so no binning involved [14:49:57] the code at a glace looks like it might fail over stevedores before trying eviction, though :/ [14:50:54] ema: yeah, but it would explain easy push of fe-malloc objects out to transient. and it might offer a hint as to why mailbox lag goes crazy after a time in the other case. [14:51:38] (maybe our bin hinting only really applies until they initially fill up, then it's kinda out the window) [14:52:29] unrelated good news: prometheus folks seem excited about the qdisc collector :) https://github.com/prometheus/node_exporter/pull/580 [14:53:21] cool :) [14:55:52] hmmm ok maybe I'm wrong. it seems like the pick_stevedore code does stick to the hint, so long as it's valid. it only iterates stevedores if no hint was supplied or the hint was invalid... [14:57:48] ok so back to your point about failover directly to transient (which mustn't be via the pick_stevedore above) [14:57:55] vbf_allocobj() does: [14:58:11] if (bo->uncacheable || lifetime < cache_param->shortlived) [14:58:11] storage_hint = TRANSIENT_STORAGE; [14:58:12] else [14:58:12] storage_hint = bo->storage_hint; [14:58:41] (so that's where short and hfp objects go transient) [14:58:47] right [14:58:52] but then it tries the real allocation, and if that fails [14:59:01] if the explicit hint was transient, it fails out completely [14:59:35] otherwise it caps the object ttl to the shortlived ttl, resets grace/keep to zero, and attempts transient storage as a last chance at success [15:00:32] if that also fails, then we eventually fail out hard with: [15:00:34] (void)VFP_Error(bo->vfc, "Could not get storage"); [15:01:08] which we've seen a few times already on file storage failures [15:03:28] so just from cache_fetch.c code, we know transient on the frontend could come from: hfp, shortlived, or failure to evict space via nuke_limit (which creates a ttl-capped object in transient instread) [15:03:41] but there's a couple of other places in other files that touch transient, too [15:04:24] there's one in cache_req_fsm.c that's for the generation of synthetic responses [15:05:14] and then VRB_Cache in cache_req_body.c, I don't really understand that one [15:05:44] oh but it's (duh) the request body, not response. we haven't even turned on that feature [15:05:54] it's meant for the optional/scary "cacheable POST" stuff [15:07:27] https://phabricator.wikimedia.org/P5436$68 [15:07:33] how did this end up in transient, for example? [15:08:32] TTL is > shortlived, doesn't look like a hfp [15:11:30] that was cp4017 fe? [15:12:09] cp4010 [15:12:29] (fe) [15:13:55] do you understand the TTL lines in that paste? [15:14:08] - BerespHeader Cache-control: s-maxage=1209600, must-revalidate, max-age=0 [15:14:17] - BerespHeader Age: 44181 [15:14:27] - TTL RFC 1209600 10 -1 1494601356 1494557175 1494557174 0 1209600 [15:14:29] vsl(7) explains the meaing [15:14:32] *meaning [15:14:32] ​- TTL VCL 1209600 10 604800 1494557175 [15:14:32] 66 ​- TTL VCL 130581 10 604800 1494557175 [15:14:33] 67 ​- TTL VCL 130581 3600 604800 1494557175 [15:14:33] 68 ​- TTL VCL 44181 3600 604800 1494557175 [15:14:41] we use qdisc on varnish? [15:14:46] (this can wait, sorry :) [15:14:53] mark: we will soon, not yet [15:15:16] awesome [15:15:21] mark: https://phabricator.wikimedia.org/T147569 (TCP BBR - which requires FQ qdisc to work) [15:16:21] "RFC or VCL" - TTL - Grace - Keep - Reference time for TTL [15:16:59] right so it starts with the RFC TTL of 1209600 from CC [15:17:28] err, wait, I'm misunderstanding reference time [15:17:50] where does 130581 come from? [15:17:53] 1494601356 - 1494557175 = 44181 [15:18:01] cool [15:18:09] now we need to tie those BBR parameters into geodns [15:18:22] 130581 = 86400 + 44181 [15:19:42] so current time is 1494601356 [15:19:55] reference time is that - age (44181) = 1494557175 [15:20:21] RFC header indicates a TTL of 1209600, which is based on the reference time [15:20:47] 130581 happens when we do our 1d cap [15:21:09] (which seems to effectively be capping it off at 1d after arrival, not 1d after Age:0) [15:21:42] then we see the 1h grace applied by VCL (s/10/3600/ in second numeric column) [15:22:05] the last line is the puzzler: VCL 44181 3600 604800 1494557175 [15:22:17] that seems to imply what we'd consider "ttl=0" being set [15:22:34] it's setting a TTL of 44181, with a reference time that's 44181 seconds in the past [15:23:03] it's confusing that "beresp.ttl" doesn't map to any of these colums [15:24:11] oh I thought beresp.ttl was TTL [15:24:37] I think not [15:25:00] I think beresp.ttl is a number that counts from "now", but the TTL column counts from now-age. [15:25:27] so executing "beresp.ttl = 0s" sets the TTL column to the object's Age value, basically [15:25:55] but then is an object shortlived when beresp.ttl < shortlived or TTL < shortlived? [15:26:22] if the former is the case, that explains why this object ended up in transient [15:28:01] ooh wait a minute [15:28:06] lifetime = oc->exp.ttl + oc->exp.grace + oc->exp.keep; [15:28:15] if (bo->uncacheable || lifetime < cache_param->shortlived) [15:28:15] storage_hint = TRANSIENT_STORAGE; [15:28:49] so this object really shouldn [15:29:00] 't end up in transient [15:29:15] right, if nothing else grace and keep save it [15:30:09] unless it's uncacheable somehow? [15:30:40] oh, N-hit wonder maybe [15:30:49] X-Cache-Int: cp1053 miss, cp2019 hit/2, cp4017 miss [15:32:18] heh ok, that makes sense [15:32:35] I guess 4-hit comes after our generic TTL-capping code [15:33:04] so it's a TTL=0s transient pass object that only exists for the life of the request [15:33:08] basically [15:33:27] (that might've been more obvious with the client-side of that reqlog alongside it) [15:33:44] fair enough, TTL=44181 confused me [15:33:49] yeah me too [15:34:19] I do wonder though, about the whole age-offsetting thing and reference times [15:34:38] if we're failing to understand some things about to use/manipulate berest.ttl [15:34:42] *beresp.ttl :) [15:35:07] oc->exp.ttl == beresp.ttl, or the TTL column? [15:36:31] bin/varnishd/cache/cache_expire.c:282:oc->exp.ttl = now + ttl - oc->exp.t_origin; [15:37:11] but it must be set somewhere else too [15:37:26] ok here we go [15:37:39] struct exp { double t_origin; float ttl; float grace; float keep; [15:37:42] }; [15:37:47] so exp has fields t_origin, ttl, grace, keep [15:38:05] and then elsewhere in the code (far away from where it could easily usefully be found!) we get this explanation: [15:38:33] hmmm stupid pasting of C code [15:38:49] * In exp.*: [15:38:49] * t_origin is absolute [15:38:49] * ttl is relative to t_origin [15:38:49] * grace&keep are relative to ttl [15:38:49] * In VCL: [15:38:52] * ttl is relative to now [15:38:54] * grace&keep are relative to ttl [15:39:29] I think t_origin is basically "now - age" [15:40:25] so ignore our VCL, in generic terms here's an example [15:40:44] object comes in with CC: s-maxage=86400 + Age: 10000 [15:41:07] exp fields should be t_origin=now-10000, ttl=86400 [15:41:20] beresp.ttl will read as 76400 [15:42:28] and if we're setting grace=300 + keep=900, those values would be the same in both beresp and exp, since they're supposedly always used relative-to-ttl [15:43:26] but that seems to make this wrong: lifetime = oc->exp.ttl + oc->exp.grace + oc->exp.keep; if (lifetime < cache_param->shortlived) [15:43:50] or at least, makes shortlived very counterintuitive [15:44:44] are we sure that hfp objects ignore grace? [15:44:47] shortlived isn't operating on the beresp.ttl concept. it's operating on the absolute ttl (the maxage) [15:45:41] ema: I'm not sure. [15:46:22] I wonder if various varnish developers have gotten confused about the two meanings of "TTL" in some of their commits at some point [15:46:22] https://github.com/varnishcache/varnish-cache/issues/1964 [15:48:48] oh wait but we do return (deliver) when creating hfp, we don't end up in the default vcl [15:49:32] yeah but our 601s-hfp code there is basically doing our version of that default vcl [15:49:45] we do it a little differently, but it's the same purpose: translate ttl=0 -> hfp [15:51:10] anyways, the 4.1.5-vs-4.1.6 check, at this point does it fairly conclusively rule out 4.1.6 as the transient cause? [15:52:32] at this point I'd say it does [15:52:33] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&from=now-9d&to=now&panelId=58&fullscreen&var-server=cp4010&var-datasource=ulsfo%20prometheus%2Fops [15:54:29] I think our best bet at this point is that the ttl/keep change did this to the frontends, the question is figuring out the details and how best to solve it [15:54:37] random thoughts on that topic: [15:55:42] as the backends went through those ttl/keep changes, we were basically trading ttl and keep times [15:56:05] but the frontends, they always had ttl_cap=1d and no keep. now they're still at ttl_cap=1d, but keep=7d [15:56:44] that could be just affecting everything in general, that the storage doesn't fully "expire" now in FE until 8d later instead of 1d later [15:57:09] but also, those keep times could somehow apply to various hfp objects or even ttl=0s pass objects (n-hit-wonder)? [15:57:34] we know the big transient ramps somewhat correlated with the 4.1.6 restarts [15:57:43] but could they also correlate with 24h after the ttl/keep change? [15:58:10] maybe the ttl/grace/keep reordering affects this for the frontend in an unpredicted way, too [15:58:43] I think we were mostly thinking through it in a backend-centric way, but it's common to both layers [16:00:51] in general, thinking from this perspective, we probably didn't want keep=7d on frontends in general [16:01:22] but now I'm really curious about side-effects of that in general, and on hfp/n-hit-wonder scenarios in the FE [16:01:39] 10Traffic, 10MediaWiki-ResourceLoader, 10MediaWiki-extensions-CentralNotice, 06Operations, and 2 others: Provide location, logged-in status and device information in ResourceLoaderContext - https://phabricator.wikimedia.org/T103695#3258878 (10AndyRussG) Hi! Thanks much @Krinkle for explaining this in detai... [16:01:45] it should be easy to see if hfp objects are kept with vtc [16:01:47] (especially the re-order to prevent ttl_cap from limiting keep value, while capping keep relative to applayer ttl) [16:07:14] when you think about it, really it's kind of awful that VCL doesn't have access to exp.ttl [16:07:35] although I guess we can infer it using beresp.http.age [16:08:04] 10Traffic, 10MediaWiki-ResourceLoader, 10MediaWiki-extensions-CentralNotice, 06Operations, and 2 others: Provide location, logged-in status and device information in ResourceLoaderContext - https://phabricator.wikimedia.org/T103695#3258920 (10AndyRussG) P.S. Since I'm pretty ignorant of service workers, I... [16:12:17] I don't understand [16:12:39] this test passes: https://phabricator.wikimedia.org/P5438 [16:13:14] so it doesn't look like transient objects get removed soon after the request is done? [16:14:43] not all transient objects are per-request [16:15:19] oh but that one should be, you'd think [16:15:25] right? [16:15:35] that's our N-hit wonder basically [16:15:54] what if you extend the vtc to make several request and check after each one? [16:15:59] does it keep stacking up? [16:16:00] keeps on growing [16:16:24] maybe the freeing is just async and not happening in this case for whatever reason [16:16:32] oh god [16:16:37] (needs more delay or other traffic to get to when it finally frees?) [16:16:39] it doesn't keep on growing if I set grace to 0 [16:16:51] what's default grace in that vtc? [16:17:21] the varnishd default is grace=10s [16:17:23] varnishd's default, 10s [16:17:24] right [16:17:39] if you wait >10s do they start dissappearing? [16:18:16] yup [16:19:45] hmmm [16:19:53] that can't be good :) [16:20:08] ok one more question: default keep is probably zero. does setting some keep time keep them alive similarly, even if grace is gone (or on top of grace?) [16:20:34] let's see [16:21:11] if so then not only do we understand that our original ttl=0s longstanding code probably needed fixes and was causing more transient than we should have in general... [16:21:16] 10Traffic, 10MediaWiki-ResourceLoader, 10MediaWiki-extensions-CentralNotice, 06Operations, and 2 others: Provide location, logged-in status and device information in ResourceLoaderContext - https://phabricator.wikimedia.org/T103695#3258983 (10Krinkle) @AndyRussG Service workers can help us in two ways: As... [16:21:28] but we now also understand that setting keep=7d on frontends might have really exacerbated the effect, too [16:22:25] when we were at ttl_cap/keep of 3+3 it wasn't a big deal because fe cap was still explicit at 1d, and keep was set afterwards and capped down to 1d as well I think [16:23:55] yeah I just double-checked both of the ttl/keep commits. the initial effect on frontends was ttl_cap=1d + ttl_keep capped at most 1d. [16:24:08] then on the second patch we went to keep capping at 7d [16:24:41] we need to set both keep and grace to 0 for the space to be freed [16:25:48] I've run the test with varnish v1 -arg "-p default_keep=5 -p default_grace=5" -vcl+backend [16:26:20] and only be explicitly setting both beresp.keep and beresp.grace to 0 does the space get freed [16:26:40] what I mean with "space freed": [16:26:41] varnish v1 -expect SMA.Transient.c_bytes == 400 [16:26:43] varnish v1 -expect SMA.Transient.c_freed == 400 [16:26:43] varnish v1 -expect SMA.Transient.g_bytes == 0 [16:27:10] yeah ok [16:27:50] (400 after two requests, hfp objects in that test are 200b) [16:28:27] bblack: should I go ahead with a VCL patch? [16:29:33] I'm already 75% done :) [16:29:37] ok :) [16:31:52] so also, we should look at making that 7d ttl cap different on FE [16:32:03] either move it to BE-only code or set a different/smaller one for FE? [16:32:20] https://gerrit.wikimedia.org/r/#/c/353567/ ? [16:33:48] sorry I meant "7d keep cap different on FE" [16:35:04] maybe just set the vcl_config keep attribute smaller there, 1d or less [16:35:27] yeah [16:36:25] re: https://gerrit.wikimedia.org/r/#/c/353567/ misc is missing [16:36:30] probably not a big deal, but still [16:36:53] oh wait but those in misc aren't with beresp.ttl=0 [16:38:29] yeah [16:38:55] +1 [16:40:15] updating for a few more [16:40:34] I was looking for "ttl = 0", now considering the other grep sites for "uncacheable" [16:40:52] bblack: I'll reduce the frontend keep cap to 1d meanwhile [16:42:15] ok I only added to the text cookie one in text-common [16:42:26] the others, they may not be important and/or I may not fully understand the impact heh [16:42:39] (the upload size >1G uncacheable for instance, or the misc ones that are using the app ttl) [16:43:03] https://gerrit.wikimedia.org/r/#/c/353567/2/modules/varnish/templates/text-common.inc.vcl.erb [16:43:07] ^ that's what I added in PS2 [16:44:43] hmmm, I'm trying to think of this in terms now of "is this really a varnish bug, or?" [16:45:29] we tend to think: "beresp.ttl=0s + beresp.uncacheable = this object is temporary only for this request", but obviously that's not really true and not meant to be [16:45:51] even if it weren't for the grace/keep complications, other requests can coalesce on that one and share the ttl=0s+uncacheable result [16:46:28] so it makes sense that grace/keep still apply [16:46:40] we also know that varnish does have a real-er concept of per-request objects [16:47:19] e.g. when a request hits an existing shared hit-for-pass object, if you try to manipulate obj.* or beresp.* in that context you're operating on a truly temporary per-request object [16:48:05] I don't know that there's a VCL way to make our ttl=0+uncacheables act like that, or if that's even desirable. [16:51:06] this is all a nice find though [16:51:23] probably even before this new huge ramps in transient, we were wasting lots of excess transient on this stuff [16:51:54] ema: before we merge this stuff up, let's think a sec though [16:52:00] bblack: updated the commit message of https://gerrit.wikimedia.org/r/#/c/353567/, I've opened a new phab to track the actual transient growth issue [16:52:27] on whether the grace is useful and setting it to zero is damaging in some of these cases... [16:52:41] it's the killing of the huge keep value that's the big win here [16:52:59] right [16:53:10] but maybe ttl=0s+uncacheable, which are technically shared but very short-lived objects... maybe that they had a grace-time is preventing some coalesce-stacking issues or whatnot... [16:53:42] mmh [16:55:19] a short grace doesn't seem to hurt, seeing things in this perspective [16:56:41] yeah I'm thinking just on the N-hit-wonder case right now [16:56:49] there's two important sub-cases there: [16:58:06] 1) The true rare-ish objects: I think the grace has little-to-no effect here, on truly-rare objects [16:59:46] 2) New objects that might become hotter: we might be mentally modeling that it would become cacheable on the 4th request (that hits the backend, since the frontends share the BE's hit counter), but really the grace time could mean that it doesn't start caching in the FE until several more requests have passed... probably not a huge deal overall, as it will background-fetch while grace-time is act [16:59:52] ive, etc... and it was always the case that e.g. 10 concurrent requests for a brand-new object might all fail N-hit-wonder, then the 11th shortly after starts caching. [17:00:19] 3) (yeah there's 3) - initial FE cache warmup from empty... where all objects are potentially hot but initially failing N-hit-wonder [17:00:55] in that case given that requests to the BE are probably temporarily backlogged and clogging BE connections, etc... the grace value here is probably doing more of.. something? is it positive or negative? [17:01:48] 1) seems the easy one. If the object isn't popular, the likelihood of thundering herds coalescing is low so whatever [17:03:39] really we already knew (3) was a problem. I think we've talked before about how it would be nice to be able to wrap the N-hit-wonder block in some kind of: if (varnishd.uptime > 300) or something, but there's no such info available at that level [17:03:49] setting grace causes the N-hit wonder to be N+something, IIUIC [17:04:26] it should be N+notmuch you'd think. after all as soon as you "use" an object via grace time, you're already also background-requesting an updated object too [17:04:51] I just worry about the "notmuch" growing a bit larger during warmup of empty cache and slight backlog of backend reqs [17:05:02] and whether it's good that we're killing that with grace=0 on these, or bad [17:07:04] tricky stuff! [17:08:22] tl;dr would be -> drawback of grace: $notmuch could be (possibly) much, drawback of nograce: coalescing on popular objects [17:08:25] ? [17:08:52] coalescing is a good thing if the result is a cache object, of course [17:09:06] right, I mean in the N-hit-wonder case [17:09:16] yeah... [17:09:45] I guess, we were setting grace=60m on those [17:10:06] but in practice it really didn't matter. it was going to fix itself for a hot object pretty quickly and replace that object [17:10:22] at most a few seconds of the grace value mattered [17:10:28] 5m I think [17:10:36] (grace_healthy) [17:10:40] oh right because the healthy limit in vcl_pass [17:10:43] hmmm [17:10:53] heh, that also raises questions... [17:11:51] wm_common_hit_grace *assumes* a non-zero obj.grace [17:12:05] (in the healthy case) [17:15:44] ema: https://gerrit.wikimedia.org/r/#/c/353585/ ? [17:16:29] I'm not sure, I may have mangled the logic there... [17:16:44] (bbiab) [17:17:50] In the net, I'm leaning towards thinking that grace=0 is appropriate or at least ok for the n-hit-wonder [17:20:13] the grace=31 I stuck into the ~5m hfp objects, is more of a "meh why not in case it helps" sort of thing, to avoid thinking about issues like the above [17:20:22] and is probably also ok [17:22:17] and I do think my logic on the grace_healthy thing is right, now that I looked more [17:22:51] I know it's late friday (esp for you!) but I think probably merging these 3x patches up is better than dealing with oom's over the weekend that would almost certainly happen otherwise. [17:26:21] beyond that, we can then set transient caps based on the hopefully-saner new transient size data, and then also we probably should re-eximagine the 601s hfp condition in general [17:27:16] I think the latter is probably at least sometimes creating harmful hfp objects in edge cases (cacheable object right at the ttl=0 mark or even a bit under due to grace). I mean, the X-Cache-Int check is supposed to suppress that, but I'm not really sure that it always does all things considered [17:27:45] heh s/re-eximagine/re-eximine/ [17:30:11] ema: ultimately, I'm still going to be somewhat-unavailable the latter half of my today and through most of saturday, so judgement call is more on you about leaving the status quo risks or risking the various fixups [17:30:40] but I lean towards the 3x fixups and stop there (no transient cap yet) [17:30:53] bblack: ok, just one more thought on the N-hit-wonder grace [17:31:05] so far it hasn't created problems as far as we know, right? [17:31:34] (the current setting of 5/60m) [17:31:41] right [17:31:58] so perhaps to be conservative in the changes we push on a friday we could leave it as is now [17:32:16] and almost certainly, the 60m is only really effective for a shorter time, maybe up to "however large backlog of BE fetches could grow in time during warmup" [17:32:32] right [17:32:46] I'd say let's leave grace as is for n-hit-wonders then [17:32:56] ok [17:33:04] what about the grace_healthy capping fixups? [17:33:15] (so it doesn't use 5m grace in a case where we had explicitly set obj.grace=0) [17:33:31] that seems like a good idea [17:34:01] ok I'll fix up the other patch [17:34:29] ok, I'll merge the 1d frontend keep patch meanwhile [17:35:14] I'm just gonna comment out the grace=0 part and leave an XXX for later re-examination I think [17:35:43] sounds good [17:36:56] oh and I'll upgrade cp4010 back to 4.1.6 [17:37:41] ok [17:41:38] running puppet agent on everything [17:46:55] on other long-term topics: I do wonder about the default nuke_limit=50 on frontend malloc nodes now [17:47:26] it doesn't matter as much since there's no concept of fragementation or contiguous free space like there is with file [17:48:15] oh, on that topic: https://github.com/varnishcache/varnish-cache/issues/1764 [17:48:17] but still, you could see a frontend scenario where malloc limit is "full", and a new 256KB object comes in, and those 50 nukes all end up being 1K objects, leading to a failure that then spills over to transient [17:48:32] I'm not sure if the fix has been backported to 4.1 [17:48:39] lol [17:48:42] :) [17:49:15] also, I'm sure we had an undocumented reason at some past point [17:49:41] but I noticed our old old varnish3.x + pre-jessie varnishd initscripts were setting nuke_limit=300 for all daemons I think [17:49:47] and we dropped that at some point [17:50:23] so that further reinforces the argument that maybe the FE nuke_limit should be raised (but I guess it doesn't matter for now if it doesn't work anyways) [17:51:13] on the other other hand, I think at least in the text case where objects are commonly streaming with no Content-Length, an eventually-256KB object doesn't allocate the whole storage at once anyways [17:51:24] it grows by chunks which gives time for the nukes to build up as well [17:54:49] bblack: briefly back to transient, do we want to roll-restart the frontends? [17:55:01] anyways, since the VCL fixups are live, I'd expect the big transients to at least level off now right (well, within the next ~hr) [17:55:38] is it fair to assume the FE transient size is near the graphed maximums on most nodes in a cluster? [17:57:29] I don't know [17:57:32] this is the current text usage https://phabricator.wikimedia.org/P5439 [17:57:57] so yeah they're all high, and vary by DC as roughly expected [17:58:02] yup [17:58:12] how long does it take to see the pattern post-restart? [17:58:34] I mean, I guess even when it's not yet reached the previous normal, you can kinda see the crazy straight-upwards ramp slope [17:59:12] yeah, this is cp4010, downgraded on the 11th at 9ish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&from=now-10d&to=now&panelId=58&fullscreen&var-server=cp4010&var-datasource=ulsfo%20prometheus%2Fops [17:59:43] it went over the 16G threshold after ~24h [17:59:54] hmm yeah so we already have it as a basic test-case [18:00:02] it should ramp up slower this time [18:00:39] then again, different times of day/week vs previous restart [18:00:46] not directly comparable due to traffic patterns [18:00:53] right [18:02:03] so zooming in on the previous restart there, it staircases up from the get-go once a minute in ~27MB chunks [18:02:09] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&from=1494496163341&to=1494498063898&panelId=58&fullscreen&var-server=cp4010&var-datasource=ulsfo%20prometheus%2Fops [18:02:49] and it looks pretty similar on the fresh restart just now, too [18:03:06] which may have raced with the VCL fixes, but they were all in play within a few minutes of startup I think [18:03:18] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&from=1494610924483&to=1494612148103&panelId=58&fullscreen&var-server=cp4010&var-datasource=ulsfo%20prometheus%2Fops [18:04:31] but that may just mean that most of it's the N-hit-wonder and it will level off within the 1h grace there [18:04:52] (as they start actually falling off after grace=1h, instead of after keep=7d) [18:05:29] * ema nods [18:05:53] maybe should've just reduced that to 5m or something [18:06:01] but this is technically a bit safer :) [18:06:06] :) [18:06:42] so maybe wait an hour and see? I've gotta leave soon but will be around tomorrow now and then to see how it goes [18:06:51] ok [18:07:08] I'm probably leaving soon too, but I'll check back in later today [18:07:43] alright! [18:07:51] do you think I should put up a screen session with a slow-rolling FE restart on text+upload? [18:07:56] over a few hours like before? [18:08:08] it will at least buy time even if the patches made things no better [18:08:20] good point, yes [18:08:27] ok I'll do that before I take off [18:09:05] perfect, interesting discoveries today! :) enjoy your weekend [18:09:11] you too :) [18:45:35] ema: in case you drop by later: cp4010 did change at the ~1hr mark. it's still staircasing up, but by much smaller amounts after that point.... [18:49:25] ema: now it seems to have crested around 1h5m in @ ~1.18GB and looks stable or even slightly-dropping. probably just randomly varying a bit at this point but stable-ish [18:49:57] whereas previous transient ramps went as high as ~45GB before we killed them (same host) [18:50:48] (and the restarts are going, ~4h total for the run) [18:52:02] the maximums on the ones still running seem to have leveled or slightly dropped as well as expected, but it's harder to see there at such large stable-ish values [19:59:45] 10Traffic, 10DBA, 06Operations: dbtree: make wasat a working backend and become active-active - https://phabricator.wikimedia.org/T163141#3259609 (10Dzahn) a:03Dzahn