[08:34:56] https://varnish-cache.org/lists/pipermail/varnish-misc/2018-February/026340.html [08:35:36] you can get an incomplete body (with 200) if nuke_limit is reached apparently [08:36:43] which really is not cool, especially if it happens at our varnish-backendmost layer and the incomplete response gets cached throughout the CDN [09:40:28] <_joe_> can you add an header to the response in that case? [09:45:53] it would be better to turn that "fake" 200 in a 503 [10:03:58] <_joe_> vgutierrez: yes, that's why I asked about a header, but I get it's not possible from reading a bit about nuke_limit [10:21:52] well.. https://gerrit.wikimedia.org/r/#/c/429394/ we will have a new counter to graph and check when strange things happen :) [10:30:50] yes! [10:32:02] and cache_hit_grace to track step 2 of T192368#4153519 -> https://gerrit.wikimedia.org/r/#/c/429395/ [10:32:03] T192368: Unconditional return(deliver) in vcl_hit - https://phabricator.wikimedia.org/T192368 [12:13:07] ema: nuke_limit issues make sense. it's a fundamentally-intractable problem given the parameters of the situation. [12:14:23] ema: varnish is streaming, it has no pre-knowledge of the final content length. The headers have to go through to the client early (200 OK and now no chance to add an X-Oops-Nuke header either). Storage nuking for allocation can eventually fail while streaming through the unknown bytes. [12:15:43] ema: fundamentally it's no different than the backend applayer segfaulting halfway through sending us the body bytes. Once we've streamed the "success" header through, anything going wrong on the backend side is unreportable to the client, just broken/partial body. [12:18:31] ema: (which is another good reason to go back to do_stream=false on the applayer-facing varnish, but last time we tried that things failed spectacularly, possibly due to applayer misbehaviors from our pov) [12:20:02] maybe we could put an asterisk on "unreportable" - technically there are trailer-headers, and technically we can at least say whole TE: chunked chunks are still the unit of successful transmission. [12:20:57] but even if varnish can set trailer headers, I don't know if we can usefully use them to fix this. [12:22:53] it would be nice if there was a way to explicitly signal to the applayer that we want a non-streamed whole response with a CL. [12:23:35] e.g. Varnish bereq header X-Must-Send-CL: true -> apache is forced to buffer response and set a CL before transmitting anything. [12:24:11] (you'd think maybe forcing it to be HTTP/1.0 would do this, but since 1.0 allows a close-delimited "streaming" response that's probably all it would trigger) [12:25:10] in general, non-streaming (buffering whole thing) at either side of the varnish<->apache connection is both healthy and not much perf cost (since it's DC local, the bw is high and the latency is low) [12:26:02] the real benefits of streaming are at the be->fe->user end of things, where at least we have some level of stronger gaurantees about potential misbehaviors. [12:54:14] yeah so collecting and monitoring n_lru_limited seems like a good low-hanging fruit [12:54:37] re: disabling applayer<->varnish-be streaming, I forgot what the failures were [13:12:00] so, there's more to #1799 [13:12:07] > when an object expires after the request was received but before lookup AND the object was found (due to keep or a lagging expiry thread), then we want VLC to see this as a hit. This is the center assumption of this PR, and it fixes a known problem uncovered after #2519 was discussed and tested. [13:12:24] https://github.com/varnishcache/varnish-cache/pull/2555#issuecomment-363121812 [13:14:47] now the "known problem" isn't explicitly mentioned, but I assume it must be that the patch fixing #1799 made VCL consider objects expiring after the request but before lookup as misses [13:18:59] ok I think the "known problem" must be r02555.vtc [13:19:20] "Expiry during processing" [13:19:52] > All VMODs that currently use HSH_purge must change to using [13:19:52] VRT_purge. [13:20:43] (changes affecting VMODs in a bugfix release /o\) [13:26:27] not that we use HSH_purge, but still [13:46:35] yeah, this is all symptomatic of the design really [13:47:23] the parallelism of reqs/resps via N,000 threads, optimizing to minimize locking so it can scale at all, etc.... it means there's probably lots of hidden issues related to the flow of time or sequences. [13:48:05] things are happening and changing all the time, while other things are in progress for others. invariably you run into a situation where the programmer's intended logic only makes sense if everything else was static, which it never is. [13:48:48] req->resp takes time. things change even in the middle of a single function call. lots of assumptions are carried across lock boundaries about the state of the objects, implicitly by the calling path and its arguments, etc [15:33:44] haha [15:33:59] r02555.vtc from master only passes on varnish > 5.2 [15:34:16] >= 5.2 that is [15:34:32] r02555.vtc from 4.1 only passes on varnish 4.1 [15:34:46] not bad [15:35:05] see https://github.com/varnishcache/varnish-cache/commit/33143e05c0720e5be0df4ac5b20a0fcc8a6874e8 (master) and https://github.com/varnishcache/varnish-cache/commit/a02e4f277c3ff12c8ef05b692713f87813a85da9 (4.1) [15:36:29] IOW, to fully backport the patch from master (or forward-port from 4.1, hehe) the test needs to be rewritten [15:37:15] specifically, both 'import vtc;' and 'import ${vmod_std/debug};' fail on 5.1 [15:44:59] why is this stuff changing every 2 months [15:45:31] goal: call barrier_sync [15:45:51] howto (5.2): import vtc; vtc.barrier_sync [15:46:02] howto (5.1): import debug; debug.barrier_sync [15:46:20] howto (4.1): import ${vmod_debug}; debug.barrier_sync [15:46:25] ffs [15:47:13] this is a neat feature https://cloudplatform.googleblog.com/2018/04/introducing-VPC-Flow-Logs-network-transparency-in-near-real-time.html [16:29:20] interesting, I'm done backporting the patch to 5.1 and now ./bin/varnishtest/tests/c00041.vtc fails (but passes occasionally) [16:36:06] bblack, vgutierrez: a fresh pair of eyes comparing https://gerrit.wikimedia.org/r/#/c/429440/2/debian/patches/0016-new-ttl-in-vcl-calculation.patch with 33143e0 and a02e4f2 would be great when/if you have the time! [19:35:42] 10Wikimedia-Apache-configuration, 10Discovery, 10Reading-Infrastructure-Team-Backlog, 10Zero, and 2 others: m.wikipedia.org and zero.wikipedia.org should redirect how/where - https://phabricator.wikimedia.org/T69015#4164731 (10LGoto)