[07:44:28] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2291872 (10ema) >>! In T134989#2290254, @BBlack wrote: > 1. We've got 2x upstream bugfixes applied to our varnishd on cache_mi... [07:54:04] gehel: good morning :) [07:54:10] shall we give it another try? [08:10:28] Sure... [08:11:04] ema: test started... [08:11:23] bonjour! [08:11:31] hashar: o/ [08:15:38] ema: I still got an error. Age: 120, so seems to be on TTL boundary again. [08:16:02] damn! [08:16:04] but next GET is fine... [08:16:32] https://www.irccloud.com/pastebin/u0ti8RQe/ [08:17:03] Age header sent twice (not really an issue, but maybe an indication of something?) [08:17:28] Single error on 100 requests. [08:17:56] gehel: do you have the full headers output perhaps? [08:18:49] damn, I deleted that output too fast. Luckily (or not) I just got another error. [08:19:05] https://phabricator.wikimedia.org/P3067 [08:20:04] ema: what does "hit+miss" means in this context? It sound a bit like an oxymoron... [08:20:59] I now get only errors, seems the zero body is cached on cp3009 [08:21:05] gehel: it's a recent change https://github.com/wikimedia/operations-puppet/commit/b32f7e85f609620eff0456866fe1e416c5b442b5 [08:21:59] yes now it got cached as 0-length [08:22:15] (I'm also hitting cp3009) [08:23:19] yeah, caching seems easy, until you actually try to implement it... [08:28:27] hi, I was peeking at graphite and there's a 'too many creates' alert, looks like we got plenty of varnish stats in the form of varnish.client.jsessionid. known? [08:28:49] aha [08:29:34] godog: hostnames? [08:30:25] ema: you mean what machines are sending those? [08:30:40] that's what I mean yes, if there's a way to find out :) [08:31:11] we might have started sending them with the v4 upgrade [08:31:30] yup, checking with ngrep :( [08:31:40] sec [08:37:57] maybe more than one sec heh, not sure how frequently those are sent [08:42:42] gehel: back to the hit+miss [08:42:54] in case of cache hit we do this: [08:42:55] https://github.com/wikimedia/operations-puppet/blob/production/modules/varnish/templates/vcl/wikimedia-common.inc.vcl.erb#L288 [08:43:12] setting X-CDIS to hit [08:43:52] called from here: https://github.com/wikimedia/operations-puppet/blob/production/modules/varnish/templates/vcl/wikimedia-frontend.vcl.erb#L290 [08:44:04] then we call cluster_fe_hit, which in the case of misc is: [08:44:10] https://github.com/wikimedia/operations-puppet/blob/production/templates/varnish/misc-frontend.inc.vcl.erb#L35 [08:44:46] so hit+miss = hit with expired ttl [08:45:16] it's a miss basically, but it happened "going through hit" [08:45:37] ema: thanks, that makes it a bit more clear... [08:49:23] gehel: our VCL flow is a little hard to follow, this might be better to understand the idea: https://github.com/varnish/Varnish-Cache/blob/master/bin/varnishd/builtin.vcl#L101 [08:51:27] ema: Thanks! That one, I mostly know. [08:51:34] now my question is: do we only get repros with hit+miss in x-cache? [08:52:47] cp3009's frontend cached a bad response now [08:53:30] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2291971 (10ema) Issue reproduced: https://phabricator.wikimedia.org/P3067 [08:54:31] ema: in the sample I have, I always see hit+miss, but that's skewed by the fact that the zero size body is cached by cp3009. [08:54:46] I can run a few more tests and see if I get errors without hit+miss... [08:55:10] that would be great, if you want to avoid hitting cached responses just add some query params [08:55:19] eg: style.css?blabla [08:55:36] yep, can do... [08:57:59] ema: no error so far. I'll keep running and see if I get a failure at some point... [08:58:41] thanks, for some reason I'm unable to reproduce this while you seem to be quite effective at triggering the issue [08:58:49] May 13 10:57:53 < Content-Length: 6304 [08:58:49] May 13 10:57:53 < Age: 120 [08:58:49] May 13 10:57:53 < X-Cache: cp1045 hit+miss(0), cp3010 hit+miss(0), cp3009 frontend hit(123) [08:58:52] May 13 10:57:53 SIZE: 6304 [08:58:55] May 13 10:57:54 < Content-Length: 6304 [08:58:55] May 13 10:57:54 < Age: 0 [08:58:55] May 13 10:57:54 < X-Cache: cp1045 hit+miss(0), cp3010 hit+miss(0), cp3009 frontend hit+miss(0) [08:58:58] May 13 10:57:54 SIZE: 6304 [08:59:37] May 13 10:57:55 < Content-Length: 6304 [08:59:37] May 13 10:57:55 < Age: 1 [08:59:37] May 13 10:57:55 < X-Cache: cp1045 hit+miss(0), cp3010 hit+miss(0), cp3009 frontend hit(1) [08:59:38] my magic power: breaking things! [08:59:40] May 13 10:57:55 SIZE: 6304 [08:59:58] could you share your script BTW? [09:02:00] ema: sure, it's so hacky and written in < 2 minutes that I'm not sure why anyone would want to look at it, but here it is: [09:02:05] https://www.irccloud.com/pastebin/wdAejCXU/ [09:03:10] thanks! [09:15:14] And now I can't seem to reproduce the issue anymore... might be just luck or an indication that the issue is related to hit+miss... [09:24:51] perhaps! I'm now trying to reproduce it by using the same random URL for the whole test duration sleeping for 1 minute [09:26:13] no repro so far with: [09:26:17] x-cache:cp1061 miss(0), cp3010 miss(0), cp3009 frontend miss(0) [09:26:22] x-cache:cp1061 miss(0), cp3010 miss(0), cp3009 frontend hit(1) [09:26:28] x-cache:cp1061 hit+miss(0), cp3010 hit+miss(0), cp3009 frontend hit+miss(0) [09:48:45] <_joe_> so it seems it's not a file vs persistent storage problem? [09:48:52] right [09:49:04] <_joe_> sigh I kinda hoped that was the case [09:52:07] me too [10:05:12] I think somehow it must be related, yesterday I managed to repro quite quickly in ulsfo (deprecated_persistent) and couldn't really in esams (file) [10:05:47] today we're running with file everywhere and still we managed to get a bunch or repros, although "less frequently" [10:24:07] alright I'll depool cp1061 to try reproducing there [10:50:08] /win 19 [10:50:13] yeah :) [10:54:14] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2292155 (10Jonas) Issue appearing again ``` curl -v 'https://query.wikidata.org/i18n/en.json' * Hostname was NOT found in DNS... [11:04:11] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2292181 (10ema) >>! In T134989#2292155, @Jonas wrote: > Issue appearing again [...] > < Age: 58 > < X-Cache: cp1045 hit+miss(... [11:11:58] hit+miss is going to be common. in the current misc VCL, objects we're testing tend to have 120s TTL and then 60m grace. the object can probably be kicked out anytime after grace, but probably only opportunistically, not aggressively. [11:12:36] so object goes in, it's a hit for 120s, after that it's a hit+miss for 60m+, unless it gets replaced/refreshed from a 304. [11:13:07] (hit+miss means varnish is at least going to ask for the next backend down the chain to try to give it a 304 based on IMS/INM) [11:13:22] so I'm trying to reproduce on cp1061 (fe+be on cp1061 only) [11:13:28] without much luck [11:13:45] I ran some long loops yesterday too [11:14:03] I'm testing the same URL with a sleep 3, restarting the frontend at every iteration to force frontend misses [11:14:06] over serialized query params for style.css, through the TTL boundaries, across esams+ulsfo, etc... [11:14:26] why force FE miss? [11:14:51] because I'd like to verify whether backend hit+miss are somehow the issue [11:15:02] through two layers of varnish, not through curl [11:15:18] well... [11:15:35] cp1061 probably isn't the place to be, better off in ulsfo or esams [11:16:03] it's the middle-tier that's most often implicated. really the only one that I've witness being the bad actor myself [11:16:37] I've never seen the direct-backend given a faulty response, and I've never seen a frontend generate a faulty response except due to a middle-backend telling it to do so. [11:16:56] mmh [11:17:05] and you can test that directly in this case for things like style.css that aren't complicated. loop your test query to foo:3128 and hit the backend directly [11:17:54] yeah I wanted to briefly test varnish->varnish rather than curl->varnish [11:18:29] re: objects being kicked out anytime after grace, but probably only opportunistically, not aggressively [11:18:32] another thought: wherever you're testing, you can hack -p default_ttl down to like 10s too, and maybe set obj.grace universally to 15s in VCXL [11:18:43] so that you can cycle through the states quicker [11:18:49] < Age: 0 [11:18:49] < X-Cache: cp1061 hit+miss(0), cp1061 frontend hit+miss(0) [11:18:49] SIZE: 6304 [11:19:00] < Age: 7 [11:19:00] < X-Cache: cp1061 hit(1), cp1061 frontend miss(0) [11:19:00] SIZE: 6304 [11:19:15] at least in that case it got kicked out relatively quickly [11:19:25] you restarted it :P [11:19:36] that's a great point! :P [11:20:00] also, hit+miss->hit is a natural consequence of refreshing the object [11:20:15] I mean, as serial results at the same layer [11:20:24] the backend in your case [11:20:42] (either via 304 or 200) [11:20:47] right [11:20:56] OK I'll try lowering ttl and grace [11:21:13] and keep I guess? I donno what that even defaults to [11:21:25] it'd be nice if the objects would die on their own in our testing too, though [11:21:48] default_keep seems to be 0 [11:21:56] but still, you may not get a repro in eqiad at all. [11:23:01] when we were testing mostly in esams, it was always the esams backend where you could catch the real screwup, not the eqiad one. [11:23:26] alright I'll move to a esams host [11:23:31] the esams (middle) backend, you could actually observe the first failure in varnishlog outputs. watch it get a legitimate response and emit a faulty response to the next one up [11:24:32] part of the magic of being able to reproduce before was inducing a blend of misses and hit+miss and such though, by cycling through a bunch of query params over and over and letting them vary in which BEs they hit... [11:24:55] so when locked into a single host, it will take a lot more iterations, but the lowered default_ttl will help [11:27:05] or you could put back persistent storage and CL-sensitive VCL, apparently those make it much easier to hit [11:52:35] anyways, if we're back to the drawing board, that -persist doesn't really kill the bug... [11:52:58] (and it really seemed like it did... wtf is up with the failures since after such a long silence of repro) [11:53:30] we really need to find a way to reliably repro [11:53:45] the two patches + no-CL also seemed to fix it for a bit [11:53:59] just because of restarts / storage wipes? [11:54:28] the other fix periods didn't last long [11:55:18] anyways, since we're sure the bug is still there in the same basic form [11:55:57] well, I'd keep the two patches, they're legit and may really help [11:56:06] yes [11:56:10] but maybe put the CL-VCL and the persistent back in play on the test host? [11:56:17] it may make it easier to repro? [11:56:21] and persistent [11:56:48] sounds good [11:56:59] it would be nice if we could get it reproducing relatively-easily, and then isolate it down to a sequence of actions that triggers the bug reliably in isolation... [11:57:18] lunch soon, then I'll get a esams host and do all the above (CL-VCL, persistent, lowered ttl...) [11:57:51] if it's not storage-specific, and not VCL specific, we should even be able to VTC this [11:58:23] based on the observed failures in cp3010:3128 yesterday, a VTC repro should look something like: [11:58:24] yep! [11:59:04] 1) Fetch object through varnish (200 OK server->varnish->client, all normal on first miss) [11:59:22] 2) Wait for object's base TTL to expire [12:00:15] 3) Fetch object via client again, expect varnish to use IMS/INM against server, have server return a 304. So it's server 304->varnish, varnish 200 -> client [12:00:33] I mean that's a basic 304 refresh of an expired cache object. I'm sure that scenarios already in some stock test [12:00:51] the trick is, we also need to play with protocol details on both sides of varnish to induce probably [12:01:11] whether gzip and/or TE:chunked is involved on either side, CL header present/missing in 304, etc? [12:01:42] connection: close vs connection: keepalive [12:01:54] there's some combination of factors there... [12:02:33] or a race condition on object's TTL expiry boundary [13:03:58] I've made up an extended version of the loop-tests I was doing yesterday, to try to catch a live failure via a ulsfo frontend [13:04:01] FTR: [13:04:03] bblack@palladium:~$ while [ 1 ]; do for x in {1..3}; do curl -v -w 'SIZE: %{size_download}' -s -H 'Host: query.wikidata.org' -H 'X-Forwarded-Proto: https' "http://cp4002.ulsfo.wmnet/style.css?bblackx=${x}" 2>&1 |tee test.out |egrep 'SIZE'|grep -q 'SIZE: 6304' || break 99; done; done; echo FAILED [13:04:13] that's running continuously now, will leave it for hours or until it fails [13:04:49] that that does is always request through 4002, with a unique query string nobody else is using so I have my own cache slots. it iterates throgh ?bblackx=1, 2, 3 over and over. [13:05:08] and if the output ever fails to have 6304 bytes, it will halt and leave me a test.out with full headers/output. [13:05:23] by looping fast over and over, it should eventually trip TTL edge cases [13:05:42] ok [13:05:43] I think the main diff from gehels tests is that his are randomized on every request, so it's always a miss [13:06:05] almost always, he got some hits too https://phabricator.wikimedia.org/P3067 [13:06:22] heh and now that I pasted it, I realize the double-grep is pointless. fallout from so much re-editing and re-purposing previous commandlines [13:06:45] oh I thought he had a random query param? [13:07:03] I thought so too, at least the bash script he pasted uses $RANDOM [13:07:34] yeah but the curl output says not-random [13:07:36] hmmm [13:08:21] GET /style.css [13:10:06] I'll depool cp3007 and start playing with it [13:10:19] ok [13:10:37] I fixed my double-grep, and changed my query param in case anyone else pastes from above :) [13:11:19] if someone wants to use your leaked secret param :) [13:13:28] of course for all I know the bug is strange in transitiveness [13:13:37] I only know for sure we hit it on an esams backend in normal traffic flow [13:13:56] it may not be that it hits all mid-backends, only the first, and then the second corrects it [13:14:05] we've seen this auto-correct itself through layers before I think [13:14:31] I'll make more loops, against direct ulsfo backend, and direct to an esams backend (not yours!) [13:15:04] ok [13:22:48] so much terminology overload [13:23:09] frontend/backend varnishes, frontend/backend workers, backend servers... [13:23:19] yup [13:23:26] we should invent new words [13:23:44] we can't change varnish's front/back worker terminology [13:24:09] but we can change ours [13:24:31] we could call our two varieties of varnish daemon edge and mid [13:24:33] like skinny varnish and fat varnish :) [13:24:43] or edge and disk [13:24:47] I donno [13:25:01] 10Traffic, 06Operations: varnish.clients graphite metric spammed with jsessionid - https://phabricator.wikimedia.org/T135227#2292421 (10fgiunchedi) [13:25:06] I've been calling backend servers 'appservers' lately [13:25:13] godog: thanks! [13:25:27] yeah I like 'appservers' [13:25:54] jsessionid? [13:25:55] wtf [13:27:24] only varnishrls and varnishxpcs mess with varnish.clients that are in puppet, I think [13:27:52] I think it's xcps [13:28:11] ema: np! [13:28:16] yeah it is xcps [13:28:19] def vsl_callback(transaction_id, tag, record, remote_party): for k, v in key_value_pairs.findall(record): [13:28:33] varnish.clients.d.166682 t=1463145218955438:1|c [13:28:33] varnish.clients.d.5358 t=1463145673022109:1|c [13:28:33] varnish.clients.jsessionid.1c69 [13:28:35] it assumes its input is already filtered down just to X-Connection-Properties [13:28:37] but it's not [13:29:12] also those single-little messed-up keys showed up a day or two ago in our tls stats too, which also come from xcps [13:29:31] look at the ciper list at the bottom of https://grafana.wikimedia.org/dashboard/db/tls-ciphers [13:29:54] we now have ciphers: m, n, s, d, m_3bo_a, etc... [13:30:22] I think varnishxcps is getting more input loglines than just X-Connection-Properties, and parsing whatever junk it gets with its automatic k=v -> graphite logic [13:30:30] ugh, yeah also phpsessid in there [13:32:05] so the last change there was mine on May 4: 2739346a30864f07084514c01a854fdb62b4dba7 [13:32:08] err [13:32:12] varnishxcps: fix CP header parsing for H2= [13:32:28] another trigger date is of course cache_misc turning on xcps4 (before was just maps) [13:32:34] a few days ago, makes more sense [13:32:54] I think maybe xcps's use of varnishlog got it only the XCP header, and xcps4 is pulling in more junk [13:34:04] will confirm on a maps server... [13:39:16] heh ok [13:39:29] so the diff from xcps to xcps4 is to replace RxHeader with BerespHeader... [13:39:49] I think that should be ReqHeader [13:40:29] but then also, the varnishlog thing still doesn't filter on XCP [13:40:37] on a misc server anyways [13:40:43] does on maps, where it all works fine anyways [13:40:46] very puzzling [13:41:57] anyways, I'm going to sort this out and make it work right... [13:46:18] bblack: thanks! [13:46:20] 10Traffic, 06Operations: varnish.clients graphite metric spammed with jsessionid - https://phabricator.wikimedia.org/T135227#2292474 (10fgiunchedi) more data from a captured tcpdump, see also the big udp packet size ```lines=5 13:44:08.922449 IP cp4002.ulsfo.wmnet.9705 > graphite1001.eqiad.wmnet.8125: UDP, le... [13:48:17] I'm also currently looking into why at 12.00 the carbon relay started dropping metrics towards codfw, odd https://grafana.wikimedia.org/dashboard/db/graphite-eqiad [13:51:40] the varnish3 version seems to work right and filter right [13:51:57] the varnish4 version is grabbing all kinds of junk it shouldn't [13:53:21] the varnishlog module is just very different between the two in practice, on how -i and -I work, not just the field names [13:53:44] 10Traffic, 06Operations, 13Patch-For-Review: varnish.clients graphite metric spammed with jsessionid - https://phabricator.wikimedia.org/T135227#2292421 (10Krinkle) > varnish.clients.d.42387 t=1463146627656326:1|c Looks like these don't belong, either. Collecting it in some way could be useful (T131894) - b... [13:56:13] godog: I'm assume after I stop the flow of junk, you can just rm the bad keys right? [13:56:44] bblack: yup [13:57:01] ema: can we stop varnishxcps on 3007 too? I assume it has puppet disabled and won't get the fix yet [13:57:09] sure [13:57:14] just reproduced the issue btw [13:57:29] awesome [13:57:46] see eg: neon.wikimedia.org:~ema/test.210 [13:57:53] my 4 different loops against different layers/scenarios have been running for nearly an hour and not hit it :( [13:57:57] I saved varnishlog output on cp3007 [13:58:19] -p default_ttl=10 -p default_grace=20 are the key to repro faster I guess [13:59:25] ok so.... [13:59:31] do you have the one that wasn't an FE hit? [13:59:46] test.210 is presumably the 3rd failure in that series with FE hit(2) [14:00:06] test.204 [14:00:09] < X-Cache: cp1051 hit+miss(0), cp3007 miss(0), cp3007 frontend miss(0) [14:00:34] ah 102 is an initial fail too [14:00:48] yep [14:00:53] test.102:SIZE: 0 [14:00:55] test.103:SIZE: 0 [14:00:55] test.104:SIZE: 0 [14:00:55] test.105:SIZE: 0 [14:00:55] test.107:SIZE: 0 [14:00:56] test.108:SIZE: 0 [14:00:59] test.111:SIZE: 0 [14:01:01] test.196:SIZE: 0 [14:01:04] test.200:SIZE: 0 [14:01:06] test.203:SIZE: 0 [14:01:09] test.204:SIZE: 0 [14:01:11] test.207:SIZE: 0 [14:01:14] test.210:SIZE: 0 [14:01:23] so now the million dollar question - can you correlate varnishlog on FE and/or BE during the initial fail (102 or 204)? [14:01:35] heh :( [14:01:41] I suspect the BE one will be more interesting [14:01:58] I mean it didn't take *too* long to repro so I guess we'll manage [14:02:08] if you can repro like that again, maybe just keep a huge varnishlog of everything, then go back and grep for varnish transactions with the X-Varnish number? [14:02:27] since only you are on that host [14:02:48] I guess you need no regexes, just a varnishlog on each of FE and BE running and logging everything [14:03:07] makes sens [14:03:12] sense [14:03:19] and also the -c and -b to get both sides of the transaction on each? varnishlog3 always claimed it could do both, but never did for me before heh [14:03:23] maybe varnishlog4 can :) [14:04:06] godog: junk output should be terminated now [14:04:43] oh [14:04:49] purge is going to spam your logs a lot pointlessly [14:04:59] shut off vhtcpd daemon to kill the purge flow [14:05:05] -q 'not ReqMethod eq PURGE' [14:05:08] or that [14:05:59] OK if you want to take a look at the frontend varnishlog for those failed requests (eg: test.204) it's on my home on cp3007, perhaps using X-Varnish to correlate [14:06:17] in the meantime I try to repro with a varnishlog of both frontend and backend [14:06:25] ok [14:07:13] ~ema/varnishlog is the file [14:07:28] now I'm creating varnishlog-frontend and varnishlog-backend and try to reproduce [14:08:54] bblack: thanks! I'll see what metrics stop updating and I've moved the obvious junk out of the way for ssl_ciphers [14:12:33] ema: from your previous log, this was the backend's response to the frontend on initial uncached failure: [14:12:36] https://phabricator.wikimedia.org/P3079 [14:12:47] so cp3007's output was bad, the error didn't originate in the frontend [14:12:54] heh [14:13:01] s/cp3007/cp3007:3128/ [14:13:10] right [14:13:47] note the Content-Encoding: gzip [14:13:53] yeah, but we expect that [14:14:08] it's gzippable in our VCL via do_gzip, and I think even the applayers gzips this output too [14:14:14] on successful requests there is no Content-Encoding [14:14:49] actually on curl output there is no C-E [14:15:48] well curl doesn't ask for gzip by default, unless you give it --compressed [14:15:55] and I just re-checked, and wdqs doesn't gzip when asked to either [14:16:01] so this is a "do_gzip does something" case [14:16:36] when backend-most requests from wdqs, it will AE:gzip, but the applayer won't gzip. because it's a compressible content-type, varnish in eqiad will gzip it on the way into the object storage [14:16:37] oh, right, --compressed [14:17:08] and then the next backend up (cp3007) should also request with AE:gzip, and eqiad should respond with the (already) gzipped content [14:17:11] and so-on [14:17:23] and then it gets de-compressed by cp3007 frontend at client request time, just for that client [14:17:30] (if curl w/o --compressed) [14:17:37] that's the behavior I'd expect, anyways [14:18:25] in any case, I've done past repros with both curl --compressed and without [14:18:32] FTR trying to repro with: [14:18:34] the output was slightly different, but same effect [14:18:35] varnishlog -g request -n frontend -q 'not ReqMethod eq PURGE' | tee ~ema/varnishlog-frontend [14:18:41] varnishlog -g request -q 'not ReqMethod eq PURGE and not ReqURL eq "/check"' |tee ~ema/varnishlog-backend | tee ~ema/varnishlog-backend [14:18:56] with --compressed curl gets a TE:chunked response, no content, 5s later connection terminates (so it makes a big pause in your test) [14:19:08] without --compressed, it's not TE:chunked, and explicit CL:0 [14:19:55] but if we have some variance on CE on the 3007->eqiad fetch+response... yeah that's something to look into and understand for sure [14:20:19] and if we can reliably repro this with your new method in a relatively short time... [14:20:28] immediate repro on first try [14:20:32] nice! [14:20:32] < X-Cache: cp1051 hit+miss(0), cp3007 hit+miss(0), cp3007 frontend hit+miss(0) [14:20:47] ~ema/run-2/test.0 [14:21:30] ok wait a sec [14:21:45] the request that generated test.0, that was curl without --compressed, to cp3007:80 ? [14:21:54] yes [14:22:06] why is the frontend giving you gzip output when you didn't ask for it? [14:22:10] or claiming to anyways [14:22:25] interesting [14:22:47] I curled as follows: [14:22:49] maybe with the CL:0 bug already in play, everything else is suspect. it never thinks to decompress empty content for you [14:22:52] curl -v -w 'SIZE: %{size_download}' -H 'X-Forwarded-Proto: https' -H 'Host: query.wikidata.org' $url 2>&1 [14:22:57] so this could be effect rather than cause [14:23:21] anyways, let's correlate this to cp3007's view of the front and back transactions on the hit+miss [14:23:32] sorry, cp3007:3128's view... [14:23:47] I've stopped the test and varnishlogs [14:25:02] well it should be the first line of varnishlog-backend right? [14:25:10] * << Request >> 3562758 [14:25:17] < X-Varnish: 209077939, 3562758, 2947323 [14:25:59] - RespUnset Content-Length: 1829 [14:26:02] ah, but gzipped [14:26:33] yeah trying to sort through where the client and backend halves are... [14:26:48] I guess those first two chunks, Request and BeReq [14:28:56] so, eqiad gave a 304 [14:29:05] 3007:3128 gave a 304 to 3007:80 [14:29:12] 3007:80 gave you a bad output [14:30:49] -- ObjHeader Content-Length: 1829 [14:30:49] -- BackendReuse 25 boot.be_cp1051 [14:30:49] -- Timestamp BerespBody: 1463149155.857507 0.168052 0.000114 [14:30:49] -- Length 0 [14:31:05] so CL=1829 but Length=0 [14:31:08] I think in this case, the frontend made the mistake [14:31:45] 10Traffic, 10Analytics-Cluster, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#2292703 (10Ottomata) We didn't get a chance to fully restart each broker with `inter.broker.protocol.version=0.9.0.X` this w... [14:31:50] you're looking at 3007:3128's request to 1051 [14:32:04] 1051 returned a 304-NotModified, so it does have Length:0 (no content with a 304) [14:32:13] ObjHeader: CL:1829 is from the cached object it's reusing on 304 [14:32:23] the expired-but-now-revalidated one [14:32:31] aha! [14:32:35] ok [14:33:51] so then Request 2947323 (first one in the frontend varnishlog) is the culprit [14:34:06] - RespStatus 200 [14:34:09] - RespHeader Content-Length: 0 [14:34:38] so a 304 is not supposed to contain a content-length [14:34:49] I'm just tracing back through to verify, but I think the FE made the mistake [14:35:32] well the backend responded with 304 and CL:0 [14:35:46] I don't think it did [14:36:01] in frontend, section << BeReq >> [14:36:20] 113 -- BerespStatus 304 [14:36:20] BeRespProtocol and so-on there, that's it receiving the response from the backend... [14:36:24] 134 -- BerespHeader Content-Length: 0 [14:36:31] (first number is the line number) [14:36:32] there's no CL header there [14:37:14] well let's start back in the backend log, before getting into that mystery... [14:37:33] in the top section there, the client-request side [14:37:36] the response on that [14:37:51] it starts out as : [14:37:52] - RespProtocol HTTP/1.1 [14:37:52] - RespStatus 200 [14:37:52] - RespReason OK [14:37:59] - RespHeader Content-Length: 1829 [14:38:01] etc [14:38:14] then during deliver-time processing it tacks on: [14:38:15] - RespProtocol HTTP/1.1 [14:38:15] - RespStatus 304 [14:38:15] - RespReason Not Modified [14:38:15] - RespReason Not Modified [14:38:17] - RespUnset Content-Length: 1829 [14:38:38] so internally, it first views it as a 200, then converts to a 304 w/o CL because the etag matches or whatever [14:39:05] now in the backend side of the frontend log, where we're receiving that... [14:39:21] it starts with: [14:39:21] -- Timestamp Beresp: 1463149155.857729 0.168497 0.168402 [14:39:21] -- BerespProtocol HTTP/1.1 [14:39:21] -- BerespStatus 304 [14:39:22] -- BerespReason Not Modified [14:39:24] ... [14:39:32] with no Content-Length on reception (correct) [14:39:56] then when that block of headers ends, this is at the bottom (including the last line): [14:40:00] -- BerespHeader Connection: keep-alive [14:40:02]