[07:44:28] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2291872 (10ema) >>! In T134989#2290254, @BBlack wrote: > 1. We've got 2x upstream bugfixes applied to our varnishd on cache_mi... [07:54:04] gehel: good morning :) [07:54:10] shall we give it another try? [08:10:28] Sure... [08:11:04] ema: test started... [08:11:23] bonjour! [08:11:31] hashar: o/ [08:15:38] ema: I still got an error. Age: 120, so seems to be on TTL boundary again. [08:16:02] damn! [08:16:04] but next GET is fine... [08:16:32] https://www.irccloud.com/pastebin/u0ti8RQe/ [08:17:03] Age header sent twice (not really an issue, but maybe an indication of something?) [08:17:28] Single error on 100 requests. [08:17:56] gehel: do you have the full headers output perhaps? [08:18:49] damn, I deleted that output too fast. Luckily (or not) I just got another error. [08:19:05] https://phabricator.wikimedia.org/P3067 [08:20:04] ema: what does "hit+miss" means in this context? It sound a bit like an oxymoron... [08:20:59] I now get only errors, seems the zero body is cached on cp3009 [08:21:05] gehel: it's a recent change https://github.com/wikimedia/operations-puppet/commit/b32f7e85f609620eff0456866fe1e416c5b442b5 [08:21:59] yes now it got cached as 0-length [08:22:15] (I'm also hitting cp3009) [08:23:19] yeah, caching seems easy, until you actually try to implement it... [08:28:27] hi, I was peeking at graphite and there's a 'too many creates' alert, looks like we got plenty of varnish stats in the form of varnish.client.jsessionid. known? [08:28:49] aha [08:29:34] godog: hostnames? [08:30:25] ema: you mean what machines are sending those? [08:30:40] that's what I mean yes, if there's a way to find out :) [08:31:11] we might have started sending them with the v4 upgrade [08:31:30] yup, checking with ngrep :( [08:31:40] sec [08:37:57] maybe more than one sec heh, not sure how frequently those are sent [08:42:42] gehel: back to the hit+miss [08:42:54] in case of cache hit we do this: [08:42:55] https://github.com/wikimedia/operations-puppet/blob/production/modules/varnish/templates/vcl/wikimedia-common.inc.vcl.erb#L288 [08:43:12] setting X-CDIS to hit [08:43:52] called from here: https://github.com/wikimedia/operations-puppet/blob/production/modules/varnish/templates/vcl/wikimedia-frontend.vcl.erb#L290 [08:44:04] then we call cluster_fe_hit, which in the case of misc is: [08:44:10] https://github.com/wikimedia/operations-puppet/blob/production/templates/varnish/misc-frontend.inc.vcl.erb#L35 [08:44:46] so hit+miss = hit with expired ttl [08:45:16] it's a miss basically, but it happened "going through hit" [08:45:37] ema: thanks, that makes it a bit more clear... [08:49:23] gehel: our VCL flow is a little hard to follow, this might be better to understand the idea: https://github.com/varnish/Varnish-Cache/blob/master/bin/varnishd/builtin.vcl#L101 [08:51:27] ema: Thanks! That one, I mostly know. [08:51:34] now my question is: do we only get repros with hit+miss in x-cache? [08:52:47] cp3009's frontend cached a bad response now [08:53:30] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2291971 (10ema) Issue reproduced: https://phabricator.wikimedia.org/P3067 [08:54:31] ema: in the sample I have, I always see hit+miss, but that's skewed by the fact that the zero size body is cached by cp3009. [08:54:46] I can run a few more tests and see if I get errors without hit+miss... [08:55:10] that would be great, if you want to avoid hitting cached responses just add some query params [08:55:19] eg: style.css?blabla [08:55:36] yep, can do... [08:57:59] ema: no error so far. I'll keep running and see if I get a failure at some point... [08:58:41] thanks, for some reason I'm unable to reproduce this while you seem to be quite effective at triggering the issue [08:58:49] May 13 10:57:53 < Content-Length: 6304 [08:58:49] May 13 10:57:53 < Age: 120 [08:58:49] May 13 10:57:53 < X-Cache: cp1045 hit+miss(0), cp3010 hit+miss(0), cp3009 frontend hit(123) [08:58:52] May 13 10:57:53 SIZE: 6304 [08:58:55] May 13 10:57:54 < Content-Length: 6304 [08:58:55] May 13 10:57:54 < Age: 0 [08:58:55] May 13 10:57:54 < X-Cache: cp1045 hit+miss(0), cp3010 hit+miss(0), cp3009 frontend hit+miss(0) [08:58:58] May 13 10:57:54 SIZE: 6304 [08:59:37] May 13 10:57:55 < Content-Length: 6304 [08:59:37] May 13 10:57:55 < Age: 1 [08:59:37] May 13 10:57:55 < X-Cache: cp1045 hit+miss(0), cp3010 hit+miss(0), cp3009 frontend hit(1) [08:59:38] my magic power: breaking things! [08:59:40] May 13 10:57:55 SIZE: 6304 [08:59:58] could you share your script BTW? [09:02:00] ema: sure, it's so hacky and written in < 2 minutes that I'm not sure why anyone would want to look at it, but here it is: [09:02:05] https://www.irccloud.com/pastebin/wdAejCXU/ [09:03:10] thanks! [09:15:14] And now I can't seem to reproduce the issue anymore... might be just luck or an indication that the issue is related to hit+miss... [09:24:51] perhaps! I'm now trying to reproduce it by using the same random URL for the whole test duration sleeping for 1 minute [09:26:13] no repro so far with: [09:26:17] x-cache:cp1061 miss(0), cp3010 miss(0), cp3009 frontend miss(0) [09:26:22] x-cache:cp1061 miss(0), cp3010 miss(0), cp3009 frontend hit(1) [09:26:28] x-cache:cp1061 hit+miss(0), cp3010 hit+miss(0), cp3009 frontend hit+miss(0) [09:48:45] <_joe_> so it seems it's not a file vs persistent storage problem? [09:48:52] right [09:49:04] <_joe_> sigh I kinda hoped that was the case [09:52:07] me too [10:05:12] I think somehow it must be related, yesterday I managed to repro quite quickly in ulsfo (deprecated_persistent) and couldn't really in esams (file) [10:05:47] today we're running with file everywhere and still we managed to get a bunch or repros, although "less frequently" [10:24:07] alright I'll depool cp1061 to try reproducing there [10:50:08] /win 19 [10:50:13] yeah :) [10:54:14] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2292155 (10Jonas) Issue appearing again ``` curl -v 'https://query.wikidata.org/i18n/en.json' * Hostname was NOT found in DNS... [11:04:11] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2292181 (10ema) >>! In T134989#2292155, @Jonas wrote: > Issue appearing again [...] > < Age: 58 > < X-Cache: cp1045 hit+miss(... [11:11:58] hit+miss is going to be common. in the current misc VCL, objects we're testing tend to have 120s TTL and then 60m grace. the object can probably be kicked out anytime after grace, but probably only opportunistically, not aggressively. [11:12:36] so object goes in, it's a hit for 120s, after that it's a hit+miss for 60m+, unless it gets replaced/refreshed from a 304. [11:13:07] (hit+miss means varnish is at least going to ask for the next backend down the chain to try to give it a 304 based on IMS/INM) [11:13:22] so I'm trying to reproduce on cp1061 (fe+be on cp1061 only) [11:13:28] without much luck [11:13:45] I ran some long loops yesterday too [11:14:03] I'm testing the same URL with a sleep 3, restarting the frontend at every iteration to force frontend misses [11:14:06] over serialized query params for style.css, through the TTL boundaries, across esams+ulsfo, etc... [11:14:26] why force FE miss? [11:14:51] because I'd like to verify whether backend hit+miss are somehow the issue [11:15:02] through two layers of varnish, not through curl [11:15:18] well... [11:15:35] cp1061 probably isn't the place to be, better off in ulsfo or esams [11:16:03] it's the middle-tier that's most often implicated. really the only one that I've witness being the bad actor myself [11:16:37] I've never seen the direct-backend given a faulty response, and I've never seen a frontend generate a faulty response except due to a middle-backend telling it to do so. [11:16:56] mmh [11:17:05] and you can test that directly in this case for things like style.css that aren't complicated. loop your test query to foo:3128 and hit the backend directly [11:17:54] yeah I wanted to briefly test varnish->varnish rather than curl->varnish [11:18:29] re: objects being kicked out anytime after grace, but probably only opportunistically, not aggressively [11:18:32] another thought: wherever you're testing, you can hack -p default_ttl down to like 10s too, and maybe set obj.grace universally to 15s in VCXL [11:18:43] so that you can cycle through the states quicker [11:18:49] < Age: 0 [11:18:49] < X-Cache: cp1061 hit+miss(0), cp1061 frontend hit+miss(0) [11:18:49] SIZE: 6304 [11:19:00] < Age: 7 [11:19:00] < X-Cache: cp1061 hit(1), cp1061 frontend miss(0) [11:19:00] SIZE: 6304 [11:19:15] at least in that case it got kicked out relatively quickly [11:19:25] you restarted it :P [11:19:36] that's a great point! :P [11:20:00] also, hit+miss->hit is a natural consequence of refreshing the object [11:20:15] I mean, as serial results at the same layer [11:20:24] the backend in your case [11:20:42] (either via 304 or 200) [11:20:47] right [11:20:56] OK I'll try lowering ttl and grace [11:21:13] and keep I guess? I donno what that even defaults to [11:21:25] it'd be nice if the objects would die on their own in our testing too, though [11:21:48] default_keep seems to be 0 [11:21:56] but still, you may not get a repro in eqiad at all. [11:23:01] when we were testing mostly in esams, it was always the esams backend where you could catch the real screwup, not the eqiad one. [11:23:26] alright I'll move to a esams host [11:23:31] the esams (middle) backend, you could actually observe the first failure in varnishlog outputs. watch it get a legitimate response and emit a faulty response to the next one up [11:24:32] part of the magic of being able to reproduce before was inducing a blend of misses and hit+miss and such though, by cycling through a bunch of query params over and over and letting them vary in which BEs they hit... [11:24:55] so when locked into a single host, it will take a lot more iterations, but the lowered default_ttl will help [11:27:05] or you could put back persistent storage and CL-sensitive VCL, apparently those make it much easier to hit [11:52:35] anyways, if we're back to the drawing board, that -persist doesn't really kill the bug... [11:52:58] (and it really seemed like it did... wtf is up with the failures since after such a long silence of repro) [11:53:30] we really need to find a way to reliably repro [11:53:45] the two patches + no-CL also seemed to fix it for a bit [11:53:59] just because of restarts / storage wipes? [11:54:28] the other fix periods didn't last long [11:55:18] anyways, since we're sure the bug is still there in the same basic form [11:55:57] well, I'd keep the two patches, they're legit and may really help [11:56:06] yes [11:56:10] but maybe put the CL-VCL and the persistent back in play on the test host? [11:56:17] it may make it easier to repro? [11:56:21] and persistent [11:56:48] sounds good [11:56:59] it would be nice if we could get it reproducing relatively-easily, and then isolate it down to a sequence of actions that triggers the bug reliably in isolation... [11:57:18] lunch soon, then I'll get a esams host and do all the above (CL-VCL, persistent, lowered ttl...) [11:57:51] if it's not storage-specific, and not VCL specific, we should even be able to VTC this [11:58:23] based on the observed failures in cp3010:3128 yesterday, a VTC repro should look something like: [11:58:24] yep! [11:59:04] 1) Fetch object through varnish (200 OK server->varnish->client, all normal on first miss) [11:59:22] 2) Wait for object's base TTL to expire [12:00:15] 3) Fetch object via client again, expect varnish to use IMS/INM against server, have server return a 304. So it's server 304->varnish, varnish 200 -> client [12:00:33] I mean that's a basic 304 refresh of an expired cache object. I'm sure that scenarios already in some stock test [12:00:51] the trick is, we also need to play with protocol details on both sides of varnish to induce probably [12:01:11] whether gzip and/or TE:chunked is involved on either side, CL header present/missing in 304, etc? [12:01:42] connection: close vs connection: keepalive [12:01:54] there's some combination of factors there... [12:02:33] or a race condition on object's TTL expiry boundary [13:03:58] I've made up an extended version of the loop-tests I was doing yesterday, to try to catch a live failure via a ulsfo frontend [13:04:01] FTR: [13:04:03] bblack@palladium:~$ while [ 1 ]; do for x in {1..3}; do curl -v -w 'SIZE: %{size_download}' -s -H 'Host: query.wikidata.org' -H 'X-Forwarded-Proto: https' "http://cp4002.ulsfo.wmnet/style.css?bblackx=${x}" 2>&1 |tee test.out |egrep 'SIZE'|grep -q 'SIZE: 6304' || break 99; done; done; echo FAILED [13:04:13] that's running continuously now, will leave it for hours or until it fails [13:04:49] that that does is always request through 4002, with a unique query string nobody else is using so I have my own cache slots. it iterates throgh ?bblackx=1, 2, 3 over and over. [13:05:08] and if the output ever fails to have 6304 bytes, it will halt and leave me a test.out with full headers/output. [13:05:23] by looping fast over and over, it should eventually trip TTL edge cases [13:05:42] ok [13:05:43] I think the main diff from gehels tests is that his are randomized on every request, so it's always a miss [13:06:05] almost always, he got some hits too https://phabricator.wikimedia.org/P3067 [13:06:22] heh and now that I pasted it, I realize the double-grep is pointless. fallout from so much re-editing and re-purposing previous commandlines [13:06:45] oh I thought he had a random query param? [13:07:03] I thought so too, at least the bash script he pasted uses $RANDOM [13:07:34] yeah but the curl output says not-random [13:07:36] hmmm [13:08:21] GET /style.css [13:10:06] I'll depool cp3007 and start playing with it [13:10:19] ok [13:10:37] I fixed my double-grep, and changed my query param in case anyone else pastes from above :) [13:11:19] if someone wants to use your leaked secret param :) [13:13:28] of course for all I know the bug is strange in transitiveness [13:13:37] I only know for sure we hit it on an esams backend in normal traffic flow [13:13:56] it may not be that it hits all mid-backends, only the first, and then the second corrects it [13:14:05] we've seen this auto-correct itself through layers before I think [13:14:31] I'll make more loops, against direct ulsfo backend, and direct to an esams backend (not yours!) [13:15:04] ok [13:22:48] so much terminology overload [13:23:09] frontend/backend varnishes, frontend/backend workers, backend servers... [13:23:19] yup [13:23:26] we should invent new words [13:23:44] we can't change varnish's front/back worker terminology [13:24:09] but we can change ours [13:24:31] we could call our two varieties of varnish daemon edge and mid [13:24:33] like skinny varnish and fat varnish :) [13:24:43] or edge and disk [13:24:47] I donno [13:25:01] 10Traffic, 06Operations: varnish.clients graphite metric spammed with jsessionid - https://phabricator.wikimedia.org/T135227#2292421 (10fgiunchedi) [13:25:06] I've been calling backend servers 'appservers' lately [13:25:13] godog: thanks! [13:25:27] yeah I like 'appservers' [13:25:54] jsessionid? [13:25:55] wtf [13:27:24] only varnishrls and varnishxpcs mess with varnish.clients that are in puppet, I think [13:27:52] I think it's xcps [13:28:11] ema: np! [13:28:16] yeah it is xcps [13:28:19] def vsl_callback(transaction_id, tag, record, remote_party): for k, v in key_value_pairs.findall(record): [13:28:33] varnish.clients.d.166682 t=1463145218955438:1|c [13:28:33] varnish.clients.d.5358 t=1463145673022109:1|c [13:28:33] varnish.clients.jsessionid.1c69 [13:28:35] it assumes its input is already filtered down just to X-Connection-Properties [13:28:37] but it's not [13:29:12] also those single-little messed-up keys showed up a day or two ago in our tls stats too, which also come from xcps [13:29:31] look at the ciper list at the bottom of https://grafana.wikimedia.org/dashboard/db/tls-ciphers [13:29:54] we now have ciphers: m, n, s, d, m_3bo_a, etc... [13:30:22] I think varnishxcps is getting more input loglines than just X-Connection-Properties, and parsing whatever junk it gets with its automatic k=v -> graphite logic [13:30:30] ugh, yeah also phpsessid in there [13:32:05] so the last change there was mine on May 4: 2739346a30864f07084514c01a854fdb62b4dba7 [13:32:08] err [13:32:12] varnishxcps: fix CP header parsing for H2= [13:32:28] another trigger date is of course cache_misc turning on xcps4 (before was just maps) [13:32:34] a few days ago, makes more sense [13:32:54] I think maybe xcps's use of varnishlog got it only the XCP header, and xcps4 is pulling in more junk [13:34:04] will confirm on a maps server... [13:39:16] heh ok [13:39:29] so the diff from xcps to xcps4 is to replace RxHeader with BerespHeader... [13:39:49] I think that should be ReqHeader [13:40:29] but then also, the varnishlog thing still doesn't filter on XCP [13:40:37] on a misc server anyways [13:40:43] does on maps, where it all works fine anyways [13:40:46] very puzzling [13:41:57] anyways, I'm going to sort this out and make it work right... [13:46:18] bblack: thanks! [13:46:20] 10Traffic, 06Operations: varnish.clients graphite metric spammed with jsessionid - https://phabricator.wikimedia.org/T135227#2292474 (10fgiunchedi) more data from a captured tcpdump, see also the big udp packet size ```lines=5 13:44:08.922449 IP cp4002.ulsfo.wmnet.9705 > graphite1001.eqiad.wmnet.8125: UDP, le... [13:48:17] I'm also currently looking into why at 12.00 the carbon relay started dropping metrics towards codfw, odd https://grafana.wikimedia.org/dashboard/db/graphite-eqiad [13:51:40] the varnish3 version seems to work right and filter right [13:51:57] the varnish4 version is grabbing all kinds of junk it shouldn't [13:53:21] the varnishlog module is just very different between the two in practice, on how -i and -I work, not just the field names [13:53:44] 10Traffic, 06Operations, 13Patch-For-Review: varnish.clients graphite metric spammed with jsessionid - https://phabricator.wikimedia.org/T135227#2292421 (10Krinkle) > varnish.clients.d.42387 t=1463146627656326:1|c Looks like these don't belong, either. Collecting it in some way could be useful (T131894) - b... [13:56:13] godog: I'm assume after I stop the flow of junk, you can just rm the bad keys right? [13:56:44] bblack: yup [13:57:01] ema: can we stop varnishxcps on 3007 too? I assume it has puppet disabled and won't get the fix yet [13:57:09] sure [13:57:14] just reproduced the issue btw [13:57:29] awesome [13:57:46] see eg: neon.wikimedia.org:~ema/test.210 [13:57:53] my 4 different loops against different layers/scenarios have been running for nearly an hour and not hit it :( [13:57:57] I saved varnishlog output on cp3007 [13:58:19] -p default_ttl=10 -p default_grace=20 are the key to repro faster I guess [13:59:25] ok so.... [13:59:31] do you have the one that wasn't an FE hit? [13:59:46] test.210 is presumably the 3rd failure in that series with FE hit(2) [14:00:06] test.204 [14:00:09] < X-Cache: cp1051 hit+miss(0), cp3007 miss(0), cp3007 frontend miss(0) [14:00:34] ah 102 is an initial fail too [14:00:48] yep [14:00:53] test.102:SIZE: 0 [14:00:55] test.103:SIZE: 0 [14:00:55] test.104:SIZE: 0 [14:00:55] test.105:SIZE: 0 [14:00:55] test.107:SIZE: 0 [14:00:56] test.108:SIZE: 0 [14:00:59] test.111:SIZE: 0 [14:01:01] test.196:SIZE: 0 [14:01:04] test.200:SIZE: 0 [14:01:06] test.203:SIZE: 0 [14:01:09] test.204:SIZE: 0 [14:01:11] test.207:SIZE: 0 [14:01:14] test.210:SIZE: 0 [14:01:23] so now the million dollar question - can you correlate varnishlog on FE and/or BE during the initial fail (102 or 204)? [14:01:35] heh :( [14:01:41] I suspect the BE one will be more interesting [14:01:58] I mean it didn't take *too* long to repro so I guess we'll manage [14:02:08] if you can repro like that again, maybe just keep a huge varnishlog of everything, then go back and grep for varnish transactions with the X-Varnish number? [14:02:27] since only you are on that host [14:02:48] I guess you need no regexes, just a varnishlog on each of FE and BE running and logging everything [14:03:07] makes sens [14:03:12] sense [14:03:19] and also the -c and -b to get both sides of the transaction on each? varnishlog3 always claimed it could do both, but never did for me before heh [14:03:23] maybe varnishlog4 can :) [14:04:06] godog: junk output should be terminated now [14:04:43] oh [14:04:49] purge is going to spam your logs a lot pointlessly [14:04:59] shut off vhtcpd daemon to kill the purge flow [14:05:05] -q 'not ReqMethod eq PURGE' [14:05:08] or that [14:05:59] OK if you want to take a look at the frontend varnishlog for those failed requests (eg: test.204) it's on my home on cp3007, perhaps using X-Varnish to correlate [14:06:17] in the meantime I try to repro with a varnishlog of both frontend and backend [14:06:25] ok [14:07:13] ~ema/varnishlog is the file [14:07:28] now I'm creating varnishlog-frontend and varnishlog-backend and try to reproduce [14:08:54] bblack: thanks! I'll see what metrics stop updating and I've moved the obvious junk out of the way for ssl_ciphers [14:12:33] ema: from your previous log, this was the backend's response to the frontend on initial uncached failure: [14:12:36] https://phabricator.wikimedia.org/P3079 [14:12:47] so cp3007's output was bad, the error didn't originate in the frontend [14:12:54] heh [14:13:01] s/cp3007/cp3007:3128/ [14:13:10] right [14:13:47] note the Content-Encoding: gzip [14:13:53] yeah, but we expect that [14:14:08] it's gzippable in our VCL via do_gzip, and I think even the applayers gzips this output too [14:14:14] on successful requests there is no Content-Encoding [14:14:49] actually on curl output there is no C-E [14:15:48] well curl doesn't ask for gzip by default, unless you give it --compressed [14:15:55] and I just re-checked, and wdqs doesn't gzip when asked to either [14:16:01] so this is a "do_gzip does something" case [14:16:36] when backend-most requests from wdqs, it will AE:gzip, but the applayer won't gzip. because it's a compressible content-type, varnish in eqiad will gzip it on the way into the object storage [14:16:37] oh, right, --compressed [14:17:08] and then the next backend up (cp3007) should also request with AE:gzip, and eqiad should respond with the (already) gzipped content [14:17:11] and so-on [14:17:23] and then it gets de-compressed by cp3007 frontend at client request time, just for that client [14:17:30] (if curl w/o --compressed) [14:17:37] that's the behavior I'd expect, anyways [14:18:25] in any case, I've done past repros with both curl --compressed and without [14:18:32] FTR trying to repro with: [14:18:34] the output was slightly different, but same effect [14:18:35] varnishlog -g request -n frontend -q 'not ReqMethod eq PURGE' | tee ~ema/varnishlog-frontend [14:18:41] varnishlog -g request -q 'not ReqMethod eq PURGE and not ReqURL eq "/check"' |tee ~ema/varnishlog-backend | tee ~ema/varnishlog-backend [14:18:56] with --compressed curl gets a TE:chunked response, no content, 5s later connection terminates (so it makes a big pause in your test) [14:19:08] without --compressed, it's not TE:chunked, and explicit CL:0 [14:19:55] but if we have some variance on CE on the 3007->eqiad fetch+response... yeah that's something to look into and understand for sure [14:20:19] and if we can reliably repro this with your new method in a relatively short time... [14:20:28] immediate repro on first try [14:20:32] nice! [14:20:32] < X-Cache: cp1051 hit+miss(0), cp3007 hit+miss(0), cp3007 frontend hit+miss(0) [14:20:47] ~ema/run-2/test.0 [14:21:30] ok wait a sec [14:21:45] the request that generated test.0, that was curl without --compressed, to cp3007:80 ? [14:21:54] yes [14:22:06] why is the frontend giving you gzip output when you didn't ask for it? [14:22:10] or claiming to anyways [14:22:25] interesting [14:22:47] I curled as follows: [14:22:49] maybe with the CL:0 bug already in play, everything else is suspect. it never thinks to decompress empty content for you [14:22:52] curl -v -w 'SIZE: %{size_download}' -H 'X-Forwarded-Proto: https' -H 'Host: query.wikidata.org' $url 2>&1 [14:22:57] so this could be effect rather than cause [14:23:21] anyways, let's correlate this to cp3007's view of the front and back transactions on the hit+miss [14:23:32] sorry, cp3007:3128's view... [14:23:47] I've stopped the test and varnishlogs [14:25:02] well it should be the first line of varnishlog-backend right? [14:25:10] * << Request >> 3562758 [14:25:17] < X-Varnish: 209077939, 3562758, 2947323 [14:25:59] - RespUnset Content-Length: 1829 [14:26:02] ah, but gzipped [14:26:33] yeah trying to sort through where the client and backend halves are... [14:26:48] I guess those first two chunks, Request and BeReq [14:28:56] so, eqiad gave a 304 [14:29:05] 3007:3128 gave a 304 to 3007:80 [14:29:12] 3007:80 gave you a bad output [14:30:49] -- ObjHeader Content-Length: 1829 [14:30:49] -- BackendReuse 25 boot.be_cp1051 [14:30:49] -- Timestamp BerespBody: 1463149155.857507 0.168052 0.000114 [14:30:49] -- Length 0 [14:31:05] so CL=1829 but Length=0 [14:31:08] I think in this case, the frontend made the mistake [14:31:45] 10Traffic, 10Analytics-Cluster, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#2292703 (10Ottomata) We didn't get a chance to fully restart each broker with `inter.broker.protocol.version=0.9.0.X` this w... [14:31:50] you're looking at 3007:3128's request to 1051 [14:32:04] 1051 returned a 304-NotModified, so it does have Length:0 (no content with a 304) [14:32:13] ObjHeader: CL:1829 is from the cached object it's reusing on 304 [14:32:23] the expired-but-now-revalidated one [14:32:31] aha! [14:32:35] ok [14:33:51] so then Request 2947323 (first one in the frontend varnishlog) is the culprit [14:34:06] - RespStatus 200 [14:34:09] - RespHeader Content-Length: 0 [14:34:38] so a 304 is not supposed to contain a content-length [14:34:49] I'm just tracing back through to verify, but I think the FE made the mistake [14:35:32] well the backend responded with 304 and CL:0 [14:35:46] I don't think it did [14:36:01] in frontend, section << BeReq >> [14:36:20] 113 -- BerespStatus 304 [14:36:20] BeRespProtocol and so-on there, that's it receiving the response from the backend... [14:36:24] 134 -- BerespHeader Content-Length: 0 [14:36:31] (first number is the line number) [14:36:32] there's no CL header there [14:37:14] well let's start back in the backend log, before getting into that mystery... [14:37:33] in the top section there, the client-request side [14:37:36] the response on that [14:37:51] it starts out as : [14:37:52] - RespProtocol HTTP/1.1 [14:37:52] - RespStatus 200 [14:37:52] - RespReason OK [14:37:59] - RespHeader Content-Length: 1829 [14:38:01] etc [14:38:14] then during deliver-time processing it tacks on: [14:38:15] - RespProtocol HTTP/1.1 [14:38:15] - RespStatus 304 [14:38:15] - RespReason Not Modified [14:38:15] - RespReason Not Modified [14:38:17] - RespUnset Content-Length: 1829 [14:38:38] so internally, it first views it as a 200, then converts to a 304 w/o CL because the etag matches or whatever [14:39:05] now in the backend side of the frontend log, where we're receiving that... [14:39:21] it starts with: [14:39:21] -- Timestamp Beresp: 1463149155.857729 0.168497 0.168402 [14:39:21] -- BerespProtocol HTTP/1.1 [14:39:21] -- BerespStatus 304 [14:39:22] -- BerespReason Not Modified [14:39:24] ... [14:39:32] with no Content-Length on reception (correct) [14:39:56] then when that block of headers ends, this is at the bottom (including the last line): [14:40:00] -- BerespHeader Connection: keep-alive [14:40:02] -- TTL RFC 10 20 -1 1463149156 1463149156 1463149155 0 0 [14:40:05] -- BerespProtocol HTTP/1.1 [14:40:07] -- BerespStatus 200 [14:40:09] -- BerespReason OK [14:40:12] -- BerespHeader Content-Length: 0 [14:40:14] -- VCL_call BACKEND_RESPONSE [14:40:17] -- TTL VCL 10 3600 0 1463149156 [14:40:42] that looks to me like that last bit is varnish internally-modifying the response. it's turning the beresp==304 + internal expired object into a 200 response for (refreshing the cache object and/or responding to the client) [14:40:53] and it sets CL:0, which is the first time we see explicit CL:0 [14:41:08] and then sure enough back up in the request, it's served with explicit CL:0 [14:41:34] (and I think the lack of gunzipping, which leaves CE:gzip in the response headers, is because it probably never even thinks to decode gzip for a non-gzip client if there's no content-length) [14:42:12] so I think this happened in the frontend cache, in the logic that processed a legitimate 304 response from the backend cache which lacked CL [14:42:26] so for some reason, sometimes, 304 responses from the backend are returned as 200 (good) but with CL:0 (bad) [14:42:35] when it tried to combine that 304 with its existing expired object that had the content, for some reason it set explicit CL:0 incorrectly [14:42:42] y [14:45:12] digging in varnish4 source for a bit... [14:46:23] FTR another repro is https://phabricator.wikimedia.org/P3081 [14:46:58] staring at varnishlog for a bit [14:47:59] that one lacks CE:gzip on the final output [14:48:20] maybe in that case, the error happened in the backend not the frontend (but similarly), and that changes things a bit? not sure [14:50:20] so the gzip code has in one place: [14:50:20] if (http_HdrIs(vc->http, H_Content_Length, "0")) { [14:50:20] http_Unset(vc->http, H_Content_Encoding); [14:50:20] return (VFP_NULL); [14:50:37] it's supposed to skip gunzip/gzip and clear CE header if CL:0 [14:50:54] I'm not sure what all cases that ends up applying to, and maybe not directly relevant [14:52:19] in varnishlog-frontend I can find only one 304 [14:52:27] $ grep 304 varnishlog-frontend |grep Stat [14:52:27] -- BerespStatus 304 [14:55:24] so in grep analysis of varnish4 source (with all our patches applied) [14:55:48] "git grep H_Content_Length" -> a bunch of lines, but every reference there is not a write-operation [14:55:58] well sorry that's not quite right [14:56:17] there's a lot of http_GetHdr and http_Unset and http_HdrIs [14:56:25] but no setting of any explicit new value [14:56:32] just get/check, or complete unset [14:56:44] but: [14:56:45] git grep -i 'Content-Length' bin/varnishd include [14:56:57] bin/varnishd/cache/cache_req_body.c: http_PrintfHeader(req->http0, "Content-Length: %ju", [14:57:00] bin/varnishd/cache/cache_req_body.c: http_PrintfHeader(req->http, "Content-Length: %ju", [14:57:03] bin/varnishd/cache/cache_req_fsm.c: "Content-Length: %jd", req->resp_len); [14:57:06] bin/varnishd/cache/cache_req_fsm.c: http_PrintfHeader(req->resp, "Content-Length: %zd", [14:57:12] now req_body is basically for POST (body content in a request) I think [14:57:34] which means really only those two lines in cache_req_fsm could possibly be setting explicit CL:0 ? [14:58:46] the second one there is in synthetic responses only [14:59:05] and the other's context looks like: [14:59:05] http_Unset(req->resp, H_Content_Length); [14:59:05] if (req->resp_len >= 0 && sendbody) [14:59:05] http_PrintfHeader(req->resp, [14:59:06] "Content-Length: %jd", req->resp_len); [15:00:01] maybe we're hitting this block in this case, with req->resp_len == 0 (from the 304), and then it overwrites the existing good CL header with CL:0 [15:00:17] now the question is, what does the code look like when it comes through here for this case... [15:00:41] (well, maybe a better way to state that is: what does the state of the variables look like, and the calling stack, when we get here in this case) [15:04:12] so earlier in that function ( cnt_vdp() in cache_req_fsm.c ) [15:04:16] there's a protection against this: [15:04:22] } else if (status < 200 || status == 204 || status == 304) { [15:04:23] req->resp_len = -1; [15:04:23] sendbody = 0; [15:04:56] but I don't think that works right in this case, when we translate a backend's 304 to a 200 [15:05:06] I think at this point, the translation to 200 already happened [15:05:09] https://phabricator.wikimedia.org/P3082$59 [15:05:26] it does an explicit RespUnset CL [15:05:55] and RespHeader CL:0 [15:06:11] yeah I think that's from my code block above [15:06:22] in this second example, the cp3007 backend made the mistake, not the frontend [15:06:28] I think? [15:06:47] yes that's from varnishlog-backend [15:10:14] ok I'm still tracing through code, I'm gonna stop saying random things till I get somewhere... [15:10:22] :) [15:16:41] ok I'm pretty sure I basically know what's going on, but I'm not sure how to carry the state through to make a fix [15:17:04] essentially, the merge of the 304+stale_object => new virtual response happens in one place [15:18:10] then later in cache_req_fsm.c cnt_vdp(), we arrive in a status where 'status' (from status = http_GetStatus(req->resp);) is 200, req->resp's Content-Length is already-correct from header merges... [15:18:20] but resp_len is still zero from the 304 [15:18:25] and we thus re-set to zero there [15:19:50] but resp_len comes from parsing req->resp's existing Content-Length earlier in the function... [15:20:43] perhaps something went wrong here? https://github.com/varnishcache/varnish-cache/commit/4954ed3b6183bdf712324506fe7bc6074b20c0fe [15:23:21] yeah all of the code I'm looking at is new from that patch [15:24:01] that patch needs a much better commit message, and/or broken up into smaller changes :P [15:24:52] ema: how hard is it to try a stabby random patch, in terms of re-build and install on 3007 and repro? [15:25:06] not hard [15:25:37] I'll just build it on my workstation, scp, install [15:25:41] https://phabricator.wikimedia.org/P3083 [15:26:32] that's from looking at your debug output and reading source. I think it compiles. I have no idea what I'm doing. [15:26:37] call it a hunch [15:27:49] it would also be interesting, if we think we understand the problem this well, to try to build a VTC test case. I'm not even sure what all the conditions are to make the VTC reliably fail though. [15:29:06] I think the patch is worth a shot though, it feels right [15:30:07] compiling [15:31:28] 0007-134989-random-fix-attempt.patch applied, installing on cp3007 [15:33:23] trying to reproduce [15:33:54] out of curiosity, did the rebuild re-run all the stock VTC tests? did it break anything? [15:34:01] Error: 503, Backend fetch failed [15:34:07] heh nice [15:35:02] nope, no tests [15:35:43] ok so that didn't work, sorry :) [15:36:12] it was a nice try! [15:36:18] what does AZ do [15:36:19] ? [15:36:23] AZ(bo->was_304); [15:38:16] AZ asserts false basically [15:38:20] I think [15:38:38] is that what made the 503? a crash on that AZ? [15:39:12] nope, I was just grepping through the code for was_304 and noticed a AZ(bo->was_304) in bin/varnishd/cache/cache_fetch.c [15:39:50] if you want to humor one more random stab: [15:39:51] https://phabricator.wikimedia.org/P3084 [15:40:13] why not [15:42:56] building [15:43:53] another avenue of exploration now that you have a way to repro fairly easily, of course, is from yesterday: start commenting out lots of our complex shared VCL, in case we do have some odd VCL interaction [15:44:13] almost none of it's necessary for this basic test case [15:45:16] 503 again [15:45:22] oh well [15:46:33] ah! [15:46:36] it's me being silly! [15:46:47] after a package upgrade we need to redefine the systemd unit [15:47:01] well the second fix seems righter than the first [15:47:07] try the second one first if we're going back through these [15:47:07] alright [15:47:19] nothing is listening on 3120 :) [15:48:16] to avoid this and some ambiguity BTW we could call it varnish-backend.service instead of varnish.service [15:48:35] I had a task about some of that at one point [15:48:47] to do a couple of inter-related things, but it's a complex change to roll out all over the place [15:49:08] 1) Yes, stop using the default instance and call it varnish-backend for us, and have puppetization stop/disable the default one [15:49:28] that also cleans up a bunch of VCL that has to look at frontend-or-not to decide if we need -n flags on various things or not: now we always use -n [15:49:50] 2) but really, name them even more-explicitly: varnish-text-backend, varnish-maps-backend, etc [15:50:31] 3) Switch (by first adding, then later post-transition removing the old) ports to per-cluster too. each cluster uses a unique set of fe and be listening ports, none of which are port80 [15:50:57] test running [15:50:58] well the port80 part also involves: move port 80 traffic to go into nginx like 443, and have nginx do the quick 301->HTTPS for all traffic. [15:51:25] but that's predicated on finish up HTTPS transition to that level of completeness. right now we still have junk-domains and post-traffic legitimately hitting varnish on 80 and not wanting 301 [15:52:12] the per-cluster instance names and ports things gives us the capability to put several caches in one machine, which was an idea for the mini-pops [15:52:43] (deploy 1-2 beefy box, with one nginx in front splitting traffic on hostnames to all 4x different varnish-foo-frontend daemons/ports) [15:53:03] mini-pops being very small edge pops that backend to existing real cache pops, that we can deploy cheaply in lots of places [15:53:09] reproduced [15:53:43] but I think faidon shot that down pretty well, by noting that most of the cost in the cache pop is having rackspace + network links at all, not the cache machine count, so why bother with the complexity. [15:54:23] ok so fix-attempt-2 doesn't work [15:54:33] varnishlogs on cp3007:~ema/varnishlog-frontend-2 and ~ema/varnishlog-backend-2 [15:55:20] curl output here: https://phabricator.wikimedia.org/P3085 [15:55:41] _joe_'s presentation starts in 5 minutes btw :) [15:56:31] yeah [15:56:46] we may actually have two distinct bug patterns here, although maybe one real underlying code bug [15:56:53] <_joe_> yeah but you are the two people who should already know most of what I'm saying [15:57:17] the one that does RespUnset of CL before setting it to zero, and the one that just sets it to zero without a previous Unset [15:57:29] I think my fixes, if they worked at all, would only fix the former not the latter [15:58:43] ema: if fix#1 doesn't work either, I'd say our next best shot in terms of effort:reward-odds is probably stripping down VCL for repro [15:59:00] (and back on the no-hacky-fixes package) [15:59:34] surely if we can repro with stripped VCL, we can make a VTC that emulates this [16:01:09] also, IMHO if we don't find a fix soon today, we should consider reverting the bulk of them back to varnish3 to stop causing errors for users [16:01:39] we could keep 1x eqiad + 1x esams depooled and on varnish4 with an explicit backending-path to repro [16:03:00] +1 [16:03:16] oh hey, while I've been busy 2/4 of my loop-tests finally hit failures [16:04:03] and because of my screwup, one of those doesn't have a valid capture output [16:04:07] the other does though [16:04:16] it was a TTL-boundary case again, through a ulsfo frontend [16:05:27] https://phabricator.wikimedia.org/P3086 [16:05:33] perhaps there are multiple code-paths triggering the same (or similar) bug? [16:05:51] other failures we got were not TTL-boundary AFAICT [16:06:01] Age:120, CL:0, no CE:gzip [16:06:21] yeah I think TTL boundary-cases just aggravate it easier, for some types of testing [16:06:38] well or maybe it's not really the TTL-boundary [16:07:27] perhaps just the miss [16:07:29] so in my paste, we have an Age:120 hit from codfw backend, miss in ulsfo backend + ulsfo frontend in front of that [16:07:37] it's just the slight timing race on expiry between those 3 [16:07:45] codfw still had the live object, the next two had just expired it [16:08:20] it was probably the expiry of the grace that it happened on, and not a 304 (no hit+miss, which seems to be a prereq for 304-conditions) [16:08:33] so, 304 is not really a requirement [16:08:43] I think [16:08:47] hmmmmmmmmm [16:12:09] ema: before you get into crazy stripping down of random VCL [16:12:28] maybe just kill the do_gzip = true line in common VCL template on 3007 [16:12:50] for both frontend and backend [16:12:52] it might be one of the most-likely things [16:12:53] yeah [16:12:56] alright [16:13:24] I donno how much effect it will have. we're assuming the eqiad backend always emits good output (I've never observed differently yet) [16:13:42] it would be nicer to have a whole depooled stack to play with including the eqiad backend of course [16:14:01] but I think this scenario is still useful for now [16:14:19] well yeah we could depool one eqiad machine too and point cp3007 to that one only [16:14:37] if we suspect eqiad backends as well [16:14:57] I don't suspect them yet, just noting it's a hole in our test isolation [16:15:24] the output from wdqs->eqiad is pretty consistent. it does not honor AE:gzip or conditional-request (it's always ungzipped 200 OK) [16:19:18] trying to repro without do_gzip [16:22:47] bblack: reproduced https://phabricator.wikimedia.org/P3087 [16:24:19] Content-Encoding: gzip (from cp1051 I guess) [16:25:24] I don't see CE in that paste [16:25:33] nope, I've seen it in varnishlog [16:25:39] I'm creating a paste [16:25:42] ok [16:25:55] I guess the frontend gunzipped to zero length here [16:26:06] (but probably the input to gunzip was already screwed) [16:26:31] https://phabricator.wikimedia.org/P3088 [16:27:43] https://phabricator.wikimedia.org/P3088$60 [16:27:53] the bereq section at the bottom looks sane... [16:28:18] both the beresp 304 stuff, the translation to 200, and the objheader stuff [16:29:37] even the initial respheader in the top request section looks sane [16:29:57] then: [16:29:57] - VCL_return deliver [16:29:58] - Timestamp Process: 1463156480.737622 0.083821 0.000028 [16:29:58] - RespHeader Accept-Ranges: bytes [16:29:58] - RespUnset Content-Length: 1829 [16:30:00] - RespHeader Content-Length: 0 [16:30:15] (and also, lack of gunzip for the client, but maybe that's being blocked by the CL:0 bug doing the above) [16:30:46] maybe the set of Accept-Ranges there is a hint to look for where this happens in code? [16:31:24] I still think that specific block mentioned earlier in cache_req_fsm is actually doing the CL:0 RespHeader line there [16:31:37] I just don't know where the bug lies before we get there and how to fix it [16:32:28] other than setting CL:0 it also avoid sending the body right? [16:32:42] well, I think [16:32:47] looks like [16:32:59] it may not be explicit, it may just be that setting CL:0 kills the body later elsewhere, too [16:33:15] right [16:33:23] but then there's also that sendbody variable [16:33:27] yeah [16:34:10] but again, it could be the logic in there is perfect, and it's the state of bo, req, req->resp are buggy already on function entry, from a bug elsewhere before we get there. [16:34:36] how comes nobody else is screaming about this, I'm wondering [16:35:07] yeah that's what says to me we should still try to strip down VCL [16:35:45] but really, who knows how many high volume sites where people scream about intermittent errors, with multi-layer varnishes, and a good mix of randomly-different applayer backend behaviors, exist yet... [16:35:51] that have switched to v4 yet [16:36:40] I mean really as a first-test, I'd get rid of virtually all of our VCL [16:37:09] make two new single simple VCL file for cp3007 fe + be, which just set the appropriate backend and leave all VCL falling through to defaults? [16:37:59] why not [16:38:53] so basically just the backend_hints [16:38:57] for this isolated test URL case, none of the VCL should matter to get a functional response [16:57:47] heh Accept-Ranges: bytes is set in that same function [16:58:28] that just further confirms, the CL:0 is being set by bin/varnishd/cache/cache_req_fsm.c line 105 [16:58:31] the: [16:58:34] } else { [16:58:36] http_Unset(req->resp, H_Content_Length); [16:58:39] if (req->resp_len >= 0 && sendbody) [16:58:41] http_PrintfHeader(req->resp, [16:58:44] "Content-Length: %jd", req->resp_len); [16:58:49] we've still got two distinct trace cases in varnishlog... [16:59:05] one where we see that Unset first before Set CL:0, the other we don't see unset [16:59:16] I'm guessing the unset goes unlogged if there was no existing header to unset is the only difference [17:02:09] it also means we know sendbody is true when we set CL:0 [17:02:28] so we know status already == 200 in there [17:02:58] but this if-case fails: } else if (resp_len >= 0 && resp_len == req->resp_len) { [17:03:08] and this one succeeds: if (req->resp_len >= 0 && sendbody) [17:03:29] and we know req->resp_len is zero (because that's what's used as the value 0 in the set header) [17:03:53] so, resp_len isn't zero (or the first if above would suceed rather than fail) [17:04:49] basically, resp_len has the right body size value. the response already has content-length: $resp_len. but because of the logic here, we wipe it out and replace it with req->resp_len, which is zero (which is probably from a 304, I think?) [17:08:10] req->resp_len became zero when set at the top of that function [17:08:36] I'm butchering our VCL in the meantime [17:08:40] I would think bo!=NULL in our case with the 304s (because it's a busyobj) [17:09:28] which would set req->resp_len from the existing req->resp's Content-Length header (which might be zero in this case) [17:10:11] and in http_GetContentLength, if the header didn't exist at all it would've set resp_len to -1, not 0 [17:10:58] so our faulty req->resp_len==0 can get set in two distinct cases: [17:11:16] 1) req->resp already had a Content-Length:0 header [17:11:59] but then req->resp and resp_len would both be zero, and we would never reach http_PrintfHeader(req->resp, "Content-Length: %jd", req->resp_len); [17:12:20] or bo==NULL and it came from req->resp_len = ObjGetLen(req->wrk, req->objcore); [17:12:36] I'm really not sure what req->objcore means at this point [17:13:56] I mean I guess it's the actual object (private object, or cache object) that we're planning to use to generate this response [17:14:18] but is it at that moment the expired 304 object in cache? the already-remangled-to-200 object? [17:17:43] maybe the fault in my earlier patch attempts, is that at this point bo is NULL, and we don't have a busyobj, at least not in all cases [17:19:57] do we have a log where we're sure (other than X-Cache:miss instead of hit+miss, which I think is maybe dubious for this purpose) that 304 was not involved in the screwup? [17:20:43] mmmh I think I've only seen screwups with 304s in varnishlog [17:33:07] https://phabricator.wikimedia.org/P3089 [17:33:16] fix attempt #3, if -VCL hasn't gotten anywhere [17:33:43] that's almost certainly not an appropriate/correct source patch from upstream POV. it probably breaks some other case we don't care about like ESI or something. [17:33:56] but it sure seems like it should paper over this one stupid case... [17:34:37] I've just finished the butchering [17:34:53] see simplified-backend.vcl and simplified-frontend.vcl [17:35:00] does it fix it? [17:35:10] don't know yet :) [17:35:32] ok :) [17:35:59] mmh [17:36:02] Log abandoned [17:36:02] Log reacquired [17:36:02] Log abandoned [17:36:02] Log reacquired [17:36:07] when running varnishlog [17:36:09] that can't be good [17:36:27] re: source fixups, I still think the error lies deeper, maybe not in the logic I'm editing. so any fix there may really just be a bandaid even if it does work [17:36:36] usually that means crash? [17:36:48] oh, it keeps on exploding with Assert error in vmod_vslp_backend(), vmod_vslp.c ...46: Condition((vslpd) != NULL) not true.... [17:38:12] PEBKAC [17:38:27] btw: vslp bug was another random theory the other day btw. I think I at least minimized the odds of that by looking at it, but technically it's still possible. [17:38:38] after this test, might try just using random backend and not vslp, too [17:38:47] I don't think it will help, just something to try [17:41:29] test running without VCL basically [17:42:01] neon.wikimedia.org:~ema/run-4/ for live updates [17:43:34] no fail so far! [17:43:53] and pretty boring X-Cache! [17:44:16] heh [17:45:57] so, probably our VCL is triggering [17:46:16] or the builtin VCL is not triggering [17:46:48] if that holds up, I guess put back the whole normal VCL, and start commenting out dangerous ugly chunked in the wikimedia VCL files (but without killing the call chain down to cluster_[fb]e_ functions) [17:47:01] well either way, it's our VCL blocking default VCL [17:47:51] repro [17:48:04] heh [17:48:33] now the question is: did cp10xx emit a bad output to cause the repro, or did cp3007 screw it up on its own [17:48:43] let's see [17:49:00] < X-Varnish: 236942524 223044366 [17:51:08] looks ok on the cp1051 side to me so far [17:51:45] oh I'm looking at the wrong xid from that cache hit, it's all over the place [17:52:53] X-Varnish isn't enough to go on, we hit that pair of values multiple times... [17:52:59] yep [17:53:26] wait there are two X-Varnish [17:53:33] this one as well: < X-Varnish: 136002 [17:54:02] https://phabricator.wikimedia.org/P3090 [17:55:11] looks like 3007:3128 made the mistake [17:55:12] - RespHeader Accept-Ranges: bytes [17:55:12] - RespHeader Content-Length: 0 [17:55:45] of course, streaming plays into this too... [17:56:00] how does it even know resp_len when streaming, in time to set a header? [17:57:49] I think the error here was in translating cp1051's 304 into a 200 by cp3007:3128 [17:58:24] the request from fe->be in 3007 had no IMS/INM headers, which is why it wasn't a 304->fe [17:59:02] have we ever seen this happening except for 304->200? [17:59:13] not that I'm aware of [17:59:24] well, not that I can definitely verify with varnishlog [17:59:45] a simpler variant of my fix #3 that could also work, in light of streaming and blah blah: [18:00:52] https://phabricator.wikimedia.org/P3092 [18:01:19] so, two patches to try to repro with blank-ish VCL: that one or https://phabricator.wikimedia.org/P3089 [18:01:57] a good sanity check would be to run the stock VTC with these patches too. that would tell us if they royally screw up some other kind of request. [18:02:14] yep [18:02:23] I'll have to leave soon unfortunately :( [18:02:56] yeah [18:03:09] perhaps tomorrow I can take another look, not sure though [18:03:44] I'll depool an eqiad and hack 3007 to only look at it [18:03:58] and I think, try to downgrade to v3 on the rest, and see if that goes easily or not [18:04:31] let me give a try to those two patches as a last attempt for today [18:04:31] otherwise we're risking weekend reports of broken cache_misc sites randomly [18:09:49] bblack: so this would be it right? https://phabricator.wikimedia.org/P3093 [18:10:59] oh, was it either one or the other? [18:11:17] they're independent [18:12:29] of the two, https://phabricator.wikimedia.org/P3089 is more likely to fix our case without breaking others, I *think* [18:12:48] I'm not sure if there's some case where CL:0 is legitimately supposed to be set there [18:13:02] if there's not, https://phabricator.wikimedia.org/P3092 is far simpler [18:13:21] I donno though, https://phabricator.wikimedia.org/P3092 may not do much good [18:13:28] 3089 is a better test [18:13:47] ok, trying 3089 [18:18:26] oh the icinga downtime for cp3007 is over [18:18:58] OK, cp3007 is running with P3089 and simplified vcl now [18:18:59] P3089 fix attempt 3 for T134989 - https://phabricator.wikimedia.org/P3089 [18:20:30] live updates on neon.w.o:~ema/run-5/ [18:21:18] heh [18:21:28] it's a fascinating test! [18:21:35] bblack@neon:/home/ema/run-5$ grep SIZE test.*|grep -v 6304 [18:21:36] test.0: margin-rightSIZE: 1829 [18:21:42] I think 1829 is the gzipped size [18:21:56] it emitted ungzipped content + gzipped size in those failures [18:22:02] I think [18:22:10] there's a few of those [18:22:26] getting closer! :) [18:22:36] yup [18:22:47] there's no CE:gzip, and the full output gets sent, but then it sends a CL:gzip-size [18:23:02] let it run a bit, I'm curious if that's the only kind of fail we get [18:23:12] sure [18:23:54] varnishlog-frontend-5 and varnishlog-backend-5 in my home for this run [18:24:02] oh, duh [18:24:29] interesting [18:24:51] upon failures it's now noticeably slower to get the response [18:25:04] failures as in size=0 [18:25:26] ah, damn [18:26:17] right now [18:26:21] if it was only the 1829 problem, I think I had a workaround [18:26:35] but not for size==0 [18:27:24] the size==0 ones are different now [18:27:28] curl: (18) transfer closed with 1829 bytes remaining to read [18:27:32] < Content-Length: 1829 [18:27:42] yeah but they still sent zero content [18:27:47] yep [18:27:51] I can fix the 1829 thing [18:28:02] it's because resp_len was taken from before gzipping, right in the same function heh [18:28:52] oh [18:29:19] ok now I really have to go. Do you want me to keep the test running? [18:31:25] nah [18:31:30] have a fun weekend! [18:31:50] you too! [18:33:46] I'm gonna step away for an hour or so and reset my brain from debug-mode a bit [18:34:18] when I get back I'm gonna isolate an eqiad depooled, and then save that eqiad + cp3007 as v4 cache_misc tests for further work here, and revert the rest to varnish3 for live traffic [20:23:20] gehel: bblack is there any point in reporting more URL having issues with misc ? :d [20:23:59] had one that served me a blank page, and after curl + force refresh in my browser it is not being served properly :-} [20:24:17] https://releases.wikimedia.org/debian/dists/jessie-mediawiki/InRelease from T135238 [20:24:17] T135238: This file can not be downloaded: https://releases.wikimedia.org/debian/dists/jessie-mediawiki/InRelease - https://phabricator.wikimedia.org/T135238 [20:24:33] served from bromine.eqiad.wmnet if one want to look at apache logs [20:25:00] it's still good to merge them in [20:25:07] we have enough testcases to know how to debug at this point though [20:25:39] (and I'm in the process of reverting the majority of the cluster to v3 to temporarily-resolve this for the weekend, and then we can debug more on 2x varnish4 servers left depooled for testing next week) [20:25:41] I can imagine [20:25:48] will update the main ticket when I'm done [20:26:13] at least it has been very stable all day long. I havent encountered any issue beside the page above [20:26:34] well we've been doing various things that "seem to help" and they at least reduce the incidences of it [20:26:47] but they're not really fixing the bug, they're just making it harder to find the bug :) [20:27:11] unapply the 0007-dust-under-carpet.patch in the .deb [20:27:37] good luck with the v4 -> v3 rollback :( [20:28:58] thanks! [20:48:15] 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: Upgrade to Varnish 4: things to remember - https://phabricator.wikimedia.org/T126206#2293942 (10BBlack) Instructions for downgrading nodes to varnish3 (trialed on cache_misc): 1. disable puppet on affected nodes: 2. Update hieradata to remove varnis... [20:52:48] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2293963 (10BBlack) I forgot one of our temporary hacks in the list above in T134989#2290254: 4. https://gerrit.wikimedia.org/r... [21:12:58] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: Wikidata Query Service REST endpoint returns truncated results - https://phabricator.wikimedia.org/T133490#2294011 (10BBlack) [21:13:02] 10Traffic, 10Varnish, 06Operations: Upgrade all cache clusters to Varnish 4 - https://phabricator.wikimedia.org/T131499#2294012 (10BBlack) [21:13:05] 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: cache_misc's misc_fetch_large_objects has issues - https://phabricator.wikimedia.org/T128813#2294013 (10BBlack) [21:13:08] 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: Convert misc cluster to Varnish 4 - https://phabricator.wikimedia.org/T131501#2294007 (10BBlack) 05Resolved>03Open T134989 couldn't be resolved in a reasonable timeframe, and is corrupting some responses (zero body content length). I've reverted... [21:13:11] 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: Convert misc cluster to Varnish 4 - https://phabricator.wikimedia.org/T131501#2294014 (10BBlack) [21:13:14] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2285020 (10BBlack) [21:16:34] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2294023 (10BBlack) Current State: * cp3007 and cp1045 are depooled from user traffic, icinga-downtimed for several days, and... [21:26:03] 10Traffic, 06Discovery, 06Operations, 10Wikidata, 10Wikidata-Query-Service: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2294057 (10TerraCodes) [22:12:13] bblack: and congratulations ! [22:15:31] thanks! [22:20:38] have a good week-end and rest well [22:46:00] 07HTTPS, 10Traffic, 06Operations, 05MW-1.27-release-notes, 13Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#2294355 (10BBlack) Announcement email (finally) sent! The cutoff dates/process are: * 2016-06-12 - We'll randomly reject 10% of insecure POST with "403 - Inse... [23:54:31] bblack: yay! re ^ :D