[02:33:57] 10Traffic, 10Graph, 10Graphoid, 06Operations, 13Patch-For-Review: Graph results are not being cached in Varnish - https://phabricator.wikimedia.org/T134542#2283687 (10BBlack) @Yurik - ok. FYI, we separately cap 4xx lifetimes anyways. Currently at 1 minute on all clusters, although IMHO we could probabl... [02:35:47] 10Traffic, 10Graph, 10Graphoid, 06Operations, 13Patch-For-Review: Graph results are not being cached in Varnish - https://phabricator.wikimedia.org/T134542#2283690 (10Yurik) Thanks @bblack. Are there any headers I should add to the regurar 200 replies beyond setting maxage? [02:36:04] 07HTTPS, 10Traffic, 06Operations: Secure connection failed - https://phabricator.wikimedia.org/T134869#2283693 (10Danny_B) For cross reference: - {T96848} - {T134817} [02:40:05] 10Traffic, 06Operations, 06Project-Admins: Create #HTTP2 tag - https://phabricator.wikimedia.org/T134960#2283697 (10Danny_B) [02:42:14] 10Traffic, 10Graph, 10Graphoid, 06Operations, 13Patch-For-Review: Graph results are not being cached in Varnish - https://phabricator.wikimedia.org/T134542#2283715 (10BBlack) Well, that's a complex topic, but what you're doing now will work fine for now :) Someday we'll have a guideline document to poin... [03:09:19] 10Traffic, 06Operations, 06Project-Admins: Create #HTTP2 tag - https://phabricator.wikimedia.org/T134960#2283697 (10BBlack) I'm not sure we need a whole separate tag for HTTP/2. [03:10:08] 07HTTPS, 10Traffic, 06Operations: Secure connection failed - https://phabricator.wikimedia.org/T134869#2283745 (10BBlack) While there's a possibility this is an HTTP/2 issue, it's also possible that's a number of other things, including some generic bug in Firefox that's not protocol specific. [03:29:37] 10Traffic, 06Operations, 06Project-Admins: Create #HTTP2 tag - https://phabricator.wikimedia.org/T134960#2283697 (10Krenair) I'm sceptical as well. What existing tasks would go into this project? [05:17:22] 10Traffic, 06Operations, 06Project-Admins: Create #HTTP2 tag - https://phabricator.wikimedia.org/T134960#2283861 (10Danny_B) I think all new(ly used) technologies deserve tag since they may be buggy or cause issues (typically compatibility). For this cases we have i.e. #newphp Also, we have #https too... AT... [09:58:43] mmmh I never seem to be able to get geo-routed to ulsfo [09:59:00] even from hong kong and singapore I get to codfw [10:04:11] 07HTTPS, 10Traffic, 06Operations: Secure connection failed - https://phabricator.wikimedia.org/T134869#2284310 (10Samtar) Hi @BBlack thanks again for the above - I did remember to turn this off, not keen to be surfing around without it! At work (Win 8.1 Pro + FF 46.0.1) the issue reappeared until changing to... [10:13:23] <_joe_> ema: I think bblack and robh did set ulsfo down yesterday [10:13:35] <_joe_> for robh's work in the dc [10:13:38] <_joe_> check the dns [10:13:45] <_joe_> if ulsfo is set down [10:14:01] mmmh, but I've seen traffic on text caches [10:14:53] you're right though, the last dns commit sets ulsfo down [10:16:44] perhaps the text traffic can be explained with broken dns caches [10:17:12] <_joe_> yes [10:17:42] thank you! [10:19:05] ulsfo misc looked way too silent :) [11:15:40] bblack, ema: I'm setting kernel.unprivileged_bpf_disabled=1 on the Linux 4.4 systems as a hardening measure (https://lwn.net/Articles/660331/ for background, triggered by the recent https://bugs.chromium.org/p/project-zero/issues/detail?id=808) [11:16:05] I'll start with two cp* and lvs* systems initially and proceed with the rest if no problems arise (which I don't expect) [11:16:20] but better safe than sorry, since that sysctl value can only be reset with a reboot [11:16:44] unprivileged bpf is really new after all [11:17:38] 10Traffic, 06Operations, 06Project-Admins: Create #HTTP2 tag - https://phabricator.wikimedia.org/T134960#2284392 (10BBlack) #HTTPS is a little different: that transition is a huge, long-term effort spanning years. HTTP/2 is only a slight change from the SPDY/3 that we supported for quite a while before it.... [11:18:29] any preferences on those canary systems so that I don't mess with ongoing maintenance work? [11:18:48] (once that is rolled out initially, I'll also puppetise it) [11:27:19] 07HTTPS, 10Traffic, 06Operations: Secure connection failed - https://phabricator.wikimedia.org/T134869#2284408 (10BBlack) We may have been seeing the same bug here as https://bugzilla.mozilla.org/show_bug.cgi?id=1271301 - you might want to see if the test in that bug work from FF 47 but not 46 as well. Ther... [11:27:44] moritzm: when it comes to the v4 upgrade, we've done misc esams and ulsfo [11:29:08] codfw and eqiad are next, so perhaps it would be better not to use misc codfw/eqiad as canaries [11:30:04] hit cp1065 and cp3038 if you want a pair of canaries to be extra safe [11:30:11] that's one text in eqiad and one upload in esams [11:30:26] I think it will be fine to set the sysctl everywhere though, seems low risk [11:30:58] oh you need ones already on 4.4 to test with, though :) [11:31:55] yesterday I've rebooted cp1065 and cp1066, so those are running 4.4 for sure [11:32:18] are they running the newest 4.4 now? I think moritzm uploaded a new one just last week [11:32:50] I haven't upgraded anything, just rebooted them. 4.4.0-1-amd64 [11:33:00] bblack: that one is currently building on copper (I cherrypicked an additional fix for one security bug not in 4.4.9) [11:33:12] ok [11:33:16] {'cp3038.esams.wmnet': '4.4.0-1-amd64 #1 SMP Debian 4.4.2-3+wmf1 (2016-02-29)'} [11:33:20] so that when ema starts upgrading we'll have the latest one upfront [11:33:34] {'cp1065.eqiad.wmnet': '4.4.0-1-amd64 #1 SMP Debian 4.4.2-3+wmf1 (2016-02-29)'} [11:33:44] so yeah, the ones I said earlier, use those two for canaries :) [11:34:18] ok, so I'll use cp3038 and cp1065, as said I don't expect any risk, but better play safe [11:34:25] yup [11:34:47] and I'll pick lvs2001 and lvs4001 in addition [11:35:29] back on the general topic of kernel updates, with the disk startup race gone it should be easy at most sites [11:35:56] but we still have the strange ipmi_si power issue in esams. I think it's still going to bite us, and fixing it may involve doing some remote BMC upgrades [11:36:53] there's only at most 20 affected machines (and I'm not sure they're all affected) - cp30[34]x [11:38:05] lacking a real fix (probably BMC update): they may not reboot when software/kernel tells the machine to do so, without racadm intervention for powercycle. and the kernel may not boot successfully afterwards without editing grub cmdline entry to blacklist ipmi_si [11:38:39] it's also possible the "fix" is just to fully reset the BMC by pulling out the power cables of the machines, but there's nobody there to do it :) [11:40:40] https://phabricator.wikimedia.org/T126062 [11:43:21] I'm not sure if that's the same issue, but when I mass-rebooted systems for the keyctl bug, we had at least 20 of those in codfw as well (except sympton 4). Papaul upgraded the idrac firmware and all further reboot tests with a depooled server were fine [11:45:00] 10Traffic, 10Graph, 10Graphoid, 06Operations, 13Patch-For-Review: Graph results are not being cached in Varnish - https://phabricator.wikimedia.org/T134542#2284438 (10Yurik) 05Open>03Resolved a:03Yurik ok, will close for now, thx for your help! [11:45:20] but the lack of dedicated ops-esams also makes that hard.. [11:47:52] 10Traffic, 06Operations, 06Project-Admins: Create #HTTP2 tag - https://phabricator.wikimedia.org/T134960#2284449 (10Danny_B) OK, another: {T117682} There is bunch of tasks regarding SPDY, so maybe #HTTP2-SPDY then? [11:58:33] 10Traffic, 06Operations, 13Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#2284470 (10BBlack) So, recapping the strange evidence above, because I've been thinking about this off and on all night and it still makes no sense: * The only change in... [12:02:56] 10Traffic, 06Operations, 06Project-Admins: Create #HTTP2 tag - https://phabricator.wikimedia.org/T134960#2284474 (10BBlack) SPDY is dead. It was an experimental protocol that predated HTTP/2, and HTTP/2 was derived from it. We dropped support for SPDY when we turned on HTTP/2 last week. Chrome's dropping... [12:18:28] 10Traffic, 06Operations, 13Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#2284589 (10BBlack) Trying a patch that will test the first two ideas above... [12:55:21] 10Traffic, 10Varnish, 06Operations: Upgrade all cache clusters to Varnish 4 - https://phabricator.wikimedia.org/T131499#2284895 (10ema) [12:55:24] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: Wikidata Query Service REST endpoint returns truncated results - https://phabricator.wikimedia.org/T133490#2284894 (10ema) [12:55:28] 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: cache_misc's misc_fetch_large_objects has issues - https://phabricator.wikimedia.org/T128813#2284896 (10ema) [12:55:31] 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: Convert misc cluster to Varnish 4 - https://phabricator.wikimedia.org/T131501#2284893 (10ema) 05Open>03Resolved [12:56:42] 10Traffic, 06Operations, 13Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#2284900 (10BBlack) I'm out of good ideas for now. I've tried both of the obvious alternatives (1.1 w/o explicit Connection setting, and 1.1 w/o explicit Connection setti... [13:05:04] bblack: curiosity about ---^ and http format error - is Varnish getting garbage from the backend (mediawiki I suppose) or is it considering for some reason a legitimate response as garbage? [13:05:21] the third option might be that I need to re-read the phab task [13:05:56] it's a pretty complicated jumble of things, yeah [13:06:34] because those 503s are rare in the stream of traffic, and varnish doesn't actually log a dump of the response it doesn't like... correlating that to e.g. sniff of traffic is hard [13:06:45] I don't yet have a sample of the response it rejected [13:07:47] I think it's probably true that varnish is getting garbage from MW/Apache in these cases. The question is: what kind of garbage, and why does the distant action of switching protocol versions for nginx->varnish-fe cause this way down at varnish_be->mw/apache ? [13:08:17] the next crazy theory on my list that I'm trying to run down now is that this is related to ETag/If-None-Match sorts of things [13:08:41] but it's a pretty unlikely theory [13:09:54] and since this behavior is not all that un-similar from what happened when we tried keepalives with 1.9.4 (so apparently those issues weren't really about keepalives, but about turning on 1.1 as a prerequisite for keepalives)... [13:10:10] if this is some pure nginx bug when using 1.1, it's been there for a long time and I don't see other reports, so that's also unlikely. [13:10:21] * elukey nods [13:10:38] I think any way we slice this though, if we find the bug it's going to be something that was incredibly unlikely [13:12:09] it's sort of like on House (if you remember that show), it seems like every episode there was this conversational exchange something like "That can't be it, the odds of that are 1 in millions" and someone else would say "yeah but if it was a common problem, we wouldn't even be dealing with it, it would already be solved" [13:15:45] misc on varnish 4 I see? [13:15:48] congrats :) [13:16:07] \o/ [13:17:55] \o/ [13:18:44] this time ganglia didn't explode, I've chmodded the VSM files immediately after upgrading [13:18:55] ema: one thing that's puzzling me.... [13:19:06] would be awesome to find a neat way to add ganglia to the varnish group in puppet [13:19:21] well hold that thought [13:19:29] I never chmodded VSM files for cache_maps servers I don't think heh [13:19:42] uh? [13:20:46] when I upgraded the 16x new maps servers to varnish4, all I did was: https://phabricator.wikimedia.org/T126206#2237213 [13:21:30] https://ganglia.wikimedia.org/latest/?c=Maps%20caches%20ulsfo&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [13:21:36] root@cp3004:~# ls -l /var/lib/varnish/*/*.vsm [13:21:36] -rw-r----- 1 root varnish 84934656 Apr 25 22:30 /var/lib/varnish/cp3004/_.vsm [13:21:38] -rw-r----- 1 root varnish 84934656 Apr 25 22:38 /var/lib/varnish/frontend/_.vsm [13:21:54] explains a lot :) [13:22:00] :) [13:22:02] so what's the chmod command again? [13:22:17] I think when I saw the ganglia mess, I just said "ganglia sucks as usual" again :P [13:22:47] I just did 644, which are the perms v3 was using [13:22:55] ok [13:23:28] fixing [13:24:58] 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: Upgrade to Varnish 4: things to remember - https://phabricator.wikimedia.org/T126206#2285011 (10BBlack) One more thing to remember when upgrading nodes above: ``` chmod 644 /var/lib/varnish/*/*.vsm service ganglia-monitor restart ``` [13:25:09] my theory is that if gmond sends screwed up stuff to the aggregator 'for a while', things get messed up and a gmond restart on the host is not enough, the relevant aggregator also needs to be restarted [13:25:23] maybe [13:25:30] they seem ok so far in ulsfo [13:25:34] nice [13:25:48] anyways back on my puzzling thing [13:26:03] in misc frontend VCL we have: [13:26:05] sub cluster_fe_backend_response { [13:26:05] // hit_for_pass on objects >= ~10MB or no CL [13:26:05] if (!beresp.http.Content-Length || beresp.http.Content-Length ~ "^[0-9]{8}") { [13:26:08] <% if @varnish_version4 -%> [13:26:10] set beresp.uncacheable = true; [13:26:13] return (deliver); [13:26:15] <% else -%> [13:26:18] return (hit_for_pass); [13:26:20] <% end -%> [13:26:23] } [13:26:48] right, because of https://www.varnish-cache.org/docs/4.0/whats-new/upgrading.html#hit-for-pass-objects-are-created-using-beresp-uncacheable [13:26:57] and we have this WDQS test URL: https://query.wikidata.org/sparql?query=SELECT%20%3Fqid%20%3Fimdb%20WHERE%20%7B%3Fqid%20wdt%3AP345%20%3Fimdb.%20FILTER%20regex%28str%28%3Fimdb%29%2C%20%22nm%7Ctt%22%29%7D%20ORDER%20BY%20%3Fqid [13:27:07] which is a ~70MB object [13:27:31] when I test it now, I get: [13:27:32] < Content-Length: 76518133 [13:27:45] < X-Cache: cp1045 miss(0), cp2025 miss(0), cp2025 frontend hit(2) [13:28:16] which means either that block doesn't really work right, or our X-Cache stuff doesn't work the same in V4 as it did in V3 (but I think it's ok) [13:29:02] 10Traffic, 06Discovery, 06Operations, 10Wikidata, 10Wikidata-Query-Service: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2285020 (10Gehel) [13:30:07] gehel: the misc caches were just upgraded, and there was also nginx experimentation going on at the same time [13:30:24] you may want to re-test in a few minutes, and also see above that's mildly related [13:31:23] * gehel is reading back... [13:31:53] bblack: good to know that you are already fixing this! I could not understand what was going on and was getting ready to ask for your help... [13:33:53] the stuff above isn't directly related [13:34:04] but still, the test is just before/during various cache_misc upgrades [13:34:23] the response I get for your test URL has Content-Length:0 currently and is a cache miss [13:34:26] err sorry, cache hit [13:34:44] I see wdqs1001 returns a correct result [13:35:49] trying to ban it and see what happens [13:37:45] test URL works now [13:38:05] probably some temporary fallout from nginx experimentation and/or the upgrade process [13:39:35] 10Traffic, 06Discovery, 06Operations, 10Wikidata, 10Wikidata-Query-Service: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2285020 (10BBlack) Assuming there was no transient issue (which became cached) on the wdqs end of things, then t... [13:41:27] ema: new 4.4.9 kernel is built and working fine in my test instance, will upload to carbon next [13:43:27] thanks! [13:47:34] ema: ok I found my mystery puzzle answer [13:47:43] Content-Length: 3613019 [13:47:54] ^ is the gzipped content-length being passed around between varnishes, and what counts for that check [13:48:03] oh, and that's 7 digits [13:48:06] the ~70MB size is after it gunzips for curl w/o --compressed of course [13:48:27] since we store it in the cache in gzipped form too, applying the size limits that way makes sense anyways [13:48:47] I was trying to reproduce the situation in VTC but got stuck at '---- c1 0.7 Assert error in http_rxchar(), vtc_http.c line 406: Condition(hp->prxbuf + n < hp->nrxbuf) not true.' [13:48:52] just have to remember that uncompressed sizes for compressible things may not seem to behave intuitively [13:50:12] bblack: was this different on v3? [13:50:18] probably not [13:51:26] 10Traffic, 06Discovery, 06Operations, 10Wikidata, 10Wikidata-Query-Service: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2285161 (10Gehel) p:05Triage>03High a:03Gehel [13:54:15] 10Traffic, 06Discovery, 06Operations, 10Wikidata, 10Wikidata-Query-Service: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2285184 (10Gehel) I can't reproduce the issue. Since it was random, I might just be lucky. Closing this anyway f... [13:54:26] 10Traffic, 06Discovery, 06Operations, 10Wikidata, 10Wikidata-Query-Service: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2285185 (10Gehel) 05Open>03Resolved [13:56:32] 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: varnishstatsd crashes with ValueError in vsl_callback without being restarted by systemd - https://phabricator.wikimedia.org/T132430#2285204 (10ema) 05Open>03Resolved a:03ema varnishstatsd has been behaving properly for a while now, logging a me... [14:02:07] ema: btw I saw some varnishrls Flush message spam the other day, did we ever fix that up? [14:05:50] bblack: oh, right! I forgot [14:06:47] what I'm not sure about is whether it makes sense to call flush_stat *that* often [14:07:41] it pops items from a dict and sends them to statsd, so actually it should be called only once I think [14:09:35] 10Traffic, 06Discovery, 06Operations, 10Wikidata, 10Wikidata-Query-Service: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2285279 (10Gehel) 05Resolved>03Open [14:12:02] 10Traffic, 06Discovery, 06Operations, 10Wikidata, 10Wikidata-Query-Service: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2285284 (10Gehel) Seems the issue (or a similar one is still happening), I'm trying to reproduce... [14:16:30] bblack: testing on T134989, I see a bunch of other URLs returning empty content. I have identified a few CSS and JS files. What magic did you do to fix it before? [14:16:30] T134989: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989 [14:18:31] gehel: WDQS URLs? [14:19:32] bblack: not sure I have the complete list yet [14:20:00] well, the "bunch of other URLs returning empty content" - are they all WDQS URLs so far, or are you talking about other URLs on unrelated services/clusters? [14:20:16] because I didn't think WDQS did much css/js [14:21:31] bblack: there is a UI, so yes it does a bit of css / js [14:21:35] https://www.irccloud.com/pastebin/TLX4nZQF/ [14:21:42] bblack: I've got a similar report for https://stats.wikimedia.org/EN/WikipediaStatistics15.js [14:22:09] elukey: Thanks! I feel less alone! [14:22:52] bblack: I'm not really sure how to find an exhaustive list. The list above was me poking around and looking for broken things, but I'm pretty sure there is more... [14:23:17] I am trying to curl the URL and it is intermittent, sometimes I get the payload and sometimes not [14:23:43] if the solution was clearing the varnish cache, for WDQS we can probably get away with clearing everything (low enough traffic). If it is more general, probably not. [14:24:02] elukey: yep, sounds just like what I saw... [14:24:34] I already cleared all of WDQS back when I fixed that one URL [14:24:47] so if it's still happening with WDQS URLs, it's not fixed by clearing the cache [14:24:58] 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: varnishmedia: repeated calls to flush_stats() - https://phabricator.wikimedia.org/T132474#2285323 (10ema) If flush_stats is not 'fast enough' sending stats to statsd, it happens that other transactions call it again, thus leading to the 'Flush' messag... [14:25:47] gehel: are you getting consistent failure on your pastebin list, or intermittent, or? are they all cache hits and still intermittent? from the cache frontend cache? [14:26:10] I've tried elukey's stats link and it's ok for me in repeated tests so far [14:26:50] although I still think something's off with our size-based pass logic [14:26:54] but that shouldn't cause empty output [14:27:01] bblack: it is intermitent and actually fairly rare, probably 1 in 10 or 1 in 20 is failing [14:27:38] bblack: I don't have more analysis yet. Let me script this a bit ad see if I can get something useful [14:28:59] bblack: I can repro now - https://phabricator.wikimedia.org/P3033 [14:31:13] and another one : https://phabricator.wikimedia.org/P3034 [14:32:34] I'm thinking this is esams-specific for some reason. I can only repro when I switch to the frontend servers you guys hit there [14:32:49] looks like WikipediaStatistics15.js got cached as empty on cp3007 [14:33:09] yeah, and it's recent too, because I banned all objects with host==query.wikidata.org earlier [14:33:22] (after all upgrades done) [14:33:40] bblack: newbie question, but how do I tell if it was served from cache? I guess that Age header < s-max-age is a good indication, but there is probably something more direct [14:33:43] so, generally speaking cache_misc caching isn't all that critical [14:33:48] gehel: X-Cache [14:33:59] if it says "hit" anywhere in that line, it was a cache hit at some layer [14:34:10] anyways [14:34:16] stop hitting test URLs for a few [14:34:34] * gehel stopping [14:34:34] let's wipe the caches in case there's some strange problem related to the cluster having been half-upgraded over the past ~24H [14:34:41] and then we can try some of these again [14:35:28] note that cp3007 is the machine I upgraded yesterday [14:36:01] so it's been running v4 for ~1 day while the rest of the cluster was running v3 [14:37:17] the wipe will take a few mins... [14:42:58] ok caches wiped, try again! [14:43:18] I'm noting a few things wrong with our pass on content-length stuff, but again that shouldn't cause broken objects, just inefficient caching [14:44:02] 1) layer-differential pass and using varnish-be-rand for passes doesn't interact well. if we're only passing for !Content-Length or CL>X, we should still be using chash [14:44:34] uh varnishkafka died on a bunch of hosts [14:44:41] 2) apparently it's not true that do_stream=false in the backend-most cache gives Content-Length to upper cache layers for applayer chunked responses, because they're all passing on those except for the backend-most [14:44:52] s/a bunch/all the misc/ [14:44:58] ema: yeah I think it has crashed on varnishd restart before [14:45:04] ema: cleanly or screaming horrible things? [14:45:24] it exited with 0 [14:45:31] but screaming things :) [14:45:42] example please :) [14:45:50] or maybe host [14:45:55] bblack: I'm trying again, can't reproduce the issue yet... [14:45:59] going to puppet them all to get stats daemons going again [14:46:22] elukey: cp1045 for example [14:47:51] ah a lot of kafka horrible messages! This might be due to the fact that we upgraded now to 0.9 [14:48:16] more precisely: the cluster is still running with 0.9 with 0.8 compatibility mode settings [14:49:10] varnishkafka died?! [14:49:29] bblack: issue appearing again (no good trace, I was trying from Chrome) [14:49:31] ottomata: it doesn't like it when we restart varnish apparently [14:49:34] i was watching vk logs on one host and i saw it notice when i restarted hosts, and then it seemed fine [14:49:37] oh varnish [14:49:39] hm [14:49:52] hm [14:49:57] hm [14:50:05] so, related to kafka cluster restarts/upgrade? [14:50:07] or not? [14:50:08] hm [14:50:23] not [14:50:42] we restarted varnishd processes on all cache_misc, and varnishkafka crashes when you do that, at least lately [14:50:59] varnishreqstats also crashed [14:51:04] yeah [14:51:25] on 1045 I can see [14:51:26] May 11 14:24:56 cp1045 varnishkafka[23054]: KAFKAERR: Kafka error (-195): kafka1020.eqiad.wmnet:9092/bootstrap: Failed to connect to broker at kafka1020.eqiad.wmnet:9092: Connection refused [14:51:31] gehel: elukey: still no reproduction since wipe? [14:51:44] yes, I did reproduce... [14:51:44] yep! [14:51:51] yep == didn't repro [14:51:59] gehel: details? [14:52:57] ema: did you see something different? because it does look like related to the kafka upgrade rather than the restart itself (I mean, the restart might have triggered this behavior because of the kafka cluster change at the same time) [14:53:00] I was doing the test from chrome. Bunch of static resources empty. Cache hits for all [14:53:31] reproduce it in curl! :) [14:54:19] elukey: well the kafka errors happened at 14:24, vk exited at 14:41 [14:54:40] bblack: just did ! [14:54:50] so I guess it's two different issues: vk screaming because of the kafka upgrade and vk exploding when varnish gets restarted [14:55:05] https://phabricator.wikimedia.org/P3036 [14:55:24] ema: true, didn't notice the diff with the TERM [14:56:00] hm ok [14:56:09] elukey: we should fix both problems though :) [14:56:53] ema: the kafka one might be only garbage because the producer code does emit some errors when the brokers change [14:58:02] 10Traffic, 10Analytics-Cluster, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#2285393 (10Ottomata) All analytics brokers are now upgraded to confluent 0.9! Tomorrow we will switch off 0.8 inter broker... [14:58:04] so that one might be librdkafka emitting some "oh noes!" [14:58:14] yeah, that's ok [14:58:16] gehel: even using your frontend (cp3008), I can't get a repro [14:58:38] that's just it noticing that the broker layout has changed, immediately after it issues a metadata request, figures out the new layout, and starts using it [14:58:45] bblack,ema: also a note about the kafka upgrade - https://grafana.wikimedia.org/dashboard/db/kafka?panelId=32&fullscreen [14:58:49] but: I really think we should solve the pass-consistency issues first anyways, because they make testing more confusing/inconsistent [14:59:04] bblack: it seems to be fairly rare... but it still happens. I have no idea how to reproduce, except trying again and again... [15:01:06] awesome now I have to tell my curl to use http/1.1 otherwise it defaults to h2 :) [15:01:24] so, the notion we had that do_stream==false would get us a content-length that would pass upwards seems to be false. I think it only passes upwards one layer [15:01:38] and, the great surprise: can repro gehel's issue with http1.1, not with h2 [15:02:06] no, sorry about that, it was just a coincidence [15:02:06] ema: consistently? [15:02:09] ok [15:02:14] let's stop supporting http1.1, h2 is so much better anyway :P [15:02:26] ema: what X-Cache line on the repro hit? [15:03:11] bblack: I didn't copied it, got too excited about the h2 thing [15:03:16] ok :) [15:03:57] so basically, I think do_stream possibly converts a response with content-length to one without it [15:04:05] (which is on by default in v4) [15:05:03] (or in other words: if varnish do_stream a request and the backend response is not TE:chunked, the output response will be TE:chunked still) [15:05:18] that's my working theory anyways [15:05:49] does wikidata speak http/1.0 by any chance? :) http://www.gossamer-threads.com/lists/varnish/misc/38048 [15:06:01] whereas explicitly setting do_stream=false does the opposite: it will convert a backend response that is TE:chunked to a non-TE:chunked output [15:07:11] heh [15:07:11] sub vcl_backend_fetch { [15:07:12] set bereq.proto = "HTTP/1.1"; [15:07:12] } [15:07:16] is that it? [15:07:20] maybe [15:07:36] can try a patch anyways [15:07:49] if it works on misc, should probably promote that to if varnish_version4 in common VCL [15:08:12] does that even exist in v3? setting bereq.proto? [15:08:45] no idea [15:09:05] apparently the varialbe exists in v3 [15:09:29] this could (just to increasing how confusing and complex my week is!) also be relevant to the strange nginx http 1.0 vs 1.1 issues too [15:09:42] s/increasing/increase/ [15:09:45] :) [15:09:56] should we add this to misc-backend.inc.vcl.erb? [15:09:56] hmmmm [15:10:05] well [15:10:09] I'm not sure yet [15:10:19] or perhaps it's needed also between varnishes [15:10:22] there's also req.proto [15:11:17] where they're talking about setting bereq.proto (vcl_backend_response aka vcl_fetch), we've already talked to the backend and parsed its response headers [15:11:30] but the setting may affect interpretation of those headers when processing the body, I guess [15:12:00] bblack: I just run a 100 curls with a 3 second delay. On the 100 requests, I got 2 failures: https://phabricator.wikimedia.org/P3037 and https://phabricator.wikimedia.org/P3038 [15:14:51] ema: when I query wdqs1001 directly for style.css, it response with HTTP/1.1 and an explicit Content-Length [15:15:22] mmh, can varnish be silly and use 1.0 instead? [15:15:47] I tried that [15:15:56] if I do the request with HTTP/1.0, I still get the same HTTP/1.1 response [15:16:03] which is probably illegal, but also probably reasonable [15:16:11] :) [15:17:14] so some other things: [15:17:27] 1. in examples like P3037, we do see: < Via: 1.1 varnish-v4, 1.1 varnish-v4, 1.1 varnish-v4 [15:17:27] P3037 (An Untitled Masterwork) - https://phabricator.wikimedia.org/P3037 [15:17:40] so varnish at least claims it's using http/1.1 for inter-cache and the app request [15:18:02] also, that paste shows triplicate Age: headers, which seems a little buggy [15:19:57] mmh, on cp1045 I see http/1.0 everywhere [15:20:00] maps.wm.o tile fetches with curl, regardless of cache-hit or cache-miss, seem to have only a single Age: header [15:20:05] $ sudo varnishlog -n frontend| egrep -i 'Host|BereqProto' [15:20:05] - BereqProtocol HTTP/1.0 [15:20:05] - BereqHeader Host: piwik.wikimedia.org [15:20:06] ema: where? [15:20:23] fascinating [15:20:45] shocking [15:21:06] presumably the leak of HTTP/1.0-ness starts at nginx, which uses 1.0 to query varnish-fe [15:21:47] of course if 1.0 carried all the way through the varnish stack, how do we ever get TE:chunked from a backend? [15:21:57] but of course, wdqs answers with HTTP/1.1 even on a 1.0 request... [15:22:22] right [15:22:43] I just don't know if setting bereq.proto is enough to really fix all the fallout that can cause [15:23:14] it would be better to fix nginx not to use 1.0, which I've been trying to do for two days but causes inexplicable varnish3 intermittent 503's on cache_text backends :P [15:23:45] with proxy_http_version 1.1? [15:23:49] yeah [15:23:53] mmh [15:24:14] https://phabricator.wikimedia.org/T107749#2284470 [15:24:27] ^ the latest meaningful analysis of that in isolation [15:25:41] I'll need a coffee for this one [15:25:45] and then I did experiment with the first two "We can try ..." ideas at the bottom of that comment. they didn't fix anything. [15:25:50] me too [15:26:05] I'm inclined when I get back, to try proxy_http_version 1.1 only on varnish4 clusters :) [15:30:59] I'm glad there are smarter people than me around! I can understand the analysis, but pretty sure I would have had no idea how to get there! [15:34:25] https://gerrit.wikimedia.org/r/288213 [15:36:31] ema: you ok with trying that as a next experiment, and seeing if it at least kills BereqProtocol HTTP/1.0 ? [15:38:05] I think this has to be about special backends that respond with 1.1 to 1.0 or other such things, too. or else by now people would be screaming about the higher-traffic sites on cache_misc being intermittently-broke [15:38:09] like phab [15:40:33] on that note, I tested phab's backend (iridium) directly [15:40:42] it's behavior seems to be (for requests to /): [15:40:53] bblack: +1 [15:41:07] HTTP/1.1 req -> HTTP/1.1 resp, TE:chunked, implicit default keepalive connection [15:41:47] HTTP/1.0 req -> HTTP/1.1 resp, non-chunked, no content-length, explicit Connection:close (which emulates, other than proto ver, what an HTTP/1.0 Content-Length-less transfer looks/works like) [15:42:30] which is of course totally unreliable, you can't tell end-of-content from "oops connection closed early", but that's why we made HTTP/1.1 (among other reasons) [15:45:18] 10Traffic, 10Analytics-Cluster, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#2285494 (10Nuria) [15:47:38] uh, I just found one request with both 1.0 and 1.1?!? [15:47:49] * << BeReq >> 1704079 [15:47:49] - BereqProtocol HTTP/1.0 [15:47:49] - BereqProtocol HTTP/1.1 [15:48:21] nice! [15:48:33] but I think varnishlog always logs every change, not the final state [15:48:40] we see that with multiple header updates too [15:48:45] right [15:48:54] it's nice for tracing, but annoying too [15:49:45] 10Traffic, 06DC-Ops, 06Operations, 10hardware-requests, 10ops-esams: Decomission amssq31-62 (32 hosts) - https://phabricator.wikimedia.org/T95742#2285533 (10RobH) [15:51:51] so piwik also seems to respond with 1.1 to 1.0 requests [15:52:02] - BereqProtocol HTTP/1.0 [15:52:06] - BerespProtocol HTTP/1.1 [15:52:21] so 1.1 for nginx is applied/reloaded on cache_misc + cache_maps via varnish_version4 flag for all now... interesting things to note: 1) does it solve BereqProtocol strangeness? 2) Does that immediately fix these CL:0 issues? 3) If not, maybe after another cache wipe? 4) Does it cause some elevated background rate of 503s on misc or maps compared to before? [15:52:36] I have a meeting starting in 8 minutes, so semi-absent in a few [15:53:36] ok, let's see what BereqProtocol looks like now [15:53:46] 1.1 [15:53:57] oh also, if results seem inconsistent, may need to wait a few minutes [15:54:08] nginx does slow-reload where old child procs slowly die off as their persistent conns do... [15:54:33] on cp1045 all backend requests are using 1.1 at the moment [15:55:21] ok so that answers one question. at least under varnish4, nginx http/1.0 infects everything beneath and can probably really screw things up in varnish/appserver -land [15:56:49] getting frontend pass on https://query.wikidata.org/style.css [15:57:07] * bblack grumbles something about how if varnish had just fscking implemented TLS at both edges long ago, we wouldn't be in this mess and many others, probably.... :P [15:57:26] ema: I think the unexpected FE pass goes back to that stuff about TE:chunked and do_stream [15:57:51] 15:05 < bblack> (or in other words: if varnish do_stream a request and the backend response is not TE:chunked, the output response will be TE:chunked still) [15:57:59] 15:06 < bblack> whereas explicitly setting do_stream=false does the opposite: it will convert a backend response that is TE:chunked to a non-TE:chunked output [15:58:14] and then we have FE and middle-BE's always do_stream==true and passing on !Content-Length [15:58:24] right [15:58:28] and backmost-BE do_stream==false iff !Content-Length [15:58:35] gehel: can you try to repro if you're still around? [15:58:57] I'll run my script again and see if it crashes... [15:59:02] thanks [15:59:37] there might still be cache effects in play too, but I hope not (may have to roll through wiping down disk caches then frontends: eqiad bes, codfw bes, esams+ulsfo bes, then all fes. I wiped with salt batch-size 1 on stop varnish; rm -f /srv/sd*/varnish*; sync; sleep 1; start varnish, basically) [15:59:58] BBL [16:00:31] ema: reproducing at first try [16:01:30] on 22 tries, I get 5 errors, so it seems worse than before... [16:01:56] can you paste the output somewhere? [16:02:41] https://phabricator.wikimedia.org/P3039 <- just one of the failure [16:03:30] still has that strange triple Age header too [16:06:37] on 100 tries, I now get 53 failures. Not good... [16:06:55] should probably wipe caches then re-test [16:07:06] I'll start on that, I can multi-task for now :) [16:07:11] and I have the commands in up-arrow history [16:07:35] awesome :) [16:07:53] can you also share the commands if you have enough multi-tasking bandwidth? [16:07:55] bblack: thanks! [16:07:58] yes [16:08:14] it's not the kind of thing I would do on text/upload, but ok for cache_misc ;) [16:08:19] sure [16:08:33] root@neodymium:~# salt -v -t 5 -b 1 -C 'G@cluster:cache_misc and G@site:eqiad' cmd.run 'service varnish stop; rm -f /srv/sd*/varnish*; sync; sleep 3; service varnish start' [16:08:48] ; service varnishkafka start [16:08:51] root@neodymium:~# salt -v -t 5 -b 1 -C 'G@cluster:cache_misc and G@site:codfw' cmd.run 'service varnish stop; rm -f /srv/sd*/varnish*; sync; sleep 3; service varnish start' [16:09:23] root@neodymium:~# salt -v -t 5 -b 1 -C 'G@cluster:cache_misc and ( G@site:ulsfo or G@site:esams )' cmd.run 'service varnish stop; rm -f /srv/sd*/varnish*; sync; sleep 3; service varnish start' [16:09:31] (I think vk dies on frontend restart, not backend) [16:09:42] (so I'll amend the upcoming final command for that...) [16:09:48] great [16:10:54] root@neodymium:~# salt -v -t 5 -b 2 -G cluster:cache_misc cmd.run 'service varnish-frontend stop; sleep 2; service varnish-frontend start; puppet agent -t' [16:11:04] ^ puppet above should fix the 2x broken daemons [16:11:14] yep [16:11:31] it's going to slow down my command a lot though. oh well. [16:11:51] I was trying to be nice with the -b 2 before, could've done more, but it will just take longer. [16:13:34] would this be correct in order to ban wikidata's style.css? [16:13:35] salt -v -t 30 -C 'G@cluster:cache_misc and G@site:eqiad' cmd.run 'varnishadm ban req.http.host == "query.wikidata.org" && req.url ~ "^/style.css"' [16:13:39] salt -v -t 30 -C 'G@cluster:cache_misc and not G@site:eqiad' cmd.run 'varnishadm ban req.http.host == "query.wikidata.org" && req.url ~ "^/style.css"' [16:13:42] salt -v -t 30 -C 'G@cluster:cache_misc' cmd.run 'varnishadm -n frontend ban req.http.host == "query.wikidata.org" && req.url ~ "^/style.css"' [16:15:13] not anymore, that wiki page is outdated, and fixing it for the general case is complicated [16:15:38] it's all about the backend ordering [16:15:54] for misc+text+upload, the backend ordering is now: eqiad, codfw, ulsfo+esams [16:16:04] for maps, it's: codfw, eqiad, ulsfo+esams [16:16:20] and all of those are subject to change based on cache::route_table's current contents at the time of the ban... [16:16:37] right so one would need to find the proper routing order for the given cluster and start from the backends, left to right I guess [16:17:05] yeah the first question is "which (single, unless we're in some transition test) DC is actually talking to the backend: purge that first" [16:17:31] but then also, we have 3-tier caching in some cases now, such as ulsfo->codfw->eqiad for text/upload/misc, and esams->eqiad->codfw for maps [16:17:47] which basically means "whichever of eqiad|codfw was the answer to the first question, purge the other one next before doing the last two" [16:18:29] cache_misc wipe finished [16:18:32] sounds like the task for a script! [16:18:35] still reproduces? [16:18:59] I'm hesitant to make banning easier. bans are dangerous and don't scale. they can stick around way longer than we'd expect... [16:20:17] I'm getting proper responses from wikidata [16:20:19] gehel: ? [16:21:12] ema: let's try again... [16:23:27] ema, bblack: not able to reproduce the issue yet... [16:23:41] * bblack refuses to get hopes up yet :P [16:24:26] and bblack is right! as always: https://phabricator.wikimedia.org/P3040 [16:24:33] again the 3x Age: [16:24:46] funny side note: cache_fsm.c in varnish4 does this unconditionally: [16:24:48] http_SetHeader(req->resp, "Via: 1.1 varnish-v4"); [16:24:54] so the Via 1.1 means nothing :P [16:24:54] yeah there you go [16:25:33] x-cache line for a proper wikidata response: [16:25:34] < x-cache:cp1061 hit(1), cp3008 hit(6), cp3007 frontend pass(0) [16:25:44] broken response: [16:25:45] < x-cache:cp1061 hit(1), cp3007 hit(1), cp3007 frontend miss(0) [16:27:03] I'll go afk for a bit [16:27:10] I've got only success so far, but with cp3008 frontend pass(0) at the end of them all [16:27:23] so I guess it's when the hit-for-pass object expires [16:27:50] yup, just hit the bug, and: [16:27:57] < X-Cache: cp1061 hit(5), cp3010 miss(0), cp3008 frontend miss(0) [16:28:04] < Content-Length: 0 [16:28:14] (all my reqs were forced to 3008 as FE) [16:31:33] so the differences there that I can think of on FE miss vs pass: [16:31:43] miss uses chashed backend, pass uses random [16:33:01] well and whether we're doing the initial set beresp.uncacheable on a fresh miss, or re-using the hit-for-pass object to pass (but then also probably still hitting the same beresp.uncacheable logic) [16:57:32] I looked through the vmod_vslp code we use for v4 chashing, just in case [16:57:51] I don't see anything there where (because of its role as a director) it actually interferes in backend request protocol/logic stuff [16:59:48] I'm guessing if we remove the beresp.uncacheable stuff there (which allows giant objects to thrash frontend caches, too, but not relevant to these little CSS tests) [17:00:02] it would make them all fail? [17:01:33] mmh [17:01:58] did you also manage to reproduce without forcing 3008 as FE? [17:09:23] I didn't try [17:10:13] OK so it looks like I'm getting 3 Age headers when there are 2 passes in x-cache [17:10:23] < Age: 16 [17:10:23] < Age: 16 [17:10:23] < Age: 16 [17:10:23] < X-Cache: cp1051 hit(1), cp3010 pass(0), cp3007 frontend pass(0) [17:10:23] yeah [17:10:39] and 2 with a single pass [17:10:45] < Age: 49 [17:10:45] < Age: 49 [17:10:45] < X-Cache: cp1061 hit(2), cp3008 hit(8), cp3007 frontend pass(0) [17:11:09] that's hopefully an unrelated and relatively harmless bug [17:11:52] FWIW, curl -sv https://config-master.wikimedia.org/ also shows 3xAge: [17:12:09] and it's one of the req hostnames for which we do explicit return (pass) on all requests [17:12:16] so it's not special to this CL/chunked stuff [17:13:29] https://gerrit.wikimedia.org/r/288231 is something we could try and maybe later revert (partially or fully), just to see what the effect is [17:15:49] yeah, we could also remove the v4 conditionals on misc eventually [17:16:10] yeah I almost prepped a patch for that earlier [17:16:21] then this stuff came up. if we can't fix it, we may have to revert [17:17:39] +1 for trying to get rid of CL stuff in the meantime [17:19:23] looking at varnish4 code, the Age: header is unconditionally set in cnt_deliver [17:19:44] in bin/varnishd/cache/cache_req_fsm.c [17:20:13] also interesting in the same function is: [17:20:13] http_ForceField(req->resp, HTTP_HDR_PROTO, "HTTP/1.1"); [17:20:30] so apparently *responses* are always forced to 1.1, whatever else is going on [17:20:54] which is I guess what the 1.1 in Via means too. It means via 1.1 response, not necessarily via 1.1 fetch [17:22:06] ema: yeah given all related things, I really wish we didn't have CL-sensitive magic at all [17:22:39] uh so responses are always 1.1 [17:22:44] the concern is that without it, some cache_misc service will send a 5GB file in a response, and it will force out thousands of objects in the tiny frontend mem cache. [17:23:05] (well, relatively-tiny) [17:23:34] and I'm pretty sure we have multiple cache_misc services that are all about downloading large data dumps [17:24:58] and even if we don't care for cache_misc (worst case: we identify the giant-file services and force explicit return (pass) on whole hostnames or request paths that can have giant files)... [17:25:09] cache_upload has similar concerns that can't be worked around so easily... [17:25:28] right [17:25:44] well let's see if CL stuff is actually the culprit first [17:26:23] I think it will change the behavior, I donno if it will really be a direct culprit. probably more like a bug-aggravator [17:29:51] ok so finished puppet and did some style.css checks: [17:30:13] multiple reqs to cp3008 came back fine, all as cache hits on the frontend: [17:30:16] < X-Cache: cp1061 hit(1), cp3008 miss(0), cp3008 frontend hit(7) [17:30:40] but then I restarted varnish-frontend on cp3008 to wipe the cache entry there, and: [17:30:42] < Content-Length: 0 [17:30:45] < Age: 233 [17:30:48] < X-Cache: cp1061 hit(1), cp3010 hit(2), cp3008 frontend miss(0) [17:31:10] works fine again on next request (FE cache hit) [17:31:22] so on fe misses we consistently get broken responses [17:31:32] oh wait I'm wrong, that next request that worked said pass [17:31:37] < X-Cache: cp1061 hit(2), cp3007 hit(1), cp3008 frontend pass(0) [17:31:54] OK, just as earlier on [17:32:16] and then: [17:32:16] < Content-Length: 0 [17:32:17] < Age: 74 [17:32:17] < X-Cache: cp1061 hit(3), cp3010 miss(0), cp3008 frontend miss(0) [17:32:26] and then: [17:32:26] < Content-Length: 0 [17:32:26] < Age: 79 [17:32:27] < X-Cache: cp1061 hit(3), cp3010 miss(0), cp3008 frontend hit(1) [17:32:41] and now it's stuck that way in FE cache: [17:32:42] < Content-Length: 0 [17:32:42] < Age: 83 [17:32:42] < X-Cache: cp1061 hit(3), cp3010 miss(0), cp3008 frontend hit(2) [17:32:59] maybe we still have broken CL:0 responses in some backends? [17:33:04] (cached, I mean) [17:33:31] it seems to be 3010 as the intermediary that caused CL:0 via cp3008 FE [17:33:56] will wipe caches again, I'd rather wait on that than play guessing games [17:34:10] agreed [17:37:18] I wonder why it passed that one time, too [17:37:28] maybe old-VCL threads stick around a little longer now [17:37:57] either that or some aspect of the previous miss response made it uncacheable (other than CL) [17:41:16] < Age: 0 [17:41:16] < X-Cache: cp1061 miss(0), cp3010 miss(0), cp3008 frontend miss(0) [17:41:24] < Content-Length: 6304 [17:41:31] interestingly, not streamed to be with chunked response [17:41:49] oh, so proper response on miss? [17:41:58] yeah on full 3-layer miss [17:42:30] and also after another fe restart, on single-layer miss too [17:42:30] < Age: 29 [17:42:31] < X-Cache: cp1061 miss(0), cp3010 hit(1), cp3008 frontend miss(0) [17:42:36] < Content-Length: 6304 [17:43:20] so either the setting of beresp.uncacheable during the miss was affecting this (and then sure we had linger bad CL:0 cache entries) [17:43:31] vmod_vslp theory is thrown out [17:43:59] or it's the backend-most do_stream=false that was screwing things up intermittently for the next layer up and sometimes creating CL:0 cache entries [17:44:04] assuming this holds [17:44:25] one thing I've noticed is that we're setting uncacheable=true without setting beresp.ttl [17:44:45] yeah but it will just take it from the TTL it would've used if it hadn't been uncacheable [17:45:11] for style.css there's etag info but no CC or Expires [17:45:23] so I guess TTL would use default_ttl, which is small on cache_misc I think [17:45:42] 120s I think? [17:46:10] yeah [17:46:16] ok [17:50:28] it's not intermittent, I've forced 100 triple-cache-miss in a row (by appending ?foo=N for 1..100 to style.css URL) and no failure [17:50:45] should probably put back just the do_stream=false part and try that again [17:52:24] although in this case, surely that's not a factor? [17:52:50] direct http/1.1 queries to wdqs1001 for style.css return a content-length, not chunked [17:53:02] so it must be the other part [17:53:06] right, it must be [17:53:48] well, I guess it still could be that. whatever this is, IMHO it's a varnish bug, just one with strange triggers [17:54:37] on the other hand style.css doesn't trigger beresp.http.Content-Length ~ "^[0-9]{10}" as well [17:55:07] but it may trigger !Content-Length if varnish streams it chunked between layers [17:55:42] so right, can't be do_stream=false [17:57:07] unless Content-Length disappears between layers? [17:58:44] yeah but the do_stream=false thing was only in the bottom-most layer [17:58:57] true! [17:59:04] where's varnish4's default vcl btw? /etc/varnish/default.vcl doesn't have a copy anymore to look at [17:59:37] bin/varnishd/builtin.vcl [18:00:16] I guess /usr/share/doc/varnish/builtin.vcl [18:00:24] ah! [18:01:15] heh [18:01:41] I see they're already partially-honoring Surrogate-Control in there, but not all of it, and only in VCL not in the core code :P [18:01:50] fun fun for when we implement it [18:05:13] ok so https://gerrit.wikimedia.org/r/288242 - revert just the frontend bit (which should be the only bit style.css can trigger anyways, regardless of other questions) [18:05:39] of course that begs the question - how the hell did CL:0 infect a middle-backend in earlier theories [18:05:42] but still [18:07:32] mystery [18:08:29] I'm a bit worn out on mysteries this week. Give me something easy to fix [18:09:10] automatically starting vk after varnishd restarts? :) [18:10:13] nope! [18:11:31] no CL:0 reproduction [18:11:51] but the results are confusing. I see middle-layer pass() behavior, and the frontend alternates betwee hit/miss/pass [18:12:07] probably X-Cache doesn't work right in varnish4 [18:12:14] mmh [18:12:17] (our X-Cache VCL, I mean) [18:12:35] the flow through hit/miss/pass and such must be different [18:12:44] I had to validate that it worked a certain way in in v3 [18:15:31] well I donno, http://book.varnish-software.com/4.0/chapters/VCL_Basics.html looks pretty straightforward [18:15:51] and close enough to v3 for X-Cache construction purposes [18:16:50] as in we set it in vcl_deliver -> done? [18:19:33] well as in the X-CDIS headers set in vcl_(hit|miss|pass) that that builds from [18:20:03] oh! [18:20:11] https://gerrit.wikimedia.org/r/288244 is an improvement after staring at the VCL 4 flow chart, but I don't think that's our problem here [18:22:59] time to go. See you tomorrow! [18:25:18] cya! [18:30:06] I'm probably leaving for a while soon too. this seems like at least a mostly-stable state AFAICS, but there's still a lot of mystery [18:30:48] I wonder if we can set a flag in the object when creating hit-for-pass [18:31:00] to differentiate it from pass [18:31:18] well, in general, it would be nice to know at vcl_pass() time whether it was due to return (pass) or hit-for-pass [18:32:48] there must be a way to tell, but barring any easy way, perhaps setting obj.something = hfp when creating the hit-for-pass will make its use detectable after [19:29:39] 10Traffic, 06Operations, 10Pybal: Unhandled pybal error causing services to be depooled in etcd but not in lvs - https://phabricator.wikimedia.org/T134893#2286453 (10chasemp) p:05Triage>03High [19:30:19] 10Traffic, 06Operations, 13Patch-For-Review: Move stream.wikimedia.org (rcstream) behind cache_misc - https://phabricator.wikimedia.org/T134871#2286456 (10chasemp) p:05Triage>03Normal [19:30:30] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2285020 (10Jonas) I can confirm I am still having read errors every few requests. ``` curl -v 'https://query.wikidata.org/ven... [19:30:38] 10Traffic, 06Operations, 13Patch-For-Review: Support websockets in cache_misc - https://phabricator.wikimedia.org/T134870#2286459 (10chasemp) p:05Triage>03Normal [19:33:10] 10Traffic, 06Operations, 10Phabricator: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#2286465 (10chasemp) [19:34:25] 07HTTPS, 10Traffic, 06Operations: Secure connection failed when attempting to preview or save pages - https://phabricator.wikimedia.org/T134869#2286483 (10chasemp) p:05Triage>03High [19:53:04] bblack: do you have any idea how much traffic is affected by this no content issue? Is it only WDQS? [19:53:45] or all cache misc? Can I do anything to help? I don't know much about VCL, but let me know if there is anything I can do... [20:03:00] 10Traffic, 06Operations, 06Release-Engineering-Team, 10Wikimedia-General-or-Unknown: Inconsistently unable to download https://releases.wikimedia.org/mediawiki/1.26/mediawiki-1.26.2.tar.gz (returns zero-byte response) - https://phabricator.wikimedia.org/T135038#2286540 (10hashar) #traffic people would sure... [20:09:06] gehel: only cache_misc I'm pretty sure. Likely not only WDQS. [20:10:50] gehel: I think we have it worked-around for now. or worst-case, if it's still there we revert https://gerrit.wikimedia.org/r/288242 and it's probably then worked-around for now. [20:10:57] but a lot of things about this are still mysterious [20:12:17] I checked a few minutes ago, I had > 50% of requests in error... [20:12:25] lemme check again... [20:13:25] 10Traffic, 06Operations, 06Project-Admins: Create #HTTP2 tag - https://phabricator.wikimedia.org/T134960#2286611 (10Danny_B) I know... The deal is, that whatever task intended for SPDY will most probably be valid even for HTTP/2. [20:14:11] bblack: nope, still a lot of issues... [20:14:51] ok [20:15:03] it seemed fixed before that partial revert, but we don't know why [20:21:01] reverting the change mentioned above [20:21:08] will probably have to wipe caches again, too [20:21:11] but not sure! [20:21:45] I have flagged a task for #traffic that is related to a download of https://releases.wikimedia.org/mediawiki/1.26/mediawiki-1.26.2.tar.gz [20:21:58] I guess that hit misc-varnish and is impacted by whatever you guys are currently dealing with [20:22:11] anyway task is https://phabricator.wikimedia.org/T135038 . Might offers you a reproducible test case [20:22:24] hashar: see other channel [20:22:34] yeah no worries about this one ;} [20:22:42] it's the same one [20:23:34] great ;-} [20:24:51] 10Traffic, 06Operations, 06Release-Engineering-Team, 10Wikimedia-General-or-Unknown: Inconsistently unable to download https://releases.wikimedia.org/mediawiki/1.26/mediawiki-1.26.2.tar.gz (returns zero-byte response) - https://phabricator.wikimedia.org/T135038#2286678 (10hashar) [20:24:55] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2286680 (10hashar) [20:25:30] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2285020 (10hashar) From T135038: `curl -I https://releases.wikimedia.org/mediawiki/1.26/mediawiki-1.26.2.tar.gz` Example res... [20:27:51] 10Traffic, 06Operations, 10Phabricator: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#2286694 (10Dzahn) So then same hostname but different port. [20:29:02] so, cache_misc wiped again after reverting the partial revert from earlier, if that makes any sense [20:29:18] please re-test for inducing CL:0 issues again, and I'll be back in an hour or so... [20:29:28] gehel: ^ [20:33:19] bblack: looks good so far ... zero errors on 50 requests... [20:38:28] bblack: 100 requests, zero errors. Look good for real. I'll test again a bit later... [21:39:42] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2287218 (10BBlack) Status update: we've been debugging this off and on all day. It's some kind of bug fallout from cache_misc... [21:42:20] 10Traffic, 06Operations, 10Phabricator: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#2287225 (10mmodell) @dzahn yes I believe so, port `22280` by default. See [[ https://secure.phabricator.com/book/phabricator/article/notifications/#terminating-ssl-... [21:56:01] maybe-relevant: https://github.com/varnishcache/varnish-cache/issues/1918 [21:58:13] and from there we reach: https://github.com/varnishcache/varnish-cache/commit/e142a199c53dd9331001cb29678602e726a35690 [21:58:25] bugfix to 4.1, in varnish master branch, 6 days ago [21:58:42] I'd be willing to bet this is related somehow. at least worth trying. [22:01:38] now that I look, the master branch of varnish has a shit-ton of bugfixes since 4.1.2 [22:01:42] about time they cut a release :P [22:11:05] I'm starting to audit all the post-4.1.2 work in varnish/master. it's a lot, and a lot of it's refactoring or new-code. but I want to see if there are any other applicable and critical-sounding fixes we might want to backport in the meantime [22:23:53] oh they've already picked over a bunch to the 4.1 branch too [22:24:01] not the one that sounds relevant though, it's pretty new [22:45:47] ema: (for tomorrow) - we should (IMHO) roll up a new varnish4 package, based on what we have now with some upstream (as in varnish, not debian) commits applied on top. [22:47:15] ema: my recommendation would be: start with merging in upstream's 4.1 branch HEAD, which has a few fixes that might be interesting for us, and should be ff-only ( https://github.com/varnishcache/varnish-cache/commits/4.1 ) [22:47:55] ema: and then cherry-pick in 4x bugfixes from master. I filtered through the ones that aren't yet in 4.1, sound relevant/important, and apply cleanly on top of 4.1. [22:49:00] ema: 3215a91 "Make sure hp->body is always initialized." a6c14b0 "Correctly handle EOF responses with HTTP/1.1" 480b9b8 "dont panic on pthread_cond_timedwait being interrupted" 8cc18b8 "check for overrun before looking at the next vsm record" [22:50:29] ema: (and maybe recommend/remind varnish team that those look like potential 4.1 bugfix cherrypicks) [22:51:28] ema: the first two seem potentially directly relevant to the mess today, the latter two look relatively simple, low-risk, correct, and a good idea [22:58:14] ema: heh, sorry, those commit hashes are from after I cherry-picked them in. the correct hashes in upstream master are: [22:59:10] ema: a50c99f6 , e142a199 , 0895d2e6 , 0895d2e6 [22:59:32] ema: heh, grrrr: a50c99f6 , e142a199 , 0895d2e6 , f91c6f90