[02:33:57] 10Traffic, 10Graph, 10Graphoid, 06Operations, 13Patch-For-Review: Graph results are not being cached in Varnish - https://phabricator.wikimedia.org/T134542#2283687 (10BBlack) @Yurik - ok. FYI, we separately cap 4xx lifetimes anyways. Currently at 1 minute on all clusters, although IMHO we could probabl... [02:35:47] 10Traffic, 10Graph, 10Graphoid, 06Operations, 13Patch-For-Review: Graph results are not being cached in Varnish - https://phabricator.wikimedia.org/T134542#2283690 (10Yurik) Thanks @bblack. Are there any headers I should add to the regurar 200 replies beyond setting maxage? [02:36:04] 07HTTPS, 10Traffic, 06Operations: Secure connection failed - https://phabricator.wikimedia.org/T134869#2283693 (10Danny_B) For cross reference: - {T96848} - {T134817} [02:40:05] 10Traffic, 06Operations, 06Project-Admins: Create #HTTP2 tag - https://phabricator.wikimedia.org/T134960#2283697 (10Danny_B) [02:42:14] 10Traffic, 10Graph, 10Graphoid, 06Operations, 13Patch-For-Review: Graph results are not being cached in Varnish - https://phabricator.wikimedia.org/T134542#2283715 (10BBlack) Well, that's a complex topic, but what you're doing now will work fine for now :) Someday we'll have a guideline document to poin... [03:09:19] 10Traffic, 06Operations, 06Project-Admins: Create #HTTP2 tag - https://phabricator.wikimedia.org/T134960#2283697 (10BBlack) I'm not sure we need a whole separate tag for HTTP/2. [03:10:08] 07HTTPS, 10Traffic, 06Operations: Secure connection failed - https://phabricator.wikimedia.org/T134869#2283745 (10BBlack) While there's a possibility this is an HTTP/2 issue, it's also possible that's a number of other things, including some generic bug in Firefox that's not protocol specific. [03:29:37] 10Traffic, 06Operations, 06Project-Admins: Create #HTTP2 tag - https://phabricator.wikimedia.org/T134960#2283697 (10Krenair) I'm sceptical as well. What existing tasks would go into this project? [05:17:22] 10Traffic, 06Operations, 06Project-Admins: Create #HTTP2 tag - https://phabricator.wikimedia.org/T134960#2283861 (10Danny_B) I think all new(ly used) technologies deserve tag since they may be buggy or cause issues (typically compatibility). For this cases we have i.e. #newphp Also, we have #https too... AT... [09:58:43] mmmh I never seem to be able to get geo-routed to ulsfo [09:59:00] even from hong kong and singapore I get to codfw [10:04:11] 07HTTPS, 10Traffic, 06Operations: Secure connection failed - https://phabricator.wikimedia.org/T134869#2284310 (10Samtar) Hi @BBlack thanks again for the above - I did remember to turn this off, not keen to be surfing around without it! At work (Win 8.1 Pro + FF 46.0.1) the issue reappeared until changing to... [10:13:23] <_joe_> ema: I think bblack and robh did set ulsfo down yesterday [10:13:35] <_joe_> for robh's work in the dc [10:13:38] <_joe_> check the dns [10:13:45] <_joe_> if ulsfo is set down [10:14:01] mmmh, but I've seen traffic on text caches [10:14:53] you're right though, the last dns commit sets ulsfo down [10:16:44] perhaps the text traffic can be explained with broken dns caches [10:17:12] <_joe_> yes [10:17:42] thank you! [10:19:05] ulsfo misc looked way too silent :) [11:15:40] bblack, ema: I'm setting kernel.unprivileged_bpf_disabled=1 on the Linux 4.4 systems as a hardening measure (https://lwn.net/Articles/660331/ for background, triggered by the recent https://bugs.chromium.org/p/project-zero/issues/detail?id=808) [11:16:05] I'll start with two cp* and lvs* systems initially and proceed with the rest if no problems arise (which I don't expect) [11:16:20] but better safe than sorry, since that sysctl value can only be reset with a reboot [11:16:44] unprivileged bpf is really new after all [11:17:38] 10Traffic, 06Operations, 06Project-Admins: Create #HTTP2 tag - https://phabricator.wikimedia.org/T134960#2284392 (10BBlack) #HTTPS is a little different: that transition is a huge, long-term effort spanning years. HTTP/2 is only a slight change from the SPDY/3 that we supported for quite a while before it.... [11:18:29] any preferences on those canary systems so that I don't mess with ongoing maintenance work? [11:18:48] (once that is rolled out initially, I'll also puppetise it) [11:27:19] 07HTTPS, 10Traffic, 06Operations: Secure connection failed - https://phabricator.wikimedia.org/T134869#2284408 (10BBlack) We may have been seeing the same bug here as https://bugzilla.mozilla.org/show_bug.cgi?id=1271301 - you might want to see if the test in that bug work from FF 47 but not 46 as well. Ther... [11:27:44] moritzm: when it comes to the v4 upgrade, we've done misc esams and ulsfo [11:29:08] codfw and eqiad are next, so perhaps it would be better not to use misc codfw/eqiad as canaries [11:30:04] hit cp1065 and cp3038 if you want a pair of canaries to be extra safe [11:30:11] that's one text in eqiad and one upload in esams [11:30:26] I think it will be fine to set the sysctl everywhere though, seems low risk [11:30:58] oh you need ones already on 4.4 to test with, though :) [11:31:55] yesterday I've rebooted cp1065 and cp1066, so those are running 4.4 for sure [11:32:18] are they running the newest 4.4 now? I think moritzm uploaded a new one just last week [11:32:50] I haven't upgraded anything, just rebooted them. 4.4.0-1-amd64 [11:33:00] bblack: that one is currently building on copper (I cherrypicked an additional fix for one security bug not in 4.4.9) [11:33:12] ok [11:33:16] {'cp3038.esams.wmnet': '4.4.0-1-amd64 #1 SMP Debian 4.4.2-3+wmf1 (2016-02-29)'} [11:33:20] so that when ema starts upgrading we'll have the latest one upfront [11:33:34] {'cp1065.eqiad.wmnet': '4.4.0-1-amd64 #1 SMP Debian 4.4.2-3+wmf1 (2016-02-29)'} [11:33:44] so yeah, the ones I said earlier, use those two for canaries :) [11:34:18] ok, so I'll use cp3038 and cp1065, as said I don't expect any risk, but better play safe [11:34:25] yup [11:34:47] and I'll pick lvs2001 and lvs4001 in addition [11:35:29] back on the general topic of kernel updates, with the disk startup race gone it should be easy at most sites [11:35:56] but we still have the strange ipmi_si power issue in esams. I think it's still going to bite us, and fixing it may involve doing some remote BMC upgrades [11:36:53] there's only at most 20 affected machines (and I'm not sure they're all affected) - cp30[34]x [11:38:05] lacking a real fix (probably BMC update): they may not reboot when software/kernel tells the machine to do so, without racadm intervention for powercycle. and the kernel may not boot successfully afterwards without editing grub cmdline entry to blacklist ipmi_si [11:38:39] it's also possible the "fix" is just to fully reset the BMC by pulling out the power cables of the machines, but there's nobody there to do it :) [11:40:40] https://phabricator.wikimedia.org/T126062 [11:43:21] I'm not sure if that's the same issue, but when I mass-rebooted systems for the keyctl bug, we had at least 20 of those in codfw as well (except sympton 4). Papaul upgraded the idrac firmware and all further reboot tests with a depooled server were fine [11:45:00] 10Traffic, 10Graph, 10Graphoid, 06Operations, 13Patch-For-Review: Graph results are not being cached in Varnish - https://phabricator.wikimedia.org/T134542#2284438 (10Yurik) 05Open>03Resolved a:03Yurik ok, will close for now, thx for your help! [11:45:20] but the lack of dedicated ops-esams also makes that hard.. [11:47:52] 10Traffic, 06Operations, 06Project-Admins: Create #HTTP2 tag - https://phabricator.wikimedia.org/T134960#2284449 (10Danny_B) OK, another: {T117682} There is bunch of tasks regarding SPDY, so maybe #HTTP2-SPDY then? [11:58:33] 10Traffic, 06Operations, 13Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#2284470 (10BBlack) So, recapping the strange evidence above, because I've been thinking about this off and on all night and it still makes no sense: * The only change in... [12:02:56] 10Traffic, 06Operations, 06Project-Admins: Create #HTTP2 tag - https://phabricator.wikimedia.org/T134960#2284474 (10BBlack) SPDY is dead. It was an experimental protocol that predated HTTP/2, and HTTP/2 was derived from it. We dropped support for SPDY when we turned on HTTP/2 last week. Chrome's dropping... [12:18:28] 10Traffic, 06Operations, 13Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#2284589 (10BBlack) Trying a patch that will test the first two ideas above... [12:55:21] 10Traffic, 10Varnish, 06Operations: Upgrade all cache clusters to Varnish 4 - https://phabricator.wikimedia.org/T131499#2284895 (10ema) [12:55:24] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: Wikidata Query Service REST endpoint returns truncated results - https://phabricator.wikimedia.org/T133490#2284894 (10ema) [12:55:28] 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: cache_misc's misc_fetch_large_objects has issues - https://phabricator.wikimedia.org/T128813#2284896 (10ema) [12:55:31] 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: Convert misc cluster to Varnish 4 - https://phabricator.wikimedia.org/T131501#2284893 (10ema) 05Open>03Resolved [12:56:42] 10Traffic, 06Operations, 13Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#2284900 (10BBlack) I'm out of good ideas for now. I've tried both of the obvious alternatives (1.1 w/o explicit Connection setting, and 1.1 w/o explicit Connection setti... [13:05:04] bblack: curiosity about ---^ and http format error - is Varnish getting garbage from the backend (mediawiki I suppose) or is it considering for some reason a legitimate response as garbage? [13:05:21] the third option might be that I need to re-read the phab task [13:05:56] it's a pretty complicated jumble of things, yeah [13:06:34] because those 503s are rare in the stream of traffic, and varnish doesn't actually log a dump of the response it doesn't like... correlating that to e.g. sniff of traffic is hard [13:06:45] I don't yet have a sample of the response it rejected [13:07:47] I think it's probably true that varnish is getting garbage from MW/Apache in these cases. The question is: what kind of garbage, and why does the distant action of switching protocol versions for nginx->varnish-fe cause this way down at varnish_be->mw/apache ? [13:08:17] the next crazy theory on my list that I'm trying to run down now is that this is related to ETag/If-None-Match sorts of things [13:08:41] but it's a pretty unlikely theory [13:09:54] and since this behavior is not all that un-similar from what happened when we tried keepalives with 1.9.4 (so apparently those issues weren't really about keepalives, but about turning on 1.1 as a prerequisite for keepalives)... [13:10:10] if this is some pure nginx bug when using 1.1, it's been there for a long time and I don't see other reports, so that's also unlikely. [13:10:21] * elukey nods [13:10:38] I think any way we slice this though, if we find the bug it's going to be something that was incredibly unlikely [13:12:09] it's sort of like on House (if you remember that show), it seems like every episode there was this conversational exchange something like "That can't be it, the odds of that are 1 in millions" and someone else would say "yeah but if it was a common problem, we wouldn't even be dealing with it, it would already be solved" [13:15:45] misc on varnish 4 I see? [13:15:48] congrats :) [13:16:07] \o/ [13:17:55] \o/ [13:18:44] this time ganglia didn't explode, I've chmodded the VSM files immediately after upgrading [13:18:55] ema: one thing that's puzzling me.... [13:19:06] would be awesome to find a neat way to add ganglia to the varnish group in puppet [13:19:21] well hold that thought [13:19:29] I never chmodded VSM files for cache_maps servers I don't think heh [13:19:42] uh? [13:20:46] when I upgraded the 16x new maps servers to varnish4, all I did was: https://phabricator.wikimedia.org/T126206#2237213 [13:21:30] https://ganglia.wikimedia.org/latest/?c=Maps%20caches%20ulsfo&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [13:21:36] root@cp3004:~# ls -l /var/lib/varnish/*/*.vsm [13:21:36] -rw-r----- 1 root varnish 84934656 Apr 25 22:30 /var/lib/varnish/cp3004/_.vsm [13:21:38] -rw-r----- 1 root varnish 84934656 Apr 25 22:38 /var/lib/varnish/frontend/_.vsm [13:21:54] explains a lot :) [13:22:00] :) [13:22:02] so what's the chmod command again? [13:22:17] I think when I saw the ganglia mess, I just said "ganglia sucks as usual" again :P [13:22:47] I just did 644, which are the perms v3 was using [13:22:55] ok [13:23:28] fixing [13:24:58] 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: Upgrade to Varnish 4: things to remember - https://phabricator.wikimedia.org/T126206#2285011 (10BBlack) One more thing to remember when upgrading nodes above: ``` chmod 644 /var/lib/varnish/*/*.vsm service ganglia-monitor restart ``` [13:25:09] my theory is that if gmond sends screwed up stuff to the aggregator 'for a while', things get messed up and a gmond restart on the host is not enough, the relevant aggregator also needs to be restarted [13:25:23] maybe [13:25:30] they seem ok so far in ulsfo [13:25:34] nice [13:25:48] anyways back on my puzzling thing [13:26:03] in misc frontend VCL we have: [13:26:05] sub cluster_fe_backend_response { [13:26:05] // hit_for_pass on objects >= ~10MB or no CL [13:26:05] if (!beresp.http.Content-Length || beresp.http.Content-Length ~ "^[0-9]{8}") { [13:26:08] <% if @varnish_version4 -%> [13:26:10] set beresp.uncacheable = true; [13:26:13] return (deliver); [13:26:15] <% else -%> [13:26:18] return (hit_for_pass); [13:26:20] <% end -%> [13:26:23] } [13:26:48] right, because of https://www.varnish-cache.org/docs/4.0/whats-new/upgrading.html#hit-for-pass-objects-are-created-using-beresp-uncacheable [13:26:57] and we have this WDQS test URL: https://query.wikidata.org/sparql?query=SELECT%20%3Fqid%20%3Fimdb%20WHERE%20%7B%3Fqid%20wdt%3AP345%20%3Fimdb.%20FILTER%20regex%28str%28%3Fimdb%29%2C%20%22nm%7Ctt%22%29%7D%20ORDER%20BY%20%3Fqid [13:27:07] which is a ~70MB object [13:27:31] when I test it now, I get: [13:27:32] < Content-Length: 76518133 [13:27:45] < X-Cache: cp1045 miss(0), cp2025 miss(0), cp2025 frontend hit(2) [13:28:16] which means either that block doesn't really work right, or our X-Cache stuff doesn't work the same in V4 as it did in V3 (but I think it's ok) [13:29:02] 10Traffic, 06Discovery, 06Operations, 10Wikidata, 10Wikidata-Query-Service: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2285020 (10Gehel) [13:30:07] gehel: the misc caches were just upgraded, and there was also nginx experimentation going on at the same time [13:30:24] you may want to re-test in a few minutes, and also see above that's mildly related [13:31:23] * gehel is reading back... [13:31:53] bblack: good to know that you are already fixing this! I could not understand what was going on and was getting ready to ask for your help... [13:33:53] the stuff above isn't directly related [13:34:04] but still, the test is just before/during various cache_misc upgrades [13:34:23] the response I get for your test URL has Content-Length:0 currently and is a cache miss [13:34:26] err sorry, cache hit [13:34:44] I see wdqs1001 returns a correct result [13:35:49] trying to ban it and see what happens [13:37:45] test URL works now [13:38:05] probably some temporary fallout from nginx experimentation and/or the upgrade process [13:39:35] 10Traffic, 06Discovery, 06Operations, 10Wikidata, 10Wikidata-Query-Service: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2285020 (10BBlack) Assuming there was no transient issue (which became cached) on the wdqs end of things, then t... [13:41:27] ema: new 4.4.9 kernel is built and working fine in my test instance, will upload to carbon next [13:43:27] thanks! [13:47:34] ema: ok I found my mystery puzzle answer [13:47:43] Content-Length: 3613019 [13:47:54] ^ is the gzipped content-length being passed around between varnishes, and what counts for that check [13:48:03] oh, and that's 7 digits [13:48:06] the ~70MB size is after it gunzips for curl w/o --compressed of course [13:48:27] since we store it in the cache in gzipped form too, applying the size limits that way makes sense anyways [13:48:47] I was trying to reproduce the situation in VTC but got stuck at '---- c1 0.7 Assert error in http_rxchar(), vtc_http.c line 406: Condition(hp->prxbuf + n < hp->nrxbuf) not true.' [13:48:52] just have to remember that uncompressed sizes for compressible things may not seem to behave intuitively [13:50:12] bblack: was this different on v3? [13:50:18] probably not [13:51:26] 10Traffic, 06Discovery, 06Operations, 10Wikidata, 10Wikidata-Query-Service: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2285161 (10Gehel) p:05Triage>03High a:03Gehel [13:54:15] 10Traffic, 06Discovery, 06Operations, 10Wikidata, 10Wikidata-Query-Service: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2285184 (10Gehel) I can't reproduce the issue. Since it was random, I might just be lucky. Closing this anyway f... [13:54:26] 10Traffic, 06Discovery, 06Operations, 10Wikidata, 10Wikidata-Query-Service: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2285185 (10Gehel) 05Open>03Resolved [13:56:32] 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: varnishstatsd crashes with ValueError in vsl_callback without being restarted by systemd - https://phabricator.wikimedia.org/T132430#2285204 (10ema) 05Open>03Resolved a:03ema varnishstatsd has been behaving properly for a while now, logging a me... [14:02:07] ema: btw I saw some varnishrls Flush message spam the other day, did we ever fix that up? [14:05:50] bblack: oh, right! I forgot [14:06:47] what I'm not sure about is whether it makes sense to call flush_stat *that* often [14:07:41] it pops items from a dict and sends them to statsd, so actually it should be called only once I think [14:09:35] 10Traffic, 06Discovery, 06Operations, 10Wikidata, 10Wikidata-Query-Service: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2285279 (10Gehel) 05Resolved>03Open [14:12:02] 10Traffic, 06Discovery, 06Operations, 10Wikidata, 10Wikidata-Query-Service: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2285284 (10Gehel) Seems the issue (or a similar one is still happening), I'm trying to reproduce... [14:16:30] bblack: testing on T134989, I see a bunch of other URLs returning empty content. I have identified a few CSS and JS files. What magic did you do to fix it before? [14:16:30] T134989: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989 [14:18:31] gehel: WDQS URLs? [14:19:32] bblack: not sure I have the complete list yet [14:20:00] well, the "bunch of other URLs returning empty content" - are they all WDQS URLs so far, or are you talking about other URLs on unrelated services/clusters? [14:20:16] because I didn't think WDQS did much css/js [14:21:31] bblack: there is a UI, so yes it does a bit of css / js [14:21:35] https://www.irccloud.com/pastebin/TLX4nZQF/ [14:21:42] bblack: I've got a similar report for https://stats.wikimedia.org/EN/WikipediaStatistics15.js [14:22:09] elukey: Thanks! I feel less alone! [14:22:52] bblack: I'm not really sure how to find an exhaustive list. The list above was me poking around and looking for broken things, but I'm pretty sure there is more... [14:23:17] I am trying to curl the URL and it is intermittent, sometimes I get the payload and sometimes not [14:23:43] if the solution was clearing the varnish cache, for WDQS we can probably get away with clearing everything (low enough traffic). If it is more general, probably not. [14:24:02] elukey: yep, sounds just like what I saw... [14:24:34] I already cleared all of WDQS back when I fixed that one URL [14:24:47] so if it's still happening with WDQS URLs, it's not fixed by clearing the cache [14:24:58] 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: varnishmedia: repeated calls to flush_stats() - https://phabricator.wikimedia.org/T132474#2285323 (10ema) If flush_stats is not 'fast enough' sending stats to statsd, it happens that other transactions call it again, thus leading to the 'Flush' messag... [14:25:47] gehel: are you getting consistent failure on your pastebin list, or intermittent, or? are they all cache hits and still intermittent? from the cache frontend cache? [14:26:10] I've tried elukey's stats link and it's ok for me in repeated tests so far [14:26:50] although I still think something's off with our size-based pass logic [14:26:54] but that shouldn't cause empty output [14:27:01] bblack: it is intermitent and actually fairly rare, probably 1 in 10 or 1 in 20 is failing [14:27:38] bblack: I don't have more analysis yet. Let me script this a bit ad see if I can get something useful [14:28:59] bblack: I can repro now - https://phabricator.wikimedia.org/P3033 [14:31:13] and another one : https://phabricator.wikimedia.org/P3034 [14:32:34] I'm thinking this is esams-specific for some reason. I can only repro when I switch to the frontend servers you guys hit there [14:32:49] looks like WikipediaStatistics15.js got cached as empty on cp3007 [14:33:09] yeah, and it's recent too, because I banned all objects with host==query.wikidata.org earlier [14:33:22] (after all upgrades done) [14:33:40] bblack: newbie question, but how do I tell if it was served from cache? I guess that Age header < s-max-age is a good indication, but there is probably something more direct [14:33:43] so, generally speaking cache_misc caching isn't all that critical [14:33:48] gehel: X-Cache [14:33:59] if it says "hit" anywhere in that line, it was a cache hit at some layer [14:34:10] anyways [14:34:16] stop hitting test URLs for a few [14:34:34] * gehel stopping [14:34:34] let's wipe the caches in case there's some strange problem related to the cluster having been half-upgraded over the past ~24H [14:34:41] and then we can try some of these again [14:35:28] note that cp3007 is the machine I upgraded yesterday [14:36:01] so it's been running v4 for ~1 day while the rest of the cluster was running v3 [14:37:17] the wipe will take a few mins... [14:42:58] ok caches wiped, try again! [14:43:18] I'm noting a few things wrong with our pass on content-length stuff, but again that shouldn't cause broken objects, just inefficient caching [14:44:02] 1) layer-differential pass and using varnish-be-rand for passes doesn't interact well. if we're only passing for !Content-Length or CL>X, we should still be using chash [14:44:34] uh varnishkafka died on a bunch of hosts [14:44:41] 2) apparently it's not true that do_stream=false in the backend-most cache gives Content-Length to upper cache layers for applayer chunked responses, because they're all passing on those except for the backend-most [14:44:52] s/a bunch/all the misc/ [14:44:58] ema: yeah I think it has crashed on varnishd restart before [14:45:04] ema: cleanly or screaming horrible things? [14:45:24] it exited with 0 [14:45:31] but screaming things :) [14:45:42] example please :) [14:45:50] or maybe host [14:45:55] bblack: I'm trying again, can't reproduce the issue yet... [14:45:59] going to puppet them all to get stats daemons going again [14:46:22] elukey: cp1045 for example [14:47:51] ah a lot of kafka horrible messages! This might be due to the fact that we upgraded now to 0.9 [14:48:16] more precisely: the cluster is still running with 0.9 with 0.8 compatibility mode settings [14:49:10] varnishkafka died?! [14:49:29] bblack: issue appearing again (no good trace, I was trying from Chrome) [14:49:31] ottomata: it doesn't like it when we restart varnish apparently [14:49:34] i was watching vk logs on one host and i saw it notice when i restarted hosts, and then it seemed fine [14:49:37] oh varnish [14:49:39] hm [14:49:52] hm [14:49:57] hm [14:50:05] so, related to kafka cluster restarts/upgrade? [14:50:07] or not? [14:50:08] hm [14:50:23] not [14:50:42] we restarted varnishd processes on all cache_misc, and varnishkafka crashes when you do that, at least lately [14:50:59] varnishreqstats also crashed [14:51:04] yeah [14:51:25] on 1045 I can see [14:51:26] May 11 14:24:56 cp1045 varnishkafka[23054]: KAFKAERR: Kafka error (-195): kafka1020.eqiad.wmnet:9092/bootstrap: Failed to connect to broker at kafka1020.eqiad.wmnet:9092: Connection refused [14:51:31] gehel: elukey: still no reproduction since wipe? [14:51:44] yes, I did reproduce... [14:51:44] yep! [14:51:51] yep == didn't repro [14:51:59] gehel: details? [14:52:57] ema: did you see something different? because it does look like related to the kafka upgrade rather than the restart itself (I mean, the restart might have triggered this behavior because of the kafka cluster change at the same time) [14:53:00] I was doing the test from chrome. Bunch of static resources empty. Cache hits for all [14:53:31] reproduce it in curl! :) [14:54:19] elukey: well the kafka errors happened at 14:24, vk exited at 14:41 [14:54:40] bblack: just did ! [14:54:50] so I guess it's two different issues: vk screaming because of the kafka upgrade and vk exploding when varnish gets restarted [14:55:05] https://phabricator.wikimedia.org/P3036 [14:55:24] ema: true, didn't notice the diff with the TERM [14:56:00] hm ok [14:56:09] elukey: we should fix both problems though :) [14:56:53] ema: the kafka one might be only garbage because the producer code does emit some errors when the brokers change [14:58:02] 10Traffic, 10Analytics-Cluster, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#2285393 (10Ottomata) All analytics brokers are now upgraded to confluent 0.9! Tomorrow we will switch off 0.8 inter broker... [14:58:04] so that one might be librdkafka emitting some "oh noes!" [14:58:14] yeah, that's ok [14:58:16] gehel: even using your frontend (cp3008), I can't get a repro [14:58:38] that's just it noticing that the broker layout has changed, immediately after it issues a metadata request, figures out the new layout, and starts using it [14:58:45] bblack,ema: also a note about the kafka upgrade - https://grafana.wikimedia.org/dashboard/db/kafka?panelId=32&fullscreen [14:58:49] but: I really think we should solve the pass-consistency issues first anyways, because they make testing more confusing/inconsistent [14:59:04] bblack: it seems to be fairly rare... but it still happens. I have no idea how to reproduce, except trying again and again... [15:01:06] awesome now I have to tell my curl to use http/1.1 otherwise it defaults to h2 :) [15:01:24] so, the notion we had that do_stream==false would get us a content-length that would pass upwards seems to be false. I think it only passes upwards one layer [15:01:38] and, the great surprise: can repro gehel's issue with http1.1, not with h2 [15:02:06] no, sorry about that, it was just a coincidence [15:02:06] ema: consistently? [15:02:09] ok [15:02:14] let's stop supporting http1.1, h2 is so much better anyway :P [15:02:26] ema: what X-Cache line on the repro hit? [15:03:11] bblack: I didn't copied it, got too excited about the h2 thing [15:03:16] ok :) [15:03:57] so basically, I think do_stream possibly converts a response with content-length to one without it [15:04:05] (which is on by default in v4) [15:05:03] (or in other words: if varnish do_stream a request and the backend response is not TE:chunked, the output response will be TE:chunked still) [15:05:18] that's my working theory anyways [15:05:49] does wikidata speak http/1.0 by any chance? :) http://www.gossamer-threads.com/lists/varnish/misc/38048 [15:06:01] whereas explicitly setting do_stream=false does the opposite: it will convert a backend response that is TE:chunked to a non-TE:chunked output [15:07:11] heh [15:07:11] sub vcl_backend_fetch { [15:07:12] set bereq.proto = "HTTP/1.1"; [15:07:12] } [15:07:16] is that it? [15:07:20] maybe [15:07:36] can try a patch anyways [15:07:49] if it works on misc, should probably promote that to if varnish_version4 in common VCL [15:08:12] does that even exist in v3? setting bereq.proto? [15:08:45] no idea [15:09:05] apparently the varialbe exists in v3 [15:09:29] this could (just to increasing how confusing and complex my week is!) also be relevant to the strange nginx http 1.0 vs 1.1 issues too [15:09:42] s/increasing/increase/ [15:09:45] :) [15:09:56] should we add this to misc-backend.inc.vcl.erb? [15:09:56] hmmmm [15:10:05] well [15:10:09] I'm not sure yet [15:10:19] or perhaps it's needed also between varnishes [15:10:22] there's also req.proto [15:11:17] where they're talking about setting bereq.proto (vcl_backend_response aka vcl_fetch), we've already talked to the backend and parsed its response headers [15:11:30] but the setting may affect interpretation of those headers when processing the body, I guess [15:12:00] bblack: I just run a 100 curls with a 3 second delay. On the 100 requests, I got 2 failures: https://phabricator.wikimedia.org/P3037 and https://phabricator.wikimedia.org/P3038 [15:14:51] ema: when I query wdqs1001 directly for style.css, it response with HTTP/1.1 and an explicit Content-Length [15:15:22] mmh, can varnish be silly and use 1.0 instead? [15:15:47] I tried that [15:15:56] if I do the request with HTTP/1.0, I still get the same HTTP/1.1 response [15:16:03] which is probably illegal, but also probably reasonable [15:16:11] :) [15:17:14] so some other things: [15:17:27] 1. in examples like P3037, we do see: < Via: 1.1 varnish-v4, 1.1 varnish-v4, 1.1 varnish-v4 [15:17:27] P3037 (An Untitled Masterwork) - https://phabricator.wikimedia.org/P3037 [15:17:40] so varnish at least claims it's using http/1.1 for inter-cache and the app request [15:18:02] also, that paste shows triplicate Age: headers, which seems a little buggy [15:19:57] mmh, on cp1045 I see http/1.0 everywhere [15:20:00] maps.wm.o tile fetches with curl, regardless of cache-hit or cache-miss, seem to have only a single Age: header [15:20:05] $ sudo varnishlog -n frontend| egrep -i 'Host|BereqProto' [15:20:05] - BereqProtocol HTTP/1.0 [15:20:05] - BereqHeader Host: piwik.wikimedia.org [15:20:06] ema: where? [15:20:23] fascinating [15:20:45] shocking [15:21:06] presumably the leak of HTTP/1.0-ness starts at nginx, which uses 1.0 to query varnish-fe [15:21:47] of course if 1.0 carried all the way through the varnish stack, how do we ever get TE:chunked from a backend? [15:21:57] but of course, wdqs answers with HTTP/1.1 even on a 1.0 request... [15:22:22] right [15:22:43] I just don't know if setting bereq.proto is enough to really fix all the fallout that can cause [15:23:14] it would be better to fix nginx not to use 1.0, which I've been trying to do for two days but causes inexplicable varnish3 intermittent 503's on cache_text backends :P [15:23:45] with proxy_http_version 1.1? [15:23:49] yeah [15:23:53] mmh [15:24:14] https://phabricator.wikimedia.org/T107749#2284470 [15:24:27] ^ the latest meaningful analysis of that in isolation [15:25:41] I'll need a coffee for this one [15:25:45] and then I did experiment with the first two "We can try ..." ideas at the bottom of that comment. they didn't fix anything. [15:25:50] me too [15:26:05] I'm inclined when I get back, to try proxy_http_version 1.1 only on varnish4 clusters :) [15:30:59] I'm glad there are smarter people than me around! I can understand the analysis, but pretty sure I would have had no idea how to get there! [15:34:25] https://gerrit.wikimedia.org/r/288213 [15:36:31] ema: you ok with trying that as a next experiment, and seeing if it at least kills BereqProtocol HTTP/1.0 ? [15:38:05] I think this has to be about special backends that respond with 1.1 to 1.0 or other such things, too. or else by now people would be screaming about the higher-traffic sites on cache_misc being intermittently-broke [15:38:09] like phab [15:40:33] on that note, I tested phab's backend (iridium) directly [15:40:42] it's behavior seems to be (for requests to /): [15:40:53] bblack: +1 [15:41:07] HTTP/1.1 req -> HTTP/1.1 resp, TE:chunked, implicit default keepalive connection [15:41:47] HTTP/1.0 req -> HTTP/1.1 resp, non-chunked, no content-length, explicit Connection:close (which emulates, other than proto ver, what an HTTP/1.0 Content-Length-less transfer looks/works like) [15:42:30] which is of course totally unreliable, you can't tell end-of-content from "oops connection closed early", but that's why we made HTTP/1.1 (among other reasons) [15:45:18] 10Traffic, 10Analytics-Cluster, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#2285494 (10Nuria) [15:47:38] uh, I just found one request with both 1.0 and 1.1?!? [15:47:49] * << BeReq >> 1704079 [15:47:49] - BereqProtocol HTTP/1.0 [15:47:49] - BereqProtocol HTTP/1.1 [15:48:21] nice! [15:48:33] but I think varnishlog always logs every change, not the final state [15:48:40] we see that with multiple header updates too [15:48:45] right [15:48:54] it's nice for tracing, but annoying too [15:49:45] 10Traffic, 06DC-Ops, 06Operations, 10hardware-requests, 10ops-esams: Decomission amssq31-62 (32 hosts) - https://phabricator.wikimedia.org/T95742#2285533 (10RobH) [15:51:51] so piwik also seems to respond with 1.1 to 1.0 requests [15:52:02] - BereqProtocol HTTP/1.0 [15:52:06] - BerespProtocol HTTP/1.1 [15:52:21] so 1.1 for nginx is applied/reloaded on cache_misc + cache_maps via varnish_version4 flag for all now... interesting things to note: 1) does it solve BereqProtocol strangeness? 2) Does that immediately fix these CL:0 issues? 3) If not, maybe after another cache wipe? 4) Does it cause some elevated background rate of 503s on misc or maps compared to before? [15:52:36] I have a meeting starting in 8 minutes, so semi-absent in a few [15:53:36] ok, let's see what BereqProtocol looks like now [15:53:46] 1.1 [15:53:57] oh also, if results seem inconsistent, may need to wait a few minutes [15:54:08] nginx does slow-reload where old child procs slowly die off as their persistent conns do... [15:54:33] on cp1045 all backend requests are using 1.1 at the moment [15:55:21] ok so that answers one question. at least under varnish4, nginx http/1.0 infects everything beneath and can probably really screw things up in varnish/appserver -land [15:56:49] getting frontend pass on https://query.wikidata.org/style.css [15:57:07] * bblack grumbles something about how if varnish had just fscking implemented TLS at both edges long ago, we wouldn't be in this mess and many others, probably.... :P [15:57:26] ema: I think the unexpected FE pass goes back to that stuff about TE:chunked and do_stream [15:57:51] 15:05 < bblack> (or in other words: if varnish do_stream a request and the backend response is not TE:chunked, the output response will be TE:chunked still) [15:57:59] 15:06 < bblack> whereas explicitly setting do_stream=false does the opposite: it will convert a backend response that is TE:chunked to a non-TE:chunked output [15:58:14] and then we have FE and middle-BE's always do_stream==true and passing on !Content-Length [15:58:24] right [15:58:28] and backmost-BE do_stream==false iff !Content-Length [15:58:35] gehel: can you try to repro if you're still around? [15:58:57] I'll run my script again and see if it crashes... [15:59:02] thanks [15:59:37] there might still be cache effects in play too, but I hope not (may have to roll through wiping down disk caches then frontends: eqiad bes, codfw bes, esams+ulsfo bes, then all fes. I wiped with salt batch-size 1 on stop varnish; rm -f /srv/sd*/varnish*; sync; sleep 1; start varnish, basically) [15:59:58] BBL [16:00:31] ema: reproducing at first try [16:01:30] on 22 tries, I get 5 errors, so it seems worse than before... [16:01:56] can you paste the output somewhere? [16:02:41] https://phabricator.wikimedia.org/P3039 <- just one of the failure [16:03:30] still has that strange triple Age header too [16:06:37] on 100 tries, I now get 53 failures. Not good... [16:06:55] should probably wipe caches then re-test [16:07:06] I'll start on that, I can multi-task for now :) [16:07:11] and I have the commands in up-arrow history [16:07:35] awesome :) [16:07:53] can you also share the commands if you have enough multi-tasking bandwidth? [16:07:55] bblack: thanks! [16:07:58] yes [16:08:14] it's not the kind of thing I would do on text/upload, but ok for cache_misc ;) [16:08:19] sure [16:08:33] root@neodymium:~# salt -v -t 5 -b 1 -C 'G@cluster:cache_misc and G@site:eqiad' cmd.run 'service varnish stop; rm -f /srv/sd*/varnish*; sync; sleep 3; service varnish start' [16:08:48] ; service varnishkafka start [16:08:51] root@neodymium:~# salt -v -t 5 -b 1 -C 'G@cluster:cache_misc and G@site:codfw' cmd.run 'service varnish stop; rm -f /srv/sd*/varnish*; sync; sleep 3; service varnish start' [16:09:23] root@neodymium:~# salt -v -t 5 -b 1 -C 'G@cluster:cache_misc and ( G@site:ulsfo or G@site:esams )' cmd.run 'service varnish stop; rm -f /srv/sd*/varnish*; sync; sleep 3; service varnish start' [16:09:31] (I think vk dies on frontend restart, not backend) [16:09:42] (so I'll amend the upcoming final command for that...) [16:09:48] great [16:10:54] root@neodymium:~# salt -v -t 5 -b 2 -G cluster:cache_misc cmd.run 'service varnish-frontend stop; sleep 2; service varnish-frontend start; puppet agent -t' [16:11:04] ^ puppet above should fix the 2x broken daemons [16:11:14] yep [16:11:31] it's going to slow down my command a lot though. oh well. [16:11:51] I was trying to be nice with the -b 2 before, could've done more, but it will just take longer. [16:13:34] would this be correct in order to ban wikidata's style.css? [16:13:35] salt -v -t 30 -C 'G@cluster:cache_misc and G@site:eqiad' cmd.run 'varnishadm ban req.http.host == "query.wikidata.org" && req.url ~ "^/style.css"' [16:13:39] salt -v -t 30 -C 'G@cluster:cache_misc and not G@site:eqiad' cmd.run 'varnishadm ban req.http.host == "query.wikidata.org" && req.url ~ "^/style.css"' [16:13:42]