[01:49:26] ema: I got distracted by other things and haven't touched cache_misc's v3/v4 state. feel free to start that up in the morning if you get to it before me :) [06:45:48] 10Traffic, 06Operations, 07Performance: Lots of Title::purgeExpiredRestriction from API DELETE FROM `page_restrictions` WHERE (pr_expiry < '20160517063108') without batching/throttling potentially causing lag on s5-api - https://phabricator.wikimedia.org/T135470#2299986 (10jcrespo) [06:52:32] 10Traffic, 06Operations, 07Performance: Lots of Title::purgeExpiredRestriction from API DELETE FROM `page_restrictions` WHERE (pr_expiry < '20160517063108') without batching/throttling potentially causing lag on s5-api - https://phabricator.wikimedia.org/T135470#2300016 (10jcrespo) [07:20:40] 10Traffic, 06Operations, 07Performance: Lots of Title::purgeExpiredRestriction from API DELETE FROM `page_restrictions` WHERE (pr_expiry < '20160517063108') without batching/throttling potentially causing lag on s5-api - https://phabricator.wikimedia.org/T135470#2300088 (10jcrespo) There were hundreds of que... [07:22:05] 10Traffic, 06Operations, 07Performance: Lots of Title::purgeExpiredRestriction from API DELETE FROM `page_restrictions` WHERE (pr_expiry < '20160517063108') without batching/throttling potentially causing lag on s5-api - https://phabricator.wikimedia.org/T135470#2300092 (10jcrespo) Another problem is that th... [07:42:24] <_joe_> ema: what did fix the varnish4 bug btw? [07:42:42] _joe_: reverting this one https://github.com/varnishcache/varnish-cache/commit/8b7cb51b76ad616143040ee955f1f6d9306251b9 [07:43:35] <_joe_> ema: did you understand what exactly is causing what we see? [07:44:22] the tl;dr is: bug not found in varnish 4.1.0, found in 4.1.1. Reproducible only setting 'grace', so I went digging through grace-related commits between 4.1.0 and 4.1.1. Reverting the commit mentioned above does the trick [07:44:33] but no, I didn't really get the details of what's going on yet [07:44:42] <_joe_> heh ok :) [07:45:48] the idea now would be: let's try the updated packages on misc, we don't expect disasters. In the meantime we'll file a bug upstream, I'm sure they'll be able to provide an explanation [08:56:10] 10Traffic, 10DNS, 06Operations: arbcom-nl.wikipedia.org doesn't have a functioning mobile website - https://phabricator.wikimedia.org/T135480#2300317 (10Sjoerddebruin) [09:32:49] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2300527 (10ema) I've upgraded cache_misc to Varnish 4 again. We're now running a patched version of Varnish (4.1.2-1wm5) inclu... [10:30:56] 10Traffic, 10DNS, 06Operations: arbcom-nl.wikipedia.org doesn't have a functioning mobile website - https://phabricator.wikimedia.org/T135480#2300317 (10Joe) The dns name is missing for the mobile version; adding it should be enough. [10:31:09] 10Traffic, 10DNS, 06Operations: arbcom-nl.wikipedia.org doesn't have a functioning mobile website - https://phabricator.wikimedia.org/T135480#2300701 (10Joe) p:05Triage>03Low [10:36:27] 10Traffic, 10DNS, 06Operations: arbcom-nl.wikipedia.org doesn't have a functioning mobile website - https://phabricator.wikimedia.org/T135480#2300711 (10Joe) I am not sure what is our policy about adding mobile records to the dns, but I am posting a change that should enable this. @bblack is there anything... [10:41:48] 10Traffic, 10DNS, 06Operations, 13Patch-For-Review: arbcom-nl.wikipedia.org doesn't have a functioning mobile website - https://phabricator.wikimedia.org/T135480#2300723 (10Joe) a:03Joe [10:44:56] 10Traffic, 06Operations, 07Performance: Lots of Title::purgeExpiredRestriction from API DELETE FROM `page_restrictions` WHERE (pr_expiry < '20160517063108') without batching/throttling potentially causing lag on s5-api - https://phabricator.wikimedia.org/T135470#2300724 (10Joe) p:05Triage>03High [11:18:59] 10Traffic, 06Operations, 13Patch-For-Review: varnish.clients graphite metric spammed with jsessionid - https://phabricator.wikimedia.org/T135227#2300792 (10fgiunchedi) 05Open>03Resolved cleaned up junk metrics from graphite1001 and graphite2001 [12:41:00] 10Traffic, 10Analytics-Cluster, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#2300922 (10elukey) So from the past week I can see: - kafka1012 increased steadily its logsize from 12/05 ~20:00 UTC more o... [12:57:49] 10Traffic, 10Analytics-Cluster, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#2300952 (10elukey) Distribution of the leaders: ``` elukey@kafka1012:~$ kafka topics --describe | grep Leader | awk '{print... [13:00:00] 10Traffic, 10DNS, 06Operations, 13Patch-For-Review: arbcom-nl.wikipedia.org doesn't have a functioning mobile website - https://phabricator.wikimedia.org/T135480#2300317 (10Peachey88) We should probably do the patchset to do all the arbcom wikis at once or none. Have e checked MobileFront handles security... [13:11:32] 10Traffic, 10DNS, 06Operations, 13Patch-For-Review: arbcom-nl.wikipedia.org doesn't have a functioning mobile website - https://phabricator.wikimedia.org/T135480#2300983 (10BBlack) >>! In T135480#2300953, @Peachey88 wrote: > We should probably do the patchset to do all the arbcom wikis at once or none. Is... [13:23:11] 10Traffic, 10DNS, 06Operations, 13Patch-For-Review: arbcom-nl.wikipedia.org doesn't have a functioning mobile website - https://phabricator.wikimedia.org/T135480#2301039 (10Peachey88) >>! In T135480#2300983, @BBlack wrote: > Is MobileFrontend installed for them? Is there other configuration required that'... [14:02:22] <_joe_> bblack: oh it's already turning into a debate [14:02:23] <_joe_> sigh [14:16:52] 10Traffic, 10Analytics-Cluster, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#2301203 (10Ottomata) The increase in log size correlates to the time at which I set `inter.broker.protocol.version=0.9.0.X`.... [15:10:36] ema: have you messed with cache_misc state yet? (I didn't) [15:12:41] bblack: I've upgraded cache_misc this morning, yes [15:13:17] (and cannot repro anymore) [15:14:18] 10Traffic, 06Operations, 13Patch-For-Review: Raise cache frontend memory sizes significantly - https://phabricator.wikimedia.org/T135384#2301356 (10BBlack) After running for about a day, cp3048-frontend is now at 163G virtual and 86G resident. I've seen the resident part vary up and down in the ~70-90-ish r... [15:15:03] ema: ok, awesome :) [15:16:07] it would be nice to narrow it down more before filing upstream, but on the other hand, I'd hate to miss a timing race with them doing a new 4.1.x because we didn't get a report to them in time either. [15:16:21] maybe file now and refine later, or something :) [15:16:28] (so that they're at least aware) [15:16:59] yes, although not optimal the repro steps identified (two varnishes+bash script) are probably good enough [15:17:34] 10Traffic, 10Analytics-Cluster, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#2301358 (10Ottomata) Ok, I believe that when switching `inter.broker.protocol.version` and bouncing brokers, on startup they... [15:17:47] 10Traffic, 10DNS, 06Operations, 13Patch-For-Review: arbcom-nl.wikipedia.org doesn't have a functioning mobile website - https://phabricator.wikimedia.org/T135480#2300317 (10Krenair) >>! In T135480#2300953, @Peachey88 wrote: > Have we checked MobileFrontend handles security properly for the private wikis?... [15:17:50] I'm off in some deep rabbithole the past day and a half so far [15:18:14] it all started with wanting to make further progress on minimizing TTLs, and our size-based VCL issues, and hit-for-pass behaviors, etc... [15:19:08] and then I remembered that it still sucks that we don't have cache hit stats in grafana anywhere (which usually means I end up inferring them from ganglia in/out request rates, and not nearly often enough to note when things change unexpectedly) [15:19:28] so then I started working on a varnishxcache (ripped from varnishxcps) to log hit stats... [15:19:37] then I decided I should clean up X-Cache output first to make it saner [15:19:57] and then I decided I had to make various puppet refactors before putting more redundancy in place puppeting varnishxcache... [15:20:02] it just keeps going and going [15:20:04] :) [15:20:55] also I think varnishxcps (and maybe some of the others? didn't look) has a strange sort of output bug that I'd like to fix before I copy most of it too [15:21:12] it flushes to statsd only when the counters reach a sum of 10000 counted things [15:21:24] which for text or upload is quite often (a few seconds at most) [15:21:39] but on low-rate clusters, it could be a long time between flushes, which makes the stats kind of jerky and inaccurate [15:21:45] right [15:22:08] so, will add a timer thing to that too so it flushes every 15s or something if stats don't get high enough to do the other trigger [15:25:03] are you planning on making varnishxcache compatible with both v3 and v4? [15:26:32] yeah [15:26:44] well the same way as before, there will just be a varnishxcache4 for now [15:27:14] miss-vs-pass is really tricky. even as a human, I'm not sure how I would categorize every case generically given the various behaviors on the clusters [15:27:32] so for now I'm just binning everything that wasn't hit|internal as "misspass". it's a start anyways. [15:28:01] it's a !hit basically [15:28:08] well !hit and !int :) [15:28:17] right :) [15:28:17] which means at every layer it was either miss|pass [15:28:37] there are strange cases though, because (esp on upload) we have layer-differential pass behaviors [15:28:47] so we can do things like hit,hit,pass and it's legit [15:29:08] (we passed a large object in FE due to size rules, but it hits in all backends) [15:29:57] also, we have internal response at mutiple layers too (e.g. misc cluster backend-most does the 404 on bad domainname, which can do int,miss,miss or int,hit,hit, etc) [15:31:09] the way it's binned up now, there's 7 keys in statsd: int-front, int-local, int-remote, hit-front, hit-local, hit-remote, misspass [15:31:35] hi, sorry to interrupt, I got one report in -operations that the graphite ui fails to load, and indeed I can see a 503 for e.g. https://graphite.wikimedia.org/content/js/ext/ext-all.js in the browser, fetching from graphite1001 itself seems fine (700k gzipped response) [15:31:38] misspass == we had to talk to the applayer. *-remote traffic had to move across a wan link. else it was serviced without leaving the traffic nodes at the DC it arrived at. [15:31:51] should be easy to layer that up in grafana with color sets that make it clear [15:37:27] ema: so cache_misc, we never put back the CL-sensitive code, and we didn't undo the undo of the hit->miss thing either [15:38:04] godog: yep I'm also getting 503s, but only for stuff under /content/js apparently [15:38:20] this one works for example https://graphite.wikimedia.org/content/img/carbon-fiber.png [15:39:16] ema: https://gerrit.wikimedia.org/r/289230 (I had pulled this out during the v3 downgrade) [15:39:25] I doubt related, but who knows [15:40:04] probably unrelated but still a Good Idea [15:41:01] ema: indeed, I'm not seeing 500s served from graphite1001 though, anyways I'm opening a task [15:41:09] thanks! [15:41:45] hehe thanks for taking care of this! [15:47:20] ema: that patch is salted to all, so if it had any effect it already did [15:47:44] (and it didn't affect the 503s) [15:48:20] 10Traffic, 06Operations: graphite.wikimedia.org 503s on some css/js resources - https://phabricator.wikimedia.org/T135515#2301423 (10fgiunchedi) [15:49:43] on those 503's, varnishlog at backendmost says: [15:49:44] - FetchError http read error: EOF [15:50:04] yeah I was looking into varnishlog too [15:51:48] perhaps this behavior has been introduced by reverting https://github.com/varnishcache/varnish-cache/commit/8b7cb51b76ad616143040ee955f1f6d9306251b9 [15:53:17] it shouldn't be, that's really a feature not a bugfix [15:53:47] my hunch is it's indirectly related to turning on HTTP/1.1 for nginx->varnish-fe for the varnish4 case [15:53:58] (same cause as the low-rate 503s on cache_text when doing so with varnish3) [15:56:28] when I get a tcpdump on the request, it's very strange [15:56:44] mmmh, not sure, I can repro with curl localhost:3128 as well as :80 [15:57:15] (and --http1.0) [15:57:20] tcpdump of cp1051->graphite1001:80 for my browser req, says: [15:57:23] 10.64.32.155.80 > 10.64.32.103.59467: Flags [S.], cksum 0x5ce3 (correct), seq 1924731484, ack 278851966, win 28960, options [mss 1460,sackOK,TS val 1339516407 ecr 1812435652,nop,wscale 9], length 0 [15:57:38] (SYN, new connection, not some kind of reuse for varnish->graphte) [15:57:52] ... [15:57:53] 10.64.32.103.59467 > 10.64.32.155.80: Flags [P.], cksum 0x59e1 (incorrect -> 0x08a7), seq 1:1082, ack 1, win 58, options [nop,nop,TS val 1812435652 ecr 1339516407], length 1081 [15:57:59] 0x0030: 4fd7 65f7 4745 5420 2f63 6f6e 7465 6e74 O.e.GET./content [15:57:59] 0x0040: 2f6a 732f 636f 6d70 6f73 6572 5f77 6964 /js/composer_wid [15:57:59] 0x0050: 6765 7473 2e6a 7320 4854 5450 2f31 2e31 gets.js.HTTP/1.1 [15:58:00] 0x0060: 0d0a 582d 436c 6965 6e74 2d49 503a 2031 ..X-Client-IP:.1 [15:58:07] incorrect? [15:58:11] so there's us pushing 1082 bytes of requests + headers [15:58:21] (ignore incorrect csum, that's just nic offloading) [15:58:25] ok [15:58:43] next two packets, from graphite1001 on this port: [15:58:46] 10.64.32.155.80 > 10.64.32.103.59467: Flags [.], cksum 0xf85b (correct), seq 1, ack 1082, win 61, options [nop,nop,TS val 1339516407 ecr 1812435652], length 0 [15:58:54] (acks our request data, 1082 bytes) [15:59:06] 10.64.32.155.80 > 10.64.32.103.59467: Flags [P.], cksum 0x5a8b (correct), seq 1:168, ack 1082, win 61, options [nop,nop,TS val 1339516410 ecr 1812435652], length 167 [15:59:15] 0x0000: 4500 00db 77ab 4000 4006 6cf0 0a40 209b E...w.@.@.l..@.. [15:59:15] 0x0010: 0a40 2067 0050 e84b 72b9 125d 109e f5b7 .@.g.P.Kr..].... [15:59:15] 0x0020: 8018 003d 5a8b 0000 0101 080a 4fd7 65fa ...=Z.......O.e. [15:59:15] 0x0030: 6c07 92c4 1f8b 0800 0000 0000 0003 448d l.............D. [15:59:15] 0x0040: 410a c230 1000 c593 04f2 06f7 01ad 6db4 A..0..........m. [15:59:17] 0x0050: 7888 2048 1505 2d7a 08f6 1cd2 b5ac 4a12 x..H..-z......J. [15:59:20] 0x0060: da20 d42f f817 bfa8 bd88 d761 98d9 2975 .../.......a..)u [15:59:22] 0x0070: 4ac4 44c0 344d e1b8 e7ec a0db 1017 aea2 J.D.4M.......... [15:59:25] 0x0080: 0b61 25a1 c42a 8299 8035 9a5e 1119 88b9 .a%..*...5.^.... [15:59:27] 0x0090: cc66 3213 b02d 1467 b9b3 016d 8855 e751 .f2..-.g...m.U.Q [15:59:30] 0x00a0: 82f6 fe4e 4607 7236 b9ea 876e 4d43 3e70 ...NF.r6...nMC>p [15:59:32] 0x00b0: 76d6 4d27 2177 ee46 18c1 ca18 f421 de58 v.M'!w.F.....!.X [15:59:35] 0x00c0: d35f 6cfd 8ffc 8884 fa49 9e33 cec6 afd1 ._l......I.3.... [15:59:37] 0x00d0: e0bd 5c94 c3cf 1700 00ff ff ..\........ [15:59:40] then sends 167 bytes of binary stuff with no response headers. wtf? [16:01:01] on the 503s on the text cluster (varnish3 + nginx http/1.1), varnish in this backend case would say it couldn't parse the response (HTTP/1.1 200 OK line) [16:01:06] which this would match up with [16:01:15] in varnish4 instead I get: - FetchError http read error: EOF [16:01:23] but I suspect they're complaining about the same condition [16:01:49] does something awful about the request make the backend not send response headers? [16:02:11] is timewait reuse causing us to do something horrible and reuse an existing connection thinking it's new? (surely not...) [16:02:29] well I can repro hitting varnish directly with curl [16:02:45] the response from graphite1001 continues on with a bunch more binary data, which is probably the gzipped object [16:02:49] but I never see headers [16:03:13] the setup on graphite1001 is apache talking fastcgi with uwsgi fwiw [16:08:45] hmmm [16:08:46] - BereqHeader X-Forwarded-For: 107.132.130.216, 10.192.48.29, 10.192.48.29, 10.192.48.29, 10.192.48.29, 10.192.16.138, 10.192.16.138 [16:08:50] wtf is that [16:09:13] I think varnish is doing XFF now, redunantly with our VCL XFF handling [16:10:25] I've started a test varnishd on cp1044: [16:10:27] /usr/sbin/varnishd -a :3128 -b graphite1001.eqiad.wmnet:80 -p default_ttl=5 -p default_grace=10 -F [16:10:34] cannot reproduce [16:11:16] perhaps something VCL/VMOD related? [16:11:51] I don't know [16:12:05] the tcpdump looks the same on the graphite1001 side FWIW [16:12:18] it does respond to the HTTP request with binary data with no HTTP response line/headers, at the TCP level [16:12:25] on a fresh TCP connection [16:12:29] wtf [16:16:35] I'm going to fix the XFF thing anyways. that code is complex on our side [16:17:33] in our varnish3 vcl, we do a bunch of heavy processing on XFF (and related headers from nginx), and *then* we append ourselves to XFF. In varnish4, they've already appended self to XFF before we enter vcl_recv. So fixing that up is a bit of a mess. [16:18:37] s/self/client.ip/ I mean [16:20:47] in the meantime I've found https://copyninja.info/blog/comressed-response.html which seems to match, mod_wsgi + mod_deflate [16:22:20] I'll give it a try and disable mod_deflate [16:24:03] interesting... [16:24:19] we don't really care about deflate anyways, only gzip, AFAIK [16:24:41] (varnish with do_gzip + https_support normalizes AE:gzip,deflate -> AE:gzip anyways too) [16:25:27] and in fact, with the buggy responses the request does have just plain "AE: gzip" [16:25:54] ah ok, nevermind heh [16:26:02] no, it's still valid [16:26:24] I'm just saying, we didn't *ask* for deflate. mod_deflate + mod_wsgi could still be at fault, with some strange catalyst involved [16:26:36] ema: does your repro on cp1044 work if curl uses --compressed? [16:27:52] *nod*, I'll leave mod_deflate disabled, doesn't seem like it had an effect [16:30:47] oh :/ [16:30:52] bblack: I *cannot* repro on cp1044 without VCL [16:31:08] however, with --compressed I get CL:0 :/ [16:31:22] godog: are you sure? [16:31:36] I get a better backend response now, no funny binary stuff [16:31:54] on the backend fetch on my failing test req: [16:31:54] - ObjProtocol HTTP/1.1 [16:31:54] - ObjStatus 200 [16:31:54] - ObjReason OK [16:32:11] the initiating browser did get CL:0 though [16:32:18] but still, that just means we have two bugs in play [16:32:35] try this: [16:32:37] curl -v -H 'Host: graphite.wikimedia.org' -H "Authorization: Basic BLABLA" http://cp1044.eqiad.wmnet:3128/content/js/ext/ext-all.js [16:32:45] and the same story with --compressed [16:33:28] ok well the CL:0 thing is still separate IMHO [16:33:32] bblack: ish, I still saw the occasional 503s while reloading from my browser for ext-all.js [16:33:54] and I think if godog puts mod_deflate back in, you can get a 503 with your cp1044 VCL-free repro, using --compressed [16:33:59] (that's my theory anyways) [16:34:37] I'm not getting 503s anymore with FF [16:34:55] because mod_deflate is disabled [16:34:58] now you just get CL:0 [16:35:05] err ok let me know what I should do with mod_deflate, was a bit too trigger happy there [16:35:18] godog: keep it off, IMHO [16:35:28] anyways, cp1051 varnishlog on the backend request now says: [16:35:30] - BerespProtocol HTTP/1.1 [16:35:30] - BerespStatus 200 [16:35:30] - BerespReason OK [16:35:32] - BerespHeader Last-Modified: Wed, 31 Dec 2014 16:43:41 GMT [16:35:34] - BerespHeader Content-Type: application/javascript [16:35:37] - BerespHeader Vary: Cookie, Accept-Encoding [16:35:39] - BerespHeader Content-Encoding: gzip [16:35:42] - BerespHeader Date: Tue, 17 May 2016 16:31:03 GMT [16:35:44] and that's it, with zero bytes [16:35:51] I don't think it's legal for varnish to do an HTTP/1.1 response with no content-length and no connection:close, right? [16:35:54] sorry [16:36:06] I don't think it's legal for *graphite* to do an HTTP/1.1 response with no content-length and no connection:close, right? [16:37:18] https://github.com/varnishcache/varnish-cache/issues/1918 comes to mind [16:38:09] that's the one we patched, though [16:38:23] right [16:39:07] well, but the patch doesn't fix this case [16:39:20] graphite is not giving Connection:close, which that fix requires to make it work [16:39:23] it's just closing [16:40:18] and I think the nginx 1.1 behavior is triggering too, since otherwise varnish would pass the request down as 1.0 and it would work [16:40:49] but IMHO, the graphite server is at fault here. If it's not going to do chunked and not going to do content-length, and it's HTTP/1.1, doesn't it have to send Connection:close? [16:40:50] cannot repro with curl --http1.0 [16:41:05] ack, I'll leave mod_deflate off, also graphite1003 is running the same software but on jessie, so e.g. mod_wsgi is more recent [16:41:07] or is that not true standards-wise? [16:45:03] mmmh yeah now after godog's appserver change http1.0 requests seem to be fine [16:45:53] so, we have a couple of issues here, to recap: [16:46:30] 1. apache mod_wsgi + deflate buggy under at least some request conditions that varnish can trigger (we should perhaps audit the fleet and disable deflate everywhere, because we never need it and this could be a common bug even in non-wsgi scenarios) [16:47:06] 2. I think varnish is actually buggy on the close-delimited transfer here, and the catalyst is nginx using 1.1 instead of 1.0 (vs unbroken behavior otherwise) [16:47:40] my best read of https://tools.ietf.org/html/rfc7230#section-3.3 doesn't seem to indicate Connection:close is required for it to be "close-delimited" [16:47:52] but the #1918 fix seems to require an explicit Connection:close header to work around the issue [16:48:20] that apache's not setting that header is stupid, but I don't think it actually violates standards. maybe experts would disagree? it's confusing. [16:50:33] technically you could argue that C:keepalive is the HTTP/1.1 default, and C:close should be explicit. But one could interpret graphite's response pattern as "I intended to keepalive, but at the end of that transfer I decided to close anyways for some unrelated reason, and now it's a close-delimited transfer" [16:51:17] graphite is basically behaving like an HTTP/1.0 responder here, even though it's calling it HTTP/1.1 [16:52:22] interesting, does it do the same on graphite1003 too? should be otherwise identical modulo software versions [16:52:52] although [16:52:57] 6.1 of the same RFC says: [16:52:58] " A server that does not support persistent connections MUST send the [16:52:58] "close" connection option in every response message that does not [16:53:01] have a 1xx (Informational) status code. [16:53:25] but also: [16:53:26] 6.6. Tear-down [16:53:26] The Connection header field (Section 6.1) provides a "close" connection option that a sender SHOULD send when it wishes to close the connection after the current request/response pair. [16:53:31] not SHOULD not MUST [16:54:37] heh [16:54:42] *note SHOULD not MUST [16:56:00] graphite1003 behaves the same as graphite1001 [16:56:18] (I'm testing with telnet and a copy of legit HTTP/1.1 request+headers copied from my older browser request) [16:56:23] HTTP/1.1 200 OK [16:56:23] Last-Modified: Wed, 31 Dec 2014 16:43:41 GMT [16:56:23] Content-Type: application/javascript [16:56:23] Vary: Cookie, Accept-Encoding [16:56:23] Content-Encoding: gzip [16:56:25] Set-Cookie: sessionid=q8pv9xujaaoi1hfczflzncgjszgt9kya; expires=Tue, 31-May-2016 16:55:32 GMT; httponly; Max-Age=1209600; Path=/ [16:56:28] M;WW[o�6~ϯ8��# [16:56:48] it's a 1.1 response, CE:gzip, no TE:chunked, no Content-Length, server closes after gzip data is done [16:58:12] I just don't know whether that's really varnish's or graphite100x's fault for sure [16:58:23] (the failure to interpret that response as legit) [16:58:35] bblack: so what is our current theory now? :) That varnishd returns empty responses using http/1.1 and no TE:chunked, no CL? [16:59:02] ema: this is different than the other CL:0 bug [16:59:17] varnish isn't translating something else into CL:0 during a 304 or whatever [16:59:51] varnish, when querying the appserver, decides the CL is 0 because they disagree on interpreting the HTTP RFCs in this case [17:00:20] if the appserver responds with HTTP/1.1 and no TE and no CL (which, at best, can only be interpreted as close-delimited content), varnish calls it CL:0 [17:00:26] (right away, when it receives it) [17:00:56] mmmh, but then an appserver response like this one should also trigger right? [17:00:59] HTTP/1.1 200 OK [17:01:01] Content-Type: text/plain [17:01:04] The quick brown fox [17:01:15] yes [17:01:15] well with an empty line which disappeard because of copy-paste fail [17:01:33] if you copy the vtc they did in #1918, but leave off the Connection:close in the server response, we have this case [17:01:58] and we could probably build on #1918's patch and work around it if we're sure it's varnish's fault standards-wise [17:02:29] I tend to assume it is, because we've got disparate versions of existing long-term apache software doing this, and varnish4 is relatively-new [17:03:38] IMHO either way it's asinine of Apache not to insert Connection:close in the headers in this case in the first place. It surely already knows that's what it's doing. But if it's standards-legal-but-stupid, it's still varnish's fault. [17:04:14] <_joe_> bblack: don't assume what apache does conforms to the standards [17:04:24] <_joe_> I found it failing on that respect a lot of times [17:04:42] well if nothing else, existing long-term apache behavior would be a defacto standards-interpretation that varnish would need to support, right? [17:05:02] so far I can only find confusion even in the update 2014 RFCs for HTTP/1.1 [17:05:08] <_joe_> my experience is that varnish usually doesn't forgive apache's shortcomings [17:05:31] <_joe_> I had to work around those errors a few times already [17:05:45] <_joe_> but they were mostly linked to the horrible mod_proxy_fcgi, to be fair [17:06:14] well in this case, they may be linked to the horrible uwsgi-related modules [17:08:14] ema: I think a varnish-level workaround would be to look at the #1918 patch's: if (hp->protover == 11 && (request || !http_HdrIs(hp, H_Connection, "close"))) [17:08:48] and amend that to also catch the case of "no connection: header, no CL header, no TE header -> assume connection:close-like behavior" [17:08:59] (for http/1.1) [17:09:12] I have to go shortly, thanks for your help so far! [17:09:32] FWIW curl also notices this condition and assumes close, * no chunk, no close, no size. Assume close to signal end [17:14:35] bblack: yep! [17:15:16] working on that... [17:25:52] at least this time the vtc test is straightforward [17:29:37] ema: try https://phabricator.wikimedia.org/P3119 with your VTC? [17:29:58] IMHO, we could/should re-open #1918 and put a pull req in there for that + updated VTC [17:30:07] (if it works) [17:30:33] oh wait, my patch is still wrong [17:30:41] I didn't see it till I saw it in a different view in the paste :) [17:34:01] ema: https://phabricator.wikimedia.org/P3119 updated [17:35:39] bblack: https://phabricator.wikimedia.org/P3120 does it make sense? [17:36:10] yeah well, s/bodylen/body/ [17:36:24] in txresp [17:38:11] is it meant to update the existing one? [17:38:37] so really looking at the VTC side of this... [17:38:39] https://github.com/varnishcache/varnish-cache/commit/e142a199c53dd9331001cb29678602e726a35690 [17:38:48] it's meant as a separate test case [17:39:06] ^ the VTC added here already tests both cases. It explicit checks for CL:0 as the correct behavior if no conn:close in this situation. [17:40:04] ema: updated https://phabricator.wikimedia.org/P3119 again [17:40:19] I think that one-line patch to the existing VTC matches what we're fixing there [17:40:34] the updated VTC should fail w/o the code patch, and succeed with it [17:41:30] it does fail indeed with 4.1.2-1wm5 [17:43:38] hmmm I wonder about VTC's "server" behavior here though [17:43:54] will it close the connection on the first request after the 10 bytes, without asking to set CL:close? [17:46:56] building wm6 now [17:47:39] http1/cache_http1_proto.c:310:23: error: passing argument 1 of 'http_ForceHeader' discards 'const' qualifier from pointer target type [-Werror] http_ForceHeader(hp, H_Connection, "close"); [17:48:00] heh [17:48:33] well there's two ways to fix that I guess, make it non-const for the whole function, or just carry the condition another way [17:48:38] patch probably smaller with the latter [17:50:53] so the goal is to avoid returning BS_NONE in this case right? [17:51:46] this case -> http1.1, no-cl, no chunked... [17:53:02] and return BS_EOF instead [17:53:43] yeah I'm just trying to do it cleanly and efficiently and kinda look like their code style [17:53:52] :) [17:59:53] heh actually [18:00:01] when you work through all the convoluted logic... [18:01:36] the conditions we're looking for are all already implicitly the case, except for the Conn:close bit... [18:02:13] ema: https://phabricator.wikimedia.org/P3119 updated (again!) [18:02:35] oh that's nice :) [18:02:41] working through the logic, when that line has been reached, it's implicit that cl==-1 (meaning no CL: header), and TE:chunked was not set [18:02:58] the only thing left is to not care whether they said CL:close or not at that point [18:03:04] (and not affect requests) [18:03:50] I wonder whether requests are special either. maybe http/1.1 clients can just as awfully behave in the real world heh [18:04:03] but for now, this is close enough [18:04:09] (I think!) [18:04:35] I'm still not sure if VTC's little "server" does close implicitly after every response (close the actual tcp conn I mean) [18:04:54] well we always have nc [18:04:59] building [18:07:01] bblack: varnishtest fails, nc+curl works [18:07:05] it may need a "close" instruction between the VTC server's "txresp -nolen -bodylen 10" and the following "rxreq", just not sure [18:07:15] maybe that's always what happens in this case [18:07:35] seems like even the existing tests would be screwing if it didn't already do it though [18:08:32] mmh varnishtest hangs, actually [18:08:54] and fails after HTTP rx timeout (fd:11 30000 ms) [18:08:58] yeah [18:09:05] try the close hack [18:10:12] or maybe close, accept? [18:10:23] close alone isn't enough [18:10:47] close + accept is \o/ [18:11:13] ok [18:11:56] so this new https://phabricator.wikimedia.org/P3119 is what works, right? [18:12:16] correct [18:12:20] I should really set up a good varnish build->patch->test environment for myself heh [18:12:36] it's crazy to be passing this through you while working on my stupid mac commandline for patches :) [18:13:08] anyways, let's see if it breaks or fixes anything on cache_misc as a local package patch? [18:13:19] I'm running the stock tests now [18:13:25] ok [18:13:48] so far, so good [18:19:18] grr, libvmod_debug strikes again [18:19:32] it's not installed by default so all tests depending on it are failing [18:20:25] heh [18:20:39] that's strange [18:20:54] we could start installing it I guess [18:21:02] or is it not really installable and meant for build-time-only? [18:21:34] I don't think it's installed by the ./configure ; make ; make install jazz [18:21:54] but I mean it's a vmod, if you don't use it what can go wrong? [18:25:36] :) [18:27:02] sudo cp ./lib/libvmod_debug/.libs/libvmod_debug.so /usr/lib/x86_64-linux-gnu/varnish/vmods/libvmod_debug.so [18:27:05] there you go [18:28:03] bblack: re-running the tests, in the meantime: https://gerrit.wikimedia.org/r/289254 [18:30:11] I've gone back to working on the horrible V3-vs-V4 X-Forwarded-For mess for now :) [18:31:33] good luck! [18:32:56] poor wikibugs [18:42:02] no regressions, will build on copper and upload to carbon [18:45:16] yay! [18:46:09] xorg is among the build-deps apparently lol [18:48:55] bblack: 4.1.2-1wm6 is apt-get installable now [18:49:21] I've gotta go soon, feel free to report the bug upstream if you wish [18:49:27] otherwise I'll do that tomorrow :) [18:50:06] see you! Let's see how many other bugs we'll find before Friday! [18:50:45] ema: thanks! cya later :) [20:08:47] 10Traffic, 06Operations, 13Patch-For-Review: graphite.wikimedia.org 503s on some css/js resources - https://phabricator.wikimedia.org/T135515#2302786 (10BBlack) p:05Triage>03Normal Misc-cluster updated to 4.1.2-1wm6, patch above seems to fix (along with the mod_deflate disable). We may need further comm... [22:04:44] ema: https://github.com/varnishcache/varnish-cache/issues/1954 [22:38:00] ema: also https://github.com/varnishcache/varnish-cache/issues/1955