[04:42:32] 07HTTPS, 10Traffic, 06Operations: Secure connection failed when attempting to preview or save pages - https://phabricator.wikimedia.org/T134869#2280304 (10MSJapan) I would just like to add that I've having the same issue, but it occurs when saving almost any edit, as well as trying to XfD in Twinkle, and I'm... [06:39:19] curl -I https://releases.wikimedia.org/mediawiki/1.26/mediawiki-1.26.2.tar.gz [06:39:22] [...] [06:39:25] content-length:25175751 [06:39:30] age:203 [06:39:30] x-cache:cp1045 hit(2), cp3010 hit(1), cp3007 frontend miss(0) [06:39:39] with https://gerrit.wikimedia.org/r/#/c/288350/ applied [06:40:06] I've currently upgraded cp3007 only, but I'd say we found a solution [07:32:29] ema: I've uploaded the new Linux package to carbon yesterday, the meta packages don't need to be changed this time [07:32:41] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2288343 (10Gehel) a:05Gehel>03None [07:58:08] moritzm: thanks [07:59:03] gehel: I've upgraded all misc caches to a patched version of varnish that I think solves the issue [07:59:11] please try to reproduce when you have a sec [07:59:20] ema: trying right now... [08:12:08] gehel: and? :) [08:12:43] ema: 2 errors on 100 requests, lemme check [08:14:04] https://phabricator.wikimedia.org/P3048 and https://phabricator.wikimedia.org/P3049 [08:14:20] I still see the 3x Age header... [08:14:55] yeah that is likely unrelated [08:15:45] mmh, both failed requests got misses in esams and a hit on cp1061 [08:19:31] ema: was that with all the CL-sensitive VCL still gone? [08:19:48] bblack: yes I haven't touched that [08:19:57] but I could repro the problem even with CL-stuff gone [08:20:17] oh ok [08:21:05] after upgrading I cannot reproduce the issue with https://releases.wikimedia.org/mediawiki/1.26/mediawiki-1.26.2.tar.gz anymore [08:21:42] yeah but gehel still hits it [08:21:47] indeed [08:22:11] right after re-(re-)re-moving the last of the CL-stuff I couldn't reproduce, and everyone else said it was ok at the time too heh [08:22:38] it may be hard to hit at times. I think the CL-stuff just made it easier to repro. [08:22:48] so a good order for cache wiping would be: eqiad, codfw, esams, ulsfo, right? [08:23:43] yeah for misc: eqiad, codfw, (esams+ulsfo) for the backends [08:24:35] try adding a50c99f6 "Make sure hp->body is always initialized." maybe. That one looks related to me too. [08:24:46] anyways, I'm going back to sleep, back in a few hours :) [08:24:52] sleep well! [08:26:24] bblack: and come back rested! [08:27:57] oh, I had to take one more peek. I didn't notice last night that that patch is to varnishtest. so, not related :( [08:28:34] uh? bin/varnishd/http1/cache_http1_proto.c [08:29:13] https://github.com/wikimedia/operations-debs-varnish4/commit/9560c50fbeca1c32a0ddf52ef84ff9776c8d8099 [08:30:30] oh, you mean the hp->body one [08:33:45] let me try to wipe the caches again just to be sure [08:40:33] gehel: could you please give it another go? [08:40:40] sure ... [08:45:40] looks good so far ... [08:50:35] 100 requests, no error. Looks good. (Or I might just be lucky again...) [08:51:02] yeah given the nature of this bug I wouldn't get too excited (just mildly happy) [08:51:48] * gehel is midly happy [08:54:11] gehel: is your test still running or does it stop at 100 reqs? [08:54:32] it stops at 100 requests, but I launched it again, just to be sure... [08:54:38] great, thanks [08:55:29] still no errors... What is the level above "mildly happy"? [08:56:00] Damn, I spoke too soon, 1 error again [08:56:28] https://phabricator.wikimedia.org/P3050 [08:57:33] just 1 error on this batch of 100 requests... [08:59:01] mmh [09:11:41] this also might be relevant https://varnish-cache.org/trac/ticket/1858 [09:12:00] Stale hit-for-pass objects generating 200 responses with missing content [09:17:47] yeah but it should be in 4.1.2 [09:22:56] and now style.css got cached as empty on cp3008 [09:23:08] X-Cache: cp1061 hit(1), cp3010 miss(0), cp3008 frontend hit(9) [09:23:12] curl: (18) transfer closed with 6304 bytes remaining to read [09:24:08] so yeah, certainly not solved :( [09:41:42] cached with Age: 995 BTW [09:41:45] * ema is confused [09:42:39] flushing cp3008's frontend I get a proper response again [10:26:03] there shouldn't be any HFPs for these test URLs right now, though [10:29:28] another thing to look at: this probably doesn't affect all backends/URLs [10:29:39] we have examples in wdqs, downloads, stats [10:29:57] is there a common factor in them? [10:30:37] re: Age, keep in mind that doesn't indicate how long it's been in the FE (or anywhere) [10:31:05] that's just the Age since that response was first generated at some origin (some backends may even send our backend-most cache non-zero Age) [10:35:47] wdqs uses nginx, download uses apache [10:35:59] so it's probably not some common subtle bug in origin output [10:36:27] so for a while I thought there was a difference between T134989 and T135038 given that the latter seemed to be fixed by https://github.com/varnishcache/varnish-cache/commit/e142a199c53dd9331001cb29678602e726a35690 [10:36:27] T135038: Inconsistently unable to download https://releases.wikimedia.org/mediawiki/1.26/mediawiki-1.26.2.tar.gz (returns zero-byte response) - https://phabricator.wikimedia.org/T135038 [10:36:28] T134989: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989 [10:36:37] s/download/releases/ [10:36:52] sorry, I meant the releases one seemed to be fixed [10:37:06] however now I can reproduce both [10:37:19] the trick is triggering misses to find repros faster [10:37:39] I donno if hit->miss at different layers is an important part too, though [10:37:43] triggering full-miss is easy [10:38:46] but if you get an FE cache hit with valid output, and it's a default 120s cache object, you're probably not going to get a reproduction there for 120s [10:38:57] right [10:38:57] which can be a lot of test requests and make you think it's 100x ok and all's good :) [10:39:27] have we ever had a reproduction with miss,miss,miss? [10:40:21] (or pass,pass,pass for that matter, using a pass-only backend) [10:40:35] I don't think I've had one yet, no [10:40:38] https://config-master.wikimedia.org/ is explicit return-pass, so it always has 3x pass [10:40:58] oh ok, I was adding query params to get misses :) [10:41:40] well adding query params will get you 3x miss, but doesn't help much with finding a hit-miss-miss [10:42:03] yep I was doing that to trying a repro with 3x miss [10:44:09] for x in {1..1000}; do curl -s "https://query.wikidata.org/style.css?asdf=${x}" >x; ls -l x; done [10:44:18] (but pick your own unique 'asdf') [10:44:37] I don't think I've ever gotten a miss/miss/miss or pass/pass/pass repro, though [10:45:04] so it's likely only triggering on a miss/pass that's fetching a cache hit from beneath [10:46:13] in the style.css case, we know the origin just sends etag, but no expires/CC headers, which means it uses the default 120s TTL [10:46:45] for style.css I get: [10:46:46] X-Cache: cp1058 miss(0), cp3010 miss(0), cp3008 frontend hit(164) [10:47:01] why 164 then? Shouldn't it expire? [10:47:02] to help debug on that URL, could hack in some frontend-only VCL (for that req.http.host / req.url) that caps ttl down to like 30s [10:47:16] not if it's been hit 164 times in under 120s [10:47:31] ah, I forgot to sleep in the loop :) [10:47:46] if backends do it for 120 and frontends only for 30, we'll be able to get more hit-miss-miss reprods [10:47:47] and I got a repro on pass pass miss [10:47:57] what did pass/pass/miss? [10:47:59] < X-Cache: cp1045 miss(0), cp3010 pass(0), cp3008 frontend pass(0) [10:48:04] without Content-Length [10:48:08] what URLs? [10:48:13] https://query.wikidata.org/style.css?xxxxx [10:48:28] why would it ever be pass at all? this is what had me questioning X-Cache the other day [10:48:56] it clearly really was a pass, though [10:49:17] right after that one I got another repro with: [10:49:18] < X-Cache: cp1058 hit(1), cp3007 miss(0), cp3008 frontend pass(0) [10:49:23] because the BE->BE jump there is 3010->1058 on miss, but 3010->1045 on pass (pass is random) [10:49:57] yeah once you hit pass, pass picks random backends, much more likely to find intermediate miss/pass, more likely to repro, etc [10:50:05] the question is, what is causing those to pass at all? [10:53:12] so at a given layer (let's say cp3010 in the middle), we're pretty sure it did a hit at some point in the past (for the miss(0)->cp3008 hit) [10:53:35] then when cp3008 finally expired, it ask 3010 again, and 3010 did a pass to 1058 when we'd expect a miss... [10:53:49] there may have been a miss we didn't record first, but that miss generated a hit-for-pass [10:54:34] same story for cp3008 as a frontend: after the obect expires I always get a pass [10:54:38] the only explicit hit_for_pass we have is in text/upload -specific VCL. and with misc's removal of CL-logic, we don't have hfp/beresp.uncacheable in that code either, and not in shared [10:54:58] the pass will go away in 120s, should get another chance to become a real hit instead of an hfp [10:55:27] something in shared VCL must be doing it somehow, or in v4 builtin VCL? [10:56:08] perhaps! But why do we sometimes get hits and sometimes passes? [10:57:34] got a repro with 3x miss [10:57:47] < X-Cache: cp1045 miss(0), cp3008 miss(0), cp3008 frontend miss(0) [10:57:50] and no CL [10:58:06] the request immediately afterwards was good though [10:58:09] < Content-Length: 6304 [10:58:10] < X-Cache: cp1045 miss(0), cp3008 miss(0), cp3008 frontend hit(1) [10:58:14] for https://query.wikidata.org/style.css?xxxx-ema [10:58:39] I'm assuming no-CL == repro here [10:58:55] not necessarily at all [10:59:15] the response can be chunked, too. can only tell by whether final output is zero-length or 6304 [11:00:07] anyways, back on the pass-mystery: I just varnishlogged on cp3010-backend and cp3008-frontend while putting style.css reqs through [11:00:56] where cp3010 indicated pass on the way through... [11:01:25] - Debug "VSLP picked preferred backend 3 for key cf143d05" [11:01:28] - VCL_return hash [11:01:30] - VCL_call HASH [11:01:31] so the pass is due to hit-for-pass [11:01:34] - VCL_return lookup [11:01:36] - Debug "XXXX HIT-FOR-PASS" [11:02:19] and I've just caught 2x reqs going through cp3010 serially that were miss then pass [11:03:16] first one: [11:03:16] - VCL_return lookup [11:03:17] - VCL_call MISS [11:03:17] - ReqHeader X-CDIS: miss [11:03:17] - VCL_return fetch [11:03:19] - Link bereq 19859904 fetch [11:03:29] next one: [11:03:30] - VCL_call HASH [11:03:30] - VCL_return lookup [11:03:31] - Debug "XXXX HIT-FOR-PASS" [11:03:31] - HitPass 19859904 [11:03:33] - VCL_call PASS [11:03:35] - ReqHeader X-CDIS: pass [11:03:38] - VCL_return fetch [11:03:40] - Link bereq 9458032 pass [11:03:45] I assume HitPass 19859904 == Link bereq 19859904 fetch [11:03:55] which means the fetch in the miss above created the HFP object used in the next [11:05:02] the fetch from the eqiad backend (the fetch on miss, which probably created hfp), had: [11:05:05] - RespHeader Age: 226 [11:05:16] which would be a negative-TTL object if we're still dealing with 120s [11:05:33] right [11:05:40] I didn't catch the backend-side, I wonder if it tried to do an ETag/If-None-Match to get that? [11:06:15] (as in, I wonder if 3010->cp10xx was a 304-type response for the "hit") [11:06:25] on that note: https://varnish-cache.org/trac/ticket/1858 [11:06:28] even then, hmmm [11:06:33] which should be fixed already though [11:06:53] yeah there's another related thing though that's not fixed already [11:07:36] https://github.com/varnishcache/varnish-cache/commit/d828a042b3fc2c2b4f1fea83021f0d5508649e50 [11:07:43] is in 4.1 branch but no release yet [11:08:12] but regardless of that, I think the IMS issues are probably minor [11:08:32] the real problem here is in appropriate handling of negative TTLs in our VCL code, which worked before and doesn't work now [11:09:19] probably because we don't use the builtin vcl_foo tails, and while that's well-analyzed in the text/upload case, it's probably full of holes (where we don't adequately replace the builtins) in misc/maps [11:10:36] another related VCL nit: shared VCL in the wikimedia-common says this in vcl_backend_response: [11:10:39] /* Don't cache private, no-cache, no-store objects */ [11:10:42] if (beresp.http.Cache-Control ~ "(private|no-cache|no-store)") { [11:10:45] set beresp.ttl = 0s; [11:10:47] /* This should be translated into hit_for_pass later */ [11:10:50] } [11:11:08] only text/upload have the custom VCL to translate that into hit_for_pass later [11:11:21] found something perhaps interesting: [11:11:21] < X-Cache: cp1058 hit(4), cp3008 pass(0), cp3008 frontend pass(0) [11:11:21] SIZE: 6304 [11:11:30] < X-Cache: cp1058 hit(5), cp3008 miss(0), cp3008 frontend miss(0) [11:11:30] SIZE: 0 [11:11:45] the second request also had CL set, the first one didn't [11:11:46] I guess that explains why I can't repro on pass/pass/pass :) [11:12:16] how is it possible that a hit on cp1058 is returned with the right size first and the wrong one shortly later? [11:12:37] well we don't know that for sure, at the cp1058 level [11:12:56] probably cp1058 has the same output both times, but cp3008 handles it differently on miss-vs-pass [11:13:17] or, between those two requests the age passed into negative territory [11:13:45] I think we're really missing (anywhere - text/upload have it explicit, builtin VCL probably has it but we skip builtin, but misc has it nowhere...) [11:13:57] the standard if (ttl <= 0s) { hfp } [11:14:14] oh, which is in default vcl [11:14:29] ? [11:14:50] or conversely, we do have it from builtin perhaps, and we need to answer the question of why an Age:226 object made it out of a cache with 120s TTLs [11:15:24] hmmmm [11:15:39] ok so, this object has no expiry info we have to remember (no CC, no Expires) [11:15:47] it only has an etag to match [11:16:02] default TTL applies, which sets 120s object life in a single varnishd [11:16:30] I guess it's still psosible to get values up to 360s at the frontend [11:16:48] if you fetch a nearly expired object at each layet and give it a fresh 120s life in this layer [11:17:07] but also, the TTL is coming from default, not from Age:, so there should be no negative-TTL issues [11:19:04] ok wait, new candidate VCL bug idea [11:20:33] our cluster-shared vcl defines vcl_hit (for both layers) [11:21:17] all we really do there is set X-CDIS for X-Cache and call the cluster-specific hit function, which is empty on misc [11:21:33] we do explicit return in our cluster-shared vcl_hit though, bypassing builting [11:21:36] yes [11:21:49] the builtin for varnish4 is not simple [11:21:51] and we say that default VCL is just "return (deliver)" anyways [11:21:52] # sub vcl_hit { [11:21:52] # if (obj.ttl >= 0s) { [11:21:52] # // A pure unadultered hit, deliver it [11:21:52] # return (deliver); [11:21:52] # } [11:21:55] # if (obj.ttl + obj.grace > 0s) { [11:21:57] # // Object is in grace, deliver it [11:22:00] # // Automatically triggers a background fetch [11:22:02] # return (deliver); [11:22:05] # } [11:22:05] yes [11:22:07] # // fetch & deliver once we get the result [11:22:10] # return (miss); [11:22:12] # } [11:22:15] apparently VCL in varnish4 is what handles hits on expired objects [11:22:27] I don't think we should return! [11:22:48] probably [11:22:57] needs some thinking to be sure if we fix it in shared VCL though [11:23:54] the default vcl_hit's grace logic is simplistic [11:24:05] we probably want a better version anyways, basic bugs aside [11:24:26] our grace logic even in varnish3 was awful/missing anyways [11:24:56] https://info.varnish-software.com/blog/grace-varnish-4-stale-while-revalidate-semantics-varnish [11:25:02] mmh do we ever return(miss) then? [11:25:54] not from vcl_hit [11:26:09] I think the v4 flowchart can reach vcl_miss without going through hit though [11:26:33] the path through hit->miss is just for when the cache has an expired object that hasn't been completed erased from lookup indices/storage whatever [11:28:27] back on the general topic of grace: [11:29:01] we probably won't bother with the health-check logic, as for most backends we don't have health anyways [11:29:11] (well, we do for inter-cache, but not for applications usually) [11:30:31] all of that topic around TTLs is confusing when you consider that TTL != time-to-HTTP-object-expiry [11:33:58] I'm assuming a backend with no healthcheck is "healthy" in the vmod_std healthy() sense [11:34:26] anyways, that's all long-term problems [11:34:52] for now just to help debug on misc, we could just add to its cluster-specific vcl_hit: [11:34:56] should we try if the default vcl_hit changes anything on misc? [11:35:03] if (obj.ttl < 0s) { return (miss); } [11:35:14] it would do the same basic thing [11:35:16] it sounds good [11:35:26] I think it will clear up confusion about some things in our testing [11:35:34] I'm not sure it's really going to affect CL:0 [11:36:10] grace doesn't matter much for this scenario, for misc, for now [11:36:18] alright [11:36:20] but serving expired objects as hits might be very confusing [11:36:51] food is needed here, bbiab [11:37:17] ok [11:44:30] https://gerrit.wikimedia.org/r/288372 is salt-puppeting [11:45:06] oh while I'm at it, I'll merge the X-Cache fixup too heh [11:53:00] ok so with those patches in place, I've got an interesting result [11:53:13] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2288816 (10Jonas) Now I get content-length 0 for query.wikidata.org ``` jonkr@C134:~$ curl -v 'https://query.wikidata.org/'... [11:53:54] req #1: directly to cp3010 backend: [11:53:55] bblack@palladium:~$ curl -sv -H 'Host: query.wikidata.org' -H 'X-Forwarded-Proto: https' http://cp3010.esams.wmnet:3128/style.css >x [11:54:33] < Age: 9 [11:54:33] < Via: 1.1 varnish-v4 [11:54:33] < X-Cache: cp1045 hit+miss(0), cp3010 hit(1) [11:54:33] < Accept-Ranges: bytes [11:54:33] < Content-Length: 6304 [11:54:36] < Connection: keep-alive [11:54:38] < [11:54:41] { [data not shown] [11:54:43] * transfer closed with 6304 bytes remaining to read [11:55:14] I wasn't there for the initial miss, but either way this was a hit in cp3010 backend, and it gave CL:6304, but never sent the damn bytes and timed out [11:55:52] req #2: same basic query + result: [11:55:53] < Age: 35 [11:55:53] < Via: 1.1 varnish-v4 [11:55:53] < X-Cache: cp1045 hit+miss(0), cp3010 hit(2) [11:56:01] slightly older hit object, same lack of bytes [11:56:13] req #3: try cp3008: [11:56:14] bblack@palladium:~$ curl -sv -H 'Host: query.wikidata.org' -H 'X-Forwarded-Proto: https' http://cp3008.esams.wmnet/style.css >x [11:56:36] < X-Cache: cp1045 hit+miss(0), cp3010 hit(3), cp3008 frontend miss(0) [11:56:48] < Content-Length: 6304 [11:56:49] < Connection: keep-alive [11:56:57] and it did transfer all 6304 bytes correctly [11:57:25] req #4: back to cp3010-backend directly to re-verify: [11:57:25] < X-Cache: cp1045 hit+miss(0), cp3010 hit(4) [11:57:26] < Accept-Ranges: bytes [11:57:26] < Content-Length: 6304 [11:57:36] and still: [11:57:36] * transfer closed with 6304 bytes remaining to read [11:58:09] so apparently when cp3008 queries cp3010, it can get the object and pass it successfully to curl [11:58:18] but when curl hits cp3010 directly, the transfer fails [11:58:20] wtf? [11:59:19] direct fetch to cp1061 or cp1045 :3128 does work [12:08:13] on those ones with "transfer closed with 6304 bytes remaining to read", it's definitely varnish not curl that closes, 5 seconds later [12:08:18] I've watched it in a packet capture [12:09:39] bblack: I really like that you take the time to write your thoughts on IRC. It let us mere mortals learn a lot in the process! [12:10:20] again: direct to cp3010:3128, I get: < X-Cache: cp1061 hit+miss(0), cp3010 hit+miss(0) + failed to transfer w/ 5sec timeout [12:10:54] (also, that failed transfer had < Age: 0) [12:11:09] (which makes sense, it was a miss/miss) [12:11:31] but shortly after, I hit cp3008 frontend and got: [12:11:31] < Age: 116 [12:11:33] < X-Cache: cp1061 hit+miss(0), cp3010 hit(2), cp3008 frontend hit+miss(0) [12:11:37] and successful transfer [12:12:04] so analyzing just that pair of test requests: [12:12:44] the first one is a full miss, no valid cache contents in the way. cp1061 and cp3010 both pulled in the content and created cache objects, but cp3010 failed to deliver it outbound [12:13:03] the second one is a miss in 3008 that hits in 3010, and does get the content out of it and delivers it to me [12:13:50] if I go back to 3010:3128 right after, once again it still stalls [12:14:02] so apparently 3008 can fetch the object from 3010, but curl cannot [12:15:04] another interesting point: if I request from cp3010:3128 with --compressed, I get < Content-Length: 0 and no delay and an empty file [12:15:29] whereas without --compressed, I get CL:6304 and 5 seconds of radio silence with no bytes output, then varnish closes connection [12:19:04] either way it's bad, but I guess this explains how we get two different kinds of failed outputs [12:20:16] tracing a 3010:3128 fetch which results in failed transfer + < X-Cache: cp1061 hit+miss(0), cp3010 hit+miss(0) [12:20:52] varnishlog on 3010 shows the expected hit->miss->fetch cycle, and then fetches from the backend with AE:gzip, and backend gzips for us [12:21:10] - RespHeader Content-Encoding: gzip [12:21:18] - RespHeader Via: 1.1 varnish-v4 [12:21:18] - RespHeader X-Cache: cp1061 hit+miss(0) [12:21:19] - RespHeader Content-Length: 1829 [12:21:24] - RespHeader Age: 0 [12:21:51] those are the initial respheader in the 3010 client-facing varnishlog, then: [12:21:54] - VCL_call DELIVER [12:22:08] - RespUnset X-Cache: cp1061 hit+miss(0) [12:22:08] - RespHeader X-Cache: cp1061 hit+miss(0), cp3010 hit+miss(0) [12:22:08] - VCL_return deliver [12:22:08] - Timestamp Process: 1463055557.951748 0.084348 0.000031 [12:22:08] - RespUnset Content-Encoding: gzip [12:22:11] - RespHeader Accept-Ranges: bytes [12:22:13] - RespUnset Content-Length: 1829 [12:22:16] - RespHeader Content-Length: 6304 [12:22:18] - Debug "RES_MODE 42" [12:22:21] - RespHeader Connection: keep-alive [12:22:23] - Gzip U D - 0 0 0 0 0 [12:22:26] - Timestamp Resp: 1463055557.951792 0.084392 0.000044 [12:22:28] - ReqAcct 205 0 205 427 0 427 [12:22:57] so.... it found an 1829 byte gzipped object (ok), it gunzipped it for the client at deliver-time (ok), rewrite content-length to 6304 (ok), the the Gzip line says gzip generated zero bytes of output (wtf) [12:24:02] if I do the same right after with --compressed: [12:24:04] - VCL_call DELIVER [12:24:04] - RespUnset X-Cache: cp1061 hit(1) [12:24:04] - RespHeader X-Cache: cp1061 hit(1), cp3010 hit+miss(0) [12:24:04] - VCL_return deliver [12:24:06] - Timestamp Process: 1463055807.020175 0.083673 0.000057 [12:24:09] - RespHeader Accept-Ranges: bytes [12:24:11] - RespUnset Content-Length: 1829 [12:24:13] - RespHeader Content-Length: 0 [12:24:16] - Debug "RES_MODE 2" [12:24:18] - RespHeader Connection: keep-alive [12:24:21] - Timestamp Resp: 1463055807.020196 0.083693 0.000021 [12:24:23] - ReqAcct 237 0 237 454 0 454 [12:24:26] there's no gzip transform at deliver time, yet we see it rewrite content-length to zero [12:25:59] trying again through cp3008 frontend while still tracing cp3010:3128 [12:26:18] it's a sucess again, but the 3008->3010 req gave a 304 not modified, it didn't have to transfer any bytes anyways. [12:27:29] because cp3008 had an expired object laying around, so it asked: [12:27:29] - ReqHeader If-Modified-Since: Sat, 07 May 2016 20:29:34 GMT [12:27:30] - ReqHeader If-None-Match: W/"572e502e-18a0" [12:29:38] ok so that explains why hit+miss tends to work better than a pure miss [12:29:57] and how it's working around 3010's lack of output. it doesn't need output, it gets a 304 [12:30:55] so I wiped cp3008's frontend (varnish-frontend restart), fetch from 3008 again while tracing 3010:3128's connection to its own backend [12:31:17] curl fails with CL:0, and it's < X-Cache: cp1061 hit+miss(0), cp3010 hit+miss(0), cp3008 frontend miss(0) [12:31:38] 3010's request to cp1061 had IMS/INM and got a 304 response [12:33:45] sooooo.... [12:34:35] I can't explain the gzip connection yet, there may not really be one other than gzip and ungzipped outputs are going to look different but both be broken [12:34:49] but it seems a lot like 304's are the problem here [12:36:50] client (no IMS/INM) asks varnishd for x -> varnishd asks other varnishd for x using IMS/INM headers based on (possibly expired) object -> gets 304 response which naturally has no content-length, which means it can use its cached content -> returns zero bytes to client (either because the object the storage should come from is expire, or more likely just passing through the zero-byte-ness of the 3 [12:36:56] 04 response) [12:41:35] this from 4.1 is possibly-related: https://github.com/varnishcache/varnish-cache/commit/d828a042b3fc2c2b4f1fea83021f0d5508649e50 [12:41:46] (from 4.1 branch I mean, after our version) [12:41:52] ok [12:42:11] perhaps the affected headers include Content-Length somehow [12:42:28] so the object gets created properly but varnish returns a 0-length response [12:42:38] well, not quite [12:42:50] what I'm seeing most-recently is more like this: [12:43:50] when varnish has an expired object in storage and has to fetch from backend (another varnish), it sends IMS/INM headers based on its expired object (ok), the other varnish replies with 304 Not Modified (ok), and then the requesting varnish replies to the client with 0-length 200 OK response [12:44:05] I don't think it's creating an object, I think it's re-using the expired one on 304 [12:44:26] which is fine if it will send the contents too :P [12:45:03] does this only happen between varnishes or also with backends? [12:45:09] probably the 0-length of the 304 from upstream is causing the 0-length when returning 200 OK to the client [12:45:29] I think only between varnishes, because when backend-most in eqiad also has an expired object, it still gives me full output [12:45:44] but maybe that's because the applayer backend doesn't do 304s in this case? I haven't checked. [12:47:08] somehow then it gets confused and returns a 0-length response because the 304 is 0-length, instead of returning the cached item [12:47:17] ? [12:47:30] yeah, I think [12:47:37] or something related is happening anyways [12:47:43] which gets back to looking at 304/IMS-ish patches [12:47:58] the expire object it's trying to reuse has a stored length [12:48:05] the 304 has a different (zero) length [12:48:08] https://varnish-cache.org/trac/ticket/1858 and https://github.com/varnishcache/varnish-cache/commit/d828a042b3fc2c2b4f1fea83021f0d5508649e50 [12:48:21] the patch linked about is about duplicate headers on 304/IMS stuff... [12:48:25] and how they're merged [12:48:34] yes, that one :) [12:49:35] I'm still trying to confirm varnish<->varnish vs varnish<->app heh [12:52:41] confirmed: for style.css test queries, even when handed a valid IMS/INM, applayer nginx still sends 200 OK [12:52:59] so it's varnish<->varnish [12:53:05] even though the ETag on the fresh 200 matches the ETag asked for in INM [12:53:17] well it's varnish<->varnish because varnish is supported 304 on INM correctly [12:53:30] nginx is ignoring its opportunity to respond with 304 and giving a pointless full 200 [12:53:49] which is not good but might be helpful to pinpoint the issue :) [12:53:58] yeah [12:54:07] and explains why I never see bad output from the backend-most varnish [12:54:17] only from the intermediate one and/or frontend [12:54:47] I wonder if there's something dumb in our shared VCL that can affect the 304 case, too [12:55:49] there is a beresp.was_304 VCL variable which could perhaps be useful for logging/debugging if needs be [12:55:52] https://github.com/varnishcache/varnish-cache/commit/c17c701b6ecf89a331d1c495991f6e2997d31a12 [12:56:00] FTR: our shared VCL does unconditionally: set beresp.grace = 60m; [12:56:27] TTL < 0 objects would never get purged before grace (and I assume they're only purged when convenient/necessary after grace) [12:56:40] so that's why expired objects keep hitting vcl_hit for an hour afterwards [12:58:07] there's nothing obviously-stupid in our shared VCL for 304s in varnish4, i don't think [12:58:19] I say try the patch and see what happens [12:59:28] Proper handling of duplicate headers on IMS headers merge? [13:01:06] yeah [13:01:31] the commitmsg on that patch seems to have logical consistency issues, I'm not even sure which direction they fixed things, but "sounds related" [13:01:44] " [13:01:45] This fixes a problem when a backend replies with 304 Not Modified [13:01:46] (after a http conditional request from varnish) and does not supply a [13:01:46] header that was duplicate in the cached object. [13:01:46] Before this patch, varnish would only supply (by copying from the [13:01:48] expired object) the first instance of a duplicate header. [13:01:50] " [13:02:30] it says it fixed a problem that happens when the 304 does not supply a copy of a header that was in the object. Then the next sentence says that before this patch, varnish would only supply the the object's version of a duplicated header in the 304... [13:03:18] but either way, I could totally see this impacting obj.http.Content-Length: 6304 vs beresp.http.Contenth-Length: 0 (or missing) [13:03:43] (beresp being the 304) [13:04:54] switching gears a bit back to general object TTL issues. So varnish4, in addition to the grace timer, also has a keep timer [13:05:15] the keep timer is used for attempting 304s [13:05:37] so you can do things like obj.ttl = 300, obj.grace = 600, obj.keep = 9999999999 [13:06:23] when the object's inside normal TTL, it can be served without any check. within grace, you can serve it stale while busy refreshing it (or failing to refresh it due to dead backend). [13:06:59] and while within keep, it can't normally be served, but it can still be used for an IMS/INM check against a backend, hoping for a 304 to refresh its TTLs instead of a full 200. [13:07:40] you'd think on the surface of things, it would always be good to set keep extremely-large even in the fact of shorter TTL/grace policies. [13:08:12] but that assumes your applayer backends do ETag and IMS stuff sanely. a long keep time means we can't recover from applayer sending duplicate ETag for different content, etc.... [13:08:23] (which we could totally do with a code bug in PHP or whatever) [13:08:48] or applayer updating content but not updating the timestamp for IMS [13:10:39] my initial thoughts on grace (wrt https://phabricator.wikimedia.org/T124954 + Varnish 4) would be to go for something like: cap obj.ttl at 1d, set obj.grace at 7 or 14 days to cover operational scenarios. [13:12:02] but even when backends never fail, DCs are never offline, etc.... you can still have a very cold URL that usually only gets hit once every 3 days. And it would get served 3-days-stale all the time due to grace, even though you're thinking 1d is the TTL outside of exceptional circumstances. [13:12:31] so without some concept of healthcheck, there's no good answer to that problem [13:13:40] (and we don't probe applayer LVS backends for health) [13:14:22] (because doing so is problematic: there's only one host, but LVS round-robins a large count of machines. we don't want it to fail the whole backend because of transient issues on one or even a few) [13:15:52] with a healthcheck we can do logic like https://info.varnish-software.com/blog/grace-varnish-4-stale-while-revalidate-semantics-varnish and use a short grace when healthy and a long one when unhealthy [13:18:37] Surrogate-Control and Cache-Control both are capable specifying grace-time too [13:18:54] (but not keep time) [13:19:50] bblack: let's see https://gerrit.wikimedia.org/r/#/c/288387/ [13:21:59] building on copper [13:22:24] I think for us, we probably want exceptional-case-grace in operations control rather than app-header control [13:22:48] probably ideally we do something like this: [13:23:12] 1) app response has CC/expires, which sets obj.ttl through normal varnish cache logic [13:23:56] 2) in vcl_backend_response, we actually cut obj.ttl in half, and set obj.grace = 7d [13:25:24] 3) on vcl_hit, we do stale-while-revalidate if we're still within the doubled TTL (the original applayer TTL), and for unhealthy backends we use the full grace [13:25:42] I guess we'd have to save the initial obj.ttl setting in an object header somewhere, to use as a dynamic short-grace, too [13:26:48] basically we'd rather use the whole second half of the applayer TTL to cause a refresh-on-stale-fetch, not just a 10s window. should cut down a lot on users actually pausing to wait on a missfetch. [13:27:43] and we're still honoring applayer TTL regardless of grace. we don't have to tell them "oh we set grace to 5 minutes always, therefore your advertised cache TTL is actually +5m in practice) [13:28:02] the app-advertising TTL is always obeyed as a max limit, unless backends are down [13:29:56] come on copper! Slow fella [13:32:11] OK, 4.1.2-1wm4 available [13:32:40] bblack: should we upgrade all misc or do you want to focus on certain hosts/DCs? [13:32:56] I'd just upgrade them all [13:33:01] and wipe caches again [13:33:05] and then we can start testing again [13:33:09] alright! [13:44:07] upgrade+puppet done, I'm gonna restart the backends with -b 1 [13:48:38] eqiad done, now codfw [13:50:39] what is wrong with salt, also superslow [13:51:21] codfw done, on ulsfo+esams now [13:53:57] restarting the frontends with -b 2 [13:55:06] bblack: done [13:57:11] hey curious, yall are currently messing with misc, right? [13:57:27] yep [13:57:52] have been seeing fairly consistent loss on misc webrequest logs for the past day, not every hour, but fairly frequent [13:58:01] small loss being about 1% or less [13:58:37] also on maps, are you doing things with maps? [13:58:53] ottomata: we're not, but both misc and maps run varnish 4 [13:58:59] hm [14:00:02] hm, how long have they been on varnish4? [14:01:32] just this week, since sometime tuesday, it wasn't all at once [14:02:04] misc, maps has been v4 for a while now [14:02:11] ema hm, how long for maps? [14:02:16] but it was only two hosts at first [14:02:22] (am looking back farther to correlate [14:02:22] oh [14:02:23] hm [14:02:25] which 2 hosts? [14:02:32] don't pay attention to anything going wrong with misc this week, there's too many restarts and abusive debugging going on, etc [14:02:32] cp1043 and cp1044 IIRC [14:02:37] hehe, bblack, ok [14:02:50] ema: yep those were the ones [14:02:53] yeah maps has been on varnish4 for a long time [14:03:11] May 12 15:59:43 < Content-Length: 6304 [14:03:11] May 12 15:59:43 < X-Cache: cp1045 miss(0), cp3008 miss(0), cp3008 frontend hit(142) [14:03:11] cp1043 and 1044 were the only maps caches, until a couple of weeks ago, when they were replaced by 16 others [14:03:14] May 12 15:59:43 SIZE: 6304 [14:03:16] May 12 15:59:43 --- [14:03:19] May 12 15:59:44 < Content-Length: 6304 [14:03:21] May 12 15:59:44 < X-Cache: cp1045 hit+miss(0), cp3008 hit+miss(0), cp3008 frontend hit+miss(0) [14:03:24] May 12 15:59:44 SIZE: 6304 [14:03:27] May 12 15:59:44 --- [14:03:29] May 12 15:59:45 < Content-Length: 6304 [14:03:32] May 12 15:59:45 < X-Cache: cp1045 hit+miss(0), cp3008 hit+miss(0), cp3008 frontend hit(1) [14:03:35] May 12 15:59:45 SIZE: 6304 [14:03:37] can't repro so far [14:03:45] that looks promising [14:04:19] hmm, ok, i'm not going to delve into this for now, since you are moving lots of stuff around, but we should keep our ey on this [14:04:30] in my high level look at this right now [14:04:42] ottomata: we upgraded maps to v4 on April the 4th https://phabricator.wikimedia.org/T122880#2177437 (the two lonely systems) [14:04:49] i'm looking at all of may for percent loss [14:04:51] hourly [14:05:04] maps has the same pattern that misc does, but misc doesn't start til tuesday [14:05:31] ottomata: what's "percent loss" in this context anyways? [14:05:42] its calculated using the vk sequence number [14:05:43] so [14:05:55] max_seq - min_seq should == count(*) group by host [14:05:56] in an hour [14:06:00] so what you're saying is there used to be no gaps in vk seqnos, and now there is? [14:06:16] that may not be loss at all, may just be a change in behavior. I really don't know. [14:06:19] gehel: could you try to reproduce the issue again please? [14:06:20] possible [14:07:00] ottomata: how can I check the data that you are comparing? [14:07:32] elukey it sin hive [14:07:34] two tables [14:07:36] wmf_raw db [14:07:43] webrequest_sequence_stats [14:07:43] and [14:07:45] webrequest_sequence_stats_hourly [14:07:51] ahhh okok makes sense [14:07:52] webrequest_sequence_stats has it per host [14:08:02] hourly is aggregated across all hosts [14:08:40] I'm running some specific tests on the 304 issue too [14:10:56] ema: this works, it fixed the 304 thing [14:11:10] it looks like [14:11:22] I made the intermediate request from the backendmost, and backendmost returned 304, and intermediate re-used its object and did not screw up output length [14:11:29] \o/ [14:12:19] I'd say let's leave things exactly as they are for right now, and get more parties to re-verify that they can't reproduce it [14:12:30] sounds good [14:12:39] if this looks good, we can step back through generalizing the VCL solutions, trying to re-apply the CL-specific stuff, etc [14:14:14] the duplicate Age headers are also gone, it seems [14:15:23] on the CSS file at least [14:15:55] ema: elukey, looking a little more, i see few results for loss before april 4th, then between april 4th and the 26h, i see cp1043 and 1044 with what the looks like the same pattern of more frequent loss, then on april 26th other hosts start joining the pattern of loss [14:16:45] ema: test in progress... [14:17:24] ottomata: who joined the party on the 26th? [14:17:38] i see [14:18:54] cp1046.eqiad.wmnet, cp1047.eqiad.wmnet, cp1059.eqiad.wmnet, cp1060.eqiad.wmnet, cp3003.esams.wmnet, cp3004.esams.wmnet, cp3005.esams.wmnet, cp3006.esams.wmnet, cp2009.codfw.wmnet, cp4019.ulsfo.wmnet, cp4020.ulsfo.wmnet [14:19:06] that's hosts that showed loss on the 26th-29th [14:19:20] they also seem to be maps machines [14:19:27] oh i'm just looking at maps [14:19:28] right now [14:19:35] :) [14:19:43] since i was trying to correlate vk4 with loss [14:19:56] seeing if the loss matches up for when you upgraded maps machines [14:19:59] seems like it does :/ [14:20:31] ema: of course, it makes sense now. when we saw duplicate Age:, that was an indicator that the backend fetch in question was a 304... [14:20:44] and the IMS/304 header-merge fix fixed both that and CL issues [14:21:02] (want to note when I say 'loss' i mean reported loss, this could be some other weird issue) [14:21:34] what are vk's seqnos based on? [14:21:37] if vk sequences are off, this metric will be funky [14:21:53] will have to look at code, afaik its just an internal incrementing counter for each message it logs [14:22:20] yep this is what I remember [14:22:52] starts from 0 and the increments itself during the life of vk [14:23:15] ja, eatch time it calls render_match it increments sequence_number [14:23:24] so it should only happen for each message it sends [14:23:32] on phabricator we still get multiple Age headers [14:24:32] hmmm [14:24:58] ok, yeah [14:25:07] and earlier we had 3xAge on pass-only too, which can't be 304s [14:25:31] e.g. https://config-master.wikimedia.org/ has 3xAge: 0 on pass,pass,pass [14:25:51] I guess IMS header merge fix just incidentally merges those up in the 304 case [14:27:43] yes HTTP_Merge is called on if (http_IsStatus(bo->beresp, 304)) [14:28:49] I see varnish4 didn't bring about any change in varnish code conventions re: clarity :) [14:28:55] :) [14:29:02] so the issue was duplicate headers on 304 [14:29:47] yeah [14:29:50] well [14:30:07] if we're being really really pedantic: [14:30:38] 1) There was some 304resp vs staleobject duplicate/missing headers issue for someone, which that patch fixed by doing more-proper header merging in that case [14:31:09] 2) We had an issue where varnish<->varnish 304 on the backend fetched caused missing body content on the client response, in at least some cases [14:31:40] 3) The fix probably fixed it because there was a duplicate headers issue on the 304/object merge to do with content length and/or encoding [14:31:56] but it may just be hiding the issue (and maybe perfectly effectively) [14:32:06] that might not be the right real fix in the right place in the code [14:32:35] it's also possible this is an issue with varnish's header filter lists [14:33:06] in include/tbl/http_headers.h , there's a table like: [14:33:07] H("Cache-Control", H_Cache_Control, F ) // 2616 14.9 [14:33:07] H("Connection", H_Connection, P|F|I) // 2616 14.10 [14:33:07] H("Content-Encoding", H_Content_Encoding, 0 ) // 2616 14.11 [14:33:09] etc... [14:33:29] which filters which headers get preserved or filtered out in various scenarios: backend fetches, passes, responses, etc... [14:34:08] it could be that some header related to encoding/length is not correct in that table, or that code using that table isn't using it right, etc [14:34:33] and some encoding/length header was making it through a filter when it shouldn't, and then the patch made HTTP_Merge paper over the problem accidentally [14:35:08] ema, bblack: just one error on 100 GET: https://phabricator.wikimedia.org/P3054 [14:35:36] :( [14:35:47] hmmmm [14:35:55] Age: 120 is right on the TTL boundary too [14:35:56] yeah, even a single error is not really a good news [14:36:14] I'm assuming this was style.css? [14:36:23] yep [14:36:36] yes [14:37:00] it's a miss->200 rather than a hit+miss->304, and it hit an object right on the expiration boundary [14:37:43] I now have a log more errors. I might be hitting a cache of the bad one... [14:37:45] I'm pretty sure the current operational setup of grace in this case is ok too [14:37:49] gehel: probably [14:38:28] we should be getting grace=1h, and even if we checked grace in vcl_hit, we'd still be returning regular deliver for a while after initial ttl expiry, which is the same as we did just now for that 120s object [14:38:56] still, it has to be a hint that the first failure was on a fetch of a 120s object from the backend-most where ttl would run out at age=120 [14:39:48] it seems that I consistently have a failure now... [14:39:55] from the same cache hit? [14:40:25] yes, always cp3009, so make sense... [14:40:30] X-Cache reads right to left. the first one that says just "hit" (as opposed to e.g. hit+miss) is where the object came from [14:44:57] extracting just Age and X-Cache, seems the cache was refreshed, but still with no data [14:44:58] https://phabricator.wikimedia.org/P3055 [14:45:41] in that paste, which of those are fail/success? [14:45:48] only fails [14:46:27] so after your first failure, it kept failing up through the expiry near age=120, then refreshed everything and still refreshed it bad [14:46:38] as far as I can see, yes [14:47:16] cp3010 has an expired cache object now, and it's serving it with zero bytes [14:47:21] (confirmed with a direct fetch) [14:47:58] the underlying eqiad cache is still serving it correctly [14:49:20] even on hit+miss(0), backend-most gets it right [14:50:22] only cp3010 gets it wrong anymore, and that's because it has that expired stale object [14:50:40] it keeps doing 304 checks to backends and confirming them and sending out bad objects [14:51:21] I think this is a distinct sub-case and we still fixed it for some cases [14:51:51] the one bad response was cached into cp3010 at Age=120 in https://phabricator.wikimedia.org/P3054 [14:52:01] and now it's stuck there and getting re-used via 304 [15:12:48] I'm presently out of bright ideas [15:13:19] I think the 304/IMS patch did paper over some cases, but as said earlier probably isn't the "right" fix, and I guess doesn't even cover all cases. [15:14:00] it's interesting we havne't heard of this affecting maps [15:14:08] (which has been on multi-tier v4 longer) [15:14:20] you'd think people would at least occasionally complain of CL:0 tiles being served [15:14:48] I've looked at the varnish->app requests for that, and they do return ETag/LM, so 304s are applicable there too [15:15:52] maps VCL is extremely minimal. other than our cluster-shared VCL, all it does is set its applayer backend, and 403s on some referrers in frontend recv [15:16:48] misc VCL is similarly minimal on things that could affect all of this, though [15:24:33] https://phabricator.wikimedia.org/T135121 [15:24:40] ^ complaints about stats.wm.o [15:25:14] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2289372 (10BBlack) [15:27:22] merging into the main ticket [15:27:22] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2285020 (10BBlack) In the merged ticket above, it's browser access to status.wm.o, and the browser's getting a 304 Not Modifie... [15:28:04] ema: if one of us doesn't come up with a better bright idea soon, we'll have to revert v4 on cache_misc till we figure out some better plan [15:30:24] mmh [15:32:05] if you want to take a real longshot stab.... [15:32:36] rebuild with varnish upstream master-branch (post-4.1 feature-work). it looks like they refactored a lot of related things. they may have incidentally fixed some things too :) [15:33:22] it seems very strange to me that nobody but us would notice 4.1 breaks inter-cache 304. varnish trying for and serving 304s is common, and multi-layer varnish isn't all that uncommon either [15:33:58] that's what I was thinking as well, perhaps there is something funny in our VCL that we haven't identified yet? [15:34:00] and again, maps doesn't seem broken [15:34:48] I'm afraid packaging master might break more stuff than it fixes :) [15:34:51] but then again maps' req/resp scenarios are more-limited, and tbh nobody might notice minor tile breakage there [15:35:20] I'm sure it would, but if we end up reverting v4 for being broken anyways, may as well take a moonshot and see if it works better than 4.1 :) [15:35:54] that idea aside, we could try cutting out all unecessary fat from our shared VCL config [15:36:01] just for misc, I mean [15:36:22] I'm not sure what a good way to do that is, other than manually-ish [15:37:35] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2285020 (10hashar) [15:38:14] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2285020 (10hashar) {T135086} does not have much details beside the layout of 15.wikipedia.org being broken. [15:39:16] ema: here's another thing you could try for debugging: [15:40:06] on some other random machine, set up 2x varnishd with just malloc caches. give them minimal VCL to backend to wdqs1001.eqiad.wmnet for all requests and do basically nothing non-default. and pass test requests through 2x varnish to wdqs for style.css, and see if you can repro [15:40:21] could steal cp1008 for something that's already in the prod network [15:40:34] put the test instances on alternate ports [15:40:57] (so we don't spam wdqs with pinkunicorn reqs) [15:41:23] oh cp1008 doesn't have v4 packages though [15:41:33] better idea: [15:41:39] steal one of the eqiad misc caches [15:41:41] depool it [15:41:49] stop confd on it with puppet disabled [15:42:01] hack directors.vcl so its frontend uses its own backend only [15:42:07] hack VCL to remove all cruft [15:42:13] and pass test reqs through two layers there [15:42:15] I like it [15:42:55] I've quickly tried to build master and we would have to forward-port/remove some of the patches (at least 0003-varnishd-nukelru.patch, perhaps others as well) [15:43:17] I'm really not sure the nukelru patch is all that useful anyways [15:44:13] the others are persistent storage mostly, hopefully it didn't change much since it's deprecated [15:44:20] heh there's something else to try too: [15:44:51] does the bug go away if we s/persistent/file/ ? maybe varnish 4.x is starting to break assumptions with that deprecated storage engine's code and they're not maintaining it, and cache objects through it are now broken in subtle ways [15:45:09] oh shit [15:45:26] ? [15:46:05] * ema was thinking of the subtle issues with broken persistent storage [15:46:12] yeah [15:46:24] it would be difficult to make that change right now via puppet without some heavy refactoring [15:46:40] but manually with puppet disabled, probably not hard at all with some systemd unit edits [15:46:53] how about disabling puppet in, say, eqiad and esams and applying the change by hand? [15:47:15] could, sure [15:47:34] basically replace: [15:47:35] -s main1=deprecated_persistent,/srv/sda3/varnish.main1,360G,0x500000000000 -s main2=deprecated_persistent,/srv/sdb3/varnish.main2,360G,0x540000000000 \ [15:47:42] with -s main1=file,.... [15:48:14] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2289479 (10Kghbln) >>! In T134989#2289372, @BBlack wrote: > due to missing character encoding supposedly, but it's entirely li... [15:48:17] let's try [15:48:41] also noted while doing a related google search [15:48:56] apparently after they deprecated the old open source persistent engine, they've added a new commercial one :P [15:48:59] https://www.varnish-software.com/plus/massive-storage-engine [15:49:11] the "massive" storage engine, which can do big things, and optionally persists [15:49:36] yep they were talking about it at the summit [15:50:34] more open-core bullshit, IMHO [15:50:49] open-source is a toy/demo product, real features for real sites must be paid for :P [15:52:13] they should do what gwicke said: AGPL it instead of closing it up. Let profiteers buy alternate licensing + support in a package. [15:52:34] except of course they don't use GPL-like licensing [15:53:45] anyways, I'm taking a break for a few, will check back in a bit and see what happens! [15:54:01] OK, I'll test s/deprecated_persistent/file/ [15:54:08] ok [15:54:31] if that doesn't help, try gutting large swaths of the shared VCL on a depooled backend->self machine in ieqad [15:54:34] *eqiad [16:19:14] gehel: time for me to bug you again! [16:19:35] just deploying one last patch... brb [16:19:53] we're now testing a different storage engine in eqiad+esams [16:20:11] sure, no hurries [16:22:16] ema: so, want me to start my test script again? [16:22:21] yes please [16:22:36] assuming you always go through esams, (cp3*) which should happen if you're in EU [16:22:48] most probably... [16:23:12] so yeah please fire your test up [16:23:43] oh, got a fail right now [16:24:14] no error on my side... [16:25:00] for the record, I got: [16:25:01] May 12 18:23:20 < X-Cache: cp1061 hit(2), cp3010 hit(40), cp3010 frontend pass(0) [16:25:05] May 12 18:23:20 SIZE: 6304 [16:25:07] May 12 18:23:20 --- [16:25:10] May 12 18:23:21 < Content-Length: 6304 [16:25:12] May 12 18:23:21 < X-Cache: cp1061 hit(2), cp3010 hit(41), cp3010 frontend miss(0) [16:25:15] May 12 18:23:21 SIZE: 0 [16:25:43] with SIZE coming from curl -w 'SIZE: %{size_download}' [16:29:28] 100 requests, no errors. I'll start another 100 requests, just for fun... [16:34:16] 07HTTPS, 10Traffic, 06Operations: status.wikimedia.org has no (valid) HTTPS - https://phabricator.wikimedia.org/T34796#366296 (10faidon) Option (3) sounds like the easiest way forward to me and an acceptable option. My only concern would be whether it could handle a surge of traffic (the kind of traffic it'd... [16:34:20] perhaps the error I got was unrelated, it's looking good now [16:34:47] ema: I still can't reproduce any error [16:35:04] moderate optimism [16:35:18] 200 requests, no error... [16:35:28] hm, ema, were cp1043 and cp1044 the only maps hosts at all before april 4? [16:35:35] I'll keep running that test every now and then... [16:35:39] gehel: thanks! [16:35:53] * gehel is crossing fingers... [16:36:55] ottomata: yes [16:43:35] for the time being I haven't been able to reproduce in esams+eqiad (using file storage) [16:43:49] however I got a repro in ulsfo relatively quickly [16:44:25] ulsfo repro FTR [16:44:26] May 12 18:42:19 --- [16:44:26] May 12 18:42:21 < X-Cache: cp1045 hit+miss(0), cp2006 pass(0), cp4001 hit(22), cp4001 frontend pass(0) [16:44:29] May 12 18:42:21 SIZE: 6304 [16:44:32] May 12 18:42:21 --- [16:44:34] May 12 18:42:23 < Content-Length: 0 [16:44:37] May 12 18:42:23 < X-Cache: cp1061 hit(4), cp2006 miss(0), cp4004 miss(0), cp4001 frontend miss(0) [16:44:40] May 12 18:42:23 SIZE: 0 [16:44:53] ema: elukey, was gonna make a task about this loss, so i started diving deeper into historical sequence stats for maps [16:45:08] i think this is not vk4 related, it looks like maps just has a small amount of loss going back before vk4 [16:45:43] for misc, the loss is only showing up over the last day or so, which i'm going to assume is related to the work yall are doing now [16:45:58] sorry [16:46:02] dunno why i'm typing vk4 [16:46:05] but you know what I mean :) [16:46:08] yep! [16:46:23] ottomata: ok!! [16:46:42] yeah we are not treating misc well this week [16:50:57] bblack: gotta leave soon, file storage is behaving well in esams+eqiad [16:51:40] if you want to try it on other DCs, in /lib/systemd/system/varnish-frontend.service: main1=deprecated_persistent,/srv/sda3/varnish.main1,220G,0x500000000000 -> main1=file,/srv/sda3/varnish.main1,220G [16:52:13] 07HTTPS, 10Traffic, 06Operations, 13Patch-For-Review: letsencrypt puppetization: upgrade for scalability - https://phabricator.wikimedia.org/T134447#2289769 (10Dzahn) [16:52:24] then systemctl daemon-reload and the usual backend restart/flush [17:08:31] ema: ok, cya later [17:08:53] I was about to ask if you wiped caches on switch, but I guess switching to file wipes itself [17:09:05] but did frontends get wiped post-switch? [17:09:42] looks like it based on varnishd start time [18:23:18] so I did move the rest to file storage [18:23:24] and restarted appropriate frontends to wipe them [18:27:32] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2290254 (10BBlack) So we're currently have several experiments in play trying to figure this out: 1. We've got 2x upstream bu... [18:28:10] ^ ticket updated, need to find proof we're still broken again [18:54:22] I've been torture-testing style.css with loops of numeric query params, through backends, through esams frontends, through ulsfo frontends (which is worst-case, there's more cache layers) [18:54:32] and loop though them over the 120s expiry boundary, etc [18:54:41] all kinds of combinations of hit/miss/hit+miss/etc [18:54:51] everything is SIZE:6304 so far [18:56:10] which means, if nobody can reproduce now, persistent storage bugs is the issue [18:56:46] if that holds for a while, we could next try putting back the CL-sensitive VCL. It should've made the old problems easier to hit if anything, and would confirm that they're not part of it [18:57:17] going to go spelunking in the persistent and file storage backend commit logs... [19:19:29] I don't see anything super-obvious there [19:19:47] if anything, I'm surprised at the amount of work they continue to put into the deprecated persistent storage code [19:21:48] still, if we can't reproduce this anymore, there's clearly a persistent-storage-related bug [19:23:07] <_joe_> what's the difference between file storage and persistent storage? [19:23:29] <_joe_> that file storage doesn't survive restarts? [19:23:57] yeah [19:24:09] file storage uses a disk for space, but wipes on restart/crash of the daemon/host [19:24:32] persistent storage engine takes care to be able to recover most of the cache on crash/restart, like a database would with its binlogs, etc [19:25:24] <_joe_> I don't see what's the reason not to make your storage recoverable [19:25:30] <_joe_> to be honest [19:25:39] well, a few things: [19:25:52] <_joe_> recoverable, with the option to wipe it out [19:26:02] 1) it's way way harder and more-complicated to make disk storage recoverable [19:26:45] (they have to split up the storage into smaller bins and work through writing to them and sealing them, so that only the latest unsealed one is corrupted on crash and lost, that sort of thing) [19:26:46] <_joe_> if you want to make persistency completely reliable, yes [19:27:01] it's not completely reliable, but it's "all but the most-recent stuff is reliable" [19:27:27] <_joe_> yeah I would be ok with "doesn't get lost on clean restarts" [19:27:49] we wouldn't be, unfortunately [19:28:02] <_joe_> not for uploads, no [19:28:06] <_joe_> I get that [19:28:08] but in any case, that side-argument aside, in terms of code complexity by competent varnish developers [19:28:21] 600 SLOC for storage_file, 2K SLOC for storage_persist [19:28:26] <_joe_> ok [19:28:29] <_joe_> fair enough [19:28:45] <_joe_> it's still 3x, not two orders of magnitude [19:28:56] <_joe_> in terms of SLOC [19:29:13] SLOC is very rough anyways. the logic is *way* more complicated, and more bugs happen, etc... [19:29:42] it has a lot more tie-in with all other parts of the code, too [19:30:01] there are lots of things that are just "related" to cache storage, which with a file storage backend that doesn't persist, can just live in memory [19:30:16] but with a persistent store, all of those things have to be hacked into the persistence and stay in sync with it too [19:30:41] like the long-term cache-ban lists, and xkey indices (which don't persistent, that's one of our upcomign xkey pain points) [19:31:27] <_joe_> sigh [19:31:37] well from an implementor's point of view, I guess point 1 is about it... it's just way more complicated to implement and keep it bug-free [19:32:01] from a user's point of view: if storage dies when the daemon/host stops, things are easier to reason about [19:32:15] because you probably lost purges while it was down too, since they weren't received by the storage... [19:32:47] unless you ahve some external mechanism to queue and replay them (which we do for daemon down, but not for host down) [19:34:14] but then we get into the user's point of view on benefits of why you want persistence, and what kinds of persistence matter: [19:35:07] IMHO, persist on clean shutdown is nice-to-have, but not really critical. if it was a clean shutdown by choice, we could've chosen to take the operational hit of doing that much slower instead of a bunch of caches at once, and spread the effect of cache contents loss out so it's negligible [19:35:43] the case where it really matters more is unclean shutdown due to daemon crash, host crash, hardware malfunction, and/or DC power loss [19:36:00] because sometimes those happen unplanned, and can happen to more than one machine at a time... [19:36:15] and then you're left with a shitload of missing cache contents all at once by surprise [19:36:30] with no workaround [19:37:11] so that's been our argument for persistent storage with reliability on hard crash: if all caches in eqiad suddenly power off, when we're recovering from that event we really want them to still have a lot of their cache contents [19:37:48] my thinking currently on trying to attack persistence (trying to argue we don't need it) goes like this: [19:38:34] 1) In a single-machine event, it's tolerable to lose cache. We have enough caches (in terms of #machines per DC, #DCs) that it's not that bad a percentage, and more importantly we have layering to make up for it... [19:39:07] e.g. if we lose a middle-tier backend, the frontends directly in front of it are still absorbing most of the user-facing load, and the next tier of backends is re-filling it, not MediaWiki [19:39:28] if we lose a back-tier backend, we've got both frontends and middle-tier backends covering the bulk of the load while it refills from MW [19:39:42] if we lose a frontend, it's got all of the DC-local backends to refill from at near-zero latency [19:40:29] basically there's really no scenario in which loss of a single daemon or machine's cache contents is a big deal to us. the rest of the infra absorbs it fine. [19:41:42] 2) In a single-DC event (so, powerloss at a DC, or a crash of all machines/daemons at a DC from an exploit/bug that only hit one DC), things are trickier, but we can still manage. [19:42:39] we'd obviously depool the dead DC in DNS and have to deal with recovering, but we can offload users to other DCs, and we can refill the lost DC's caches from another DC too, no matter which DC it actually is (modulo some eqiad special-ness presently, but we're on-track to get rid of that specialness, and we can work around it today with extra work and have codfw refill things) [19:43:18] 3) So the scenario where it still really really matters is if we lose all caches globally, all in a short timeframe [19:43:47] it's unreasonable to try to plan for the kind of rare event it would take for that to be power loss [19:44:42] so we're mostly talking about a software bug in varnishd that causes them all to crash quickly from some normal request, a software bug someone figures out how to maliciously trigger and sends to all our global LB IPs repeatedly. in either case could be a varnishd bug or a kernel bug, etc... [19:44:57] or a major screwup by us in performing some operation [19:45:11] or a very urgent need for us to restart every varnishd in minutes to patch some exploit [19:45:52] so breaking those down further (after tossing out global power loss) [19:46:55] varnishd bug: even most varnishd bugs wouldn't do this. they'd kill the frontends but not reach the backends, or similar. it would have to be a very special varnishd bug where the triggering request actually traverses all the layers down to the application, the response makes it all the way back out to the user sanely, and it kills every varnishd along the *response* path *after* they've succes [19:47:01] sfully sent their response upstrema. [19:47:58] I think the odds of that and someone triggering it globally are pretty remote, but it can happen, and persistence would probably save us trouble when recovering from it. [19:48:44] kernel bug: probably more likely than the above: triggering traffic could hit all frontends, which is all cache machines, and crash them all through a kernel bug.... [19:49:01] us screwin up: no matter what storage engine we use, it's possible for us to screw up [19:49:38] urgent patch: I don't foresee a scenario in which we can't span that over a few hours, which is enough to make this not a critical issue in practice. [19:49:51] it's not like we can insta-patch code anyways [19:50:50] the kernel/varnishd bug scenarios are both the main concerns, and those are very remote possibilities. they can happen, but what's the cost/benefit on protecting ourselves from them? [19:51:10] have we ever witnessed such a bug hit the WMF since the varnish clusters went live years ago? [19:51:27] N amount of pain for persistent storage vs M odds of this ever being an issue [19:51:58] (and in any case, even with persistence, those will be hugely problematic events. we're just making one part of the recovery slightly less-painful) [19:52:46] also, hhvm and RB APIs and blah blah, we hope, have made the amount of uncached user traffic we can theoretically handle in a spike to refill caches higher than it was historically [19:53:48] so to me, that long rambling argument says that our rationale for wanting persistence is weak. it's there, but it's not much. [19:54:31] the downsides are: more-complex backend with more bugs, more local patches == less-stable. upstream maintainer has deprecated it going forward. xkey vmod doesn't natively work with it. [19:55:22] and apparently now we can add to the pile: in varnish4 testing on cache_misc, apparently persistent storage has novel bugs that the undeprecated backends don't... [20:06:58] I was just thinking further: without persistence, what would you do to solve the software-crasher threat? [20:07:19] probably break up each cluster in each DC into two sub-pools, A/B, with half the caches in each. [20:08:20] once every X (let's say every 12 hours?) we go through an automated hour-long process of having LVS shift traffic from pool A to B (or vice-versa). So at any given time, the public world is only hitting one half, and the other half is traffic-dormant, but still has lots of fresh objects cached when traffic comes back 12h later. [20:08:43] the 1h proccess smooths the spike in misses to negligible [20:09:11] and then if someone sends us a kernel crasher request or an all-varnishd-layers-crasher request, it only kills the currently-active side and we switch to the other. [20:09:22] (and half the usual switches). [20:09:30] unless they spam us during the transition time I guess [20:10:05] I guess you could work on reducing the switchover window time and the frequency, and making them randomized so they can't be predicted [20:11:27] if a globally-spammed crasher comes through, it obviously still takes us down, but whoever responds can stop the switching process where it's at, find a way to mitigate the issue and/or block the traffic, then flip user traffic to the still-living side to recover. [20:12:13] sounds like a pain to engineer, but it's more general and reliable than the persistence solution anyways. [20:13:57] not a reasonable enough plan for us to put it in a ticket and work on it, but it's something to think interesting thoughts about. [20:26:24] <_joe_> bblack: uploads are the main culprit here [20:26:41] <_joe_> we're nowhere near to be able to handle twice the traffic we get today [20:28:14] <_joe_> so even with HHVM, I guess we can handle 3 times the traffic we get today on the appserver layer [20:28:33] <_joe_> since we've been reducing the size of the applayer noticeably [20:29:51] <_joe_> and honestly, it's not like varnish doesn't have huge limitations already [20:30:07] <_joe_> (no ssl either to clients or to backends being the biggest one) [20:32:22] <_joe_> but yes, I guess we can live with a non-persistent storage for everything excl. maybe upload [20:36:44] well it would probably be a lot more than 3x initially... [20:37:03] but, every successful request creates a cache item that blocks some further requests [20:37:35] so, one would think the pattern would be a very sharp and very brief spike, which exponentially quickly ramps back down to normal levels, if the app can handle the initial burst. [20:37:53] if it can't, then we have to mitigate that by ratelimiting outbound from varnish, basically [20:38:14] we already have a (poorly-tuned) max_connections parameter per service backend, which we could use to put a cap on how hard varnish can hit backend servers. [20:38:39] and if requests timeout waiting on those max_connections, they'll 503 the user and all the while the cache keeps filling in and things keep getting healthier [20:39:09] if things go to plan, then the percentage of otherwise cached backend requests will also grow in the next years [20:39:21] yeah, hopefully [20:39:22] which should help reduce the remaining load on in-memory cache miss [20:41:39] <_joe_> bblack: say we have to restart just one server [20:41:48] <_joe_> isn't that like when we depool one server now? [20:41:56] <_joe_> from the backends [20:42:03] <_joe_> all its storage is suddenly lost [20:42:09] <_joe_> and we do that all the time, right? [20:42:44] yeah [20:43:13] as long as we don't try to do some large fraction of all global servers in minutes, we should be ok in those scenarios though [20:44:36] the layering and multiple DCs tends to protect well against one, or even a few, sudden cache machines losing contents [20:45:45] <_joe_> as I said, my main concern is upload [20:45:56] <_joe_> we're way more tight on backend capacity there [20:46:25] the only scenarios in which backend capacity is significantly in play is the global-crash scenarios [20:46:26] <_joe_> (on the applayer, it has been shown repeatedly now the databases are what hit a performance wall, rather than appservers, after all) [20:46:40] (well, assuming other minor nits/plans get resolved in the near-ish term) [20:47:00] <_joe_> well in a global crash scenario we try to recover to the best of our abilites, probably [20:47:23] <_joe_> it would take some time to recover text, and a few hours to recover uploads, probably [20:47:23] basically the global crash scenario is the only one where persistent-storage vs non really matters [20:47:38] <_joe_> having both swift clusters available for reads would help [20:47:44] in all the other scenarios, we can (and in many cases, automatically) prevent impact on load to app backends [20:48:57] for anything else: we go slow enough on willful restarts (and I think even for upload, we can make that work over a few hours at most). if we lose a DC, we switch users in DNS, switch cache routing for refills in cache::route_table, etc. [20:50:19] theoretically even if we lose all eqiad caches today to power loss... by the time we sort out the other impact of that, we can fix the caching [20:50:45] we can update cache::route_table to send everyone in ulsfo/esams through codfw -> (whichever side appservers we're recovering on), and all those still have full hot caches [20:50:55] <_joe_> yes [20:51:08] and then with traffic blocked inbound to eqiad (after DNS depool), we can route eqiad's backend fill to codfw, and then let it re-warm itself from those caches, etc [20:51:18] <_joe_> honestly that's why I said at the start it seemed acceptable [20:51:45] well it's a big change, dropping the safety net of persistent cache storage [20:51:47] <_joe_> but then tought about uploads [20:51:57] <_joe_> it is indeed [20:52:11] this is like the 3rd or 4th time I've rambled through the pros and cons on IRC, and I've not convinced myself enough to argue for it hard and formally and make the switch :) [20:52:50] <_joe_> eheh [20:53:00] <_joe_> well, it seems phk decided for you :P [20:53:21] bblack: are any of the alternatives like ATS still in the running to potentially replace Varnish? [20:53:53] gwicke: not in any kind of predictable or soonish timeframe [20:53:55] <_joe_> gwicke: that would be a larger jump into the unknown [20:54:26] <_joe_> so I'd think trying to keep varnish working is a better immediate option [20:54:30] I still want to explore a switch to ATS, but realistically we don't have spare engineering capacity to take away from other things to play with it on the side at this time [20:54:45] makes sense [20:54:47] and even if we were more-definite about long-term ATS plans, switching to v4 first at this point makes more sense [20:55:04] I'd still really like to fully evaluate ATS as a replacement [20:55:07] at some point [20:55:41] when I've looked before, it looks capable, and even better on some fronts. undoubtedly it may not be as good on others [20:55:52] something with native http/2 support would also be nice, but that has some implications for # of connections and thus best architecture [20:55:54] it requires some dramatic re-architecting of our stuff too [20:56:35] the biggest argument for ATS exploration is that it conceivably might work, conceivably would replace our nginx+varnish stack with one thing, and that one thing wouldn't be tied up in commercial open-core bullshit [20:56:46] their licensing and philosophy (ASF) is much closer to ours [20:57:06] both nginx and varnish do the open-core thing now, and it's a thorn in our side with both :P [20:57:36] yeah, but I am quite optimistic that the time window for that is closing [20:57:58] that nginx and varnish will switch course back to open-source-friendly? [20:57:59] solid http/2 support is becoming commoditized in Go & probably soon Rust [20:58:38] I would expect more hackable alternatives emerge in the medium timeframe as a result [20:58:54] eh I donno, we'll see [20:59:16] yup [20:59:19] that would be nice, but writing a solid, general-purpose caching revproxy with all the bells and whistles of ATS or Varnish is a really really hard problem [20:59:36] anything new has mountains of lessons from the past to re-learn, too... [20:59:55] and they're important ones, it's not the kind of thing where you can deploy first and sort those out as you hit them [21:00:08] the barrier to entry has been very high so far, as there hasn't been any clear framework to build on that handles the common low-level http stuff [21:00:27] (and has decent performance) [21:00:38] there still isn't one, for this kind of purpose [21:01:21] implementing an http/[12] server or client is hard. implementing a proxy between the two is harder. implementing a revproxy between the two is even harder. implementing any kind of cache is hard. implementing an HTTP cache is harder. [21:01:28] yeah, but I have hope that this will improve [21:01:36] it's all of those hardnesses put together, in one very inter-twined mess of difficulty-level [21:02:22] most of the time even with varnish having done it all "right" as an example, I'm constantly having to dig through standards documents and experimentally verify behavior to know wtf is right or wrong in thousands of corner cases. [21:02:23] it's already fairly easy in high-level languages, but of course the performance is lacking then [21:02:49] I'd say even if you throw perf out the window, writing something varnish-like in a high-level language is still very hard. [21:03:40] there is a long tail, sure [21:03:51] yeah but all of the long tail matters [21:04:12] just here at the WMF we hit a large chunk of the long-tail cases between different kinds of internal appservers and codebases and requests and clients and protocols and .... [21:05:11] if you're going to have the world use it, all of the long-tail matters [21:07:57] if I really thought I had a shot at writing a decent http caching revproxy of my own from scratch, in say a year or two timeline to a decent beta release, and doing better than varnish/ATS has done even ignoring performance completely... I'd probably be working on it on weekends now. [21:08:18] but when I think about that, I know I'm not capable of handling that project and doing it justice. [21:08:58] and I know more than a lot of people do about that task really, other than core developers of existing ones. [21:10:07] of course, there's always the argument that blind arrogance is necessary to do great things. maybe someone who's not aware of how hard it is, is the kind of person that can tackle it anyways and eventually get it done in the face of all evidence to the contrary :) [21:56:50] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2290928 (10BBlack) Has anyone been able to reproduce any of the problems in the tickets merged into here, since roughly the ti... [21:59:56] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2285020 (10Smalyshev) I don't see any weirdness anymore on query.wikidata.org/ @Jonas ? [22:01:56] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2290947 (10Jonas) Problem seems to be fixed. Thank you! [22:16:00] bblack: at the very least, it's fun to ponder what it would take [22:19:09] the async http stuff in rust is getting more interesting recently; a toy server is doing about 150k req/s on my laptop, with the wrk client using 1/3 of cpu -- but, http/2 is still missing, and so is basic stuff like connection pooling for the client portion [22:32:08] 07HTTPS, 10Traffic, 06Operations: status.wikimedia.org has no (valid) HTTPS - https://phabricator.wikimedia.org/T34796#2291063 (10BBlack) Yeah I tend to agree too. I think if we're concerned at all about status.wm.o perf during outages, we could probably also tack on a secondary task to extend the apache co... [22:32:58] 07HTTPS, 10Traffic, 06Operations: status.wikimedia.org has no (valid) HTTPS - https://phabricator.wikimedia.org/T34796#2291073 (10BBlack) Also note: while in there, should convert wikitech-static to cron'd letsencrypt (using our prod script!), and then use that for the status.wm.o cert as well. [22:37:38] 07HTTPS, 10Traffic, 06Operations: status.wikimedia.org has no (valid) HTTPS - https://phabricator.wikimedia.org/T34796#366296 (10Krenair) Yes, we should. Unfortunately wikitech-static might be a pain since it does not use puppet (and for obvious reasons cannot reach the production puppetmaster). :/ [22:38:46] 07HTTPS, 10Traffic, 06Operations: status.wikimedia.org has no (valid) HTTPS - https://phabricator.wikimedia.org/T34796#2291120 (10BBlack) it's ok, we can just copy down the acme-setup script as it exists today (well, and acme-tiny). for a 1-2 cert setup like this, it's not hard to use it puppet-free from cr... [23:10:55] bblack: so no issues with file storage? [23:14:53] ema: nobody's been able to reproduce with it, yeah [23:15:05] I puppetized the systemd unit file change and turned puppet back on [23:15:10] awesome [23:15:23] should I try to put back the CL-sensitive stuff tomorrow? [23:15:35] yeah I think so, after triple-checking nobody can reproduce [23:16:29] if the problem was indeed deprecated_storage that might explain why others didn't run into the issue, I was thinking [23:16:37] yeah [23:17:07] even then, it's probably specific to having two layers of varnishd talking to each other and deprecated_persistent at one or both of them. [23:18:25] this probably, in the net, just bumps the priority of making the switch away from it anyways. it's, I think, the logical course of action eventually anyways. [23:18:31] just maybe eventually's getting here sooner [23:19:01] we'll really need to fix the TTL-capping/surrogate-control/grace-mode/etc intertwined mess of issues better first though, before assuming we can go file-backend on text and/or upload varnish4 deploys. [23:20:46] (and the active/active cache routing and outbound-TLS problems become higher priority to solve faster if we don't have persistence too. not that they were low prio before) [23:21:44] but the upside is we can dump all of our custom debian/patches/ I think, and not have to worry about the ugly problem of xkey+persistence interactions [23:22:04] alright, I'm already quite happy we made some steps forward today :) [23:22:12] (well the legacy patches anyways. all we had left there was 3x persistence one and the questionable nukelru one) [23:23:52] time to sleep, see you tomorrow! [23:29:47] nite! [23:51:15] 10netops, 06Operations: codfw-eqiad Zayo link is down (cr2-codfw:xe-5/0/1) - https://phabricator.wikimedia.org/T134930#2291418 (10faidon) 05Open>03stalled The issue recovered before they had a chance to investigate. I have been asked whether we are interested in an formal RFO (reason for outage), which is...