[02:00:37] 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2788204 (10GWicke) The main benefit of encoding the original dimensions in the URL would be consistency across formats, and some amount of ease of use. Th... [02:20:13] ok I don't have a "good" fix yet, but I can confirm it's a buffering problem with nginx + OpenSSL-1.1.0 as discussed above [02:20:30] this is the old nginx ticket about the same basic thing from a few years back: https://trac.nginx.org/nginx/ticket/413 [02:20:59] which is when the code linked earlier was implemented as a hacky workaround. apparently OpenSSL-1.1.0 changes things enough that that workaround isn't right anymore. [02:21:41] I tried simple stabs at nginx patches to either always set the write buffer size there, or never set it (instead of current conditional), and neither one changed anything [02:22:21] but rebuilding openssl itself with the internal default buffer changed from 4K to 8K does fix it, so it is that same basic issue. [02:22:34] worst case, we can just do that to our openssl package, but I'd like to find a better answer with an nginx patch using the openssl APIs [02:23:16] (our cert + chain is small enough to not trip this, but the addition of the OCSP Stapling data is what sends us over the 4K buffer limit) [05:25:08] 10Domains, 10Traffic, 10DNS, 06Operations, and 2 others: Point wikipedia.in to 205.147.101.160 instead of URL forward - https://phabricator.wikimedia.org/T144508#2788286 (10Naveenpf) @CRoslof This is an enhancement request. If someone take wikipedia.in now it is redirecting to new URL. There is no point in... [08:44:08] <_joe_> ema: https://phabricator.wikimedia.org/T150503 might wanna take a look as well [10:38:01] 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations: Failure to save recent changes - https://phabricator.wikimedia.org/T150503#2788466 (10ema) [10:40:23] 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations: Failure to save recent changes - https://phabricator.wikimedia.org/T150503#2788473 (10ema) Log captured with `varnishlog -n frontend -g request -q 'RespStatus eq 503'` ``` * << Request >> 629660955 - Begin req 629660954 rxreq - Timest... [10:41:56] 10Traffic, 06Operations, 05Prometheus-metrics-monitoring: Error collecting metrics from varnish_exporter on some misc hosts - https://phabricator.wikimedia.org/T150479#2788475 (10ema) p:05Triage>03Normal [10:42:59] 10Traffic, 06Operations: 503 errors for users connecting to esams - https://phabricator.wikimedia.org/T149865#2788477 (10ema) 05Open>03Resolved [11:09:27] trying param.set gethdr_extrachance 100 on cp3043 [11:14:46] no difference, I get a repro after a few minutes [11:15:48] random facts: Length is not the same on all repros, it seems to vary between 1553 and 1557 [11:19:16] all repros I've seen so far using H2 have this error in the backend part: [11:19:19] -- FetchError req.body read error: 11 (Resource temporarily unavailable) [11:19:22] -- FetchError backend write error: 11 (Resource temporarily unavailable) [11:20:26] other repros *not* using H2 (H2=0 in XCP) don't have the same FetchError [11:20:43] see for example cp3040.esams.wmnet:~ema/503.log [11:22:44] also with H2=1 -> X-Cache-Status: bug, H2=0 -> X-Cache-Status: pass [12:01:41] 11 (Resource temporarily unavailable) is EAGAIN, which is a fairly normal return code for network calls... [12:02:22] X-Cache-Status should depend whether the backend gave us any headers or not [12:06:40] bblack: I've left the gethdr_extrachance change deployed on cp3043 but it doesn't seem to help with T150503 at all [12:06:40] T150503: Failure to save recent changes - https://phabricator.wikimedia.org/T150503 [12:07:06] ema: so, the last one in your log is because it happened at a different layer [12:07:35] the ruwiki POST to /w/api.php [12:08:01] it's getting a good response from the backend in esams. but said backend had the same problem you're having on the frontend, when talking to eqiad [12:08:27] so the buggy X-Cache interpretation just happened further down and had a chance to get converted into a more-legit-looking 503 [12:08:32] right [12:08:39] ok that makes sense [12:09:13] I'm pretty sure the CL variation is because the response content is our synthetic error page, and the bytes are differences in the reported info in the template [12:10:36] in the X-Cache-Status: bug case, the biggest question is why are we getting a local X-Cache value of "pass" instead of "int"? [12:10:48] that's clearly a misunderstanding in our VCL [12:11:16] we're doing a "pass", but it becomes an internal error, but when we go through that error path and locally-generate the 503, somehow we fail to fixup X-Cache [12:12:01] (which is why X-Cache-Status ends up "bug" - because it's considered illegal that the final user-facing X-Cache just has a frontend "pass" and nothing else) [12:14:54] bblack: don't we only set X-Cache to int if X-CDIS is not set? [12:15:44] here X-CDIS is set to pass [12:16:34] yeah [12:16:43] I guess that strategy doesn't work so well anymore :) [12:16:57] IIRC we tried to set it explicitly before and failed, though [12:17:09] something about no variables being available in the right places? [12:19:46] well the other piece of the puzzle is that vcl_synth responses don't use vcl_deliver, too [12:29:44] whereas vcl_backend_error's synthetic output does pass through vcl_deliver [12:37:40] bblack: I thought 'straight insufficient bytes' was set by the frontend when stuff like https://phabricator.wikimedia.org/T150503#2788473 happens, however that doesn't seem to be the case [12:38:31] trying to log v1f_pull_straight@http1/cache_http1_vfp.c:216, I don't get output from the probe when varnishlog catches that [12:38:49] it comes from the backend then I guess? [12:48:37] still staring at varnishlog: [12:48:38] - ReqAcct 1719 0 1719 417 1556 1973 [12:48:53] -- BereqAcct 1793 0 1793 0 0 0 [12:49:22] ema: https://gerrit.wikimedia.org/r/#/c/320975/ <- I think this fixes up X-Cache/X-Cache-Status, but I also feel like I tried something like that before and failed, but maybe that was before we came up with the resp/bebesp -based passing of XCDIS and such [12:49:56] it probably needs manual testing first somewhere [12:52:04] to recap what I think I know about the various error cases/paths (which may help with looking at the above): [12:52:33] (and this is all from the perspective of a varnishd): [12:53:16] 1. If the backend sends us a legitimate, specific error (e.g. returns HTTP/1.1 501) - handled by vcl_backend_response->vcl_deliver [12:54:02] 2. Internal synthetic errors (frontend side of daemon) - vcl_synth->client (using synthetic() object, no vcl_deliver is called) [12:55:15] 3. Internal synthetic errors (backend side of daemon) - vcl_backend_error->vcl_deliver (using synthetic() object, but vcl_deliver is called) [12:56:04] another minor point not immediately-relevant is the synthetic object from case (3) can be made cacheable if given a TTL, whereas the synth in (2) is never cacheable [12:58:31] I think case-2 was probably handled ok by the current X-CDIS "if not defined, set to int" [12:58:57] case-3 we necessarily pass through vcl_miss or vcl_pass to get there, so it's pre-set to one of those values and needs to be re-set to "int" when the backend fetch fails [12:59:49] hence we end up with "X-Cache: cpNNNN pass" + "X-Cache-Status: bug" + 503 synthetic response [13:00:13] (the "bug" in that case is if a frontend says "pass" rather than "int", we expected another later of X-Cache data...) [13:01:31] other interesting properties of those 3 pathways: [13:01:52] 1. needs 503-retry-once, doesn't need X-CDIS="int" or synthetic [13:02:17] 2. does not need 503-retry-once, does need X-CDIS=int and synthetic object [13:02:33] 3. needs 503-retry-once, and after that needs X-CDIS=int and synthetic object [13:03:50] I also question whether (3) should have that ">400 && !413" exception, which I think was to pass-through real backend errors, not meant to apply to synthetic ones [13:05:22] (amended for the above) [13:11:55] regardless, all the above doesn't affect the real problem here, it just makes for confusing X-Cache/X-Cache-Status output on these errors [13:12:15] 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations: Failure to save recent changes - https://phabricator.wikimedia.org/T150503#2788776 (10elukey) From the httpd point of view: There are a lot of 503s logged for GET requests for /w/api.php like the following: ``` 2016-11-11T12:07:44 59999926 10.64.0.1... [13:13:35] ema: I think you've only gotten "FetchError req.body read error: 11 (Resource temporarily unavailable)" when you've tried your extrachance patch, because the patch isn't right [13:13:47] (as opposed to the ones that don't say 11) [13:14:36] or if you're getting it without the new patch, then my extrachance patch does bad things by accident too [13:14:53] I don't think that would ever happen with the stock code [13:16:10] yeah, now that I've thought about, both patches are bad [13:16:30] ema: ^ (not that the original behavior is good either, but both patches have unrelated bad fallout) [13:16:51] bblack: I see 'error: 11' on hosts without my patch too [13:17:02] eg: cp3030 [13:17:08] (~ema/503.log) [13:17:12] yeah but with mine, right? [13:17:34] right [13:17:38] 11 being a reportable error is definitely a bug, and I think it's one that both patches introduced [13:18:12] well, maybe [13:18:21] it's confusing in any case [13:20:40] I think the approach of both patches may be ignoring the impact of the call to V1F_FetchRespHdr() on "i" there [13:23:09] bblack: should we try downgrading one host to wm2 (no extrachance patch) and see what happens? [13:31:49] well, we have a known case where that's broken too, and in worse ways [13:32:21] (when the backend times out completely, but is a heavily-used backend and thus most re-attempts will get reused connections) [13:32:47] infinite loops inside varnish on a backend req are way worse than 503s :) [13:32:50] (rare ones at that) [13:33:35] https://github.com/varnishcache/varnish-cache/issues/2126 [13:34:33] uh [13:34:51] (+ the tickets linked to it, not as-directly related but give insight into this general area) [13:34:53] interesting [13:35:53] we should really find a way to repro this ourselves (instead of catching it in a log) [13:38:02] some of those 503s in the log are probably "normal" (as in, it's ok for varnish to 503 in those cases, although maybe 400 is more-appropriate) [13:39:05] but if we're going off of varnishlog alone and can't explain the 503, we should try to find a way to induce a similar-looking 503 [13:39:29] if it ends up being that these 503s are because the client canceled a half-sent POST or the backend did something illegal, we kinda don't care [13:40:11] (vs a 503 that's strangely-block a legit request that should've worked, like perhaps the edit stash POSTs) [13:41:16] (but even then, have we varnishlogged one of the stash edit failures yet? for all we know that one's not a varnish-internal issue. there's still at least some evidence it could be apache/hhvm) [13:42:06] (well - we know in that case we're synthesizing an error in the frontend, implying the varnish-backend goofed the connection, but that could be a varnish bug that's only triggered in a case that's bad from the applayer backend to begin with...) [13:42:31] (especially given default streaming) [13:43:35] without a way to repro it ourselves with a request (and/or app-response) we control and can manipulate, it's hard to say much though [13:44:05] or at least, a clear and convincing pattern to the observed 503 logs we're correlating up across [13:44:31] (a pattern in the client-req and/or app-resp attributes, not a pattern in the varnish's error output mode) [13:56:27] bblack: I think I've found a way to repro -> cp1008:~ema/T150503.py [13:58:15] <_joe_> ema: oh? [13:58:18] <_joe_> let me see [14:00:08] <_joe_> ema: so the content-legnth is wrong, amirite? [14:00:56] _joe_: I tried following the steps to repro here https://github.com/varnishcache/varnish-cache/issues/2126 [14:01:11] > Make POST request but close the connection (write side is enough) before writing full Content-Length body bytes. [14:01:38] <_joe_> ema: uhm that causes the apache reaction we expected? [14:02:15] _joe_: so far I've confirmed that it causes a 503 from varnish which looks suspiciously similar to what we've varnishlogged so far [14:02:34] <_joe_> ema: add the X-Wikimedia-Debug header [14:02:36] it could make sense, since the input filter chain will not receive all the data expected (hence APR_INCOMPLETE? not sure) [14:02:44] <_joe_> and we can see it on mw1099/mw1017 [14:02:48] _joe_: ok [14:03:07] <_joe_> I don't remember the details of hwo X-wm-debug works [14:03:36] * elukey tails mw1099 [14:03:44] * _joe_ tails elukey [14:04:43] * elukey runs away [14:06:08] I'm using X-Wikimedia-Debug: backend=mw1099.eqiad.wmnet [14:06:15] elukey: found anything interesting? [14:07:22] -- BackendOpen 24 boot.be_hassaleh_codfw_wmnet [14:07:27] wat? [14:07:40] I don't see the request in the access logs [14:07:45] <_joe_> lol [14:07:49] ahhha [14:08:21] anyways, it's a clear repro on the varnish-be side (confirmed with varnishlog) [14:09:27] so X-Wikimedia-Debug is not working ? [14:10:01] <_joe_> elukey: well it should [14:10:17] <_joe_> ema: a semicolon at the end of the header might help [14:10:36] I tried curl -H 'X-Wikimedia-Debug: backend=mw1099.eqiad.wmnet; readonly' https://meta.wikimedia.org/wiki/Main_Page and it worked [14:11:03] <_joe_> ema: maybe it gets mangled on the frontend in some way? [14:12:29] _joe_: I don't think so, on varnish-be I see: [14:12:30] -- BereqHeader X-Wikimedia-Debug: backend=mw1099.eqiad.wmnet; [14:12:41] keeps on hitting hassaleh meh [14:13:20] <_joe_> ema: wat [14:13:30] <_joe_> what's hassaleh? [14:13:50] /etc/varnish/wikimedia-common_text-backend.inc.vcl:.host = "hassaleh.codfw.wmnet"; [14:14:30] hassaleh is a X-Wikimedia-Debug proxy (role::debug_proxy) [14:19:28] <_joe_> wtf? [14:19:36] <_joe_> when was that added? [14:20:13] <_joe_> this ofc ruins our debugging attempt [14:20:29] <_joe_> ema: well without x-wikimedia-debug [14:20:56] <_joe_> it's hard to pin which host was called [14:22:06] <_joe_> ema: what about we change that by hand to mw1099 on cp1008? [14:22:38] _joe_: sounds like a plan, let me do that [14:23:28] <_joe_> ema: that's because the nginx on hassaleh returns 400 [14:23:40] <_joe_> which is a sensible response ihmo [14:25:25] _joe_, elukey: anything interesting on mw1099? [14:25:36] <_joe_> lemme check [14:25:55] <_joe_> Nov 11 14:25:02 mw1099 apache2[27230]: [proxy_fcgi:error] [pid 27230:tid 140240119383808] (70008)Partial results are valid but processing is incomplete: [client 208.80.154.42:57964] AH01075: Error dispatching request to : (reading input brigade) [14:26:01] <_joe_> confirmed \o/ [14:26:04] ha \o/ [14:27:08] and 208.80.154.42 is indeed pinkunicorn [14:27:41] \o/ [14:27:51] yesssss [14:27:52] <_joe_> elukey: nothing in the access log though AFAICS [14:28:13] <_joe_> I'm impressed, we currectly interpreted the apache's source code elukey [14:29:30] my brain is melted now in fact :D [14:29:54] 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations: Failure to save recent changes - https://phabricator.wikimedia.org/T150503#2788887 (10ema) We've been able to reproduce the bug on pinkunicorn by closing the connection before sending Content-Length bytes as follows: ``` #!/usr/bin/env python import... [14:32:42] <_joe_> ema: now we have to understand why varnish is getting those bogus requests [14:35:44] ema: can I play with the script on cp1008? [14:36:18] elukey: sure thing [14:38:05] _joe_ 503 in the logs \o/ [14:38:23] <_joe_> elukey: as a GET? [14:38:27] yes! [14:38:31] <_joe_> man, apache is fucked up [14:39:01] ahahah no come on, the bug is really sneaky [14:39:04] <_joe_> this is material for one of those WTF posts [14:39:07] I'll try to follow up with upstream [14:39:18] <_joe_> elukey: well, apache logging a GET 60 seconds later [14:39:28] <_joe_> and responding 503, not 400 [14:39:37] <_joe_> is fucked up [14:39:52] we are also using an ancient version of httpd [14:39:53] or is an easter egg _joe_ :-P [14:40:07] it might have been fixed in 2.4.23 [14:40:15] so, from the varnish-fe side of things we have this on https://phabricator.wikimedia.org/T150503#2788473 : [14:40:21] <_joe_> elukey: "ancient"? [14:40:23] <_joe_> come on [14:40:24] - ReqHeader Content-Length: 1688 [14:40:24] - ReqAcct 1719 0 1719 417 1556 1973 [14:40:27] <_joe_> not by any standard [14:40:34] ReqAcct: Header bytes received | Body bytes received | Total bytes received | Header bytes transmitted | Body bytes transmitted | Total bytes transmitted [14:41:05] so that would be a POST with empty body? [14:42:03] <_joe_> seems so [14:42:20] <_joe_> but with a wrong content-length [14:42:49] right [14:43:07] (or maybe the CL is correct and the body misteriously disappeared) [14:44:30] if we find out that nginx swallows the body in certain cases then we've got a bug involving pretty much every piece of software in our traffic stack [14:49:18] > when users had a particular extension enabled in Chrome (Alexa Traffic Rank), it would append a custom header containing illegal characters. Rather than ignore the individual header, nginx was stripping out valid headers including the POST body. [14:49:23] https://trac.nginx.org/nginx/ticket/831 [14:49:28] for example [14:50:49] ahhahaha [14:56:38] so I think I am able to repro with "curl -d "Hola!" --header "Content-Length: 120" --header "Host: en.wikipedia.org" localhost/w/api.php" [14:57:21] on httpd trunk this one seems to lead to a 408 [15:01:00] yes confirmed, I can repro with it [15:01:11] and I log a GET [15:02:43] and re-tested on httpd trunk - I find the same stuff on the error log but a 400 in the access log [15:06:46] 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations: Failure to save recent changes - https://phabricator.wikimedia.org/T150503#2788905 (10elukey) Even simpler: ``` curl -d "Hola!" --header "Content-Length: 120" --header "Host: en.wikipedia.org" localhost/w/api.php ``` I checked the httpd trunk code an... [15:07:06] <_joe_> elukey: no -X POST ? [15:07:25] I think that -d does it automagically IIRC [15:07:58] now what is the patch that we are missing? :P [15:16:18] not really straightforward to find [15:16:31] <_joe_> elukey: whatever [15:16:36] <_joe_> it's not so important [15:16:44] <_joe_> the bug is not in apache [15:16:54] <_joe_> well, not the relevant one at least [15:17:08] <_joe_> "it's traffic's problem now" [15:17:10] :) [15:17:12] <_joe_> :D [15:17:22] ema: any lead from your side [15:17:22] ? [15:17:59] elukey: I'm still suspecting something funny happening in nginx along the lines of https://trac.nginx.org/nginx/ticket/831 [15:18:44] mostly because of the varnishlog pasted above [15:18:45] 15:40 < ema> - ReqHeader Content-Length: 1688 [15:18:45] 15:40 < ema> - ReqAcct 1719 0 1719 417 1556 1973 [15:20:33] the options I see are: 1) the client is buggy, sends CL: 1688 and empty body 2) nginx swallows the body under certain circumstances 3) varnish-fe swallows the body internally somehow [15:20:51] it's easy to remove nginx from your repro, though [15:20:56] just hit :80 or :3127 instead of :443 [15:21:24] (and set X-F-P of course) [15:21:34] bblack: my repro simply proves that if the body "disappears" we get the observed behavior [15:21:58] is it the same if you send only a partial body? [15:22:16] yes [15:23:24] and there's a 60s wait in there somewhere, right? [15:23:37] (on the apache end before the log appears) [15:23:49] is that wait happening before you get the frontend response as well? [15:24:41] varnishd backend_idle_timeout defaults to 60s, and https://github.com/varnishcache/varnish-cache/issues/2126 says: [15:24:44] When POST request is canceled by the browser (connection is closed) before content length of data is received by Varnish and the backend connection was reused Varnish will get stuck for backend_idle_timeout . [15:24:53] bblack: the frontend response arrives after ~5s [15:24:55] Varnish gets stuck for the backend_idle_timeout since the connection was put for reuse by previous request. When timeout runs out it closes backend connection and sends 503 to the client and closes client connection. [15:25:11] I wonder what the 5s is? [15:25:45] 5s is our connect_timeout for varnish->mw [15:25:53] 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations: Failure to save recent changes - https://phabricator.wikimedia.org/T150503#2788064 (10Joe) So to be a bit more precise on what happens on apache: `mod_proxy_fcgi` reads the request body in a loop, when it gets to the end of input according to the cont... [15:26:17] <_joe_> ema: 5 seconds? [15:26:22] <_joe_> I thought 5 ms [15:26:28] <_joe_> did I read that wrong? [15:26:44] <_joe_> bblack: 60s is the request timeout of apache IIRC [15:26:58] $ time python T150503.py > /dev/null [15:26:58] real 0m5.022s [15:26:58] user 0m0.012s [15:26:58] sys 0m0.008s [15:26:58] T150503: Failure to save recent changes - https://phabricator.wikimedia.org/T150503 [15:27:12] <_joe_> ema: ok so I read that wrong [15:27:45] I think something related to varnish bug #2126 (client or nginx is truncating input, and then varnish reacts poorly as well) [15:28:07] in the exact case in #2126, the client actually closed the connection before sending all the POST data. perhaps timeout behavior is different if it leaves it open [15:28:07] <_joe_> honestly, this started happening regularly on nov 8th [15:28:14] <_joe_> what did we change there? [15:28:24] varnish4 for more users [15:28:48] do you have an exact-ish time of "this never happened before X time?" [15:29:46] so something else to consider here is this: the only known case of a user complaining is with stashedit or editstash, whichever it is, right? [15:30:27] there may be other cases where a client acutally-misbehaves (stop button, error, bad bot scripts, who knows), but nobody cares because the failure is ignored or expected, and even if our stack's response behavior to that has changed, still nobody cares [15:30:42] but the important case is these stash edits where something actually got broken that wasn't broken before [15:31:05] <_joe_> bblack: nope [15:31:11] <_joe_> we can scavenge the logs [15:31:24] we can scavenge which logs for what? [15:31:36] <_joe_> 5xx.json I guess and then apache ones [15:31:45] <_joe_> elukey: did you find a date on logstash? [15:31:56] <_joe_> bblack: I would worry about saves more than stashedits [15:32:03] that we can unequivocally say was an error that shouldn't have happened, as opposed to the background rate of errors that always happen, is what I'm saying [15:32:05] <_joe_> I mean most errors are on stashedits [15:32:14] <_joe_> bblack: yes got it [15:32:21] 5xx.json comes from varnish [15:32:25] <_joe_> but that's because stashedits are very common [15:32:37] in https://goo.gl/Xv9YBc there is the partial etc.. error log trend [15:32:39] <_joe_> if we want the apache error let's look at logstash [15:32:51] if our reaction to a bad situation has changed, the graphs and error logs will show a change, but it's important we focus on a case where we think the client req was legitimate. [15:33:28] <_joe_> yeah I agree [15:33:29] which I think is the stash edit failures (ok saves too, but doesn't save use the stash edit data?) [15:33:44] <_joe_> bblack: look at one month of the graph elukey posted [15:33:49] <_joe_> something is definitely going on [15:34:00] yeah there is a clear trend [15:34:06] <_joe_> since nov 4th I'd say [15:34:09] if we can pin down a time, we can be more precise about what to look at [15:34:10] <_joe_> and ramping up [15:34:39] was it 0 before Oct 25th? [15:34:49] <_joe_> elukey: i guess that's retention [15:34:58] <_joe_> but we can check on the actual api servers [15:36:07] ok so first occurence oct 25th [15:36:34] might be a false lead since I can see in the sal "elukey: removed logstash filter for Apache " [15:36:37] seems like at least a small jump on Nov 2 [15:36:50] nov 4th goes up a lot more [15:36:54] <_joe_> bblack: the significant jump is on nov 4th [15:37:08] through the 6th, where it kind of levels and dips a bit [15:37:13] <_joe_> bblack: I'd consider what we have before nov 4th a background noise [15:37:24] before shooting up again 8th through now, basically [15:37:27] <_joe_> or legitimate cases where the client fucked up [15:37:36] the rate is zero before oct 25th [15:37:43] and we ramped in v4 very slowly in terms of users [15:37:44] so it's relevant [15:38:04] <_joe_> bblack: that's because elukey removed a filter on logstash that was preventing apache logs from being recorded [15:38:07] the first _text node upgraded to v4 was cp2023 on Nov 3rd [15:38:16] on oct 25th? [15:38:28] heh [15:38:31] ok [15:38:32] <_joe_> :) [15:38:45] <_joe_> ema: I think it kinda fits what we're seeing [15:38:55] anyways, we can try to correlate with the rest of our transition [15:39:06] the dates we first turn on codfw, then ulsfo, then esams, then eqiad [15:39:12] with the rise in that graph [15:39:24] <_joe_> bblack: any changes to nginx at the same time? [15:39:27] which day did we first deploy the extrachance workaround in codfw? [15:39:38] _joe_: sure, lots :) [15:39:58] <_joe_> bblack: you work too much, see? [15:40:08] <_joe_> we should institute ITIL [15:40:14] the rate of change around here is pretty constant, we just need to find the correlations [15:40:26] <_joe_> so that we can assure our change stream is easily correlatable [15:40:35] find the SAL stamps for the DCs switching text to v4 [15:40:48] and for when we pushed the initial extrachance fix to codfw [15:40:51] bblack: extrachance workaround first deployed on 2016-11-08 [15:41:30] ema: but then we also upgraded esams not longer after that, right? [15:41:49] the 9th is where it takes off like esams then eqiad [15:41:52] the day after, 9th [15:43:19] based on that graph, I'd say this is fallout from rolling out v4, not from the extrachance fix, and probably-not from nginx/openssl changes, but I'm still trying to look at those dates too [15:44:21] yesterday the 10th was the latest nginx update, which is unlikely to be causative (it's changing stapling and logging stuff, not core things) [15:45:33] Nov 3 - libssl updated for nginx, to fix the readahead issue (which was already mitigated at the varnish level earlier than that, and other than that there's no real change to libssl, but still...) [15:45:57] specifically, cache_text switched to the above new libssl at 14:08 UTC Nov 3 [15:47:20] the last nginx/libssl change before that was ~12:01 UTC Nov 1, for the +wmf13 bump, which was just silencing an unrelated warning, so unlikely to do anythign important. [15:47:58] nginx wmf12 had a fair number of changes in it, was oct 29th [15:48:19] then back to the 18th for our initial libssl1.1-based nginx deploy [15:49:27] err 19th, with a downgrade on the 24th, then I guess we didn't re-upgrade until the 29th rollout of wmf12 [15:49:39] so, to un-ramble that [15:51:07] Oct 19th: nginx wmf3 (first libssl1.1 attempt), 24th: nginx reverted, 29th: upgraded to libssl1.1 again with wmf12, Nov 3rd: upgraded libssl itself with an upstream bugfix (which could in theory be bad in new ways), Nov 10th - upgraded nginx for new things that are unlikely to cause something like this. [15:52:41] I'm double-checking the zoomed stats for nov3 just in case, but I think none of that correlates well [15:52:47] <_joe_> ok honestly varnish 4 seems the likeliest suspect [15:53:14] yes [15:54:00] but what it's likely to be doing is failing in new ways in response to bad requests, mostly. I mean, we know we can simulate it with bad requests. Those can't have been working before, either, even if they showed up differently in our logs and responses. [15:54:18] we still need to find out what related failure is happening on a supposedly-good request [15:54:27] yup [15:56:05] if nginx were mangling the input on a special case, either it was always doing it (not a good match, we have new reports from editors), or its version changes don't correlate well with the graphs. [15:56:39] (but that latter point is weak: the graphs could be tracking a much-larger rate of legitimately-bad requests, and we're hunting for the isolated ones within that shouldn't have failed) [15:57:19] what made me bring nginx into the picture here is that varnish-fe itself receives CL > 0 and empty body [15:57:34] right [15:57:43] the question is whether nginx got that or create it [15:58:16] I tend to think it received it, because it would line up with your earlier correlation that the errors mostly happened over HTTP/2 client conns [15:58:27] I think [15:58:51] (there's a lot of stretchy linksin there about whether a faulty client would close after the empty POST on HTTP/1, though) [15:59:35] just to broaden the focus a bit: [16:00:12] stashedits are an optimization. the idea with them is a JS event fires in the browser and saves the wikitext currently being edited, to pre-cache it server side and make saving at the end quicker. at least that's what I think I heard about it. [16:01:07] if those got silently-broken in a way that was hard to notice server-side and didn't blurt an error to the client, it would basically be a perf impact and not noticeable to the user, right? [16:01:27] <_joe_> bblack: that's my understanding as well [16:01:33] so it's possible they actually-are broken in some cases and have been independently of this, and the change in varnish's response to the situation makes a 503 bubble up to the user now. [16:02:02] <_joe_> bblack: if they were broken we would've noticed earlier? [16:02:05] most of the varnishlog captured have x-requested-with: XMLHttpRequest, for the record [16:02:10] <_joe_> also, it's not just stashedits [16:02:17] <_joe_> it's mostly stashedit [16:02:33] <_joe_> because they are overwhelmingly the majority of POST traffic [16:02:33] the others could be fallout of a broken->503 stashedit, though? [16:02:40] <_joe_> bblack: not really [16:02:43] but yeah, ok [16:02:52] <_joe_> bblack: actually most users don't notice [16:03:16] I wish we could induce this without a broken request [16:03:16] <_joe_> the guy that opened the ticket just happened to stumble upon it in a very visible scenario [16:04:01] or even better: get a real log of the http traffic on a broken request, not just a varnishlog [16:04:02] <_joe_> can we check in varnish if the request body and the content-length do not match and log something in that case? [16:04:10] as in, a tcpdump of the nginx->varnish TCP data both ways [16:04:15] <_joe_> heh [16:04:20] <_joe_> yeah that would be great [16:04:45] I think we're pretty sure varnish thinks CL+body don't match, the question is why [16:04:54] (and when) [16:05:17] as in, perhaps varnish is getting the right CL and body and it "looses" the body before varnishlogging [16:05:25] if we could at least confirm that it arrived broken at the front edge of varnish, then it changes things dramatically, vs looking for varnish fucking up internally [16:05:33] exactly [16:05:38] <_joe_> I would add that it happens randomly [16:05:45] right [16:05:54] <_joe_> I can see sequences of stashedit requests working correctly from one user [16:05:59] <_joe_> and then the failed one [16:06:03] so the other sort of crazy direction I was going to go earlier with stashedit specifically.... [16:06:33] was that stashedit probably fires on a timer in our JS, and it may line up exactly with when the browser or nginx was giving up on an idle connection. [16:06:45] oh man [16:06:53] or nearly-exactly, enough that it's a race and shows up sporadically [16:08:45] so, one thing we could do is identifying the earliest (in term of request processing) varnish code path executed and log CL!=body size [16:08:50] you might find interesting new evidence by taking a test cache machine you're logging these on, and altering the nginx and varnish timeouts that are "client"-facing in each [16:08:57] alter them enough to find where it affects something [16:10:02] nginx is set up to not buffer (including not buffering body). as nearly-well as it can, the sequence should be similar to: [16:10:26] client sends first bytes -> nginx opens conn to varnish and starts forwarding bytes [16:10:44] so timeouts on the front of either would have similar effects [16:12:32] ema: an easier idea to try. low chance of success, but easy to verify: edit our tlsproxy localssl.conf and turn proxy_request_buffering back on [16:12:46] I saw "low chance", but maybe more like middling-chance [16:13:34] https://github.com/wikimedia/operations-puppet/blob/production/modules/tlsproxy/templates/localssl.erb#L73 [16:14:07] it seems like functionality which is related to all things above. and it's in nginx and/or affects nginx->varnish boundary, and we only turn it on when varnish_version4 [16:14:40] we've had that throughout the transition I think, but maps, misc, and upload probably don't exhibit the same special cases with POST reqs (or nearly as many in general) [16:15:53] there was nginx bug traffic at points in the past about interaction between that and HTTP/2 as well, and you were saying some of these longs tend to be H2=1 [16:16:28] correct [16:16:54] may also interact with: [16:16:57] https://nginx.org/en/docs/http/ngx_http_v2_module.html#http2_body_preread_size [16:17:11] ^ which we've left at the default of 64k, which is supposed to be a fine value all things considered [16:17:46] but maybe the bug is specific to client-CL > http2_body_preread_size + proxy_request_buffering=off [16:18:38] (which you could test with a legitimate >64k+ POST-data request into nginx over HTTP/2, although I donno if curl would trigger the same thing or you'd need to fake it with a real browser to get different HTTP/2 behaviors) [16:19:39] the body preread is only supposed to matter for an RTT or two, basically long enough for nginx to respond and inform the client of its buffer size (the client starts sending POST data immediately, even though nginx hasn't yet had a chance to use HTTP2 to signal its buffer size for that data) [16:19:51] but, all kinds of bugs could lurk there [16:20:34] still, the big change for varnish4 here is that it turns proxy_request_buffering=off in our nginx config regardless of our nginx package rollouts [16:20:59] right, I could try that on a single host and see if anything changes [16:21:17] it may not fix the repros with known-broken requests of course [16:21:24] but it may stop the organic 503s that seem related? [16:22:39] I still think this could be a change in hidden error behaviors, too [16:23:15] maybe these few POSTs that randomly fail *are* bad (for whatever reason), and if nginx is buffering bodies it returns 400, but now without buffering it becomes a 503 and presents to the user differently (or at all) [16:23:49] in any case, back to semi-vacation mode and good luck :) [16:24:29] oh are you on vacation? :) [16:24:32] bblack: enjoy! [16:25:08] well I'm on us holiday supposedly [16:25:14] it's not a very big holiday :) [16:25:30] (at least, in terms of practical impact on me!) [16:27:04] 10Domains, 10Traffic, 10DNS, 06Operations, and 2 others: Point wikipedia.in to 205.147.101.160 instead of URL forward - https://phabricator.wikimedia.org/T144508#2788993 (10Aklapper) Outsider comment: The task summary currently says "Point wikipedia.in to 205.147.101.160 instead of URL forward". If I curr... [16:37:05] 10Domains, 10Traffic, 10DNS, 06Operations, and 2 others: Point wikipedia.in to 205.147.101.160 instead of URL forward - https://phabricator.wikimedia.org/T144508#2789004 (10Naveenpf) Hi Aklapper, We are having multiple websites in same server. We are doing the same for all other Indic websites. [root@e2... [16:43:26] very interesting [16:43:52] I'm testing proxy_request_buffering=on on cp3042 and I didn't get any 'buggy' 503s yet [16:44:21] I did get a 503, but with - RespHeader X-Cache: cp3041 pass, cp3042 pass [16:52:20] ema: didn't get the "but with" [16:52:29] (I am curious :) [16:53:02] elukey: I got only one 503, and it's a different issue compared to what we're debugging [16:53:20] ah nice! [16:53:27] sorry I thought it was a mutation of the issue [16:54:07] I am reading http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_request_buffering [16:54:10] interesting [16:54:54] so nginx would issue a 400 rather than proxying stuff? [16:57:06] mmm not sure [16:57:06] elukey: that's my understanding yes [16:57:38] anyhow, good finding! [16:57:58] puppet-merged, let's see if the rate goes down [17:22:36] so the 503s going away with buffering back on has at least 3 explanations I can think of: [17:22:54] 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations: Failure to save recent changes - https://phabricator.wikimedia.org/T150503#2789087 (10ema) We've set nginx's proxy_request_buffering back on: https://gerrit.wikimedia.org/r/#/c/321000/ and that seems to help. [17:23:08] 1. The client requests are actually-bad, and whether or not varnish or nginx could've handled them better or not, what we've done is converted 503s into invisible 400s [17:24:04] 2. The client requests are legit, and there's an nginx bug triggered by unbuffered post-proxying (maybe in combination with HTTP/2 and/or POST over a certain size), and this squished the bug [17:24:47] 3. The client requests are legit, nginx messes them up no matter what, and we've converted that failure from a visible 503 into an invisible (to our logging, but probably not to users) 400. [17:25:14] in any case, it's progress, so \o/ :) [17:25:22] yep! [17:25:45] I think (2) is the most-likely, but we should keep an eye out for users reports that might support (3) [17:29:42] looking good https://goo.gl/jIiiG3 [19:17:22] bblack: as I'm checking out for the week, should I file a task for that lftp issue or do you think that it's not worth investigating further? [20:30:35] nice work guys, https://goo.gl/jIiiG3 looks great! [23:04:19] paravoid: I don't know, it might be worth posting something in phab just so we don't forget to check back later. it would be interesting to understand why, or to see if any of the 10 other things we fix in the near future happen to fix it. but it may just be that lftp's implementation in that version is uniquely-bad in a way that reacts uniquely-badly with our stuff. [23:04:53] I didn't file one for the openssl buffering issue either heh [23:47:03] 10Traffic, 06Operations: Extra RTT on TLS handshakes - https://phabricator.wikimedia.org/T150561#2789604 (10BBlack) [23:52:07] 10Traffic, 06Operations: Extra RTT on TLS handshakes - https://phabricator.wikimedia.org/T150561#2789618 (10BBlack)