[00:53:32] 10netops, 10Cloud-Services, 10Operations, 10cloud-services-team (Kanban): Allocate public v4 IPs for Neutron setup in eqiad - https://phabricator.wikimedia.org/T193496 (10ayounsi) 05Open→03Resolved 208.80.155.128/25 has been cleaned up from Netbox and DNS (see above). [07:18:05] 10netops, 10Operations: cr2-esams: BGP flapping for AS 61955 (ipv4 and ipv6) - https://phabricator.wikimedia.org/T222424 (10elukey) [07:43:40] 10Traffic, 10Operations: Evaluate ATS TLS stack - https://phabricator.wikimedia.org/T220383 (10Vgutierrez) [07:50:47] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp4024.ulsfo.wmnet'] ` The log can be... [08:06:58] 10Traffic, 10Operations: Evaluate ATS TLS stack - https://phabricator.wikimedia.org/T220383 (10Vgutierrez) [08:33:05] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4024.ulsfo.wmnet'] ` and were **ALL** successful. [09:55:16] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp4025.ulsfo.wmnet'] ` The log can be... [10:37:39] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4025.ulsfo.wmnet'] ` and were **ALL** successful. [10:57:03] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10fgiunchedi) >>! In T196066#5153205, @Ottomata wrote: > Hm but also, whatever we replace varnishkafka with... [13:06:36] so all the text-dcname alerts last night, that was all from 1-2 eqiad backends having issues? [13:07:25] ema: I remember you had some cron-shuffling issues and then patched it away, but maybe in the interim time some of them missed (jumped over) their cron restart times for a span? [13:08:04] oh right your issues would only be within cache_upload, and this was text [13:08:58] correct, the cron-shuffling affects upload only [13:09:13] it just seems odd we had more than one node go out in series [13:09:27] I get that varnish backends have some misbehaviors that pile up and a restart fixes it [13:09:51] but the 3.5-day cron is meant to adequately cover it [13:10:23] I suspect in at least some of these recent cases, it's that some other anomaly (of external requests, or internal behavior) exacerbates and triggers the varnish misbehavior [13:11:04] that is possible, yes. Also the FetchErrors were different from the usual "Could not get storage" this time around [13:11:19] ends up being the same solution, assuming the trigger was temporary, but it would be nice to investigate the triggers as well. [13:12:02] yeah one of the other recent ones I looked at, mbox lag was a symptom, but the main issue was that it ran out of threads/conns to the backends, implying the underlying cause was high response latencies (up to outright indefinite stall for all practical purposes) [13:12:24] but the "could not get storage" can also cause threads/conns to pile up, cause/effect is not always obvious :/ [13:13:01] this EU morning we got mostly "HTC status -1", and occasionally (seldom) "http format error" [13:13:21] the latter in particular looked weird (eg: https://phabricator.wikimedia.org/P8469) [13:13:52] the former happens when HTC_RxStuff returns HTC_S_EOF, AFAIU [13:14:05] heh [13:14:07] -- BogoHeader 1st header has white space: [13:14:07] -- HttpGarbage "cted"> -- BerespStatus 503 [13:14:11] (from your paste) [13:14:26] so it never saw a boundary between http headers and body [13:14:37] (when fetching from the backend!) [13:15:01] and that was appservers.svc, not some random misc service [13:15:47] but a far more likely explanation for that (than php/hhvm missing the breakpoint between headers and body of a legit response) [13:16:21] is that it's a connection reuse issue, where a gzip error or other kind of content-length error / 500-ish issue happened in the request before that one on the same reused connection. [13:16:38] and th ebackend was still spewing HTML from the broken request varnish gave up on, while varnish sent it a new request and tried to read a response [13:17:39] we've had various issues of this sort in the past, and all were kinda tricky. wouldn't be shocking if something critically-related has changed with the php transition... [13:18:09] (but since we're not getting response headers at all in that scenario, we can't actually tell which of php/hhvm was doing things on that conn before) [13:18:38] it is also entirely possible that the "http format error"s are unrelated to this morning's issues and simply happened at the same time (they were happening significantly less often than "HTC status -1") [13:19:20] yeah, but still, it doesn't take that many random garbage 503s like that causing the recycling of otherwise reusable connections to start piling up connection counts and latency either. [13:19:46] (I assume after the garbage one you pasted, varnish probably did disconnect the conn at that point) [13:21:36] yup: https://phabricator.wikimedia.org/P8469$40 [13:25:03] probably not the gzip explanation, since the garbage looks like uncompressed html [13:25:24] but it could be some other kind of length issue, which usually happens on some kind of error [13:25:46] (e.g. backend emits a 500 with a bad CL-vs-body in the request before?) [13:26:12] would seem to conveniently-similar to the unrelated recent swift issue though, so that's probably just my brain bias [13:26:19] *too [13:26:48] I don't know how we'd find those though [13:27:44] if it were a 500 with no CL at all and no chunking, it would imply close-delimited... but then I think varnish would wait on the close or close for itself after some timeout [13:27:55] (not reuse and send it a new req) [13:28:33] it could of course also be some rare case of a 200 or 204 with a body/CL mismatch [13:29:00] (well not 204 I guess, by definition) [13:29:06] hmmmm [13:29:41] anyways [13:30:04] maybe backend fetches causing errors like BogoHeader / HttpGarbage is something we should have stats or logs on [13:30:16] then at least we could see the patterns in them and experiment [13:30:48] (I would like to say just let it ride until we replace it with ATS, but for text that's still quite a long way out) [13:36:27] indeed. I've gotta go afk for a while now, talk later! [13:36:51] ok [13:37:22] so 5/6 ulsfo upload backends on ATS, nice [13:39:34] the caching graphs all look pretty sane and expected too, "remote" cache stats almost-gone [13:39:49] total hitrate is in reasonable-range considering all the churn [13:40:37] hmmm T222452 might be possibly-related too [13:40:38] T222452: PHP Fatal Errors on mw1275 after deployment - https://phabricator.wikimedia.org/T222452 [13:41:09] if PHP is having some kind of hard fatal like, and it's happening mid-response-stream... maybe? [13:41:40] (if apache is doing the connection reuse part on MW's behalf) [13:42:10] who knows, grasping at straws at this point [13:43:05] you'd think that would just truncate/timeout though, unless the fatal also spews excess garbage beyond the originally-intended CL [13:43:15] (excess HTML-looking garbage) [14:32:27] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10Ottomata) I don't think Magnus would build it into librdkafka itself. But, a 3rd party C library could b... [14:56:49] <_joe_> bblack: related to what? [14:56:57] <_joe_> that happened on a single server for 20 minutes tops [14:57:33] <_joe_> but I can see it happening because of some applayer fart indeed [14:58:12] <_joe_> if we got a "bad response" to include the acutal origin server [14:58:17] <_joe_> so that I can go inspect [14:58:24] <_joe_> we could see if it comes from php-fpm [15:10:29] _joe_: the scenario I'm guessing about is a tricky one to debug, but to recap it succinctly: [15:11:32] https://phabricator.wikimedia.org/P8469 has ema's log of one of the 503s. What happens from varnish pov here is that it send a request to appservers.svc, and instead of getting http headers back, it gets some HTML content, which it interprets as garbled headers and converts to a 503.... [15:12:08] <_joe_> heh I said if I had the origin server name [15:12:12] <_joe_> but that's not available [15:12:14] right [15:12:18] <_joe_> as you don't get headers back [15:12:29] <_joe_> now we could go match time and url [15:12:47] so my working theory is that this is a re-used connection to appservers.svc, something went horribly wrong with the previous req->resp on that connection, and it was still spewing HTML that varnish wasn't expecting, when it send the next req [15:13:21] I think something in that varnishlog paste can tell us whether it was actually a reused or fresh connection, but I can't recall offhand [15:13:45] L32 says "BackendOpen", but I don't think that means it had to be a fresh conn [15:14:17] <_joe_> there is just one issue with this theory [15:14:31] <_joe_> 1 - it's happening on specific varnish backends at a time [15:14:37] <_joe_> sorry, two issues [15:14:48] <_joe_> 2 - it seemed to happen consistently both to hhvm and php users [15:14:57] <_joe_> now I can imagine one appserver having bugs [15:14:59] <_joe_> but both? [15:15:06] <_joe_> and apache - why just now? [15:17:02] the whole "specific varnish backends thing" is unrelated: either this was a causative trigger that exacerbated varnish-internal problems by piling up some excess 503s/reconnects, or it's completely unrelated and just a fault we noticed while digging. But either way I'd expect to see these BogoHeader things anywhere, not varnish-specific. [15:18:00] anyways, you can imagine a simpler version of this theoretical scenario unfolding like this (but I don't think this is the acutal case): [15:18:55] 1. Varnish sends Req1 to appservers.svc over connection1, and the response headers have "Content-Length: 1024", followed by >1024 bytes of body [15:19:16] 2. Varnish forwards and/or caches the (truncated to 1024) response, puts Conn1 back on the reuseable list. [15:19:19] <_joe_> ok I have some additional info [15:19:27] <_joe_> that url in P8469 [15:19:37] 3. Varnish sends Req2 to appservers.svc over connection1, gets the rest of the HTML when expecting response headers [15:19:43] <_joe_> it was forwarded to two appservers [15:19:55] <_joe_> mw1321.eqiad.wmnet and mw1271.eqiad.wmnet [15:20:12] <_joe_> and the content-length was different [15:20:13] but I doubt there's so simple a bug as "excess content body in a normal response" [15:20:27] <_joe_> 126495 vs 126707 [15:20:34] hmmm ok heh [15:20:43] <_joe_> that's a bit much for normal caching diffs [15:21:41] so possibly-related to "forwarded to two appservers": if the first attempt results in a 5xx (for any reason, even varnish-internal crap), varnish-frontend is going to restart the request for a once-only retry (back through the varnish backends and applayer) before giving up and presenting 5xx to the user. [15:22:05] err sorry, that's not for all 5xx, that's only specifically for 503 [15:22:14] but we do have a 503 in this case [15:22:45] anyways, are you saying mw1321 and mw1271 got different copies of the exact same request, or just two independent external requests for that URI? [15:23:14] <_joe_> yeah [15:23:36] <_joe_> I'm saying there are always small diffs between two requests [15:23:46] <_joe_> mostly in the metadata that goes in html comments [15:23:57] <_joe_> this seemed a bit much [15:24:05] <_joe_> but that page uses a few templates AIUI [15:24:13] yeah in general it should be ok though, if we restart a request we get the new CL too [15:24:23] <_joe_> yes [15:24:32] <_joe_> no my main point was anyways [15:24:40] <_joe_> two appservers both returned 200 ok [15:24:46] <_joe_> as far as they are concerned [15:24:53] <_joe_> but clearly not according to varnish [15:24:59] <_joe_> at least the first request failed [15:25:44] <_joe_> sorry [15:25:47] <_joe_> I'm an idiot [15:25:50] <_joe_> I looked at the timing [15:25:53] I just hit that same URI with curl from the outside like a normal user, and got a cache hit from < X-Cache: cp1089 hit/4, cp2012 hit/3, cp2004 hit/2 [15:25:55] <_joe_> not the content-length [15:26:02] <_joe_> that's the same to the byte [15:26:03] and its CL and actual size are 50739 [15:26:15] <_joe_> it was 51384 back then [15:26:30] it does end with , so it's not truncated [15:26:34] <_joe_> same as in a request at 2019-05-02T23:34:12 [15:26:49] <_joe_> that didn't generate na error if I can trust the logs on weblog [15:27:35] but that 50739 CL was an older cache entry, if I cachebust I now get 51495 bytes of data [15:27:43] <_joe_> yeah [15:27:48] <_joe_> caching changes a lot [15:28:11] ? [15:28:28] it shouldn't change it a lot [15:28:34] <_joe_> no I mean [15:28:41] <_joe_> the inner applayer caching state [15:28:50] <_joe_> and well, template transclusions [15:28:50] oh right [15:29:11] anyways, like I was saying earlier, I doubt it's a straight up "send the wrong CL for the body" situation on a normal response [15:29:18] <_joe_> heh [15:30:01] in similar scenarios we've seen in the past, it's been more like "there was some kind of horrible error in one req->resp cycle, which then led to the trailing garbage output seen on the next req->resp on that conn" [15:30:28] <_joe_> it looks like some memory leaking over [15:30:44] (strange error cases tend to bleed over from one req to the next with conn reuse) [15:31:09] e.g. if output was halted mid-chunk or before CL was exausted, due to crash [15:31:18] but the conncetion stayed alive because apache was mediating that [15:31:56] <_joe_> that would make sense if the user finally received a "good" response from the second attempt [15:32:01] <_joe_> which went to another server [15:32:07] hmmm [15:32:15] <_joe_> and indeed [15:32:29] <_joe_> I can't find Historical_present in the archived 5xx.json [15:32:45] right [15:32:50] <_joe_> but so many errors? I'd assume they'd be all over the place [15:33:16] if my scenario is even approximately correct, we'd expect after the ema-paste scenario, varnish tore down that connection, and the varnish-frontend retried for the user and got a different random success on a new conn [15:34:01] in which case you'll never find the bug looking for an erroneous output to a request for Historical_present. Its output is always ok in isolation. [15:34:24] <_joe_> yep [15:34:36] some other rare event happened as the request just before that one on the same reused connection, and messed up the peers' idea of the connection state (varnish thinks the connection is idle and ready for another connection, but in fact the applayer side is still sending unexpected html) [15:34:55] <_joe_> but thatr happened on multiple varnishes the same morning [15:34:57] oops s/ready for another connection/ready for another request/ [15:34:58] <_joe_> it's a bit strange [15:35:17] it probably happened on all of them randomly, may still happen at some low incidence rate [15:35:47] just some of them were already borderline on falling apart due to their own internal issues, and the excess 503s/reconnects caused escalating connection-count and knocked them over [15:37:58] (and/or they happened to be a chash-target for whatever unknown case messes up the connection before the one with the observed 503) [15:38:26] anyways, we need better logs/stats on the BogoHeader / HttpGarbage log entries from varnish [15:38:50] so we can see if they're happening at some constant background rate [15:39:14] (constant-ish anyways) [15:43:40] I picked 3 eqiad text varnish-backends, and ran a quick manual varnishlog to catch 503 responses from applayer (or internal) [15:44:11] and already picked up a few and they're all similar examples [15:44:13] -- BogoHeader 1st header has white space: [15:44:13] -- HttpGarbage "special"%00" [15:44:58] ah interesting! [15:45:03] some are slightly different, they have instead the unique marker: [15:45:06] -- FetchError HTC status -1 [15:45:10] which ema mentioned earlier [15:46:29] reading varnish source code, apparently "HTC status -1" is some sub-case of the condition where varnish was fetching a response and it was incomplete [15:46:56] basically after it's "done" fetching the response, there are several possible error cases with abornal reasons for being done [15:47:25] "Received junk", "backend closed", "timeout", "overflow" [15:47:39] and then this is the default in the switch statement for other errors that didn't fit above heh [15:47:46] by quickly reading the code earlier today it seemed to me that -1 is returned when EOF is reached [15:48:06] well EOF would mean close right? [15:48:10] or rather, when HTC_RxStuff returns HTC_S_EOF [15:48:17] oh I see [15:48:57] in my tiny sample, they seemed almost like matching pairs, the "HTC status -1" and the other bogoheaders [15:49:17] it may be that HTC Status -1 happens first, and the conn is messed up and gets reused causing a bogoheaders case [15:49:23] I have no idea [15:49:56] or it could be that the real original cause happens-first and goes unnoticed by most things (because it's a 200 output to the user, just happens to get truncated by a shortened CL value) [15:50:13] and then the bogoheaders thing happens on the next request of that conn (seeing the trailing body content) [15:50:25] in the varnishlog I've captured today, "http format error" happened first, then the HTC ones [15:50:31] and then the HTC Status -1 happens later when the connection finally tears down for [15:50:34] see cp1089.eqiad.wmnet:~ema/503.log [15:50:43] yeah [15:51:32] varnish is obfuscating the underlying cause (of course!) [15:52:04] a wide net to fish some of those is: [15:52:06] sudo cumin -o txt 'A:cp-text' 'timeout 30 varnishlog -g request -q "FetchError ne \"Pass delivery abandoned\""' [15:52:18] "HTC status -1" comes from RcStuff returning HTC_S_EOF, which it does in the case that VTCP_Read returns either of 0 or -1 [15:52:40] and VTCP_Read's retval is basically the read() retval, with all negatives normalized to -1 [15:52:58] so "HTC status -1" could either mean read return zero for EOF, or it returned -1 and an errno value that we'll never see :P [15:53:44] ah that second part I've missed [15:55:31] anyways, probably not super-helpful anyways [15:56:25] all we can really say for sure whether better digging is probably "Something goes wrong with a req->resp cycle between varnish-be and appservers.svc in way we're not seeing/detecting yet, and it messed up the connection for the next request too" [15:57:26] at least, that seems more-likely than just "the connection was clean and fine on all sides, and appservers.svc emits junk garbage memory/html in response to a simple request" [15:59:39] I guess that seems backwards from a fresh pov. knowing nothing else, the second scenario seems simpler and more-plausible [16:00:03] but based on history of past incidents and guessed likelihoods of various things, the first seems more-likely to me [16:00:47] on the other hand, though... those tiny fragments of HTML that we see in the BogoHeader / HttpGarbage log lines... [16:00:58] they're not clean, they have random bytes in them that don't seem like correct html... [16:01:39] looking back, they all seem to have an embedded NUL byte somewhere shown as "%00" [16:02:29] and I wouldn't expect our content to ever have embedded literal NUL bytes. IIRC even a utf-8 stream never has a single byte NUL value in the midst of a multi-byte encoding of anything. [16:02:40] NUL always means NUL in the world of ascii and utf-8 [16:03:05] hmmm [16:05:27] ema: I was just logging 503s in general, and didn't quickly find any that weren't one of these cases (HTC or BogoHeader) [16:05:36] varnishlog -b -g request -q 'BerespStatus == 503' [16:06:19] two completely unrelated URLs, going through two different varnishes: [16:07:12] cp1087, user requested ar.wikipedia.org/wiki/%D8%AF%D9%8A%D9%81%D8%A7_(%D8%B1%D9%88%D9%85%D8%A7%D9%86%D9%8A%D8%A7) [16:07:38] cp1085, user requested en.wikipedia.org/wiki/Special:Contributions/Leiem [16:07:43] both had: [16:07:53] -- BogoHeader 1st header has white space: [16:07:53] -- HttpGarbage "special"%00" [16:08:14] (exactly down to the byte, those two lines were the same in both cases) [16:08:57] another more-recent one I saw, was a wikidata.org special entity data query with cache-busting, pass pass pass [16:09:04] turns into a 503 with: [16:09:06] -- BogoHeader 1st header has white space: [16:09:06] -- HttpGarbage "cted"> err sorry wrong paste, this one is: [16:09:23] -- BogoHeader 1st header has white space: [16:09:23] -- HttpGarbage "cted"> heh, I had it right the first time [16:09:43] that's the same garbage I've seen [16:09:50] "cted"> it's interesting that the 5-tabs seems to be common [16:11:05] there are several examples of such lines in "normal" article outputs, but not too many [16:12:02] well [16:12:28] in one normal article example I'm looking at, anyways, all the cases of lines with just 5 tabs followed by , are all in one place [16:12:52] which is inside of
...
[16:13:14] (the 5-tabbers close the divs within mw-head, such as