[10:14:58] 10Traffic, 10ORES, 10Operations, 10Scoring-platform-team, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3718772 (10ema) >>! In T179156#3717895, @BBlack wrote: > My best hypothesis for the "unreasonable" behavior that would break un... [11:29:05] 10Traffic, 10ORES, 10Operations, 10Scoring-platform-team, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3718957 (10Lucas_Werkmeister_WMDE) > The only live polling feature I can think of that was recently introduced is for the live... [11:48:51] 10Traffic, 10ORES, 10Operations, 10Scoring-platform-team, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3719057 (10BBlack) >>! In T179156#3718772, @ema wrote: > There's a timeout limiting the total amount of time varnish is allowed... [11:51:41] 10Traffic, 10ORES, 10Operations, 10Scoring-platform-team, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3719064 (10hoo) [11:55:24] ema: re fitting this in with the !CL problem... [11:55:39] 1) We don't have a !CL problem on upload, so IMHO we could still try the exp policy there [11:56:00] +1 [11:56:44] 2) I think we still ultimately want the "do_stream=false if !CL" behavior. But we need to find a better way to introduce some real limitations against bad backend behavior here (if we can even pin down exactly what that behavior is) [11:57:05] reducing the 3x backend-side timeouts was a huge improvement in the situation [11:57:50] I mean, ideally we don't have backends behaving like this anyways, but even in an ideal world it's sometimes going to happen due to some performance regression or whatever, and the cache_layer should protect itself from carnage via some kind of tuning... [11:58:01] yes [11:58:33] timeouts-wise, I was thinking that varnish defaults for client-related timeouts are probably pretty wrong in our case on varnish-be [11:58:47] very true :) [11:59:02] also, the default backend-facing ones seem off as well, but it's hard to pick good ones [11:59:17] 180s ttfb for the applayer though, that's nuts [11:59:51] I'm pretty sure the outcome of the timeouts/retries RFC discussed briefly somewhere over the weekend was MW having some reasonable cap on request execution time, something like 60s [12:03:09] https://phabricator.wikimedia.org/T97204 + some related discussions that may be linked into there somewhere [12:04:41] (but I would ignore the Retry-After parts of that... I still think those don't make sense. A service that's unresponsive can't predict when it will be responsive again, neither can it predict that the next query to the same URI under different conditions will stall similarly or not (could depend on client request attributes exactly matching) [12:14:16] not directly following from the above, but related: I think Varnish 4+'s new (vs 3) model of divorcing fe+be operations almost-completely probably makes this stuff harder. [12:14:54] it's probably the case that aborting the fe side of a transactions flowing through a varnishd doesn't necessarily abort the be side, which might still try to complete in an effort to fill a cache entry for future use. [12:15:59] what I wish varnish had here was some equivalent of "send_timeout on the client side", but as "receive_timeout on the backend side" - a total limit on the time it takes to slurp one response from a backend server. [12:16:45] the closest thing I know of that it has is between_bytes_timeout, but for any value you configure there, it's possible for a backend server to dribble out bytes at just the right rate indefinitely and evade hitting that timer while taking minutes or hours to complete a single response. [12:23:49] 10Traffic, 10ORES, 10Operations, 10Scoring-platform-team, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3719145 (10Lucas_Werkmeister_WMDE) >>! In T179156#3719057, @BBlack wrote: >could other services on text-lb be making these kind... [12:28:55] https://gerrit.wikimedia.org/r/#/c/231197/3 [12:30:41] (I just checked and that value is still there present-day) [12:32:46] given hhvm max execution time is supposedly-capped at 60s, I think ttfb=65s should be reasonable [12:33:06] (although I'd rather ignore ttfb and just cap the total here, but whatever, we have what we have for now) [12:49:54] ema: have you already looked at the client-side timeouts for backends thing, thoughts? [12:50:21] my initial take is it's tricky to set those lower for inter-cache if they're proxying/streaming for the front edge real clients too. [12:50:28] but I donno, need to think [12:51:19] bblack: re: that and https://gerrit.wikimedia.org/r/#/c/387225/, it would be good to spend some time observing the current behavior of different clusters/layers where it makes sense [12:51:49] some of those should be relatively easy to observe by looking at varnishlog? [12:52:15] * ema needs food, bbl [12:52:18] yeah [12:52:31] we know there are cases that trip it, I logged some of them over the weekend [12:52:53] occasionally, spurious random timeouts on random URLs (e.g. slow response from MW on a relatively-simple page output) [12:53:37] but more-consistently, certain APIs like flow-parsoid would often timeout at ~80s (presumably from eventually hitting between_bytes_timeout, but it could also be they hit some timeout behavior down in the application layers and then errored the connection at that point) [12:54:44] so, client-side timeout parameters for the daemon itself: [12:55:40] (that matter for what we're talking about here, anyways): idle_send_timeout, send_timeout, timeout_idle [12:56:33] https://info.varnish-software.com/blog/understanding-timeouts-varnish-cache documents a timeout_req too, but I didn't catch it in the varnishd parameter docs [12:59:29] we recently raised timeout_idle across the board on the backends, which I think I still like for various reasons [13:00:33] I think send_timeout and idle_send_timeout are probably due for some evaluation on both backends and frontends, since nginx is fronting anyways (and very-slightly-buffering) [13:03:24] that varnishlog entries don't show up until transaction-end is kind of ugly in this case [13:03:51] if we have a trickling backend that manages to consistently evade between_bytes_timeout, it could be a very long time before it eventually shows up in the shm log [13:14:06] I'm staring at the varnish code looking for an easy way to inject an overall "receive whole response from backend" timeout, but it's clearly not structured very well for it... [13:14:27] the code is structured around per-read timeouts [13:14:53] (since it's thread-per-conn anyways, they can just block on tcp-level recieve and set the between or first -bytes timeout as the TCP level timeout) [13:16:32] hmmm no, they actually block in poll() then do an immediate read, rather than set the socket-level timeout. makes sense, but still basically the same. [13:17:16] maybe can hook in a loose overall timeout that only gets checked once per received chunk of data (so it would have accuracy around ~between_bytes_timeout) [13:54:10] bblack: timeout_req has been removed IIRC [13:54:40] yeah, merged with timeout_idle in 4.1 [13:58:14] oh, there's been an icinga critical for cp4024 being down last night, I've ack'ed it assuming that the previous ack had expired [13:58:56] yeah or downtime expired [13:59:15] that machine is hardware-dead afaik, it didn't even make it through its last install attempt without crashing during install [14:02:30] can't we just send it back and get a new one? :) [14:03:32] yes, I think that's the general idea :) [14:03:51] the issue is convincing Dell, since apparently we get no SEL entry and Dell's diagnostics don't fail [14:09:05] ema: https://gerrit.wikimedia.org/r/#/c/387236/ it compiles! :) [14:09:42] ship it then! [14:10:31] while skimming through the code this morning I've sadly noticed that there are often no counters for when the timeouts kick in [14:11:29] which is sad, but we could probably do something like varnishncsa|mtail to get the info without patching varnish everywhere [14:11:35] yeah apparently some of my git-foo went wrong, the commit includes a bunch of made-header outputs it shouldn't [14:11:58] that aside, I think this patch only applies the transaction_timeout to dribbling headers, not response bodies. still some work to go [14:21:58] 10netops, 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack frack refresh equipment - https://phabricator.wikimedia.org/T169644#3719622 (10Cmjohnson) 05Open>03Resolved this has been completed. [14:22:17] 10netops, 10Operations, 10fundraising-tech-ops, 10ops-eqiad: rack/setup/wire/deploy msw2-c1-eqiad - https://phabricator.wikimedia.org/T166171#3719635 (10Cmjohnson) 05Open>03Resolved Completed. [14:27:53] interestingly, when transaction_timeout is applied to dribbling headers, this sets an overall timeout limit on the conn-close-race extrachance loops we mitigated before, too. [14:30:41] heh, now that I'm looking for where to stab transaction_timeout handling into response body handling, I'm not sure that between_bytes_timeout actually applies to response bodies either :P [14:30:51] it may only apply to timing gaps in sent headers [14:34:23] bblack: it applies to the response body too [14:35:04] (see ./bin/varnishtest/tests/b00022.vtc, I've tried bumping the delay between the two `send "Baba\n"` and it times out) [14:35:43] hmmm ok [14:36:11] I'm still working my way through all the crazy and poorly-documented abstractions [14:36:27] but the one place I found between_bytes_timeout doing something useful, it sure sounds like it's only for resp headers, not bodies [14:41:09] ugh, I get it now [14:41:45] ok so, during header fetching, we can actually observe timeout_first_byte + timeout_between_bytes being used "manually", and it's easy to inject a transaction_timeout check there. [14:42:14] but once the header-fetching is done, it does literally use between_bytes_timeout to do a setsockopt() read timeout on the TCP connection, and that's how it's enforced from there out for the body read. [14:42:59] (so none of the crazy chain of token-pasted callback function names/pointers involved really have any notion of time/timeout parameters during body ready, it just relies on TCP timing out on a read() call) [14:43:11] s/body ready/body read/ [14:43:34] ok, and I think the similar approach might have been followed for write()s [14:43:54] VTCP_set_read_timeout() etc [14:44:41] it's probably still possible to inject the checking of the transaction timeout somewhere, but probably tricky... [14:49:11] v1f_read() is where the body read() call happens, ultimately [14:49:37] and it does have a reference to htc, which has the timeout data [14:50:19] well, it has some notion of the timeout data, maybe not the right notion if it's been customized per-req via VCL beresp.timeout_blah =X [14:50:46] it doesn't have a ref to the busyobj (there probably isn't one in all callers) [14:52:14] hmmmm [14:52:37] VIM_real() can be gotten there, we just also need data from bo->transaction_timeout, basically. [14:53:46] it's tricky to pass down new params there though, it's the token-pasted state machine functions that go on stacks [14:56:57] maybe that's silly anyways, maybe it would be easier to check it further up the abstraction, e.g. where VFP_Suck or similar are called? [14:58:59] no, it's even sillier. I think htc's timeout is the right one, thanks to the FIND_TMO() macro earlier [14:59:19] (I think that sets the timeout values on the htc to be whatever's needed for the transaction (busyobj) currently using the connection) [15:03:22] I've shot a depooled varnish-frontend child in misc, the icinga check for child_start == 1 works [15:05:14] mmh the grafana URL in the description has been messed up by puppet bleah [15:20:34] 10Traffic, 10ORES, 10Operations, 10Scoring-platform-team, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3715032 (10daniel) @BBlack wrote: > something that's doing a legitimate request->response cycle, but trickling out the bytes o... [15:30:07] ok I think I'm getting somewhere [15:34:47] 10Traffic, 10ORES, 10Operations, 10Scoring-platform-team, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3719914 (10BBlack) Trickled-in POST on the client side would be something else. Varnish's timeout_idle, which is set to 5s on... [15:39:16] 10Traffic, 10ORES, 10Operations, 10Scoring-platform-team, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3719928 (10daniel) > In any case, this would consume front-edge client connections, but wouldn't trigger anything deeper into... [15:51:57] cool that you can use timestamps for VSL queries: [15:51:58] varnishncsa -n frontend -q 'Timestamp:Resp[2] > 0.5' -F '%{VSL:Timestamp:Resp}x %r' [15:54:40] 10Traffic, 10ORES, 10Operations, 10Scoring-platform-team, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3719995 (10BBlack) >>! In T179156#3719928, @daniel wrote: >> In any case, this would consume front-edge client connections, bu... [15:55:38] oh nice, I had been regex-matching those heh [15:55:52] e.g.: [15:55:56] varnishlog -Cb -q 'BerespStatus ~ "503" and Timestamp ~ "Beresp: [0-9.]+ ([1-9][0-9]|[4-9])"' [16:05:49] ema: what was that recent ticket about caching ignoring request bodies on GETs? :) [16:06:16] bblack: ah the logstash one, let me see [16:06:46] bblack: https://phabricator.wikimedia.org/T174960 [16:07:08] ah ok so cache_misc [16:28:35] 10Traffic, 10ORES, 10Operations, 10Scoring-platform-team, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3720106 (10BBlack) p:05Unbreak!>03High Reducing this from UBN->High, because current best-working-theory is this problem is... [16:35:31] heh I just saw the "copper shutdown today" further down etherpad, is there a known replacement? I was still using it heh [16:36:03] bblack: boron [16:42:46] ah! [16:50:50] bblack: and it's stretch FYI [17:02:30] ema: Still staring and working on this, but i think is much closer. Going to try to add some relevant VTC in a bit: https://gerrit.wikimedia.org/r/#/c/387236/ [17:10:07] 10Traffic, 10ORES, 10Operations, 10Scoring-platform-team, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3720290 (10daniel) > Because they're POST they'd be handled as an immediate pass through the varnish layers, so I don't think t... [17:14:35] bblack: nice! I don't think I understand what https://gerrit.wikimedia.org/r/#/c/387236/4/bin/varnishd/http1/cache_http1_fsm.c is doing [17:15:41] oh, it sets ta to 0 [17:24:16] bblack: we might want to add a counter of timed out transactions? vc->wrk->stats->transactions_timed_out++ ? [17:25:04] or is that another type of wrk? :) [17:33:35] yeah would be nice to have counters on all of these things, instead of having to parse through shmlog to count them :P [17:40:02] 10Traffic, 10ORES, 10Operations, 10Scoring-platform-team, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3720392 (10BBlack) >>! In T179156#3719995, @BBlack wrote: > We have an obvious case of normal slow chunked uploads of large fil... [20:31:58] 10netops, 10Operations, 10fundraising-tech-ops, 10ops-codfw: connect second ethernet interface for fundraising codfw hosts - https://phabricator.wikimedia.org/T176175#3720997 (10ayounsi) Switch ports activated, some interfaces still show as down, most likely not activated on the server side: ``` ge-0/0/3...