[09:38:42] ema: morning sir [09:39:01] vgutierrez: hello! [09:43:18] vgutierrez: let's start with the easy things :) [09:43:47] I see a couple of icinga warnings for pybal-test2001: long-running tmux and puppet disabled [09:47:01] hmmm [09:47:15] puppet it's my fault [09:48:18] and the long living screen too [09:48:31] fixed :D [09:54:38] we can also add pybal-test to the exception list for thee screen/tmux test, these are development systems after all [09:56:53] +1 [09:58:08] FYI, I disabled puppet the other day to validate for moritzm how pybal handles changes on /etc/resolv.conf [10:01:43] vgutierrez: https://gerrit.wikimedia.org/r/#/c/419705/ updated (pcc here https://puppet-compiler.wmflabs.org/compiler02/10504/) [10:03:04] the idea is to begin testing the upgrade to 5.1.3-1wm4 on canary/misc and then move to upload/text [10:05:06] nice [10:05:28] 5? we're not upgrading to 6? :-P [10:05:29] * volans hides [10:06:14] minor upgrade within varnish 5 [10:06:14] volans: I hope the next upgrade will be to ATS :) [10:06:42] that would be nice seeing how varnish 5 its behaving... [10:06:58] but I guess that ATS would come with its own set of "hidden features" [10:12:16] should be totally easy, after all the release notes for 6.0 clearly state "Usually when we do dot-zero releases in Varnish, it means that users are in for a bit of work to upgrade to the new version, but 6.0 is actually not that scary" [10:12:37] right [10:17:13] https://gerrit.wikimedia.org/r/#/q/project:operations/puppet+branch:production+topic:varnish4-upgrade [10:17:16] https://gerrit.wikimedia.org/r/#/q/project:operations/puppet+branch:production+topic:varnish5-upgrade [10:17:22] "a bit of work" ^ [10:19:48] * elukey thinks about a gerrit dashboard that adds "swearing time" values for each code review submitted [10:20:10] amount of hair lost / greyed [10:20:22] years to be spent in hell [10:20:41] sacrifices to be done to FLOSS gods... [10:22:08] * ema also always gets slightly offended when people call him [10:22:11] "a user" [10:22:34] embrace your layer 8 membership ema :* [10:22:41] :) [10:23:03] I have seen users [10:23:14] I know what they do [10:23:27] we don't look like them, we don't do those things [10:23:55] honestly, we don't know how they think to end doing those things [10:26:59] -- [10:27:20] cp3008 upgraded, it's now running upstream's fixes for extrachance instead of our own patch [10:27:28] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?panelId=7&fullscreen&orgId=1&from=now-6h&to=now&edit&var-datasource=esams%20prometheus%2Fops&var-cache_type=misc&var-server=All&var-layer=backend [10:27:55] there's a clear difference in how varnish-be accepted sessions look like in esams_misc ^ [10:28:28] (at least now, right after the upgrade) [10:28:39] let's keep an eye on that [10:29:25] sure [10:32:28] ema: vgutierrez: when you have a couple minutes you can review my latest pybal patches *g* [10:33:09] mark: sure, I wanted to give a proper review to the attributes unit tests, I've found some issues when I added some tests for BGP parsing with them [10:33:26] maybe it was a problem on my side, but let's be sure [10:33:38] this is just a code move [10:33:43] let's do fixes in separate commits [10:33:50] all i want to make sure is that the move doesn't break anything new [10:33:52] ack [10:34:10] i know there are all kinds of issues with the attributes code itself, it's pretty rudimentary code [10:34:37] so i added ~100% statement coverage in unit tests, so at least everything runs once and doesn't refer to some identifier that has moved [10:37:29] btw, what we discussed the other day: https://gerrit.wikimedia.org/r/#/c/420299/ [10:38:11] the summary logline looks like this: [dns_rec6_53] INFO: Added 2 server(s): 2 enabled server(s) and 0 disabled server(s) [10:38:47] sigh.. commit message needs ammend to satisfy ema's OCD [10:39:53] ok i will review later [10:40:39] ema: btw, dunno if you checked my comment from yesterday.. but your Friday reboot on esams text cluster also impacted on the amount TCP time-wait sockets [10:41:20] vgutierrez: yes I've seen that [10:42:10] the issue is a pain in the ass.. but it's pretty interesting debugging it [10:53:10] ema: re:https://phabricator.wikimedia.org/T184942 I've seen that we have several varnish dashboards in grafana still using the legacy (graphite?) backend instead of prometheus, shall I migrate every dashboard of there is any that we could get rid of? [10:59:48] vgutierrez: the idea would be: go through every python-based statsd script, figure out which dashboards use those metrics, port the ones that are in use, get rid of the unused ones [11:00:19] ack [11:00:22] varnishxcps seems like a good starting point as we shouldn't have many dashboards based on the metrics it generates [11:01:08] fwiw there's a script from krinkle to audit metrics in dashboards, in T188749 [11:01:08] T188749: Deprecation of mw.errors.* metrics - https://phabricator.wikimedia.org/T188749 [11:01:33] nice one, I didn't know [11:04:57] godog: hmm it's your free day! go go go [11:05:12] and thx :* [11:05:51] vgutierrez: haha you are indeed correct! I'm working tho, puppet calls [11:07:14] devotion to duty: https://xkcd.com/705/ [11:07:44] lol [11:10:48] oh god.. the script is node based... god bless containers... [11:17:55] cp3008 is behaving properly, upgrading the rest of misc@esams [13:01:56] ema: when you get back: so the current data suggests the connection-related issues may be resolved by upstream's extrachance patch? [13:02:31] one thing that crossed my mind reading through all that backscroll, re: the pileup of timewaits vs the relatively-stable counts elsewhere. [13:02:45] timewaits take a fixed amount of time to clear out of the hosts' tcp connection tables [13:03:14] so if we're seeing a scenario where timewaits are building up higher and higher infinitely, it doesn't necessarily mean anything is leaking real connections [13:03:49] what it means is the rate of overall connections established->closed is exceeding the rate at which timewaits can expire. [13:04:11] I don't have the numbers offhand, but let's say it takes 2 minutes to clear a timewait-state socket from the kernel [13:04:39] if you're cycling connections through open->close and they're staying alive on average <2 minutes each in the middle, then timewait count would build indefinitely. [13:04:45] (assuming each close creates a timewait) [13:07:51] and then of course, if we have a timewait pileup of that nature, we'll eventually run out of sockets (timewaits still occupy a port-slot) [13:08:11] morning bblack [13:08:16] morning [13:08:45] when I say "run out of sockets", I don't mean the ulimit or configured max_connections, but rather available TCP port numbers on both ends. [13:09:05] (or one end I guess, since it's a single fixed port on the backend side) [13:10:25] BTW, dunno if this could be attributed to day to day variations.. but established connections changed significantly also... 90k VS 117k (same time today and a week ago) [13:11:05] basically for a given fe->be pair (say, all connections from cp3030-fe to cp3031-be), there's only ~50K or whatever possible client port numbers. the timewaits consume those in addition to the active established ones. [13:11:45] so, yes, if the average lifetime of an fe->be conn drops below the time required to clear a timewait, it's inevitable that you'll eventually run out of port numbers and connections will fail. [13:12:40] the underlying question is why our supposedly-persistent fe->be conns aren't living as long as (we thought? they used to?) [13:13:00] the extrachance stuff is plausibly related to that [13:13:18] but so is the connection:close behaviors at various levels/places [13:13:53] 5xx rates are also a factor, as most servers close a connection after a 5xx.. [13:15:23] we've band-aided similar portcount issues at the nginx<->varnish-fe level by having 8 separate listening ports for varnish-fe that are configured as a round-robin pool on the nginx side, to cut port-count limits by a healthy margin [13:16:18] that was a solution for fairly steady-state problem, though. for this problem, it would only buy more time-to-failure, not eliminate failure. [13:17:24] bblack: so.. do you think that varnish loses the ability of recycling the connections cause they're dying/being closed "unexpectedly"? [13:17:42] I don't know [13:18:34] to recap the extrachance problem: [13:19:08] varnish has a pool of backend connections for reuse. when it needs to make a fresh request to a backend, if there's available connections idling in the pool it reuses one. [13:19:28] however, there's always a possible race that's difficult to resolve: [13:19:59] the server behind varnish (another varnish in some cases!), on the other end of the persistent connection, could also timeout the idle connection and close it. [13:20:15] and it may close it in a race *while* varnish is just starting to send it data for the new request [13:21:10] in these cases, resolution of the issue is tricky. It's also possible it wasn't a race-condition, and that varnish's request data was received and the server abruptly closed because the request caused a crash or whatever and failed (thus should be a 503). [13:21:52] but non-idempotent methods this is especially problematic, as varnish must assume that latter case and thus cannot re-send on a fresh connection without violating protocol about duplicate non-idempotent methods [13:23:10] our "extrachance" patch and the new usptream fix that looks better, are all about changing the behavior of varnish during these close-race cases [13:24:41] the problem we ran into before is that their "extrachance" method of dealing with it had a flaw, where it affected other non-race cases with long timeouts. [13:25:22] (if the backend took forever to respond, and then abruptly terminated as a response, varnish would internally retry this forever) [13:26:17] we patched over that inelegantly, but it "worked". The new code with the setting 0 basically changed the behavior to "don't try to worry about this race. if we hit this race condition, just fail fast and call it a 503" [13:26:52] (and then our frontend singular-503-retry should handle the case that it actually was a rare timing race that resolves itself with a single retry shortly afterwards) [13:29:50] at the heart of this particular sub-issue: at the tcp level, it actually is possible to tighten up the race case by looking at sequence numbers. if the server's FIN indicates they hadn't received any data from us since the connection last went idle, then it's a true race and the new request did nothing, close up and retry the request on a new/different conn. [13:30:13] whereas if the FIN also shows seqnos indicating they did receive our new request, treat it like a 503/server-crash case. [13:30:37] but the standard socket interfaces don't expose this information easily, and I don't think varnish tries to use other APIs to look. [13:38:28] anyways, re: the tcp seqno solution, for linux the API is to call getsockopt(TCP_INFO) on the connection, which returns a copy of the kernel's struct tcp_info [13:39:45] one of the fields is: " __u32 tcpi_unacked", which tells you how much data is outstanding the other side didn't ack. in a true race, this would be non-zero on receipt of the FIN. [13:46:03] so, extrachance-related issues aside, what else could be causing less connection reuse? [13:46:32] it could be some varnish timeouts have changed their defaults with v5 and we weren't setting them explicitly. [13:46:57] it could be a change in whether/how varnish handles connection:close pass-through. [13:48:47] on the nginx end of things, we do set an explicit "Connection: close" on client requests forwarded from nginx to varnish-fe. [13:48:57] the idea being that nginx<->varnish-fe are one connection per request. [13:49:25] the behavior in times past has been that this doesn't affect varnish-fe<->varnish-be, but perhaps now it does? [13:49:30] why are we enforcing that behaviour? [13:50:03] usually TLS terminators use persistent connections against the origin servers [13:50:14] I'd have to dig to tell you for sure, but I know it was very intentional and related to real problems [13:51:44] what we saw on Friday is that apparently both varnish (fe & be) lose the ability of reusing connections over time.. thus the rate of new sessions per second on varnish-be increases over time [13:52:34] (something to do with error cases letting problems bleed between unrelated clients. e.g. client1 of nginx uses connection1 to varnish, gets some kind of awful error case where output just ceases mid-stream and becomes a 50x, but then nginx would try to forward client2's unrelated request over the same connection while the connection was in this bad state, so it would fail too) [13:53:29] I think some of these mid-stream awful things had to do with gzip cases, too [13:53:44] /o\ [13:54:29] e.g. I think we've had a case before where the application layer (MW or RB or whatever) would first emit response headers like everything was ok (200 OK, blah blah) and that it intended to gzip (CE:gzip), but then something failed internally after sending those headers, before it could send a body. [13:54:47] and so in spite of 200 OK + CE:gzip, the applayer would then send a non-gzip error text output with the wrong length. [13:55:15] what's escaping my memory at the moment is why this would bubble up all the way through the varnish layers and affect varnish-fe<->nginx [13:56:42] but I know there was a point in the past where we spent months trying to enable nginx<->varnish-fe connection reuse, but it always resulted in increased inexplicable error rates, and that our theory was about these bleed-over errors where certain error cases hosed a connection and nginx didn't realize it and forwarded another request over the now-insane open connection. [13:58:55] anyways, nginx<->varnish-fe is always to the localhost only, which helps tw_reuse be more-effective, and it has its 8x ports to help. basically at that layer the solution has been "working" [13:59:22] but it's possible varnish-fe is now forwarding the "connection: close" and making it request-per-connection further down the stack, which we didn't intend? [14:00:10] so, what we know for sure is that the rate of accepted sessions goes up upon varnish-backend-restart, and stays up [14:00:31] that's probably due to vcl reloads and old vcls not being discarded [14:01:14] we know that VCLs not in use keep on probing, so that does make some sense [14:01:52] true [14:01:56] this is today's backend restart of cp3042 [14:01:57] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?panelId=7&fullscreen&orgId=1&from=1521460106273&to=1521464592345&edit&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend [14:02:04] bblack: maybe we could leverage https://github.com/varnishcache/varnish-cache/pull/2564 for nginx<-->varnish-fe communication [14:03:03] so, yeah, unix sockets is something we've wanted for that case forever. We never had time to try to implement it ourselves though. [14:03:07] I didn't know upstream finally had! [14:03:19] yeah they did in 6.0 [14:04:47] now, what's not clear to me is why did the rate of accepted sessions look sane prior to the upgrade to v5 (it's not that we were discarding old VCLs with v4) [14:05:17] maybe probes continuing in old VCLs is new? [14:05:37] perhaps [14:07:05] if not, something else has changed probes-wise between v4 and v5 which makes fe<->be probes not use persistent connections? [14:07:16] possibly [14:07:47] honestly, for either version if you asked me "do probes make new connections or re-use existing ones" I think my answer would be "I don't know" [14:08:49] maybe with v5 they moved from using separate connections to re-using from the pool, and our probe traffic sends connection:close, thus is closing off random persisntent conns as it probes... [14:10:16] hmmm [14:10:34] well we don't control the connection:close part for v<->v probes, not sure what v behavior is there [14:10:40] pybal is different, but fe-only [14:11:11] oh ignore what I said above, we do set it [14:11:22] probe varnish { [14:11:22] .request = [14:11:22] "GET /check HTTP/1.1" [14:11:22] "Host: varnishcheck" [14:11:22] "User-agent: Varnish backend check" [14:11:23] we do set Connection: close on the client side of the probe, yes [14:11:24] "Connection: close"; [14:12:00] bblack: any reason for not passing --autodiscard to reload-vcl? [14:12:06] hmmm [14:12:14] that connection: close looks a little bit redundant [14:12:16] https://github.com/varnishcache/varnish-cache/blob/5.1/bin/varnishd/cache/cache_backend_probe.c#L530 [14:13:57] it's not redundant, you can either specify host+url, or specify the whole request yourself [14:14:10] but in the case that we were doing just host+url, it would also be conn:close [14:14:40] ack [14:15:22] ema: may want to do another round of manual tests first, just to be sure. (another round of manual discards via cumin, maybe site/cluster at-a-time, and verify none of them crash out as a result?) [14:15:39] I did one such round where I discarded everything that can discard globally, a couple weeks ago [14:15:49] (no crashes) [14:15:53] nice [14:15:56] yeah let's give it a go [14:16:22] I'll try a few manual text@esams discards [14:18:17] !log cp3040: discard old VCL T189892 [14:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:23] T189892: varnish-be: rate of accepted sessions keeps on increasing - https://phabricator.wikimedia.org/T189892 [14:20:14] so cache_backend_probe.c does use a connection pool of some kind [14:20:24] it's unclear to me whether this is the same pool used for regular traffic, so far [14:20:45] it doesn't seem, at a glance, that this part has changed from v3->v4->v5, though, at least not in that file [14:23:46] so, I've ran `/usr/share/varnish/reload-vcl -n frontend -f /etc/varnish/wikimedia_text-frontend.vcl -d 2 --autodiscard` on cp3040 [14:24:01] accepted session rate is actually going up [14:24:23] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?panelId=7&fullscreen&orgId=1&from=now-1h&to=now&edit&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend [14:24:23] meaning the situation improved from the discard? [14:24:36] or worsened? [14:24:36] the contrary [14:24:37] it worsened from the reload I think [14:24:50] that seems... unlikely? [14:25:16] unless the issue is not in cold VCL but rather in the reload process itself somehow [14:25:30] no, I think we're just interpreting results differently [14:26:20] it could be that some limitation is in play on total active sessions (connections) from fe->be (e.g. port# + time_wait), and stopping the excess probes freed up connection slots, allowing it the active session count to rise [14:26:48] in other words, the limit is actively under pressure (and requests are being artificially delayed waiting on connection slots) [14:28:04] ema: let's try removing conn:close from the probe? afaics the response side doesn't have it, just the request side [14:28:55] so, whenever we do restart a backend (and trigger VCL reloads elsewhere) there's a spike in accepted sessions rate, which then goes down (although never back to less than before the restart) [14:29:50] it seems at least plausible (still trying to confirm) that the probes use the same connection pool as live requests, and that they're causing them to be routinely closed-off early. [14:29:54] probably because there's other VCLs still being loaded and probing (in the case of cp3040 now, 5 busy VCLs) [14:31:00] let's give it a bit more time and wait for those to be unused and thus actually discarded [14:31:08] assuming the above is true, it's probably always been true, but there could be other changes in v5 that exacerbate the fallout. [14:31:23] bblack: re:conn:close, yes let's try that out on a handful of hosts first? [14:32:10] text@ulsfo for instance? [14:32:10] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?panelId=7&fullscreen&orgId=1&from=now-24h&to=now&edit&var-datasource=ulsfo%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend [14:32:33] confirmed in the source, probes and live reqs do share the same connection pool [14:33:21] (the way it's structured, actually, anything in an entire varnishd that connects to the same destination shares a global singleton pool, even if e.g. we configured two separate backend labels that happened to have the same dest ip:port) [14:37:17] hmmm also our probe definition itself looks suspicious, in that it lacks the double-newline at the end [14:37:23] but obviously it must work, curiously [14:37:58] uhm [14:38:27] \r\n is added automagically between lines, perhaps the last one is automagically \r\n\r\n [14:39:43] ok [14:39:53] seems so on the wire [14:41:07] the default one (if you don't override it I guess) does: [14:41:07] VSB_printf(vsb, "Connection: close\r\n"); [14:41:14] VSB_printf(vsb, "\r\n"); [14:42:53] so here's the result of reload-vcl --discard: [14:42:54] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?panelId=7&fullscreen&orgId=1&from=1521467602499&to=1521470553231&edit&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend [14:44:40] of discarding on one or all? [14:46:45] cp4027, I manually removed conn:close from the frontend probes, and reloaded the frontend with discards. (puppet-disabled) [14:46:51] just now [14:48:10] so many "busy" discards :P [14:48:29] in general that's a problem we just have to live with, though [14:48:34] we have many idle threads [14:49:27] bblack: discarding just on cp3040 [14:49:39] hang on [14:51:12] re-added conn:close to probes, reloaded + discard again, on cp4027 [14:56:19] all the 50x coming from cp4027 according to https://logstash.wikimedia.org/app/kibana#/dashboard/Varnish-Webrequest-50X [14:56:24] (as expected) [14:58:53] (interesting consequence of removing the connection: close on the probes) [14:59:21] yeah... [14:59:58] very [15:00:03] mailbox lag also went up :P [15:00:15] there are a number of plausible ways that could've happened I guess, but all seem fishy [15:00:21] and fe<->be request rate went to 0 [15:00:23] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp4027&var-datasource=ulsfo%20prometheus%2Fops&from=1521463292181&to=1521471567405 [15:00:47] perhaps parts of varnish assume that manual requests in probes use conn:close or else they misbehave [15:01:39] or, perhaps everything with the probe change was fine, for probes, but we're relying on a constant spam of conn:close coming from them to prevent some other fallout that's otherwise impending. [15:02:33] let's validate further on cp1008 [15:02:42] I'll try removing connection:close from fe probes there [15:03:46] ok [15:07:27] vcl-d0854643-a0d9-44fc-bf2e-0552fb0c5067.be_cp1008_wikimedia_org probe Sick 0/5 [15:07:30] Current states good: 0 threshold: 3 window: 5 [15:07:45] so they all fail without conn:close? [15:08:59] yup [15:09:16] I'm trying to see the response with httpry but failing [15:09:27] sudo httpry -m GET -i lo 'tcp port 3128' shows PURGE responses too, grr [15:09:48] oh, but wait [15:09:54] 2018-03-19 15:09:04 208.80.154.42 208.80.154.42 > GET varnishcheck /check HTTP/1.1 - - [15:10:00] 2018-03-19 15:09:04 208.80.154.42 208.80.154.42 < - - - HTTP/1.1 200 OK [15:10:44] and indeed `varnishncsa -q 'RespStatus eq 200'` confirms [15:10:49] not a layer7 issue.. looks like the probe expects an specific behaviour in L4 [15:12:50] heh [15:13:22] so, probes only work with conn-close? [15:13:37] looks like it [15:13:42] or is it that our synth response fails to have a Content-length and thus is implicitly connection-delimited? [15:13:53] hmmm [15:14:20] you'd think synthetic() would add CL [15:14:40] sub wm_common_synth { [15:14:41] if (resp.reason == "healthcheck") { [15:14:41] set resp.reason = "OK"; [15:14:41] synthetic("Varnish <%= @inst %> running on <%= @hostname %> is up"); [15:18:10] 'Connection: keep-alive' in the request does not help [15:18:53] and of course adding back 'Connection: close' fixes it [15:19:06] does the response have a CL? [15:19:21] it does [15:19:22] the function responsible for running the probe actually closes the TCP socket used after performing the request [15:19:52] - RespHeader Content-Length: 39 [15:19:54] hmm ok [15:20:06] well, that's brilliant [15:20:07] also see `curl -v -H 'Host: varnishcheck' http://localhost:3128/check` [15:20:34] so, basically the way probes are implemented, you can't use them without having them close persistent pool conns pointlessly :P [15:21:40] another interesting experiment would be to back off the interval=100ms to interval=1s for a whole DC worth of frontends, and see if other related graphs improve substantially from the order-of-magnitude reduction in probe-induced connection closings [15:22:30] bblack: I've tried that experiment earlier on today on one single frontend, accepted sessions rate did begin going down (as expected) [15:22:48] of course that will screw with the initial-use delay on vcl reload [15:22:59] we'd need to add +4s to delay or so [15:23:39] there's also no depool-threshold, which would have helped with the conn:close experiment on cp4027 :P [15:23:52] right [15:24:22] I hate it when monitoring systems assume their own infallibility [15:24:38] "all the things are down, so don't even try to do anything live anymore, just fail everything" :P [15:25:15] even lacking a threshold setting, any sane implementation of such a thing should assume all-down means monitoring is broken and treat as all-up while complaining loudly elsewhere [15:29:37] so: [15:29:57] even if probe-closes are a part of the problem, I think it's fair to assume this hasn't changed since v3, based on what I can see. [15:30:07] something else changed that's acting as a catalyst here [15:30:56] reloading a frontend VCL causes an increase in /check rps being received by backends [15:31:30] that should be expected, since we've added a new set of probes and there's still some undead busy discarded ones running too [15:32:01] right, this wasn't the case with v4 though [15:32:37] or rather, we did have brief increases, which then went down back to the normal rate [15:32:47] see eg: [15:32:48] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?panelId=7&fullscreen&orgId=1&from=1519616009717&to=1519709918414&edit&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend [15:33:50] in general, there could've been actions-at-a-distance in the v4->v5 transition that makes idle threads stay more-idle [15:34:06] regarding probes closing connections... some git blaming got me here: https://lists.gt.net/varnish/bugs/2342?search_string=Don%27t%20half-close%20probes;#2342 [15:34:13] so yeah.. they close the connection since 2009 xD [15:34:56] the bottom line is a given thread will hold a VCL reference until it processes a new request. there are probably lots of obvious perf-improvement patches one would make in unrelated places that would result in the most-idle threads getting even idle-er, thus exacerbating stale VCL [15:35:36] (in other words, better thread/cache perf by re-using the most-recently-used threads before long-idle ones) [15:36:32] so, we currently have 65 busy+warm loaded VCLs on text@esams frontends [15:36:39] one request every 100ms [15:36:50] 650 rps [15:37:31] which is what we get: [15:37:33] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?panelId=7&fullscreen&orgId=1&from=1521473100231&to=1521473830563&edit&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend [15:37:33] 65 on each? [15:37:38] 65 total [15:37:40] 65 in total [15:37:40] ok [15:38:09] so basically this confirms the probes are the overwhelming majority of our conn:close cases then, right? [15:38:15] indeed [15:38:45] back on the timewait angle of this [15:39:06] whatever the timewait time is, it's all about whether our *average* connection lifetime is shorter than that, I think. [15:39:39] which is 2msl [15:40:28] we have msl set to 3 [15:40:35] so time_wait states should last ~6s [15:41:03] err wait maybe the fin timeout is the 2msl value? [15:42:35] oh wait, we did already discuss the "vcl stays busy" thing, didn't we [15:42:36] https://github.com/varnishcache/varnish-cache/issues/2228 [15:43:44] > One solution could be an expiry-like thread dedicated to releasing references in the background [15:45:23] yeah because the last time you made an expiry-like thread that worked out so well :P [15:45:32] right? [15:46:01] anyways, I've lost my place now on the whole 2msl thing [15:46:36] timewait lasts 2msl. socket consumption (from the OS pov of used-up-port-numbers) is total of the established conns + the timewait conns. [15:47:39] if your average connections lasted about 2msl of time-connected before they close up, you'd expect roughly #established==#timewait, and so while the port-based limit might be 50K sockets, effectively you'll be limited to 25K live sockets and can sustain like this. [15:48:25] if your average connections last longer, then that number goes up (timewait is less of a problem, it's using up a smaller percentage of the overall lifetime of a socket including its timewait state) [15:49:10] but if your average connection lifetime in established is shorter than the 2msl of time_wait, then the timewaits just keep growing and growing, even if you have a steady state of #established. [15:49:28] which will exhaust your ports and cause connection failures, eventually [15:50:55] because we own both sides over the loopback in the nginx<->fe case, tw_reuse can be more aggressive I think, which is maybe why the above doesn't cause issue there [15:51:06] but fe<->be cross machines, and tw_reuse can only fix one side I think [15:53:11] 10Traffic, 10Operations, 10Patch-For-Review: varnish-be: rate of accepted sessions keeps on increasing - https://phabricator.wikimedia.org/T189892#4061600 (10ema) Further investigation today showed that the cause of this is VCL staying in the `auto/busy` state. All those VCLs' probes keep on running. At a ce... [15:54:15] in general there are two design-level issues excerbating this state above in the ticket update: [15:54:32] 1) That varnish probes require conn:close yet re-use the same pool of persistent conns as the live traffic [15:55:31] 2) That varnish threads do not release vcl references while idle, they won't notice the switch and release their reference to an old VCL until they next receive a request. If you have 500 threads and typically only 20 of them are active to handle the parallelism of your traffic, you can see how thread#500 may almost never get a request and never release. [15:56:24] if the use of the threads was pure round-robin, you'd get there in at most 500 requests which isn't as bad. but it's clearly not optimal to do so for thread performance. probably thread performance improvements in general move in the direction of minimizing the hot set of threads. [15:56:45] (and thus maximizing the stale vcl problems) [16:08:50] the more stale VCLs pile-up from (2), the more our conn:close rate from (1) increases, which reduces the average lifetime of connections in our pool of desirably-persistent connections. If the average lifetime falls below 2msl, timewait states will dominate and grow indefinitely, leading to port# exhaustion and failure to connect / queueing, which could explain a lot of recent misbehaviors. [16:12:18] right [16:22:00] for that matter, the status of upstream tickets about (2) are unclear. it seems possible the problem may run deeper than just long-idle threads. there might be actual bugs with leaked references. [16:23:57] I'm now looking at the v4-v5 debdiff, because clearly this issue was not as prominent with v4 [16:24:07] see boron:~ema/v4v5-deb.diff [16:25:01] in hsh_rush(), there's a while loop that was removed, and did call VCL_Rel(&req->vcl); [16:34:08] it also killed threads/requests though, I think it makes sense if the whole loop was removed [16:35:36] https://github.com/varnishcache/varnish-cache/commit/867743f0257413304c5922b77513201da0b9242b#diff-297987e0ac0bac69be6ee6599bfae0de [16:36:04] it looks like that VCL_Rel() is now done inside hsh_rush2() [16:36:20] yeah I was looking into the commit too [16:37:19] however, there's no VCL_Rel in bin/varnishd/cache/cache_hash.c on 5.1 [16:38:01] see https://github.com/varnishcache/varnish-cache/blob/5.1/bin/varnishd/cache/cache_hash.c [16:38:21] in general VCL_Ref/Rel don't easily match up, something's odd there [16:38:40] VCL_Ref only happens in one place, VCL_Rel in like 13 [16:42:40] I think the worker thread's vcl attribute (wrk->vcl) is just copied from the latest request the vcl handled [16:42:52] and most vcl_ref/rel are about request objects not worker threads [16:44:29] yeah that stuff is a mess :/ [16:44:49] apparently ownership of the reference passes between request and worker-thread in places [16:45:09] the refcnt only increases in the context of a busyobject in a request [16:45:30] but is decremented in all kinds of places, sometimes via the request/busyobj, sometimes via the worker thread [16:46:11] maybe I should add a (3) to my list above: [16:47:33] 3) None of this may become critical in a case where VCL reloads == real configuration changes, but we've in the past assumed VCL reloads are relatively-harmless, and we now operate such that vcl reloads happen very frequently as a result of pool/depool [16:47:43] (and pool/depool in turn happens routinely because of weekly (now semiweekly) restarts) [16:48:39] on that note: we could explore setting backends as sick/healthy as a depooling strategy instead of reloading VCL [16:48:55] varnishadm backend.set_health vcl-af4294fd-5726-4c42-b416-53b6e3646874.be_cp2007_codfw_wmnet sick [16:48:58] varnishadm backend.set_health vcl-af4294fd-5726-4c42-b416-53b6e3646874.be_cp2007_codfw_wmnet healthy [16:50:03] yeah I guess, modulo the vcl-naming races/issues [16:50:41] right: puppet disable ; list backends ; set sick/healthy ; puppet enable ? [16:50:56] well [16:51:00] a few issues there: [16:51:02] list backends to get the vcl-name [16:51:18] 1) puppet disable can take a while (wait on in-progress run), so this slows down the depool response [16:51:49] 2) the sickness won't persist when a puppet-driven config update happens immediately afterwards. [16:52:45] (also regardless of the above: some ? on whether doing it only to the latest active VCL is enough, or we need to care about marking sick in other busy/warm/discarded cases too) [16:52:52] can we influence the result of our probe endpoint? [16:53:11] without reloading VCLs of course [16:53:12] vgutierrez: only when the depool is voluntary [16:53:32] (if the server's crashed, we can't tell it to change its healthcheck output to look faily) [16:53:43] I mean, one approach is mark manually the backend as sick [16:53:56] the other approach is make that backend answer a non 200 to be marked as sick [16:54:06] see above :) [16:54:59] right now conceptually the depooled state covers both cases [16:55:19] more in general, I think it's not great that we are using a version of varnish which does not get updated upstream (5.1.x) [16:55:45] I'll try tomorrow and see if 5.2/6.0 are affected too [16:56:10] affected by what? [16:56:23] vcl leaks? [16:56:31] by https://phabricator.wikimedia.org/T189892#4061600 [16:57:07] the probe part is pretty easy to understand, either they fixed it in 5.2/6.0 or they didn't [16:57:54] but the VCL_Rel() part is the question [16:58:33] I'm not pleased about 5.1's lack of updates either [16:58:50] but they've painted us into a pretty nasty corner here, I'm not sure further upgrades will actually improve our net situation, either. [16:59:31] right, and the work required to update to 5.2 is non-negligible (varnishkafka) [16:59:51] and there may be other random perf/behavioral changes that are worse rather than better [17:00:39] in general the pattern for us from v3 -> v4.1 -> v5.1 has been in each case we gained a set of expected features we needed, but also gained new issues that have been difficult to resolve [17:02:12] what really concerns me now is I don't even understand their release / support-lifecycle standards at all [17:02:28] will 5.2.x get long-term support like 4.1.x? [17:02:44] or will 6.x be the one [17:03:27] moving to 6.0.0 seems foolish, it's a double dot-zero and it was released 4 days ago. there's doubtless many low-hanging fruits of bugs left to squish there [17:04:17] ema: > the work required to update to 5.2 is non-negligible (varnishkafka) <3 [17:04:45] the implicit pattern seems to be (using semver field naming of major.minor.patch): when a new Major is released, no new Minors will be made in previous majors, and the last Minor of previous majors will be the LTS-like release for quite a while. [17:05:15] then we should upgrade to 5.2 [17:05:34] yeah but it's not documented [17:05:41] we're guessing [17:05:47] yup [17:05:47] and upgrading to 5.2 is a serious schedule problem [17:09:29] we have a long-running history now going back to when we were on varnish-3.x [17:10:08] of "upstream varnish bugs / lack-of-bugfix-releases / etc are forcing that it's wiser to spend time on a varnish upgrade to maintain our health and defer working on ATS" [17:10:31] the upgrade treadmill keeps preventing work on the long-term cure [17:11:04] and each step on the treadmill seems to make our lives more complex with new issues, too [17:12:34] yeah [17:13:03] at any rate: I'm happy that we went from "OMG what's going on" on Friday to "vcl keeps on staying busy" on Monday [17:13:24] I've gotta go now :) [17:13:24] o/ [17:13:27] ok [17:13:36] hmmm [17:13:42] another thing we can do here is re-examine all the hows and whys of our probes in general [17:14:04] given the design constraints here, what are fast probes (or probes at all) worth to us vs relying on manual depools [17:14:55] we know why we don't abstract varnish<->varnish via LVS: because we want chashing, and LVS can't look inside the requests to chash them. [17:15:56] for all the varnish->applayer cases, it's abstracted via LVS and pybal does the healthchecking and depooling stuff [17:16:42] but v<->v we handle differently because chashing, and thus we need varnish to have an explicit list of backend hosts and its own depool-control and healthcheck [17:17:37] FWIW: diffing 5.2 and 5.1 I found this commit (https://github.com/varnishcache/varnish-cache/commit/cc5df0ef06caa2fca78ba471a6126fcad03fd6a2) fixing a memory leak inside vcl_cli_discard, got merged into varnish 6.0 and 5.2, not 5.1 though [17:17:47] the probes clearly have value. varnish won't depool a backend ever if there's no healthcheck to fail (it doesn't look at 5xx/timeout patterns of live requests an dinfer) [17:19:04] in maintenance/restart cases we depool healthy machines ahead of the work, so the probe is doing nothing [17:19:29] but in cases that are more crash-like, we're relying on the probes to work-around until someone handles the issue and does a depool to persist it [17:20:18] (well, to be fair, probes are also what's saving us if depooling for maintenance doesn't work right because of some etcd/confd-level issue) [17:20:53] but still, etcd-failure and node-crashes are rare. maintenance depools happen at a fairly brisk pace routinely. [17:21:44] we could also slow down the probe interval, which wouldn't necessarily resolve the issue, but might make enough of a pragmatic difference. [17:22:39] s/100ms/1s/ would mean stopping the 5xx carnage when a node suddenly crashes would take a few seconds longer, but the rate of probe checks drops an order of magnitude. [17:23:11] the load->use delay on reload_vcl would need raising as well, but that's tolerable [17:25:52] in general, machine crashes seem more-likely than varnishd crashes? I can't remember when we last had a legitimate inexplicable varnishd crash. [17:26:21] we could have the conn:close probes hitting nginx instead of varnishd just to confirm the machine is alive [17:26:32] well.. that's not fair, we don't restart the machines semi-weekly to avoid crashes [17:27:17] well, we're restarting semiweekly because "some things that are hard to understand make varnish perform poorly to the point of virtual failure if we leave them alive too long" [17:27:33] more than one things, I'm sure [17:27:49] but the point is, it's not a varnishd crash [17:28:17] the original weekly restarts were insurance against what seems to be bad storage performance from -sfile over time [17:28:38] I think this new problem is something different [17:28:59] and if the driver is vcl reload rate (given stale vcl + probe-conn:close issues discussed above) [17:29:22] then upping our restart rate from weekly to semi-weekly may have been a net negative, as we've now doubled the pace of vcl reloads as well. [17:31:53] while I think the probes are valuable, and do save us when the unexpected happens, and *should* be there.... [17:32:23] in practice, the rate of probes saving us from unexpected things is low in terms of events/month. the rate of problems caused indirectly by our probes is high and happening pretty constantly. [17:35:35] I don't know if we can reliably solve the vcl ref problem with a code patch [17:36:00] I also don't know how hard it would be to make probes not require conn:close (maybe easy, maybe fraught with edge-cases) [17:36:36] but if we killed all probes, a lot of problems might vanish. if we even slowed them down by an order of magnitude, it might be enough while still preserving some of their utility. [17:38:02] https://github.com/varnishcache/varnish-cache/issues/2228 seems to indicate that even for varnish developers, the vcl_ref() problem is rather tricky [17:39:37] currently clusters that matter have anywhre from 6-12 frontends probing local backends [17:39:49] core backends have additionally all the remote backends hitting them [17:39:57] multiply that by the stale VCLs [17:40:05] it's a pretty high rate of probes -> conn:close [17:40:23] yup.. you could bring down some sites only with our probes traffic [17:40:56] probably in eqiad it's on the order of ~1k/sec probes? [17:41:00] (into the backends) [17:41:38] well that's not right, they don't multiply because they're single-layer [17:42:06] but it may still add up, depending on VCL staleness in various places [17:42:10] to at least hundreds/sec [17:43:54] checking a random eqiad text node [17:44:06] (upload would be slightly worse, as it has more nodes in some DCs) [17:44:56] average rate of 1420 incoming probes/sec in a 30-second window I checked on cp1065 [17:45:17] not bad [17:46:38] 10netops, 10Operations, 10ops-eqiad: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4062000 (10ayounsi) [17:47:06] 10netops, 10Operations, 10ops-eqiad: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960#4062003 (10ayounsi) [17:55:13] anyways, I think cutting their rate is worth at least experimenting with, so that we can confirm how much this is contributing to other problems [17:55:16] https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+branch:production+topic:slower-varnish-probes [17:58:47] 10netops, 10Operations, 10ops-eqiad: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960#4062039 (10ayounsi) [17:59:02] 10netops, 10Operations, 10ops-eqiad: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4062040 (10ayounsi) [18:10:55] I've backported https://github.com/varnishcache/varnish-cache/commit/cc5df0ef06caa2fca78ba471a6126fcad03fd6a2 here: https://gerrit.wikimedia.org/r/#/c/420395/, I thought it could be interesting [18:11:34] yeah [18:12:29] and yes.. it's a shame that stuff like this doesn't get fixed in 5.1 by upstream :( [18:13:10] well it's not unreasonable for them to drop older branches from maintenance if there's a viable upgrade path and reasonable deprecation->eol cycles... [18:13:37] 5.1 to 5.2 upgrade breaks several APIs... [18:13:46] but (a) they're breaking usual semver expectations in making some minor-release updates (like 5.1->5.2) more difficult than they should be [18:13:52] so as ema said mentioning varnishkafka, it can be painful [18:14:05] + (b) they haven't clearly published what exactly their standards for maintenance of branches/releases is. [18:14:47] by inference, I'm guessing their policy at this point is there will never be a 5.3, and that 5.2 is their 5.x LTS release like 4.1 was. [18:14:57] but when 4.1 support ends, or when 5.2 support ends, who knows [18:16:33] it may be that they consider all of 5.x to be a special-case and the plan is only for 4.1.x and some future 6.x to keep getting long-term support/fixes, for all I know. [18:17:09] anyways [18:17:39] I think the 1s probes (the 3x patches uploaded earlier) is worth a shot. We may or may not stick with that, but it will at least provide some valuable data about the situation. [18:18:00] but I need to at least eat lunch first, I may try it later in the US day [18:18:49] (+ should also turn on auto-discard as our default reloading policy, after another round of sanity-check) [18:20:32] the varnishadm route for setting sick/healthy without a reload is worth pursuing as well, but lots to think about there [18:20:52] maybe we can have confd write the directors.vcl update, not reload, and also do the varnishadm command against all live VCLs. [18:21:11] that would solve the problem if there's a true puppet-driven config update afterwards, of maintaining the depooled-ness. [18:21:47] (that plan has a hole too, needs more thinking than that!) [20:07:24] 10Traffic, 10Operations, 10Beta-Cluster-reproducible, 10Performance-Team (Radar): PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938#4062444 (10Krinkle) [20:47:11] 10Traffic, 10netops, 10Operations: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090#4062522 (10ayounsi) p:05Triage>03Normal [21:12:10] ema: recap for tomrrow: I deployed all of these things: https://gerrit.wikimedia.org/r/#/q/status:merged+project:operations/puppet+branch:production+topic:slower-varnish-probes [21:13:00] ema: which means auto-discard is in effect, probe intervals are now 1s, and other stuff has had its delay/sleep values updated to cope with the probe intervals affecting the time-to-mark-healthy on vcl loading [21:14:02] ema: but we still have a lot of stuck-warm old VCLs, especially in frontends. One possible solution (ugh) is to weekly-restart the frontends so they don't collect them so long. [21:14:31] ema: another is... I think we could have the reloader set them manually-sick, which stops probes. As in: [21:14:34] # varnishadm -n frontend vcl.list [21:14:36] [...] [21:14:38] discarded auto/busy 16 vcl-1bdfd3f0-cfbc-4098-8a01-ff431b79bfa3 [21:14:48] # varnishadm -n frontend backend.set_health vcl-1bdfd3f0-cfbc-4098-8a01-ff431b79bfa3.* sick [21:15:06] there's a risk we do that too fast, while that VCL has legit connections flowing through it [21:15:54] so we'd have to have some logic run down the vcl.list, figure out the load-time of each (maybe fs timestamp on compiled vcl?), and use that as the effective "end-time" of the one before it on the list, and only set *->sick on ones with ends times >1h ago, or something like that. [21:16:33] at least that's my current theory, that this would stop the probes from the old ones. [21:18:35] the other related thing, is thinking more about having confd pool-actions not use a full reload [21:19:16] having it execute varnishadm commands works in-the-present. the hard part is that while a node is depooled (for minutes, hours, days?), there will be other vcl reloads for true code changes, and these would lose the temporary sick state. [21:19:47] so we'd need it to also update a vcl file to persist the state (and not do a reload, just leave it there for future reloads to pick up). [21:20:06] but havin git simply remove the backend from that vcl file like today wouldn't work, because then it couldn't add it back with a reload when it's repooled. [21:21:08] we need to find some vcl command like: vcl_init { set be_cp1024.healthy = false; } or something. In other words, leave it defined, but inject some vcl that forces it into an unhealthy state that can be reverted by later set_health back to auto. [21:23:34] (because all of those frontend vcls are still alive and kicking with probes, the connection-close/probe rate hasn't gone down much in eqiad, yet) [21:38:53] ugh, also: [21:39:06] https://varnish-cache.org/docs/5.1/users-guide/vcl-backends.html#health-checks [21:39:20] "Please note that Varnish will keep health probes running for all loaded VCLs. Varnish will coalesce probes that seem identical - so be careful not to change the probe config if you do a lot of VCL loading" [21:39:29] ^ so it's possible this coalescing is broken? [21:39:48] or it's possible that the fact that we remove then re-add a backend breaks coalescing just for that one? [21:40:54] I wonder where that code is.... [21:49:49] 10netops, 10Operations: Config discrepencies on network devices - https://phabricator.wikimedia.org/T189588#4062787 (10ayounsi) [22:30:50] 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#4062972 (10Dzahn) a:03Dzahn [22:31:09] 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3911827 (10Dzahn) p:05Normal>03High [22:31:14] 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3911827 (10Dzahn) 05stalled>03Open [23:29:25] 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#4063154 (10Dzahn) 05Open>03Resolved https://design.wikimedia.org/ https://design.wikimedia.org/style-guide/ [23:45:26] 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3911827 (10Volker_E) Thanks @Dzahn for the quick and thorough work on this! [23:58:28] 10Wikimedia-Apache-configuration, 10Operations, 10Performance-Team: VirtualHost for mod_status breaks debugging Apache/MediaWiki from localhost - https://phabricator.wikimedia.org/T190111#4063273 (10Krinkle)