[02:04:43] 10netops, 10Operations, 10ops-eqiad: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960#4055495 (10ayounsi) [02:09:05] 10netops, 10Operations, 10ops-eqiad: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4055498 (10ayounsi) [02:15:46] 10netops, 10Operations, 10ops-eqiad: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4055513 (10ayounsi) @Cmjohnson some ports show up at absent, some as up but with no neighbors, could you please have a look? ``` ayounsi@asw2-c-eqiad> show virtual-chassis vc-port |... [08:03:10] 10Traffic, 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#4055687 (10Tgr) Quick overview of the open technical tasks: * {T186965}: ** On sites using Tidy, styles will split paragraphs they are em... [08:48:09] ema: oh cool [08:52:48] the rate of incoming backend connections seems to be increasing non-stop, which is either very bad or a problem with the graph [08:52:54] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?panelId=7&fullscreen&orgId=1&from=now-30d&to=now [08:55:14] bbiab [09:14:17] uh [09:14:33] that's pretty weird [09:15:00] varnish fe incoming connection rate still makes sense: https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&panelId=7&fullscreen&from=now-30d&to=now&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=frontend [09:16:06] but the backend layer behaves like crazy since February 27th [09:17:20] hmmm [09:18:17] ema: on feb 27th, cache_text @ esams was updated to varnish 5 [09:18:23] coincidence? [09:53:08] vgutierrez: on a unrelated note, I was thinking of our plan to deprecate the current python stats daemons on cache hosts T184942 [09:53:08] T184942: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 [09:53:31] we should most of the important stuff in prometheus by now [09:53:47] in particular regarding TLS, see eg https://grafana.wikimedia.org/dashboard/db/prometheus-tls-ciphersuite-explorer?orgId=1 [09:54:34] aham [09:55:11] if/when you have some time to spare, could you check if we can drop/replace the non-prometheus dashboards and get rid of varnishxcps? [09:55:32] of course [09:57:08] ema: more ways to get in depth in our varnish infra [09:57:11] <3 [09:57:17] :) [09:58:36] ema: regarding the graph you sent at 08:52 UTC / 09:52 CET, for varnish-fe it looks normal/as expected, and for varnish-be looks really scary considering that matches when the cluster was updated to varnish 5 [09:59:46] vgutierrez: yeah, if you look at text_ulsfo instead, you'll see that on March the 8th the rate drops (and then starts climbing up again) [09:59:56] that's the day when I've rebooted text_ulsfo for kernel upgrades [10:00:14] well all of cp-ulsfo really, including text [10:00:20] yup, I saw that drop as well [10:00:24] :/ [10:00:44] it would be crazy to downgrade one node to varnish 4 and see how it behaves? [10:05:09] vgutierrez: I'm currently more inclined to think that the graph is wrong and we're not actually getting a continuously increasing rate of backend requests [10:05:19] see for example https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp3043&var-datasource=esams%20prometheus%2Fops&from=now-30d&to=now [10:06:32] the only one that stands out is "Varnish Backend Sessions", which plots the same metric as the crazy graph does [10:06:55] varnish_main_sessions being the matric [10:06:57] *metric [10:07:07] <_joe_> ema: meaning you're plotting the number of new backend sessions from daemon startup, and not the current number of open sessions, probably [10:07:49] _joe_: perhaps, which would explain why reboots make the graph drop [10:08:18] hmmm [10:08:32] if it's daemon related [10:08:35] <_joe_> well varnish restarts would too [10:08:36] the varnish restart would fix it [10:08:41] <_joe_> yup [10:08:45] and we restart them a lot [10:08:47] "a lot" [10:08:50] ema: https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp3043&var-datasource=esams%20prometheus%2Fops&from=now-90d&to=now&panelId=25&fullscreen [10:09:30] <_joe_> yeah, that seems to track TIME_WAITs [10:09:38] so that increase in March from 250k to 300k is also a coincidence? [10:11:19] also if the metric it's wrong, it should be wrong in the same way for varnish-fe and varnish-be, right? [10:11:50] <_joe_> oblivian@cp3043:~$ sudo ss -tuna state TIME-WAIT | wc -l [10:11:50] <_joe_> 297607 [10:11:53] <_joe_> wow [10:12:07] I love the localization of -tuna [10:12:19] for me is -atun, that means tuna in spanish [10:12:33] <_joe_> vgutierrez: my preferred one was '-polenta' [10:12:41] lol [10:12:55] in netstat for me is -puta, but I won't translate that [10:12:59] O:) [10:13:14] <_joe_> no need for a translation I guess [10:13:38] so.. coming back to varnish [10:13:54] <_joe_> yeah 300k connections in TIME-WAIT is a lot [10:13:59] yup [10:14:25] and as ema suspects, the graph is wrong, I guess that it should be wrong for varnish-fe and varnish-be, not only for varnish-be [10:14:34] *if the graph is wrong [10:14:50] <_joe_> and they seem to be local-be / any varnish fe [10:15:26] <_joe_> 274539 of those are connections to the varnish be [10:16:02] poor guy [10:17:06] so if indeed the amount of incoming backend connections is going crazy like that, wouldn't you expect to see some changes in resource usage on the host? [10:17:27] mmm at least memory usage [10:17:43] or at least *some* change [10:17:57] yup [10:18:14] or the number of TCP connections in state established to increase [10:18:28] even though [10:18:30] <_joe_> so ftr, I'm filtering like this [10:18:33] <_joe_> sudo ss -tuna state TIME-WAIT | awk 'BEGIN {count=0} {if ($4 ~ /10.20.0.178:312/) { count += 1} } END { print count }' [10:19:06] <_joe_> cp3043:~$ sudo ss -tuna state ESTABLISHED | awk 'BEGIN {count=0} {if ($4 ~ /10.20.0.178:312/) { count += 1} } END { print count }' [10:19:09] <_joe_> 266 [10:19:31] ema: hmmm not if they are fast enough? [10:20:10] <_joe_> it looks like some connections between the varnish frontends in esams and the backends there are not closed cleanly [10:20:12] _joe_: that can be written as awk '$4 ~ /10.20.0.178:312/ { count++ } END { print count }' [10:20:19] (at least one joy this morning) [10:20:51] <_joe_> ema: I learned awk when such crazyness wasn't possible (not initializing a counter) [10:21:16] hahaha [10:21:17] <_joe_> and ofc, I wrote python there, as it's the thing I coded in most recently [10:21:30] ema: I see on the 5 minutes graph in grafana that we get TCP stats data every minute [10:22:26] so I suspect that anything happening fast enough won't be rendered in that graph [10:23:05] the metric we're plotting is MAIN.sess_conn [10:23:14] "Sessions accepted" [10:25:40] calling it a connection, as we do in the graph (and as the metric name kinda does too) is thus wrong [10:28:02] <_joe_> so it's more properly a counter [10:28:22] note that MAIN.sess_closed also does increase as sess_conn does [10:28:30] <_joe_> but the TIME-WAITs are a bit worrying [10:28:48] <_joe_> ema: so you might want to plot the rate of MAIN.sess_conn [10:29:56] _joe_: a counter? [10:30:18] that's what's he plotting AFAIK: rate(varnish_main_sessions{type=\"conn\",layer=\"$layer\", job=\"varnish-$cache_type\",instance=~\"($server):.*\"}[5m])", [10:30:32] <_joe_> oh, ok [10:32:20] yeah there's very little difference between the rate of accepted and closed sessions [10:32:24] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?panelId=7&fullscreen&orgId=1&from=now-24h&to=now&edit&tab=general&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend [10:33:52] now of course the question is: where do they come from? There must be a matching increase in frontend requests [10:34:20] not necessarily client<->fe, just fe<->be [10:34:59] (requests, not connections) [10:35:31] the distinction is important otherwise we'll keep on looking at tcp established connections as I did before :) [10:39:03] varnish_main_s_req (number of requests seen) does not go up [10:39:22] so what is this sessions accepted thing really [10:39:29] I was checking the source code [10:39:48] good luck [10:39:54] https://github.com/varnishcache/varnish-cache/blob/5.1/bin/varnishd/cache/cache_acceptor.c#L367 [10:40:31] wrk->stats->sess_drop++;// XXX Better counter ? [10:40:32] grr [10:40:52] /* Turn accepted socket into a session */ [10:41:11] it roughly seems that connection =~ session [10:41:36] or at least, we should expect 1:1 ratio between connections and sessions [10:41:43] if the server has enough resources to handle them [10:42:58] mmh [10:47:41] so if those are accepted connections, we should have frontends opening them [10:49:54] always a fe? [10:50:12] FTR my official statement is that there's something fucked going on, I don't believe the "wrong graph" theory anymore [10:50:26] :) [10:50:40] ema: how PURGE requests end on a varnish-be instance? [10:51:04] vgutierrez: vhtcpd.service [10:51:26] https://wikitech.wikimedia.org/wiki/Multicast_HTCP_purging [10:51:26] so vhtcpd it's also a source of incoming connections for a varnish-be [10:51:40] yes it is [10:51:40] <_joe_> ohhh, https://trac.nginx.org/nginx/roadmap includes [in progress] gRPC proxy support which would be *very* useful to me [10:52:13] <_joe_> also [in progress] TLS 1.3 support [10:52:22] <_joe_> which I guess is interesting to you :P [10:53:24] yup [10:53:25] vgutierrez: vhtcpd is written by brandon though, so it does the right thing and establishes one TCP connection only [10:53:48] --bblack-is-always-right [10:53:59] you're learning fast [10:54:07] I've the best teacher [10:54:09] :P [10:54:41] ema: re: 'wrong graph' https://twitter.com/dakami/status/962710535824289792 [10:59:43] :) [10:59:44] -- [11:00:22] the rate of connections accepted by varnish backends seems to keep on increasing [11:01:40] the number of established tcp connections does not [11:06:07] the number of connections in time-wait does increase, if you look over a long enough period of time (eg 30 days) https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=25&fullscreen&orgId=1&var-server=cp3043&var-datasource=esams%20prometheus%2Fops&from=1518606309551&to=1521198309551 [11:06:57] <_joe_> yeah you have a huge baseline that's increasing constantly [11:07:12] <_joe_> the daily slope draws on top of it [11:07:18] _joe_: re:time-wait? [11:07:25] <_joe_> ema: yes [11:07:28] yes [11:07:48] <_joe_> if you take daily averages, it would be very clear, IMHO [11:15:48] the rate of fe<->be connections https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=now-30d&to=now&panelId=5&fullscreen&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=frontend does not match the rate of accepted backend connections [11:15:54] https://grafana-admin.wikimedia.org/dashboard/db/varnish-failed-fetches?panelId=7&fullscreen&orgId=1&from=now-30d&to=now [11:56:38] ok what happens with reboots for kernel upgrades is that the *frontends* get restarted as well [11:57:26] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?panelId=7&fullscreen&edit&orgId=1&from=1518619511870&to=1521180975343&tab=metrics&var-datasource=ulsfo%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend [11:58:01] so I think we need further digging at the frontend side [11:58:41] because they must be the source of backend connection rate increase, right? [11:58:54] makes sense [12:01:07] pasta o'clock [12:14:37] so while the pasta cooks: [12:15:46] this is a time bomb, we cannot have an ever-increasing rate of incoming backend connections (for million of reasons including the fact that we know for sure that the number of connections in TIME_WAIT keeps on increasing and we'll run out of file descriptors eventually) [12:16:15] to buy us some time, I'll reboot text-esams for kernel upgrades after lunch [12:16:23] which is something I had to do anyways and it will please moritzm [12:17:27] but we definitely need to stop this from happening asap. Coming up with a graph showing frontends behavior that could lead up to this would be a great first start [12:18:55] * moritzm is easy to please [12:31:22] I'd expected to see how backend_reuse would decrease over time on varnish-fe as apparently it's opening more and more connections to the backend [12:38:59] and surely an increase in connections opened (or connection attempts: we have not seen an increase of TCP_ESTAB) [12:40:09] all this reminds me that we really really need to start using cadvisor [12:40:33] <_joe_> so, I was checking our parameters; we set tcp_fin_timeout to 3 [12:40:37] with it, we would now be able to plot connections in the various tcp states on a per-unit basis [12:43:39] <_joe_> and we already have tw_reuse=1 [12:43:44] <_joe_> ahem [12:43:46] <_joe_> wow [12:47:34] I'll begin with the reboots [12:47:53] go go go :) [12:48:05] help welcome from whoever has time to join the investigation (and file tasks, and so on) [12:49:30] eqiad and codfw also affected, but I'm gonna start with esams which is the one looking worse [12:57:43] as another data point, cache_text seems to be more affected than cache_upload [12:58:38] compare cache_upload: [12:58:39] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?panelId=7&fullscreen&orgId=1&from=now-90d&to=now&var-datasource=esams%20prometheus%2Fops&var-cache_type=upload&var-server=All&var-layer=backend [12:58:43] with cache_text: [12:58:50] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?panelId=7&fullscreen&orgId=1&from=now-90d&to=now&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend [13:01:29] cp3030 rebooted [13:05:24] correlating time-wise with the reboot of cp3030, we had a spike of 504s in eqiad (uh?) [13:05:39] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=2&fullscreen&orgId=1&from=1521201324976&to=1521205529456&var-site=eqiad&var-cache_type=All&var-status_type=5 [13:07:15] see how the reboot of cp3030 also made accepted sessions rate on codfw and eqiad decrease [13:07:36] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?panelId=7&fullscreen&orgId=1&from=1521203477808&to=1521205643273&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend [13:09:00] cp3030 depooled at 12:55:20 and repooled at 13:01:29 [13:11:36] it would be perhaps more graceful to depool by setting the origin servers (backends) as sick instead of getting rid of them from vcl? [13:11:45] varnishadm backend.set_health vcl-af4294fd-5726-4c42-b416-53b6e3646874.be_cp2007_codfw_wmnet sick [13:11:53] varnishadm backend.set_health vcl-af4294fd-5726-4c42-b416-53b6e3646874.be_cp2007_codfw_wmnet healthy [13:12:34] <_joe_> ema: that could be done by confd too [13:13:38] I'm not sure of the implications of doing that for the vslp director, but it might be an option worth exploring [13:14:48] or rather the shard director, I should say nowadays [13:17:45] carrying on with cp3031 [13:20:08] cp3031 depooled at 13:18:22 [13:25:14] cp3031 repooled at 13:24:58 [13:41:27] rebooting cp3032 now [13:41:52] cp3032 depooled at 13:41:36 [13:44:36] hmmm re: decrease of backend_reuse over time on varnish-fe --> https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=now-3h&to=now&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=cp3030&var-server=cp3031&var-layer=frontend&panelId=11&fullscreen [13:45:04] after rebooting cp3030 and cp3031, figures get significantly better [13:45:05] ah there you go [13:46:26] ok so, the issue seems to be looking like this: fe<->be connection reuse decrease, with a relatively constant incoming fe request rate [13:48:28] cp3032 repooled at 13:48:13 [13:52:15] nice, cp3032 shows the same trend [13:52:17] eqiad happy about the reboots too [13:52:20] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?panelId=7&fullscreen&orgId=1&from=1521197645521&to=1521208314887&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend [13:53:00] hmmm IIRC, fe<-->be communication happens always inside the same DC [13:53:30] so that graph means that be also loses/gets worse in reusing connections [13:54:23] correct, fe<->be happens in the same DC [13:55:04] vgutierrez: which graph? The last one I've posted is about newly accepted sessions [13:55:32] connection reuse: https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1521205012271&to=now&panelId=11&fullscreen&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=cp3030&var-server=cp3031&var-server=cp3032&var-layer=frontend [13:56:19] both layers get better after the reboot [13:56:20] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1521205012271&to=now&panelId=11&fullscreen&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=cp3030&var-server=cp3031&var-server=cp3032&var-layer=backend [13:57:06] that's why eqiad is happier as well with the esams reboots if I'm understanding these graphs correctly [13:59:16] suggesting that both varnish-fe and varnish-be are affected by the same problem you mean? [13:59:20] indeed [13:59:50] if you zoom out a little bit on the left of the graph you'll see it even better [13:59:57] fe: https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1521201412000&to=now&panelId=11&fullscreen&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=cp3030&var-server=cp3031&var-server=cp3032&var-layer=frontend [14:00:03] be: https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1521201412000&to=now&panelId=11&fullscreen&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=cp3030&var-server=cp3031&var-server=cp3032&var-layer=backend [14:01:20] yeah [14:02:35] what I don't understand is that the increase in backend accepted sessions rate does not seem proportional to the decrease in reused frontend sessions [14:03:13] it is to be kept in mind that we should sum the frontend values across multiple caches in the same DC though [14:03:50] (because a single backend gets connections from all frontends) [14:06:47] no, we should not sum them, we should rather analyze the increase in backend accepted sessions on one host with the decrease of connection reuses from all frontends to that particular host [14:11:09] proceeding with the next one in line, cp3033 [14:12:29] cp3033 depooled at 14:12:07 [14:22:28] cp3033 repooled at 14:21:55 [14:22:50] 50% of text_esams rebooted [14:32:26] hmm I should get lunch or something /o\ [14:32:40] haha yes you should! [14:33:07] one of the problems of working from home.. I got trapped into grafana [15:12:20] next in line is cp3040 [15:12:51] cp3040 depooled at 15:12:34 [15:20:00] cp3040 repooled at 15:19:11 [15:51:29] cp3041 depooled at 15:51:12 [15:58:13] cp3041 repooled at 15:56:46 [16:14:11] cp3042 depooled at 16:13:56 [16:20:56] cp3042 repooled at 16:20:33 [17:05:35] 10Traffic, 10Operations: varnish-be: rate of accepted sessions keeps on increasing - https://phabricator.wikimedia.org/T189892#4057061 (10ema) [17:05:41] 10Traffic, 10Operations: varnish-be: rate of accepted sessions keeps on increasing - https://phabricator.wikimedia.org/T189892#4057076 (10ema) p:05Triage>03High [17:06:14] cp3043 depooled at 17:05:58 [17:06:20] (last one [17:06:22] ) [17:09:01] 10Traffic, 10Operations: varnish-be: rate of accepted sessions keeps on increasing - https://phabricator.wikimedia.org/T189892#4057078 (10ema) [17:09:18] 10Traffic, 10Operations: varnish-be: rate of accepted sessions keeps on increasing - https://phabricator.wikimedia.org/T189892#4057061 (10ema) [17:16:10] cp3043 repooled at 17:15:56 [17:16:21] for a second I thought of trying 5.1.3-1wm4 on it [17:16:33] then I've noticed it's 18:16 on a Friday [17:19:19] 10Traffic, 10Operations: varnish-be: rate of accepted sessions keeps on increasing - https://phabricator.wikimedia.org/T189892#4057116 (10ema) [17:34:25] 10Traffic, 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#4057153 (10Anomie) >>! In T133410#4055687, @Tgr wrote: > ** On Parsoid (currently this means VE, Flow, the mobile apps and whatever third... [18:19:08] 10Traffic, 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#4057309 (10Tgr) That depends. Try something like this: ```