[00:45:15] <wikibugs>	 10Traffic, 10Operations: Turn up network links for Asia Cache DC - https://phabricator.wikimedia.org/T156031#3795140 (10faidon)
[10:16:15] <ema>	 there's been a few failed fetches in esams (text, upload and misc) earlier this morning
[10:16:28] <ema>	 https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1511945134326&to=1511947947399&panelId=3&fullscreen&edit&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All
[10:17:22] <ema>	 first spike between 09:04 and 09:08, second one ten minutes later between 09:14 and 09:18 
[10:17:48] <ema>	 network hiccups perhaps?
[10:35:15] <ema>	 I do see some 'Inbound interface errors' on cr2-{esams,eqiad} a few minutes after those timeframes on https://librenms.wikimedia.org/alert-log/
[10:35:21] <ema>	 XioNoX: ^ 
[14:35:29] <gilles>	 bblack: are we 100% sure that Resp[3]>60 has to be the client's fault?
[15:04:45] <bblack>	 no, not at all
[15:07:06] <bblack>	 but the original impetus for doing the logging was to find slow requests on the backend side
[15:08:06] <bblack>	 and logging all Resp[3]>60 is orthogonal to that purpose, and even logging on Resp[2]>60 is mostly catching just Resp[3]>60 cases where the other timestamps are relatively-acceptable
[15:08:15] <bblack>	 we'd find our needles on the backend side better if those weren't in the set
[15:11:04] <bblack>	 and unlike other long-timeout conditions, we know that it's at least possible Resp[3]>60 is the fault of the client/network rather than some issue in our stack, whereas the other cases definitely are an issue on our side.
[15:12:20] <bblack>	 (we can't always send data to the client at maximum speed, even from Varnish's limited perspective.  For a very short transfer we can, because Varnish would be done with it as soon as it's buffered up in the TCP window, but for small windows and/or longer transfers, the client actually has to ACK the bytes to make progress and avoid timing out for long periods in Resp[3])
[15:13:16] <bblack>	 but anything we speculate about all such things is difficult to nail down without careful analysis through all the layers, so take nothing as 100% certainty
[19:29:25] <XioNoX>	 ema: indeed, that circuit flapped many times. It's seems to be a quite rare event, and there is not much we can do. One option would be to enable damping, but the risk there is to loose all transport to/from esams.