[00:45:15] 10Traffic, 10Operations: Turn up network links for Asia Cache DC - https://phabricator.wikimedia.org/T156031#3795140 (10faidon) [10:16:15] there's been a few failed fetches in esams (text, upload and misc) earlier this morning [10:16:28] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1511945134326&to=1511947947399&panelId=3&fullscreen&edit&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All [10:17:22] first spike between 09:04 and 09:08, second one ten minutes later between 09:14 and 09:18 [10:17:48] network hiccups perhaps? [10:35:15] I do see some 'Inbound interface errors' on cr2-{esams,eqiad} a few minutes after those timeframes on https://librenms.wikimedia.org/alert-log/ [10:35:21] XioNoX: ^ [14:35:29] bblack: are we 100% sure that Resp[3]>60 has to be the client's fault? [15:04:45] no, not at all [15:07:06] but the original impetus for doing the logging was to find slow requests on the backend side [15:08:06] and logging all Resp[3]>60 is orthogonal to that purpose, and even logging on Resp[2]>60 is mostly catching just Resp[3]>60 cases where the other timestamps are relatively-acceptable [15:08:15] we'd find our needles on the backend side better if those weren't in the set [15:11:04] and unlike other long-timeout conditions, we know that it's at least possible Resp[3]>60 is the fault of the client/network rather than some issue in our stack, whereas the other cases definitely are an issue on our side. [15:12:20] (we can't always send data to the client at maximum speed, even from Varnish's limited perspective. For a very short transfer we can, because Varnish would be done with it as soon as it's buffered up in the TCP window, but for small windows and/or longer transfers, the client actually has to ACK the bytes to make progress and avoid timing out for long periods in Resp[3]) [15:13:16] but anything we speculate about all such things is difficult to nail down without careful analysis through all the layers, so take nothing as 100% certainty [19:29:25] ema: indeed, that circuit flapped many times. It's seems to be a quite rare event, and there is not much we can do. One option would be to enable damping, but the risk there is to loose all transport to/from esams.