[08:17:12] 10Traffic, 06Operations, 13Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#3145896 (10ema) This has started happening on cache_text as well. First occurrance I'm aware of on text was on 2017-03-30 between [[https://grafana.wikimedia.or... [09:33:13] bblack: misc-codfw has puppet disabled since 12h, can we re-enable it? [10:42:12] ema: yes [11:29:22] ema: re cache_text mailbox stuff, is it just that they were building some backlog before but narrowly missing triggering alerts, or is the existence of backlog at all new? [11:31:25] bblack: so, none of the 503 spikes from today and yesterday actually triggered any alert AFAIK [11:31:50] let me check on varnish-machine-stats if they've had backlog in the past too [11:34:33] looks like it didn't happen before on cp3032 and cp3040 [11:36:08] the mailbox spikes seem very brief [11:36:13] they don't build up or persist like upload [11:36:43] yep [11:36:45] if I had to guess, I'd bet on the small mailbox spike being a symptom (alongside the 503 spike) of something else going wrong [11:37:11] there have been other spikes in the past on cp3041 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=21&fullscreen&orgId=1&from=1489849939540&to=1490960185818&var-server=cp3041&var-datasource=esams%20prometheus%2Fops [11:37:34] yeah but single-digits aren't much of a spike [11:38:00] mailbox lag in and of itself technically isn't a problem, especially if it's stable or manageable [11:38:20] it's just when it runs away off towards infinity that things go crazy [11:38:38] but even those bigger short spikes we've got on esams text the past few days, they only reach ~300K [11:38:53] that shouldn't be enough to represent locking up a significant fraction of text's storage or anything [11:40:56] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&from=1490934537304&to=1490947475709&var-server=cp3032&var-datasource=esams%20prometheus%2Fops [11:41:12] notice the 'fetch failed' graph [11:44:31] be conns spike too [11:44:37] on upload I think we start seeing fetch errors only after significant lag growing for a while, here lag and fetch errors seem to correlate directly timewise [11:45:24] yeah I still think the lag isn't really a cause [11:46:09] we're seeing basically a ton of different varnish stats all spike around the same time as a visible 503 spike. mailbox lag is just one of the random impacts of whatever else is happening. [11:46:39] yep [11:47:01] and BTW we need the response status codes in prometheus too! [11:47:36] I've started working on this today which should do the trick https://phabricator.wikimedia.org/P5171 [11:49:42] any chance it correlates with eqiad varnish backend restarts? [11:49:52] maybe the depooling isn't working for some reason [11:50:09] (since we're seeing fetch fail, which from cp3 would be eqiad) [11:50:25] let's see [11:51:04] nice paste! I guess the goal there is a framework to replace all the varnishxcache/varnishrls/etc/etc scripts? [11:51:38] it's be nice to unify some of that up where appropriate anyways, at least the globally-applicable ones [11:52:17] yeah that's the long term goal indeed (replace the various scripts and unify those who can be unified) [11:52:28] no correlation with eqiad-text backend restart [11:52:31] reqstats, xcache, xcps, and statsd are all global to all clusters and re-parsing the same outputs :) [11:52:35] last restart happened 16 hours ago [11:53:03] hmmm [11:53:13] another obvious one would be network blips of some kind [11:53:20] or virtual network blips in the form of ipsec blips [11:53:37] or it could be generated by some kind of anomalous user traffic too [11:55:11] lunch needed, see you in a bit :) [11:56:54] ok [11:57:01] the storage stats are interesting over the broad range too [11:59:56] on cp3032 a couple of hours before the actual spike, something on the order of ~3GB of storage suddenly freed up [12:00:45] no that's per-disk, more like ~6.4GB [12:54:20] ema: we should try to finish up linux-4.9 deploy eary next week, so we can get a shot at BBR testing ahead of the dc-switch stuff (which will confuse results) [12:55:03] there's the ttl_cap=1d pending too which is also kind of a confusing overlap to test, given how long it takes to really see the ttl_cap changes [12:55:31] (I think we're just begginning to see the real impact of the 3d one today, maybe pattern will be more apparent by early next week) [13:16:12] bblack: agreed, I'll start the 4.9 upgrades on Monday [13:17:42] moritzm: ^ [13:22:52] great, let me know if I can help with anything. the kernel is running on 20 hosts without problems so far [13:57:10] ema: if you have time - https://gerrit.wikimedia.org/r/#/c/345117/ [13:57:26] (deprecation of analytics1027, reworked after the recent changes) [13:59:51] elukey: the changes to misc.yaml look fine, there are still references to the host in modules/install_server/files/dhcpd/linux-host-entries.ttyS1-115200 [14:00:10] not sure if you've left it there on purpose :) [14:02:43] ah yes these ones (IIRC) should be removed as very last step of decomm [14:02:58] I am assigning role spare to the host for the moment [14:03:41] oh ok [14:10:26] thanks! [15:15:08] bblack: on cp3040 there's been a drop in N struct smf a few minutes before the spike [15:15:11] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=37&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams%20prometheus%2Fops&from=1490831601308&to=1490880178315 [15:15:24] spike around 07:07 [15:15:54] so that would explain the type of disk activity you've observed earlier on (on cp3032) [15:49:59] maybe we should extend the upload storage "experiment" to text? [15:55:08] ema: the problem is, I don't think all text responses include Content-Length. some might, but there's definitely some that don't [15:55:23] whereas with upload we're lucky we have one simplistic backend that always does [16:00:13] bblack: you're right, it seems that we're getting CL only on responses to POST [16:00:48] see: [16:00:50] varnishncsa -q 'ReqMethod ne "PURGE"' -F '%{Content-Length}i %r'|grep '^[0-9]' [16:02:38] hehe no I got the format string wrong [16:02:49] -F '%{Content-Length}o %r' [16:04:29] 10Traffic, 06Operations: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support) - https://phabricator.wikimedia.org/T147199#3146894 (10BBlack) We've been stalling on this a bit too long now. I'd like to start kicking off this process and getting in touch with Community as well. I've kinda bac... [16:10:17] yeah I've tried `varnishncsa -b -F '%{Content-Length}o %r'` for about 5 seconds and got ~1600 responses without CL, ~1000 with [20:11:34] 10netops, 06Operations, 10hardware-requests: MX480 & QFX5100 for esams (April 2017) - https://phabricator.wikimedia.org/T161930#3147522 (10faidon)