[08:17:12] <wikibugs>	 10Traffic, 06Operations, 13Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#3145896 (10ema) This has started happening on cache_text as well.  First occurrance I'm aware of on text was on 2017-03-30 between [[https://grafana.wikimedia.or...
[09:33:13] <ema>	 bblack: misc-codfw has puppet disabled since 12h, can we re-enable it?
[10:42:12] <bblack>	 ema: yes
[11:29:22] <bblack>	 ema: re cache_text mailbox stuff, is it just that they were building some backlog before but narrowly missing triggering alerts, or is the existence of backlog at all new?
[11:31:25] <ema>	 bblack: so, none of the 503 spikes from today and yesterday actually triggered any alert AFAIK
[11:31:50] <ema>	 let me check on varnish-machine-stats if they've had backlog in the past too
[11:34:33] <ema>	 looks like it didn't happen before on cp3032 and cp3040
[11:36:08] <bblack>	 the mailbox spikes seem very brief
[11:36:13] <bblack>	 they don't build up or persist like upload
[11:36:43] <ema>	 yep
[11:36:45] <bblack>	 if I had to guess, I'd bet on the small mailbox spike being a symptom (alongside the 503 spike) of something else going wrong
[11:37:11] <ema>	 there have been other spikes in the past on cp3041 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=21&fullscreen&orgId=1&from=1489849939540&to=1490960185818&var-server=cp3041&var-datasource=esams%20prometheus%2Fops
[11:37:34] <bblack>	 yeah but single-digits aren't much of a spike
[11:38:00] <bblack>	 mailbox lag in and of itself technically isn't a problem, especially if it's stable or manageable
[11:38:20] <bblack>	 it's just when it runs away off towards infinity that things go crazy
[11:38:38] <bblack>	 but even those bigger short spikes we've got on esams text the past few days, they only reach ~300K
[11:38:53] <bblack>	 that shouldn't be enough to represent locking up a significant fraction of text's storage or anything
[11:40:56] <ema>	 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&from=1490934537304&to=1490947475709&var-server=cp3032&var-datasource=esams%20prometheus%2Fops
[11:41:12] <ema>	 notice the 'fetch failed' graph
[11:44:31] <bblack>	 be conns spike too
[11:44:37] <ema>	 on upload I think we start seeing fetch errors only after significant lag growing for a while, here lag and fetch errors seem to correlate directly timewise
[11:45:24] <bblack>	 yeah I still think the lag isn't really a cause
[11:46:09] <bblack>	 we're seeing basically a ton of different varnish stats all spike around the same time as a visible 503 spike.  mailbox lag is just one of the random impacts of whatever else is happening.
[11:46:39] <ema>	 yep
[11:47:01] <ema>	 and BTW we need the response status codes in prometheus too!
[11:47:36] <ema>	 I've started working on this today which should do the trick https://phabricator.wikimedia.org/P5171
[11:49:42] <bblack>	 any chance it correlates with eqiad varnish backend restarts?
[11:49:52] <bblack>	 maybe the depooling isn't working for some reason
[11:50:09] <bblack>	 (since we're seeing fetch fail, which from cp3 would be eqiad)
[11:50:25] <ema>	 let's see
[11:51:04] <bblack>	 nice paste! I guess the goal there is a framework to replace all the varnishxcache/varnishrls/etc/etc scripts?
[11:51:38] <bblack>	 it's be nice to unify some of that up where appropriate anyways, at least the globally-applicable ones
[11:52:17] <ema>	 yeah that's the long term goal indeed (replace the various scripts and unify those who can be unified)
[11:52:28] <ema>	 no correlation with eqiad-text backend restart
[11:52:31] <bblack>	 reqstats, xcache, xcps, and statsd are all global to all clusters and re-parsing the same outputs :)
[11:52:35] <ema>	 last restart happened 16 hours ago
[11:53:03] <bblack>	 hmmm
[11:53:13] <bblack>	 another obvious one would be network blips of some kind
[11:53:20] <bblack>	 or virtual network blips in the form of ipsec blips
[11:53:37] <bblack>	 or it could be generated by some kind of anomalous user traffic too
[11:55:11] <ema>	 lunch needed, see you in a bit :)
[11:56:54] <bblack>	 ok
[11:57:01] <bblack>	 the storage stats are interesting over the broad range too
[11:59:56] <bblack>	 on cp3032 a couple of hours before the actual spike, something on the order of ~3GB of storage suddenly freed up
[12:00:45] <bblack>	 no that's per-disk, more like ~6.4GB
[12:54:20] <bblack>	 ema: we should try to finish up linux-4.9 deploy eary next week, so we can get a shot at BBR testing ahead of the dc-switch stuff (which will confuse results)
[12:55:03] <bblack>	 there's the ttl_cap=1d pending too which is also kind of a confusing overlap to test, given how long it takes to really see the ttl_cap changes
[12:55:31] <bblack>	 (I think we're just begginning to see the real impact of the 3d one today, maybe pattern will be more apparent by early next week)
[13:16:12] <ema>	 bblack: agreed, I'll start the 4.9 upgrades on Monday 
[13:17:42] <ema>	 moritzm: ^
[13:22:52] <moritzm>	 great, let me know if I can help with anything. the kernel is running on 20 hosts without problems so far
[13:57:10] <elukey>	 ema: if you have time - https://gerrit.wikimedia.org/r/#/c/345117/
[13:57:26] <elukey>	 (deprecation of analytics1027, reworked after the recent changes)
[13:59:51] <ema>	 elukey: the changes to misc.yaml look fine, there are still references to the host in modules/install_server/files/dhcpd/linux-host-entries.ttyS1-115200
[14:00:10] <ema>	 not sure if you've left it there on purpose :)
[14:02:43] <elukey>	 ah yes these ones (IIRC) should be removed as very last step of decomm
[14:02:58] <elukey>	 I am assigning role spare to the host for the moment
[14:03:41] <ema>	 oh ok
[14:10:26] <elukey>	 thanks! 
[15:15:08] <ema>	 bblack: on cp3040 there's been a drop in N struct smf a few minutes before the spike
[15:15:11] <ema>	 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=37&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams%20prometheus%2Fops&from=1490831601308&to=1490880178315
[15:15:24] <ema>	 spike around 07:07
[15:15:54] <ema>	 so that would explain the type of disk activity you've observed earlier on (on cp3032)
[15:49:59] <ema>	 maybe we should extend the upload storage "experiment" to text?
[15:55:08] <bblack>	 ema: the problem is, I don't think all text responses include Content-Length.  some might, but there's definitely some that don't
[15:55:23] <bblack>	 whereas with upload we're lucky we have one simplistic backend that always does
[16:00:13] <ema>	 bblack: you're right, it seems that we're getting CL only on responses to POST
[16:00:48] <ema>	 see:
[16:00:50] <ema>	 varnishncsa -q 'ReqMethod ne "PURGE"' -F '%{Content-Length}i %r'|grep '^[0-9]'
[16:02:38] <ema>	 hehe no I got the format string wrong
[16:02:49] <ema>	 -F '%{Content-Length}o %r'
[16:04:29] <wikibugs_>	 10Traffic, 06Operations: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support) - https://phabricator.wikimedia.org/T147199#3146894 (10BBlack) We've been stalling on this a bit too long now.  I'd like to start kicking off this process and getting in touch with Community as well.  I've kinda bac...
[16:10:17] <ema>	 yeah I've tried `varnishncsa -b -F '%{Content-Length}o %r'` for about 5 seconds and got ~1600 responses without CL, ~1000 with
[20:11:34] <wikibugs_>	 10netops, 06Operations, 10hardware-requests: MX480 & QFX5100 for esams (April 2017) - https://phabricator.wikimedia.org/T161930#3147522 (10faidon)