[01:03:53] <ema>	 upload esams upgrade process started, cp3034 upgraded and pooled 
[01:04:04] <ema>	 all good so far
[01:23:00] <bblack>	 I'm semi-here, but not completely yet :)
[01:23:21] <ema>	 nice :)
[01:23:26] <bblack>	 librenms impact seems pretty minimal when cascading through them one at a time
[01:23:32] <ema>	 yep
[01:23:32] <bblack>	 looking at cr2-eqiad.wikimedia.org :: Port xe-4/1/3
[01:23:45] <bblack>	 bbiab
[01:27:45] <ema>	 carrying on with cp3035
[02:26:46] <ema>	 3034, 3035 and 3036 done, we're looking good. I'm off to bed :)
[02:30:08] <bblack>	 ok
[02:34:51] <bblack>	 starting off on 3037
[02:46:18] <bblack>	 I'm gonna speed it up a bit from here, in part because it's going so smoothly already, and in part to minimize the bad vslp-vs-chash window when they're half-upgraded :)
[02:46:50] <bblack>	 (also, getting it done a bit faster before the traffic ramps back up is probably better than running it longer during the rampup)
[04:47:14] <bblack>	 all done
[04:47:45] <bblack>	 peaked at ~1.92Gbps on the esams link (5 min avgs in librenms).  the peak was pretty sharp and brief, when the last remaining v3 went away
[04:49:08] <bblack>	 so... next steps is basically (1) roll through v4 upgrades in eqiad then (2) flip routing back to normal (codfw->eqiad, esams->eqiad)
[04:50:36] <bblack>	 based on how esams went, eqiad timing should be a breeze at ~15 min spacing.  same basic loadings and cache miss fallout, etc, but no long backhaul link involved for the misses
[04:51:28] <bblack>	 ideal start time would maybe be around 07 or 08 UTC, but even starting at 10 UTC would work fine
[05:21:02] <_joe_>	 bblack: if you guys are tired and need someone to help, I'm around
[06:36:56] <paravoid>	 nice!
[09:48:00] <ema>	 we had an increase in the 503 rate in upload ulsfo, looks like cp4006 is broken somehow
[09:48:09] <ema>	 I've just depooled it
[09:49:01] <ema>	 --  ExpKill        LRU_Fail
[09:49:01] <ema>	 --  FetchError     Could not get storage
[10:50:23] <ema>	 started upgrading eqiad with cp1048
[12:33:57] <bblack>	 nice, hmmm
[12:34:03] <bblack>	 looking at 4006
[12:35:38] <ema>	 bblack: varnishlog in ~ema/y.log
[12:36:01] <ema>	 that's the backend varnishlog, obtained with varnishlog -g request -q 'RespStatus ~ 50'
[12:37:32] <bblack>	 hmmm
[12:37:41] <bblack>	 I see the LRU_Fail, on both main1+main2
[12:39:46] <bblack>	 so the simplest explanation is there's actually an object that can't fit in main disk storage
[12:39:56] <bblack>	 but that seems unlikely, and would've just spread to another node too
[12:40:03] <bblack>	 (and hit codfw on the way)
[12:40:33] <bblack>	 more likely the object can fit, it just took the eviction of too many other objects to make space and varnish gave up
[12:41:20] <mark>	 which storage backend is this?
[12:41:30] <bblack>	 4006 is one of our longest-running (at present) v4 nodes right?
[12:41:32] <bblack>	 mark: file
[12:42:13] <bblack>	 the main thing I'm worried about here is whether we're due for them all to eventually fall to this
[12:43:00] <ema>	 mmh cp4005's varnishes started roughly at the same time as cp4006's
[12:43:33] <bblack>	 the request being processed when the LRU fail happens is a miss fetching a small object from codfw (e.g. ~16KB object)
[12:43:43] <bblack>	 you'd think it could always find a way to evict room for a ~16KB object :P
[12:43:47] <ema>	 right
[12:44:29] <bblack>	 in looking through other related bugs, I think our current varnish code has a minor bug that it ignores the nuke limit argument
[12:44:43] <bblack>	 which we have set to a non-default (higher) value in v3
[12:44:52] <ema>	 oh
[12:45:02] <bblack>	 it could be that with how large and diverse our object store is, we need that higher nuke limit to avoid corner cases
[12:45:30] <bblack>	 https://github.com/varnishcache/varnish-cache/issues/1764
[12:48:59] <ema>	 mmh we are getting some 503s in esams as well :(
[12:49:33] <ema>	 different pattern though, they're spikes, not plateaus
[12:51:40] <ema>	 --  FetchError     no backend connection
[12:51:51] <bblack>	 from the FE?
[12:51:52] <ema>	 see cp3034:~ema/503-backend.log
[12:51:53] <ema>	 backend
[12:52:08] <bblack>	 we might be hitting the connection limit there
[12:52:29] <bblack>	 it's rather low, and default-streaming might cause more connections
[12:54:43] <bblack>	 ema: https://gerrit.wikimedia.org/r/#/c/309982/1/modules/role/manifests/cache/upload.pp
[12:55:35] <ema>	 bblack: lgtm
[12:56:57] <bblack>	 salting puppet for it
[12:58:19] <ema>	 cp1074 and cp1099 are the last v3 upload hosts, for the record
[12:59:12] <bblack>	 yeah....
[13:01:02] <bblack>	 well those 503 spikes in esams were relatively low-rate and hopefully that patch was it
[13:01:23] <bblack>	 I'm more worried about the 4006 issue, because it seems unlikely 4006 is special
[13:02:04] <bblack>	 so it had been ~4 days
[13:02:48] <bblack>	 if this starts spreading, one potential way to mitigate might be to drop our upload TTL cap to from 7d to 3d and wipe through the storages (relatively slowly as long as the problem isn't huge)
[13:03:07] <bblack>	 at least in ulsfo, which has a smaller total storage set size than the rest, by a lot
[13:03:17] <bblack>	 (6 nodes there, vs 10-12 for the rest)
[13:03:27] <ema>	 it would be interesting to see if it happens elsewhere though. So far it looks like a ulsfo-only issue
[13:03:36] <bblack>	 yeah
[13:04:00] <bblack>	 codfw's been online longer, and with ulsfo+esams going through it should have a fairly-complete warm set
[13:04:22] <ema>	 bblack: OK to carry on with 1074 and 1099 in the meantime? Done with salt? 
[13:04:28] <bblack>	 so you'd think it would get hit too if it wasn't ulsfo-specific (and thus likely related to total storage overflowing by a lot)
[13:04:33] <bblack>	 ema: yes
[13:05:33] <bblack>	 I wonder what the largest single object is in swift?
[13:05:58] <bblack>	 I know there's upload size limits, but those get bypassed by people doing special side-loads of content with special privs and such
[13:06:19] <bblack>	 godog: any easy way to find out what the largest swift upload object is?
[13:08:23] <godog>	 bblack: good question, not easy afaik besides looking inside the containers, though it should all be inside the images table in mw too
[13:13:50] <bblack>	 ema: in normal success logs we see this from VSLP
[13:13:52] <bblack>	 -   Debug          "VSLP picked preferred backend  0 for key c98f9e48"
[13:14:20] <bblack>	 but in the 503 cases with the "no backend connection" issue:
[13:14:22] <bblack>	 -   Debug          "VSLP picked preferred backend  4 for key  fc282ea"
[13:14:25] <bblack>	 -   Debug          "VSLP found no alternative backend in unhealthy"
[13:16:12] <bblack>	 looking at the code that emits that, it seems to indicate our backends (cp2 in thise case) were "unhealthy" from cp3's perspective...
[13:17:37] <ema>	 the message seems to suggest that it didn't find any unhealth backend...
[13:17:52] <ema>	 *unhealthy
[13:17:55] <bblack>	 but our probes are just the /varnishcheck thing for health, they shouldn't be affected by anything complex, other than network
[13:17:58] <bblack>	 the message is odd
[13:18:25] <bblack>	 https://github.com/wikimedia/operations-software-varnish-libvmod-vslp/blob/5e913b62fb16b187c7ba76c75a56e148c9cdb0d6/src/vslp_dir.c#L487
[13:18:42] <bblack>	 ^ but it comes from within a clause where we get false for if (be->healthy(be, ctx->bo, NULL))
[13:19:50] <ema>	 right, and then vslp_choose_next_healthy returns something < 0
[13:22:08] <bblack>	 only somewhat related, but the peak network I saw during the esams conversion wasn't the "final" fallout.  even after all the conversions were done, with the user load ramping in the link BW kept ramping in too.  it's probably artificially higher and relative to user load for a day at least.
[13:22:34] <bblack>	 so far it peaked at ~2.6Gbps back around 08:40-ish
[13:22:53] <bblack>	 but it's already declining back away from that number, not getting worse
[13:22:58] <ema>	 good
[13:23:02] <bblack>	 so I doubt it's a pure network problem
[13:24:30] <ema>	 done, all upload hosts upgraded
[13:27:12] <bblack>	 \o/
[13:28:29] <bblack>	 the larger earlier 503 from cp4006 tends to mask it on the graph, but ulsfo has a similar pattern now to esams on that stuff
[13:29:04] <bblack>	 and the #conns hack didn't fix it
[13:29:56] <elukey>	 great work :)
[13:30:43] <godog>	 nice! yeah swift is taking a bit more requests than usual too of course, in waves
[13:33:37] <ema>	 bblack: this is common to all frontend 503s in esams:
[13:33:38] <ema>	 Fetch_Body     3 length stream
[13:35:23] <ema>	 I'll take a break, bbiab
[13:35:30] <bblack>	 well the FE 503s in esams are from the backend saying "Fetch Failed" explicitly, which it's doing because of failed conn to cp2?
[13:35:33] <bblack>	 ok
[13:49:29] <bblack>	 possibly not directly relevant
[13:49:56] <bblack>	 but I was trying to see what our backend connection concurrency really was, and apparently somewhere between v3->v4 they broke the concurrency stat.  it always-increments now
[13:50:06] <bblack>	 there's a fixed bug with backport to 4.1 branch, but not in our 4.1.3
[13:50:09] <bblack>	 https://github.com/varnishcache/varnish-cache/commit/36ce6871100424c18cda1f7903c06aae50824c5a
[13:50:46] <bblack>	 it seems at first glace that this just affects stats though, not max_conns stuff
[13:58:52] <bblack>	 potentially-relevant from 4.1 branch (unreleased):
[13:58:54] <bblack>	 https://github.com/varnishcache/varnish-cache/commit/c3374b185e7db43db673590fdefc4424ca8de610
[14:00:19] <bblack>	 in the case reported that caused a panic, but it's in the right area of code, and perhaps the real fallout is variable, especially depending on alternate director (vslp)
[14:08:35] <bblack>	 https://gerrit.wikimedia.org/r/#/c/309993/
[14:10:10] <bblack>	 another thing we could experiment with on the VSLP-unhealthy thing is removing health probes for vbe->vbe completely (which makes it always-ok, like our connections to applayer through LVS)
[14:10:21] <bblack>	 probably not an awesome long term idea, but it could narrow down what's going on
[14:11:08] <bblack>	 FWIW, I haven't actually observed an unhealthy backend or a healthcheck history showing any failure, but I've done limited looking
[14:26:26] <elukey>	 cp104[89] seems to have logged "Assert error in child_sigsegv_handler(), mgt/mgt_child.c line 282:" for varnish-frontend 
[14:26:30] <elukey>	 vk didn't like it
[14:26:56] <elukey>	 but from the sal these hosts have been done hours ago right?
[14:34:02] <ema>	 Sep 12 14:16:19 cp1049 varnishd[9699]: Child (9701) said varnishd: vmod_netmapper.c:175: vmod_map: Assertion `ip_string' failed.
[14:34:05] <ema>	 Sep 12 14:16:24 cp1049 varnishd[9699]: Child (9701) not responding to CLI, killing it.
[14:34:08] <ema>	 Sep 12 14:16:24 cp1049 varnishd[9699]: Child (9701) not responding to CLI, killing it.
[14:34:11] <ema>	 Sep 12 14:16:24 cp1049 varnishd[9699]: Child (9701) died signal=6
[14:34:13] <ema>	 Sep 12 14:16:24 cp1049 varnishd[9699]: Child (9701) Last panic at: Mon, 12 Sep 2016 14:16:24 GMT
[14:34:16] <ema>	                                        "Assert error in child_sigsegv_handler(), mgt/mgt_child.c line 282:
[14:34:19] <ema>	                                          Condition(Segmentation fault by instruction at 0x77000025e5) not true....
[14:47:41] <ema>	 now it's cp4015 behaving like cp4006 this morning
[14:49:36] <ema>	 as in: returning lots of 503s, the reason is not 'Could not get storage'
[14:51:05] <ema>	 will depool cp4015, restart and repool cp4006's backend 
[14:54:48] <ema>	 I sense there must be something pretty weird with mod_vslp
[15:00:53] <bblack>	 it could be a specific object that ends up rehashing elsewhere on depool, too
[15:01:20] <bblack>	 so, netmapper failing in eqiad, that's interesting
[15:03:42] <bblack>	 I'm gonna guess that ip_string can now be NULL, whereas it couldn't before
[15:05:38] <bblack>	 still seems odd
[15:05:44] <bblack>	 it implies there was no client.ip
[15:06:14] <bblack>	 maybe when auto-conver of client.ip to a string for VCL fails, it ends up not setting the header, which ends up being NULL
[15:06:37] <bblack>	 either way, it's not hard to fix vmod_netmapper to accept NULL ip_string and treat it as non-matching
[15:09:21] <ema>	 re: cp4015, it started failing fetches around 14:20 https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Upload%20caches%20ulsfo&h=cp4015.ulsfo.wmnet&r=hour&z=default&jr=&js=&st=1473692766&event=hide&ts=0&v=1094202&m=varnish.MAIN.fetch_failed&vl=N%2Fs&ti=Fetch%20failed%20all%20causes&z=large
[15:09:52] <ema>	 which is also when the number of waitinglist structs made reached ~2.2k https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Upload%20caches%20ulsfo&h=cp4015.ulsfo.wmnet&r=hour&z=default&jr=&js=&st=1473692766&event=hide&ts=0&v=515&m=varnish.MAIN.n_waitinglist&vl=N&ti=waitinglist%20structs%20made&z=large
[15:10:09] <ema>	 which is also basically the peak number of threads created https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Upload%20caches%20ulsfo&h=cp4015.ulsfo.wmnet&r=hour&z=default&jr=&js=&st=1473692766&event=hide&ts=0&v=500&m=varnish.MAIN.threads&vl=N&ti=Total%20number%20of%20threads&z=large
[15:10:22] <ema>	 uh sorry for the endless URLs :)
[15:14:44] <bblack>	 https://gerrit.wikimedia.org/r/#/c/310019/ <- netmapper fixup for master branch, new version number, et 
[15:15:00] <bblack>	 you want to debianize->package, etc?
[15:15:31] <wikibugs>	 10Traffic, 06Operations, 10Continuous-Integration-Infrastructure (phase-out-gallium): Move gallium to an internal host? - https://phabricator.wikimedia.org/T133150#2628884 (10hashar)
[15:20:52] <bblack>	 so waiting list stuff is basically coalesce
[15:21:04] <bblack>	 it could make sense that threads would stack up if we have some kind coalescing problem
[15:21:56] <bblack>	 we could have some kind of hit-for-pass related issue there
[15:24:12] <bblack>	 normally when it comes to pass-mode stuff, we'd try to either unconditionally pass on the recv side (attributes of the request), or hit-for-pass on the response side, because otherwise we'll suffer coalesce stalls on pass traffic
[15:24:35] <bblack>	 (it will effectively serialize the pass-traffic to a given URL to hit the backend one req at a time)
[15:25:30] <bblack>	 we did recently add that CL:0+200 hack, which does a response-based uncacheable+TTL=0, which would suffer from coalesce problems if we ran through that path repeatedly and rapidly
[15:26:02] <bblack>	 and of course we know from our logging on that issue that there *are* legitimate/unavoidable/acceptable CL:0+200 responses
[15:26:24] <bblack>	 maybe we should remove that hack?
[15:26:43] <bblack>	 we think that problem's gone in v4-only installs anyways
[15:27:14] <ema>	 true, let's give it a try
[15:27:40] <bblack>	 I'm prepping reverts
[15:27:48] <ema>	 re: netmapper, how urgent do you think it is? Because as we know we cannot just install the new vmod, we also need to restart the varnishds
[15:28:49] <bblack>	 we should at least have the package ready to go I think
[15:28:55] <ema>	 ok
[15:29:43] <bblack>	 we can maybe workaround in VCL for now, too
[15:36:06] <bblack>	 salting out those 2x VCL fixups (revert CL:0+200 hack, probably-prevent NULL ip_string argument)
[15:36:35] <ema>	 alright, I'm gonna package netmapper 1.4
[15:40:58] <bblack>	 those esams mini-503-spikes seems to have died off a while back
[15:41:18] <bblack>	 I thought my 10K conns thing didn't do it, when I was looking shortly after, hmmmmm
[15:42:19] <bblack>	 it merged and deployed closer to like 13:05, and then spikes fell off around 13:25-ish
[15:42:33] <bblack>	 maybe it took a while for all threads to leave the old VCL behind completely?
[15:43:29] <bblack>	 ulsfo's similar mini-503-spikes died off around the same timeframe, they're again just harder to see with the slightly-larger plateaus in play
[15:46:31] <bblack>	 deploying the routing stuff back into eqiad now too
[15:48:22] <bblack>	 actually, a more-interesting timing correlation is that the mini-503-spikes in ulsfo+esams died off very shortly after you starting upgrading the final v3->v4 in eqiad heh
[15:48:31] <bblack>	 that's a very strange correlation to explain, though
[15:48:35] <ema>	 yeah
[15:55:53] <ema>	 uh it's almost meeting time already
[15:56:44] <bblack>	 yeah
[15:56:53] <mark>	 time goes fast when having fun
[15:57:00] <ema>	 :)
[16:07:16] <bblack>	 I haven't thought through it all yet, but so I don't forget later: there may be some interaction between upload's vcl_config pass_random settings and the new pass-all-range stuff...
[16:07:28] <bblack>	 (maybe we need to rethink pass_random there)
[16:21:37] <bblack>	 ema: another ulsfo 503 plateau appears...
[16:22:43] <ema>	 bblack: yep, let's see which machine is responsible this time
[16:22:55] <bblack>	 is 4015 still depooled from before?
[16:23:17] <bblack>	 yeah it is
[16:24:07] <ema>	 we're getting 503s from all the ulsfo hosts except for cp4015
[16:24:35] <bblack>	 well, it's depooled :)
[16:24:44] <ema>	 right :)
[16:25:53] <bblack>	 -   Debug          "VSLP picked alternative backend  0 for key 5b91ad1c in healthy"
[16:26:38] <bblack>	 I still don't see health issues on be->be in varnishstat
[16:27:06] <ema>	 the "backend fail" metrics are 0 on all hosts yeah
[16:27:28] <ema>	 MAIN.backend_fail, that is
[16:28:22] <bblack>	 VSLP thing seems to be a red herring, yeah
[16:28:31] <ema>	 should we in the meanwhile restart the failing varnish backends?
[16:28:36] <bblack>	 we get the same 503s with no strange VSLP message
[16:28:42] <bblack>	 all of them? :)
[16:28:52] <wikibugs>	 07HTTPS, 10Traffic, 06Labs, 10Labs-Infrastructure, and 2 others: update *.wmflabs.org certificate (existing expires on 2016-09-16) - https://phabricator.wikimedia.org/T145120#2629309 (10chasemp) a:05chasemp>03yuvipanda
[16:28:52] <ema>	 a couple? :)
[16:29:19] <bblack>	 let's wipe and then repool 4015, and wipe/restart 4005?
[16:29:38] <ema>	 OK, I'll do cp4015
[16:30:13] <bblack>	 hmmm
[16:30:18] <ema>	 mmmh codfw?
[16:30:18] <bblack>	 I think cp2 is failing it now
[16:30:21] <bblack>	 yeah
[16:30:34] <ema>	 after changing routing
[16:30:40] <bblack>	 hold on the ulsfo stuff, other than repooling a fresh 4015
[16:30:47] <ema>	 ok
[16:31:00] <bblack>	 I see in the ulsfo 503s, they're getting an actual literal 503 response from codfw
[16:31:12] <bblack>	 it's codfw itself that's now failing on be->be to eqiad
[16:31:37] <ema>	 should I also restart cp4015's frontend?
[16:32:30] <bblack>	 sure
[16:35:27] <bblack>	 the cp2 backend is synthesizing a local 503 when it tries to fetch from eqiad
[16:37:23] <ema>	 so codfw has never had issues before, it really seems there's something odd when hitting another cache as opposed to going straight to the applayer
[16:37:29] <bblack>	 yes
[16:37:40] <bblack>	 it's definitely a varnish-be -> varnish-be only issue
[16:37:50] <ema>	 we could revert the routing change and depool ulsfo perhaps?
[16:37:57] <bblack>	 https://gerrit.wikimedia.org/r/#/c/310037/
[16:38:03] <bblack>	 let's try this first? ^
[16:38:33] <ema>	 why not
[16:39:18] <ema>	 RECOVERY - Ulsfo
[16:39:55] <bblack>	 from 4015 coming online?
[16:40:22] <bblack>	 or that alert is just very screwy
[16:40:36] <ema>	 yeah, the graph is still not pretty
[16:42:08] <bblack>	 I'm assuming so far that LRU_Fail and this are separate, even though the graph fallout looks so similar
[16:42:17] <bblack>	 it's kind of odd in that sense...
[16:42:48] <bblack>	 so, -probe VCL reloads done globally, but it may take a while to have "real" effect?
[16:44:50] <ema>	 mmh maybe
[16:45:51] <bblack>	 no impact on codfw->eqiad 503 so far
[16:46:23] <bblack>	 is there some way to get mega-debug-output from varnishlog on these?
[16:46:39] <ema>	 not afaik
[16:46:48] <ema>	 https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=large&h=cp4005.ulsfo.wmnet&m=cpu_report&s=descending&mc=2&g=network_report&c=Upload+caches+ulsfo
[16:47:05] <ema>	 looks like there's been a Out net drop on cp4005?
[16:47:11] <bblack>	 oh, these are LRU_Fail on codfw
[16:47:36] <bblack>	 --   718353566 ExpKill        b LRU_Fail
[16:47:36] <bblack>	 --   718353566 FetchError     b Could not get storage
[16:47:45] <bblack>	 that's on cp2 503 trying to fetch from cp1
[16:49:25] <bblack>	 I'm looking at 2002, is it all?
[16:49:57] <bblack>	 yeah I think it's all from 2002
[16:50:05] <bblack>	 I'm gonna depool 2002 backend and restart->repool
[16:51:40] <bblack>	 that worked
[16:51:45] <ema>	 I'm not seeing other 503s on codfw backends
[16:52:43] <ema>	 and now in ulsfo it's only cp4013
[16:52:59] <bblack>	 heh
[16:53:04] <bblack>	 is that new?
[16:53:20] <ema>	 a few minutes ago it was all of them :)
[16:53:33] <ema>	 I guess the 2002 depool helped
[16:54:25] <bblack>	 well all the ones I was observing were based on 2002 LRU_Failed
[16:54:38] <bblack>	 but now 4013's backend 503 is synthetic there, without LRU_Failed AFAICS
[16:55:27] <ema>	 no, I think it *is* LRU_Failed on cp4013
[16:55:32] <ema>	 or at least I just logged one
[16:55:37] <bblack>	 no, it is LRU_Failed
[16:55:43] <bblack>	 heh
[16:55:47] <bblack>	 yeah my logging was messed up
[16:56:12] <bblack>	 depool->restart->repool, just the backend?
[16:56:36] <ema>	 yeah
[16:56:54] <bblack>	 so, I don't think this is a vbe->vbe comms problem.  it's just a general vbe LRU_failed problem, and it multiplies and gets more apparent when bes are fetching from other bes
[16:58:19] <bblack>	 done
[16:58:54] <bblack>	 the crazy thing is even when it's happening it's sub-1%
[16:58:57] <ema>	 looking better in ulsfo
[16:59:07] <bblack>	 well
[16:59:10] <bblack>	 maybe anyways
[16:59:41] <bblack>	 it might be all miss-fetch for the LRU_Failed-ing backend, but the stats work out to sub-1% after all the hits caches in front absorb, and chashing
[16:59:56] <bblack>	 will be interesting to see if an LRU_Failed node ever does succeed at fetching misses
[17:00:18] <bblack>	 since lack of probe doesn't help, will revert that
[17:00:35] <ema>	 ok
[17:02:46] <ema>	 mmh, esams has not been affected for > 1 hour
[17:04:20] <ema>	 cp4014 is freaking out
[17:04:29] <ema>	 LRU_Fail
[17:05:18] <bblack>	 confctl --quiet select name=`hostname -f`,service='varnish-be.*' set/pooled=no; service varnish restart; confctl --quiet select name=`hostname -f`,service='varnish-be.*' set/pooled=yes
[17:05:25] <bblack>	 ^ standard copypasta quickfix
[17:05:41] <bblack>	 can you get it?
[17:06:36] <ema>	 bblack: want me to do that on cp4014?
[17:06:59] <bblack>	 yeah
[17:07:06] <ema>	 sure
[17:07:24] <bblack>	 I think it may still ask y/n, I didn't try since I last edited that commandline heh
[17:07:50] <bblack>	 so I'm looking at the LRU_Fail code
[17:08:14] <ema>	 it does ask y/N :)
[17:08:15] <bblack>	 bin/varnishd/cache/cache_expire.c -> EXP_NukeOne
[17:08:49] <bblack>	 one interesting mystery there is this: we're getting the varnishlogged "LRU_Fail" message from:
[17:08:52] <bblack>	  VSLb(wrk->vsl, SLT_ExpKill, "LRU_Fail");
[17:09:19] <bblack>	 but above that when it's supposed to be looping through the LRU list looking for candidates, it does this per candidate:
[17:09:22] <bblack>	                 VSLb(wrk->vsl, SLT_ExpKill, "LRU_Cand p=%p f=0x%x r=%d",
[17:09:25] <bblack>	                     oc, oc->flags, oc->refcnt);
[17:09:31] <bblack>	 and I don't think we're observing any LRU_Cand messages at all...
[17:09:37] <ema>	 and we don't see that in the logs, right
[17:09:50] <bblack>	 does that imply the LRU list is empty? seems odd
[17:11:13] <ema>	 well but the LRU_Fail debug statement happens when oc == NULL
[17:12:01] <ema>	 yeah ignore me
[17:14:37] <bblack>	 OC_EF_OFFLRU is a flag I don't understand
[17:15:05] <bblack>	 apparently objects can exist in storage and not be in the LRU? that seems strange to me.  I would've assumed the LRU held refs to all objects, just a question of ordering...
[17:15:24] <bblack>	 documentation of those flags' meanings would be awesome :P
[17:20:50] <bblack>	 https://www.varnish-cache.org/trac/wiki/ArchitectureLRU (very old)
[17:21:10] <ema>	 so we did get a LRU_Cand, but not followed by an LRU_Fail
[17:21:30] <ema>	 that one failed with FetchError     straight insufficient bytes
[17:21:43] <ema>	 see ema@cp4015:z.log
[17:23:17] <ema>	 bblack: https://www.varnish-cache.org/forum/topic/15499
[17:23:29] <ema>	 > It seeems you're running with file storage. This can cause problems due to fragmentation of the storage space, which can result in allocation failures
[17:24:28] <bblack>	 lol
[17:24:35] <bblack>	 "I suppose the question now is wether this is normal behaviour when the cache is full? I wouldn't expect Varnish to blow up and start returning 503s"
[17:24:40] <ema>	 yeah
[17:24:47] <bblack>	 -> "This is a fundamental issue with the file backend under certain workloads. This is why malloc is the default now."
[17:24:53] <bblack>	 fucking ridiculous
[17:25:25] <bblack>	 I've been operating under the assumption that -sfile is a lot like -smalloc, just that it can map in a whole lot more memory, backed by an FS
[17:25:42] <bblack>	 since it has no persistence, there's little fundemental reason it should differ from smalloc behavior
[17:26:29] <bblack>	 but apparently, it just falls over and dies once the storage fills up?
[17:26:41] <ema>	 that sounds lame
[17:26:42] <bblack>	 at least "under certain workloads"
[17:27:04] <bblack>	 so, one obvious thing we can do here is go back to -spersist
[17:27:11] <bblack>	 err -sdeprecated_persistent
[17:27:22] <ema>	 which was exploding in different ways though, wasn't it?
[17:27:35] <bblack>	 was it?
[17:27:50] <bblack>	 I think we removed that one patch that was panicing in a debug message.  or did we not yet?
[17:27:58] <ema>	 we did
[17:28:17] <bblack>	 is that updated code actually-deployed everywhere? at least in upload?
[17:28:25] <bblack>	 I think that's the only real blowup we saw with persist
[17:28:31] <ema>	 yeah we dropped the patch in 4.1.3-1wm2
[17:28:39] <ema>	 which is installed in upload
[17:30:19] <bblack>	 we could also try playing with lru_interval and/or nuke_limit (which might be ineffective, that earlier bugreport...)
[17:30:38] <ema>	 nuke_limit sounds promising (if it is effective)
[17:30:50] <bblack>	 lru_interval seems to basically be: "if an object was bumped to the top of the LRU list less than X ago, don't bump it again on use until that time window passed"
[17:30:54] <bblack>	 and the default is 2s
[17:31:07] <bblack>	 we access a shitload of objects every 2s, which could keep LRU churn pretty high
[17:31:57] <bblack>	 if it's only happening sporadically now (the LRU_Fail), and if that's a possible mitigation, even setting it to e.g. 20s would make an order of magnitude diff and should make it go away
[17:32:34] <bblack>	 I think it can be dynamically set from varnishadm as a test anyways
[17:33:15] <bblack>	 and it's an easier experiment that rolling through wiping them all back to persist
[17:33:23] <ema>	 yep, let's try with 20s lru_interval
[17:33:50] <bblack>	 ok, will salt to all cache_upload backends
[17:34:39] <bblack>	 done
[17:34:58] <bblack>	 I guess now we wait for another LRU_Fail spike
[17:35:23] <ema>	 sounds good. I've gotta go rest a bit, it's been a long day :)
[17:35:29] <bblack>	 next one we get to see (if there is one), I'd really like to try to inverse-match on what we're observing there: are there any miss->fetch on the LRU_Fail-ing backend that don't LRU_Fail?
[17:35:35] <bblack>	 ema: cya :)
[17:35:38] <ema>	 when are you traveling?
[17:36:25] <bblack>	 I'll be leaving for the airport about 24h from now, ish
[17:36:29] <bblack>	 maybe slightly earlier than that
[17:37:07] <ema>	 alright! safe trip then :)
[17:37:12] <bblack>	 thanks!
[17:38:24] <bblack>	 so I recall seeing in some other related thread once, some advice about replacing -sfile with a giant -smalloc and sufficient swapspace
[17:38:35] <bblack>	 we'd have turn swapping back on in general heh, and it sounds ugly
[17:39:46] <bblack>	 another thing, even though I view this whole -sfile behavior as buggy and stupid, if they're making workload/space -appropriateness arguments...
[17:40:21] <bblack>	 upload only gets ~70% frontend hit, which means a lot more reqs hit that first backend layer than should be reasonable
[17:40:43] <bblack>	 putting in place the 2-hit-wonder and/or lower frontend size limits might up that 70% dramatically, thus reduce request-rate to -sfile backend dramatically, etc...
[17:41:09] <bblack>	 it's something we want to explore anyways, but maybe it also helps this situation
[17:44:41] <bblack>	 if we end up going back to persistent just because file is awful in unique ways, we're kind of in an unpalatable dead-end situation there in the long term
[17:45:08] <bblack>	 but that's just further argument to pursue ATS as a replacement for varnish backends while leaving v4+smalloc in the front :P
[17:46:32] <bblack>	 I wonder if there's some crazy linux hack we can do with the SSDs to use -smalloc on them
[17:47:13] <bblack>	 e.g. some way to map those SSDs' space into memory at some fixed address space, and make a small patch to varnish that makes -smalloc use that address space exclusively, or something.
[17:48:18] <bblack>	 well, something sorta like that but different.  smalloc really does just use "malloc", not mmap.  it's not really address-space-aware at all in its default form.
[17:48:24] <bblack>	 but still
[18:00:52] <wikibugs>	 10Traffic, 06Operations: Strong cipher preference ordering for cache terminators - https://phabricator.wikimedia.org/T144626#2629774 (10BBlack) My current thinking on this at the moment is that it's probably not worth the more-complex prefhacks.  We should probably stall on seeing a little more broad adoption...
[18:35:50] <bblack>	 so far the theory of lru_interval=20 as a workaround is holding, but it's too early to really call it conclusive
[19:18:55] <wikibugs>	 07HTTPS, 10Traffic, 06Labs, 10Labs-Infrastructure, and 2 others: update *.wmflabs.org certificate (existing expires on 2016-09-16) - https://phabricator.wikimedia.org/T145120#2630146 (10yuvipanda) I've moved over tools-static and tools, and they're good now.  Need to do novaproxy next.
[19:27:29] <wikibugs>	 07HTTPS, 10Traffic, 06Labs, 10Labs-Infrastructure, and 2 others: update *.wmflabs.org certificate (existing expires on 2016-09-16) - https://phabricator.wikimedia.org/T145120#2630182 (10yuvipanda) I had to do this for the following set of hosts on tools:  1. tools-proxy-* 2. tools-static-*  I've done these...
[20:00:14] <bblack>	 3 hours straight so far of no 503 issues
[20:00:50] <bblack>	 if the lru_interval thing does work, we might want to experiment to ensure no adverse fallout at some even-higher setting closer to a minute, and then puppetize that (to make sure we leave more headroom and never see this again)
[20:55:53] <wikibugs>	 07HTTPS, 10Traffic, 06Labs, 10Labs-Infrastructure, and 2 others: update *.wmflabs.org certificate (existing expires on 2016-09-16) - https://phabricator.wikimedia.org/T145120#2630676 (10yuvipanda) ok, done on novaproxy-01 and -02 as well!   I've also documened the tools ssl certs in https://wikitech.wikime...
[21:04:17] <wikibugs>	 07HTTPS, 10Traffic, 06Labs, 10Labs-Infrastructure, and 2 others: update *.wmflabs.org certificate (existing expires on 2016-09-16) - https://phabricator.wikimedia.org/T145120#2630718 (10yuvipanda) 05Open>03Resolved
[22:05:27] <bblack>	 ema: so, another 503 plateau, but it self-resolved after hammering away with a presumeable LRU_Fail thing for a while
[22:05:51] <bblack>	 and a report from a user of a CL:0 image during the window, but it's gone now
[22:06:39] <bblack>	 we had a bit over 3 hours of peace before that one, so I'm going to try upping lru_interval further before resorting to persist
[22:08:35] <bblack>	 lru_interval = 73 and nuke_limit = 1000
[22:09:07] <bblack>	 for anyone else following along, the way to check all caches to confirm such a case is:
[22:09:23] <bblack>	 salt -v -t 5 -b 100 -G cluster:cache_upload cmd.run 'curl -v https://upload.wikimedia.org/wikipedia/commons/thumb/1/17/Fossilised_frog.jpg/800px-Fossilised_frog.jpg --resolve upload.wikimedia.org:443:127.0.01 2>&1|grep -i content-length:'
[22:09:48] <bblack>	 ^ fetches the actual image (with GET not HEAD) on every cache from itself and prints out the CL
[22:18:49] <bblack>	 heh, new one already starting
[22:18:50] <bblack>	 ...
[22:21:26] <bblack>	 I can see cp2017 doing it, but I'm not seeing LRU_Fail in log there
[22:28:47] <bblack>	 tgr: does it work with GET instead of HEAD?
[22:29:32] <bblack>	 the known correlated issue here started back up again at ~22:12 UTC (as in 16 mins ago), after being dormat for an hour
[22:29:47] <bblack>	 and not long after I showed up here and started talking about it
[22:29:59] <bblack>	 which makes me suspicious that someone's test requests actually trigger the broader problem
[22:30:23] <tgr>	 bblack: same result with GET
[22:30:45] <bblack>	 tgr: you get this every time you test?
[22:30:52] <tgr>	 yes
[22:30:54] <bblack>	 (also, can you get the full header output?)
[22:31:16] <bblack>	 I'm repeatedly testing the same fossilised frog URL against every cache we have, and always getting the correct non-zero CL via GET
[22:31:43] <tgr>	 https://phabricator.wikimedia.org/P4031
[22:32:16] <tgr>	 the other two DCs work fine; esams just hangs for me
[22:32:24] <bblack>	 ah not the frog, the flag
[22:32:41] <bblack>	 the flag also doesn't repro for me, though...
[22:32:44] <tgr>	 I tried both images a couple times
[22:33:12] <tgr>	 machines within a frontend varnish cluster are IP-hashed, right?
[22:33:22] <bblack>	 yes, but I'm querying them all directly
[22:33:50] <bblack>	 "esams just hangs for me" seems way worse, btw
[22:34:06] <bd808>	 I can repro with the same command from tgr's paste. I added a comment with the output (ipv6 for me)
[22:35:45] <bblack>	 I can too, from the outside
[22:35:48] <tgr>	 esams is https://phabricator.wikimedia.org/P4032 ; I might be doing something wrong, not familiar with lb URLs
[22:35:49] <bblack>	 why not from within?
[22:36:30] <bblack>	 tgr: I'm guessing that's an unrelated problem.  can you even ping e.g. text-lb.esams.wikimedia.org ?
[22:37:06] <bblack>	 oh, my failure to repro is due to a typo, I found it now :(
[22:37:11] <tgr>	 I can't
[22:40:57] <bblack>	 I'm going to reapply a VCL workaround that's not ideal for now
[22:41:05] <bblack>	 the reason it became easy to reproduce is it's a cache hit
[22:41:30] <wikibugs>	 10Traffic, 06Operations, 10media-storage, 13Patch-For-Review: Certain images failing to load in ulsfo - https://phabricator.wikimedia.org/T144257#2631114 (10Tgr)
[22:42:34] <Jasper>	 jasper@tarkus ~/Source/eos-knowledge-lib $ curl -I https://upload.wikimedia.org/wikipedia/commons/thumb/1/17/Fossilised_frog.jpg/800px-Fossilised_frog.jpg
[22:42:34] <Jasper>	 HTTP/1.1 200 OK
[22:42:34] <Jasper>	 Date: Mon, 12 Sep 2016 22:41:52 GMT
[22:42:34] <Jasper>	 Content-Type: image/jpeg
[22:42:35] <Jasper>	 Content-Length: 0
[22:42:37] <Jasper>	 bblack, ^
[22:43:07] <bblack>	 Jasper: I think the problem is intermittent, but it's possible for it to get cached and become not-so-intermittent for a given URL
[22:43:24] <bblack>	 but I'm putting a workaround back in place (which is unideal for other reasons) that will at least prevent them from caching
[22:43:47] <bblack>	 and then after that I can ban out the existing objects
[22:44:04] <bblack>	 this is all related to Varnish 4 bugs, basically
[22:44:19] <bblack>	 but there's a couple of different bug-investigation threads going and they're pretty confusing so far :(
[22:47:32] <bblack>	 it should be "fixed" now
[22:47:53] <bblack>	 (in that CL:0 is probably possible rarely, but is no longer cached or cacheable again)
[22:49:12] <bblack>	 the intermitted CL:0 responses tend to only happen when we're also concurrently showing a higher-than-normal (but still small) rate of disturbing 503 responses.  those come and go in small chunks that may last an hour and are easy to track in graphing.  and at least sometimes are related to storage LRU eviction failure.
[22:50:06] <bblack>	 I can see one just starting up again, which still makes me wonder if test queries are somehow a triggering cause
[22:53:30] <bblack>	 --   839991492 ExpKill        b LRU_Fail
[22:56:27] <Jasper>	 bblack, confirmed that the CL:0 is gone on my end
[22:56:38] <Jasper>	 bblack, is there a Phabricator ticket I can subscribe to for this?
[22:56:40] <bblack>	 ema: even with the more-increase lru_interval and nuke_limit, it's still happening
[22:57:17] <bblack>	 Jasper: https://phabricator.wikimedia.org/T144257 for reports of CL:0 I think
[22:57:40] <bblack>	 the deeper problem, it's just too confusing and the experimentation and investigation is too fluid to bother with phab at this point, basically
[22:58:49] <bblack>	 ema: I'm going to revert puppetization back to deprecated_persistent and begin restarting some backends at this point, starting with eqiad and working outwards (since hitting this in a deeper layer has a more-pronounced effect)
[23:00:14] <bblack>	 ema: https://gerrit.wikimedia.org/r/#/c/310161/1
[23:02:31] <Jasper>	 bblack, OK, I guess I'll just watch status updates.
[23:02:36] <bblack>	 salting out that change now, after it's everywhere I'll script up backend-depool->stop->wipe->start->repool stuff and start hitting eqiad
[23:02:44] <Jasper>	 bblack, I'll let you know if I see this again in the future
[23:03:15] <bblack>	 Jasper: sorry, we've been debugging this for a week or two now and it just gets to the point where documenting the fix attempts fully is cumbersome while doing everything in realtime
[23:03:39] <bblack>	 Jasper: but thanks for reporting :)
[23:04:00] <Jasper>	 bblack, understood.
[23:04:33] <Jasper>	 bblack, any clues about the application/x-www-urlencoded issue as well?
[23:06:21] <Jasper>	 bblack, I filed it as https://phabricator.wikimedia.org/T123806 -- might be completely unrelated
[23:09:24] <bblack>	 Jasper: no idea at the moment, but it definitely predates the current issues
[23:11:06] <Jasper>	 OK.
[23:33:35] <wikibugs>	 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502#2631303 (10BBlack) To recap the current state of affairs and recent investigation/experimentation:  1. We finished upgrading the remaining DCs to Varnish4 earlier today, a...
[23:33:54] <bblack>	 Jasper: recap in ticket update above, for the deeper bits