[00:50:32] <Krenair>	 ema, ^
[00:53:34] <Krenair>	 Jasper, hey
[00:54:13] <Krenair>	 I can't reproduce this the way I could the last two times this occurred
[00:55:42] <Krenair>	 I also can't do it by connecting directly to either of the two hosts listed as hits in that X-Cache line
[00:56:16] <Krenair>	 Jasper, got any other examples?
[01:24:43] <Jasper>	 Krenair, I think it might be fixed now.
[01:24:52] <Jasper>	 It seemed to hit a large portion of the images we tried before.
[08:31:09] <ema>	 we did get a 503 plateau in ulsfo between 21:06 and 22:29
[08:32:40] <ema>	 interesting that 4007's backend was involved in Jasper's repro
[08:36:22] <ema>	 Jasper: it would be great to have more examples like the one you posted here, possibly with the exact date and time 
[08:44:20] <ema>	 Krenair: we now explicitly avoid caching those buggy 200 responses with Content-Length 0. That's probably why this time it was harder to get a repro
[09:10:49] <ema>	 the host responsible for the 503 plateau, and likely for the empty 200 responses, is cp4007
[09:11:05] <ema>	 unsurprisingly, given that it's the one with highest uptime 
[09:11:08] <ema>	 http://bit.ly/2cLafoQ
[09:29:08] <wikibugs>	 10Traffic, 06Operations, 10media-storage, 13Patch-For-Review: Certain images failing to load in ulsfo - https://phabricator.wikimedia.org/T144257#2639881 (10ema) @JasperStPierre reported another occurrence of this issue on IRC (2016-09-14 22:18 UTC):    https://upload.wikimedia.org/wikipedia/commons/thumb/...
[10:09:52] <wikibugs>	 10Traffic, 06Discovery, 06Maps, 06Operations, 03Interactive-Sprint: Maps - move traffic to eqiad instead of codfw - https://phabricator.wikimedia.org/T145758#2639951 (10Gehel)
[10:11:07] <wikibugs>	 10Traffic, 06Discovery, 06Maps, 06Operations, 03Interactive-Sprint: Maps - move traffic to eqiad instead of codfw - https://phabricator.wikimedia.org/T145758#2639971 (10Gehel)
[10:19:23] <_joe_>	 ema: I'll run pcc to test for sanity, but can I get your green light on https://gerrit.wikimedia.org/r/#/c/310774/ ?
[10:19:34] <_joe_>	 i'll merge when it is tested to be a noop
[10:22:08] <ema>	 _joe_: LGTM, feel free to merge after confirming it's a noop
[10:24:41] <_joe_>	 ema: do you have a list of hosts you test with pcc when such a change happens?
[10:27:36] <ema>	 _joe_: to test all possible combos of DCs/roles:
[10:27:42] <ema>	 misc-esams cp3007.esams.wmnet
[10:27:43] <ema>	 misc-eqiad cp1045.eqiad.wmnet
[10:27:43] <ema>	 misc-ulsfo cp4004.ulsfo.wmnet
[10:27:43] <ema>	 misc-codfw cp2025.codfw.wmnet
[10:27:43] <ema>	 maps-esams cp3005.esams.wmnet
[10:27:45] <ema>	 maps-eqiad cp1046.eqiad.wmnet
[10:27:48] <ema>	 maps-ulsfo cp4011.ulsfo.wmnet
[10:27:50] <ema>	 maps-codfw cp2015.codfw.wmnet
[10:27:52] <ema>	 text-esams cp3042.esams.wmnet
[10:27:55] <ema>	 text-eqiad cp1055.eqiad.wmnet
[10:27:57] <ema>	 text-ulsfo cp4016.ulsfo.wmnet
[10:28:00] <ema>	 text-codfw cp2016.codfw.wmnet
[10:28:03] <ema>	 upload-esams cp3039.esams.wmnet
[10:28:05] <ema>	 upload-eqiad cp1049.eqiad.wmnet
[10:28:08] <ema>	 upload-ulsfo cp4013.ulsfo.wmnet
[10:28:10] <ema>	 upload-codfw cp2026.codfw.wmnet
[10:28:37] <volans>	 we should have an easy way to get those ;)
[10:29:09] <ema>	 volans: ema@neodymium.eqiad.wmnet:clusterssh
[10:29:47] <_joe_>	 cool
[10:30:01] <ema>	 would be great to get them through some sort of http API of course 
[10:30:02] <_joe_>	 volans: see cluster_nodes() function in puppet :P
[10:30:15] <_joe_>	 we can from puppetdb
[10:30:20] <_joe_>	 once it's operating correctly
[10:30:25] <ema>	 wonderful
[10:32:44] <volans>	 I meant also a list of test hosts, not only the whole clusters
[10:32:58] <volans>	 or just one random from each is enough?
[10:33:11] <_joe_>	 volans: that's typically enough, but we'll get that too
[10:33:16] <_joe_>	 I have ideas(TM)
[10:33:26] <ema>	 :)
[10:33:34] <volans>	 puppet side or DNS side?
[10:33:34] <_joe_>	 basically, create a fact that includes: site, cluster, if the server is a canary or not
[10:33:51] <_joe_>	 and be able to report it to puppetdb and this to select nodes via that
[10:39:49] <_joe_>	 ema: https://puppet-compiler.wmflabs.org/4086/
[10:41:28] <_joe_>	 I'm going to merge that change
[10:41:34] <ema>	 _joe_: go for it
[10:42:42] <_joe_>	 ema: done, should I run puppet around?
[10:44:41] <ema>	 _joe_: any reason for doing that instead of waiting for puppet to run on its own?
[10:45:01] <_joe_>	 ema: not really
[10:45:26] <ema>	 let's wait then
[12:35:39] <elukey>	 ema: o/ anything happened ~11 UTC in cache upload?
[12:36:43] <elukey>	 might be the hour after or before
[12:37:00] <elukey>	 our data consistency checks complain once every day for small loss 
[12:37:07] <elukey>	 just wanted to double check
[12:37:20] <elukey>	 (loss == dt:'-')
[12:37:37] <elukey>	 these might be as well shm log timeouts
[12:37:41] <elukey>	 or overflows
[12:37:53] <elukey>	 but with the value that I've set I find it hard to believe -.-
[12:46:12] <ema>	 elukey: indeed, cp4007 got hit by T145661 between 11 UTC and 11:40ish
[12:46:13] <stashbot>	 T145661: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661
[12:46:51] <ema>	 so the state of cp4007 is interesting now
[12:47:39] <ema>	 varnish-be has been running for 6 days and 23 hours  
[12:48:48] <ema>	 had issues between 21:06 and 22:29 yesterday
[12:49:01] <ema>	 and again between 11 and 11:40 today
[12:49:23] <ema>	 the problem seems to "fix" itself after a while, which is interesting
[12:53:35] <elukey>	 yesterday we saw some data issue around 10ish UTC
[12:53:40] <elukey>	 mmmmm
[12:53:52] <ema>	 10:14, that was cp4005
[12:54:22] <elukey>	 all right correlated, vk looks good then
[12:54:23] <elukey>	 :)
[12:54:25] <elukey>	 thanks!
[12:54:33] <ema>	 it should have lasted less yesterday though, I was online when it happened and I depooled the machine after a short while
[12:57:58] <ema>	 I've logged lots of 200s with CL:0, but unfortunately there are 416s among those (still haven't managed to convince varnishlog to skip them) and our old friend /wikipedia/id/8/8f/Tingkatan.jpg
[13:04:28] <elukey>	 ah but this time it was only a warning, something like "hey, a percentage less than 5% of your data looks weird, double check!"
[13:04:51] <ema>	 but this time, after a bit of digging, we have a nice varnishlog of a CL:0 200
[13:04:55] <ema>	 https://phabricator.wikimedia.org/P4053
[13:05:08] <ema>	 -- FetchError straight insufficient bytes
[13:06:29] <ema>	 10:54 UTC, right at the very beginning of the latest 503 plateau, and indeed the backend serving this was cp4007
[13:08:32] <wikibugs>	 10Traffic, 06Operations, 10media-storage, 13Patch-For-Review: Certain images failing to load in ulsfo - https://phabricator.wikimedia.org/T144257#2640411 (10ema) The varnish error triggering this seems to be:  ```-- FetchError straight insufficient bytes```    Full varnishlog here: https://phabricator.wiki...
[13:19:35] <wikibugs>	 10Traffic, 06Operations, 13Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#2640435 (10ema) cp4007 has been affected by this issue yesterday 2016-09-14 between ~21:05 and ~22:30 and again today 2016-09-15 between ~10:55 and ~11:46.  {F44...
[13:42:33] <ema>	 I guess the stap probe might be as simple as:
[13:42:34] <ema>	 probe process("/usr/sbin/varnishd").statement("EXP_NukeOne@cache/cache_expire.c:369") { printf("LRU_Fail n_objcore=%u flags=%u\n", $lru->n_objcore, $lru->flags);
[13:42:37] <ema>	 }
[13:42:49] <ema>	 with slightly improved indentation :)
[13:45:25] <ema>	 running it now on cp4007
[13:49:45] <ema>	 the other thing we could probe is 'straight insufficient bytes' for the CL:0 200s
[13:50:43] <ema>	 however, the message comes from the frontends fetching from affected backends, so what we really should find out is why sick backends return a 200 with less bytes than CL
[13:54:27] <ema>	 the other question is: why do frontends return 200 if it's a FetchError?
[14:03:14] <ema>	 now I only hope we don't need to wait 12h for some debugging output...
[14:03:22] <ema>	 although that's entirely possible
[14:18:59] <ema>	 http://varnish.org/docs/5.0/whats-new/changes-5.0.html#whatsnew-changes-5-0
[14:19:28] <ema>	 v5 is out
[14:20:04] <volans>	 ema: so start upgrading again :-P
[14:20:24] <ema>	 what can go wrong
[15:48:50] <bblack>	 ema: the reason they return 200 on a fetcherror is because we're streaming; it's already sent the 200+headers and part of the content when the backend xfer fails
[15:49:08] <ema>	 right!
[15:50:04] <ema>	 ok that makes sense then
[15:51:00] <bblack>	 there's a ticket somewhere about whether/how to handle that better.  there was a suggestion of sending a bad chunk so that the client also sees it as a failed transfer.  or RST.
[15:51:19] <bblack>	 (varnish ticket I mean)
[15:52:00] <bblack>	 so.... the most-recent one, that was with the do_stream=true stuff out of the picture, right?
[15:52:13] <ema>	 yes
[15:53:06] <bblack>	 it might be worth focusing hard on the insufficient bytes xfer, on the backend side
[15:53:25] <bblack>	 the LRU_Fail looks like a bug rather than tuning (stap may confirm)
[15:53:55] <bblack>	 maybe there's an unrelated bug which at first glance breaks one transfer, but then has the fallout of failing at LRU management, leading to the LRU_Fails
[15:54:15] <bblack>	 e.g. some resources is acquired and then not released-on-failure, etc...
[15:54:28] <ema>	 it is possible, the broken transfer I caught happened a few seconds before the 503 spike
[15:54:36] <bblack>	 if it's really right at the beginning, yeah
[15:55:20] <bblack>	 the LRU flags are interesting too.  it could be very telling if they're nonzero in stap output
[15:56:23] <bblack>	 (they should always be zero though, unless there's a really bad bug)
[15:57:31] <ema>	 https://github.com/varnishcache/varnish-cache/blob/master/bin/varnishd/http1/cache_http1_vfp.c#L168
[15:57:44] <ema>	 this 'straight insufficient bytes' might be in the wrong place perhaps
[15:58:17] <ema>	 it's in function v1f_pull_chunked instead of v1f_pull_straight, intuitively that doesn't look right
[16:01:18] <ema>	 although the CL:0 200 transfer we logged isn't chunked so whatever
[16:01:33] <bblack>	 also, tuneables fetch_chunksize and fetch_maxchunksize could affer some of the outer callers of NukeOne, but it's a tenuous connection
[16:02:06] <bblack>	 are you sure it's not chunked?
[16:02:27] <ema>	 I guess we should see some 'chunked stream' in varnishlog if it was?
[16:02:35] <bblack>	 I don't know for sure
[16:02:55] <bblack>	 but streaming is the default, and I wouldn't be surprised if HTTP/1.1 + streaming was always chunked
[16:03:28] <bblack>	 I believe it's possible for varnish to send an initial and correct CL at the start of a chunked, too, even though CL+chunked isn't the norm elsewhere.
[16:03:32] <bblack>	 but don't quote me on that, either
[16:04:43] <ema>	 would be nice to get two distinct error messages for straight/chunked insufficient bytes but alright, let's see what we can log with systemtap in both cases 
[16:07:30] <wikibugs>	 10Traffic, 06Discovery, 06Maps, 06Operations, 03Interactive-Sprint: Maps - move traffic to eqiad instead of codfw - https://phabricator.wikimedia.org/T145758#2641055 (10Yurik) I suspect that both databases / tilesets are fairly similar. Then again, we had some job scheduling issue recently, so maybe we s...
[16:13:34] <ema>	 heh I was trying to find the varnishlog for that failed transfer but it was a 200 as far as cp4007's backend is concerned, so no log
[16:20:49] <bblack>	 bleh I'm running late to get into the office.  good luck :)
[16:20:58] <ema>	 cheers!
[16:37:20] <Jasper>	 ema, we spent a bit far too long trying to repro it -- it was intermittent.
[16:37:43] <Jasper>	 ema, the only other clue I can give you is that sometimes it worked fine, sometimes it return CL:0, and sometimes the Content-Length field was omitted *entirely*
[16:37:45] <Jasper>	 Even though Content-Type was there
[16:46:16] <wikibugs>	 10Traffic, 06Operations, 13Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#2641321 (10ema) I found something quite interesting while staring at ganglia. Look at cp4005's `fetch_304` before the ramp-up in `fetch_failed`, which is when 50...
[17:16:45] <ema>	 bblack: funny correlation of the day https://phabricator.wikimedia.org/T145661#2641321
[17:19:42] <ema>	 I'll leave the stap probe running on cp4007, logs in ~ema/lrufail.log
[17:21:35] <ema>	 Jasper: thanks, please do report those issues again if you get a repro including curl -v output and the exact time
[17:21:45] <Jasper>	 ema, will do.
[17:22:22] <ema>	 see ya o/
[17:22:28] <Jasper>	 Thanks!
[17:30:28] <legoktm>	 ema: bblack: I'm seeing content-length 0 errors, where/how should I report it?
[17:32:08] <legoktm>	 and...now its working
[17:32:27] <legoktm>	 https://paste.fedoraproject.org/428546/60740147/raw/ fwiw
[17:37:42] <MaxSem>	 https://upload.wikimedia.org/wikipedia/commons/thumb/8/8d/A04_1705-deriv-former-member-s.png/600px-A04_1705-deriv-former-member-s.png
[17:38:49] <ema>	 looks like they start happening every time I go afk :P
[17:39:06] <MaxSem>	 then don't do that! :P
[17:41:37] <_joe_>	 MaxSem: I see the same btw
[17:42:13] <ema>	 they're happening in all DCs
[17:42:33] <_joe_>	 sigh
[17:42:36] <ema>	 and they're not LRU_Fail
[17:42:40] <ema>	 even in eqiad
[17:42:40] <_joe_>	 so probably an eqiad node?
[17:43:06] <_joe_>	 new images don't show that
[17:43:41] <_joe_>	 ema: https://grafana.wikimedia.org/dashboard/db/varnish-http-errors is pretty alarming in fact
[17:43:44] <ema>	 trying to depool and restart varnish-fe on cp1049 to see if they stop there
[17:43:47] <ema>	 _joe_: yeah
[17:43:49] <_joe_>	 anything I can help with?
[17:44:57] <ema>	 that worked on cp1049
[17:45:46] <ema>	 so definitely not a problem with swift 
[17:46:08] <ema>	 _joe_: if it keeps going like this I'm afraid a rolling restart of varnish-be is the only solution
[17:46:16] <ema>	 bblack: around?
[17:46:45] <ema>	 _joe_: they seem to be going down now on https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes
[17:46:48] <_joe_>	 ema: seems to be going down
[17:46:54] <_joe_>	 yes :)
[17:47:14] <ema>	 this is very strange, this time there were no LRU_Fail 
[17:48:01] <ema>	 I've logged a few on cp1049:/home/ema/503.log before restarting varnish there
[17:51:02] <ema>	 plus all ulsfo machines are logging 503s in ~ema/2016-09-13-backend-503.log
[17:56:11] <ema>	 ok yeah the problem must be eqiad, almost all hosts there have a pretty high backend_nlru
[17:59:49] <ema>	 I'm logging all 503s on eqiad nodes, will run the stap probe if they start LRU_Failing
[18:06:50] <ema>	 it's happening on cp1074
[18:07:22] <ema>	 and it stopped already
[18:09:13] <ema>	 next in line is probably going to be cp1099 if MAIN.n_lru_nuked is actually a valid indicator (and it seems so)
[18:09:59] <ema>	 but again, no LRU_Fail
[18:10:22] <ema>	 also this is the first time we're getting 503s in a direct DC I think
[18:11:59] <_joe_>	 sigh
[18:12:15] <ema>	 running the stap probe on all eqiad hosts although it's probably not going to catch anything 
[18:13:12] <ema>	 _joe_: a certain way to avoid issues would be restarting varnish backends after a certain MAIN.n_lru_nuked value, but that also means we're not gonna be able to debug this 
[18:13:37] <paravoid>	 do you want me to fetch brandon? :)
[18:14:11] <ema>	 paravoid: there is no imminent danger now, but please stick around if you can :)
[18:15:09] <_joe_>	 yeah I'm going to detach from the computer now, I am incredibly tired (the whole week plus sleeping less than 4 hours last night) and slightly feverish 
[18:15:23] <_joe_>	 sorry :(
[18:15:34] <ema>	 thanks _joe_ have a good evening :)
[18:23:12] <ema>	 yes! we got logs from lrufail
[18:23:17] <ema>	 on cp1074
[18:23:23] <ema>	 LRU_Fail n_objcore=14274950 flags=0
[18:23:38] <ema>	 I'm going to depool it now
[18:24:44] <ema>	 restarting varnish-be
[18:27:17] <ema>	 ok this is getting very interesting, although the stap probe caught a few LRU_Fail they haven't been logged by varnishlog
[18:27:34] <ema>	 s/a few/9758/
[18:28:05] <ema>	 varnishlog running with: varnishlog -q RespStatus ~ 503
[18:28:33] <ema>	 paravoid: is the offer of fetching Brandon still valid?
[18:30:20] <paravoid>	 yes :)
[18:30:30] <paravoid>	 I already pinged him before
[18:30:40] <ema>	 great, thanks
[18:30:48] <paravoid>	 what do you need?
[18:31:24] <ema>	 a suggestion for how to proceed now, we've got eqiad-upload in a not-so-stable situation
[18:31:53] <ema>	 I've managed to log some LRU_Fails with the systemtap probe, but none with varnishlog which is really odd
[18:32:37] <ema>	 a proably safe approach at this point would be to slowly restart eqiad upload backends which haven't been restarted yet
[18:32:51] <ema>	 but I'd like to make sure we don't miss out on any debugging opportunity
[18:34:17] <bblack>	 hey
[18:34:22] <ema>	 hey there :)
[18:34:31] <bblack>	 so yeah, I would start restarting eqiad backends
[18:35:01] <bblack>	 when I last did that (in and out of persist), I went pretty aggressive and it worked out ok
[18:35:13] <bblack>	 let me check cmdline history and I can quantify aggressiveness
[18:35:18] <ema>	 I'd wait quite some time between the restarts so that next time they don't all crash together
[18:36:02] <bblack>	 salt -v -t 5 -b 1 -C 'G@cluster:cache_upload and G@site:eqiad' cmd.run 'confctl --quiet select name=`hostname -f`,service='varnish-be' set/pooled=no; confctl --quiet select name=`hostname -f`,service='varnish-be-rand' set/pooled=no; sleep 15; service varnish stop; rm -f /srv/sd*/varnish*; service varnish start; sleep 10; pool; sleep 91'
[18:36:09] <bblack>	 ^ that was how aggressive, and we survived it ok
[18:36:55] <bblack>	 ema: ideally yes, but if it's getting out of control, you can restart them all quickly now, and then do a slower stagger afterwards to space out the next failures
[18:37:05] <ema>	 sounds good
[18:37:15] <bblack>	 codfw won't be far behind most likely
[18:37:20] <ema>	 any logging you think might be worthwhile?
[18:37:34] <bblack>	 the stap stuff is going to be the most useful thing at this point
[18:37:35] <ema>	 I'm varnishlogging all 503s and the stap probe is running
[18:37:40] <bblack>	 oh you have that
[18:38:12] <bblack>	 ok so, the 14M n_objcore on cp1074.  do we have an idea how that compared to the normal varnishstat idea of #objects?
[18:38:38] <bblack>	 (is it close to looking like that's all objects? or is it significant smaller?)
[18:39:01] <bblack>	 another thing you could push as a VCL change ahead of these eqiad (and then codfw) restarts with fresh storage
[18:39:13] <ema>	 well on cp1074 the number of objects is now really small because I restarted it
[18:39:40] <bblack>	 is drop the upload TTL cap from 7d to say 3d, on the theory that maybe having them TTL expire faster than storage fills may help this scenario.
[18:40:05] <bblack>	 if 3d is even short enough
[18:40:13] <_joe_>	 bblack: confctl --quiet select name=`hostname -f`,service=varnish-be.*' should work on both pools
[18:40:19] <_joe_>	 just fyi
[18:40:25] <bblack>	 _joe_: yeah but it asks y/n
[18:40:30] <_joe_>	 ahah right
[18:40:31] <ema>	 we do have ~7M backend objects on cp1099 at the moment 
[18:40:39] <_joe_>	 the --bblack switch :)
[18:41:03] <bblack>	 ema: does ganglia track that for 1074 before restart?
[18:41:22] <ema>	 bblack: yes it should track all varnishstat metrics
[18:42:01] <bblack>	 it could be interesting data for analysis of the code later, to know how total storage objects and that lru n_objcore compared at LRU_Fail time
[18:42:22] <ema>	 what I really can't understand is why varnishlog didn't log any LRU_Fail and the stap probe did
[18:42:48] <bblack>	 varnishlog only does that with certain flags
[18:43:04] <bblack>	 I think "-v" and possibly also "-g request" are required?
[18:43:11] <ema>	 oh shit I forgot -g request
[18:43:49] <bblack>	 in any case, we've got two problems here: one is debugging the root cause, the other is keeping prod afloat
[18:44:00] <ema>	 yes
[18:44:17] <bblack>	 the basic plan for the latter is keep restart LRU_Fail backends, that's about all we have.  but if staggered the rate shouldn't be awful, just hard to keep people awake at the right times
[18:44:42] <ema>	 I'd start slowly restarting the backends except perhaps one machine where we can debug further
[18:44:55] <bblack>	 we could also automate the restarts
[18:45:21] <bblack>	 e.g. have puppet auto-swizzle the hour/min args so they're random, but basically cron varnish backend restart everywhere once a day.
[18:45:46] <bblack>	 then if we want to pick one to debug, we can disable puppet and the cronjob
[18:46:09] <bblack>	 but maybe ttl_cap=3d is enough, too
[18:47:52] <bblack>	 (I don't know if faster natural expiry helps this problem or not, but it's possible)
[18:48:01] <ema>	 +1 for ttl_cap=3d, but it's probably wise to restart eqiad backends now by hand I think 
[18:48:11] <bblack>	 yeah
[18:48:17] <ema>	 to avoid surprises while I'm asleep :)
[18:48:35] <bblack>	 codfw maybe too, but after eqiad and slower if they're not yet a problem
[18:49:10] <ema>	 I think with codfw we should be safe, so far MAIN.n_lru_nuked has been a pretty reliable indicator of when hosts will start crashing
[18:49:28] <ema>	 esams might happen first
[18:50:29] <bblack>	 ok
[18:51:32] <bblack>	 thinking more about the LRU_Fail code issues.... so we know the basic issue is it can't find 1/millions of objects that isn't busy, which is bullshit
[18:52:12] <bblack>	 probably a bug with reference counting, possibly the preceding different 503 transfer failure has a bug in error handling that leads it to screw up a refcnt was one idea
[18:52:41] <bblack>	 another thing to consider is how it self-resolves after a while.  why does it self-resolve ever?
[18:52:57] <bblack>	 there must be some kind of "maintenance" task/thread that works on the LRU sometimes....
[18:53:30] <ema>	 yeah
[18:53:32] <bblack>	 and maybe that's the whole nature of it.  something wakes up and refcnt++ the whole list doing some kind of ideal/periodic maintenance, and when it gets done doing whatever it's doing it drops the refcnts and the problem goes away
[18:53:54] <bblack>	 and maybe with a smaller storage and less traffic a temporary lockup of the LRU isn't bad, but it causes our LRU_Fail long window here
[18:54:15] <bblack>	 s/ideal/idle/ above heh
[18:57:03] <bblack>	 I've seen some hints about things like that... exp_mail_it and related...
[18:57:48] <bblack>	 "Post an objcore to the exp_thread's inbox."
[18:58:05] <ema>	 bblack: I'll start restarting eqiad backends in MAIN.n_lru_nuked order if you agree
[18:58:07] <bblack>	 is the description of exp_mail_it(), which implies there's an expiry management thread, which must operate on the LRU, blah blah...
[18:58:19] <bblack>	 ema: sounds ok to me, yeah
[19:01:11] <ema>	 bblack: also have you seen the weird correlation with fetch_304? 
[19:01:31] <ema>	 https://phabricator.wikimedia.org/T145661#2641321
[19:01:31] <bblack>	 exp_thread() too
[19:02:52] <bblack>	 fascinating....
[19:03:27] <bblack>	 so, we could prevent 304 responses on the backend
[19:04:12] <bblack>	 in theory, nothing has to support conditional req->resp.  e.g. if in the backends only, we stripped If-Modified-Since and similar on recv, they'd always be 200 responses instead of 304s and it's all legal if inefficient.
[19:04:48] <bblack>	 which might isolate the possibility that the 304 response code leaves LRU objects busy when it's done with them, or something like that (leaking refcnt on them)
[19:05:08] <bblack>	 but even then it's not a complete explanation for the downramp in 304s
[19:12:35] <ema>	 uh we don't have ganglia metrics for eqiad, the permission thing...
[19:12:39] <ema>	 chmod 644 /var/lib/varnish/*/*.vsm ; service ganglia-monitor restart
[19:19:03] <ema>	 I've just salted that on all cache_upload hosts 
[19:39:46] <bblack>	 I'm generally not watching IRC, but just text me if you need me to come back and poke at something
[19:40:56] <ema>	 bblack: thanks! I'll finish eqiad and do esams next
[21:05:31] <ema>	 so yeah today we learned it's not a varnish-be<->varnish-be only problem
[21:18:46] <Jasper>	 ema, haven't seen any more CL:0 requests today yet FYI.
[21:18:55] <Jasper>	 So good job whatever you fixed :)