[00:50:32] ema, ^ [00:53:34] Jasper, hey [00:54:13] I can't reproduce this the way I could the last two times this occurred [00:55:42] I also can't do it by connecting directly to either of the two hosts listed as hits in that X-Cache line [00:56:16] Jasper, got any other examples? [01:24:43] Krenair, I think it might be fixed now. [01:24:52] It seemed to hit a large portion of the images we tried before. [08:31:09] we did get a 503 plateau in ulsfo between 21:06 and 22:29 [08:32:40] interesting that 4007's backend was involved in Jasper's repro [08:36:22] Jasper: it would be great to have more examples like the one you posted here, possibly with the exact date and time [08:44:20] Krenair: we now explicitly avoid caching those buggy 200 responses with Content-Length 0. That's probably why this time it was harder to get a repro [09:10:49] the host responsible for the 503 plateau, and likely for the empty 200 responses, is cp4007 [09:11:05] unsurprisingly, given that it's the one with highest uptime [09:11:08] http://bit.ly/2cLafoQ [09:29:08] 10Traffic, 06Operations, 10media-storage, 13Patch-For-Review: Certain images failing to load in ulsfo - https://phabricator.wikimedia.org/T144257#2639881 (10ema) @JasperStPierre reported another occurrence of this issue on IRC (2016-09-14 22:18 UTC): https://upload.wikimedia.org/wikipedia/commons/thumb/... [10:09:52] 10Traffic, 06Discovery, 06Maps, 06Operations, 03Interactive-Sprint: Maps - move traffic to eqiad instead of codfw - https://phabricator.wikimedia.org/T145758#2639951 (10Gehel) [10:11:07] 10Traffic, 06Discovery, 06Maps, 06Operations, 03Interactive-Sprint: Maps - move traffic to eqiad instead of codfw - https://phabricator.wikimedia.org/T145758#2639971 (10Gehel) [10:19:23] <_joe_> ema: I'll run pcc to test for sanity, but can I get your green light on https://gerrit.wikimedia.org/r/#/c/310774/ ? [10:19:34] <_joe_> i'll merge when it is tested to be a noop [10:22:08] _joe_: LGTM, feel free to merge after confirming it's a noop [10:24:41] <_joe_> ema: do you have a list of hosts you test with pcc when such a change happens? [10:27:36] _joe_: to test all possible combos of DCs/roles: [10:27:42] misc-esams cp3007.esams.wmnet [10:27:43] misc-eqiad cp1045.eqiad.wmnet [10:27:43] misc-ulsfo cp4004.ulsfo.wmnet [10:27:43] misc-codfw cp2025.codfw.wmnet [10:27:43] maps-esams cp3005.esams.wmnet [10:27:45] maps-eqiad cp1046.eqiad.wmnet [10:27:48] maps-ulsfo cp4011.ulsfo.wmnet [10:27:50] maps-codfw cp2015.codfw.wmnet [10:27:52] text-esams cp3042.esams.wmnet [10:27:55] text-eqiad cp1055.eqiad.wmnet [10:27:57] text-ulsfo cp4016.ulsfo.wmnet [10:28:00] text-codfw cp2016.codfw.wmnet [10:28:03] upload-esams cp3039.esams.wmnet [10:28:05] upload-eqiad cp1049.eqiad.wmnet [10:28:08] upload-ulsfo cp4013.ulsfo.wmnet [10:28:10] upload-codfw cp2026.codfw.wmnet [10:28:37] we should have an easy way to get those ;) [10:29:09] volans: ema@neodymium.eqiad.wmnet:clusterssh [10:29:47] <_joe_> cool [10:30:01] would be great to get them through some sort of http API of course [10:30:02] <_joe_> volans: see cluster_nodes() function in puppet :P [10:30:15] <_joe_> we can from puppetdb [10:30:20] <_joe_> once it's operating correctly [10:30:25] wonderful [10:32:44] I meant also a list of test hosts, not only the whole clusters [10:32:58] or just one random from each is enough? [10:33:11] <_joe_> volans: that's typically enough, but we'll get that too [10:33:16] <_joe_> I have ideas(TM) [10:33:26] :) [10:33:34] puppet side or DNS side? [10:33:34] <_joe_> basically, create a fact that includes: site, cluster, if the server is a canary or not [10:33:51] <_joe_> and be able to report it to puppetdb and this to select nodes via that [10:39:49] <_joe_> ema: https://puppet-compiler.wmflabs.org/4086/ [10:41:28] <_joe_> I'm going to merge that change [10:41:34] _joe_: go for it [10:42:42] <_joe_> ema: done, should I run puppet around? [10:44:41] _joe_: any reason for doing that instead of waiting for puppet to run on its own? [10:45:01] <_joe_> ema: not really [10:45:26] let's wait then [12:35:39] ema: o/ anything happened ~11 UTC in cache upload? [12:36:43] might be the hour after or before [12:37:00] our data consistency checks complain once every day for small loss [12:37:07] just wanted to double check [12:37:20] (loss == dt:'-') [12:37:37] these might be as well shm log timeouts [12:37:41] or overflows [12:37:53] but with the value that I've set I find it hard to believe -.- [12:46:12] elukey: indeed, cp4007 got hit by T145661 between 11 UTC and 11:40ish [12:46:13] T145661: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661 [12:46:51] so the state of cp4007 is interesting now [12:47:39] varnish-be has been running for 6 days and 23 hours [12:48:48] had issues between 21:06 and 22:29 yesterday [12:49:01] and again between 11 and 11:40 today [12:49:23] the problem seems to "fix" itself after a while, which is interesting [12:53:35] yesterday we saw some data issue around 10ish UTC [12:53:40] mmmmm [12:53:52] 10:14, that was cp4005 [12:54:22] all right correlated, vk looks good then [12:54:23] :) [12:54:25] thanks! [12:54:33] it should have lasted less yesterday though, I was online when it happened and I depooled the machine after a short while [12:57:58] I've logged lots of 200s with CL:0, but unfortunately there are 416s among those (still haven't managed to convince varnishlog to skip them) and our old friend /wikipedia/id/8/8f/Tingkatan.jpg [13:04:28] ah but this time it was only a warning, something like "hey, a percentage less than 5% of your data looks weird, double check!" [13:04:51] but this time, after a bit of digging, we have a nice varnishlog of a CL:0 200 [13:04:55] https://phabricator.wikimedia.org/P4053 [13:05:08] -- FetchError straight insufficient bytes [13:06:29] 10:54 UTC, right at the very beginning of the latest 503 plateau, and indeed the backend serving this was cp4007 [13:08:32] 10Traffic, 06Operations, 10media-storage, 13Patch-For-Review: Certain images failing to load in ulsfo - https://phabricator.wikimedia.org/T144257#2640411 (10ema) The varnish error triggering this seems to be: ```-- FetchError straight insufficient bytes``` Full varnishlog here: https://phabricator.wiki... [13:19:35] 10Traffic, 06Operations, 13Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#2640435 (10ema) cp4007 has been affected by this issue yesterday 2016-09-14 between ~21:05 and ~22:30 and again today 2016-09-15 between ~10:55 and ~11:46. {F44... [13:42:33] I guess the stap probe might be as simple as: [13:42:34] probe process("/usr/sbin/varnishd").statement("EXP_NukeOne@cache/cache_expire.c:369") { printf("LRU_Fail n_objcore=%u flags=%u\n", $lru->n_objcore, $lru->flags); [13:42:37] } [13:42:49] with slightly improved indentation :) [13:45:25] running it now on cp4007 [13:49:45] the other thing we could probe is 'straight insufficient bytes' for the CL:0 200s [13:50:43] however, the message comes from the frontends fetching from affected backends, so what we really should find out is why sick backends return a 200 with less bytes than CL [13:54:27] the other question is: why do frontends return 200 if it's a FetchError? [14:03:14] now I only hope we don't need to wait 12h for some debugging output... [14:03:22] although that's entirely possible [14:18:59] http://varnish.org/docs/5.0/whats-new/changes-5.0.html#whatsnew-changes-5-0 [14:19:28] v5 is out [14:20:04] ema: so start upgrading again :-P [14:20:24] what can go wrong [15:48:50] ema: the reason they return 200 on a fetcherror is because we're streaming; it's already sent the 200+headers and part of the content when the backend xfer fails [15:49:08] right! [15:50:04] ok that makes sense then [15:51:00] there's a ticket somewhere about whether/how to handle that better. there was a suggestion of sending a bad chunk so that the client also sees it as a failed transfer. or RST. [15:51:19] (varnish ticket I mean) [15:52:00] so.... the most-recent one, that was with the do_stream=true stuff out of the picture, right? [15:52:13] yes [15:53:06] it might be worth focusing hard on the insufficient bytes xfer, on the backend side [15:53:25] the LRU_Fail looks like a bug rather than tuning (stap may confirm) [15:53:55] maybe there's an unrelated bug which at first glance breaks one transfer, but then has the fallout of failing at LRU management, leading to the LRU_Fails [15:54:15] e.g. some resources is acquired and then not released-on-failure, etc... [15:54:28] it is possible, the broken transfer I caught happened a few seconds before the 503 spike [15:54:36] if it's really right at the beginning, yeah [15:55:20] the LRU flags are interesting too. it could be very telling if they're nonzero in stap output [15:56:23] (they should always be zero though, unless there's a really bad bug) [15:57:31] https://github.com/varnishcache/varnish-cache/blob/master/bin/varnishd/http1/cache_http1_vfp.c#L168 [15:57:44] this 'straight insufficient bytes' might be in the wrong place perhaps [15:58:17] it's in function v1f_pull_chunked instead of v1f_pull_straight, intuitively that doesn't look right [16:01:18] although the CL:0 200 transfer we logged isn't chunked so whatever [16:01:33] also, tuneables fetch_chunksize and fetch_maxchunksize could affer some of the outer callers of NukeOne, but it's a tenuous connection [16:02:06] are you sure it's not chunked? [16:02:27] I guess we should see some 'chunked stream' in varnishlog if it was? [16:02:35] I don't know for sure [16:02:55] but streaming is the default, and I wouldn't be surprised if HTTP/1.1 + streaming was always chunked [16:03:28] I believe it's possible for varnish to send an initial and correct CL at the start of a chunked, too, even though CL+chunked isn't the norm elsewhere. [16:03:32] but don't quote me on that, either [16:04:43] would be nice to get two distinct error messages for straight/chunked insufficient bytes but alright, let's see what we can log with systemtap in both cases [16:07:30] 10Traffic, 06Discovery, 06Maps, 06Operations, 03Interactive-Sprint: Maps - move traffic to eqiad instead of codfw - https://phabricator.wikimedia.org/T145758#2641055 (10Yurik) I suspect that both databases / tilesets are fairly similar. Then again, we had some job scheduling issue recently, so maybe we s... [16:13:34] heh I was trying to find the varnishlog for that failed transfer but it was a 200 as far as cp4007's backend is concerned, so no log [16:20:49] bleh I'm running late to get into the office. good luck :) [16:20:58] cheers! [16:37:20] ema, we spent a bit far too long trying to repro it -- it was intermittent. [16:37:43] ema, the only other clue I can give you is that sometimes it worked fine, sometimes it return CL:0, and sometimes the Content-Length field was omitted *entirely* [16:37:45] Even though Content-Type was there [16:46:16] 10Traffic, 06Operations, 13Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#2641321 (10ema) I found something quite interesting while staring at ganglia. Look at cp4005's `fetch_304` before the ramp-up in `fetch_failed`, which is when 50... [17:16:45] bblack: funny correlation of the day https://phabricator.wikimedia.org/T145661#2641321 [17:19:42] I'll leave the stap probe running on cp4007, logs in ~ema/lrufail.log [17:21:35] Jasper: thanks, please do report those issues again if you get a repro including curl -v output and the exact time [17:21:45] ema, will do. [17:22:22] see ya o/ [17:22:28] Thanks! [17:30:28] ema: bblack: I'm seeing content-length 0 errors, where/how should I report it? [17:32:08] and...now its working [17:32:27] https://paste.fedoraproject.org/428546/60740147/raw/ fwiw [17:37:42] https://upload.wikimedia.org/wikipedia/commons/thumb/8/8d/A04_1705-deriv-former-member-s.png/600px-A04_1705-deriv-former-member-s.png [17:38:49] looks like they start happening every time I go afk :P [17:39:06] then don't do that! :P [17:41:37] <_joe_> MaxSem: I see the same btw [17:42:13] they're happening in all DCs [17:42:33] <_joe_> sigh [17:42:36] and they're not LRU_Fail [17:42:40] even in eqiad [17:42:40] <_joe_> so probably an eqiad node? [17:43:06] <_joe_> new images don't show that [17:43:41] <_joe_> ema: https://grafana.wikimedia.org/dashboard/db/varnish-http-errors is pretty alarming in fact [17:43:44] trying to depool and restart varnish-fe on cp1049 to see if they stop there [17:43:47] _joe_: yeah [17:43:49] <_joe_> anything I can help with? [17:44:57] that worked on cp1049 [17:45:46] so definitely not a problem with swift [17:46:08] _joe_: if it keeps going like this I'm afraid a rolling restart of varnish-be is the only solution [17:46:16] bblack: around? [17:46:45] _joe_: they seem to be going down now on https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes [17:46:48] <_joe_> ema: seems to be going down [17:46:54] <_joe_> yes :) [17:47:14] this is very strange, this time there were no LRU_Fail [17:48:01] I've logged a few on cp1049:/home/ema/503.log before restarting varnish there [17:51:02] plus all ulsfo machines are logging 503s in ~ema/2016-09-13-backend-503.log [17:56:11] ok yeah the problem must be eqiad, almost all hosts there have a pretty high backend_nlru [17:59:49] I'm logging all 503s on eqiad nodes, will run the stap probe if they start LRU_Failing [18:06:50] it's happening on cp1074 [18:07:22] and it stopped already [18:09:13] next in line is probably going to be cp1099 if MAIN.n_lru_nuked is actually a valid indicator (and it seems so) [18:09:59] but again, no LRU_Fail [18:10:22] also this is the first time we're getting 503s in a direct DC I think [18:11:59] <_joe_> sigh [18:12:15] running the stap probe on all eqiad hosts although it's probably not going to catch anything [18:13:12] _joe_: a certain way to avoid issues would be restarting varnish backends after a certain MAIN.n_lru_nuked value, but that also means we're not gonna be able to debug this [18:13:37] do you want me to fetch brandon? :) [18:14:11] paravoid: there is no imminent danger now, but please stick around if you can :) [18:15:09] <_joe_> yeah I'm going to detach from the computer now, I am incredibly tired (the whole week plus sleeping less than 4 hours last night) and slightly feverish [18:15:23] <_joe_> sorry :( [18:15:34] thanks _joe_ have a good evening :) [18:23:12] yes! we got logs from lrufail [18:23:17] on cp1074 [18:23:23] LRU_Fail n_objcore=14274950 flags=0 [18:23:38] I'm going to depool it now [18:24:44] restarting varnish-be [18:27:17] ok this is getting very interesting, although the stap probe caught a few LRU_Fail they haven't been logged by varnishlog [18:27:34] s/a few/9758/ [18:28:05] varnishlog running with: varnishlog -q RespStatus ~ 503 [18:28:33] paravoid: is the offer of fetching Brandon still valid? [18:30:20] yes :) [18:30:30] I already pinged him before [18:30:40] great, thanks [18:30:48] what do you need? [18:31:24] a suggestion for how to proceed now, we've got eqiad-upload in a not-so-stable situation [18:31:53] I've managed to log some LRU_Fails with the systemtap probe, but none with varnishlog which is really odd [18:32:37] a proably safe approach at this point would be to slowly restart eqiad upload backends which haven't been restarted yet [18:32:51] but I'd like to make sure we don't miss out on any debugging opportunity [18:34:17] hey [18:34:22] hey there :) [18:34:31] so yeah, I would start restarting eqiad backends [18:35:01] when I last did that (in and out of persist), I went pretty aggressive and it worked out ok [18:35:13] let me check cmdline history and I can quantify aggressiveness [18:35:18] I'd wait quite some time between the restarts so that next time they don't all crash together [18:36:02] salt -v -t 5 -b 1 -C 'G@cluster:cache_upload and G@site:eqiad' cmd.run 'confctl --quiet select name=`hostname -f`,service='varnish-be' set/pooled=no; confctl --quiet select name=`hostname -f`,service='varnish-be-rand' set/pooled=no; sleep 15; service varnish stop; rm -f /srv/sd*/varnish*; service varnish start; sleep 10; pool; sleep 91' [18:36:09] ^ that was how aggressive, and we survived it ok [18:36:55] ema: ideally yes, but if it's getting out of control, you can restart them all quickly now, and then do a slower stagger afterwards to space out the next failures [18:37:05] sounds good [18:37:15] codfw won't be far behind most likely [18:37:20] any logging you think might be worthwhile? [18:37:34] the stap stuff is going to be the most useful thing at this point [18:37:35] I'm varnishlogging all 503s and the stap probe is running [18:37:40] oh you have that [18:38:12] ok so, the 14M n_objcore on cp1074. do we have an idea how that compared to the normal varnishstat idea of #objects? [18:38:38] (is it close to looking like that's all objects? or is it significant smaller?) [18:39:01] another thing you could push as a VCL change ahead of these eqiad (and then codfw) restarts with fresh storage [18:39:13] well on cp1074 the number of objects is now really small because I restarted it [18:39:40] is drop the upload TTL cap from 7d to say 3d, on the theory that maybe having them TTL expire faster than storage fills may help this scenario. [18:40:05] if 3d is even short enough [18:40:13] <_joe_> bblack: confctl --quiet select name=`hostname -f`,service=varnish-be.*' should work on both pools [18:40:19] <_joe_> just fyi [18:40:25] _joe_: yeah but it asks y/n [18:40:30] <_joe_> ahah right [18:40:31] we do have ~7M backend objects on cp1099 at the moment [18:40:39] <_joe_> the --bblack switch :) [18:41:03] ema: does ganglia track that for 1074 before restart? [18:41:22] bblack: yes it should track all varnishstat metrics [18:42:01] it could be interesting data for analysis of the code later, to know how total storage objects and that lru n_objcore compared at LRU_Fail time [18:42:22] what I really can't understand is why varnishlog didn't log any LRU_Fail and the stap probe did [18:42:48] varnishlog only does that with certain flags [18:43:04] I think "-v" and possibly also "-g request" are required? [18:43:11] oh shit I forgot -g request [18:43:49] in any case, we've got two problems here: one is debugging the root cause, the other is keeping prod afloat [18:44:00] yes [18:44:17] the basic plan for the latter is keep restart LRU_Fail backends, that's about all we have. but if staggered the rate shouldn't be awful, just hard to keep people awake at the right times [18:44:42] I'd start slowly restarting the backends except perhaps one machine where we can debug further [18:44:55] we could also automate the restarts [18:45:21] e.g. have puppet auto-swizzle the hour/min args so they're random, but basically cron varnish backend restart everywhere once a day. [18:45:46] then if we want to pick one to debug, we can disable puppet and the cronjob [18:46:09] but maybe ttl_cap=3d is enough, too [18:47:52] (I don't know if faster natural expiry helps this problem or not, but it's possible) [18:48:01] +1 for ttl_cap=3d, but it's probably wise to restart eqiad backends now by hand I think [18:48:11] yeah [18:48:17] to avoid surprises while I'm asleep :) [18:48:35] codfw maybe too, but after eqiad and slower if they're not yet a problem [18:49:10] I think with codfw we should be safe, so far MAIN.n_lru_nuked has been a pretty reliable indicator of when hosts will start crashing [18:49:28] esams might happen first [18:50:29] ok [18:51:32] thinking more about the LRU_Fail code issues.... so we know the basic issue is it can't find 1/millions of objects that isn't busy, which is bullshit [18:52:12] probably a bug with reference counting, possibly the preceding different 503 transfer failure has a bug in error handling that leads it to screw up a refcnt was one idea [18:52:41] another thing to consider is how it self-resolves after a while. why does it self-resolve ever? [18:52:57] there must be some kind of "maintenance" task/thread that works on the LRU sometimes.... [18:53:30] yeah [18:53:32] and maybe that's the whole nature of it. something wakes up and refcnt++ the whole list doing some kind of ideal/periodic maintenance, and when it gets done doing whatever it's doing it drops the refcnts and the problem goes away [18:53:54] and maybe with a smaller storage and less traffic a temporary lockup of the LRU isn't bad, but it causes our LRU_Fail long window here [18:54:15] s/ideal/idle/ above heh [18:57:03] I've seen some hints about things like that... exp_mail_it and related... [18:57:48] "Post an objcore to the exp_thread's inbox." [18:58:05] bblack: I'll start restarting eqiad backends in MAIN.n_lru_nuked order if you agree [18:58:07] is the description of exp_mail_it(), which implies there's an expiry management thread, which must operate on the LRU, blah blah... [18:58:19] ema: sounds ok to me, yeah [19:01:11] bblack: also have you seen the weird correlation with fetch_304? [19:01:31] https://phabricator.wikimedia.org/T145661#2641321 [19:01:31] exp_thread() too [19:02:52] fascinating.... [19:03:27] so, we could prevent 304 responses on the backend [19:04:12] in theory, nothing has to support conditional req->resp. e.g. if in the backends only, we stripped If-Modified-Since and similar on recv, they'd always be 200 responses instead of 304s and it's all legal if inefficient. [19:04:48] which might isolate the possibility that the 304 response code leaves LRU objects busy when it's done with them, or something like that (leaking refcnt on them) [19:05:08] but even then it's not a complete explanation for the downramp in 304s [19:12:35] uh we don't have ganglia metrics for eqiad, the permission thing... [19:12:39] chmod 644 /var/lib/varnish/*/*.vsm ; service ganglia-monitor restart [19:19:03] I've just salted that on all cache_upload hosts [19:39:46] I'm generally not watching IRC, but just text me if you need me to come back and poke at something [19:40:56] bblack: thanks! I'll finish eqiad and do esams next [21:05:31] so yeah today we learned it's not a varnish-be<->varnish-be only problem [21:18:46] ema, haven't seen any more CL:0 requests today yet FYI. [21:18:55] So good job whatever you fixed :)