[09:34:14] <_joe_> I am restarting a few varnish backends in upload eqiad, progressively [09:34:24] <_joe_> I'm letting 10 minutes between restarts [09:43:27] _joe_: so, how does it look like? [09:43:56] <_joe_> ema: there are 4 servers in equiad with nearly 100% cpu usage [09:44:23] <_joe_> I restarted the backend on one, I can do all, I just don't know which metrics to look at to know it's safe to go on with the next [09:44:45] <_joe_> as I kinda got you guys changed the persistency model of uploads cache [09:44:49] <_joe_> am I right? [09:45:52] yeah, the hitrate on https://grafana.wikimedia.org/dashboard/db/varnish-caching is something to keep an eye on [09:46:37] but with the daemons failing like this it's not any better, let's depool at least two of those [09:46:53] <_joe_> can't we just restart a second one? [09:46:57] <_joe_> 15 minutes passed [09:47:01] oh, yeah [09:47:08] <_joe_> I'm sorry I usually can handle these matters [09:47:08] you can use varnish-backend-restart [09:47:13] <_joe_> I used it [09:47:17] <_joe_> :) [09:47:31] ok, 5 minutes are more than enough between restarts I'd say [09:47:36] <_joe_> ok [09:47:39] <_joe_> I had no idea [09:49:06] _joe_: I'm restarting cp1099 [09:50:23] <_joe_> it's quite interesting, cp1048 had high iowait [09:50:44] <_joe_> while all others had high sys cpu usage [09:51:05] high sys cpu usage is a common theme with these failures [09:51:19] <_joe_> the high iowait? [09:51:51] I don't remember seeing that [09:51:51] <_joe_> wow cp1063 is basically maxing out its connection [09:52:01] <_joe_> let me look at swift [09:52:07] yes please [09:52:26] load is going down on cp1063 [09:52:54] <_joe_> Current Load Avg (15, 5, 1m): [09:52:55] <_joe_> 90%, 96%, 100% [09:53:00] <_joe_> this is swift [09:53:02] <_joe_> jeez [09:53:12] <_joe_> yeah let's stop with restarts for now [09:53:20] yep [09:53:53] the 503s should be over soon anyways [09:54:38] <_joe_> maybe depool cp1073? [09:55:19] cp1073 is not returning 503s, but it doesn't look good at all [09:55:39] 76 loadavg [09:56:24] _joe_: how does swift look like now? [09:56:30] <_joe_> bad [09:56:47] then let's wait a bit before depooling cp1073 I'd say [09:57:01] <_joe_> its network activity went from 1 gb to 1.4 over the last few minutes [09:57:29] how many varnishes did we restart in eqiad in total? I did 1 [09:59:56] <_joe_> I did 2 [10:00:00] OK [10:00:05] <_joe_> see the SAL for timings [10:00:17] Codfw 503 are OK in icinga now [10:00:56] <_joe_> yeah let's just wait for the storm to pass [10:01:07] yep [10:01:33] <_joe_> acting on large, high-traffic distributed systems is like steering an aircarrier [10:01:56] <_joe_> it takes patience and some guts :P [10:02:39] :) [10:03:29] cp3044 and cp3037 are also returning some 503s [10:03:49] <_joe_> yeah but let's maybe wait for the swift storm to pass [10:03:57] loadavg 5015 [10:04:00] sorry 6015 [10:04:11] (that's cp3044 [10:04:30] wtf [10:05:27] _joe_: yes we can wait a bit before restarting them [10:06:38] thank you for noticing this and for calling me BTW [10:06:54] <_joe_> yeah if I was more sure of the consequences [10:06:59] <_joe_> I wouldn't have needed it [10:07:06] <_joe_> but this is all quite new [10:07:12] yep [10:08:43] <_joe_> ok swift load is slowly recovering [10:08:54] <_joe_> man for being such a crude tech, it's pretty impressive [10:09:01] <_joe_> how resilent swift is [10:11:25] _joe_: I'll depool cp3044 if you agree then [10:11:50] load average is now still around 5k... [10:13:06] and now cp1073 is failing, sigh [10:13:12] <_joe_> sigh [10:13:28] <_joe_> we really have to wait, IMO [10:13:37] <_joe_> swift is on the verge of collapsing [10:14:12] <_joe_> maybe abother 30 minutes [10:16:40] ok the issue on cp1073 stopped [10:17:02] (which is something we've seen often, after a while the problem "solves" itself for a bit) [10:18:14] uh I'm now reading the backlog [10:20:43] so we're running with automated restarts, I didn't know the patch was merged [10:20:57] <_joe_> ema: let's wait the swift load goes down below 100% before doing anything else [10:21:21] <_joe_> it's settling very very very slowly down [10:21:35] _joe_: sure, 503s seem under control at the moment [10:21:40] <_joe_> cool [10:23:34] _joe_: root@cp1073:~# cat /etc/cron.d/varnish-backend-restart [10:23:34] 30 10 * * * root /usr/local/sbin/varnish-backend-restart > /dev/null [10:24:05] it's going to restart automatically in 7 minutes, we can comment it out though if you think swift needs more time [10:25:13] ema: not sure if relevant but I can see "/usr/local/sbin/varnish-backend-restart: line 13: service: command not found" in the cronspam [10:25:37] elukey: very relevant, sigh [10:25:43] <_joe_> nah [10:25:51] <_joe_> it's being better [10:25:59] <_joe_> elukey: hah [10:26:02] well I guess we don't need to comment out the cron job anyways given what elukey said :) [10:26:06] <_joe_> /sbin/service :) [10:26:38] ok this explains lots of things, such as the fact that some daemons have been running for +2days [10:26:41] let me fix the cron job [10:26:45] <_joe_> heh, yes [10:26:46] elukey: thank you [10:26:50] <_joe_> elukey: thanks :) [10:31:23] elukey, _joe_: https://gerrit.wikimedia.org/r/311326 [10:32:44] molto bene, +1 :D [10:33:09] the number of times I've been bitten by cron's default path is not even funny [10:34:12] <_joe_> win 23 [10:34:17] <_joe_> yeah, right [10:34:21] :) [10:36:04] (could have used systemctl which is in /bin) [10:36:05] <_joe_> btw, swift has settled down a bit but it's still churning at 30-40% more than its normal rate at this time of the day [10:44:52] <_joe_> did you restart more caches, ema ? [10:44:58] nope [10:45:07] but I did fix the restart script [10:45:13] <_joe_> because swift is now doing 60% more thoughput than usual [10:45:28] <_joe_> and, we're clearly not prepared [10:46:16] so cp1073,cp1074,cp1072 and cp1071 are not serving varnish-backend requests [10:46:23] that's weird [10:48:02] I see lots of pooled: no in etcd [10:48:09] confctl select dc=eqiad,cluster=cache_upload get|grep '"no"' [10:48:46] I'm gonna repool cp1074 [10:49:31] and the others [10:50:01] that should help with swift's load [10:51:57] _joe_: how's swift doing? [10:55:24] aaand now esams [10:55:26] ema: the depooled hosts might be the ones that got the wrong cron script executed? [10:55:50] elukey: yes [10:56:26] I saw only now the list, it matches :/ [10:56:59] <_joe_> ok now this all makes more sense [10:57:04] <_joe_> set -e , heh [10:57:07] ok now cp3044 and cp3037 are really going crazy [10:57:30] <_joe_> swift is now ok [10:57:41] _joe_: ok, restarting varnish on cp3044 and cp3037 now then [10:57:45] <_joe_> did you repool a few machines? [10:57:50] _joe_: yes I did [10:57:52] see SAL [10:58:01] <_joe_> that kinda explains everything I guess [10:58:39] <_joe_> we had less varnish backends active [10:58:44] <_joe_> so less cache hits [10:58:49] <_joe_> and higher swift usage [10:58:52] absolutely, in particular in eqiad it was really bad [11:13:08] <_joe_> wow swift traffic is down to 0.6 Gbps from 1.6 [11:13:18] <_joe_> like 25 minutes ago [11:14:01] yeah that's when I repooled eqiad [11:14:09] <_joe_> yes [11:14:20] <_joe_> well, we seriously risked losing uploads for good [11:14:37] <_joe_> I mean if the cron ran on every server [11:16:46] I'm double-checking the pooling status everywhere [11:21:54] now this also explains the crazy loadavgs that we haven't seen before [11:22:06] we were hammering the few pooled varnish backends [11:22:56] I've confirmed that everything is pooled now [11:34:58] _joe_: looks like we made it [11:35:04] thank you :) [11:36:54] hey ema...I asked the analytics guys but nobody seems to be there. Is there a good way to download the grafana statistics to use in my thesis as pictures? They would be better than anything I can create right now^^ [11:37:53] Snorri: I don't think so [11:38:11] usually I take screenshots [11:39:01] too bad. Screenshots it is I guess. Thanks [11:40:59] the situation looks stable, I'm going afk [11:51:30] _joe_: I understand that less carnishes → more lload into swift, but… losing uploads ? [12:55:49] yeah... [12:56:15] so probably what's lead to this condition was the restart crons, because they depool then fail to restart, and don't repool [12:56:39] that's probably why things were imabalanced at the start of this when _joe_ noticed [13:08:21] ema: did you have a varnishstat output somewhere from a backend that had just gone into LRU_Failed? [13:10:31] found one [13:14:04] so, the cp4013 varnishstat you have from lru_failed time, exp_mailed - exp_received is in the ballpark of 4M objects [13:15:08] while main n_object is ~7M and the lrufail output shows the LRU n_objcore at ~13M [13:15:40] I'm not entirely sure they're all using the same units, because I think large files are broken into distinct chunks to insert into storage [13:16:08] but surely being 4 million objects behind on the mailbox to the expiry thread is an important link in the logical chain here [13:16:43] everything that's pending in the mailbox has a refcnt just for being pending in the mailbox, which explains why LRU_Fail tries to iterate the LRU and doesn't find any unbusy objects to free. [13:18:47] I don't think I entirely understand how/why varnish uses that mailbox... [13:19:10] I'm starting to guess, though, that this is what the waitinglist we saw skyrocketing was about [13:19:49] (that when a thread needs to nuke storage to make room for an incoming object, it has to ask the expiry thread to make the room for it through a mailbox request, and then the thread waits on the waitinglist for the expiry thread to finish the request?) [13:24:16] ah, hmmmmm [13:25:00] so, when a worker thread does EXP_NukeOne and succeeds, it has to mail the expiry thread for it to manage the actual object, but it does get to immediately call ObjSlim, which frees most of the object's storage (for e.g. file data), just not the object structure itself in various tracking [13:27:17] the mailbox is still puzzling. or I guess a better way to state it: exactly which things the expiry thread exclusively owns and must manage itself is puzzling [14:14:58] hmmm 503s ramping up again, cp1073 [14:16:05] restarting varnish there [14:16:39] I guess since the cronjob was ineffective the past ~24h, we're still "behind" on the actual restarts, and we're going to hit more issues before the cronjob cycle really comes into full play [14:18:26] root@cp1073:~# egrep 'exp_|n_obj' vstat1 [14:18:26] MAIN.n_object 16498539 . object structs made [14:18:26] MAIN.n_objectcore 16500377 . objectcore structs made [14:18:29] MAIN.n_objecthead 5567400 . objecthead structs made [14:18:32] MAIN.n_obj_purged 5184 . Number of purged objects [14:18:35] MAIN.exp_mailed 61482675 256.81 Number of objects mailed to expiry thread [14:18:38] MAIN.exp_received 55882897 233.42 Number of objects received by expiry thread [14:19:28] exp_mailed - exp_received is 5599778 [14:19:40] which is close to (slightly greater than) n_objecthead [14:20:31] which again supports the theory that LRU_Fail is because all objects in the LRU are "busy" (have a refcnt) because they're backlogged in the expiry thread mailbox (which holds a refcnt between mailed -> received) [14:20:40] <_joe_> Platonides: well if the crojob kept depooling varnishes [14:21:10] <_joe_> we would've ended up with too few of those, and probably end up overwhelming swift [14:21:44] I was going to say "pybal won't depool more than half", but that's not the problem here heh [14:21:52] I don't think we have any depool threshold for the varnish<->varnish stuff [14:22:12] it would eventually go down to 1x backend with much reduced storage and much higher miss [14:22:24] <_joe_> and yes, I was thinking pybal [14:22:25] I think it would probably fail VCL recompilation trying to remove the final backend [14:22:36] <_joe_> but yeah we don't have any safety measure there [14:22:43] <_joe_> we might want to add it somehow [14:27:32] I missed that the depool *did* work :/ [14:28:02] but an overwhelmed swift shouldn't lose pictures anyway… [14:31:53] well it wouldn't lose the source images, no [14:32:14] but we'd "lose upload" in the sense that we'd stop reliably serving images to users until things got repooled and recached [14:32:55] bblack: so we've got a bunch of varnish-be processes with 5+ days uptime [14:33:18] perhaps we could restart those preemptively [14:34:01] yeah [14:34:07] and not step on the impending crons heh [14:34:31] cp2022 cp2005 cp2017 cp2011 cp2026 cp2020 cp2008 [14:34:48] are those the only ones globally, or just codfw? [14:35:01] globally [14:35:17] they're the top hosts sorting by MAIN.uptime [14:35:22] ok [14:35:22] the next one after them is cp1049 [14:35:55] 239405 was the uptime on cp1073 when it failed [14:36:16] a little under 3 days [14:36:18] cp1049 has uptime 247626 [14:36:37] ok [14:37:24] eqiad being at the bottom of the stack probably gets more raw total requests and thus gets there faster [14:37:57] (also being the only one which currently gets FE traffic not filtered by another BE first) [14:39:28] on the FE size limit thing, it looks like while 1MB->512KB change did very little (hard to tell from noise), going from 512KB->2MB seems to have dropped the hitrate slightly [14:39:41] so 1MB may be luckily near optimal [14:40:00] at least for now [14:40:22] once the two-hit-wonder thing lands and we get more hittable objects into FE cache, the optimum size probably shifts a bit upwards, too [14:41:51] ema: I'll go through those codfw backends and cp1049 for now, checking their cron time first in case it's close anyways [14:42:12] ok [14:46:08] so another factor here on upload is how our storage is laid out and used [14:46:16] we have 4x files: 2x bigobj and 2x main [14:46:35] files >= 100MB size go randomly into the bigobj ones, and smaller ones randomly between the two main ones [14:47:15] bigobj files are 2x60GB and main are 2x300GB [14:47:38] there could be something up with that in general that exacerbates the problems with LRU nuking and mailbox, etc [14:47:38] nice, purge went from ~150k req/s to ~60k (because of the filter I guess) [14:47:48] globally you mean [14:47:54] yeah [14:48:03] uh hey :) [14:48:09] I didn't filter for upload :) [14:48:23] for upload it drops by a factor of ~1000 :) [14:48:35] crazy [14:48:38] basically almost all of the PURGE is not really for upload [14:49:06] even before our PURGE traffic got crazy last december and kept getting crazier afterwards, upload was the smaller of the two sets (the other being text) [14:49:20] but all of the crazy things multiplying the text purge rate don't really apply to upload [14:50:10] still, the regex filter in vhtcpd is just a minor optimization. VCL has the same filter and rejects them with "Domain not served here" without doing an actual purge operation. [14:50:34] so I don't think this affects any LRU problems directly. but it does dramatically cut the raw HTTP request rate flowing into varnishd, which has to be worth something. [14:52:55] so the bigobj/sizes thing, a few things to think about: [14:53:21] 1) in text or other normal caches, we don't explicitly pick a storage backend at all. They have 2x360GB normal "main" files and varnish choses between them internally. [14:53:37] whereas on upload, we explicitly pick them using std.random in both the bigobj and main cases [14:53:55] is it possible varnish internal selection is smarter, and we're making things grossly inefficient by picking them randomly and explicitly? [14:54:08] why do we do that? [14:54:50] well, the main functional thing we're doing there is ensuring that 100MB+ files don't evict tons of smaller objects, by splitting storage on size [14:55:22] in all cases we have 2x SSDs and thus 2x files, but at least under v3 if we want to pick specific files based on size, we have to pick the exact file, not just "use this pair of files instead of that pair of files" [14:56:34] so because of that, we're picking between the two equal files in each class, explcitly in VCL with std.random(), instead of varnish making some internal choice, which might be based on say fullness or lock contention or who knows what (never looked) [14:57:01] moving on though [14:57:33] 2) The split there, like so much of our ancient tuning/setup, is probably arbitrary and suboptimal. Should it be 60/300 as the split? should 100MB be the cutoff? [14:57:55] these things probably have effects on how efficient nuking is [14:58:29] right [14:58:56] 3) In light of all of this, should we be more-worried about extreme edge cases sizes? as in, should we have VCL explicitly pass and refuse to even try to backend cache files larger than X, where X might be 1GB or so? [14:59:44] maybe the trigger that occasionally really locks up the nuking/expiry process is varnish trying and failing to cache a 1.5GB file through the bigobj storage and trying to evict for it and failing, or whatever. [15:00:13] is there any way to get the object size in exp_nuke? [15:00:17] or on the other end of things, it could be that 100MB is too big for the main storage when there's also known to be e.g. 1K-sized objects in there commonly (SVGs) [15:00:38] that could lead to a situation where one new object has to nuke 100K existing objects at the extreme [15:01:31] MSE internally solves this problem by size-classes subsets of storage, much like jemalloc does for the general malloc and smalloc cases [15:02:00] we could hackily solve the same problem with smarter size classing of storage, possibly into a larger count of distinct chunks than we do today [15:04:04] so in two very different directions we have things to explore there [15:04:20] maybe we need to extend this scheme and break it up into 3-5 classes instead of just 2 [15:04:37] and maybe random selection between the 2x equivalent files in each class is suboptimal [15:04:53] but with more classes we could avoid ever having 2x equivalent files, perhaps [15:05:20] /dev/sda could have the 3 smallest classes and /dev/sdb could have the 2 biggest classes, or whatever [15:05:25] one file per size-class [15:06:14] there are so many variables there, and the time to stabilize after a change so long, it would be hard to ever get it tuned right just from random experiment [15:06:44] but if we took a peek at the dataset's size statistics (both on a per-request basis and on the basis of what's in swift), it could give us some better rough ideas [15:07:31] http://bit.ly/2cQxMWM <- why would n_lru_nuked decrease? [15:07:59] because the nuking process locks up [15:08:21] under normal conditions with filled storage we expect a constant high rate of nuking [15:08:42] but it's a counter of nuke operations since startup AFAIK [15:08:46] when we hit LRU_Fail, the nature of the issue is we can't nuke anymore because every object in the LRU is "busy". So all nuking stops, and requests that should've nuked something return 503 instead [15:08:46] it shouldn't decrease [15:09:03] it goes back to zero on restart [15:09:09] sure [15:09:15] the step might be artificial, too [15:09:36] they're 5 minute stat intervals, I think that's just 1 stat interval averaging over the restart or whatever [15:10:12] also we don't know for sure how the LRUs work [15:10:26] it would make sense that there would be an LRU per storage file [15:10:36] so maybe one locks up first, then the next, etc [15:11:16] mmh [15:11:30] vk crashed on cp1071 and cp1072 [15:11:41] Sep 18 14:52:58 cp1072 varnishkafka[86743]: VSLQ_Dispatch: Varnish Log abandoned or overrun. [15:11:52] nice [15:12:42] but varnish-be wasn't restarted on either of them [15:13:00] restarting vk [15:14:09] I think if we start from the point of "hey, the expiry thread is getting behind on processing its mailbox", which seems like its performance sucks because of raw io and/or storage fragmentation and/or locking... [15:14:48] splitting up storage into more bins of smaller size, that are more constrained in their size ranges and thus suffer less from fragmentation, might be an answer [15:15:33] seems reasonable [15:15:47] like I said, stats would help a lot picking the right magic numbers [15:17:10] but I could easily see us doing, say, 5x classes using 0-15K, 16K-511K, 512K->10MB, 10MB->100MB, 100MB->1GB, and then skipping straight to swift for 1GB+ [15:18:06] the problem is picking the cutoffs, picking the total size of each bin, and then balancing the files between the SSDs to level out io [15:18:52] <_joe_> swift is under some pressure again, btw [15:19:19] <_joe_> but this seems like mostly organic traffic-driven pressure [15:20:13] we do have some backend restarts going [15:20:28] and the automated ones [15:46:24] bblack: I'm going afk, text me if needed [15:47:19] ok [15:48:06] digging in code on the stevedore stuff: varnish by default rotates through the available stevedores (storage backend files), which is slightly different than our "random" choice which probably doesn't rotate very perfectly [15:48:41] there is a case in which it will skip over the first-picked stevedore and try a different one, which I don't yet understand [15:49:03] once it gets past that, it only tries to nuke with the nuke limit (if not enough space) within the one chosen stevedore [15:50:49] ah that's also space allocation [15:51:06] so basically the default algorithm, lacking us setting a specific storage hint (like we always do on upload) is: [15:51:49] rotate to the "next" stevedore after the last one used. if allocation of space succeeds there, done. [15:52:17] if allocation initially fails there, rotate through them all until you find one with space for the allocation. [15:52:56] if none of them have space, go back to the first selection (the "next" one in the list after the last one used), and try to nuke some free space to make room [15:53:26] when we explicitly and correctly set a hint, we basically bypass all of that. it picks the one we tell it, and nukes space there if necessary. [15:54:07] <_joe_> if the hint is not optimal, that would mean more nukes, though [15:54:20] right [15:54:35] <_joe_> It's very similar with what happens with memcached and slabs [15:54:50] in the default case, though, until they get close to full it will be rotating between the two, so they should approximately fill at the same rate [15:55:35] <_joe_> my point being: say one of your storage file backends is full, while the other is not at 50% usage [15:55:36] once they're both close to full, we get into the kind of behavior where it's picking whichever one already has enough space, and if neither does then it's rotating the choice on which to nuke from (which may free more space than needed, leading to that stevedore getting chosen again quickly as well) [15:55:54] it would be extremely unlikely to have one full and one not-near-full [15:56:01] except in the case of extremely large objects relative to storage size [15:56:58] but again we do have at least one size-class distinction already in place. the larger main storage that sees most of the action doesn't ever have objects >=100MB size [15:57:29] and the storage size there is 300GB per file, 600GB total between the two [15:57:53] <_joe_> ok sorry I got it [15:58:02] so ~0.03% of storage is the most a single object can take there [15:58:04] <_joe_> when one is full, the other one gets tried anyways [15:58:21] I think g.dog said swift max object size was only a little over 1GB too, so the size spread in the bigobj storage isn't too bad. [15:58:45] _joe_: in the default stuff, yes. with our explicit setting of storage, no. [15:59:04] if somehow one file was 50% full and the other 100% full, we'd still randomly pick between the two with our VCL, and it might pick the 100% one and nuke for space there. [15:59:16] <_joe_> oh ok [15:59:22] <_joe_> that's the part I didn't get [15:59:27] it just seems unlikely we'd arrive at such a mismatched condition in the first place [15:59:42] <_joe_> I have no idea, do we have metrics for that? [16:00:05] <_joe_> I would expect that the vast majority of files we serve are under 10MB [16:00:15] <_joe_> like 99.5% or something [16:01:08] <_joe_> my experience with memcached showed it is easy to have in cache a ton of objects within a very narrow size range [16:01:30] <_joe_> and have those take up 90% of all the storage [16:01:46] <_joe_> something we didn't really expect, btw [16:04:15] yeah [16:04:38] the vast majority are small. we've looked imperfectly before and I think the average object size is well under 1MB [16:04:45] I don't remember the exact value [16:05:14] <_joe_> ok, wouldn't that fill up the file cache for that size pretty quickly compared to the other slots? [16:05:20] <_joe_> we have two slots, right? [16:05:35] <_joe_> sorry, I should stop asking dumb questions and just look at the VCL [16:06:03] so we have 4 total "slots" (storage files) in a cache_upload backend [16:06:18] 2x60GB files for >= 100MB size class [16:06:26] and 2x300GB files for <100MB size class [16:06:35] <_joe_> ok [16:06:45] on a normal setup like cache_text, we just have 2x360GB files, and varnish choses internally with its smarter algorithm. [16:06:54] (there's no size classing in VCL) [16:07:05] we have 2x files at all because we have 2x SSDs to use [16:07:13] <_joe_> ok [16:07:37] so on upload sometime long ago, we decided to size class, probably to work around or optimize for similar nuke/fragmentation/hitrate issues [16:07:52] so we split each of the SSDs to have a 60GB and 300GB file [16:08:05] <_joe_> yeah the difference should be small to any practical effect as long as we have 2 files [16:08:19] varnish's internal choice algorithm makes no size distinction, so if we want to size-split we have to pick 1/4 explicitly in VCL, not just 1/2 size classes. [16:08:26] <_joe_> if we had more, it would make this way worse than the varnish original choice [16:08:39] so we're using std.random() choice between the two files in a class, vs their space-checking and rotation [16:09:08] yes [16:09:17] <_joe_> bblack: just to be sure: your working hypothesis is that somehow the thread that does LRU nukes is too slow to keep up? [16:09:44] something like that, but it's complicated and I don't understand it all [16:09:51] <_joe_> I have skimmed casually through what you and ema were saying [16:09:53] there's a special, singular expiry thread [16:10:03] and then there's ~500+ worker threads handling normal traffic [16:10:08] <_joe_> sigh [16:10:22] <_joe_> sounds like the mysql replication thread issue :P [16:10:36] when a worker thread needs to store a new object and there's no space in the storage, it nukes to make space [16:10:51] <_joe_> and does that communicating with that singular thread [16:10:58] in order to nuke an object from the LRU list and storage, the object has to not be "busy" (which is refcounted) from some other thread [16:11:32] within its own thread, without relying on the expiry thread, the worker can free up some space. It can free basically the actual object contents, it just can't free it from various metadata lists on its own [16:12:26] so it frees some objects' content space to make room during the local part of LRU nuking, and creates a new object and moves on. it also "mails" a request to the expiry thread to finish the job and remove all the nuked object's metadata as well. [16:12:42] <_joe_> so the high system cpu we see in those situation is from context-switching and message queues of 500+ workers trying to get attention from that singular expiry thread? [16:12:48] there are similar book-keeping "mailed" requests to the expiry thread for object creation too. it's the central book-keeper. [16:13:27] <_joe_> so now I understand the rationale behind your last VCL change [16:13:28] the expiry thread isn't *synchronously* doing anything for the 500 worker threads. it's asynchronously processing metadata updates for them. they don't have to wait on it to free up basic allocation space and allocation new objects. [16:13:41] (my last VCL change is completely unrelated!) [16:14:16] <_joe_> oh right that's for frontend caching [16:14:24] so what I've said so far about the expiry thread, it makes sense. it's not horribly inefficient, and nothing should really block on it, and it has time to catch up on mailed requests within reason, etc... [16:14:39] the catch is this: [16:15:10] in order for a worker thread to be able to free up some allocation space at all, it has to find an un-busy object to operate on (meaning no other thread is actively manipulating it or serving it, by inter-thread refcounting) [16:15:55] when the worker threads "mail" (enqueue) a request to the expiry thread to handle metadata updates, the object being present in this mailbox queue gives it a refcnt++, which then refcnt-- when the expiry thread pulls the request off the queue and deals with it [16:16:32] and we're seeing exp_mailed (to expiry thread) much larger than exp_received (expiry thread picks up the request and acts), meaning the expiry thread is falling behind processing its queue [16:17:01] if it falls behind by millions of mailed requests, eventually every object in the LRU is "busy" because it's in the mailbox, so now workers can't free any space because they can't find a single unbusy object to free from. [16:17:17] <_joe_> sigh [16:17:21] <_joe_> ok [16:17:27] the last one I just looked at, the backlog of mailed requests was as large as n_objhead for the storage in total... [16:17:47] <_joe_> and this doesn't seem like something we can really solve [16:17:50] <_joe_> easily [16:18:14] well there's a few paths to go down here. we don't know conclusively that there isn't an easy way to stop the expiry thread getting so far behind on its mailbox [16:18:30] <_joe_> unless we horribly misconfigured something... [16:18:35] well... [16:19:12] <_joe_> I know varnish is complex and even the most innocent-appearing VCL change can trigger absurd edge cases :) [16:19:22] so from a simplistic view you'd say: probably the expiry thread is efficient, and probably io is the limiting factor in processing those mailbox requests, and we're just screwed with a single expiry thread and this storage size, io capacity, request rate, etc... [16:19:33] <_joe_> esp at our size/volume [16:19:54] <_joe_> so run multiple varnish backend instances per host :P [16:19:58] but, there's a few escape valves there where we could maybe dramatically speed up the "falling behind on the mailbox" problem: [16:20:35] <_joe_> if it's just IO that wouldn't help too much unless the expiry thread is more efficient for smaller files, though [16:20:44] 1) It's possible some simple changes to the expiry thread code can avoid some stupid delays. There some conditional waits on the mailbox that have time constants which would be reduced/capped. I don't think that's the issue, but it could be on further code examination. [16:21:47] 2) It's possible that one of the drivers on the io capacity problems for the expiry thread (well for everything) is storage fragmentation and object size range within a storage unit (which are related) [16:23:05] as we constantly evict/nuke and write new objects, slowly over time the storage space becomes very fragmented. especially you can imagine this kind of scenario: a 1MB file comes in, and the first 100 items to nuke on the LRU are 10KB objects that are scattered all over storage. even after it nukes all those is has to nuke more and more until it finds a contiguous 1MB, or fails... [16:23:51] over time fragmentation just keeps getting worse on average, until the book-keeping work gets insane and maybe that drives up the io rate until the expiry thread can't keep up anymore [16:24:29] (it also nukes out a lot of objects pointlessly, since it doesn't look at the fragmentation problem at all. it just keeps nuking from the LRU list until a contiguous region large enough magically appears) [16:25:25] so that's the idea with increasing our size-classing magic [16:25:40] instead of 2x60GB files for >=100MB, and 2x300GB files for <100MB. [16:26:09] we might split up storage across these two SSDs so we have 1x file for size class 0-64K, 1x file for size class 64K-1MB, ..... [16:26:33] split into say 4-6 distinct size classes, so that nuking gets much easier from the reduction in spread of object sizes within one storage file [16:26:38] and fragmentation should also reduce [16:26:54] <_joe_> all this left me wonder [16:26:57] also each one has its own LRU list, so we've split up into more total, smaller LRUs, and less total space per file [16:27:09] <_joe_> how this will work with the crazy purge rates we have in text [16:27:24] :) [16:27:31] <_joe_> but that's for later I guess :P [16:27:39] the size spread in text is smaller than it is in upload, luckily [16:27:48] but yeah, we may have to apply similar approach there regardless [16:27:53] <_joe_> and yes, smaller files and more slots might help but also bite you in the ass [16:27:56] if it even works here [16:28:13] well, it's all about tuning the size-ranges and the %space per size class we allocate [16:28:45] we can make it "right" and actually increase our overall bytes and/or objects hitrate vs the more-naive situation we have today, regardless of the expiry thread mess [16:28:59] but it's a lot of magic tuning, so we need better stats on the size distribution of objects and requests [16:30:23] obviously if swift-storage stats or per-request stats say 90% of objects and/or requests are <64K and we only allocate 25% of our space to <64K objects, that would be awful [16:30:38] but with the right data in hand it can be an improvement rather than a problem [16:31:24] it's notable that with relatively very-limited storage space, the frontends get their peak hitrate if they refuse to cache objects >1MB [16:31:33] <_joe_> heh [16:31:50] and relatedly-notable that we know the total set of objects includes rare hits on files >1GB in size [16:33:01] frontends has ~100GB of total space. the equivalent in the backends for the "main" storage (<100GB size class) is roughly 6TB after chashing. [16:33:47] sorry <100MB size class above [16:34:29] <_joe_> that's a big difference, yes [16:36:48] ~80% of all requests are cache hits in a 100GB cache space that refuses objects >1MB [16:37:34] or in backend terms, if we sized a <1MB sizeclass as our bottom end storage and gave it only ~10GB per node out of the ~720GB total, it would also have fairly high efficiency. [16:38:08] we can do better than that, but it's a rough measure of the fact that there's room to do this right [16:38:42] <_joe_> actually a lot depends on what we're optimizing for [16:38:51] it does :) [16:39:00] in the frontends we're optimzing for object hitrate [16:39:01] <_joe_> if we optimize for reducing nukes at a constant cache hit ratio [16:39:07] in the backends, we really want to optimize for bytes hitrate [16:39:24] (focus on large objects more) [16:39:29] <_joe_> makes sense [16:39:46] <_joe_> sorry, gtg, family calls :) [16:40:08] the optimizing for reduing nuke/lru related thrashing is "split up more size classes so we have a larger count of storage files, and a smaller spread of smallest->largest objects within each" [16:40:13] cya! [16:46:18] after reading the backlog [16:46:28] note that the case that you mention above [16:46:40] of an almost-full varnish and another 50% full [16:46:53] would tend to self-heal itself [16:47:17] in addition of being odd to reach to to begin with [16:55:39] the varnish answer to all of this mess, of course, is that they realized long ago that "file" had these issues [16:56:04] persistent doesn't have these issues, for unique and complex reasons of its own [16:56:22] and malloc doesn't because it's fast and small, basically, unless you try to use it with swapspace [16:57:10] and their answer moving towards varnish4 was to deprecate persistent for mostly-unrelated reasons (so now it's broken for us too, it panics under workloads v3 persistent handled fine) [16:57:31] and say file is still there but doesn't work for some uploads and sizes, so just use malloc if you're using open source [16:57:49] and then they fixing the LRU/nuking/size-class problems algorithmically in their commercial-only MSE storage backend [16:58:23] s/uploads/workloads/ two sentences up heh [16:58:53] so they don't have a storage backend in open source that can handle large storage size / set size and high rates, etc [16:59:30] at least not in any simple way. but maybe we can gather stats and do a custom crazy config with 6 different size-classes file storage backends and work around it [17:09:31] :( [17:46:23] hmmmm apparently we're still getting netmapper crashes on NULL ipstring, the VCL workaround in place isn't enough for whatever reason [17:48:23] I tried to logic through how that's possible. it doesn't seem like it should be, but clearly it is [18:09:06] root@cp1050:~# egrep 'exp_|n_obj' vstat1-lrufail [18:09:06] MAIN.n_object 15875375 . object structs made [18:09:06] MAIN.n_objectcore 15875517 . objectcore structs made [18:09:09] MAIN.n_objecthead 5051988 . objecthead structs made [18:09:12] MAIN.n_obj_purged 5969 . Number of purged objects [18:09:15] MAIN.exp_mailed 60261440 237.37 Number of objects mailed to expiry thread [18:09:18] MAIN.exp_received 54758133 215.69 Number of objects received by expiry thread [18:09:28] ^ cp1050 that was in LRU_Fail state (also up a long time, was due for a manual restart but hadn't gotten to it yet) [18:09:42] note again exp_mailed-exp_received > n_objecthead [18:27:51] cp1064 is the last one left on my hitlist to do manually (based on uptime and time-to-cron), just waiting a bit [18:40:16] I think I'm going to let it go for a bit, I'm tracking the stats above on it to see how things look at the LRU_Fail moment to confirm [18:42:30] I have a little perl script running in a loop every 47s based on varnishstat -1 output [18:42:33] mailed: 49557201 recv: 46052132 n_objhead: 4028749 backlog: 3505069 bpct: 87.0014240152464 [18:42:47] right now the mailbox backlog on cp1064 is ~87% of the n_objecthead [18:43:21] my theory is we'll see it go into LRU_Fail 503 if it reaches 100% [18:44:32] which if it just keeps climbing at it's current rate, it will reach in ~45 minutes [18:54:05] n_objhead varies a bit both up and down, but the backlog just keeps growing, and even when n_objhead drops a little the percentage backlog/n_objhead still goes up every time [19:00:00] comparing cp1071 (relatively fresh) and cp1064 (seems doomed soon)... ~0.5% iowait vs ~5% iowait [19:00:17] 10Domains, 10Traffic, 10DNS, 06Operations, and 2 others: Point wikipedia.in to 180.179.52.130 instead of URL forward - https://phabricator.wikimedia.org/T144508#2646737 (10Aklapper) @Naveenpf: Not **here** in T144508. This task is only about pointing wikipedia.in to 180.179.52.130. This task is **not** abo... [19:00:25] Device: tps kB_read/s kB_wrtn/s [19:00:41] cp1071 typical sample: [19:00:50] sdb 263.40 873.60 20410.80 [19:00:54] cp1064 typical sample: [19:01:06] sdb 4138.20 14880.00 12407.40 [19:01:36] it's writing slower, reading much heavier, much higher tps [19:02:59] some of that's undoubtedly not from the expiry thread itself. it's just that the worker threads, when trying to evict space for new objects, have to scan increasingly larger fragments of the LRU list to find unbusy objects, since most of them are tied up in the mailbox [19:08:45] the backlog% increase seems to be subtly accelerating, it probably won't be as long as I was thinking to get there [19:14:30] well for whatever reason 100% backlog/n_objecthead isn't the mark [19:14:39] it's at 100.5% and still chugging for now [19:15:00] the last one I looked at was around 110% when I looked, but had been in LRU_Fail for a bit [19:16:26] iowait has dropped back off a bit from previous highs though. patterns are changing a little [19:17:12] and that percentage now bounces around a bit, it doesn't always strictly go up [19:18:31] in any case, apparently n_objhead isn't a strict match for the item count on the LRU [19:27:13] hit it finally [19:27:16] LRUBBBB_ [19:27:16] All status_type @ All / upload (non-PURGE) [19:27:16] 13:3014:0014:3015:0015:3016:0016:3017:0017:3018:0018:3019:00025 K50 K75 K100 K125 Krate per second [19:27:19] get109.8 Kpost962head598options64connect0put0trace0delete0 [19:27:22] All status_type @ All / upload (PURGE) [19:27:24] 13:3014:0014:3015:0015:3016:0016:3017:0017:3018:0018:3019:00050 K100 K150 K200 Krate per second [19:27:27] purge84.0 K [19:27:30] oops, bad paste [19:38:20] it was in the ballpark of 103% when it failed [19:39:24] perhaps even when they seem to be roughly 100%, there's still a race the worker threads can win as it's still processing mailbox entries [19:59:20] salt -v -t 5 --out=raw -b 101 -G cluster:cache_upload cmd.run 'varnishstat -1|egrep "n_objecthead|exp_"|sort|tr -s " "|cut -d " " -f2|paste -sd-/|sed -e "s/^/(/" -e "s,/,)/,"|bc -l' [19:59:58] ^ this output is super-helpful. most nodes now show zero backlog, but as the number there approaches 1.0 LRU_Fail is imminent, just a matter of time [20:00:44] it's terribly ugly, but I was trying to work within the crazy quoting/escaping confines of cmd.run and still get the job done [20:01:08] it basically spits out the value of (exp_mailed-exp_received)/n_objecthead [20:01:33] cp1074 was already at 1.03 so I went head and restarted it [20:01:54] (and is also another one that's been up for days, and was still ~2-3h away from its first daily cron restart) [20:05:10] cp3036 and cp3039 are the only other ones with significant backlog% that wouldn't naturally restart soon-ish, so I'll hit those before I sign off, spaced out a bit [20:05:38] after that I really think the daily cron restarts will take care of it for now [20:09:00] so, oxygen's logs can be used to extract 1/1000 data for upload.wm.o + status:200 and get all the CLs [20:09:42] from that data we can trying binning different ways into size classes and multiplying out the storage size for decent hitrate, etc... [20:10:06] it won't tell us the stats of the underlying dataset in swift, but the stats of actually-required objects are probably more relevant anyways [20:10:11] s/required/requested/ [20:11:49] notably-obvious from existing varnishstat is that the "bigobj" storage for the 100MB+ files doesn't get much traffic compared to the rest, but that's kind of expected, and we have to multiply the stats out to figure out a bytes hit ratio anyways [20:14:21] I'm trying [20:14:37] to think through what the math should be to come up with bin sizes and storage percentages [20:14:54] there has to be a better way that just experimenting on the list of response CLs [20:15:38] we should be able to at least make a script that gives us the optimal size break downs for a given set of binning cutoffs [20:16:03] and then shift those around until it's easy to get ~50% split between some grouping of bins, to fit them on the SSDs [20:17:43] other than the smallest (0-something) size bin, for the rest we probably want to keep the range down to where e.g. the biggest is no more than an X multiple of the smallest, where X is somewhere in the 10-100-ish ballpark [20:18:04] it's just hard to reasonably do that on the small end, but probably matters less, too [20:26:42] 0-16K, 16K-256K, 256K-4M, 4M->64M, 64M-1G [20:27:15] ^ 5 bins, with a 16x spread everywhere but the smallest bin. even that one's pretty ok if we assume <1K objects are minimal [20:28:04] and then can just fine-tune the barriers until the byte-hit-ratio from the 1/1000 data would work well and we can easily split them across two equal disks without a0-16K, 16K-256K, 256K-4M, 4M->64M, 64M-1G [20:28:24] grr, without wasting much space