[00:39:21] with spersist, they repeatedly crash :P [00:40:58] it's making an heroic effort and keeping traffic relatively-healthy in spite of that, but the crashes aren't workable. I can't even chase them down fast enough :/ [00:42:02] so, reverting persist puppetization now, and thinking about what the next-best steps are [00:43:05] clearly, one option is reverting to v3, but that's a painful process too [00:47:39] we could try to find other ways to mitigate LRU_Fail in the sfile backends, e.g. by increasing FE hitrate (by prematurely applying some experimental optimizations we were saving for later), by cutting back the storage size, or by reducing TTL? [00:47:46] I'm not even sure if reducing TTL would affect LRU issues [00:48:05] (unless it draws some distinction in actually-expired vs not-yet-expired) [00:48:31] or we could restart backends constantly before they reach the heavy nuking phase... [00:48:54] (e.g. put in a staggered cron to restart them twice a day?) [00:52:02] for now I'm just reverting them to file storage, putting us back where we were [02:42:54] 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502#2631544 (10BBlack) Since the above update: 1. Tried converting eqiad + codfw to `persistent`, and fighting through a bit on levels of 503s in case they were just from aggr... [02:45:15] 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502#2631547 (10BBlack) Also, after the various restarts above, I re-set (at runtime with varnishadm) all cache_upload backends to 31 seconds for `lru_interval` in the hope tha... [06:49:39] happy times I see! [06:51:49] so, ulsfo has been quiet for the past ~5h [06:52:05] same as codfw [06:52:42] whatever happened between 1 and 1:30 seems beneficial [06:56:41] that's roughly the time when we went back to file storage [06:56:44] mmh [07:48:06] libvmod-netmapper uploaded to carbon [07:48:15] (1.4-1) [08:25:33] uh in codfw we have nuke_limit still set to 300 instead of 1k [08:26:03] lru_interval is 31 though [08:28:37] currently we have: nuke_limit=300 in eqiad and codfw, nuke_limit=1000 in ulsfo and esams [08:29:22] lru_interval=31 everywhere [08:31:15] I'd puppetize the lru_interval bump, reducing the amount of locking for lru list access seems like a noble goal [09:09:06] currently looking at ulsfo hosts [09:09:20] CPU usage seems proportional to varnishd uptime [09:10:47] varnishd has been running for ~4 days on cp4005 and cp4007 and cpu_user is ~14 there, while it's ~8 on other hosts where varnishd has been restarted ~18h ago [09:12:07] comparable number of objects in cache on all nodes, what is very different is the number of objects nuked of course [09:12:32] ~20M on 4005 and 4007 and ~2M on other hosts [09:13:09] I expect cp4005 and cp4007 to be the next to throw rockets [09:36:35] -sfile docs say: [09:36:38] > Note that if varnishd has to create or expand the file, it will not pre-allocate the added space, leading to fragmentation, which may adversely impact performance on rotating hard drives. Pre-creating the storage file using dd(1) will reduce fragmentation to a minimum. [09:37:55] if the reason for the "Could not get storage" errors is actually fragmentation, this could help [09:57:52] hey [09:58:26] so, yeah, puppetizing lru_interval=31 seems like a good idea. I just left the nuke_limit stuff out of whack from restarts I guess, and I'm really not sure whether we should mess with it yet or not. it was a shot in the dark earlier. [09:58:39] FE hitrate has gone up notably from the 1MB limit [10:00:01] re: fragmentation: on varnish3 we had patches to ensure both persist and file used fallocate() properly on linux (it didn't use it at all before). I think you dropped the file one because upstream was already using fallocate in v4? [10:00:21] if that's not working right, that could be an issue, yeah [10:02:39] yeah we dropped the patch because of https://github.com/varnishcache/varnish-cache/commit/eb1ca9d710bbfbf0ab9306b6e9b2789bbcf15c1e [10:02:50] I guess we could strace a file backend starting up with an empty FS, and also a restart on existing files, looking for ftruncate and fallocate calls, to be sure the logic works out [10:03:28] since varnish has various logic to skip fallocate() [10:05:49] in any case: I had lru_interval even higher at one point while nuke_limit was all 1k, and we still eventually had multiple LRU_Fail incidents. So unless the FE hitrate makes a big difference, it will probably recur. [10:07:00] I'm pretty sure on sfile the LRU_Fail is the root of all problems. There's hints in some other discussion threads about past LRU/storage-alloc bugs causing CL:0 responses when streaming/chunked, with no great resolution. [10:07:17] yep [10:09:25] the downside of lru_interval (and the reason it might be bad to take it to an extreme value) is that as it grows, LRU eviction choices become inherently less-accurate. [10:10:09] bblack: ok to merge https://gerrit.wikimedia.org/r/#/c/310240/ then? [10:10:30] yup [10:10:35] great [10:11:12] it seems like ++nuke_limit (assuming it's not useless) will make LRU_Fail less likely, but also potentially eat more cpu% [10:11:47] so maybe at least re-varnishadm them back to 1k, maybe puppetize for now as well so it survives future restarts? [10:12:35] I can see that it gets used in the current code in general. I think the ticket complaining about it being useless was that there's now an inner loop of some kind within each nuke_limit attempt [10:12:50] not sure [10:15:23] the big long pile of 503 in the graph last night that ended ~01:00 is from the spersist transition and the repeated child process crashes while using it, and yeah it ends when we go back to sfile [10:15:50] which is looking great right now, but a lot of them (eqiad, codfw) haven't got enough storage uptime to reach LRU_Fail issues again yet [10:16:26] yeah that's why I was keeping an eye on cp4005 and cp4007 which haven't been restarted recently [10:17:02] I do think further FE hitrate increase probably reduces the chance of LRU_Fail by dramatically reducing the backends' normal request rates [10:18:04] you could also try dropping the cacheabe size limit there further. Maybe to 512k or 256k? it takes a while to see the effect in the graph, and there's probably a point of diminishing returns where it's not useful to lower it any further. [10:18:40] or deploying the hit/2+ patch, but we also know that patch will cause more pain on frontend restarts without more work [10:21:01] what still puzzles me is that the issue didn't happen when going straight to the applayer [10:21:24] why would that have anything to do with fragmentation? [10:23:39] I don't know if we ever left codfw->direct or eqiad up and running on sfile long enough to see it [10:24:16] but you're right in that I think we've only observed it with varnish->varnish fetching backends [10:24:35] would it be a worthwhile to set back codfw->direct as an experiment? [10:24:52] that's not English but you get my point ^ [10:24:54] well we have eqiad as direct anyways. if left stable, it should eventually hit the problem [10:25:11] true, we can observe eqiad for that [10:25:14] it make take a day or two depending on access pattern [10:25:50] it's obviously a "rare" condition that eventually causes this locked-in LRU_Fail condition, somewhat randomly-possible, increasing chances as the storage fills [10:26:17] it also usually eventually resolves itself, after throwing 503 on miss for up to an hour [10:26:23] yep [10:26:40] I think multiple machines got affected by it at roughly the same time in ulsfo yesterday [10:26:54] I did get a chance to check a codfw node while it was ongoing, and short tests seemed to confirm that no misses were succeeding during that window [10:27:05] so the 503 plateau basically represents the whole rate of misses passing through [10:27:32] mmh [10:27:48] which makes me think: shouldn't the frontend try another backend when that happens? [10:28:04] no, I don't think it would [10:28:26] only if it became unhealthy, and health checks are /varnishcheck or whatever, which doesn't touch storage and succeeds as a synthetic response [10:29:08] there's the whole 503-retry thing, but I don't think it saves us here [10:29:13] oh, /varnishcheck works fine during those events [10:29:29] I actually think our retry-503-once logic that worked in v3 may not work right in v4, but that's an entirely separate point. [10:29:45] (and a minor one for most purposes, we can always sort it out later) [10:30:51] yeah, synthetic responses as well as cache hits work fine during the event. it's juts the misses that want to fetch new objects into cache that fail. [10:33:32] bblack: I'll set nuke_limit to 1k with varnishadm then, puppetized here to survive restarts https://gerrit.wikimedia.org/r/#/c/310266/ [10:33:44] in any case, the fallocate thing might be good to double-check too. I wouldn't be shocked if varnish, for example, fails it or doesn't do it silently, based on the past [10:33:49] ema: ok [10:35:43] I'll be gone in ~1.5h and in a car for a while, then I'll be stuck sitting at the airport early for a few hours and might jump back online, then flying and getting to the hotel and getting settled in for another ~6h or so. [10:36:12] so basically it's all on you, I'm just trying to info-dump before my windows of relative unavailability :) [10:36:17] :) [10:37:40] faidon/mark might be done traveling before I am, not sure. they're not up to date on everything, but might prove invaludable to bounce ideas off of :) [10:38:22] I'm asking for possible ideas on #varnish too [10:38:33] good idea :) [10:38:56] I've been asked if n_lru_nuked increased and whether we've seen EXP_Inbox entries [10:39:16] none of which happened unfortunately [10:39:36] my worst fear is that basically they know "file" is unworkable at this combination of storage size + dataset size + request rate peaks + size distributions, and spersist is borked, and malloc+swap seems like a horrible idea, and it's basically "if you have this problem, you need VarnishPlus and MSE!" :) [10:40:36] is malloc+swap really such a bad idea? [10:40:55] well, we have no way to control it really, that's what bugs me the most [10:41:28] as in, we have to give the OS ~700GB of swapspace in general, and then malloc ~700GB, and hope it works out ok, but OS policy may get tricky since all other memory on the machine could get swapped out, too [10:41:44] e.g. it may decide to swap out some other critical daemon that only wakes up periodically, and cause some kind of stall [10:41:54] which is why we don't really use swap in the general case [10:42:07] also, discussion threads seem to indicate it won't perform as well as sfile, which makes sense [10:42:34] we also currently have vm.swapiness set to zero I think, that has to change and again has global effects [10:43:30] but really I don't get when malloc and file should have such radically different LRU behaviors. you'd think file could have essentially the same logic as malloc, just tracking its own allocations within the mmap'd address space... [10:44:20] I guess maybe they optimized it beyond that to suck less if you have slow disks, re: random access patterns and such? maybe? [10:45:18] we could write a new backend called filemalloc that uses the same code as malloc, but mmaps the file first and uses the mmap'd chunk as a fixed pool for jemalloc allocation calls with some pool argument? :) [10:46:37] the non-standard (as in, doesn't mirror normal malloc() API) jemalloc API does let you specify a specific arena to allocate from [10:47:22] hmmm arenas aren't pools though [10:50:59] it would be nice if some well-implemented malloc library had a way to do that [10:51:50] to mmap a file and set it up as some kind of pool/handle P that the malloc lib manages, and then call foomalloc(P, ...) which only works within that pre-allocated address space and manages it [10:53:52] (I've written such a thing before as a very simplistic special-purpose allocator, it's not hard to write a short C file that accomplishes that sort of thing. the problem is then you're missing out on all the important advanced optimization trickery of "real" malloc libraries) [10:57:36] ema: if you decide to look down the malloc+swap pathway: perhaps there's a kernel tunable to keep swapiness low for code and only swap out data? also, perhaps we could add quick patch to varnishd to use mlockall() optionally, and use that in the frontend varnishd, so that the OS never stupidly swaps out frontend memory to make room for a heavy backend operation. [10:58:17] it would basically just be some CLI argument on startup that triggers very early on to do mlockall(MCL_CURRENT|MCL_FUTURE) syscall [10:58:24] then varnish-frontend can't swap [10:58:43] right [10:58:52] or upgrade to ATS [10:58:59] (how hard can it be) [10:59:04] :) [10:59:49] probably not hard, and it's probably worth spending some time on that decision-point all over again once we reach some kind of stablility here, especially if that stability is "we can't do v4 backends for high traffic clusters" [11:00:22] but it's probably still a quarter+ project to eval->deploy ATS as a backend-only solution, maybe longer, and unpredictable [11:00:55] I think ~1yr ago it was still the right call to move to v4 first before considering ATS, but what we're seeing here calls everything into question probably. [11:02:26] I wonder if it's even possible to run v4 frontends + v3 backends? sounds painful figuring out how to deploy both in a single host [11:02:52] that raises another interesting possibility, though.... [11:03:22] probably one of the key things changing behaviors between v3/v4 is also the default do_stream, which is going to change the parallelism of allocations and busy objects, etc... with maybe LRU fallouts. [11:03:55] we could default do_stream=false for v4 backends (and we still have that size-based VCL logic to turn it on for larger things), and see if that mitigates LRU_Fail, too. [11:04:19] good point yeah [11:04:40] if we go down that path, maybe limit it only to upload (and future, text), because maps/misc seem ok with default-stream, and I know we already removed some stream-size hacks from misc too and no point putting them back. [11:05:02] alright [11:05:32] I'm gonna go out for lunch, please keep on writing down your thoughts here though :) [11:17:54] shutting down laptop to pack up, probably be back online for a bit in 2-3h [11:17:59] good luck :) [11:56:07] i'm in a transfer airport now [11:56:08] what's up? [12:15:56] faidon is already in SF btw, do not hesitate to call him if you need help [12:54:01] i've read up on this channel backlog now [12:54:27] * mark will grab some food [14:20:21] <_joe_> cache-misc seems to be down in ganglia [14:20:37] _joe_: looking [14:20:59] oh that's probably the gmond issue due to vsm permissions [14:21:52] <_joe_> ema: ok, just pointing that out :) [14:22:03] _joe_: thank you :) [14:23:40] <_joe_> ema: also, a quick sanity check on https://gerrit.wikimedia.org/r/#/c/310271/1/modules/role/manifests/cache/misc.pp would be appreciated [14:27:50] _joe_: is the renaming of palladium into pybal_config intentional? [14:28:32] <_joe_> yes [14:28:46] <_joe_> the director is for config-master, not "palladium" [14:28:57] <_joe_> and the corresponding role is pybal-config [14:30:35] ok [14:32:11] <_joe_> relevant puppet-compiler output https://puppet-compiler.wmflabs.org/4059/cp4002.ulsfo.wmnet/ [14:33:00] yeah I was staring at https://puppet-compiler.wmflabs.org/4060/cp4004.ulsfo.wmnet/ [14:33:07] _joe_: lgtm [14:35:09] _joe_: cache_misc back in ganglia, restarting ganglia-monitor was enough this time apparently [14:35:59] in other cases we had to restart the aggregators too [14:36:24] <_joe_> heh, ok [14:37:26] godog: speaking of which, when can we use prometheus? :) [14:37:34] -varnish-exporter has been accepted into unstable [14:38:34] ema: yep, in codfw and eqiad you are already good to go, and I have https://gerrit.wikimedia.org/r/309996 out for ulsfo/esams [14:38:48] \o/ [14:41:57] ema: for varnish-exporter I'm assuming puppetization would be similar to e.g. mysqld-exporter but two services for frontend/backend [14:42:26] godog: right, adding "-n frontend" to the frontend should be enough to differentiate [14:43:32] yup, /usr/bin/prometheus_varnish_exporter -n frontend [14:44:37] nice [15:24:11] ema: so, no further fallout yet I guess [15:25:15] bblack: right [15:25:47] I'm looking closely cp4005, which has lru_nuked ~23M [15:25:56] last time it crashed, it was ~25M [15:25:59] ok [15:26:18] of course it's not necessarily related, but perhaps a good indication of uptime [15:26:46] if it does fall into LRU_Failed behavior and you catch it fairly early, I wonder if dynamically raising nuke_limit much higher at least temporarily would break the pattern [15:27:28] we can try yeah [15:27:41] modulo that ticket about nuke_limit being broken (-ish?), that's basically what's going on: it's not that there's nothing available on the LRU list, it's that it gives up searching after a bit and just 503s [15:28:20] and if/when that happens, I'll pre-create the storage file with dd [15:28:21] did #varnish have any interesting ideas? [15:28:41] not really, no [15:28:57] fallocate() should be just as reliable as DD, if we're sure it's getting executed at all when the file gets created [15:29:07] heh DD, anyways [15:29:27] I don't know exactly what sfile's behavior is on restart with existing file, though. does it retruncate->refallocate? [15:30:19] I don't know either [15:30:30] current docs suggest dd anyways [15:31:08] yeah it's just not a great general solution for random restarts/reinstalls/puppetization/etc [15:31:16] it's way slower than fallocate in execution time terms, too [15:31:37] we know fallocate() works right if it's being called, we've relied on it for years and had problems before we patched it in [15:32:12] the question-mark is whether varnishd definitely calls it in our specific case for create and/or re-create [15:33:54] the code reads like it will fail hard if fallocate fails [15:34:43] assuming the ext4 check passes (it should) [15:35:09] yeah [15:36:44] besides, even then the fallout should be iowait and perf issues, not LRU failure, I think [15:40:27] I will just double check with strace next time a host needs to be depooled :) [15:58:29] bblack: http://lab.nexedi.cn/donkey-hotei/slapos/commit/957185fc087b2c56ca865002c30e2fe1a0cbe12e [17:03:37] uh this is interesting [17:03:51] MAIN.n_vcl is 1 on all codfw upload hosts [17:04:06] while it's > 160 on ulsfo machines [17:04:35] that's the number of loaded VCLs, incremented for instance when issuing systemctl reload varnish [17:21:35] still looking good, I'm off for today o/