[02:46:51] I've got some analysis on the oxygen 1/1000 data over the past 61 days running. since it's 1/1000 it's not a perfect picture, but I think it's close enough that we can work with it for binning up storage [02:47:29] premilnary results indicate it's probably reasonable to go with the bins above and have a reasonable balance between the SSDs at some reasonable file sizes that fill them.... [02:48:37] it'll be another hour or two for the last run over the data to complete, and then I'll summarize up the methods -> data -> reasonable plan tomorrow [02:49:38] basically given the size ranges earlier as bins numbered 0-4, the preliminary split looks something similar to (needs some adjustment at the upper edge of course): [02:51:14] bin0: 3%, bin1: 20%, bin2: 43%, bin3: 30%, bin4: 4% [02:51:35] which places bins 0, 2, 4 on /dev/sda and bins 1 and 3 on /dev/sdb to get 50/50 -ish on space [02:51:57] and the hit/access rates and byte-hit-rates seem like they'd be reasonably balanced under that kind of plan, too [02:52:37] (with most of the i/o traffic ending on bins 2 and 3) [02:53:23] bin2 ends up being about the same size as one of the "main" storage chunks today, so it's a bit larger a chunk than I'd like [02:54:00] but maybe it's ok when the size range within it is limited to 256K-4M instead of 0-100M. eviction/nuke stuff and total object count should be way lower. [02:59:51] also notable in the data, there was a single object >=1GB that was accessed, not sure how many times (yet), over the past 61 days [03:00:02] size was ~1.9GB [03:00:47] and the average size of objects that fall into the 64M->1G bucket is only ~141M [03:02:19] (and they're in the ballpark of 1% of object, but only something like 0.0002% of reqs [03:02:24] ) [03:04:18] (and then in terms of bytes transferred, that bucket works out to ~0.5%) [05:28:30] <_joe_> nice [07:55:13] long backlog today :) [07:55:35] the automated restarts seem to be doing a good job at keeping the 503s down so far [07:57:19] it would be awesome to have something like !tldr [07:57:34] :S [07:57:39] sorry :S [07:57:44] aaarg :D [07:58:01] :) [07:58:11] keyboard doesn't like me today, not sure if I have to work or not :D [07:58:52] anyhow, our data consistency checks alerted yesterday but they seem quiet now [07:59:32] elukey: vk didn't like some of the restarts apparently [07:59:56] Sep 18 14:52:58 cp1072 varnishkafka[86743]: VSLQ_Dispatch: Varnish Log abandoned o [07:59:59] r overrun. [08:01:15] mmm this is weird, maybe there is a timeframe in which it can't get the new shm log handle correctly? [08:01:54] I'll try to prioritize https://phabricator.wikimedia.org/T138747 [08:01:58] that should solve the issue [08:02:17] yep [08:02:28] also note that systemd didn't restart vk when it failed like that [08:02:40] whaaat? [08:03:01] <_joe_> ema: care to take a look at https://gerrit.wikimedia.org/r/#/c/311378/ when you have 10 minutes? [08:03:20] _joe_: sure [08:03:38] ema: sigh let's have a chat today about this [08:03:44] whenever you have time [08:04:10] <_joe_> uhm regarding vk, let me take a look at the systemd unit [08:05:54] elukey: ok, vk doesn't log anything on startup right? [08:05:59] I was looking at journalctl's output on cp1072 and it seems it doesn't [08:06:03] _joe_: it's Restart=on-failure, so I guess it should do the right thing [08:06:03] but maybe vk dies with a non-failing exit status? [08:06:24] <_joe_> ema: yes that's very possible [08:06:42] <_joe_> systemctl status varnishkafka.service before restarting it would tell us more [08:07:19] <_joe_> ema: Restart= different from "always" should only be done for services that have been designed with systemd in mind, basically [08:08:37] https://github.com/wikimedia/varnishkafka/blob/master/varnishkafka.c#L2153 [08:08:47] exit code zero, systemd is right [08:09:39] heh, there you go [08:09:50] * elukey cries in a corner [08:09:52] _joe_: there are a bunch of trailing tabs here https://gerrit.wikimedia.org/r/#/c/311378/1/modules/puppetmaster/templates/nginx-puppetdb.conf.erb [08:09:56] and this one is my fault since I wrote that code [08:15:07] _joe_: I'm not 100% sure if dashes in director names are ok [08:18:37] Names of VCL objects cannot contain '-' [08:20:11] <_joe_> yes, right [08:20:13] <_joe_> amending [09:26:51] 10Traffic, 10netops, 06Operations: Telkom/8ta (South Africa) users cannot connect to wikimedia sites - https://phabricator.wikimedia.org/T145270#2647722 (10faidon) 05Open>03Resolved a:03faidon Telkom did not respond despite my repeated emails, but I've just removed the explicit route preference and tra... [09:43:34] https://gerrit.wikimedia.org/r/#/c/311387/ <- varnish-frontend-restart, it's gonna come in handy for libvmod-netmapper upgrades [10:05:14] oh we just got a puppet fail on cp3039, varnish-backend-restart and puppet shouldn't run at the same time [10:05:32] I'll patch the script to disable puppet and re-enable it when done [10:08:52] ema: it's not enough IMHO, puppet might be already running, I think you should try to disable but detect if it's already running and then decide what to do (skip this restart, loop-sleep until puppet ends aborting after a timeout, etc.) [10:09:31] also, if the cron script crashes in the middle for whatever reason puppet will not be re-enabled :( [10:10:52] volans: valid points [10:11:52] for point (2) we could re-enable puppet as a cleanup action in any case [10:15:28] sure, I'm didn't even checked the actual code that is executed, was a general advice for race conditions ;) [10:18:57] ema: also given the crontab nature of puppet if on a host it conflicts, it will likely conflicts on each run [10:20:26] volans: right, I think fqdn_rand has a parameter to avoid that [10:21:01] as in fqdn_rand(60, 'varnish-restart') guarantees a very different value from fqdn_rand(60) [10:21:31] something along those lines could be a solution for point (1), or a helper script called run-after-puppet [10:21:48] ema: yes but modules/base/templates/puppet.cron.erb looks like is using a different algorithm [10:22:48] volans: yeah, the general point was that we could try doing something along those lines to make sure the cron jobs are far apart [10:23:02] yep, agree! [10:23:08] but "run-after-puppet" might be a more generic and better solution :) [11:31:13] 10Traffic, 10netops, 06Operations: Telkom/8ta (South Africa) users cannot connect to wikimedia sites - https://phabricator.wikimedia.org/T145270#2647884 (10Cadar) Hi Faidon, thanks for the support on the access question. We seem to be up and connected to Wikipedia fine on this side. I've noticed similar issu... [12:00:06] 18:08:19 varnish's internal choice algorithm makes no size distinction, so if we want to size-split we have to pick 1/4 explicitly in VCL, not just 1/2 size classes. [12:00:06] 18:08:26 <_joe_> if we had more, it would make this way worse than the varnish original choice [12:00:06] 18:08:39 so we're using std.random() choice between the two files in a class, vs their space-checking and rotation [12:00:06] 18:09:08 yes [12:00:12] bblack, ema: yes, indeed [12:00:19] https://gerrit.wikimedia.org/r/#/c/29288/2 [12:03:05] std.random was purely put in place because you couldn't give varnish a set of backends to use (based on size), but had to give one explicitly - and therfore picking one randomly seemed the next best option [12:03:20] 4 years ago already, geez [12:04:35] and as you can see, at that time the small object store was file, the large object store was persistent [12:04:58] and I can't say I remember it all that well, but I vaguely recall running into similar lru fail problems then as we are now [12:13:05] i wonder how much work it would be to make varnish use a set of backends based on VCL, so its internal storage selection logic isn't defeated [12:13:08] ...and whether that's even useful [12:24:08] mark: probably the problems you've encountered led to this patch here https://phabricator.wikimedia.org/diffusion/ODCV/browse/3.0.6-plus-wm/debian/patches/0010-varnishd-nukelru.patch [12:26:35] possibly [12:26:40] although i don't recall that one at all [12:31:49] although I clearly added martin's patch to add that function in the first place: https://gerrit.wikimedia.org/r/#/c/15124/1/bin/varnishd/cache_expire.c [12:37:13] an aside: this bug was a real fucker too: https://gerrit.wikimedia.org/r/#/q/status:merged+project:operations/debs/varnish+branch:patches/optimize-epoll-thread [12:38:04] basically deadlocked varnish in some cases [13:07:22] varnish (3.0.3plus~rc1-wm9) precise; urgency=low [13:07:22] * Log a message when EXP_NukeLRU can't destroy an object [13:07:24] i suppose so [13:09:52] 10Traffic, 06Operations, 13Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#2648055 (10BBlack) Where we're at on basic bug-mechanics explanation: 1. `LRU_Fail` has nothing to do with the nuke_limit. `LRU_Fail` means that within an atte... [13:21:59] so yeah, looking back through my commit history etc [13:22:24] it shows that i was seeing similar LRU issues back then prompting to add that error message, but at the same time experimenting with persistent side by side [13:22:42] 10Traffic, 06Operations, 13Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#2648085 (10BBlack) In the meta sense, it seems like it's a "known issue" that the `file` storage backend simply doesn't scale for certain combinations of workloa... [13:22:50] which for some reason doesn't experience the same issue [13:24:27] heh bblack just states the same thing on ticket [13:25:04] yeah I have a lot to record, I'm trying to break it up into multiple ticket updates topically so it's easier to read through :) [13:25:32] thanks for the effort [13:25:41] I wish I had done the same at the time, might have saved us some time now :) [13:26:11] i see I was changing many things at the same time, from squid->varnish, file->persistent, streaming range patches in a short period heh [13:27:13] so you think breaking up in more storage classes might help? [13:27:21] and not divide over two ssds? [13:30:14] right: not divide one storage class over 2 SSDs, and have more fine-grained storage size-classes (5 of them), and chose their relative sizes based on our stats data [13:30:33] I'm typing up that part now :) [13:31:36] yeah [13:31:58] could even consider using only a single ssd then [13:32:03] it's not like there's really redundancy now [13:32:20] * mark shuts up and awaits further ticket updates [13:38:07] ema, bblack - I tried to solve the "Log abandoned/overrun" issue that we discussed earlier on with https://gerrit.wikimedia.org/r/#/c/311415/3 [13:38:22] there are some stuff to tune/tweak so any suggestion is really welcome [13:38:29] atm I tried it and it works as expected [13:38:45] %3 VSLQ_Dispatch: Varnish Log abandoned or overrun. [13:38:45] %3 VSLQ_Dispatch: Attempt to reconnect to the Varnish log.. [13:38:45] %3 VSLQ_Dispatch: Log reaquired! [13:39:09] that is more or less what varnish tools do [13:39:29] the above use case is easily reproducible with a varnish restart [13:39:36] 10Traffic, 06Operations, 13Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#2648172 (10BBlack) Currently we're surviving by scheduling a very conservative (in the sense of avoiding LRU_Fail) daily restart of all the cache_upload backend... [13:39:44] meanwhile with a varnish stop I obtain an exit code = 1 (log abandoned) [13:39:54] not sure if this is exactly the behavior that we want [13:40:05] but the code review should be a good baseline to have a discussion :) [13:42:37] elukey: yeah probably since VK is meant to persist (unlike, say, varnishlog), we should have it loop on retrying [13:43:01] with some configurable small delay between retries, probably <1s but not to the point that it locks up significant spinning CPU cycles. [13:43:45] if we make it too long, we run the risk of missing several events that overrun while we're not looking on startup [13:44:00] so perhaps one of the other existing time values we use for that kind of window is appropriate? [13:52:13] have you seen http://varnish-cache.org/docs/5.0/whats-new/changes-5.0.html#whatsnew-changes-5-0 ? [13:52:19] more specifically, "Added separate thread for object expiry" [13:52:22] whatever that means? [13:54:49] if I had to guess, I'd say it proba [13:55:14] ...bly means they've split the two functions of the v3/v4 expiry thread [13:55:50] which is both to deal with mailed-in object alloc/nuke data from the worker threads, and also deal with "natural" expiry based on ttl+grace+keep [13:56:02] it may very well help [13:56:20] so let's just upgrade to 5.0 straightaway? ;) [13:56:28] but, we're seeing the problem with the mailbox thing crop before before they've run long enough to have significant natural expiry load [13:56:40] as our TTLs are ~7 days, and LRU_Fail happens after 4-6 days [13:57:05] but if we tune things well enough to stay up a week, then sure splitting off natural expiry might help :) [13:57:46] we could ask them what prompted them to change that [14:10:39] bblack: I updated the code review with https://gerrit.wikimedia.org/r/#/c/311415/4 [14:11:03] but now if you do a varnish stop then vk will try every 100ms to reaquire the log handle indefinitely [14:11:19] it kinda makes sense but not sure if this is what you meant [14:11:37] adding a maximum retry could be an option [14:11:42] (even if I don't like it a lot) [14:50:15] elukey: ideally "stop" should invoke different behavior than disconnect [14:50:29] stop should mean just stop :) [14:50:46] but I haven't looked at your code review at all yet, so I have no idea what I'm saying or what the difficulties are :) [14:51:00] will do Soon (tm) [14:51:05] :D [14:51:42] so from the point of view of vk, the log is abandoned and afaik it has no way to figure out that it will be abandoned for a while [14:51:56] oh sorry [14:52:06] I read "if you do a varnish stop then vk" as "if you do a vk stop then vk" [14:52:12] yeah [14:52:13] ahhh no no [14:52:37] well, we should (and maybe we don't) have the service unit dependencies hooked up for that [14:52:47] where if 'service varnish stop' happens, it stops vk too [14:52:53] but I know there's some odd issues around that [14:53:20] in the true-dependency sense, we want varnish to depend on vk, so that vk gets all the transactions and doesn't miss some at the start/end of varnish running [14:53:31] well for some value of "true" [14:53:49] but on the other hand, really vk depends on varnish (is a consumer of varnish's service) [14:54:34] there must be some systemd magic to link them together, but I'm sure it's a PITA to figure out [14:54:53] to say that vk starts before varnish and ends after varnish, but stopping varnish also stops vk because they're bound together [14:55:19] BindsTo= and like 27 other related systemd unit file variables [14:56:05] or we can just live with the 100ms loop while varnish is down [14:56:19] it's probably not that painful, I imagine the CPU load of checking for "no shm log" is pretty low per 100ms [14:57:27] i didn't check that, but vk already pools every 10ms if there is nothing returned from shm log and cpu wise it is fine [14:57:42] ok [14:58:04] hmmm we just lost 2x vk in esams at virtually the same time [14:58:29] yep [14:59:04] Sep 19 14:50:04 cp3034 varnishkafka[141460]: VSLQ_Dispatch: Varnish Log abandoned or overrun. [14:59:07] exit status 0 [14:59:13] so [14:59:23] looking at that in syslog in the broader context, it's happening on VCL reload as a trigger [14:59:45] VCL was reloaded because the etcd-driven backends list changed due to the autorestart of an esams varnish backend (daily for the storage bug) [15:00:06] ok this explains why vk dies more often now [15:00:30] right, on average it's something like once per 37 minutes we're restarting an upload backend somewhere in the world, staggered around the DCs [15:00:39] (with etcd depooling) [15:01:02] or another way to think of it: we're doing 78x automated etcd changes per day to varnish pool state [15:01:12] (for cache_upload) [15:07:53] elukey: looking at the patch, overall it looks good (I think at runtime it will basically keep retrying every 100ms indefinitely) [15:08:49] we could also go for Restart=always if there are no reasons to avoid that, in addition to reconnecting [15:08:51] elukey: but shouldn't we do that at startup, too? as in make a function that wraps varnish_api_open_handles() in a 100ms sleep loop until it succeeds, and call it both at the top of the runtime loop and at startup? [15:09:43] well, if we did the sleeploop inside vk on startup, we could order things such that varnishd service unit depends on vk service unit, and vk is already polling @100ms wehn varnishd first starts, thus doesn't miss the initial inrush of events. [15:10:25] (well maybe not "depends on", as we don't want systemd to shut down varnishd if vk crashes for some odd reason. but at least an After= startup ordering thing) [15:35:06] (a lot of meetins! will read everything I promise :) [15:35:10] *meetings [15:47:29] 10Traffic, 06Operations, 13Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#2648639 (10BBlack) I've done some analysis on our 1/1000 sampled data on oxygen yesterday, so the next couple of posts go into the methodology, data, and results... [17:05:45] bblack: so the function that wraps varnish_api_open_handles at startup should be needed to avoid the use case in which vk is up but varnishd is not? [17:06:13] (I am trying to get the changes to do and the best way to do them :) [17:08:24] so basically vk could start even without varnish up, start polling and then attaching to the shm handle once ready [17:08:36] without simply saying "no shm handle found, exiting" [17:08:40] ah this is interesting [17:08:45] didn't think about it [17:09:15] checking what vk does with the last change [17:09:46] %3 VSM_OPEN: Failed to open Varnish VSL: Abandoned VSM file (Varnish not running?) /var/lib/varnish/debian-jessie/_.vsm [17:10:11] * mark went to varnish-software.com and it loaded pretty slowly [17:11:20] ok so I'll work on it tomorrow and try to fix this use case [17:15:35] (let me know if what I've written doesn't make sense, going afk for a bit!) [17:18:34] elukey: yeah makes sense, thanks [17:55:46] 10Traffic, 06Operations, 13Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#2649474 (10BBlack) Diving a bit deeper on interpreting the above into useful planning: first, again, a few salient higher-level points: * **Bin4 (64M-1G) Issues... [18:04:29] nice :) [18:04:58] 10Domains, 10Traffic, 06Operations, 06WMF-Legal: Use .wiki domains instead of .org on wiki sites owned by wikimedia foundation - https://phabricator.wikimedia.org/T145907#2649521 (10Aklapper) Clarification: If anyone wants to discuss the use of .wiki TLDs, please use T88873 - Thanks. (I merely closed T145... [18:30:55] back on the subject of the daily-restart workarounds: now that we're really into the cycle where none of them have been up longer than a day, the exp_maied - exp_received backlog is zero on all backends presently (well except cp3039, where I happened to catch it at the temporary value of 1) [18:33:11] and plenty of the nodes have nuked objects in varnishstat, up to values around ~7M nuked objects for some of the oldest instances [18:33:23] so it's not like it starts backlogging on the mailbox as soon as it fills storage and starts nuking [18:33:47] it takes a while for the backlog to start forming, which implies this is more about fragmentation, which hopefully implies the storage split will be effective [18:58:22] does exp_received actually increment at all once it hits the backlog forming stage? [18:59:04] ah, yes it does, according to varnishstat info in the backlog [19:02:39] yeah it does, just not fast enough [19:03:51] it's probably a self-reinforcing problem once the backlog gets significant at all [19:04:06] yes [19:04:13] have you singled out that thread in perf profiling when it was in that state? [19:04:23] the more objects there are backlogged in the mailbox and tied up as "busy", the harder it is for workers to find nukeable candidates, which increases the pain of incoming mailbox items, etc [19:04:49] not in perf, but yes in general cpu utilization. as it backlogs it takes more and more CPU, loadavg and iowait rises, etc [19:05:02] (the expiry thread being the one doing most of the excess CPU compared to normal) [19:05:19] straceon expiry thread shows that in syscall terms it spends all its time in futex() operations [19:05:45] which makes sense: all it ever really does is participate in pthread lock cycles with the workers and mutate in-memory structures [19:06:43] that's another point in favor of the storage split. right now we have 2x useful stevedores for most objects, and the 2x bigobj that are rarely-used (relatively speaking) [19:06:53] yes [19:06:54] and there's one lockable LRU list per stevedore/file [19:07:02] now we'll have a lockable LRU per 5x size classes [19:07:08] another lock contention problem: https://stevenengelhardt.com/2012/07/27/varnish-expiration-thread-lock-contention/ [19:09:09] anyways, next step is puppetize the varnishd storage params + VCL work to do the size-class split and figure out the little details of space allocation and GiB vs GB and fs overhead, etc... [19:09:21] this 5.0 change may also be related: https://github.com/varnishcache/varnish-cache/commit/f2fdfeade6b8edf3041e6d66bb0b376509894d24 [19:09:27] and make it conditional on a hieradata thingy so we can turn it on for one node and disable the restart cron there and see how it goes, etc [19:09:48] which is also listed in the 5.0 changelog besides the multiple threads one: "Reduced needless rushing of requests on the waitinglist" [19:10:42] right, along with the 5.0 thing about making a separate expiry thread. I haven't looked into what that means yet. [19:10:48] yes [19:10:56] but I'm assuming for now it means split the true-expiry and lru-nuking roles of the current singular expiry thread [19:11:55] if I were braver and maybe slightly insane, I might try to just upgrade cache_upload to varnish5 first before mucking with storage size classing [19:12:04] hehe [19:12:05] but I think we're out on a limb enough as it is, and this will probably help regardless [19:12:25] it doesn't look like a huge delta indeed [19:12:39] the hit_for_miss stuff needs to be investigated maybe [19:13:25] yeah [19:13:46] the per-foo VCLs have the potential to really unwind some of our VCL templating complexity and make it easier to merge clusters, too [19:13:55] there's some neat stuff in there [19:14:19] yup [19:14:59] late lunch, back after a while [20:02:24] bblack: if I'm reading the code right, 5.0 simply moves all LRU related code out of the expiry thread code, and moves it to the storage code instead [20:02:48] so all LRU object searching happens there, and only at the end if it found an object to nuke, it mails it to the expiry thread [20:03:43] so I think that might run in all worker threads, although I'm not sure about that [20:07:21] and that makes sense too, making LRU a concern for (each) storage backend instead of varnish main code [20:18:36] that's kinda what happens in v4 too, but I'm sure there's a subtle diff [20:19:04] in v4 the worker thread searches the LRU for a nuke candidate, and does ObjSlim() to free the content-storage of the nuked item to make space available to itself quickly [20:19:30] but then mails the expiry thread to ahve it clean up the metadata and the objhead itself, etc (and manage the expiry head for natural expiry) [20:20:33] the LRU is really only used for nuking, not for expiry [20:20:45] expiry is a separate binary head ordered by expiration date, rather than LRU [20:20:50] s/head/heap/ [20:22:58] I guess probably the difference is the worker manages the LRU part completely (including metadata), and while it still mails the expiry thread, that's just so the expiry thread can update the expiry binary heap [20:23:55] I donno, I'll have to stare [20:25:43] yeah [20:27:50] it seems pretty naive code in general [20:28:41] i mean, to find space for a new very large object it might potentially have to go through a huge number of tiny objects in lru order, all one by one with NukeOne [20:29:01] so for that, the storage rework might indeed help quite a bit [20:29:40] well for that matter, the ones it finds won't necesarrily be linear, either :) [20:30:04] exactly [20:30:05] it may find the 1000x 1K objects it needs to nuke for a 1MB object, but they may be scattered all over storage [20:31:31] I see they're still doing that for MSE [20:31:38] storing everything in one linear chunk [20:32:17] you do wonder, above a certain block size you see diminishing returns [20:32:31] less complex than managing fragmented objects of course, but still [20:33:35] I thought one of the MSE blog posts mentioned internally classing objects by size? [20:33:37] i would even say, above a certain object size, using a modern file system might be preferable :) [20:33:43] yeah that may be [20:34:31] and yeah, it would be interesting to just implement a backend that translated the cache's hash value for an object to an fs path tree and used the FS "naively" on an SSD [20:34:46] yes [20:35:08] also notable in the 5.0 commits, now that I'm looking, is that phk has started making commits related to persist again [20:35:19] i saw a few yeah [20:35:24] as in, cleaning up persist and its supporting infrastructure, either to revive it from deprecation and fix it, or make a better one? [20:35:40] perhaps part of their gradual separation from varnish software [20:36:01] I guess on some level phk and/or various other devs must realize the persist/file/storage problem we're seeing here in the open source product [20:36:15] so many 5.x commits point at the same basic things [20:36:22] yes [20:37:49] i wonder if we'll see these issues with text as well or not at all [20:37:55] less diversity in object sizes indeed, but still [20:37:57] big content set [20:38:28] we've never had text running on file either, and only very briefly on upload when I saw the same issues I think [20:40:03] the content set on text anecdotally seems much smaller than upload [20:40:08] in bytes terms anyways [20:40:27] we'll see when we get there. maybe some shade of the same issue, but probably less severe and simpler to resolve [20:41:08] if they're more interested in having an open source persistent storage backend again, we may be able to put slightly more effort into fixing that up [20:45:24] yeah [20:45:56] I don't know what the plan really is there. back when it was deprecated, I think the wording was "keep it place so the infrastructure around it doesn't rot, so a new one can be written later", or sounded like it [20:46:09] which I assume meant more like "keep the infra for MSE", but maybe it means keep the infra for an OSS replacement [20:46:09] right [20:46:27] or both, MSE at first, OSS now [20:47:28] the 5.0 commits have a lot of other interesting architectural cleans, e.g. around 304s and INM/IMS, how grace works, etc... [20:48:13] it's probably worth experimenting with a 5.0 transition early once we have some time, once again cache_misc or maps first [20:48:17] i wonder if Varnish Plus 5.0 is available today [20:48:19] yes [20:48:31] it might be very painless and help a lot [20:48:37] or upload really, on a single box [20:49:25] anyways, I need to stab at puppetizing the storage thing [20:49:40] yup that seems like an improvement in any case [20:50:02] until we have MSE anyway ;) [20:50:34] yeah it makes sense to handle it internally in a storage allocator/manager [20:51:04] that's what every good malloc implementation does too: set up arenas for different size-range classes first to avoid nightmare fragmentation scenarios [20:51:09] yup [20:51:25] but if your malloc is too dumb for that, I guess you set up your own size-classes memory pools on top :P [20:51:28] which is why you kind of want to reuse that code instead of reinvent the wheel [20:52:25] well that was the other mental path I went down actually, somewhere in the past few days of rambling [20:52:32] i saw that [20:53:02] in theory, jemalloc (and similar) could have a custom API entrypoint that uses their internal algorithms, but sets up "hey you have 320GB to use at address space X" instead of getting it from mmap/sbrk directly [20:53:06] and that could be a big mmap'd file [20:53:11] yes [20:53:19] but I trawled through glibc and jemalloc and tcmalloc, none of them offer such an interface [20:53:27] and heh, have that backed by dmcache again [20:53:45] (probably at least in part because they're tuned around the concept of sbrk() working ok, and/or not tuned for how things work out with mmap and OS caching and access speed differentials) [20:54:54] but still, one could extend jemalloc or similar to work reasonably for such a usecase/API [20:55:17] and then implement a very thin variant of -smalloc for varnish that mmaps a file and uses that API [20:55:56] but if we're going out on that kind of experiment for a simpler and better file-store backend, using the native FS is just as good an idea IMHO [20:57:04] (the only problem there is for a non-persist backend, there's a bunch of other metadata lost on restart, so basically you have to wipe the FS on startup somehow, quickly, which is ugly) [20:57:51] that gets back to the other anti-persist argument: persist breaks xkey on restart unless you somehow let xkey sync its data with persist [20:58:13] right [20:59:16] now I want to get started on a fs based varnish storage backend [20:59:17] argh ;) [21:02:22] as for non-persistent fs based backend [21:02:51] just have varnish create files and rm() them immediately while it keeps all files open for the session? [21:03:04] on varnish stop/crash, kernel cleans up automatically ;-) [21:03:27] might be some issues with that though heh [21:03:51] unlink() I mean of course [21:13:24] or just declare that varnish owns the underlying device rather than the file, have the fstype and tuning args as varnish options, and mkfs on startup [21:13:35] yeah sure [21:15:39] that may be better as it'd allow tinkering around much easier [21:17:16] or just run a script that does it, for additional flexibility [21:17:30] i dunno, perhaps you even want to muck around with lvm/device mapper stuff or whatever [21:23:06] not sure if already handled but from cronspam I can see errors like [21:23:10] ERROR:conftool:Failure writing to the kvstore: Backend error: Raft Internal Error : etcdserver: request timed out, possibly due to previous leader failure [21:23:24] cp1063 and cp1072 [21:28:40] (brb) [21:28:58] bleh [21:29:14] _joe_: ^ ? [21:30:25] one has ~2h ago, one was ~30m ago. both in eqiad, and I'm guessing that means those hosts missed their backend restart because of it [21:30:40] I'll deal with it manually in a bit if nothing else [21:46:18] <_joe_> bblack: ugh that's bad [21:46:31] <_joe_> sigh, let me take a look [21:47:30] _joe_: thanks! [21:48:05] <_joe_> we had some etcd cluster shookup around 1.5 hours ago it seems [21:48:10] <_joe_> and some raft dropped messages [21:48:12] <_joe_> sigh [21:48:35] <_joe_> I have to move forward with a) upgrading that cluster b) moving auth to a proxy [21:48:43] <_joe_> auth apparently causes issues [21:48:48] <_joe_> perf issues [21:49:38] <_joe_> things like Sep 19 19:08:03 conf1001 etcd[618]: found conflict at index 81090979 [existing term: 8388, conflicting term: 8389] [21:49:41] <_joe_> Sep 19 19:08:03 conf1001 etcd[618]: replace the unstable entries from index 81090979 [21:49:45] <_joe_> are not good, heh [21:50:37] * bblack tries to make an unsnarky comment and fails [21:52:14] <_joe_> so trouble started at 14:25 [21:52:39] <_joe_> I think it should be stable now, but I'll check in half an hour [21:53:12] <_joe_> bblack: on conf100X, you can just grep -v guest /var/log/etcd.log [21:53:17] <_joe_> to see what is going on [21:53:32] <_joe_> if you see more problems, please page me over the phone [21:54:17] ok [21:59:36] we never missed a restart FWIW. I'm guessing those errors don't turn out to be non-zero exitvals for confctl, so it just carried on [21:59:48] in one case it failed one of the repools at the end [22:00:02] in the other everything is pooled, so I only assume it failed one of the depools before the restart [22:09:47] <_joe_> sigh [22:09:58] <_joe_> conftool not returning non-zero is a bug [22:34:31] ok so cp1099 is running the storage size class experiment live now with the 5 bins as in T145661 [22:34:32] T145661: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661 [22:34:42] and no restart cron, so we can see how it fares out past the 24h mark