[00:14:41] 10Domains, 10Traffic, 06Operations, 06WMF-Legal: Use .wiki domains instead of .org on wiki sites owned by wikimedia foundation - https://phabricator.wikimedia.org/T145907#2644754 (10Peachey88) Also, our TLD is on millions(?)/billions(?) of printed materials, it's not a decision we can take lightly. Even wi... [10:47:43] varnish 5 was released [10:48:00] interesting gem from the release notes: [10:48:00] We need more money [10:48:01] Until now Varnish Software has done a lot of work for the Varnish Cache project, but for totally valid reasons, they are scaling that back and the project either needs to pick up the slack or drop some of those activities. [10:48:55] VSLP is merged in-tree now, called the "shard" director [11:16:37] * elukey hopes that the logging API has not been changed [12:32:56] 10Domains, 10Traffic, 06Operations, 06WMF-Legal: Use .wiki domains instead of .org on wiki sites owned by wikimedia foundation - https://phabricator.wikimedia.org/T145907#2644754 (10tom29739) Why not just redirect the .wiki domains to the .org domains? [12:56:18] 10Domains, 10Traffic, 06Operations, 06WMF-Legal: Use .wiki domains instead of .org on wiki sites owned by wikimedia foundation - https://phabricator.wikimedia.org/T145907#2645554 (10Aklapper) 05Open>03declined Declining this task as the summary says //instead of .org//. >>! In T145907#2645540, @tom297... [13:50:42] ema: +1 on the restart cron, the actual script isn't in the commit though [13:59:51] re: varnish5, if they haven't made big breaking changes for VCL compat, logging compat, etc... it might be possible to move from 4->5 pretty quickly [14:02:13] the upside other than general freshness is going to be: per-anything VCLs could let us merge up clusters in a simpler way for more hw efficiency, vslp is builtin, and eventually when their H/2 support stabilizes we might be able to use it (as in have a lighter TLS-only proxy in front, if it supports all the TLS things from nginx we care about) [14:03:31] but I don't see it fixing our fundamental sfile issues we're facing right now, so fixing that and getting past the v3/v4 syntax compat mess first takes precedence. [14:05:55] on the mismatched objcore counts.... I donno, it still seems to hint at the idea that the expiry thread is involved in this somehow.... [14:10:06] cp1074 has been up for a while, and shows: [14:10:30] MAIN.n_objectcore 3222945 . objectcore structs made [14:10:33] MAIN.n_expired 1183908 . Number of expired objects [14:10:36] MAIN.n_lru_nuked 13276053 . Number of LRU nuked objects [14:10:39] MAIN.n_lru_moved 18851690 . Number of LRU moved objects [14:10:45] so clearly it can't just be that we trigger as soon as the storage is full [14:10:58] we've nuked and moved far more objects than are currently present [14:12:24] so we're left with either a refcnt leak on the LRU list making them all busy (which eventually resolves itself somehow...) [14:12:53] or again the specific case of the expiry thread somehow locking up the LRU list [14:13:09] I guess it could be somehow-related to fragmentation, too [14:20:04] I see the backend restart script was already deployed [14:31:30] so I merged the cron thing, but fqdn_rand does a horrible job apparently, when auditing the splay results [14:32:25] out of 39 hosts, somehow 3 of them (2x in a single DC, even) get 17:17 . there's other statistical anomalies too :P [14:32:55] including a surprising number that get the hour and minute set to the same numbers [14:33:18] we'll probably have to fix it up a bit [14:34:09] maybe use an inline ruby template to do a real hash over the hostname first since they must not, and then do the math to pick a value for 0-1439 and split to hh:mm ourselves [14:43:12] hmmmm on further reflection: we probably can do better in ruby, but we can also do better by using fqdn_rand() better [14:43:33] there's one fqdn_rand() output per hostname, and it's effectively mod-N for whatever the range argument is [14:43:39] which explains why so many have hh=mm [14:44:02] the only ones where hh differs from mm are the ones where mm>23 [14:44:26] but fqdn_rand takes an additional fixed seed argument, which can be used to fix that shortcoming... [14:50:50] we'll see how that works out shortly [14:51:23] in the meantime, I'd like to experiment with raising FE hitrate further in light of everything related [14:53:50] trying s/1MB/512KB/ for now, to see if there's a signficant shift and in which direction [14:57:58] there's still a lot of obvious bias, and notably cp2002 and cp2005 got the exact same time lol [14:58:13] but it's better than it was without the separate keys added to hh and mm [14:58:48] clearly aside from the keying/mod-N issue, it just doesn't hash the hostnames well and therefore sometimes a 1-bit change doesn't do anything, etc... [14:59:45] actually '2' and '5' are 3 bits apart, so I don't even know what's up with that [15:31:06] <_joe_> fqdn_rand sends the md5 of the fqdn to https://github.com/puppetlabs/puppet/blob/ed209118fceb6433b51c88c10883cb7415b36adb/lib/puppet/util.rb#L563 [15:32:25] <_joe_> which, as far as I can see, uses https://ruby-doc.org/core-2.2.0/Random.html#method-c-new [15:34:03] yeah I donno, there's a lot of room for whatever in how things are type-converted, etc... [15:34:08] <_joe_> but, thinking about it, the probability 2 out of 8 machines get the same number out of 60 [15:34:16] <_joe_> it's not small enough [15:34:47] the main diff in what I just merged is that it's picking a single determinstic random value out of 0-1439 and (minutes per day) and then extracting hh:mm, instead of independently doing a 60 and a 23 [15:34:58] (even if their fqdn_rand is perfect, that's probably better) [15:35:05] <_joe_> it is [15:35:48] but it seems very fishy to me that with 2x separate fqdn_rand() with different secondary seeds, cp2002 and cp2005 ended up with the same pair of numbers [15:36:31] <_joe_> actually, if fqdn_rand was absolutely perfect, you have a ~ 10% chance that two values out of 8 are the same [15:36:31] md5's avalance can't be that bad, and hopefully however they're truncating/folding md5 output can't be that bad either. something subtle is amiss there. [15:37:06] independently? [15:37:12] <_joe_> yes [15:37:26] <_joe_> 7/60 [15:37:30] we're getting both a collision of fqdn_rand(60, 'foo') and a collision of fqnd_rand(23, 'bar') on the same host [15:37:35] that's 7/60 chance? [15:37:38] <_joe_> oh, no [15:37:46] <_joe_> sorry, I didn't get that [15:37:52] <_joe_> that is way less probably [15:37:55] well I said that in a confusing way [15:37:55] <_joe_> *probable [15:38:09] <_joe_> let me calculate the probability of that [15:38:24] x = fqdn_rand(60, 'foo'), y = fqdn_rand(23, 'bar') [15:38:25] <_joe_> 3% [15:38:31] and we're seeing 2x hosts with the same pair of x,y [15:38:45] which also happen to only have one letter difference (3 bits difference in ascii) in their hostname [15:38:45] <_joe_> out of? how many? [15:38:50] out of 39 total hosts [15:38:53] <_joe_> oh [15:39:03] <_joe_> I thought from the same cluster/dc [15:39:11] <_joe_> out of 39 the probability becomes [15:39:13] <_joe_> let me see [15:39:25] well there are 39 total on all DCs. the only matchup happened to be in the same DC, with very ascii-bits-similar hostnames [15:40:33] <_joe_> well out of 39 if the random function is perfectly distributed you're almost guaranteed to have two matching pairs [15:40:47] <_joe_> what where the two hosts? cp2002 and cp2005? [15:41:53] <_joe_> sorry, I did my calculation wrong. the probability is 2% [15:41:56] <_joe_> heh [15:42:05] yeah cp2002 and cp2005 [15:42:21] anyways, with the new scheme with the single 0-1439 number, we get no accidental collisions at all [15:42:38] there's a few cases where 2x hosts in a single DC are closer than we'd ideally like, but it's "ok" enough on average [15:43:04] a better algorithm for what we're trying to do here, which might make an interesting puppet function.... [15:43:41] would be to randomly sort the total set of hostnames, and then perfectly-splay them within X minutes and output a split array of cron fields for that. [15:44:17] so in this case there would be 37 minute spacing between any two nodes over the course of 24h, and the order would just be randomized, not the time value. [15:45:00] but then you'd need the total hostname set that's applicable as input, but in this case that's easy to get, too [15:45:27] <_joe_> confirmed this happens when trying the code on CLI [15:45:51] <_joe_> that's because ruby's Random.new(seed).rand(max) is a hashing function [15:45:56] <_joe_> not an RNG [15:46:19] but random.new(seed) expects an integer seed [15:46:32] what's going on with getting from $fqdn to that integer seed? [15:46:55] (whatever it is, it gets the same value for cp2002 and cp2005, which is suspicious) [15:47:39] (and yeah, technically we don't need to use an RNG at all here, we could've just directly converted a hash output to the range 0-1439) [15:47:39] <_joe_> it doesn't [15:47:57] <_joe_> irb(main):003:0> seed = Digest::MD5.hexdigest(['cp2002.codfw.wmnet', '5ae39fee'].join(':')).hex [15:48:00] <_joe_> => 193026423034824124299065232868023570016 [15:48:03] <_joe_> irb(main):004:0> seed1 = Digest::MD5.hexdigest(['cp2005.codfw.wmnet', '5ae39fee'].join(':')).hex [15:48:06] <_joe_> => 90547549910744580087119607545599302920 [15:48:17] <_joe_> irb(main):006:0> Random.new(seed).rand(60) [15:48:17] <_joe_> => 20 [15:48:17] <_joe_> irb(main):007:0> Random.new(seed1).rand(60) [15:48:17] <_joe_> => 20 [15:48:27] fascinating [15:48:30] <_joe_> they just hash to the same slot [15:48:38] regardless of the "60"? [15:48:42] <_joe_> that's because they called an hashing function "RAND" [15:48:51] <_joe_> no, let me test that though [15:49:03] because it works for 24 as well, at least [15:49:17] <_joe_> lol yeah [15:49:19] <_joe_> ahahahahahahah [15:49:38] so probably even in my inline version, I'm better off not using Random() at all [15:49:48] and just doing some truncation/module on the md5 output and using that directly [15:49:53] *modulo [15:50:32] which to_i already basically does for me [15:51:26] <_joe_> irb(main):015:0> (1..100).each do |n| [15:51:26] <_joe_> irb(main):016:1* if Random.new(seed).rand(n) == Random.new(seed1).rand(n) [15:51:29] <_joe_> irb(main):017:2> puts n [15:51:32] <_joe_> irb(main):018:2> end [15:51:34] <_joe_> irb(main):019:1> end [15:51:36] <_joe_> this prints out every number from 1 to 72 [15:51:42] <_joe_> LOLWUT [15:51:54] <_joe_> ahahah I <3 ruby [15:52:42] <_joe_> so it's not ALWAYS equal [15:53:15] <_joe_> now if I didn't have more self esteem than that, I'd go look at how this "Random" class is implmented [15:53:21] require 'digest/md5'; x = Digest::MD5.hexdigest('fqdn').to_i(16) % 1440; hh = x / 60; mm = x % 60; print hh.to_s() + ':' + mm.to_s(); [15:53:26] ^ basically that would be better [15:53:41] <_joe_> prolly yes [15:53:42] <_joe_> :P [15:53:51] <_joe_> surely [15:54:32] yeah a nice TODO would be to make a puppet parser func around that, which takes an extra seed on top of the fqdn like fqdn_rand does, and has as inputs the total minute range, and outputs an array of all the cron fields [15:54:53] then it can be used to set up e.g. hourly, or daily, or weekly, etc [15:56:10] <_joe_> https://dpaste.de/vm8U [15:57:53] heh [16:01:11] <_joe_> try it with the compiler, it should give you an idea of how good is spacing [16:02:20] it's simpler to just push it and salt it and gather the outputs with salt :) [16:02:44] if it's messed up, I can further fix/revert before too many of them run regardless [16:06:38] the numerical space is so small, there's closer-than-ideal within a DC no matter, what randomly [16:07:08] the outputs with just md5%1440 seem a little better overall, but randomly produce a couple more troubling edge cases than the previous one heh [16:07:44] the best would be to pursue the other path of less-randomness [16:08:37] and explicitly interleave DCs, too [16:08:49] it's hard to generalize the interleaving part, though [16:09:33] order them as cp1x, cp2x, cp3x, cp4x, cp1y, cp2y, cp3y, cp4y, .... (some DCs will run out before others but whatever) [16:10:01] and then just explicitly space them out by ~1440/39 minutes each [16:11:47] for now I'm just going to revert the latest "improvement". it's theoretically-better, but the randomly-bad edge cases turned out worse [16:20:27] <_joe_> so, discussing with a puppet dev in #puppet poked me into looking into the whole thing better [16:20:36] <_joe_> the answer is puppetlabs did it again [16:20:45] <_joe_> reading into ruby's own code, the issue is Puppet::Utils.deterministic_random just uses the initial random seed, basically, as it calls Random.new.rand which, as expected, just performs hashing, no real randomness [16:20:56] <_joe_> which is what puppet needs, but apparently they (reasonably) don't care about having an evenly distributed hashing [16:21:05] <_joe_> where they == Ruby devs [16:21:26] <_joe_> so they went from a reasonably evenly distributed hashing function as md5 [16:21:39] <_joe_> to some lame-ass hashing provided by Random.new.rand [16:21:54] <_joe_> because the first random call gives you just the result of that hashing [16:22:12] <_joe_> whil successive calls get you a result from the RNG with that seed [16:22:14] <_joe_> wtf [16:22:48] <_joe_> "If you use Random.new, you should thus call it as rarely as possible, for example once as MyApp::Random = Random.new and use it everywhere else. Creating a new random generator each time you want a random number is a really bad idea. If you do this, you will get the random properties of the initial seeding algorithm which are atrocious compared to the properties of the random generator itself." [16:22:54] <_joe_> ahah [16:23:09] <_joe_> can you believe them? [16:25:58] <_joe_> https://patchwork.ozlabs.org/patch/671069/ [16:33:26] :) [19:11:14] ok fixed "good enough" for this use-case anyways, and hopefully reuseable [19:13:10] whatever you have fixed, I guess: thank you :] [19:15:28] the only minor issue I'm aware of is the DC interleave order is slightly strange in unexpected ways, because of the oddball form of the code that interleaves the arrays [19:15:40] (it ends up putting cp3 "first" in this case, because it's the longest subset) [19:24:21] so far I don't see any big uptick in front hit rate from 1MB->512KB, so probably revert that later, maybe experiment with fine-tuning back the other direction too [19:24:42] (maybe the ideal value is 2MB, or who knows) [19:25:11] we can probably still pick up a big win from the 2-hit-wonder thing too, but it's easier to play with the size cutoff in isolation before that goes in [19:25:32] off for a bit [20:56:34] ema: another hint: check out exp_mailed vs exp_received in varnishstat. it's actually falling behind on the node I'm looking at now (cp1074) [20:56:56] like it's not processing the expired objects that have been mailed to the expiry thread fast enough [21:04:37] looking at the cache_expire.c code now to see if there's a dumb way to fix that, but so far it looks like io perf + object rates == expiry thread simply can't move fast enough [21:09:30] there are three kinds of "sleep" that can happen in the expiry thread (where "sleep" really means wait on a new inbox item to appear, for at most N time before taking some other action like looking for expired items) [21:09:39] (with a condwait) [21:10:20] one is a fixed sleep for 1ms, one is a fixed sleep for pi seconds, and one is apparently based on when some object it's looking at is due to expire in the future, which seems a little crazy [21:12:51] a fixed sleep for pi seconds? O_O [21:12:58] that looks like an April fools feature :P [21:13:24] ah and here's the magic: when an object is mailed to the expiry thread, its refcnt is incremented. to be decced when the expiry thread receives and deals with it [21:13:35] so, this paints a clearer picture that matches what we see [21:14:31] for whatever reason, exp_mailed outpaces exp_received (expiry thread falling behind on its inbox). eventually the amount it's behind by grows to most or all of the objects on the LRU. thus they all have reference (from the mailbox), so EXP_NukeOne can't free any object from the LRU to store a new one [21:15:28] while I was looking earlier, cp1074 was behind by ~500K objects, but it eventually caught back up [21:15:37] (it's now in the ~1K range behind) [21:16:00] the trigger for the condition is simply falling much further behind on that inbox [21:16:24] no doubt storage fragmentation and io perf (which is in turn affected by storage fragmentation) plays a role in falling behind [21:16:45] possibly some of the magic hardcoded tuning of the expiry thread (especially those fixed condwait times) are a factor too [21:17:38] return (now + 355./113.); [21:17:48] Platonides: ^ the pi sleep :) [21:18:12] but supposedly any of these "sleeps" are conditional on the mailbox. if a new mailbox item arrives it should get woken up. [21:18:27] but then again, it only gets woken once per mailed item... [21:18:59] but also, if it processes a mailed item, it loops relatively-tightly until the mailbox is empty [21:19:13] still, some kind of magic there can possibly be improved to prevent falling behind on the mailbox [21:19:25] bblack: /13 [21:19:34] bblack: there seems to be a lot of magic there already [21:19:45] yeah :) [21:20:14] I'm tempted to just ignore for now figuring out the deep details of how the scenario arises, and just reduce/cap the possible condwait sleeps [21:20:53] could cut the pi one to pi/10, and could cap the oc->exp_when one to the same pi/10 value [21:21:04] it may be that occasional overlong sleeps are what lets it build up [21:21:17] I donno [21:21:34] why doesn't the expiry thread wake up on receiving a mail? [21:21:41] it does supposedly [21:22:21] it's possible that all of the expiry thread code works "right" even for our case, but it's simply falling behind (as in, one expiry thread cannot keep up with the inbox workload period, due to io and/or fragmentation) [21:22:40] I'm just hoping there's some other inefficiency there for an easier win [21:24:08] the one that returns oc->exp_when is especially suspicious, but I can see how it might make sense [21:24:48] (I think the idea there is if we tried to expire the oldest object on the LRU and it's not ready yet, we don't need to wake up until either an inbox item arrives, or until the possibly far future time that said oldest LRU item finally expires) [21:26:17] it's also possible that the expiry thread is normally able to keep up fairly well, and the trigger is a rare event/object that takes the expiry thread a very long time to process [21:26:29] like, say, expiring a 1GB+ object from storage [21:27:01] and during the whole ridiculously-long time it spends deleting that one object, the mailbox gets so full that all or almost-all objects in the LRU are "busy" in the mailbox [21:28:27] we've seen the condition persist for an hour-ish, but that doesn't mean necessarily 1h to remove 1 huge item, either [21:29:39] it could be that under normal conditions the expiry thread is only barely able to keep up, and thus after a relatively-short event deleting a 1GB object, it takes quite a while to get the inbox back to near-zero, and thus much smaller deletion events can perpetuate that state for a while... [21:29:52] I donno [21:30:31] 500 i/o threads and 1 expiry thread though. it could simply fall behind naturally with such large storage [21:32:46] it looks like cache_expire.c code might be structured to allow for more than 1 expiry thread, maybe [21:32:55] probably be in untested territory at that point though [21:35:46] it's close anyways. maybe a few little fixups. [21:36:52] could go further and have N expire mailboxes. io threads could efficiently roundrobin which inbox they drop requests into, and one thread consuming each inbox [21:41:35] unfortunately ganglia's logging exp_mailed and exp_received as rates, not absolutes [21:42:06] which is kinda fine, except it doesn't let us easily go back and and see the point-in-time value of exp_mailed - exp_received, which I suspect is near the same size as the whole LRU when the condition hits