[09:03:49] so, pybal. By writing unit tests for https://gerrit.wikimedia.org/r/#/c/403677/ (use pooled-and-up servers in can-depool) I've realised that our (my?) current understanding of thresholds does not reflect reality :) [09:04:16] in the case of swift frontends, we've got 4 servers in total and the depool threshold is 0.5 [09:05:05] one would be tempted to think that the meaning of that is: pybal will make sure that minimum 2 servers are always serving traffic [09:06:02] now, by not taking (administratively) depooled servers into account that was obviously false, and the patch above should fix that part [09:07:10] with that being fixed though, the can-depool logic is: if the number of servers serving traffic is greater *or equal* than `all servers * threshold`, then we can depool [09:08:15] which means, in the swift-fe case for instance: (2 servers serving traffic) >= (4 * 0.5), go ahead, depool [09:08:44] and we end up with only one server serving traffic, which in the swift-fe case is not ok to my understanding [09:12:06] it would be great to have a common understanding of what we mean with depool-threshold, and then make sure the code reflects that [09:12:31] at the moment there are 3 different definitions in pybal's source code [09:12:34] > Threshold of up servers vs total servers below which pybal can't depool any more [09:12:41] (up servers) [09:12:48] > The total amount of pooled servers may never drop below a configured threshold [09:12:52] (pooled servers) [09:13:05] > the threshold below which no more down servers will be depooled [09:13:09] (down servers) [09:16:12] personally, I'd use "servers serving user traffic" instead of 'up/pooled/(not) down' which is easier to understand I think [09:20:21] or "hosts serving traffic" which might sound a bit better :) [09:20:49] and, I think the >= should be a strict > in canDepool [09:21:12] so that we refuse depooling a host if only 2/4 are serving traffic and the threshold is .5 [09:24:03] <_joe_> ema: I will read what you wrote, but quite a few things are based on the way pybal behaves currently [09:25:35] <_joe_> ema: but i generally agree [09:28:57] _joe_: the point is that I'm not sure we all agree on "the way pybal behaves currently" :) [09:30:32] <_joe_> ema: I think your patch doesn't do a desirable thing. [09:32:03] <_joe_> so, what we want, as you stated, is that at least N servers in a pool keep serving traffic. That means that out of all the servers in the pool (including the ones that are pooled = False in pybal), at least a percentage will not be depooled [09:32:54] <_joe_> your patch would make it so that a percentage of the currently logically pooled ones will not be depooled [09:33:33] <_joe_> which means the actual number of servers will vary with our logical pooling/depooling [09:34:25] logical pooling/depooling meaning "hosts pooled/depooled in etcd", right? [09:34:30] <_joe_> yes [09:34:33] ok [09:34:42] <_joe_> so yeah let's decide on a terminology [09:34:51] <_joe_> we have three logical state switches [09:35:06] <_joe_> up/down => status in etcd [09:35:17] <_joe_> sorry [09:35:23] you see? :) [09:35:26] <_joe_> up/down => status in monitoring [09:35:37] <_joe_> enabled/disabled => status in etcd [09:35:51] <_joe_> and pooled/depooled => status in ipvs [09:35:53] <_joe_> right? [09:35:59] yes! [09:36:39] <_joe_> so what we want is that of all the N servers in the pool, at least M% are always in status 'pooled' [09:36:43] <_joe_> right? [09:36:54] <_joe_> that is, serving traffic [09:37:00] correc [09:37:03] correct [09:37:21] <_joe_> so when we want to depool a server, we should first check if there are enough pooled. If not, we should not depool [09:37:54] <_joe_> then, whenever another server gets pooled, we want to depool any server that is down and/or disabled but pooled, correct? [09:38:19] indeed [09:38:33] <_joe_> ok, your patch doesnt' do that [09:38:38] <_joe_> at first sight [09:38:50] <_joe_> you do if server.up and server.pooled [09:39:02] <_joe_> while it should just count server.pooled [09:39:41] <_joe_> but this is very delicate, I tried to touch that in the past with bad results [09:39:49] <_joe_> we should build tests for this behaviour [09:40:55] agreed, looking at server.pooled=True seems the right behavior [09:41:04] <_joe_> seems :) [09:41:30] <_joe_> it's pretty tricky imho [09:41:53] thoughts on the >= vs > part? [09:42:23] <_joe_> ok that's tricky again [09:42:39] <_joe_> say you have a pool of 2 servers [09:43:25] <_joe_> if you put in 0.5, and both go down, what will happen with >= and with >? [09:44:16] <_joe_> and what if one goes down? [09:44:29] <_joe_> if one goes down, with > you don't depool it [09:44:45] <_joe_> with the current logic, I mean [09:44:57] <_joe_> because downservers = 1, servers = 2 [09:45:15] <_joe_> 1 is not > than 2*0.5 [09:46:06] the logic currently deployed uses >=, meaning that we *do* depool the one host going down [09:46:10] <_joe_> if you have 4 servers, and 3 go down, you should be able to depool 2 with >= ( 2 >= 4*0.5), 1 with 0.5 [09:46:27] <_joe_> sorry, 1 with > [09:47:18] <_joe_> the problem in the current logic in https://gerrit.wikimedia.org/r/#/c/403677/2/pybal/coordinator.py seems to be thaat it uses server.up and not server.pooled [09:47:40] <_joe_> so it actively disregards the state disabled/up/not pooled [09:47:44] <_joe_> in its count [09:48:31] <_joe_> I will take a re-look at the code and do a code review [09:48:35] <_joe_> it's friday after all [09:48:38] yes and I think that you're right when you say that server.pooled is the only state we should look at when counting pooled servers (it sounds so obvious stated like this!) [09:49:25] <_joe_> so I think the problem we had yesterday was that we had a pool of 4 servers, with 2 disabled/up/not pooled [09:49:32] <_joe_> and 1 went down [09:49:43] <_joe_> since we had 3 servers up, it was depooled [09:49:49] <_joe_> leaving 1 server in the pool [09:50:11] <_joe_> that went down, ofc, and since we have >=, it was depooled too (we still had 2 "up" servers) [09:50:26] the problem was indeed due to the fact that we were using server.up (monitoring status) instead of server.pooled (ipvs) [09:51:06] so we ended up with 0 hosts serving traffic, ugh [09:53:25] <_joe_> maybe :) [09:54:22] <_joe_> but we need to check the logs to be sure [09:55:11] well, surely we did end up without hosts pooled in IPVS, I've got proof :) [09:55:13] <_joe_> so, what is the the time when that happened? [09:55:14] https://grafana.wikimedia.org/dashboard/db/pybal?panelId=1&fullscreen&orgId=1&from=1515594444777&to=1515608866088 [09:55:22] <_joe_> ema: ok [09:55:34] <_joe_> let me write down what happened and note it in the task [09:56:04] <_joe_> actually the task says the right thing [09:56:13] <_joe_> well, almost [09:56:45] _joe_: https://wikitech.wikimedia.org/wiki/Incident_documentation/20180110-swift [09:59:45] let me amend the patch to only use server.pooled, I'll add a few tests leaving >= as it is [10:03:26] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: pybal's "can-depool" logic only takes downServers into account - https://phabricator.wikimedia.org/T184715#3893173 (10Joe) I don't think this is entirely correct. Pybal has 3 states: # `enabled/disabled`: the logical state in etcd, so `pooled=yes` or... [10:03:45] <_joe_> ema: one thing I'm not sure of is - [10:03:53] <_joe_> say you depool a server via etcd [10:04:03] <_joe_> but when you do that, it cannot be depooled [10:04:15] <_joe_> pybal will refuse to depool it, correctly [10:04:49] <_joe_> but say at a later time one of the servers that were depooled comes up and is pooled again, will pybal then automatically depool the other server? [10:04:55] <_joe_> I have to recheck the code [10:05:07] <_joe_> I think the answer is *yes* but let's check [10:07:46] <_joe_> ema: ouch [10:07:46] I actually think that the canDepool logic is only triggered by changes due to monitoring status, not to enabled/disabled actions [10:07:58] <_joe_> are you sure? [10:08:02] nope! :) [10:08:12] <_joe_> anyways, I found something that makes our idea bogus [10:08:40] <_joe_> see coordinator.Coordinator.refreshModifedServers [10:08:54] <_joe_> server.pooled = server.enabled and server.up [10:09:07] mmmh [10:09:10] <_joe_> so that is *not* the ipvs state [10:09:17] <_joe_> it's just a compound variable [10:09:18] no bueno [10:09:43] <_joe_> you need the 2.0 code that has logical handling of ipvs [10:09:48] <_joe_> to do what we want to do [10:09:56] <_joe_> we should work on 2.0 :P [10:10:20] yes we do need to know what the actual ipvs state is [10:13:43] yeah so `canDepool()` is only called by `depool` and `repool`, which are called by `resultDown()` and `resultUp()` respectively [10:14:24] it does not look like we call canDepool when changing server.enabled (etcd) [10:14:32] <_joe_> yes [10:14:52] <_joe_> what we do is we merge the config changes [10:15:00] <_joe_> then we reinitialize all servers [10:15:07] <_joe_> with self.assignServers [10:15:33] <_joe_> so by this logic, a logically depooled server will always be depooled, it seems. Uhm [10:15:51] <_joe_> not what I expected [10:17:11] <_joe_> uhm [10:17:40] <_joe_> yeah we need better control on ipvs to do this properly [10:18:06] <_joe_> still [10:18:41] <_joe_> say we use server.pooled which is = (server.up and server.enabled) [10:18:54] so the two issues are (a) T184715 (b) we should check if hosts can be depooled when disabling them via etcd [10:18:54] T184715: pybal's "can-depool" logic only takes downServers into account - https://phabricator.wikimedia.org/T184715 [10:19:31] <_joe_> ema: did you test this in any way? [10:19:51] <_joe_> maybe fire up a pool on pybal-test2001 and try to logically depool all the servers in it? [10:20:20] <_joe_> of course you'll need to create the data on etcd [10:20:56] I'd start by tackling (a) first which seems like the most pressing issue at the moment [10:21:18] and then yes, for (b) we should use fake etcd data and pybal-test2001 [10:23:28] <_joe_> so a, rephrasing it, is "we cannot count the logically depooled servers as up when deciding if we need to pool or depool a server for monitoring" [10:23:31] <_joe_> right? [10:23:52] yes [10:24:32] or, we cannot only rely on monitoring status when deciding if we've got enough pooled servers [10:30:40] so back to the >= part [10:30:55] len(pooledServers) = 1 [10:31:02] len(self.servers) = 2 [10:31:08] threshold = 0.5 [10:31:27] 1 >= 2 * 0.5 [10:31:37] canDepool says True, we now have 0 hosts pooled [10:35:24] <_joe_> no [10:36:01] <_joe_> in the current logic, since your server is already marked as down by the monitoring [10:36:07] <_joe_> you'll have downservers = 2 [10:37:50] <_joe_> so that's why it uses server.up and not server.pooled [10:38:00] <_joe_> in resultDown, you do the following [10:38:07] <_joe_> 1 - set up=false [10:39:50] <_joe_> 2 - call self.depool [10:40:16] <_joe_> which in turn first checks that server.pooled = True [10:40:19] <_joe_> aaarghhh [10:40:20] so maybe we should check for server.up and server.pooled after all [10:40:42] <_joe_> server.up and server.enabled [10:41:11] <_joe_> I need to create a diagram [10:41:19] and hope that what that means is actually "pooled in ipvs" [10:41:21] <_joe_> and think better about this [10:41:28] <_joe_> told you it was tricky [10:41:32] yup [10:45:18] another option, which is the first that came to mind and is still on T184715, is to use `totServers - down - disabled` [10:45:19] T184715: pybal's "can-depool" logic only takes downServers into account - https://phabricator.wikimedia.org/T184715 [10:47:08] disabled in etcd, that is [10:47:19] <_joe_> ema: I think so [11:18:12] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: pybal's "can-depool" logic only takes downServers into account - https://phabricator.wikimedia.org/T184715#3896429 (10ema) >>! In T184715#3896284, @Joe wrote: > `pooled/not pooled`: the status of the server in the ipvs pool As we've found out today, `p... [12:03:04] "As we've found out today, pooled is actually defined as enabled and up. This should translate to "pooled in ipvs", but there's no guarantee about that." [12:03:07] why is there no guarantee about that? [12:03:14] other than ipvs state getting desynced from pybal [12:09:41] no other reason, but we know that it can happen (eg: T169765) [12:09:41] T169765: pybal should automatically reconnect to etcd - https://phabricator.wikimedia.org/T169765 [12:10:31] so it's not strictly correct to say that `enabled and up` is the same as `pooled in ipvs` [12:17:26] different ticket I think? [12:18:51] oh yes sorry, T134893 [12:18:51] T134893: Unhandled pybal error causing services to be depooled in etcd but not in lvs - https://phabricator.wikimedia.org/T134893 [12:19:10] well definitely pybal intends 'enabled and up' to be the same as 'pooled in ipvs' [12:19:29] so that assumption can be held in the rest of the code, and any violations should be fixed instead [12:19:35] of course that's much more solid with the 2.0 work [12:19:47] as in 1.0 pybal is effectively working blindly [12:20:30] you could even say, if desync happens that's violation of an assertion and it would probably be better to restart pybal entirely at that point (currently) [12:24:45] right, which essentially is what we do (manually) if check_pybal_ipvs_diff alerts [12:26:02] I have been staring at canDepool too long today, clearly [12:26:18] I'm starting to think that we also need to know *which* host we want to depool [12:27:57] going for lunch seems like a good idea at this point though :) [12:42:46] :) [13:49:03] <_joe_> did pasta bring you elightenment, ema? [13:50:03] * mark advised against pizza/pasta for dinner in SF [13:50:15] we wouldn't hear the end of it [13:53:05] lololol [14:12:13] <_joe_> ahah [14:12:43] <_joe_> you know there is a twitter account that follows and reposts all the angry comments of italians about pasta recipes around the internet? [14:12:50] <_joe_> it was pretty hilarious :P [14:12:54] i did not know [14:15:09] oh [14:15:19] but it was a Michelin Recommended place [14:15:31] but what do the french know about italian food right [14:16:44] <_joe_> lol [14:18:23] <_joe_> actually, I'd eat pasta in any high-level restaurant around the world. It's the cheap italian restaurants I fear and stay away from :) [14:24:59] _joe_: I thought pasta would have helped, but jenkins does not seem convinced [14:25:26] it's pasta with german ingredients right [14:29:52] I think regardless of 1.0/2.0, the problem is a lack of distinct state-tracking, causing overloading of the meaning of some of the states. Ideally there should be: up/down reflects real monitored status (regardless of thresholds), enabled/disabled reflects real etcd status (regardless of thresholds), and pooled/depooled reflects ipvs actions actually taken (which is subject to thresholding). [14:30:36] and then when etcd or monitoring trigger a state change in the first two that's towards goodness, if a backend becomes up+enabled but is currently depooled, you pool it. [14:31:03] that is how state is tracked [14:31:08] (and if that pooling brings us back above a threshold, you scan for other backends that are down|disabled but still pooled due to threshold and kill one of those) [14:31:56] and when etcd or monitoring triggers a downwards state change, it's recorded (down, or disabled), but "depooled" is only set if we're able to depool due to thresholds. [14:32:55] they're 3 distinct states, none of them can be inferred from the current state of the other two, between up/down, enabled/disabled, pooled/depooled. [14:33:24] (aside from the possibility of loss-of-sync with etcd and/or ipvs, which is a whole separate issue. Above I'm assuming we never lose sync with reality in that sense) [14:35:05] 10:08 < _joe_> see coordinator.Coordinator.refreshModifedServers [14:35:07] 10:08 < _joe_> server.pooled = server.enabled and server.up [14:35:30] ^ is the conversational evidence that the model I'm describing is not the current one. [14:37:59] <_joe_> bblack: let's say the three are mostly independent in the code but there are a few critical point. My work on 2.0 and creating basically 3 state machines that can be moved from state to state independently had as one of its goals make the separation of state-tracking more clear [14:38:57] basically, the 1.0 code's meaning of "pooled" is overloaded. Sometimes it means "is it actually pooled in ipvs?" and sometimes it means "*Should* it be pooled in ipvs, if thresholding allows?" [14:39:02] <_joe_> bblack: in practice, after we call refreshModifiedServers() we also call Coordinator._serverInitDone, that dispatches the commands to the servers in the pool [14:39:26] <_joe_> bblack: no, the meaning is "it should be pooled/depooled in ipvs" [14:39:50] I don't mean the intended meaning, I mean the actual code's behavior, what's causing the current bug(s) [14:40:24] joe is right [14:40:27] <_joe_> the current bug is a bug in how we count how many servers are down [14:40:37] meanwhile I've pleased jenkins, reviews welcome https://gerrit.wikimedia.org/r/#/c/403677/ [14:40:58] but threshold is not taken into effect with server list refreshes [14:41:06] <_joe_> that too [14:41:39] <_joe_> mark: also in the current count "disabled/up/not pooled" is counted as UP in the depool threshold logic [14:41:51] that was a feature [14:41:56] but it no longer makes sense with etcd [14:42:01] <_joe_> right [14:42:14] there was always supposed to be a difference between commenting servers and setting them down [14:42:22] (commenting of course making them completely invisible) [14:42:24] <_joe_> and that's ok [14:42:29] but i agree that it's now just confusing and unhelpful [14:42:30] <_joe_> we have the same [14:42:55] <_joe_> so servers that are disabled can remain in the pool if too many are down, in my mind [14:43:10] <_joe_> basically the depool threshold should work both for the thundering herd case [14:43:22] <_joe_> and for the "opsen shooting themselves in the foot" case [14:43:55] <_joe_> as well for the "opsen depooling server just when a traffic surge happens", which is basically what happened here [14:47:04] what about the pooledDownServers list? [14:47:18] is that just pooled+down, or pooled+(down|disabled)? [14:52:15] hang on I'm still trying to understand whether pooled_in_ipvs iff 'server.up and server.enabled' [14:52:27] that does not seem to be the case if we reach the depooling threshold, right? [14:53:03] monitoring sets server.up = False if a host goes down, yet it's not depooled from ipvs because threshold [14:54:01] it's not true on the downwards side of things [14:54:02] <_joe_> ema: but it doesnt set pooled=False [14:54:23] but I think it is (or should be?) true that if server.up and server.enabled, the server is pooled. [14:54:36] # P1: up => pooled \/ !enabled \/ !ready [14:54:36] # P2: pooled => up \/ !canDepool [14:55:02] <_joe_> it's more complex than an on/off problem, but now I really have to wrap up this week's work [14:55:21] those are the invariants the original code tried to maintain [14:55:30] and [14:55:30] # P0: pooled => enabled /\ ready [14:56:52] but then 'up and enabled' is not the same thing as 'pooled in ipvs'. It just implies it, but 'pooled in ipvs' does not imply 'up and enabled' [14:57:08] correct [14:57:27] if a server is up and enabled, it should be pooled. but just because a server is pooled does not mean it's up and enabled. [14:58:59] (well, in my sentence above, pooled means "actually pooled in ipvs" - which is not the same thing as "pooled" in the code if j is right that "pooled" in the could means, "should", not "is") [14:59:17] bleh, s/could/code/ [15:00:19] yes correct [15:00:22] logically speaking, "should be pooled" is "up && enabled", and the inverse of it is also true "should not be pooled" is "down|disabled" [15:00:51] but "is actually pooled" (in terms of ipvs actions taken, which take thresholding into account) is distinct from that. [15:01:08] yes [15:01:19] well, no [15:01:27] thresholding should impact 'pooled' [15:01:50] so if pooled = True, it should actually be pooled in ipvs, thresholding not withstanding [15:03:49] ok, apparently this conversation is very confusing, we have different definitions of "should" going back a page or so I think [15:04:50] I don't mean should as in "we took these ipvs actions, therefore it *should* be true", I was meaning "should" as in "should be pooled/depooled based on per-backend state, but may not actually be depooled due to threshold" [15:05:13] yeah and the latter meaning is not correct [15:05:31] <_joe_> I think we should not touch that part of pybal unless someone wants to spend time to rationalize it a bit. It outgrew its original design in complexity and my experience modifying it is it's very easy to make mistakes [15:05:43] very much so [15:06:09] logically, there's 4 state-like things here (again, just assuming that ipvs commands always work and we're never out of sync with it, ditto etcd): [15:06:15] that's why i said... 4 years ago now I think, we shouldn't touch it until we redesign with FSMs [15:06:21] up/down state - directly reflects monitoring [15:06:22] <_joe_> and when I say someone, I mean we should make that a goal at some point and throw some time of ema/me/whoever in evolving pybal [15:06:30] enabled/disable state - directly reflects etcd [15:06:42] thanks joe :P [15:07:03] <_joe_> means you managers :P [15:07:11] no you didn't mean me manager [15:07:20] should_be_pooled? == (up && enabled) [15:07:38] that's why I wrote, for the announcement, "a _vain attempt_ at regaining respect..." [15:07:38] is_pooled? -> distinct state, reflects whether we actually took ipvs action up/down actions, which depends on the above and the thresholding. [15:07:42] but that was cut out by c levels ;-) [15:08:31] <_joe_> mark: I know nothing of c-somethings [15:09:07] if should_be_pooled is true, the server will always actually be is_pooled, too. [15:09:19] but due to thresholding, sometimes should_be_pooled is false, but is_pooled is true. [15:09:45] current pooled variable is closer to is_pooled than should_be_pooled [15:10:11] really we should call it is_pooled_or_will_pool ;-) [15:10:15] I think different parts of the code treat it as one or the other meanings interchangeable, is the source of some current issues. [15:13:41] should_be_pooled is not useful [15:14:02] it only becomes useful if it takes thresholding into account, which the current code for 'pooled' does [15:15:34] right, it's only useful as an abstraction of just writing "enabled&&up", it's not a distinct state of its own [15:16:12] so it makes more sense to have a state including the thresholding [15:16:22] yes, which 'pooled' should be [15:16:26] it is [15:16:40] but there's a line of code which says: [15:16:41] except probably some edge case around changing server lists/etcd :) [15:16:42] server.pooled = server.enabled and server.up [15:16:49] yes [15:16:49] which does not take thresholding into account [15:20:28] the code completely ignores thresholding for servers modified in a server list refresh [15:20:49] onConfigUpdate calls refreshModifiedServers, [15:20:54] which doesn't look at thresholding at all [15:21:39] and then the whole list is passed to assignServers which pools all servers with pooled=True [15:22:24] def maintainState(self): [15:22:25] """Maintains a few invariants on configuration changes""" [15:22:25] # P0 [15:22:25] if not self.enabled or not self.ready: [15:22:25] self.pooled = False [15:22:48] ^ that also sets self.pooled=false based on enabled/up, not thresholds+ipvs [15:23:09] ipvs.py does set pooled=True/False when it takes actions, too [15:24:47] for the current code to work with etcd+ipvs as-is and not have logic issues, "pooled" should reflect only ipvs actions taken, rather than reflect an aggregate of enabled&&up (or other similar concepts) [15:25:14] no i don't think that's correct [15:25:54] but i agree the notion has gotten confused, so to fix it it probably makes sense to split it entirely [15:26:08] or it will just create more confusion [15:26:15] (or go straight for FSMs/2.0) [15:27:09] but in any case, currently, 'pooled' is always supposed to reflect the situation in ipvs, thresholding or not, and it's clear there's a bug with at least server list refreshes [15:27:10] in which case would we want the current 'pooled' state to not be directly reflective of ipvs actions taken (and of course, threshold limit blocks ipvs action and thus the pooled=false state change on down/disabled transition, when applicable)? [15:27:48] yes, thresholding should block setting pooled=True [15:28:01] but blergh [15:28:04] should block setting pooled=False [15:28:10] er yes [15:28:54] maybe what I'm missing here is that we're losing the effective ipvs state when we refresh a new server list from etcd? [15:29:24] pybal only changes state for 'modified' servers [15:29:53] but then threshold is no longer taken into account when it (re)sets .pooled [15:30:04] basically I'm trying to think of reasons why anything but ipvs.py taking ipvs actions would modify "pooled" [15:30:08] that seems clearly wrong [15:30:29] so say we change the weight or whatever [15:30:39] then pybal seems to reset .pooled to "enabled and up" [15:31:16] but if the server was currently disabled|down, that will set .pooled=false, when it could have previously been .pooled=true due to thresholding. [15:31:30] yes [15:31:51] and now .pooled does not reflect ipvs state, which is going to confuse other logic that assumes it does [15:32:26] indeed [15:32:51] let's see what assignServers does... [15:33:19] well [15:33:22] that has its own server list [15:33:53] it calculates the differences and generates ipvsadm commands for them [15:34:11] so that could actually take a pooledDown server out of ipvs [15:34:30] this specifically would probably not cause desync, but elsewhere in pybal perhaps [15:36:49] as a small fix, probably renaming 'pooled' to 'pool', and introducing a separate 'is_pooled' might make things more clear while being not very invasive [15:37:33] the current code intends to have thresholding affect 'pool', but this bug violates that [15:39:13] but meh [15:39:28] ipvs.py addServer and removeServer modify 'pooled' (should probably become is_pooled) [15:40:08] and then 'assignServers' does not, it simply takes the entire list which is generated from .pooled and compares to its own list [15:41:26] i think i agree with joe that we might as well go with FSMs if we have to change the logic anyway [15:41:37] which has been the plan for 4 years now anyway :P [15:56:20] all this got even more fun with the notion of 'weight=0' btw [15:56:29] which is... pooled but not really? [15:57:29] but argually it makes things easier once you switch to it fully [16:08:25] from the ipvs perspective, weight=0 is still "pooled" in the sense that it's in the list, but perhaps it shouldn't count for thresholding, which gets tricky [16:09:03] i would argue that in the future we should call weight=0 not pooled [16:09:03] probably the most-logical way to think of that is that if you're really dealing with real weighting, the threshold should be relative to the total weight, not server count, but ughhhhh that's a huge change. [16:09:08] just make all servers known to ipvs [16:09:25] and enabled & up/down, thresholding affect the weight [16:09:41] e.g. if threshold is 0.5 and you have S1.weight=100 S2.weight=100 S3.weight=200, once S3 goes down you can't depool S1 or S2. [16:09:52] yeah [16:10:06] but then then you have to have different notions of the server's "full" and "current" weights [16:10:53] but weight=0 and depooled are still two different things from the ipvs perspective. [16:11:06] right [16:12:40] then there's the whole "make smart decisions about which pending down|disabled server(s) do you depool first when a repool brings you back above the threshold" [16:13:09] you could argue for a number of different sane algorithms other than "iterate the list and depool the first applicable thing" [18:22:18] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#3897465 (10Cmjohnson) [18:26:40] 10Traffic, 10netops, 10Cloud-VPS, 10Operations, 10cloud-services-team (Kanban): Evaluate the possibility to add Juniper images to Openstack - https://phabricator.wikimedia.org/T180179#3897475 (10chasemp) 05Open>03stalled p:05Normal>03Low @ayounsi and I talked about this a bit and the use case is... [22:39:16] 10Domains, 10Traffic, 10Operations, 10Research, 10Patch-For-Review: Create subdomain for Research landing page - https://phabricator.wikimedia.org/T183916#3897978 (10bmansurov) @Krenair and @Dzahn to answer your earlier questions, please see T107389 for more info. [22:44:07] 10Domains, 10Traffic, 10Operations, 10Research, 10Patch-For-Review: Create subdomain for Research landing page - https://phabricator.wikimedia.org/T183916#3897982 (10Dzahn) @bmansurov thanks for that link and detailed explanation. ping me when it's time to launch. [22:54:46] 10Domains, 10Traffic, 10Operations, 10Research, 10Patch-For-Review: Create subdomain for Research landing page - https://phabricator.wikimedia.org/T183916#3898050 (10bmansurov) @Dzahn OK, thanks!