[09:03:49] <ema>	 so, pybal. By writing unit tests for https://gerrit.wikimedia.org/r/#/c/403677/ (use pooled-and-up servers in can-depool) I've realised that our (my?) current understanding of thresholds does not reflect reality :) 
[09:04:16] <ema>	 in the case of swift frontends, we've got 4 servers in total and the depool threshold is 0.5
[09:05:05] <ema>	 one would be tempted to think that the meaning of that is: pybal will make sure that minimum 2 servers are always serving traffic 
[09:06:02] <ema>	 now, by not taking (administratively) depooled servers into account that was obviously false, and the patch above should fix that part 
[09:07:10] <ema>	 with that being fixed though, the can-depool logic is: if the number of servers serving traffic is greater *or equal* than `all servers * threshold`, then we can depool
[09:08:15] <ema>	 which means, in the swift-fe case for instance: (2 servers serving traffic) >= (4 * 0.5), go ahead, depool
[09:08:44] <ema>	 and we end up with only one server serving traffic, which in the swift-fe case is not ok to my understanding
[09:12:06] <ema>	 it would be great to have a common understanding of what we mean with depool-threshold, and then make sure the code reflects that
[09:12:31] <ema>	 at the moment there are 3 different definitions in pybal's source code
[09:12:34] <ema>	 > Threshold of up servers vs total servers below which pybal can't depool any more
[09:12:41] <ema>	 (up servers)
[09:12:48] <ema>	 > The total amount of pooled servers may never drop below a configured threshold
[09:12:52] <ema>	 (pooled servers)
[09:13:05] <ema>	 > the threshold below which no more down servers will  be depooled
[09:13:09] <ema>	 (down servers)
[09:16:12] <ema>	 personally, I'd use "servers serving user traffic" instead of 'up/pooled/(not) down' which is easier to understand I think
[09:20:21] <ema>	 or "hosts serving traffic" which might sound a bit better :)
[09:20:49] <ema>	 and, I think the >= should be a strict > in canDepool
[09:21:12] <ema>	 so that we refuse depooling a host if only 2/4 are serving traffic and the threshold is .5
[09:24:03] <_joe_>	 ema: I will read what you wrote, but quite a few things are based on the way pybal behaves currently
[09:25:35] <_joe_>	 ema: but i generally agree
[09:28:57] <ema>	 _joe_: the point is that I'm not sure we all agree on "the way pybal behaves currently" :)
[09:30:32] <_joe_>	 ema: I think your patch doesn't do a desirable thing.
[09:32:03] <_joe_>	 so, what we want, as you stated, is that at least N servers in a pool keep serving traffic. That means that out of all the servers in the pool (including the ones that are pooled = False in pybal), at least a percentage will not be depooled
[09:32:54] <_joe_>	 your patch would make it so that a percentage of the currently logically pooled ones will not be depooled
[09:33:33] <_joe_>	 which means the actual number of servers will vary with our logical pooling/depooling
[09:34:25] <ema>	 logical pooling/depooling meaning "hosts pooled/depooled in etcd", right?
[09:34:30] <_joe_>	 yes
[09:34:33] <ema>	 ok
[09:34:42] <_joe_>	 so yeah let's decide on a terminology
[09:34:51] <_joe_>	 we have three logical state switches
[09:35:06] <_joe_>	 up/down => status in etcd
[09:35:17] <_joe_>	 sorry
[09:35:23] <ema>	 you see? :)
[09:35:26] <_joe_>	 up/down => status in monitoring
[09:35:37] <_joe_>	 enabled/disabled => status in etcd
[09:35:51] <_joe_>	 and pooled/depooled => status in ipvs
[09:35:53] <_joe_>	 right?
[09:35:59] <ema>	 yes!
[09:36:39] <_joe_>	 so what we want is that of all the N servers in the pool, at least M% are always in status 'pooled'
[09:36:43] <_joe_>	 right?
[09:36:54] <_joe_>	 that is, serving traffic
[09:37:00] <ema>	 correc
[09:37:03] <ema>	 correct
[09:37:21] <_joe_>	 so when we want to depool a server, we should first check if there are enough pooled. If not, we should not depool
[09:37:54] <_joe_>	 then, whenever another server gets pooled, we want to depool any server that is down and/or disabled but pooled, correct?
[09:38:19] <ema>	 indeed
[09:38:33] <_joe_>	 ok, your patch doesnt' do that
[09:38:38] <_joe_>	 at first sight
[09:38:50] <_joe_>	 you do if server.up and server.pooled
[09:39:02] <_joe_>	 while it should just count server.pooled
[09:39:41] <_joe_>	 but this is very delicate, I tried to touch that in the past with bad results
[09:39:49] <_joe_>	 we should build tests for this behaviour
[09:40:55] <ema>	 agreed, looking at server.pooled=True seems the right behavior 
[09:41:04] <_joe_>	 seems :)
[09:41:30] <_joe_>	 it's pretty tricky imho
[09:41:53] <ema>	 thoughts on the >= vs > part?
[09:42:23] <_joe_>	 ok that's tricky again
[09:42:39] <_joe_>	 say you have a pool of 2 servers
[09:43:25] <_joe_>	 if you put in 0.5, and both go down, what will happen with >= and with >?
[09:44:16] <_joe_>	 and what if one goes down?
[09:44:29] <_joe_>	 if one goes down, with > you don't depool it
[09:44:45] <_joe_>	 with the current logic, I mean
[09:44:57] <_joe_>	 because downservers = 1, servers = 2
[09:45:15] <_joe_>	 1 is not > than 2*0.5
[09:46:06] <ema>	 the logic currently deployed uses >=, meaning that we *do* depool the one host going down 
[09:46:10] <_joe_>	 if you have 4 servers, and 3 go down, you should be able to depool 2 with >= ( 2 >= 4*0.5), 1 with 0.5
[09:46:27] <_joe_>	 sorry, 1 with >
[09:47:18] <_joe_>	 the problem in the current logic in https://gerrit.wikimedia.org/r/#/c/403677/2/pybal/coordinator.py seems to be thaat it uses server.up and not server.pooled
[09:47:40] <_joe_>	 so it actively disregards the state disabled/up/not pooled
[09:47:44] <_joe_>	 in its count
[09:48:31] <_joe_>	 I will take a re-look at the code and do a code review
[09:48:35] <_joe_>	 it's friday after all
[09:48:38] <ema>	 yes and I think that you're right when you say that server.pooled is the only state we should look at when counting pooled servers (it sounds so obvious stated like this!)
[09:49:25] <_joe_>	 so I think the problem we had yesterday was that we had a pool of 4 servers, with 2 disabled/up/not pooled
[09:49:32] <_joe_>	 and 1 went down
[09:49:43] <_joe_>	 since we had 3 servers up, it was depooled
[09:49:49] <_joe_>	 leaving 1 server in the pool
[09:50:11] <_joe_>	 that went down, ofc, and since we have >=, it was depooled too (we still had 2 "up" servers)
[09:50:26] <ema>	 the problem was indeed due to the fact that we were using server.up (monitoring status) instead of server.pooled (ipvs)
[09:51:06] <ema>	 so we ended up with 0 hosts serving traffic, ugh
[09:53:25] <_joe_>	 maybe :)
[09:54:22] <_joe_>	 but we need to check the logs to be sure
[09:55:11] <ema>	 well, surely we did end up without hosts pooled in IPVS, I've got proof :)
[09:55:13] <_joe_>	 so, what is the the time when that happened?
[09:55:14] <ema>	 https://grafana.wikimedia.org/dashboard/db/pybal?panelId=1&fullscreen&orgId=1&from=1515594444777&to=1515608866088
[09:55:22] <_joe_>	 ema: ok
[09:55:34] <_joe_>	 let me write down what happened and note it in the task
[09:56:04] <_joe_>	 actually the task says the right thing
[09:56:13] <_joe_>	 well, almost
[09:56:45] <ema>	 _joe_: https://wikitech.wikimedia.org/wiki/Incident_documentation/20180110-swift
[09:59:45] <ema>	 let me amend the patch to only use server.pooled, I'll add a few tests leaving >= as it is 
[10:03:26] <wikibugs>	 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: pybal's "can-depool" logic only takes downServers into account - https://phabricator.wikimedia.org/T184715#3893173 (10Joe) I don't think this is entirely correct.  Pybal has 3 states:  # `enabled/disabled`: the logical state in etcd, so `pooled=yes` or...
[10:03:45] <_joe_>	 ema: one thing I'm not sure of is - 
[10:03:53] <_joe_>	 say you depool a server via etcd
[10:04:03] <_joe_>	 but when you do that, it cannot be depooled
[10:04:15] <_joe_>	 pybal will refuse to depool it, correctly
[10:04:49] <_joe_>	 but say at a later time one of the servers that were depooled comes up and is pooled again, will pybal then automatically depool the other server?
[10:04:55] <_joe_>	 I have to recheck the code
[10:05:07] <_joe_>	 I think the answer is *yes* but let's check
[10:07:46] <_joe_>	 ema: ouch
[10:07:46] <ema>	 I actually think that the canDepool logic is only triggered by changes due to monitoring status, not to enabled/disabled actions
[10:07:58] <_joe_>	 are you sure?
[10:08:02] <ema>	 nope! :)
[10:08:12] <_joe_>	 anyways, I found something that makes our idea bogus
[10:08:40] <_joe_>	 see coordinator.Coordinator.refreshModifedServers
[10:08:54] <_joe_>	 server.pooled = server.enabled and server.up
[10:09:07] <ema>	 mmmh
[10:09:10] <_joe_>	 so that is *not* the ipvs state
[10:09:17] <_joe_>	 it's just a compound variable
[10:09:18] <ema>	 no bueno
[10:09:43] <_joe_>	 you need the 2.0 code that has logical handling of ipvs 
[10:09:48] <_joe_>	 to do what we want to do
[10:09:56] <_joe_>	 we should work on 2.0 :P
[10:10:20] <ema>	 yes we do need to know what the actual ipvs state is
[10:13:43] <ema>	 yeah so `canDepool()` is only called by `depool` and `repool`, which are called by `resultDown()` and `resultUp()` respectively
[10:14:24] <ema>	 it does not look like we call canDepool when changing server.enabled (etcd)
[10:14:32] <_joe_>	 yes
[10:14:52] <_joe_>	 what we do is we merge the config changes
[10:15:00] <_joe_>	 then we reinitialize all servers
[10:15:07] <_joe_>	 with self.assignServers
[10:15:33] <_joe_>	 so by this logic, a logically depooled server will always be depooled, it seems. Uhm
[10:15:51] <_joe_>	 not what I expected
[10:17:11] <_joe_>	 uhm
[10:17:40] <_joe_>	 yeah we need better control on ipvs to do this properly
[10:18:06] <_joe_>	 still
[10:18:41] <_joe_>	 say we use server.pooled which is = (server.up and server.enabled)
[10:18:54] <ema>	 so the two issues are (a) T184715 (b) we should check if hosts can be depooled when disabling them via etcd
[10:18:54] <stashbot>	 T184715: pybal's "can-depool" logic only takes downServers into account - https://phabricator.wikimedia.org/T184715
[10:19:31] <_joe_>	 ema: did you test this in any way?
[10:19:51] <_joe_>	 maybe fire up a pool on pybal-test2001 and try to logically depool all the servers in it?
[10:20:20] <_joe_>	 of course you'll need to create the data on etcd
[10:20:56] <ema>	 I'd start by tackling (a) first which seems like the most pressing issue at the moment
[10:21:18] <ema>	 and then yes, for (b) we should use fake etcd data and pybal-test2001
[10:23:28] <_joe_>	 so a, rephrasing it, is "we cannot count the logically depooled servers as up when deciding if we need to pool or depool a server for monitoring"
[10:23:31] <_joe_>	 right?
[10:23:52] <ema>	 yes
[10:24:32] <ema>	 or, we cannot only rely on monitoring status when deciding if we've got enough pooled servers
[10:30:40] <ema>	 so back to the >= part
[10:30:55] <ema>	 len(pooledServers) = 1
[10:31:02] <ema>	 len(self.servers) = 2
[10:31:08] <ema>	 threshold = 0.5
[10:31:27] <ema>	 1 >= 2 * 0.5
[10:31:37] <ema>	 canDepool says True, we now have 0 hosts pooled
[10:35:24] <_joe_>	 no
[10:36:01] <_joe_>	 in the current logic, since your server is already marked as down by the monitoring
[10:36:07] <_joe_>	 you'll have downservers = 2
[10:37:50] <_joe_>	 so that's why it uses server.up and not server.pooled
[10:38:00] <_joe_>	 in resultDown, you do the following 
[10:38:07] <_joe_>	 1 - set up=false
[10:39:50] <_joe_>	 2 - call self.depool
[10:40:16] <_joe_>	 which in turn first checks that server.pooled = True
[10:40:19] <_joe_>	 aaarghhh
[10:40:20] <ema>	 so maybe we should check for server.up and server.pooled after all
[10:40:42] <_joe_>	 server.up and server.enabled
[10:41:11] <_joe_>	 I need to create a diagram
[10:41:19] <ema>	 and hope that what that means is actually "pooled in ipvs"
[10:41:21] <_joe_>	 and think better about this
[10:41:28] <_joe_>	 told you it was tricky
[10:41:32] <ema>	 yup
[10:45:18] <ema>	 another option, which is the first that came to mind and is still on T184715, is to use `totServers - down - disabled`
[10:45:19] <stashbot>	 T184715: pybal's "can-depool" logic only takes downServers into account - https://phabricator.wikimedia.org/T184715
[10:47:08] <ema>	 disabled in etcd, that is 
[10:47:19] <_joe_>	 ema: I think so
[11:18:12] <wikibugs>	 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: pybal's "can-depool" logic only takes downServers into account - https://phabricator.wikimedia.org/T184715#3896429 (10ema) >>! In T184715#3896284, @Joe wrote: > `pooled/not pooled`: the status of the server in the ipvs pool  As we've found out today, `p...
[12:03:04] <mark>	 "As we've found out today, pooled is actually defined as enabled and up. This should translate to "pooled in ipvs", but there's no guarantee about that."
[12:03:07] <mark>	 why is there no guarantee about that?
[12:03:14] <mark>	 other than ipvs state getting desynced from pybal
[12:09:41] <ema>	 no other reason, but we know that it can happen (eg: T169765)
[12:09:41] <stashbot>	 T169765: pybal should automatically reconnect to etcd - https://phabricator.wikimedia.org/T169765
[12:10:31] <ema>	 so it's not strictly correct to say that `enabled and up` is the same as `pooled in ipvs`
[12:17:26] <mark>	 different ticket I think?
[12:18:51] <ema>	 oh yes sorry, T134893
[12:18:51] <stashbot>	 T134893: Unhandled pybal error causing services to be depooled in etcd but not in lvs - https://phabricator.wikimedia.org/T134893
[12:19:10] <mark>	 well definitely pybal intends 'enabled and up' to be the same as 'pooled in ipvs'
[12:19:29] <mark>	 so that assumption can be held in the rest of the code, and any violations should be fixed instead
[12:19:35] <mark>	 of course that's much more solid with the 2.0 work
[12:19:47] <mark>	 as in 1.0 pybal is effectively working blindly
[12:20:30] <mark>	 you could even say, if desync happens that's violation of an assertion and it would probably be better to restart pybal entirely at that point (currently)
[12:24:45] <ema>	 right, which essentially is what we do (manually) if check_pybal_ipvs_diff alerts
[12:26:02] <ema>	 I have been staring at canDepool too long today, clearly
[12:26:18] <ema>	 I'm starting to think that we also need to know *which* host we want to depool
[12:27:57] <ema>	 going for lunch seems like a good idea at this point though :)
[12:42:46] <mark>	 :)
[13:49:03] <_joe_>	 did pasta bring you elightenment, ema?
[13:50:03] * mark advised against pizza/pasta for dinner in SF
[13:50:15] <mark>	 we wouldn't hear the end of it
[13:53:05] <paravoid>	 lololol
[14:12:13] <_joe_>	 ahah
[14:12:43] <_joe_>	 you know there is a twitter account that follows and reposts all the angry comments of italians about pasta recipes around the internet?
[14:12:50] <_joe_>	 it was pretty hilarious :P
[14:12:54] <mark>	 i did not know
[14:15:09] <mark>	 oh
[14:15:19] <mark>	 but it was a Michelin Recommended place
[14:15:31] <mark>	 but what do the french know about italian food right
[14:16:44] <_joe_>	 lol
[14:18:23] <_joe_>	 actually, I'd eat pasta in any high-level restaurant around the world. It's the cheap italian restaurants I fear and stay away from :)
[14:24:59] <ema>	 _joe_: I thought pasta would have helped, but jenkins does not seem convinced
[14:25:26] <mark>	 it's pasta with german ingredients right
[14:29:52] <bblack>	 I think regardless of 1.0/2.0, the problem is a lack of distinct state-tracking, causing overloading of the meaning of some of the states.  Ideally there should be: up/down reflects real monitored status (regardless of thresholds), enabled/disabled reflects real etcd status (regardless of thresholds), and pooled/depooled reflects ipvs actions actually taken (which is subject to thresholding).
[14:30:36] <bblack>	 and then when etcd or monitoring trigger a state change in the first two that's towards goodness, if a backend becomes up+enabled but is currently depooled, you pool it.
[14:31:03] <mark>	 that is how state is tracked
[14:31:08] <bblack>	 (and if that pooling brings us back above a threshold, you scan for other backends that are down|disabled but still pooled due to threshold and kill one of those)
[14:31:56] <bblack>	 and when etcd or monitoring triggers a downwards state change, it's recorded (down, or disabled), but "depooled" is only set if we're able to depool due to thresholds.
[14:32:55] <bblack>	 they're 3 distinct states, none of them can be inferred from the current state of the other two, between up/down, enabled/disabled, pooled/depooled.
[14:33:24] <bblack>	 (aside from the possibility of loss-of-sync with etcd and/or ipvs, which is a whole separate issue.  Above I'm assuming we never lose sync with reality in that sense)
[14:35:05] <bblack>	 10:08 < _joe_> see coordinator.Coordinator.refreshModifedServers
[14:35:07] <bblack>	 10:08 < _joe_> server.pooled = server.enabled and server.up
[14:35:30] <bblack>	 ^ is the conversational evidence that the model I'm describing is not the current one.
[14:37:59] <_joe_>	 bblack: let's say the three are mostly independent in the code but there are a few critical point. My work on 2.0 and creating basically 3 state machines that can be moved from state to state independently had as one of its goals make the separation of state-tracking more clear
[14:38:57] <bblack>	 basically, the 1.0 code's meaning of "pooled" is overloaded.  Sometimes it means "is it actually pooled in ipvs?" and sometimes it means "*Should* it be pooled in ipvs, if thresholding allows?"
[14:39:02] <_joe_>	 bblack: in practice, after we call refreshModifiedServers() we also call Coordinator._serverInitDone, that dispatches the commands to the servers in the pool
[14:39:26] <_joe_>	 bblack: no, the meaning is "it should be pooled/depooled in ipvs"
[14:39:50] <bblack>	 I don't mean the intended meaning, I mean the actual code's behavior, what's causing the current bug(s)
[14:40:24] <mark>	 joe is right
[14:40:27] <_joe_>	 the current bug is a bug in how we count how many servers are down
[14:40:37] <ema>	 meanwhile I've pleased jenkins, reviews welcome https://gerrit.wikimedia.org/r/#/c/403677/
[14:40:58] <mark>	 but threshold is not taken into effect with server list refreshes
[14:41:06] <_joe_>	 that too
[14:41:39] <_joe_>	 mark: also in the current count "disabled/up/not pooled" is counted as UP in the depool threshold logic
[14:41:51] <mark>	 that was a feature
[14:41:56] <mark>	 but it no longer makes sense with etcd
[14:42:01] <_joe_>	 right
[14:42:14] <mark>	 there was always supposed to be a difference between commenting servers and setting them down
[14:42:22] <mark>	 (commenting of course making them completely invisible)
[14:42:24] <_joe_>	 and that's ok
[14:42:29] <mark>	 but i agree that it's now just confusing and unhelpful
[14:42:30] <_joe_>	 we have the same
[14:42:55] <_joe_>	 so servers that are disabled can remain in the pool if too many are down, in my mind
[14:43:10] <_joe_>	 basically the depool threshold should work both for the thundering herd case
[14:43:22] <_joe_>	 and for the "opsen shooting themselves in the foot" case
[14:43:55] <_joe_>	 as well for the "opsen depooling server just when a traffic surge happens", which is basically what happened here
[14:47:04] <bblack>	 what about the pooledDownServers list?
[14:47:18] <bblack>	 is that just pooled+down, or pooled+(down|disabled)?
[14:52:15] <ema>	 hang on I'm still trying to understand whether pooled_in_ipvs iff 'server.up and server.enabled'
[14:52:27] <ema>	 that does not seem to be the case if we reach the depooling threshold, right?
[14:53:03] <ema>	 monitoring sets server.up = False if a host goes down, yet it's not depooled from ipvs because threshold
[14:54:01] <bblack>	 it's not true on the downwards side of things
[14:54:02] <_joe_>	 ema: but it doesnt set pooled=False
[14:54:23] <bblack>	 but I think it is (or should be?) true that if server.up and server.enabled, the server is pooled.
[14:54:36] <mark>	         # P1: up => pooled \/ !enabled \/ !ready
[14:54:36] <mark>	         # P2: pooled => up \/ !canDepool
[14:55:02] <_joe_>	 it's more complex than an on/off problem, but now I really have to wrap up this week's work
[14:55:21] <mark>	 those are the invariants the original code tried to maintain
[14:55:30] <mark>	 and
[14:55:30] <mark>	         # P0: pooled => enabled /\ ready
[14:56:52] <ema>	 but then 'up and enabled' is not the same thing as 'pooled in ipvs'. It just implies it, but 'pooled in ipvs' does not imply 'up and enabled'
[14:57:08] <bblack>	 correct
[14:57:27] <bblack>	 if a server is up and enabled, it should be pooled.  but just because a server is pooled does not mean it's up and enabled.
[14:58:59] <bblack>	 (well, in my sentence above, pooled means "actually pooled in ipvs" - which is not the same thing as "pooled" in the code if j is right that "pooled" in the could means, "should", not "is")
[14:59:17] <bblack>	 bleh, s/could/code/
[15:00:19] <mark>	 yes correct
[15:00:22] <bblack>	 logically speaking, "should be pooled" is "up && enabled", and the inverse of it is also true "should not be pooled" is "down|disabled"
[15:00:51] <bblack>	 but "is actually pooled" (in terms of ipvs actions taken, which take thresholding into account) is distinct from that.
[15:01:08] <mark>	 yes
[15:01:19] <mark>	 well, no
[15:01:27] <mark>	 thresholding should impact 'pooled'
[15:01:50] <mark>	 so if pooled = True, it should actually be pooled in ipvs, thresholding not withstanding
[15:03:49] <bblack>	 ok, apparently this conversation is very confusing, we have different definitions of "should" going back a page or so I think
[15:04:50] <bblack>	 I don't mean should as in "we took these ipvs actions, therefore it *should* be true", I was meaning "should" as in "should be pooled/depooled based on per-backend state, but may not actually be depooled due to threshold"
[15:05:13] <mark>	 yeah and the latter meaning is not correct
[15:05:31] <_joe_>	 I think we should not touch that part of pybal unless someone wants to spend time to rationalize it a bit. It outgrew its original design in complexity and my experience modifying it is it's very easy to make mistakes
[15:05:43] <mark>	 very much so
[15:06:09] <bblack>	 logically, there's 4 state-like things here (again, just assuming that ipvs commands always work and we're never out of sync with it, ditto etcd):
[15:06:15] <mark>	 that's why i said... 4 years ago now I think, we shouldn't touch it until we redesign with FSMs
[15:06:21] <bblack>	 up/down state - directly reflects monitoring
[15:06:22] <_joe_>	 and when I say someone, I mean we should make that a goal at some point and throw some time of ema/me/whoever in evolving pybal
[15:06:30] <bblack>	 enabled/disable state - directly reflects etcd
[15:06:42] <mark>	 thanks joe :P
[15:07:03] <_joe_>	 means you managers :P
[15:07:11] <mark>	 no you didn't mean me manager
[15:07:20] <bblack>	 should_be_pooled? == (up && enabled) <this one can be inferred from the above two>
[15:07:38] <mark>	 that's why I wrote, for the announcement, "a _vain attempt_ at regaining respect..."
[15:07:38] <bblack>	 is_pooled? -> distinct state, reflects whether we actually took ipvs action up/down actions, which depends on the above and the thresholding.
[15:07:42] <mark>	 but that was cut out by c levels ;-)
[15:08:31] <_joe_>	 mark: I know nothing of c-somethings
[15:09:07] <bblack>	 if should_be_pooled is true, the server will always actually be is_pooled, too.
[15:09:19] <bblack>	 but due to thresholding, sometimes should_be_pooled is false, but is_pooled is true.
[15:09:45] <mark>	 current pooled variable is closer to is_pooled than should_be_pooled
[15:10:11] <mark>	 really we should call it is_pooled_or_will_pool ;-)
[15:10:15] <bblack>	 I think different parts of the code treat it as one or the other meanings interchangeable, is the source of some current issues.
[15:13:41] <mark>	 should_be_pooled is not useful
[15:14:02] <mark>	 it only becomes useful if it takes thresholding into account, which the current code for 'pooled' does
[15:15:34] <bblack>	 right, it's only useful as an abstraction of just writing "enabled&&up", it's not a distinct state of its own
[15:16:12] <mark>	 so it makes more sense to have a state including the thresholding
[15:16:22] <bblack>	 yes, which 'pooled' should be
[15:16:26] <mark>	 it is
[15:16:40] <bblack>	 but there's a line of code which says:
[15:16:41] <mark>	 except probably some edge case around changing server lists/etcd :)
[15:16:42] <bblack>	 server.pooled = server.enabled and server.up
[15:16:49] <mark>	 yes
[15:16:49] <bblack>	 which does not take thresholding into account
[15:20:28] <mark>	 the code completely ignores thresholding for servers modified in a server list refresh
[15:20:49] <mark>	 onConfigUpdate calls refreshModifiedServers,
[15:20:54] <mark>	 which doesn't look at thresholding at all
[15:21:39] <mark>	 and then the whole list is passed to assignServers which pools all servers with pooled=True
[15:22:24] <bblack>	     def maintainState(self):
[15:22:25] <bblack>	         """Maintains a few invariants on configuration changes"""
[15:22:25] <bblack>	         # P0
[15:22:25] <bblack>	         if not self.enabled or not self.ready:
[15:22:25] <bblack>	             self.pooled = False
[15:22:48] <bblack>	 ^ that also sets self.pooled=false based on enabled/up, not thresholds+ipvs
[15:23:09] <bblack>	 ipvs.py does set pooled=True/False when it takes actions, too
[15:24:47] <bblack>	 for the current code to work with etcd+ipvs as-is and not have logic issues, "pooled" should reflect only ipvs actions taken, rather than reflect an aggregate of enabled&&up (or other similar concepts)
[15:25:14] <mark>	 no i don't think that's correct
[15:25:54] <mark>	 but i agree the notion has gotten confused, so to fix it it probably makes sense to split it entirely
[15:26:08] <mark>	 or it will just create more confusion
[15:26:15] <mark>	 (or go straight for FSMs/2.0)
[15:27:09] <mark>	 but in any case, currently, 'pooled' is always supposed to reflect the situation in ipvs, thresholding or not, and it's clear there's a bug with at least server list refreshes
[15:27:10] <bblack>	 in which case would we want the current 'pooled' state to not be directly reflective of ipvs actions taken (and of course, threshold limit blocks ipvs action and thus the pooled=false state change on down/disabled transition, when applicable)?
[15:27:48] <mark>	 yes, thresholding should block setting pooled=True
[15:28:01] <mark>	 but blergh
[15:28:04] <bblack>	 should block setting pooled=False
[15:28:10] <mark>	 er yes
[15:28:54] <bblack>	 maybe what I'm missing here is that we're losing the effective ipvs state when we refresh a new server list from etcd?
[15:29:24] <mark>	 pybal only changes state for 'modified' servers
[15:29:53] <mark>	 but then threshold is no longer taken into account when it (re)sets .pooled
[15:30:04] <bblack>	 basically I'm trying to think of reasons why anything but ipvs.py taking ipvs actions would modify "pooled"
[15:30:08] <mark>	 that seems clearly wrong
[15:30:29] <mark>	 so say we change the weight or whatever
[15:30:39] <mark>	 then pybal seems to reset .pooled to "enabled and up"
[15:31:16] <bblack>	 but if the server was currently disabled|down, that will set .pooled=false, when it could have previously been .pooled=true due to thresholding.
[15:31:30] <mark>	 yes
[15:31:51] <bblack>	 and now .pooled does not reflect ipvs state, which is going to confuse other logic that assumes it does
[15:32:26] <mark>	 indeed
[15:32:51] <mark>	 let's see what assignServers does...
[15:33:19] <mark>	 well
[15:33:22] <mark>	 that has its own server list
[15:33:53] <mark>	 it calculates the differences and generates ipvsadm commands for them
[15:34:11] <mark>	 so that could actually take a pooledDown server out of ipvs
[15:34:30] <mark>	 this specifically would probably not cause desync, but elsewhere in pybal perhaps
[15:36:49] <mark>	 as a small fix, probably renaming 'pooled' to 'pool', and introducing a separate 'is_pooled' might make things more clear while being not very invasive
[15:37:33] <mark>	 the current code intends to have thresholding affect 'pool', but this bug violates that
[15:39:13] <mark>	 but meh
[15:39:28] <mark>	 ipvs.py addServer and removeServer modify 'pooled' (should probably become is_pooled)
[15:40:08] <mark>	 and then 'assignServers' does not, it simply takes the entire list which is generated from .pooled and compares to its own list
[15:41:26] <mark>	 i think i agree with joe that we might as well go with FSMs if we have to change the logic anyway
[15:41:37] <mark>	 which has been the plan for 4 years now anyway :P
[15:56:20] <mark>	 all this got even more fun with the notion of 'weight=0' btw
[15:56:29] <mark>	 which is... pooled but not really?
[15:57:29] <mark>	 but argually it makes things easier once you switch to it fully
[16:08:25] <bblack>	 from the ipvs perspective, weight=0 is still "pooled" in the sense that it's in the list, but perhaps it shouldn't count for thresholding, which gets tricky
[16:09:03] <mark>	 i would argue that in the future we should call weight=0 not pooled
[16:09:03] <bblack>	 probably the most-logical way to think of that is that if you're really dealing with real weighting, the threshold should be relative to the total weight, not server count, but ughhhhh that's a huge change.
[16:09:08] <mark>	 just make all servers known to ipvs
[16:09:25] <mark>	 and enabled & up/down, thresholding affect the weight
[16:09:41] <bblack>	 e.g. if threshold is 0.5 and you have S1.weight=100 S2.weight=100 S3.weight=200, once S3 goes down you can't depool S1 or S2.
[16:09:52] <mark>	 yeah
[16:10:06] <bblack>	 but then then you have to have different notions of the server's "full" and "current" weights
[16:10:53] <bblack>	 but weight=0 and depooled are still two different things from the ipvs perspective.
[16:11:06] <mark>	 right
[16:12:40] <bblack>	 then there's the whole "make smart decisions about which pending down|disabled server(s) do you depool first when a repool brings you back above the threshold"
[16:13:09] <bblack>	 you could argue for a number of different sane algorithms other than "iterate the list and depool the first applicable thing"
[18:22:18] <wikibugs>	 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#3897465 (10Cmjohnson)
[18:26:40] <wikibugs>	 10Traffic, 10netops, 10Cloud-VPS, 10Operations, 10cloud-services-team (Kanban): Evaluate the possibility to add Juniper images to Openstack - https://phabricator.wikimedia.org/T180179#3897475 (10chasemp) 05Open>03stalled p:05Normal>03Low @ayounsi and I talked about this a bit and the use case is...
[22:39:16] <wikibugs>	 10Domains, 10Traffic, 10Operations, 10Research, 10Patch-For-Review: Create subdomain for Research landing page - https://phabricator.wikimedia.org/T183916#3897978 (10bmansurov) @Krenair and @Dzahn to answer your earlier questions, please see T107389 for more info.
[22:44:07] <wikibugs>	 10Domains, 10Traffic, 10Operations, 10Research, 10Patch-For-Review: Create subdomain for Research landing page - https://phabricator.wikimedia.org/T183916#3897982 (10Dzahn) @bmansurov thanks for that link and detailed explanation. ping me when it's time to launch.
[22:54:46] <wikibugs>	 10Domains, 10Traffic, 10Operations, 10Research, 10Patch-For-Review: Create subdomain for Research landing page - https://phabricator.wikimedia.org/T183916#3898050 (10bmansurov) @Dzahn OK, thanks!