[06:32:25] elukey@cp1008:~$ sudo cat /etc/varnishkafka/eventlogging.conf | grep broker [06:32:32] kafka.metadata.broker.list = kafka1012.eqiad.wmnet,kafka1013.eqiad.wmnet,kafka1014.eqiad.wmnet,kafka1020.eqiad.wmnet,kafka1022.eqiad.wmnet,kafka1023.eqiad.wmnet [06:32:43] we forgot poor pink unicorn!! [06:42:09] (fixed) [06:43:38] thanks :) [06:44:44] I was trying to figure out why the vk dashboard showed some bytes flowing to kafka1012 etc.., I thought I messed up with the graphs, but it was pink unicorn :D [07:07:01] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10elukey) [07:35:11] heh, I thought that not passing the main VCL file to varnishd but just using ExecStartPost=reload-vcl would have been a zen way to work around the vcl-label dependency, but no [07:36:13] varnish refuses to start the child process if you pass -f '' [07:36:49] however, you can do that and pass a file with varnishadm commands to -I, if at the end of the jazz there's a valid config, varnish starts the child [07:40:07] so this is a case of varnish trying to be smarter than the admin [07:40:19] why would you refuse to start the child? [07:40:44] a workaround to that is avoiding -f altogether and passing -b example.org [07:41:34] that way varnish starts the child using example.org as a backend, then you can finally load the real config [07:43:25] anyways, -f '' -I $cmdfile seems the cleanest option at this point [07:48:09] which is stupid because we have a script already that does the right thing (reload-vcl), and instead we need to convert that to a sequence of varnishadm commands [08:02:49] there is a varnishadm command to start the child process though! so yeah reload-vcl should do that at the end of its dance with a new cli flag [08:48:41] rambling2patch -> https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/445081/ [09:00:53] LOL on rambling2patch :) [10:12:44] vgutierrez: ema: want to discuss the depool threshold stuff, T184715? [10:12:45] T184715: pybal's "can-depool" logic only takes downServers into account - https://phabricator.wikimedia.org/T184715 [10:13:02] sure [10:13:11] maybe _joe_ should be around as well :) [10:13:14] yeah [10:13:17] it's tricky stuff [10:13:22] surprisingly tricky hehe [10:16:51] so the crux is basically [10:17:01] i believe in the past pybal would never ever pool any server when it's not enabled [10:17:18] depool threshold or not [10:18:11] the depool threshold feature was meant to guard against automatic depooling based on error conditions, but not to guard against manual depooling by an operator [10:18:18] this was all long before etcd and scap etc of course [10:18:36] so now it's not just "manual" depooling anymore :) [10:21:17] _joe_ seems to think that in the past however pybal did guard against that through the depool threshold feature [10:21:33] <_joe_> no, I think I assumed that was true [10:21:35] and in any case, that nowadays it should [10:21:36] yup.. and that's basically what I've implemented on my patch [10:21:40] yes [10:21:41] <_joe_> sorry I wasn't reading [10:21:51] <_joe_> I'm in the middle of merging a series of changes [10:21:55] yeah sorry :) [10:22:08] yey.. I agree with you mark, we should have some kind of protection to scap or etcd or whatever tool going crazy [10:23:14] so we have two ways of depooling servers [10:23:17] one is by setting enabled=false [10:23:27] the other is by simply removing from pybal pools entirely, so it forgets about its existence [10:24:03] so I suppose we /could/ say that if you want to absolutely make sure that pybal depools a server under all circumstances, have it removed entirely from the pool? [10:24:17] but I don't know if that makes sense for all server pool sources now [10:24:21] with etcd, and even k8s now :) [10:26:01] BTW, right now enabled=False gets the server removed from ipvs. Maybe it's a good time to consider setting its weight to 0, that would drain existing connections without being disruptive [10:26:30] yes, this actually gave rise to the whole FSM work [10:26:39] it's very very difficult to implement that with current pybal [10:26:44] i once did a naive quick attempt, and it was horrible [10:27:51] it's been in gerrit for years, i discarded it recently [10:28:08] yup.. I've seen that working on the patch I've submitted the other day [10:28:13] <_joe_> yeah so, if we have a quick and sure way to implement this feature (refuse logical depooling if not enough servers are in the pool) in the current version of the code, I'd do it [10:28:18] https://gerrit.wikimedia.org/r/#/c/operations/debs/pybal/+/187346/ [10:28:42] _joe_: quick way: https://gerrit.wikimedia.org/r/#/c/operations/debs/pybal/+/443967/ [10:29:08] as mark pointed out it changes the semantics so it would need more love in the future to keep pybal sane [10:29:31] really to keep pybal sane we should move towards the FSM work, as giuseppe started [10:29:34] <_joe_> vgutierrez: yeha I think we should make coordinator a set of FSMs [10:29:43] +1 to both of you [10:30:03] because apparently already having 3 boolean variables as state is enough to confuse us all ;) [10:30:04] <_joe_> mark: even without going into the ipvs stuff I did and never finished, the basic FSM implementation should be usable [10:30:09] yes [10:30:20] * volans following along, nothing to add so far [10:30:23] i'd like to work on it, after we move to python3 (which is soon I think) [10:30:27] <_joe_> and it comes with unit tests :P [10:30:31] yup :) [10:33:19] or we make scap based on cumin and have it ensure it never depools too many servers hehehe [10:34:33] in the meantime, and perhaps until we have the FSM work in [10:34:58] would it make sense for pybal to *reject* any configuration which does not have sufficient enabled servers? [10:35:02] with a big fat error condition and alerting [10:35:11] define reject [10:35:21] not act on it, and alert [10:35:39] so basically, scap keeps setting more and more servers to disabled [10:36:01] at some point, it crosses a certain threshold (which may be different from depool threshold), and then pybal does not act on that configuration while the threshold is not met [10:36:03] dik [10:36:04] idk [10:36:05] so on a pybal start it would mean to ignore it, and if pybal is already running, just run the last-valid config? [10:36:13] yes [10:36:17] it's not pretty i guess [10:36:25] but it's also a pretty rare condition which should not happen :P [10:36:31] we don't sync pybal nodes [10:36:50] so last-valid config could be different from primary to secondary lvs... [10:37:03] yes [10:37:13] it wouldn't be pretty at all :) [10:37:17] but the idea is that this gives rise to an icinga alert and gets resolved quickly by operator [10:37:20] no :( [10:38:09] but instead of pybal trying to be smart about what servers which are actually disabled but then get randomly pooled anyway, it rejects it and alerts as it clearly needs operator attention [10:38:40] "$operator or $automatic_substitute did something stupid, halp" [10:39:02] it would still save the site [10:39:17] i can't say I like it either :( [10:39:46] btw, maybe we should expose the etcd config version in prometheus metrics [10:39:53] i think there's an integer or something for that right? [10:39:57] don't like it too much, but if we do this the check should page IMHO [10:40:04] agreed [10:41:58] and then there's also still the question how the two situations interact, i.e. many servers down and many servers disabled [10:45:01] a server marked as down should still be in the pool, getting its connections drained [10:45:28] yeah, weight=0 is another discussion [10:45:32] btw, I think that happens anyway? [10:45:33] BTW... what would happen if scap/human goes nuts and sets all the servers with weight=0 right now? [10:45:41] existing connections have connection state in ipvs [10:45:54] vgutierrez: pybal wouldn't guard against that either [10:46:00] right [10:46:04] weight=0 is not treated specially [10:46:17] we should take that scenario into consideration as well [10:47:12] basically right now pybal is only trying to guard against its own health monitoring decision, not against external input [10:52:32] BTW, this could have any kind of side-effect? there is any scenario were depooling more than the depool threshold is necessary? [10:53:29] perhaps bringing an entire rack row of servers down? [10:53:47] but, in theory we should be able to take that, and depool threshold should be low enough accordingly [10:53:52] so probably not really? [10:54:19] I'm just asking, I'm the guy here, so I'm missing a lot of stuff :) [10:54:23] *new guy [10:54:38] so in the past that would always work, right [10:54:52] no matter what depool threshold is set to, if you disable half the servers in the pool, pybal will just honor that [10:55:30] however with ema's patch it will now make sure not to depool more servers if any remaining ones report down, if depool threshold prevents it [10:56:47] that doesn't seem healthy either to be honest [10:57:17] what could it do better you think? [10:59:48] at least I'd filter on unhealthy conditions.. I could find acceptable to keep pooled a server that replies 503 to everything in a reasonable amount of time, but not a server that replies 2XX in 30 seconds on average [11:00:23] hmmm [11:00:40] yeah, but pybal only has a very limited view of that, only its own health check of course [11:00:54] it's not like it has an overview of all status codes the server returns for real traffic [11:01:01] (that would be another nice monitor maybe ;) [11:01:18] that would be amazing for deploying strategies :) [11:01:23] and, depending on what the pybal service is, it may still have varnish and what not in front which has its own timeouts and error pages if backends dont respond [11:01:44] yey.. but varnish right now doesn't see the backends [11:01:50] just the VIP handled by pybal [11:01:55] yeah [11:02:12] so a pooled server that's answaring really slowly could be a PITA [11:02:13] but it will retry after timeout expires, or on some error status [11:02:17] but it might read the headers and know which backed was and tell pybal to reduce it's weight :-P [11:02:20] * volans runs away [11:02:35] * vgutierrez chases volans to Rome [11:02:39] * mark adds "Lua integration" to the feature requests [11:02:41] ;) [11:02:51] per-service logic [11:04:14] that could be pretty awesome actually [11:04:27] anyway, depool threshold was included exactly to prevent a very small (or no) set of backend servers getting swamped by too many requests [11:04:58] it's better to try to keep enough servers in the pool, whether they are returning errors or even timing out, because if we remove too many servers they *will* timeout with certainty ;) [11:05:23] and if you remove all backends, that even kills the lvs box [11:05:26] (or at least it did in the past :) [11:05:33] I'd love to have some L7 capabilities [11:05:53] aka being able to reply synthetic 5XX errors [11:06:22] that's really not IPVS anymore at that point right [11:06:28] yup [11:07:23] so before we get too far off into the distance ;) [11:07:39] can we see any quick fix for current pybal, pre-FSM, that doesn't change existing semantics too much? [11:07:58] or would it be better in the mean time to keep things as-is and ensure scap doesn't do anything so stupid until we have something better? [11:08:29] I can refactor my patch, and basically refuse configs that don't honor the depool threshold [11:08:47] i would introduce a separate config variable for it then though [11:08:50] maybe disabled-threshold [11:09:13] it's a different thing [11:09:30] but yeah of course in the end they do interact :/ [11:11:07] we can wait until joe and ema are able to give input at some point [11:11:08] if you think about it, allowing a configuration that doesn't honor the user-set depool threshold doesn't make a lot of sense... on one side you are saying that you need always up a 70% of the servers but you are asking to disable a 40% of them [11:11:40] that is true [11:12:00] how would you handle the case where there's already a significant number of hosts down [11:12:17] and then a new config comes in which disables almost enough servers to nearly meet the threshold [11:12:25] by itself the config would be accepted (just) [11:12:38] but together with the (other) hosts also being down, there are not enough servers able to get pooled... [11:12:42] it gets messy [11:12:50] I'd reject the config cause pybal isn't able to fulfill all the config requirements in the current scenario [11:12:58] yes but the scenario can change [11:13:04] seconds later the health checks can recover [11:13:07] what does pybal do then? [11:13:24] assuming no new config comes in :) [11:13:41] nothing [11:13:58] so stay in reject-alert state [11:13:59] ok [11:14:00] the config has been rejected, it's the human/scap responsability to set it again [11:14:08] fair enough [11:14:08] (if needed) [11:14:29] let's at least put this new functionality after a boolean config option though [11:14:37] if we don't use a separate threshold from depool-threshold [11:14:41] perhaps default enabled, but able to disable [11:15:33] naming that option will be fun hehe [11:15:38] how would this prevent scap from continuing the deploy live without actually depooling hosts? [11:15:54] I guess scap will just depool on etcd/conftool and be happy that that change succeeded [11:16:07] yes it doesn't affect scap behavior in any way i think [11:16:15] so scap really shouldn't do this, ever [11:16:18] and needs to be fixed to ensure that [11:16:30] but while it does, at least pybal prevents completely melting the site [11:16:48] i think in the past outage, hhvm restarts did not succeed [11:16:58] so if hhvm is down everywhere, having servers pooled does not actually help [11:17:35] you might get some error messages from apache in front of it but that's it? [11:18:51] some 5XX I guess [11:19:49] cause apache isn't able to process requests anymore [11:21:01] yeah [11:21:25] btw, if we changed the semantics of "enabled" in pybal and have it overridden for depool threshold by sometimes pooling servers anyway if really needed... [11:21:39] it might perhaps be interesting to have "enabled" implemented as just another monitor check ;) [11:21:49] then we get the depool threshold functionality for free [11:21:56] without additional logic [11:23:04] enabled as a monitor check? like GET backend/enabled? [11:24:25] well, no [11:24:48] it would be in the config (etcd) as usual, but pybal would only use it in a specific monitor [11:24:56] which then reports "down" if disabled, "up" if enabled [11:25:05] and one monitor down is enough for pybal to depool in normal cases [11:25:09] but depool threshold can then override that [11:25:17] oh ok [11:25:27] really we could also do more with monitor weights, or voting, i.e. depool threshold could pool servers with least monitors down or whatever [11:26:07] brb [11:26:33] one of the key disconnects here, I think, is there is no way for pybal to "reject" a depooling all the way back to scap, causing it to fail/react. at least not directly. [11:27:50] dunno if we could implement some sort of API on pybal that could be used by scap [11:28:22] the "api" for depooling is to set an etcd key (without explicit knowledge of already-monitor-down hosts or the threshold), and then asynchronously pybal sees the key change and can make some threshold decision, but it's async so it can't tell scap "no, I refuse to depool that server" directly in response to its request. [11:28:22] that's correct [11:28:30] so pybal already has an api [11:28:35] i just don't think scap uses it [11:29:27] and it's not easy for it to do so either, with multiple pybal instances etc [11:31:25] right [11:33:10] if we could identify the current primary LVS for the service and talk to pybal directly, scap could at least confirm some state. [11:33:27] so pybal now reports bgp-med in prometheus metrics [11:33:41] that enables for example grafana to always determine the lowest med (primary) at any given point in time [11:33:47] (pybal could basically export all the information needed: the pool's threshold, the current count of disabled/down, etc) [11:33:49] and thus know which instance is primary [11:34:01] that's already exported over prometheus metrics [11:34:12] and separately the api for the server status [11:34:23] the API that config-master shows? [11:34:32] (because I thought that was just etcd data) [11:34:45] not sure what config-master does, but pybal has an http endpoint with this data [11:34:49] i think used for some icinga alerts at least [11:34:54] https://config-master.wikimedia.org/pybal/eqiad/appservers-https [11:34:57] ok [11:35:42] /pools and /alerts [11:35:47] and /metrics for prometheus metrics [11:36:09] Serves /pools/ [11:36:09] It will print out the state of all the servers in the pool, one per line. [11:36:13] json output [11:36:34] so it's definitely already possible [11:39:04] so, shouldn't we be able to solve this with confirmations? e.g. if scap asks etcd to depool serverX, it can then poll the correct pybal and find out if that actually happened or not? [11:39:30] it seems possible, but a bit messy [11:39:59] and similarly on repool - to confirm repool and also confirm it's not newly monitored-down because of a newly-deployed-code issue before downing another [11:40:20] i would be inclined to say we should have a generic service handle that instead of scap [11:40:37] so scap and whatever else has a single stable endpoint to talk to [11:41:06] yeah, perhaps. ideally all this same stuff should apply to other users of these interfaces (e.g. depooling cache servers with scripts) [11:41:15] yeah [11:41:43] but then that other service will need to not, in the net, worsen the reliability level we'd have with etcd+pybal themselves. [11:42:05] it's a thorny area [11:43:15] thinking aloud in random probably-horrible directions: you could also turn around the ownership of the persistent states. [11:43:42] have pybal ask scap what servers it considers "poolable"? [11:44:24] as in, instead of a model where foo edits etcd and pybal monitors it to make changes (and can re-read it on a restart, etc...) - the pybal which is currently the main/active one for a service owns the etcd data, and other things ask a pybal API directly to pool/depool things, and if it decides that's ok it updates the etcd data (for its fallback peers or future self after a restart). [11:44:40] just to add another angle of seeing this, we could validate at conftool level that the config is 'valid' before saving it, regarding pooled/unpooled/weight and have pybal just take into account down/up hosts [11:44:46] in other words, etcd is more like a database underlying pybal, instead of an interface between pybal and other things [11:44:48] with the usual logic [11:45:20] volans: that would make sense I think [11:46:13] volans: valid based on threshold + etcd state, but not pybal monitored-state? [11:46:18] bblack: yes [11:46:44] valid logically given a set of defined backend, not taking into account the current live up/down state [11:46:49] i actually think that's better than trying to hack this into pybal [11:47:03] and it also means different pybal instances don't get desynced if they reject etcd configurations [11:47:07] what happens then when the threshold is only exceeded by two non-overlapping sets of down-in-etcd and down-in-monitoring subsets of the pool? [11:47:10] because it would never make it into etcd in the first place [11:47:29] bblack: yeah that's what we were discussing above too ;) [11:47:40] still better than what we have now though [11:48:19] if you translate the pybal world over to thinking like some traditional "application" (e.g. pybal==mediawiki, etcd==mysql, ...) [11:48:34] what we have today is other applications "communicating" to the app by writing to its DB directly [11:49:09] etcd is effectively the persistent state / DB of pybal's pools, but it's not in control of it. [11:49:43] indeed [11:49:48] IIRC etcd is just one of the possible backends for pybal though [11:49:48] all it can do is decide not to act on it [11:50:08] correct, there's also the "file" backend (over http or locally) in either pybal format or json format [11:50:10] and there's k8s api now [11:51:57] but the problem with my line of thinking here, is it's forcing the hard part of it off into a different corner that's hard to deal with [11:52:13] which is election of and shared knowledge of which pybal is primary, etc [11:52:22] yes [11:52:27] which is now done by the routers [11:52:29] right now, effectively the router decides based on the best med it receives from live ones [11:52:46] in this model, pybal might have to know if it's the current best, and apps would need to know which pybal to contact as well [11:52:52] indeed [11:53:03] whereas right now pybal is completely oblivious to the fact it's handling traffic or not at all [11:53:22] it doesn't even know atm whether its ipvsadm commands succeed ;) [11:54:14] an option that would be closer to the current model but might work, would be something like this: [11:55:13] 1) Pybal's logic enforces thresholds universally (regardless of whether a host is out of traffic pooling due to etcd-depool or monitor-down). For large pools, we might want two thresholds (a softer one that merely warns, and the hard one it limits at) [11:55:31] 2) Apps (e.g. scap) write depools to etcd as they do today [11:56:25] 3) Apps (e.g. scap) also have knowledge (configured via puppet? or possibly exported via etcd or something as well? I don't know) of the total set of pybals servicing this pool, and they poll all the pybals just asking whether the pool's state is healthy before each action. [11:57:08] would be interesting to use prometheus for this maybe [11:57:15] but not sure it's wise to depend on it [11:57:18] I'm leery of using a metrics tool for functional stuff [11:57:22] agreed [11:57:50] but it's easy to configure/provide to scap a list of "here's N IPs for all the pybals that watch pool X" [11:58:17] yes [11:58:19] and it has to do a GET /pools/X/state to all of them and get back an "ok" instead of a "warn" or "fail" before it does each new depool command [11:58:30] but if scap actually uses conftool for this, maybe we should put this in conftool instead [11:58:41] i have never used either of the two [11:58:43] so if they start not coming back or overlap-sets problem, scap has a way to fail out and notice, programmatically [11:59:28] obviously this will create an issue where a backup LVS being dead will prevent scap as well (state GET will fail to it, and they're all equal in its eyes) [11:59:39] but arguably that's a good thing. LVS maint should be brief and should be blocking. [12:00:27] if we decide an LVS is dead and can't be quickly replaced but we want to continue deploy operations, we can remove it from the configured set first. [12:01:15] * volans grabbing some lunch [12:01:23] and yeah, conftool could be the mediator for all of this [12:01:44] but then we really need to declare that everyone/thing uses conftool, not etcd directly. [12:01:59] (better than reimplementing in scap + N other systems) [12:03:52] for small pools we'd probably configure the warn and fail thresholds the same. [12:04:41] for the appservers-like case where we might want them to differ, automated/normal use of conftool fails at the warning level, but there's a flag that lets manual conftool actions depool past the warning level so long as the critical failure level isn't reached. [12:05:18] (obviously, attempts to repool should succeed regardless of threshold, since they can only move things in the good direction or do nothing at all) [12:12:07] yeah [12:12:43] the only thing I'm not sure about is whether pybal should indeed universally enforce the thresholds in this case, especially since conftool or scap are already watching the state of pybal before proceeding anyway [12:13:40] besides the somewhat messy changes in pybal needed for that at this time (change of semantics), it also makes it much harder to force pybal to do anything on real manual operator intervention [12:17:05] true [12:17:34] I'd counter that the threshold is there for a reason though, and should also be easily configurable. [12:17:35] i think it's generally good for us SREs to always have an easy override at hand for the rare case where it's needed - we should just make sure our automatic tools like scap don't get to use that [12:17:52] yeah but right now it's not, as it requires a pybal restart [12:18:01] put the threshold in etcd as well heh, then if an admin decides they're going to do something crazy and manual, they can drop the threshold explicitly. [12:18:06] yes [12:18:40] (either that or have pybal able to reload config without a restart) [12:18:53] (well, without the hard effects of a restart anyways) [12:19:03] we'll get there too [12:19:23] yey.. RFC 4724 [12:19:30] Graceful Restart Mechanism for BGP [12:19:41] bigger problem right now is wiping the ipvs state [12:19:49] so giuseppe's netlink stuff helps with that [12:19:53] yup [12:20:49] yeah [12:21:13] the tradeoff with graceful restart is it's timeout based, so it slows reaction time if pybal's actually going to fail to come back online. [12:21:19] but at a small timeout value it can be reasonable. [12:21:27] yes [12:22:41] (situational of course... we'd want graceful restart for a "restart for config change or code update", but not trigger graceful-restart when the daemon stops because the machine is shutting down) [12:22:50] indeed [12:23:33] a nice way to handle that without creating new daemon verbs would be a state-setting command to the daemon that only lasts a short while [12:23:55] or a kill signal? ;) [12:24:12] admin/tool can tell pybal "next daemon stop should use RFC 4724" just before the restart, and pybal sets an internal bit to operate that way, which expires 10 seconds later or whatever. [12:24:45] yeah you could use e.g. SIGUSR1 to set the bit [12:24:49] yup [12:25:13] (and then proceed with a normal restart to keep systemd happy) [12:25:52] adding whole new verbs like a commanded "graceful-restart", or having USR1 actually trigger the restart itself happening, gets complicated in the systemd world. [12:26:30] it also makes sense for normal restarts so pybal can lower the med [12:26:38] and start the failover early before it terminates [12:27:03] right now that's not a big deal as we leave ipvs state in place but it's also not ideal [12:27:58] it terminates the bgp connection and the router quickly moves over I guess [12:28:25] could even start an lvs connection state sync hehe [12:28:43] although i'd much prefer static source ip hashing without lvs state needed in the first place... [12:30:35] for that matter, if we had a nice hashing behavior on lvs->app fanout we could go all-active from the router as well (with it doing some kind of L3/L4 hashing at its own level), and not worry so much about meds and failover and primary-ness. [12:30:43] yes [12:33:12] we have a ticket about that somewhere I think, stateless TCP for ipvs [12:33:46] https://phabricator.wikimedia.org/T175203 [12:34:27] oh wait that's slightly different, the specific one beneath it is: https://phabricator.wikimedia.org/T86651 [12:35:24] yeah [12:35:32] bah, my old svn file link no longer works hehe [12:35:56] so does IPVS now support stateless udp? [12:36:03] then i assume stateless tcp is also doable? [12:37:00] 10 years ago the problem was basically that IPVS did any state tracking unconditionally without being able to influence it from the lvs director [12:38:31] well, it does do stateless UDP (which we use), but I'm not sure that means a mere director can implement stateless TCP. their codepaths are quite different outside the director. [12:39:48] true [12:40:29] clearly we should just use some iptables packet-mangling rules to encapsulate our inbound TCP packets in equivalent UDP headers, pass them through IPVS for routing, then de-encapsulate them in another iptables mangle rule on the way out the door :) [12:41:10] haha [12:41:18] or just replace ipvs entirely [12:41:35] that too, but it's hard to be efficient at it in userspace [12:42:21] then again who knows, the whole "it must be in-kernel to be efficient" thing that spawned ipvs routing in the first place may be outdated in the modern era, it's worth contemplating whether a user-level app could deal with it now. [12:42:45] (using raw sockets I guess) [12:43:27] or alternatively, implement it all in eBPF through one of the multiple evolving pathways for that [12:43:45] (where pybal would just control a table some eBPF code is reading to do the traffic hashing statelessly) [12:47:07] yeah [12:47:36] *sigh*, XioNoX I don't see here: https://phabricator.wikimedia.org/T184293 lvs1015 network setup done, and rebooting the system in PXE mode doesn't show anything on install1002, do you need a task for that? [12:51:25] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293 (10Vgutierrez) @ayounsi could you enable lvs1015 network ports? thanks! [12:51:39] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: pybal's "can-depool" logic only takes downServers into account - https://phabricator.wikimedia.org/T184715 (10mark) We had a long and interesting discussion about this on [[ http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-traffic/20180711.txt | IRC ]]... [12:52:09] mark: thx for submitting the TL;DR on the task <3 [12:52:40] i linked to irc, there were too many options to really list everything ;) [12:57:13] 10Traffic, 10Operations, 10UniversalLanguageSelector: ULS GeoIP should not use meta.wm.o/geoiplookup - https://phabricator.wikimedia.org/T143270 (10Petar.petkovic) [12:58:30] so! next reviews I need... https://gerrit.wikimedia.org/r/#/c/operations/debs/pybal/+/434163/ ;-) [12:59:20] noted :) [13:00:06] mark: so.. on the short-term actions for the can-pool stuff we are waiting for _joe_ feedback, right? [13:00:09] what do you need on lvs1015 network ports? [13:00:26] vgutierrez: yeah, would be good to wait what joe and perhaps ema think [13:00:55] mark: the network side config, assigning the ports to the proper vlans, setting descriptions and enabling the ports [13:02:12] i'll have a stab, better i don't get too rusty on networking ;) [13:04:48] hm, are the new lvs balancers supposed to sit on the public vlan as well? [13:04:58] no, private [13:05:11] is there any existing one already configured? [13:05:12] err, depends on your definition of "sit on" I guess [13:05:20] then i'll just copy the interface ranges that arzhel has configured there [13:05:34] they are supposed to have public VLANs available as well, but the primary hostname/IP on the default vlan for the eth0 equivalent is private [13:05:37] sure sure [13:05:40] 1016 is up and running [13:05:44] ok, thanks [13:05:45] mark: lvs1016 [13:05:46] right :) [13:10:49] hmmm now I see the pci ids on the ethernet bios and I start to like the new NIC interface names in linux... [13:10:52] <04:00:00> BCM57810 - F4:E9:D4:DB:30:20 MBA:v7.14.2 CCM:v7.14.3 │ [13:10:53] D4: iscap 'project level' commands - https://phabricator.wikimedia.org/D4 [13:10:55] │ <04:00:01> BCM57810 - F4:E9:D4:DB:30:22 MBA:v7.14.2 CCM:v7.14.3 │ [13:10:58] │ <05:00:00> BCM57810 - F4:E9:D4:CF:36:60 MBA:v7.14.2 CCM:v7.14.3 │ [13:11:01] │ <05:00:01> BCM57810 - F4:E9:D4:CF:36:62 MBA:v7.14.2 CCM:v7.14.3 [13:11:20] enp4s0f0 and so on... [13:11:46] yeah that's the nice part [13:12:09] now if we could just get a tiny LCD near the network ports that reads the same numbers the bios and linux sees, so it's obviously when plugging in cables... [13:12:47] but just cause these DELL bios doesn't report PCI slot number.. otherwise we would see something like ens1 [13:12:55] or even just a multi-color single LED next to each port, so you can set in bios that 04:00:00 = dark red, 04:00:01 = fuscia, 05:00:00 = light blue, etc.... [13:13:29] I think the p-number is pci-bus [13:13:56] are they actually in numbered slots, or are all the expansions slots effectively their own separate busses with only a slot-0? [13:15:27] vgutierrez: should be done [13:15:38] but i found 2 inconsistencies between the different switches... [13:16:18] rebooting it in PXE mode.. let's say if it shows in install1002 logs or not [13:16:22] s/say/see [13:16:26] WTF is wrong with me! [13:17:03] well after you see, you'll probably say, so I think it's just a temporal issue. [13:17:12] /o\ [13:17:16] need new protocols to sync the brain with the normal flow of time [13:17:29] so I'm suffering glitches in matrix [13:17:43] maybe you are the glitch in the matrix! [13:19:32] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293 (10mark) >>! In T184293#4415691, @Vgutierrez wrote: > @ayounsi could you enable lvs1015 network ports? thanks! I added lvs1015 to interface-range LVS-balancer on asw2-c-eqia... [13:19:45] XioNoX: https://phabricator.wikimedia.org/T184293#4415745 [13:24:47] vgutierrez: does it work? [13:25:06] the server is getting lazy on the power cycle.. I had to issue another one [13:26:38] i think it probably does not work [13:26:49] the LVS-balancer interface-range does not seem correct [13:27:17] am I correct in assuming that lvs1010-lvs1012 are not in use? [13:28:20] that's right [13:28:37] https://phab.wmfusercontent.org/file/data/lfcczvehvr342ydkh3zn/PHID-FILE-knklxywtn2faxy2owfjm/lvs1015-mac.jpg --> this screenshot is bugging me as well [13:28:52] cause the first 4 NICs listed there should be disabled [13:29:23] ok let me fix it then [13:30:55] Logical Vlan TAG MAC STP Logical Tagging [13:30:55] interface members limit state interface flags [13:30:55] xe-7/0/19.0 294912 DN tagged [13:30:55] public1-c-eqiad 1003 294912 Discarding tagged [13:30:55] private1-c-eqiad 1019 294912 Discarding untagged [13:30:57] should work now [13:32:17] mark: yey <3 [13:32:19] mark: Jul 11 13:32:06 install1002 dhcpd: DHCPDISCOVER from f4:e9:d4:db:30:20 via 10.64.32.2: network 10.64.32.0/22: no free leases [13:32:29] the expected mac showing in install1002 [13:32:33] good [13:32:43] right there's currently 6 active LVSes in eqiad, in traditional setup of a pair of hosts per traffic class, but the pairs are 1+4, 2+5, and 16+6 (1, 2, 4, 5, 6, 16) [13:33:24] right [13:33:34] and lvs1010-1012 are probably that batch we never got to work [13:33:40] but i thought i'd better check before assuming ;) [13:33:59] yeah, that batch runs from 1007-1012, I honestly don't recall the various decom/remove states of them all, but none are in active service [13:34:52] we swapped in 1016 for 1003 early just to alleviate practical issues with the traffic level on 1003's interface. [13:35:44] the eventual (well, near-term eventual) layout when all the new ones are online will be 1013=high-traffic1 1014=high-traffic2 1015=low-traffic 1016=secondary-for-all-the-others [13:35:53] yeah nice [13:36:29] yeah valentin has me blocked on his pybal reviews, so I might as well fix his networking blockers eh... [13:36:33] ;p [13:36:36] :) [13:39:47] mark: lvs1015 it's already being imaged, thx <3 [13:39:52] yw [13:40:33] bblack: btw, same as in lvs1016, bios updates were required for the 10G ethernets, and changing MSI-X settings in their BIOS as well [13:46:30] yeah I'm guessing they'll all need it [13:46:59] as long as we get these 4 working in under a year, we'll be way ahead of the previous eqiad lvs replacement attempt :) [13:47:16] that wasn't even the worst [13:47:24] so i'm now decommisioning those "osm" servers in esams [13:47:28] heh [13:47:28] the ones CT bought [13:47:30] 45 days to cable each one... [13:47:36] half a year .P [13:47:37] never even got racked [13:47:48] so I did repurpose a few of them, they are lvs300x now ;) [13:47:59] and used one for spare parts as well [14:07:54] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293 (10Vgutierrez) From lldpcli everything looks good: ```name=lldpcli show neighbors root@lvs1015:~# lldpcli show neighbors | egrep "Interface|PortDescr" Interface: enp4s0f0,... [14:09:29] ema: that's why the 502s? [14:09:44] well, maybe one of the reasons for something anyways [14:10:01] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293 (10Vgutierrez) [14:10:45] bblack: right, surely a valid reason for not getting anything useful back from the misc appservers :) [14:11:33] bblack: I've just managed to repro a 502 on pinkunicorn, (using config-master.wikimedia.org as a misc domain served by pu) [14:11:40] ok [14:11:50] did one of the varnishds crash to cause the 502? [14:11:59] (that's what I was seeing I think, varnishd crash -> nginx 502) [14:12:08] yes, frontend child crash [14:12:12] nice! [14:12:23] yup! [14:12:43] probably something horrible to design around about vcl nmaing and uuids and loading/reloading and cold/warm-ness, etc [14:12:49] yeah [14:12:53] which will require us to add some awful timeouts or sanity-checks to reload-vcl [14:13:43] I've added a new switch to reload-vcl today so that we can start varnishd properly (luxury!) https://gerrit.wikimedia.org/r/#/c/445081/ [14:14:13] and yeah, given that the crash is on a vcl "temperature" assert I suspect we'll have to do something horrible [14:14:36] lol [14:15:38] let me rewrite that commit message for you: "Apparently the design of Varnish's VCL switching feature didn't really consider the use-case of actually using it in automated production, hence this set of hacks!" [14:16:57] heh [14:17:37] yeah I still don't really "get" the nature of the temp assert crash, but surely these changes will at least put us on better footing for narrowing it down, if they don't eliminate it categorically [14:20:05] so, the repro I had before was on a single VCL -> multiple VCL upgrade without daemon restart (simple reload) [14:20:33] now after restarting the daemon entirely (load/label/use/start) I cannot get a crash any longer [14:27:47] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10ayounsi) @elukey I still see flows to gerrit (2620:0:861:3:208:80:154:85) https from: 2620:0:861:108:10:64:53:26 2620:0:861:108:10:64:... [14:28:28] bblack: what's the status of dysprosium ? (it's still connected to the old asw-c-eqiad) [14:29:35] uh, I think that's the old name of cp1099 [14:29:37] let me check [14:29:57] yeah that was my original varnish test box [14:30:11] yeah [14:30:32] for extra confusion dysprosium was also used for a Ganeti instance (DMARC tests), but that one should be gone by now [14:30:32] so if you're looking at port labels, it may be that that's the live connection for the in-use cp1099, just mislabeled as the old hostname [14:30:56] ok, makes sens, will rename [14:31:35] asw-c-eqiad:xe-8/0/32.0 [14:31:40] is what cp1099 shows in lldp [14:35:17] got another repro, the vcl is indeed cold /o\ [14:35:19] available auto/cold 0 vcl-0204b9c1-93e2-48c7-ad94-9e90f24cf789 (1 label) [14:35:22] available label/warm 0 wikimedia_misc -> vcl-0204b9c1-93e2-48c7-ad94-9e90f24cf789 (1 return(vcl)) [14:35:25] active auto/warm 1 vcl-78195dfc-b731-4f1f-ab8d-eb723fedbb3e [14:36:08] requests for pinkunicorn go through fine, anything that goes through the misc vcl crashes the varnishd child [14:37:34] see `curl -v https://config-master.wikimedia.org/ --resolve config-master.wikimedia.org:208.80.154.42:443` vs normal request to pinkunicorn [14:41:47] aaand, we've got more patches to backport! https://github.com/varnishcache/varnish-cache/issues/2445 [14:45:16] 10Traffic, 10Operations, 10Pybal: Unhandled error stopping pybal: 'RunCommandMonitoringProtocol' object has no attribute 'checkCall' - https://phabricator.wikimedia.org/T157786 (10mark) 05Open>03Resolved a:03mark This has been addressed in acdd0ebf74e5dd9e06c3216b9a93063ab8e91574 [14:47:25] hmmm [14:47:54] so basically this bug is that if the VCL isn't being actively used by threads, after a while it goes cold automatically under "auto" rules? [14:48:36] that would explain why things seems to work briefly after restart, but eventually fail with the crashing assert (since there's very little traffic triggering misc vcl use, and it's per-worker-thread, so waking it up in one won't necessarily stop the coldness happening in others) [14:49:09] right so the label stays warm as expected, the actual vcl that it points to, instead, goes cold [14:49:13] the mentioned workaround is to set an explicit warm state instead of leaving it auto [14:49:51] (we could do that on reload-vcl, but then it would never go auto-cold to discard them later, so there's some ugly follow-on effects from that strategy) [14:52:23] the other ticket they reference seems to be post-5.1 though [14:55:23] I mean the series of 4 commits in https://github.com/varnishcache/varnish-cache/issues/2432 seem to directly address this, but also they seem to indicate the bug didn't exist (at least not in the same form) prior to 5.2 [14:56:51] meeting time soon [15:49:19] 10netops, 10Operations, 10fundraising-tech-ops: New PFW policy for Amazon - https://phabricator.wikimedia.org/T199341 (10cwdent) [15:59:07] haha [15:59:10] > Not sure it's something I can use because I think my manager will call it a hack and worry about the VCLs going cold. [15:59:14] https://github.com/varnishcache/varnish-cache/issues/2560 [15:59:50] so yeah that ticket confirms that 5.1 is affected too [16:01:09] the followup is awesome [16:01:12] That's because 5.2 is not a supported branch, so no releases are to be expected, except for security advisories. [16:03:51] basically only 4.1 is "Support", 6.0 is "Fresh", and everything is bleh [16:04:21] I don't think Fresh makes any gaurantees about the timing of a future move to the state Retired or Supported [16:04:27] http://varnish-cache.org/releases/index.html [16:05:35] regardless, given our rationales and current position, we'll have to move to 6.0 sometime [16:06:08] yeah [16:06:20] and hope it stays fresh/supported longer than 5! [16:07:44] what's really awesome is the confusion in: http://varnish-cache.org/docs/6.0/whats-new/changes-6.0.html [16:08:04] they've defined a new vcl language level "vcl 4.1;" to supersede "vcl 4.0;" and add unix domain socket support [16:08:14] which has nothing to do with varnish release 4.1 of course [16:09:32] of course [16:09:35] thank god the umem stevedore has been brought back on Solaris on 6.0 [16:12:14] it would be nice to try out unix domain sockets for nginx->varnish-fe too [16:12:59] we might still need to do a connection-per-request (to avoid the problems that we've seen before with bad/broken responses breaking a persistent connection for the next client request in line) [16:13:08] but at least they'll be cheaper/faster and not pile up TIME_WAIT [16:13:36] yeah that's gonna be interesting [16:14:51] one additional wtf: http://varnish-cache.org/docs/6.0/reference/vcl.html#versioning [16:14:54] > The version of Varnish this file belongs to supports syntax 4.0 only. [16:14:58] really? [16:25:59] 10netops, 10Operations, 10fundraising-tech-ops: New PFW policy for Amazon - https://phabricator.wikimedia.org/T199341 (10cwdent) @ayounsi the last one I posted was incomplete, I found the problem and 1531326142 should fix it [17:07:26] 10netops, 10Operations, 10fundraising-tech-ops: New PFW policy for Amazon - https://phabricator.wikimedia.org/T199341 (10ayounsi) Pushed. [18:21:35] bblack: do you have a minute? I wanted to discuss your comment to https://phabricator.wikimedia.org/T199146 about using internal endpoint [18:33:04] SMalyshev: it will take a while. I tried to respond on the ticket and realized our response is somewhat unreasonable :) [18:34:09] SMalyshev: you can at least not use webproxy though. but for the other part... we'd like to say something like "hey send that traffic to the hostname appservers-ro.discovery.wmnet instead", but then there's a lot of complications in that being a generic solution for this kind of thing. [18:34:20] bblack: ok, so we have https://phabricator.wikimedia.org/T199219 now - if you have any ideas please add there. Not urgent - since it works this way just fine (and we're not using proxy) but would be glad to hear your ideas [18:34:57] with the proxy it was my fault, I read the wrong config - Blazegraph uses proxy to do outside federation, but WDQS updater doesn't use it to talk to wikidata [18:35:10] ok [18:41:26] SMalyshev: updated the ticket with something slightly better than the above, but Stalled might still be appropriate :) [18:42:17] bblack: thanks, I'll try it and see if it works [21:02:30] 10netops, 10Cloud-Services, 10Operations: Allocate public v4 IPs for Neutron setup in eqiad - https://phabricator.wikimedia.org/T193496 (10chasemp) >>! In T193496#4228340, @ayounsi wrote: > https://apps.db.ripe.net/db-web-ui/#/lookup?source=ripe&key=185.15.56.0%2F24AS14907&type=route created. > ``` > 185... [21:59:36] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10Nuria) [22:08:21] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10Nuria)