[07:28:56] https://github.com/microsoft/ethr [08:48:14] nice! I was about to post https://blog.cloudflare.com/a-byzantine-failure-in-the-real-world/ [08:48:32] it is not a new one, but I've read it this morning, really nice [08:49:55] elukey: also see https://twitter.com/HenryR/status/1333336495260745729 [08:51:03] will do thanks! [08:52:04] I was wondering a similar use case for our etcd nodes (the conf[12]xxx), together with events like pybal restarting etc.. [08:54:36] (the twitter posts are a little pedantic as the title announce :D) [09:02:46] paravoid: https://twitter.com/heidiann360/status/1332711011451867139 is also interesting, didn't have the time to check it though [09:05:20] and https://hh360.user.srcf.net/blog/2020/11/fast-flexible-paxos/ [09:13:40] elukey: they are pedantic, and writing what should be an article on twitter with one tweet per paragraph shows some sort of deep issue with the assumption that progress is a thing [09:14:10] and good morning [09:17:38] :D [14:01:47] XioNoX: kind of funny that it won't do UDP traceroute... nice looking tool though [14:23:46] yeah but the depths of how byzantine a failure mode can be are difficult to plumb. You can't algorithm your way out of a design you only think is redundant, but really isn't :) [14:24:40] especially when working with the lower layers of the complex stack of architecture stuff, it pays to pay attention to the true isolation of redundant components of the system [14:30:13] as I understand it from a morning read of their blog post, there were two key architectural problems they could address: their network-switch redundancy didn't really offer true redundancy (either ditch it entirely in favor of the simplification of host-level redundancy in different parts of the network and accepting occasional rack-switch outages, or make them a truly indepedent pair of networks; [14:30:19] I'd choose the former), and then the other was the apparent load reliance on some replica that designers clearly didn't consider critical in a crisis (this is sorta like when we were saying maps load needed both DCs' servers - you can't decide something is just a redundancy aid, and then rely on it for production load-handling). [14:30:36] saying "we need to upgade etcd to handle byzantine faults" is sorta sidestepping all this in a grossly complex way [14:34:55] glacing at some the tweet-blog-response (ewww), it sounds like that person is making a similar point, perhaps in a much more intelligent and nuanced way, plus the additional point that the failure they observed doesn't require the fully definition of "byzantine" [15:07:48] bblack: morning :) completely unrelated, but I left a comment in https://phabricator.wikimedia.org/T169765#6668752 about pybal in esams and conf1006. Can you shed some light on it just to understand what we'd need to do to move conf1006 to another rack safely? [15:18:46] are we planning to do so, or is this a hypothetical? [15:18:52] elukey: ^ [15:19:18] ah re-reading, I see it's real [15:19:41] bblack: yes not a huge priority, we are defragging eqiad basically :D [15:20:07] I wasn't involved in debugging the original issues. I think the design-level problem is known and understood (we know the way things work now is not how it should be), so we don't need to rehash how to make things truly-better. [15:20:46] so what remains is just "how do we get through this un-ideal situation", and I imagine the best answer is to reconfigure the affected pybals to use a different etcd server and restart them, then move 1006, then revert. [15:20:59] (those pybal config changes requiring a sequence of restarts, like when we add new services in a DC to LVS) [15:21:23] but I'd defer to the etcd experts on that :) [15:21:56] hieradata/role/esams/lvs/balancer.yaml:profile::pybal::config_host: conf1006.eqiad.wmnet [15:22:01] ^ seems to be what you'd mess with [15:24:25] my deferences it because for all I know, there's some more-subtle etcd considerations here. Like, too many LVS pointed at one etcd server hits some limit or causes some problem. Or the data is somehow sharded and at least other servers can't do the job for esams. or similar things. [15:24:26] bblack: yep yep I had a chat with Alex and this seemed to be a good path forward, but it seems not ideal in general.. For example, if conf1006's top of rack switch goes down (it happened for conf2003 some weeks ago) it would be nice if pybal could automatically fetch from another conf100x remaining in the cluster [15:24:36] ah okok [15:24:52] yeah that's the known design-level problem you're referring to. [15:25:07] but solving that is probably out of scope for your immediate work on moving an etcd machine [15:25:45] yep yep, but I wanted to know how to deal with these scenarios in case some fires happen, and if we have a long term plan etc.. :) [15:26:55] I imagine if fires happen, we'd emergency-reconfigure to another etcd server and restart esams pybals. which is basically what we're guessing is appropriate for the move. We just have the luxury of time in this case, and even then I have to put asterisks on the plan becuase I'm not personally 100% sure there aren't caveats at the etcd level. [15:27:00] perhaps we should document it! :) [15:27:20] (or maybe it already is and we didn't find the doc) [15:28:25] the long-term plan is a much trickier question. Clearly if our L4LB solution remains largely as-is (the current ipvs + pybal solution with roughly the feature set and capability and config it has today), we should put some priority on fixing how it uses etcd by making some pybal code changes. [15:29:21] there's a design doc forthcoming as part of an OKR for this quarter (so, Soon?) about long-term L4LB plans, whose scope is big enough to contemplate moving to a completely-different solution if warranted. [15:29:43] so I think it makes sense to see how that plays out before we dig into any further heavy pybal changes from our backlog of defect tickets about how it operates today :) [15:32:07] sure thanks for the explanation :) [15:40:09] hi all would appreciate a second set of eyes on this change https://gerrit.wikimedia.org/r/c/operations/puppet/+/645351/1/modules/cfssl/manifests/ocsp.pp. both PCC and the puppet master seem to get stuck while trying to compile. no error message or stack trace the process just seems to hang for ever and i cant see any obvious issues [18:05:00] herron: o/ are you around? [18:08:59] or shdubsh :) [18:09:19] hey [18:10:22] o/ [18:10:43] if you have a minute I'd ask a review of https://gerrit.wikimedia.org/r/c/operations/puppet/+/645398 [18:11:05] we are setting up a kafka test cluster but the broker by default are set to page if they go down [18:11:20] * shdubsh 👀 [18:11:29] yeah all of them :D [18:12:14] so Razzi is making this explicit in hiera, buuut the logstash nodes have the role+include setting in site.pp so I just want to make sure that we are fixing in the right config files [18:18:13] pcc looks like it's doing the right thing. there's only one set of kafka brokers in eqiad and codfw [18:18:57] shdubsh: yes but I was wondering about the role::kafka::logging yaml in hiera, should that be kept in sync? [18:21:13] it probably should, yeah [18:22:50] ack then, thanks :)