[06:30:16] 06Traffic, 06Infrastructure-Foundations: eqsin: re-image rack 604 servers on new vlan - https://phabricator.wikimedia.org/T428229 (10ayounsi) 03NEW p:05Triage→03High [06:30:32] 06Traffic, 06Infrastructure-Foundations: eqsin: re-image rack 604 servers on new vlan - https://phabricator.wikimedia.org/T428229#11987568 (10ayounsi) [06:30:34] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, 06SRE: EQSIN:New switch setup/configuration - https://phabricator.wikimedia.org/T418439#11987569 (10ayounsi) [06:31:12] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, and 2 others: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11987573 (10ayounsi) @BCornwall good idea! I opened {T428229} [07:33:08] 06Traffic, 10Liberica, 10Prod-Kubernetes, 07Kubernetes, 06ServiceOps new (Next quarter): Migrate Wikikube k8s apiserver and services to IPIP - https://phabricator.wikimedia.org/T420436#11987644 (10MLechvien-WMF) p:05Triage→03Medium [08:51:44] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 06SRE: Nokia SR-Linux DHCP Relay Bug - https://phabricator.wikimedia.org/T411054#11987826 (10cmooney) To confirm the bug is fixed in relese 26.3.2: ` DHCP Release:26.3.2 Section:Resolved issues Functional area:System When using DHCP relay, a DHC... [09:05:16] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Install new MPC10E-10C line cards on cr1-eqiad and cr2-eqiad slot 0. - https://phabricator.wikimedia.org/T426343#11987845 (10cmooney) [09:12:26] 10netops, 06Infrastructure-Foundations, 06SRE: Cookbook sre.network.configure-switch-interfaces failing on upgraded Juniper switch - https://phabricator.wikimedia.org/T428071#11987870 (10ayounsi) As far as I understand the cookbook does `show configuration interfaces xe-0/0/41 | display json ` and not `show... [09:32:01] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Cookbook sre.network.configure-switch-interfaces failing on upgraded Juniper switch - https://phabricator.wikimedia.org/T428071#11987916 (10cmooney) Ok thanks! My bad on the command getting run. Let's see how we get on with the patch <3 [10:16:29] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, and 2 others: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11988013 (10cmooney) >>! In T427393#11987566, @ayounsi wrote: > @BCornwall good idea! I opened {T428229} Nice one. I think we can... [12:04:06] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Cookbook sre.network.configure-switch-interfaces failing on upgraded Juniper switch - https://phabricator.wikimedia.org/T428071#11988320 (10ayounsi) a:03ayounsi [13:20:14] blblack: magru is split like that on purpose I think? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1218784 [13:46:20] 06Traffic, 06DC-Ops, 10ops-eqsin, 06SRE: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11989080 (10ssingh) >>! In T414411#11986915, @BCornwall wrote: > We discussed this and the general consensus seemed to be to just decomm the server and wait for the refresh which is happening shortly... [14:08:27] cdanis: ah yeah, thanks. I somehow lost track of that in the ticket (that's still open/low-prio about splitting transport for other cases) [14:22:05] blblack: oh and one of the other things I noticed: a curl against kafka.wm.o c-hashed differently depending on if the hostname had a trailing / or not [14:22:16] which might be related to the purge issue also? [14:24:56] yeah probably [14:25:23] either way, drmrs will get upgraded Soon and all this chash and cross-node stuff will go away [14:25:47] we can go back to just blaming our own puppetization if node-local operations are not sequenced correctly [14:27:12] Q1 :) [14:27:27] I have taken notes from the discussion yesterday on the next steps [14:27:31] we will discuss them [14:28:30] 07HTTPS, 06Traffic, 06SRE, 06Traffic-Icebox, and 2 others: Support Encrypted Client Hello (ECH) on Wikimedia servers - https://phabricator.wikimedia.org/T205378#11989226 (10mikez-WMF) Hi, Just for visibility if anyone is interested: - This relevant [[ https://people.wikimedia.org/~sukhe/ech-trial-repor... [14:30:51] because it's Friday and I'm in a mood and it's a valid side-topic: our puppet repo has grown way too complicated over time. I really feel like it's a drag on us to be operating this way for CM at this point. [14:31:02] but fixing it falls into a refactoring-like bucket that's hard to prioritize [14:31:40] is this a "drastically simplify Puppet" kind of idea or is this a "move things out of Puppet" kind of idea [14:32:21] cdanis: that's just bait for bblack [14:32:39] 🤫 [14:42:27] cdanis: I'm trying not to solutionize and just complain :) [14:42:33] but I think both are options [14:42:34] fair enough! [14:43:03] in the simplify puppet bucket of nascent thoughts in my head though: [14:43:31] * sukhe gets the chairs out [14:43:57] 1) Properly stratify our puppet repo for dependency purposes using stages. Basically move a bunch of the common layer to a baseline stage that other more-specific config can depend on wholesale. [14:45:25] 2) Un-entangle a bunch of the inter-module dependencies and shared code that cross team/cluster/etc boundaries. I think redundancy is better than shared complexity. If two different clusters are using haproxy in two completely different ways, it doesn't often make sense to invest in some universal abstraction layer that serves both needs :P [14:46:48] 3) Fix all our bullshit around how IPv4 is provisioned at install time and how IPv6 is done at every level, because that's a complete mess and affects a lot of other things [14:47:48] basically --abstraction and ++team/cluster-independence. it can still be a shared repo, but with less actual sharing :P [14:49:45] even where something isn't currently shared cross-team/cluster/etc: a lot of times our manifests are just over-abstracted anyways. Stop thinking you can predict future use-cases: do the minimal-complexity thing for your use-case. Maybe instead of a generic configuration file template that supports all unknown futures and has 42 variables, it could've been mostly static content with like 3 [14:49:51] variable things that you're actually varying on today. [14:51:31] for that matter: if you find yourself chaining together a whole bunch of puppet complexity to do some setup, maybe it would be better off writing an idempotent shell or python script as appropriate and just having puppet execute it. The nice thing about that is that it's CM-system agnostic and probably simpler to understand. [14:52:26] I feel like we're just way too entrenched in endless puppet complex at this point, and a lot of it's not really essential or necessary, it's accidental and it binds us. [14:52:43] +1 [14:53:20] I am amazed how it even works out if you look at some of the puppetization [14:55:19] the cp hosts are particularly bad [14:57:25] the cp hosts would be a worthy target to re-do [15:00:04] on the non-puppet side of things: Ansible is kind of terrible in some sense vs puppet, in that it lacks so much expressive power. But in a way, that's kind of a bonus. It forces you to keep shit simpler :P [15:00:42] (and more CM-agnostic. especially if you're careful and again, put complex "setup script" things in true scripts for execution on the hosts, not in CM logic complexity) [15:00:52] I have an even more radical proposal [15:00:59] or at least the outlines of one [15:01:12] meeting time, but I'm all ears, will read [15:01:19] also meeting time :) [15:01:43] oh yeah [15:24:23] blblack: I love the idea of anycasting edge->core misc traffic [15:25:32] I think eventually we might even want to anycast a lot of text/upload traffic in general. we're just so far behind on advancing various related plans. [15:26:28] but we do have a pair of anycasts going forward that are initially mainly for authdns. HTTP can be in there eventually, and then we can talk about moving cache_misc to an IP in that range as well and doing the cold-potatoe routing of it or whatever [15:26:38] misc might be a nice first testcase for HTTP anycast anyways [15:27:31] (we'd still do geodns too I think, but anycast might be the default in the many cases where it works well - there's a lot of engineering and research left to do on all things related yet) [15:28:26] the route fluctuations we have seen in our current anycast setup even for DNS and wikidough do worry me about the HTTP usecase though. [15:28:40] yeah there's tradeoffs for sure [15:29:08] you're trading some fluctuation-RSTs vs much faster failover in some scenarios [15:29:22] but the more radical ideas earlier were basically along the lines of re-doing the whole "cache node" packaging and deployment as a bunch of podman k8s quadlet files, with a container for each reverse proxy and for whichever other supporting logging/purging/healthchecking/management daemons, all run in a network namespace together (which might even be the host's) [15:29:30] and some of that fluctuation stuff can be route-engineered away, if we put more effort into it (which we'd have to before moving text/upload) [15:30:15] like, why shouldn't it be possible to test out a bunch of purging ordering or config deployment scenarios in a CI pipeline? [15:30:34] in a world where we have edge nodes globally-synced on TFO keys and TLS STEKs, and have QUIC, fluctuations may matter a whole lot less, too. [15:31:15] NEL gives us a way to evaluate those fluctuation-RSTs for Chromium, at least, fwiw [15:31:19] yeah I guess. if we do do QUIC, hopefully a lot of Traffic would be using that over TCP anyway [15:31:21] yeah [15:32:07] there's all kinds of things we should be advancing towards in this space, and figuring it out as we go. we just haven't had priority on related things in a long time, and each step in the process is a small thing that's hard to justify [15:32:45] but the recent moves forward sukhe has engineered on our dual anycasts for authdns is a positive step! :) [15:34:37] cdanis: yeah I tend to agree, mostly. Even outside the world of podman k8s quadlet files, we're far too integrated with the host hardware+software. [15:34:49] (in the cache stack) [15:34:54] yeah +1 [15:35:06] we don't want to lose the ability to optimize for the big hardware we deploy, either, but we can have both [15:35:33] and given the k8s-ification around here, podman as a vehicle makes sense [15:35:59] we've never had real integration tests on cache behavior, either [15:36:11] yeah [15:36:14] and, like, you could get that nearly "for free" while developing this [15:36:29] it's also kind of neat thinking about being able to pass in an interface (which might be a ipip or something!) into a network namespace, as an API for "here's what you're listening on publicly" [15:37:09] this kinda blends with some thinking I've had for a long time on this front. Which is that: in our ideal world, what we really want is one daemon that succinctly does the job of haproxy+varnish+ATS. We hack together 3 daemons out of necessity, not choice. The more we can make that stack act and feel like one piece of integrated software, the better. [15:37:14] ++++ [15:37:56] this also dovetails nicely into future possible plans around minipop nodes (whether hardware or virtual) [15:38:04] yeah! [15:38:36] I see it as hoisting a lot of complexity closer to where it 'should' be anyway, while also making the prereqs on the execution environment simpler [15:39:09] but right now, it's puppetization that's doing some of that integration job, poorly. Where some bit of abstract yaml translates into related configuration fragments for 2-3 different daemons. [15:39:23] whatever we move to, we have to have a solution that doesn't devolve too much on that front [15:39:59] at least, if it does, it will have much better automated testability 😅 [15:40:04] but yeah, agreed [15:40:59] either that or we start really faking like it's one piece of software, and define our own configuration file for the "cache_stack", and some script on the host which consumes that config and spits out relevant VCL/lua/etc.... [15:41:30] but that potentially gets costly and too strange and custom in its own way. it may just be a terrible idea. [15:41:47] we can call them "InitContainers" [15:41:49] ;) [15:41:55] :P [22:21:53] 06Traffic, 06DC-Ops, 10ops-eqsin, 06SRE: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11990446 (10BCornwall) [22:22:22] 06Traffic, 06Infrastructure-Foundations: eqsin: re-image rack 604 servers on new vlan - https://phabricator.wikimedia.org/T428229#11990447 (10BCornwall) [22:23:14] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, and 2 others: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11990448 (10BCornwall) 05Open→03Resolved