[07:57:41] kormat: I just completed the reimage of all 78 hadoop worker nodes to Buster, and I preserved some mental sanity only because of your work on reuse-parts.cfg. Thanks <3 [09:06:55] elukey: that sounds like a good return on my sanity investment when writing it. you're welcome 💜 [11:51:31] elukey: is ml-ctrl your machine? [11:52:00] I am running one of netbox's scripts via teh decomm cookbook [11:52:03] and it popped up [11:52:05] effie: it is part of the new ML k8s cluster yes [11:52:18] ah Tobias is working on it makes sense [11:52:26] klausman: --^ [11:52:39] effie: mm what records? [11:53:01] I am confused, we are adding .svc ones but I thought it was not automated via netbox [11:53:10] ah right it is a no-op, now I remember [11:53:18] effie: yes please go ahead, thanks :) [13:06:39] yep, T263429 [13:06:39] T263429: Netbox support for svc allocation - https://phabricator.wikimedia.org/T263429 [13:07:02] sorry, wrong one: T270071 [13:07:02] T270071: SVC DNS zonefiles and source of truth - https://phabricator.wikimedia.org/T270071 [13:19:06] those tickets touch on some really deep issues in how we handle truthiness :) [13:19:21] in ways that maybe aren't explicitly acknowledged there, but maybe are part of discussion elsewhere? [13:19:37] and by that I mean: [13:20:40] foo.svc.(eqiad|codfw).wmnet <-> $ipaddress mappings don't exist in a vacuum. They're also tightly bound in most cases to the definitions of foo.discovery.wmnet. [13:21:11] but the hostname for that in static DNS is basically "magic" (not greppable), and the runtime definition of it comes from puppet rather than the dns repo [13:21:41] (hieradata/common/service.yaml drives that with the per-service-per-dc IPs there) [13:22:04] and then there's the extra bit that commonly trips us around ops/dns CI issues [13:22:37] that all of those service definitions are also mocked for CI in ops/dns repo under utils/mock_etc/ [13:23:31] and there's this sequencing that has to happen for CI to work correctly, where you have to create+deploy new things between ops/puppet + ops/dns in a specific sequence or else either some deployment step will fail, or worse, CI will lie to you and say things are fine when they're not. [13:24:45] ignoring the netbox angle for a moment and just diving a little on the rest of it as it exists today: [13:25:04] the ops/dns + ops/puppet split on these things (and a few others that were less-important) bugged me a lot in the past [13:25:37] I did in the past propose an idea where it all came from the puppet repo, basically, so that we didn't have this odd data split [13:26:11] https://gerrit.wikimedia.org/r/c/operations/puppet/+/342887 [13:26:18] it's abandoned of course [13:26:37] because that also wasn't a great tradeoff for all the reasons we didn't want crucial DNS stuff depending on puppet in an outage, etc [13:27:03] and some stuff at the time (pre-netbox-dns-integration) about how our eventual new single source of truth might be a better path, I think :) [13:30:35] I'm not sure, off the top of my head, what this should all mean in today's world, where we now have netbox<->dns integration and some better-formed ideas about sources of truth [13:30:45] but I thought it might be interesting context! :) [13:37:06] but looping back to the abstract view, the most-abstract truth about these entries is not a set of arbitrary, disconnected data like: "foo.svc.eqiad.wmnet => 192.0.2.1" [13:37:58] it's "fooservice has addresses 192.0.2.1 in eqiad and 192.0.2.2 in codfw, as part of an active-active|active-passive x-dc meta-cluster", and it's from that level of truth that all other things flow.w [13:40:07] hieradata/common/service.yaml encodes this currently at the right level of abstraction [13:40:14] e.g. the "wikifeeds" entry there has, among other data: [13:40:24] ip: [13:40:24] codfw: [13:40:24] default: 10.2.1.47 [13:40:25] eqiad: [13:40:25] default: 10.2.2.47 [13:40:27] + [13:40:33] discovery: [13:40:33] - dnsdisc: wikifeeds [13:40:33] active_active: true [13:40:58] so in the logical sense, it's the real origin of the truth [13:42:02] (which we then copy manually to some other places, today) [13:42:43] we have a plan (and also a patch pending from a while from jo.hn) to have a netbox -> puppet integration so that we could use some data from netbox directly in netbox (similar to to the netbox -> dns integration) [13:43:09] I'm wondering if maybe we could add this to the use cases to support and have 10.2.1.47/10.2.2.47 come from netbox maybe [13:44:14] having 'wikifeeds.svc.{dc}.wmnet' as keys hardcoded here [13:48:29] so basically, replace the IP addresses in that stanza in puppet, with references that are filled in from netbox's data on that address (rather than a runtime DNS lookup) [13:49:10] which does sort of solve the "truth" issue for the two addresses, between netbox+puppet. and then of course also netbox->dns for the per-dc entries. [13:49:12] yes, and the netbox data will be locally cached in the puppetmasters so no direct dependency on netbox, like the dns-integration [13:49:43] but then we're still left with the ugly third leg of this between puppet<->ops/dns, and the CI mess there [13:49:50] (for the discovery entry) [13:50:37] in theory, the "clean" solution would be to have netbox know about the meta-structure, and also deploy, from the same truth, the ops/dns bits about wikifeeds.discovery.wmnet. [13:50:46] but I don't know that this level of metadata maps well to netbox data [13:51:45] which part are you thinking of, the wikifeeds 300/10 IN DYNA geoip!disc-wikifeeds and/or the "disc-wikifeeds => { map => mock, dcmap => { mock => 192.0.2.1 } }" parts? [13:58:38] bblack: ^ [13:59:05] I guess it depends on how we'd integrate those things [13:59:10] but maybe more a model like: [13:59:32] (a) netbox auto-generates the first part with the DYNA record + (b) CI pulls netbox data so that it doesn't need a mock setup for testing? [13:59:53] (b) is already in place [13:59:59] yeah, for zonefiles [14:00:02] yes [14:00:10] the part that's being mocked is a little different, it's a config file [14:00:17] right [14:00:36] which currently goes puppet -> confd template -> confd creates the real file that is being mocked there, and the real file has the IPs from netbox. [14:01:14] lol [14:01:38] or we can just document the setup on wikitech and draw a dependency diagram that looks like a bird's nest and call it done! :) [14:02:22] this sounds promising too [14:02:29] but yeah, the point is taken. even if netbox had all the truths, it doesn't have the confd-templating part [14:03:25] putting on my other other ${hat}: [14:03:48] the current plans for upstream gdnsd will eventually make this setup simpler for these purposes too, and maybe if we wait a year we'll have that deployed here. [14:04:26] the syntax is still being worked out, but the intent is that the relevant config would be *in* the zonefile in some syntax, like e.g. (logically-speaking, but random syntax off the top of my head): [14:05:19] discovery.wmnet zonefile: wikifeeds 300 IN DYN:A { type => mapped, mapname => wmf_geoip, dcs => { eqiad => 192.0.2.1, codfw => 192.0.2.2 } } [14:06:08] there would still be another part where we have to feed etcd into some system that manages the map and the up/down-states, but the IPs wouldn't be in that part [14:06:48] here eqiad/codfw will have the real IPs 10.2.1.47/10.2.2.47 or still the mocked ones? [14:06:55] yeah the real IPs [14:08:09] the goal is to get rid of most of the config-level stuff about this, and also to push most of the complex/customizable bits out of the daemon and let people implement the hard parts about geoip mapping and state-management, etc in python or whatever $random_external_tool [14:08:37] yep [14:08:58] so that we can iterate faster and better on features there (e.g. how we calculate distances, integrating our own latency data, etc, etc) [14:09:46] the external tool manages everything that would today be based on GeoIP data files and the like, and also manages any state-stuff (like admin_state stuff we do today) at its own level [14:10:04] it just generates a mapfile, and re-generates it on state changes, and tells gdnsd explicitly when to reload the mapfile [14:10:37] but a mapfile is much easier to generically mock, too [14:12:10] ack [14:12:37] anyways, way off topic. but I do think the tooling will get there $soon, in the next WMF-FY anyways. [14:12:49] so we could wait and solve this when it's easier [14:14:40] (basically as soon as dnssec is "done", I'll start working on that stuff) [14:14:53] other options, if they help to reduce confusion, could be to either not generate the svc files from the netbox->dns automation OR remove the IPs from netbox and use the manual dns file as before for now as source of truth. A small step back into the past just for those 2 zonefiles. I'm open for suggestions [14:15:48] well, the upside for now is accounting for them in netbox prevents accidents [14:15:51] keeping in mind that discovery is not the only use case, as the task highlights there are some other corner cases that are hard to solve in the very simple netbox way of mapping dns [14:16:59] doing it on the path you're on now still makes logical sense, there's just a couple layers of truth involved [14:18:10] netbox has the per-dc svc hostname->IP, and puppet hieradata has the bigger logical picture which consumes those IPs and generates discovery. The only real issue is that "generates discovery" involves these CI hacks between ops/dns+ops/puppet just to move what really is DNS metadata over to ops/dns, which ironically netbox also pushes things towards in other ways for other reasons. [14:19:07] but there's not a better and easier solution I can think of in the moment [18:10:11] alright I've been hunting this for an hour and I can't figure it out - on a new host, scap is defaulting to using deploy1001. I've purged literally every mention of deploy1001 from the host (it's still a config default in the scap package I noticed) but it's still trying. Where do the .config files come from? [18:10:57] they're generated from *somewhere* external to the host I'm guessing, removing the cached ones doesn't stop the issue from happening [18:11:48] by .config files I mean /srv/deployment/REPONAME/deploy-cache/.config files [18:15:11] <_joe_> hnowlan: is this for restbase? [18:15:23] <_joe_> so scap3 [18:16:10] <_joe_> I'm off, but mutante or thcipriani probably remember the answer [18:16:59] <_joe_> tyler's not here, so maybe ask in the private channel or on #-releng too [18:17:17] hnowlan: when we replaced tin with deploy1001 sometimes we had to edit this: /srv/deployment/phabricator/deployment-cache/.config [18:17:33] but that seems what you already did [18:17:53] it's in the repos themselves on new hosts [18:18:49] mutante: heh, I saw your SAL entry for that :D [18:18:50] tin.eqiad.wmnet would keep showing up there even on new hosts that were created after tin was long gone [18:18:55] that's what I know [18:19:01] and that the fix was to just edit that file [18:32:58] it's from a file called DEPLOY_HEAD on the deploy server in each repo - 47 repos are affected [18:35:57] scap generates them on a regular deploy - right, that's close enough to an answer. I'll do a sed tomorrow morning to fix it [18:38:22] hnowlan: aha! similar but not exactly the same. great!