[07:30:14] cloudvirt2004-dev.codfw1dev is unhappy :/, memory corruption issues, I'll try to drain it [07:40:39] please make it happy :/ [07:44:09] we are getting new memory, but will take some time (https://www.youtube.com/watch?v=VkUi4qdZStQ&t=165s) [07:49:42] I liked that short film more in spanish xd [07:57:54] cute :) [08:05:42] this is the ticket, right? T374467 [08:05:43] T374467: 2024-09-10: hardware error on cloudvirt2004-dev - https://phabricator.wikimedia.org/T374467 [08:13:56] dcaro: do you think we have some kind of tools k8s outage? [08:14:46] not really, only one user reporting issues [08:15:24] I don't understand yet why kyverno is complaining about the generated pods though [08:15:36] and the slow responses [08:16:35] that field should have been injected by 1# the jobs-api, and 2# by the mutation rule in the kyverno policy [08:16:39] no? [08:17:05] that's what I thought yes [08:17:26] it's a warning though, so it should not prevent it from running right? [08:17:40] functional tests are passing ok so far [08:17:50] I think it is a hard fail at this point [08:18:26] misleading warning message then [08:18:27] :/ [08:19:49] runAsGroup contains an id [08:20:19] I wonder if what happened is that somehow jobs-api failed to fill the field with the right value [08:20:23] maybe an LDAP problem? [08:20:56] oh [08:21:00] this cronjob is extremely old [08:21:12] age 322d [08:21:14] I was thinking the same, probably recreating the job is enough [08:21:26] we should push the job migration plan [08:21:48] this means the job template wasn't created by a recent version of jobs-api, and therefore we are relying on kyverno injecting the value [08:21:56] job migration plan? [08:22:19] yep, migrating existing jobs to the newer versions, instead of relying on users recreating the jobs eventually [08:22:39] ok, I was unaware of that initiative [08:22:43] so we can then only handle the latest job version, instead of having to handle all of them [08:23:00] I suggested it a few times already in several venues [08:23:34] ack [08:24:06] sounds like a fun project anyway :-) [08:24:59] it goes hand in hand with having the business models for jobs-api saved on their own sturctures [08:25:35] (so instead of having to drift from version X to version X+N, you can just generate version X+N from the business model) [08:51:36] I had created a task yep T359649 [08:51:36] T359649: [jobs-api,infra] upgrade all the existing toolforge jobs to the latest job version - https://phabricator.wikimedia.org/T359649 [08:52:28] I had commented 🤦‍♂️ [09:17:28] arturo: about T374513, do you remember if andre.w found anything? [09:17:29] T374513: Lint problems for NeutronAgentDownForLong and NeutronAgentDown - https://phabricator.wikimedia.org/T374513 [09:17:40] 👀 [09:18:29] I think he briefly mentioned a policy change in the neutron API or similar? [09:18:40] or a role change? [09:22:12] makes sense, will have to investigate [09:41:50] dhinus: are you around today? [09:42:06] arturo: I am :) [09:42:24] I would like your help/advice with https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/40 [09:43:23] first some context will be needed, do you prefer IRC or videochat? [09:44:03] let's try IRC and switch to meet if it gets too long [09:44:36] so the main reason why I'm working on migrating DNS stuff to tofu is to keep track of all the special DNS records we have in the system [09:44:53] some of them are in the MR already, as examples. I "cherry picked" them for early import [09:45:05] but I don't plan to import them all at the same time [09:45:19] partially because I don't even know all of them [09:45:49] tracking via tofu is important for the reasons we all know, but also because requests like this: [09:46:02] T374278 [09:46:03] T374278: Update wmcloud.org MX records - https://phabricator.wikimedia.org/T374278 [09:46:18] (which is included in the MR) [09:47:25] now, turns out, the logic to import records into tofu has a bug [09:47:26] :-( [09:48:00] I think I have identified the bug, and I sent a patch upstream: https://github.com/terraform-provider-openstack/terraform-provider-openstack/pull/1778 [09:48:52] but I see there is some PR backlog in the repo and I'm not expecting it to be merged soon [09:49:07] I don't want to block our own opentofu DNS work on this upstream bug [09:49:15] so (and I'm getting to the actual question) [09:49:19] I would like to: [09:49:36] 1# briefly drop the records from the openstack DB, by hand [09:49:45] 2# then run tofu to re-create them [09:49:54] basically, forget about imports until the bug is fixed [09:50:00] wdyt? [09:50:12] sounds good to me, I'm trying to understand the bug... [09:50:37] if you specify the zone_id, shouldn't it be enough for tofu to use the correct project? [09:51:06] no :-( [09:51:16] when tofu tries to import, it runs a request to the openstack API that lacks the project_id, and therefore the special header [09:51:31] `X-Auth-Sudo-Tenant-ID` [09:51:47] which makes desginate reply that there is no such record [09:52:06] the special header is only crafted into the request if the project_id is set into the recordset resource [09:52:17] but potentially it could auto-fill the special header if the zone_id is assigned to a specific project? [09:52:43] i.e. without having to pass "openstack_dns_recordset_v2.recordset_1 recordset_id/zone_id/project_id"? [09:53:18] the project_id of the zone is not available anywhere on the recordset resource if not explicitly set [09:53:43] but the zone_id is. I'm trying to understand if the openstack API lets you create records in multiple projects with the same zone_id [09:53:56] or if the zone_id is tied to project X, and all recordsets must be tied to the same proj [09:54:23] moreover, the special header will only be included if an explicit project_id is set as an attribute in the resource, see here [09:54:23] https://github.com/terraform-provider-openstack/terraform-provider-openstack/blob/main/openstack/dns_zone_v2.go#L61 [09:55:44] that's for dns_zone_v2 though, not for dns_recordset_v2 [09:55:54] they all call this function [09:56:15] see here https://github.com/terraform-provider-openstack/terraform-provider-openstack/blob/main/openstack/resource_openstack_dns_recordset_v2.go#L108 [09:56:31] I see [09:57:46] so my theory is that recordset creation works as long as you have project_id in the resource definition [09:58:09] ans recordset import wont ever work because there is no way to set the project_id unless the patch I'm proposing [09:58:13] yep it's probably something they forgot to add to the import logic [09:58:39] we may find later that recordset creation _also_ don't work :-P [09:58:47] I haven't run a tofu apply yet with this code [09:59:13] maybe the patch could be modified so that is somehow finding out the correct behaviour automatically based on the zone_id, but let's see what the upstream maintainers say [09:59:32] I think your plan of deleting and recreating without the import is good [09:59:36] let's try that on a single record [09:59:45] ok [09:59:46] if that works, we can do the same for the rest [10:00:04] sounds reasonable [10:05:08] I'll try with one record in each deployment [10:05:13] https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/40 [10:06:06] * dcaro lunch [10:06:26] codfw1 should not have any more puppet problems [10:06:33] let me know if you see any [10:06:35] dcaro: ok, thanks [10:06:36] arturo: the plan looks good [10:06:45] have you deleted that record manually? [10:06:59] dhinus: not yet [10:07:20] so it's not detecting that it already exists? [10:07:26] I guess because you can create duplicate records [10:07:35] they will just have different uuids [10:08:30] because how DNS works, you can have duplicate records without violating any functional constraint [10:08:44] so I guess designate just allows it [10:08:46] hmm true [10:08:58] does it mean that we could run apply first and delete the duplicate later? [10:09:08] yes, no? [10:09:13] in this way we would avoid that brief moment without a record [10:09:14] let's try [10:09:15] that's actually a good idea [10:09:55] first, let me identify the command I would need to run to delete the original record, just in case [10:12:38] designate auth is truly unique, in the wrong way, look: [10:12:58] first fails unless the --all-projects arg is passed https://www.irccloud.com/pastebin/iLKpycjd/ [10:15:42] ok I've recorded the two affected records here: [10:15:43] https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/40#note_104692 [10:16:26] and to delete them is just s/show/delete/g in the command line [10:18:17] dhinus: ok to merge & apply? [10:19:03] yes, go for it! [10:19:12] ok, doing now [10:19:21] "designate auth is truly unique, in the wrong way" is worth adding to bash :) [10:19:35] heh [10:20:14] . [10:20:17] oops [10:20:20] nm [10:20:39] dhinus: error (-: [10:20:47] duplicated record after all https://www.irccloud.com/pastebin/MYWKOY2c/ [10:21:07] you can tell how imaginative I am with the previous theory of designate allowing dup records [10:21:38] I'll remove the affected ones, then re-run apply [10:21:41] ok [10:22:23] worked this time! [10:22:25] https://www.irccloud.com/pastebin/qdNujb2T/ [10:22:52] same in the other deployment! [10:22:58] ok, we are good! 🎉 [10:23:03] is the new record identical to the one that was deleted? [10:23:15] let me check [10:24:46] the TTL is missing, I did not include it, hoping it would have a default [10:25:07] other than that, seems exactly the same [10:25:17] well the records are sorted in different order [10:25:24] compare [10:25:32] old https://www.irccloud.com/pastebin/62LxlyaL/ [10:25:43] new https://www.irccloud.com/pastebin/MXmfTc4Y/ [10:26:17] should be fine [10:26:48] BTW look at how the tofu ID contains the zone: [10:26:49] module.records["codfw1dev.wmcloud.org."].openstack_dns_recordset_v2.record["ns"]: Creation complete after 1s [id=5748595a-11c6-4099-bfc2-b95b6ae67c21/2df1edc0-71f7-49e3-aff9-a51266b387b7] [10:27:01] id= uuid / uuid [10:27:05] yep [10:27:32] I wonder if I would need to match that same order in my upstream patch [10:28:35] in that output, it is id=zone/recordset [10:31:07] maybe... then you could add project/zone/recordset [10:31:27] ok [10:31:35] I'm also unsure if import with just recordset can work, or if the openstack API requires you to always include the zone_id [10:31:54] yes, the zone_id is in the endpoint URL actually [10:32:15] see https://docs.openstack.org/api-ref/dns/dns-api-v2-index.html#create-recordset [10:33:09] the codfw1dev record is in PENDING state, I suspect designate may be having a hard time [10:33:15] ... the crocodiles [10:33:31] the new one or the deleted on? [10:33:33] *one [10:33:38] the new one [10:33:58] https://usercontent.irccloud-cdn.com/file/EGwHcT2O/image.png [10:34:49] there's another one in PENDING that is unrelated to your MR [10:35:14] oh yeah, tf-infra-test.codfw1dev.wmcloud.org. [10:35:33] definitely designate not 100% happy somewhere in codfw1dev [10:36:49] time to check the designate diagrams from a.ndrew :P [10:37:07] and see which crocodile failed [10:37:22] I need to step out a moment, brb [11:19:28] I have restarted designate in codfw1dev, let's see how that goes [11:28:19] dhinus: I will add the TTL default https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/41 [11:33:11] any reason for not using a lower default? [11:40:00] on a brief diagonal scan, I've found many records using this value [11:40:17] it is 1h, which feels about right anyway [11:40:23] do you think it should be lower? [11:40:45] I would consider something like 300 or 600, but if 3600 is common in our records, I'm fine with keeping it [11:40:56] we can discuss lowering it in the future [11:42:09] I don't have an strong opinions, feels like an arbitrary value anyway [11:42:14] a lower one could be very useful for things like toolsdb where we sometimes want to point a name to a new IP after a failover [11:42:22] for most records it won't make a difference [11:42:50] right [11:42:54] so maybe we can just set a lower one where needed, and keep 3600 as the default [11:43:22] or maybe I look only on 'stable' records like MX and NS, which are unlikely to change [11:43:40] so perhaps those are the ones that needs overriding, and we can set a 600 default [11:44:22] yeah I'm also unsure... but for now I think we should concentrate on migrating to tofu keeping the same values [11:44:28] so if most of them are 3600, that's a good default [11:44:53] ok [11:56:58] :-( [11:57:00] https://www.irccloud.com/pastebin/MtqeSJv4/ [11:58:32] deleted by hand, run tofu apply again, now everyone is happy [12:11:06] jhathaway: when you are awake, this is the patch https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/42 for T374278 please review/approve it [12:11:06] T374278: Update wmcloud.org MX records - https://phabricator.wikimedia.org/T374278 [13:14:21] Raymond_Ndibe: I have rebuilt the ingress and non-nfs workers on toolsbeta, now we have plenty room for the tests [13:38:06] thanks arturo, I don't think I have approval rights? but I did comment on the merge request [14:01:46] thanks, will deploy it in a bit [17:11:30] * dcaro off [17:11:41] I'll be around later for a bit too if anything is needed, cya!