[08:19:03] morning [08:19:13] could someone please add me to https://gitlab.wikimedia.org/groups/toolforge-repos/-/group_members [08:29:06] morning, done [08:32:34] thanks! [09:56:26] there is this diff in tofu-infra [09:56:27] https://www.irccloud.com/pastebin/XeYdtSOt/ [09:56:49] I assume there is a conflict between tofu-infra an this tf-infra-test? [09:58:10] the project might have been removed outside tofu [09:58:41] * dcaro looking into the ceph warning, an osd process died [09:59:39] I have created T379141 [09:59:40] T379141: tofu-infra: conflict with tf-infra-test - https://phabricator.wikimedia.org/T379141 [10:18:30] see my IRC messages from yesterday: the project was deleted in T379076 [10:18:31] T379076: Remove tf-infra-test project - https://phabricator.wikimedia.org/T379076 [10:23:30] ok, then that's better [10:28:59] hmm wait the project exists again now? I'm sure it was gone yesterday [10:29:26] dhinus: yeah, tofu-infra created it again [10:29:37] that is expected, no? [10:30:04] oh right, I thought you only ran "plan" but not "apply" [10:30:06] tofu-infra found a diff, and corrected it [10:31:22] I saw a diff alert this morning. I did not know the reason, but decided the run apply [10:31:38] yeah in this case it's fine, it was a small diff [10:43:53] I'm thinking if we always want to run apply when there is a diff alert, in most cases probably yes [10:44:01] unless the diff is very large and/or destructive [10:45:35] we have a similar dilemma with puppet. I'm generally fine with puppet running on its own every 30 minutes. We could have something similar for tofu-infra, with some safeguards in case the diff is massive (which may indicate malfunction) [10:48:11] yeah similar situation to puppet, I think I feel safer with a manual check right now, maybe after tofu-infra will be more stable and battle-tested I would be ok with an automatic run :) [10:48:40] * arturo nods [11:14:08] hmm, I think that it's quite more troubling getting your openstack project recreated/deleted without notice, than a service restarted on every run by puppet :/, or a DB being removed on every run, etc. +1 for manual checks (at least add a task when an unexpected diff was found) [11:14:30] (in this case we would have missed that the project was recreated for example) [11:15:18] there's some puppet stuff that's scary though, like the disk partitioning and such, that would be at the same level [13:15:00] topranks: I would like to do the IPv6 allocations and netbox recording for openstack in eqiad [13:17:09] arturo: ok yep [13:17:18] I guess we just copy the blueprint from codfw 1:1 [13:17:25] yeah [13:17:27] i.e. all the sub-allocations should be the same just the parent prefix is different [13:18:17] the top-level is already done for it, we just need to add the infra vlans [13:18:27] there will likely be more of them due to the multi-rack/switch setup in eqiad [13:18:33] ok [13:18:37] eqiad: https://netbox.wikimedia.org/ipam/prefixes/1091/prefixes/ [13:18:47] codfw: https://netbox.wikimedia.org/ipam/prefixes/1078/prefixes/ [13:19:41] in general - as this is tied to the vxlan stuff - are we happy with the lower MTU for the vxlan-using networks? [13:20:09] that was the one niggle in my mind remaining [13:20:20] we saw no problem with openstack, remember? [13:20:36] we didn't run every application possible connecting to every potential internet endpoint though [13:21:08] openstack did correctly set lower mtu on the VM interfaces [13:21:22] which means TCP MSS will be set to a workable value by VMs [13:21:27] the vxlan traffic would circulate on the same physical links on cloudvirts [13:21:39] so a server won't try to send "too big" packets and that fail due to an ICMP blackhole [13:21:41] but...... [13:21:42] meaning that changing the MTU of the network later maybe as painful as doing it today [13:21:48] have we considered every possibility? [13:22:08] yeah the restriction is because we have 1500 MTU set on the physical servers [13:22:24] thus the vxlan encap eats into 1500, and VMs cannot have industry-standard 1500 MTU as a result [13:22:48] normally when using something like vxlan you use jumbo frames to allow the overhead on top [13:22:53] and still give the underlay 1500 [13:23:16] this seems slightly less critical here as OpenStack is in charge of everything, and is setting the lower mtu on the physical hosts [13:23:40] (unlike say a physical network where someone might plug in and their machine just assume 1500 is ok) [13:23:41] yeah. I'm trying to figure out if this is a concern that we should resolve today, or push for later [13:24:14] it seems to me it may not cause an issue [13:24:24] if it would be equally painful to update today compared to later, I would say let's do later [13:24:54] it would be a much harder to change later I suspect, which is why I raise it now [13:25:05] why much harder later? [13:25:42] well at very least you'd need to re-create all the VMs that already exist with 1450 MTU set, after upgrading the hosts to use jumbo frames [13:26:05] if them having that mtu is going to be a problem better they are created at 1500 MTU when first made [13:26:49] if we had jumbos on the cloudvirt, the VMs using a smaller MTU would not be an issue, no? everything should still work until the VM is re-created, no? [13:26:50] plus I don't know where else that setting may "end up", as in if it's defined for the openstack networks or anything that may need to be deleted / recreated to adjust. Just speculating really. [13:27:04] that's correct yeah [13:27:19] and probably the owner for a service that was having problems would be incentivised to rebuild their vms [13:27:49] it should be ok really [13:28:20] the main worry would be non-TCP protocols between cloud instances and things on the internet [13:28:28] which there are not many of [13:28:41] ok -- I will create a ticket, so at least we have this in the radar even if we don't change anything today [13:29:31] yeah I just want to call it out before we start moving VMs to the vxlan networks so it's not overlooked [13:29:50] fair [13:30:15] the safest option, from a "compatibility" point of view, is to ensure we can deploy VMs with 1500 MTU on the new networks, but it seems to me low-risk given they will have "working" mtu set by openstack (even though it's lower) [13:30:59] what we *should* have done was set up the cloud-private networks as all 9000 byte MTU from day one on the hosts [13:31:47] :-( [13:32:31] the only difficulty in changing really is the chicken-and-egg problems we may get trying to migrate them, with some changed and others still on 1500. not at all an impossibility to adjust, and PMTUd will work for us as it's all internal, but still something of a headache. [13:32:54] * arturo nods [13:36:56] arturo: I allocated the IPv6 networks for the two 'transport' networks that cloud hosts touch [13:36:57] https://netbox.wikimedia.org/ipam/prefixes/1091/prefixes/ [13:37:11] i.e. the ones that the cloudgw and cloudnet sit on [13:37:54] we can assign IPs from them same as the equivalents in codfw [13:38:19] however note that as soon as an IP with a dns name is added there we need to create & merge a patch in the DNS repo for the new reverse range [13:38:40] ok [13:38:43] not too tricky just very chicken and egg, ping me if you add anything I can take care of it [13:38:52] I just created this T379154 [13:38:53] T379154: openstack: vxlan: potential changes to cloudvirt MTU to enable jumbo frames - https://phabricator.wikimedia.org/T379154 [13:51:21] thanks, replied [13:52:07] dhinus: Raymond_Ndibe: in case you did not see, I found the issue with pods not terminating during the upgrade, there was a worker node stuck on NFS that was not reporting to prometheus (due to the VM being in 'confirm migration' state instead of active) [13:52:36] arturo: the problem with moving to jumbos is more complex than I thought, as the cloud-private subint can't exceed the parent por t mtu :( [13:52:46] topranks: I see [13:53:12] https://www.irccloud.com/pastebin/VaWEtCN9/ [13:53:21] good we have the task so at least it's recorded [13:53:28] but I'm maybe leaning towards not touching it [13:53:33] ok [13:55:52] dcaro: ack, thanks for the update! [13:56:40] dcaro: can you post a follow-up comment to my last comment in T362867? [13:56:41] T362867: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.28 - https://phabricator.wikimedia.org/T362867 [13:56:52] 👍 [13:57:16] though I'm not 100% sure the issue reported by lucas was caused by the stuck NFS, but it seems likely [13:59:36] topranks: I would also be interested in allocating the eqiad1 equivalent of `cloud-flat-1-codfw1dev-v6` https://netbox.wikimedia.org/ipam/prefixes/1084/ for VMs basically [14:00:45] * arturo food time [14:02:41] sure yeah, 2a02:ec80:a000:1::/64 [14:03:00] you can go ahead and create it, set the attributes/description same way as it is for the codfw one [14:03:04] but not much to it [14:03:04] dhinus: I did some tests also with liveness probes, timeout, concurrency policy and such while I was at it. From the liveness probe tests, it could have been the same issue (would behave like that, as it has to read the config from NFS to start uwsgi), but it could be a different one too yep, added a note to keep an eye, but might be sorted [14:03:22] arturo: let me know if you have any trouble or I can do it also [14:51:12] I have a meeting clash, so I can't make it to the WMCS/DPE sync in 10 minutes. Sorry. [15:15:44] apologies I was making a cup of tea and completely forgot about the WMCS/DPE sync. it looks like no one else is around today? [15:17:33] it looks like we don't have another one until February 26, maybe we could schedule one later this month? [15:58:03] Sounds good to me, I was in another one and did not get a notice as I had replied 'maybe' :/ [16:20:09] dhinus: https://developer.hashicorp.com/terraform/language/checks [16:20:15] that is what were looking for [16:20:48] we can assert that list(defined_projects) == list(projects_from_vms) [16:21:09] because if they are not equal, there are leaked VMs [16:22:02] or, there will be leaked VMs when the plan is run, if you are in a MR for deleting a project [16:22:44] I guess we can also use this check block to validate project names and other stuff [16:24:13] arturo: nice! [16:30:48] related: https://bugs.launchpad.net/nova/+bug/1288230 [16:35:55] also related: https://docs.openstack.org/python-openstackclient/latest/cli/command-objects/project-cleanup.html [16:36:04] unfortunately it does not work if the project has already been deleted [16:39:19] and obviously I did the same mistake of not running it before deleting tf-infra-dev in codfw. so that one also has leaked resources :/ [16:47:48] also related: https://github.com/terraform-provider-openstack/terraform-provider-openstack/issues/1774 [16:50:25] Heh, need help finding the leaked vm? [16:51:03] Rook: did you find any trick? I'm stuck at listing all servers and grepping for similar-sounding names, then "openstack server show [16:51:31] I can't find any way to list all servers AND their associated project in one command [16:52:25] Oh I hadn't found that either, no [16:54:07] That being said I think it's 8bb53a94-dee2-4aec-8aeb-e752b8bb0cf0 [16:55:16] Oh and also c31e3cdd-eece-44bb-bf2b-fa9d49bee66f ...might have had leftovers from a previous run [16:57:23] now I'm curious how you found them so quickly :) [16:59:07] I'm deleting both of those with "openstack server delete" [16:59:48] done [17:02:35] I think that might be all the servers. Though I'm wondering if there is other stuff. There is plenty that looks stranded in the trove project. Though that is probably a different problem. I'll look around a little more. [17:04:28] dhinus: I create this https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/118 but needs more work [17:04:32] I have to go offline now [17:04:34] Rook: yep I'm sure there is more stuff. I tried running `openstack project cleanup --dry-run` on another project and it lists a few things [17:04:45] arturo: thanks, let's catch up tomorrow [17:15:42] Alrighty, I think that is at least all of the servers. Finding a few volumes and hints that there may be a magnum cluster there [17:16:58] if I remember correctly you can see all magnum clusters in horizon, regardless of the project they belong to [17:18:03] Yes I think that ability was related to having admin on any project [17:18:06] horizon: "Error: Unable to retrieve the clusters." [17:18:21] `root@cloudcontrol2004-dev:~# openstack coe cluster template delete 3aeb04ab-a896-4ca3-bbc7-bec8565e29e5 [17:18:21] Delete for cluster template 3aeb04ab-a896-4ca3-bbc7-bec8565e29e5 failed: ClusterTemplate 3aeb04ab-a896-4ca3-bbc7-bec8565e29e5 is referenced by one or multiple clusters (HTTP 400) (Request-ID: req-22dcf6e5-9f8f-4e01-9bb4-f63f704ca4fe)` is what is getting me wondering [17:18:38] where did you see that? [17:19:17] It's the output from running `openstack coe cluster template delete 3aeb04ab-a896-4ca3-bbc7-bec8565e29e5` which is the id for `tf-infra-test-127` [17:19:48] volumes `34f768ba-b454-4423-951f-f9b437e5a22c` and `b5b58484-4527-44be-9a79-d42f22e3b834` have `os-vol-tenant-attr:tenant_id | tf-infra-dev` [17:20:39] the volumes can go I think. not sure about the cluster issue [17:21:13] do you also get a 400 error if you run "coe cluster list"? [17:22:07] Did the paws-dev and pawsdev and k8s-dev projects get removed? [17:22:19] And yes, I also get a 400 [17:22:58] There are heat leftovers of those three projects and tf-infra-dev from `openstack stack list` [17:24:00] Rook: no idea about those other projects, I did not touch them [17:24:50] I'm not seeing them in horizon or `openstack project list` any longer. I am seeing `tf-infra-dev-c9a5a626-73f7-49ab-8898-cdc9df89c217` in the latter though [17:25:45] Not much in it though [17:26:01] found the answer: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/56 [17:26:35] well, only for pawsdev, the other two are not in that MR [17:27:30] k8s-dev is there actually, so that was also removed by that MR [17:28:06] paws-dev with the dash was probably removed another time, before we starting tracking the projects in tofu [17:28:25] I use pawsdev :( Oh well [17:28:40] I removed paws-dev when it was replaced with pawsdev https://phabricator.wikimedia.org/T355954 [17:29:05] https://tenor.com/en-GB/view/shipit-revert-crash-gif-4770661 [17:30:06] Nice thing about paws is the ease of deployment from scratch. So shouldn't be a problem to revive in codfw1dev when needed [17:32:25] reproducible deploys ftw [17:33:49] I am somewhat surprised that openstack doesn't cleanup when projects are deleted. Looking around some it seems widely known. I suppose it is due to the modular nature of the project? [17:34:12] yep I'm also very surprised but yes that's probably the reasoning [17:34:37] they added that "project cleanup" (earlier called "project purge") command that partially addresses it [17:36:30] I would like at least to be able to list resources belonging to "deleted-project-name" [17:37:37] or to get a big warning if I try to delete a project with resources in it [17:38:29] anyways, I'm heading off. thanks for tracking down some of those orphaned resources! [17:38:37] Have a good evening! [17:40:23] thanks, you too! [18:10:10] * dcaro off [21:02:28] need review on this https://gerrit.wikimedia.org/r/c/operations/puppet/+/1087968 whenever anyone comes online. This should fix this https://phabricator.wikimedia.org/T360626 [22:17:51] !issync [22:17:51] Syncing #wikimedia-cloud-admin (requested by bd808) [22:17:53] Set /cs flags #wikimedia-cloud-admin wikibugs +Vv [22:17:55] Set /cs flags #wikimedia-cloud-admin stashbot +v [22:17:57] Set /cs flags #wikimedia-cloud-admin arturo -es [22:17:59] Set /cs flags #wikimedia-cloud-admin wmopbot -Ae [22:18:01] Set /cs flags #wikimedia-cloud-admin Majavah -Aefiorstv [22:18:03] Set /cs flags #wikimedia-cloud-admin *!*@libera/staff/* -Airtv [22:18:05] Set /cs flags #wikimedia-cloud-admin Raymond_Ndibe -es [22:18:07] Set /cs flags #wikimedia-cloud-admin rook -es [22:18:09] Set /cs flags #wikimedia-cloud-admin litharge +Vv [22:18:11] Set /cs flags #wikimedia-cloud-admin dcaro +FR [22:18:13] Set /cs flags #wikimedia-cloud-admin taavi +Afiortv [22:18:15] Set /cs flags #wikimedia-cloud-admin wm-bot +v [22:18:17] Set /cs flags #wikimedia-cloud-admin balloons -ARefiorstv [22:18:19] Set /cs flags #wikimedia-cloud-admin Az1568 -Afiortv [22:18:21] Set /cs flags #wikimedia-cloud-admin dhinus -es [22:18:23] Set /cs flags #wikimedia-cloud-admin blancadesal -es [22:18:25] Set /cs flags #wikimedia-cloud-admin bstorm -Aefiorstv [22:18:27] Set /cs flags #wikimedia-cloud-admin TheresNoTime +Afiortv [22:18:29] Set /cs flags #wikimedia-cloud-admin komla -es [22:18:31] Set /cs flags #wikimedia-cloud-admin ircservserv-wm +V [22:57:36] !issync [22:57:37] Syncing #wikimedia-cloud-admin (requested by bd808) [22:57:37] No updates for #wikimedia-cloud-admin