[00:29:04] dduvall: I removed that entry from the database because as far as I can tell it didn't get as far as actually creating any resources. I don't love that it didn't validate the template and also didn't let you undo your work though :( [00:39:32] logged as https://github.com/vexxhost/magnum-cluster-api/issues/1063 for what it's worth [06:56:33] greetings [07:38:47] re: cloudweb hosts and their fate, it seems to me we could fold cloudweb functionality into cloudcontrol or cloudservices [07:50:46] context being T428060 T411783 T392478 [07:50:46] T428060: codfw: move public baremetal servers to per rack vlan - https://phabricator.wikimedia.org/T428060 [07:50:47] T411783: Move cloudweb hosts to cloud racks? - https://phabricator.wikimedia.org/T411783 [07:50:47] T392478: Move cloudweb to Ganeti VMs and repurpose the servers as wikikube nodes - https://phabricator.wikimedia.org/T392478 [15:30:58] andrewbogott: looking at that error you included in the magnum issue `TypeError: failed to extract field ClusterLabels.cinder_csi_enabled` that's interesting because that label isn't even defined in our tofu :/ [15:31:46] It is in the template though, I think? Is it possible that tofu is editing/iterating on an existing template rather than making a fresh one? (just a guess) [15:31:56] and `TypeError: 'str' object is not an instance of 'bool'` suggests it _should_ be a bool value... i'll have to debug the tofu plan before trying again to see exactly what it's submitting [15:32:36] Last night I was able to convince myself both ways: bool but should be str, str but should be bool :/ [15:33:54] haha, wee! yeah so much abstraction fun [15:52:57] probably you are mostly distracted by the same meeting as me, but... I'm curious if you tell tofu a new template name if you get something with fewer phantom values [15:55:16] i'm skipping the live meeting today... E_TOO_MUCH_MEET :) [15:55:46] never the wrong choice [16:55:24] it looks as though the openstack tofu module converts all label values to string, but the cluster-api driver expects bool in some cases which includes `cinder_csi_enabled` [16:56:08] https://github.com/vexxhost/magnum-cluster-api/blob/9378220832f91622a982635bde44f533b05bc2df/src/magnum.rs#L48 [16:56:45] the default value is also true so maybe i'll try removing that label from our template and see if that avoids the error [17:03:14] hm, that seems bad. I haven't tripped over it yet but it might be always lurking [17:29:59] maybe the upstream fix is as simple as tweaking the type of that label there, not sure but there's a bool-ish string label just above that mentions labels are always strings :D https://github.com/vexxhost/magnum-cluster-api/blob/main/src/magnum.rs#L40 [17:30:23] i'll make a PR and see what happens [17:31:12] my first ever Rust contribution, blindly changing a struct field type. what could go wrong? [17:33:37] * andrewbogott buying dduvall a "what would claud do?" wristband [17:41:31] * dduvall barfs a little [18:03:40] andrewbogott: fyi https://github.com/vexxhost/magnum-cluster-api/pull/1064 [18:04:42] we'll see what happens! [18:07:21] i would have run the tests locally but the very first lines of `hack/stack.sh` use sudo so no thanks, upstream! [19:38:39] tried without the `cinder_csi_enabled` label and the cluster has been in `CREATE_IN_PROGRESS` for ~ 10 mins. no instances yet :/ [19:38:46] at least it got that far? [19:39:28] 10 minutes is a bit on the long side but not excessive. Let me see what the capi agent things it's doing [19:39:54] oh, nope, it hasn't made any worker nodes, so something is definitely broken [19:41:33] https://www.irccloud.com/pastebin/b08Jl8pD/ [19:41:41] seems odd [19:42:08] so yeah, doesn't appear to be doing much [19:42:44] Oh, with the new driver there is no 'stack' [19:43:17] if you want I can give you access to the capi engine and you can see the logs, that might exceed your curiosity though [19:43:24] oh ok! [19:43:47] sure, i'll take access. i love rabbit holes [19:46:20] see if you can ssh to capi-worker-1.magnum.eqiad1.wikimedia.cloud [19:48:13] then 'sudo kubectl get pods --all-namespaces' and you can see the workers that are (or should be) making your cluster [19:51:36] the interesting ones are capi-controller-manager (runs the show) and capo-controller-manager (relays resource creation back to openstack as requested by capi-controller) [19:51:50] what's weird is those workers are claiming that they're doing things [19:53:23] ...or I'm looking at logs from yesterday [20:07:10] hmm, in the output of `kubectl logs -n capi-system -f capi-controller-manager-548fffdb7b-tm49z` i do see `Scale up on hold because KubeadmControlPlane magnum-system/kube-5zopn-sv662 is provisioning (\"ControlPlaneIsStable\" preflight check failed).` [20:07:56] yeah, the preflight failed likely means that it doesn't like a quota or a setting or something. I'm trying to set up a parallel test over here to see if I can get a simpler working example from the one in the email... [20:08:07] nice [20:08:14] i'll bbiab. lunchtime here [20:08:22] In previous cases like this it was mad that there wasn't a floating IP to use. But you're specifying not to use one so... [20:08:24] then i can poke at it more [20:08:32] right [20:14:01] looks like there's more detail in `kubectl logs -n capi-kubeadm-control-plane-system -f capi-kubeadm-control-plane-controller-manager-7989c74d69-7m4tc` [20:14:06] ok, really going to lunch now :) [20:17:35] yep, that and the capo namespace are the logs I'm looking at [20:17:41] but I don't see why it's unhappy specifically [20:20:06] My simple test template is working just fine. So there's at least a little bit of order in this sea of chaos [20:33:19] oh, dduvall, your template has 'kube_tag': 'v1.35.4' when trying to launch a cluster on v1.34.8 -- so that's likely at least one deal-breaker. [20:33:51] I would fix that tag and then remove all the other version tags in the template unless you actually know you care. [20:34:26] And, sorry, this is probably from you copy/pasting the tofu code I linked you to which runs on a slightly different setup (with, you guessed it, 1.35.4 worker nodes) [21:39:42] !issync [21:39:43] Syncing #wikimedia-cloud-admin (requested by bd808) [21:39:44] Set /cs flags #wikimedia-cloud-admin blancadesal -Afiortv [21:39:46] Set /cs flags #wikimedia-cloud-admin arturo -Afiortv [21:39:48] Set /cs flags #wikimedia-cloud-admin bliviero +Afiortv [21:39:50] Set /cs flags #wikimedia-cloud-admin rook -Afiortv [21:43:34] bliviero: you should have new irc op powers in #wikimedia-cloud-admin, #wikimedia-cloud-daily, #wikimedia-cloud-feed, and #wikimedia-cloud. I finally noticed config changes that t.aavi had submitted MRs for months ago. [21:44:12] I guess the rights for you were freshly sent, but still :) [21:44:53] no... I can't read. March 19 is not fresh. [22:13:48] andrewbogott: ack! yeah i copied your tofu. i just reverted back to 1.34 but same issue [22:15:45] i dug a bit more into the CRDs that magnum uses and found this: `$ sudo kubectl describe -n magnum-system kubeadmcontrolplane kube-xs6k8` [22:15:50] https://www.irccloud.com/pastebin/SfA9QHzO/ [22:16:38] seems like it's failing pretty early on `EtcdClusterHealthy` [22:18:22] and yeah there are no instances or `OpenStackMachine` resources for the cluster yet [22:26:27] here's a minimal template that works: [22:26:31] https://www.irccloud.com/pastebin/kALW0enU/ [22:27:33] let's see if I can make the more complicated tofu version build... [22:29:36] i think i might have found the issue. my template doesn't specify `master_lb_enabled` [22:32:14] and in the status of the `OpenStackCluster` i see `Failed to reconcile control plane endpoint: unable to determine control plane endpoint` [22:33:11] of you're right, that is in the docs example... [22:33:27] in theory it should be smart and only require that if you have multiple controllers, but... [22:33:53] right, or rely on the node nft based routing? [22:34:16] (i.e. just pick a master node and talk to it?) [22:37:09] in any case, trying again [22:37:20] with `master_lb_enabled: true` [22:38:31] here is another working example: [22:38:38] https://www.irccloud.com/pastebin/WnwXQhLa/ [22:38:54] I hope these pastes are helpful and not just overwhelming! [22:39:23] https://www.irccloud.com/pastebin/XazPvQNB/ [22:39:29] w00t! [22:39:34] and i see instances starting [22:40:20] nice! [22:40:30] I'm trying to keep a running list of things to add to a future troubleshooting doc: https://etherpad.wikimedia.org/p/magnum-cluster-api_troubleshooting [22:40:46] please include any lessons you think might be useful to other users or future you [22:41:26] My general takeaway with this driver is: it works great when it works but it would benefit from miles of documentation, validation, and error reporting. [22:41:42] (which was true for the old driver too, tbh) [22:41:50] yeah, better error handling [22:42:11] the fact that it hangs instead of watching the openstackcluster/cluster status is not fun [22:43:50] Part of the problem is that it's eventualist all the way down [22:44:01] so I'm not sure even the actual capi engine knows when it's failed, it just keeps trying forever [22:46:35] so I guess /now/ you can test and see if the SAN thing actually works :) [22:48:46] oh right haha. i totally forgot about that for a minute [22:49:02] rabbit holes ftw [22:49:17] My conversation with the upstream dev about that was kind of weird but I'm still optimistic. [22:49:41] Thank you for being an early adopter, and sorry that it took all day to get a proof-of-concept installing :/ [22:50:03] 1 day is super fast in zuul-migration time [22:50:10] thanks for your help! [22:50:30] we were supposed to migrate away from zuul... 6 years ago? [22:50:45] s/zuul/zuul v2/ [22:51:17] 6 years isn't /that/ long for a migration! [22:51:39] I'm about to go to dinner but lmk if you have any cert joy. [22:51:48] will do. thanks again [23:05:55] gah. on the master instance `[ 123.656532] cloud-init[1309]: error: apiServer.certSANs: Invalid value: "k8s-api.svc.zuul.eqiad1.wikimedia.cloud.": altname is not a valid IP address, DNS label or a DNS label with subdomain wildcards: a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character` [23:06:22] they don't respect absolute hierarchy :D [23:07:07] good sign that the label is supported though [23:24:19] https://www.irccloud.com/pastebin/fZ2pHJEy/ [23:24:21] awesome