[06:52:28] greetings [10:35:15] just as a reminder, I will be upgrading kyverno in tools in about half an hour [11:01:01] I am starting the Kyverno upgrade [11:06:02] upgrade is done, the cookbook is running the test suite now [11:24:26] that all was a bit boring even [11:39:47] boring is nice! [11:49:45] if all software would just work as expected, then less people would hire SREs to reliably run their software. so the ideal scenario is software that runs reliably except during business hours [12:12:00] lol [12:39:25] taavi: what do you think of https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/63 ? [12:39:48] can't wait to nuke toolschecker ngl [12:41:46] godog: do we know for sure that metric will disappear completely (and not, say, be turned to 0) if the exporter fails to scrape? [12:42:03] since it uses count() not sum() [12:43:59] but seems fine as a port, I imagine there might be some ways to more reliably check that they're actually in the same cluster with Prometheus but let's not block getting rid of toolschecker on that [12:44:18] taavi: yes metrics are scraped from etcd itself, no exporter in between thus they won't be there when etcd is down [12:44:25] oh right [12:44:38] ship it [12:45:13] \o/ [12:45:16] thank you [12:45:55] interview shortly then I'll proceed with toolschecker removal patches [13:45:49] volans: as you predicted, SSH_AUTH_SOCK=/run/keyholder/proxy.sock was the bit missing from my ssh tests yesterday. [13:45:59] So the next step is security groups, which requires something like https://gerrit.wikimedia.org/r/c/operations/puppet/+/1298325 [13:49:13] ack [14:58:50] toolschecker was such magic when Chase and Madu first built it. Time change. :) [16:11:42] heheh indeed, that was a pre-prometheus time [16:12:39] i'm a bit surprised it wasn't actually just a vm speaking NRPE instead of the http server it has :P [16:15:44] lolz [16:25:34] dduvall, bd808, I just emailed vague instructions about how to test the latest magnum driver. I'm interested in your results if you have the patience to try it... it is definitely the future of magnum although I don't plan to rip out the old driver for a year or so. [17:16:57] andrewbogott: right on. thank you! so assuming it works and we get an LB in front of the master nodes and provision a proxy, how do we ensure the certificate SANs includes the proxy hostname? [17:18:58] I confess that this isn't something I've thought about much. Can you tell me a bit more about how traffic is working in this case? I tend to think in simplistic terms: "the proxy does ssl termination and after that it's just http and there aren't any certs needed" but sounds like that doesn't apply in this case. [17:19:44] dduvall: I think that you be fronting that LB with the existing wmcloud.org proxy. It would terminate TLS for the client, but then could be configured to re-encrypt traffic to the LB which would work just fine with a cert missing the public name's SAN. [17:20:20] I had that working at one point, but then took the wmcloud proxy out because it is anohter moving thing to break [17:20:31] It's also the case that we can definitely just give you a floating IP and let you manage things yourself if that somehow makes life easy. [17:21:21] We are trying/hoping to have a small amount of scraper protection in front of the proxy. We don't have that /now/ but that would be one advantage to keeping it in the mix in the long run. [17:24:39] if magnum needs a public v4 to be used properly then that's a bug in magnum that needs to be fixed IHMO [17:25:34] a floating IP would for now make it easier (and i've been banging my head on zuul for a while now and am close to burnout tbh) [17:25:50] taavi: it doesn't for the use cases that I'm expecting... but I don't think I understand what dan needs/is doing. [17:25:52] i wish we could just add a label or something to get another hostname into the SANs but that doesn't seem possible [17:26:18] you can certainly add a second proxy with a second name but the same backend... [17:26:37] i mean i can try the new cluster-api magnum driver but i'm knee deep in trying new and somewhat broken things [17:27:37] i quite regret upgrading us to zuul 14.x w/ the new zuul launcher as is. it seems like they're still ironing out problems (though it seems like nodepool had this same problem) [17:31:44] bd808: i suppose i can test the `client <-(tls)-> proxy <-(tls)-> master node` setup now even without the octavia lb, yeah? [17:34:00] bd808: the wmcloud.org proxy breaks TLS client authentication if you're using that with k8s [18:01:49] chatgpt told me that the cluster-api driver should allow us to add an additional SANs entry, but i don't really believe that yet :D [18:16:35] yeah, I'm digging in the docs to see how one would do that... [18:24:31] ...and now digging in the code [18:41:00] dduvall: are you already using labels in your magnum template? [18:41:29] If yes, please try adding a 'api_server_cert_sans=comma,delimited,list,of,names' label and see if that does something useful? [18:42:03] andrewbogott: yeah, we have a number of labels [18:42:13] should that work with heat too or just cluster-api? [18:42:16] I predict that that will work with the new driver but not with the heat driver. But it /might/ also have worked with heat, not sure [18:42:21] ah got it [18:42:29] I'm reading the cluster-api code now, I see it implemented there [18:42:36] but I don't know if it's a carry-over from heat or a new feature [18:43:15] * andrewbogott tries to find a checkout of the heat code... [18:43:48] i grepped the stable/2026.1 branch of magnum but no dice. i can try out the new driver later today [18:44:07] yes, I don't see that label implemented in the heat driver [18:44:26] So I guess this motivates you to try cluster-api which suites me :) I hope it's not too painful. [18:45:02] are you using nginx-ingress? [18:45:26] within the cluster? [18:45:29] yeah [18:45:30] nothing yet [18:46:11] ok, good, it doesn't exist in k8s 1.34.8 -- an issue for PAWS but not for you [18:46:31] i don't think we'll need ingress. zuul will create its own namespaces/pods for each job. we'll likely need buildkitd in there to drive former pipelinelib type workloads [18:46:38] ah ok. good to know! [18:48:10] re: not in 1.34 oh yeah, we already dealt with that in the gitlab-cloud-runner cluster. we're using traefik now [18:49:01] cool, that may be in my future [18:49:10] Or the future of [21:02:34] andrewbogott: is there an equivalent of https://docs.openstack.org/magnum/latest/user/#supported-versions somewhere for ubuntu/cluster-api? [21:03:12] not as far as I could find, I cajoled one of the devs to paste their personal chart into irc let me see if I can still find it... [21:03:13] i mean i can experiment but iirc it took some trial and error to find the right incantation of labels for the current setup [21:03:23] oh that would be awesome [21:03:27] (a working example) [21:04:06] The template pasted in my email /should/ be a working example, it built a paws cluster in codfw1dev [21:04:27] oh geez yeah sorry. i hadn't scroll that far yet :) [21:04:32] ty! [21:08:34] huh, I literally didn't think it was possible to format something this badly in wikitexth https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Magnum_setup#Compatibility_matrix [21:09:09] any suggestions how to get this onto that page legibly? https://www.irccloud.com/pastebin/n07AKJwG/mcapi%20support%20matrix [21:10:43] andrewbogott: make a real table instead of an ascii spew [21:10:55] that sounds like work, but ok :( [21:11:04] * bd808 will fix it [21:11:28] nooo I wasn't trying to nerd-snipe, I'll do it [21:12:07] * bd808 sees H1s and starts eye twitching [21:12:36] you can stand down, I'm already filling in the rows [21:19:19] dduvall: I'm about to go collect my CSA radishes. Are you mostly unblocked for the moment? [21:20:35] If you want a tofu version of that template, this is it give or take a setting or two: https://github.com/toolforge/paws/blob/main/tofu/135a.tf-magnum-capi [21:21:52] andrewbogott: oh yum! i just got an error but i can poke at it for now [21:22:17] `| status_reason | failed to extract field Cluster.labels` [21:22:28] checking my labels now [21:23:09] I'll be back in a bit, and around off and on this evening. [21:23:20] great, thanks! [21:31:20] yikes, getting `DELETE_FAILED` now with the same `failed to extract field Cluster.labels` status [21:37:54] that's not good [21:38:38] this is zuul-k8s-v134 ? [21:42:32] the logs say "TypeError: 'str' object is not an instance of 'bool'" near that error, seems telling [21:45:44] TypeError: failed to extract field ClusterLabels.cinder_csi_enabled [21:50:12] since the labels are driver-specific I don't think the APIs know enough about them to validate. I think you should either remove that label entirely (I think it defaults to true) or try True instead of true. [21:50:25] Sorry, this is clearly going to be a real pain to get right. [21:50:36] * andrewbogott really going for radishes now [22:32:11] oh maybe it's sensitive to the types in some of the flag labels like `master_lb_enabled` whereas the heat stack templates aren't? i see your tofu uses string values for those [22:32:31] and ours uses bool [22:41:33] oh yeah, that could definitely be it. [22:41:42] Are you currently wedged because it refuses to delete? [22:46:07] yeah :( [22:46:42] can you throw your radishes at it please? mario 2 style? [22:47:30] sorry, super mario 2 (Nintendo USA) [22:55:14] i need to pick up my girls from camp, but yeah andrewbogott if you get around to deleting that cluster for me i can retry with string values for those labels [22:55:23] thanks for your help! [22:55:44] yep, I'll see what I can figure out