[07:50:52] does anyone know why https://gitlab.wikimedia.org/repos/cloud/toolforge/fourohfour exists as an outdated copy of https://gitlab.wikimedia.org/toolforge-repos/fourohfour? [07:57:31] I think we wanted to move it to become a toolforge component instead o a regular tool at some point, but never got around to doing it [07:57:59] T369364 [07:58:00] T369364: toolforge: integrate fourohfour as a custom component, rather than a normal tool - https://phabricator.wikimedia.org/T369364 [08:06:10] ok, mind if I deploy the now-outdated repo? if we get around doing that we can do a proper gitlab move which will add a redirect etc [08:08:52] taavi: sure, go for it [08:54:25] topranks: if you have some time I think we could enable ipv6 on cloud-private in eqiad as well today [08:59:23] taavi: hey yep in about 30 mins I'll have some time to look if that suits? [09:11:25] yep [09:34:39] taavi: ok let me know when you're ready [09:34:57] I guess we take a similar approach, add the IPs first with some cumin fanciness? [09:35:20] slightly more complex here as each rack is it's own subnet, I guess we can use cumin filter for that and do 4 separate runs [09:38:49] topranks: yeah indeed [09:38:55] do you have a per-rack cumin filter handy? [09:39:52] I rarely use it but this is ought to fork for instance: [09:39:57] P:netbox::host%location ~ "E4.*eqiad" [09:40:15] I guess we could try some benign command first to validate it selects the correct hosts [09:40:28] you can just do `sudo cumin SELECTOR` without a command to test a selector [09:41:08] hahaha I should have known there was something like that [09:46:21] for reference, here's the command from last time [09:47:16] https://phabricator.wikimedia.org/P77152 [09:47:22] and here's that adapted for c8-eqiad [09:47:22] https://phabricator.wikimedia.org/P77452 [09:51:49] topranks: c8 has cloudbackup1003 which we can use as a test if that looks ok to you [09:55:17] sorry for the delay [09:55:33] commented back on your paste - looks good to me [09:56:04] I think we _may_ have forgotten to record one of the allocations in Netbox, but I think it's that mgmt network probably (sry can't remember the exact name of it)? [09:56:15] yeah that's the octavia one i think [10:02:35] topranks: done on cloudbackup1003. I'm going to do the same for 1004 in d5 to test that those two can talk with each other [10:03:02] looks good [10:03:21] and yep makes sense to do one in another rack +1 [10:05:09] addresses and routes deployed on 1004 [10:05:15] taavi@cloudbackup1003 ~ $ mtr -w 2a02:ec80:a000:202::5 [10:05:16] Start: 2025-06-10T10:04:42+0000 [10:05:16] HOST: cloudbackup1003 Loss% Snt Last Avg Best Wrst StDev [10:05:16] 1.|-- irb-1151.cloudsw-c8.private.eqiad.wikimedia.cloud 0.0% 10 0.8 2.0 0.7 12.2 3.6 [10:05:16] 2.|-- irb-1104.cloudsw1-d5-eqiad.eqiad1.wikimediacloud.org 0.0% 10 0.8 2.7 0.8 10.1 3.0 [10:05:16] 3.|-- 2a02:ec80:a000:202::5 0.0% 10 0.2 0.2 0.1 0.2 0.0 [10:06:11] awesome [10:07:17] I guess we can roll out the change with cumin then? [10:08:10] yeah, give me a second [10:09:31] updated https://phabricator.wikimedia.org/P77452 with all the per-rack commands [10:10:39] yep they all look correct to me [10:11:57] and cumin queries like `sudo cumin 'P{P:wmcs::cloud_private_subnet} AND P{P:netbox::host%location ~ "C8.*eqiad"}'` seem to be working as expected [10:12:50] starting from c8 then [10:13:51] that's done, next up is d5 [10:14:34] d5 done [10:15:00] cool [10:15:05] e4 [10:15:38] topranks: that's the IP addresses live everywhere [10:16:21] ok cool [10:16:32] next up is adding them in netbox + dns? [10:17:25] yep I'm happy to do that any time you are ready, I did a few connectivity checks there looks ok [10:17:34] so let me know if you're happy to proceed I'll run the script [10:18:26] i think we can proceed' [10:18:39] ok will kick it off [10:30:17] taavi: reloading the dns now [10:31:44] thanks! [10:33:40] done now [10:33:44] let's hope no surprises :P [10:34:14] topranks: hmm, minor problem: the ipv6 gateway is at a different hostname (compare the ip address panel at https://netbox.wikimedia.org/dcim/interfaces/29993/), and so the puppet code is not finding that and so not persisting the routes [10:34:31] doh [10:35:49] taavi: sorry that link isn't working? we can change the hostname for the switch int I'm sure [10:36:00] not 100% what is wrong, what is puppet looking for? [10:36:49] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/profile/wmcs/cloud_private_subnet.yaml#4 [10:37:44] it's looking for a dns name without the irb-115N. prefix [10:37:46] hmm... we shouldn't have the 'irb' there? [10:38:11] compare to https://netbox.wikimedia.org/ipam/ip-addresses/18886/, without the irb [10:38:13] that's unfortunate but not the end of the world, let me change it [10:42:44] taavi: ok it should be better now hopefully it gets added next puppet run [10:42:53] * taavi tries [10:46:03] yep looks much better now [10:46:05] thank you! [10:47:04] np, great to get this done... I guess next is BGP to the cloudlb? [10:49:04] yeah give me a second with that [11:11:53] topranks: this ended up beign surprisingly complicated, but https://gerrit.wikimedia.org/r/c/operations/puppet/+/1155181 and its parents do that [11:13:48] heh ok, the resulting patch doesn't seem too bad at least [11:13:49] +1 [11:28:10] topranks: alright, merged. can you set up the bgp sessions against cloudlb1001/2? [11:28:34] sure, let me have a look [11:34:19] dhinus: what's the error you are seeing with toolforge deploy? (I'm playing around with some components-api tests) [11:40:07] taavi: bgp up for cloudlb1001, looks good [11:40:08] https://phabricator.wikimedia.org/P77498 [11:40:10] will do the other one now [11:43:17] ok both are up and working [11:46:15] great [11:48:54] while I think of it we should add that octavia mgmt network in netbox [11:50:35] 2a02:ec80:a100:100::/64 in codfw just has the description "octavia-lb-mgmt-net" [11:50:55] I guess I can just add 2a02:ec80:a000:100::/64 and call it the same? [11:51:08] yeah I think so, or ask andrewbogott if you need more details [11:51:41] I'll do that for now seems to make sense, Andrew can correct me when he's online if he thinks it looks wrong [11:52:31] ok done, all those networks can be seen here: [11:52:32] https://netbox.wikimedia.org/search/?q=octavia [11:52:46] andrewbogott: if anything there looks wrong let me know [11:59:45] topranks: https://gerrit.wikimedia.org/r/c/operations/dns/+/1155194, that will need to be merged at the same time as https://netbox.wikimedia.org/ipam/ip-addresses/20557/ has the DNS name added [12:01:23] yep [12:02:25] just need to add the name in netbox, then run "sudo cookbook sre.dns.netbox --skip-authdns-update" [12:02:38] after which we can force a re-check of that CR and merge [12:03:03] cool, I'll do that now then [12:03:26] dcaro: so I tracked down the errors in toolforge-deploy looking at the spicerack logs [12:04:07] it failed twice in a row, the first time the "cleanup" command failed with error code 255, which was probably a connection problem between cumin and the target host [12:04:40] the second time, the failure was expected because one functional-test is failing [12:06:06] I think the test started failing after T394273 was merged, that change is live on toolsbeta but not yet on tools [12:06:06] oh, which one? (as in, the test is expected to fail because it caught an issue? or because it's broken?) [12:06:07] T394273: [components-api] add tool config version check - https://phabricator.wikimedia.org/T394273 [12:06:25] the test error is "Input should be . Received value: 'silly'" [12:06:59] I think that's fixed now if you rebase :) [12:07:08] ah nice! [12:07:13] let me try [12:08:09] I also noticed that we have a few components that are not in sync with the version in toolforge-deploy [12:08:19] pro tip: you should reserve the correct IP in netbox if you want DNS records to be generated correctly [12:08:51] lol yep :D [12:09:34] did you go wrong? at a glance it looks right to me... [12:09:56] actual pro-tip: you can just 'edit' an IP address object in netbox and change the address to what it should be, keeping all other properties (like dns name, object ID) [12:10:30] yeah fixed it already, but I originally created 2a02:ec80:a000:4000:: instead of 2a02:ec80:a000:4000::1 [12:10:43] ok col [12:10:56] *cool [12:11:48] dhinus: oh, just merged an MR on components-api, it might update the deploy mr :/ [12:11:55] (sorry, forgot to check if there was one open already) [12:12:11] that's ok, I'm currently running the deploy cookbook on toolsbeta [12:12:41] the open MR already contains multiple changes [12:13:13] you can run the cookbook again when my run completes [12:13:22] $ host -tAAAA openstack.eqiad1.wikimediacloud.org [12:13:22] openstack.eqiad1.wikimediacloud.org has IPv6 address 2a02:ec80:a000:4000::1 [12:13:24] neat [12:13:58] dcaro: I can confirm the failing test is now fixed after rebasing [12:14:27] ack, ... we should try to avoid deploying many things at a time xd [12:15:05] dhinus: is it also only running the components-api related tests? [12:15:06] taavi: nice :) [12:15:19] dcaro: nope, it's running all tests [12:15:31] it should print the results to gitlab in a few moments [12:16:42] okok [12:20:04] taavi: btw bored waiting for something to happen I wiped our dns resolver cache for that hostname [12:20:10] lots of v6 traffic hitting the LBs now [12:30:37] dhinus: this was missing for the tests filtering xd https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1155205 [12:37:52] :P that patch is failing on jenkins [12:55:25] :facepalm: looking (just got out of a meeting) [12:57:02] fixed, dhinus can I deploy the MR? (and will test that patch with it too) [13:12:32] topranks: that entry you made for dallas/octavia/ipv6 looks right (although the one you pasted into irc was not) [13:27:14] dcaro: sure go ahead [13:28:08] ack :) [13:46:07] ready for review https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1155205 [13:53:28] taavi: is this expected? [13:53:30] https://usercontent.irccloud-cdn.com/file/JwraobTG/image.png [13:54:01] it went away :/ [13:54:02] nm [13:54:48] dcaro: yes, I'm rebooting cloudlbs to pick up config for new routes [14:38:28] @chuckonwu the toolforge-weld link https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/blob/main/toolforge_weld/api_client.py?ref_type=heads#L168 [14:53:44] Thanks dcaro, looking [15:04:21] turns out mwopenstackclients isn't fully ready for the dual-stack future, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1155259 [15:11:14] 😞 [15:12:47] looks like it's a simple fix though :) [15:44:10] dhinus or others, can I get a quick review of https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1155268? Should fix cookbooks in codfw1dev although I'm not 100% sure it's kosher to search/replace like that in a recording. [15:52:01] andrewbogott: +1d [15:52:22] thank you! [15:52:36] I just noticed that the nginx contianer for components-api restarted a couple of times, the last restart happened because it started trowing 499s, anyone has seen that before? [15:52:40] https://www.irccloud.com/pastebin/XsWUumwV/ [16:06:28] hmm... project-proxy-puppetserver certs seem to have expired :/ [16:06:45] https://www.irccloud.com/pastebin/23njsk7E/ [16:17:16] hmm... I think it might be a CA setting somewhere, openssl from the puppetserver says it's denied because it's self-signed, but not expired (from a client it accepts it without issues): [16:17:33] https://www.irccloud.com/pastebin/Vy3E7Ido/ [16:19:58] puppet runs without issues though in the puppetserver [16:20:31] (hmpf.... it's using the general puppetserver xd, ignore thatlast comment) [16:22:48] hmm... shouldn't it have itself as puppet server? [16:32:56] nope, it's configured to use puppet (the default anyhow) [16:39:38] With the Magnum k8s clusters is the etcd backplane a combined/collapsed service? [16:39:39] I'm learning more about the zuul job runner system that we hope to use and have found that it will be wanting to create a namespace and then a pod for each test. That turns into something like 15-20K creates and destroys per day. [16:40:01] And that level of etcd churn has me wondering how a magnum cluster will hold up. [16:43:09] I think that the best answer might be to just test, but it feels like it will be either very slow, or have trouble yep. I think that the etcd nodes are also not on local disk for magnum right? (if not, then they are even slower) [16:43:27] what does upstream zuul say about it? Anyone finding issues/having to scale up etcd? [16:52:15] The big deploys upstream use the OpenStack driver which creates and destroys whole vm instances per test. This is the thing that basically crushed our rabbitmq deployment in the olden days when CI used nodepool. [16:53:00] Folks that are using Kubernetes are typically using a hosted k8s in somebody's cloud [17:03:15] that makes sense yep. All those namespaces are temporary right? how many does it run concurrently at any point? (we can try to setup custom etcd cluster with local disk and/or even separate the events etcd from the control plane one to alleviate load if needed, but that means that magnum might need to be tweaked to use external etcd) [17:11:19] * dcaro off [17:12:35] 20K/day would be ~1 every 4s, does not seem like a "really high" traffic, but yep, the churn on etcd might be a bottleneck [17:38:21] y'all are right that with magnum, etcd is on a ceph drive and might turn out to be troublesome. No way to know without testing, though, and we can also try to escalate to a high-performance ceph mount before throwing in the magnum towel entirely. (Or, actually, we could even build a magnum cluster on local storage with a few permission changes) [17:39:10] bd808: there's also an option to have nodepool manage containers on a host w/out involving k8s, isn't there? iirc there were three paths and that was the middle. [17:41:12] Looking for reviews on https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/79 [17:42:36] I approved -- do you want me to merge as well or do you have privs to do that? (I definitely don't know how to deploy) [17:50:40] andrewbogott I can merge it [17:51:46] i'm apparently too late, but the commit message could be improved to describe what was actually changed in that project [17:52:17] and no need for a "[toolforge-weld]" prefix there, that information can already be seen from the project/directory name [17:58:35] https://grafana.wikimedia.org/d/d4bf3e68-9396-4173-a59c-59e398172ba0/zuul-prometheus?orgId=1&from=now-7d&to=now&timezone=utc is a dashboard where you can kind of get a sense of how busy CI is at any given time. Each "build" is a CI job of some type and will turn into a k8s namespace create + Pod create + run + teardown. [17:59:10] Things tend to come in bursts and we will have some cap on how many things we let happen in parallel. [18:00:50] andrewbogott: There is a "static" runner solution, but we threw that one out already as not providing the same job isolation guarantees we think we currently get with Jenkins + Docker. Mostly because it doesn't have the same post-job teardown systems as Jenkins gives us today. [18:02:39] oh yeah, four paths :) [18:03:05] I thought there was "nodepool manages VMs," "nodepool manages containers on static container hosts," and "k8s manages containers" [18:03:13] but maybe I imagined the middle one [18:04:17] the driver system we are currently hoping to use is one where nodepool maintains a pool of empty k8s namespaces and then when a job launches it checks out a namespace, creates a Pod in it, and then runs a script against the pod to configure and start the job. [18:04:41] You can think of it as using a Pod as a VM in the older nodepool setup. [18:05:17] When the job finishes artifacts are copied out to blob storage and then the Pod and namespace are destroyed. [18:05:29] So the pods are recycled after every test, and the namespaces are also recycled? [18:05:44] no, new each time [18:06:22] bleh, I guess 'recycle' is not the word I meant. "melted down and rebuilt" like how a pop can is recycled. [18:06:35] or maybe yes depending on what "recycled" means, yeah [18:06:37] But why also multiple namespaces rather than one big one? [18:07:15] Is that me not understanding how k8s works? [18:07:28] I think as an extra isolation barrier. Nodepool will have the rights to make namespaces, but the user that the job runs under/as will not [18:07:48] So it's only one pod per namespace [18:08:27] I believe so yes, although there is probably a way to script a job that needs multiple pods to get its work done [18:08:47] this is all super duper flexible at the zuul level [18:09:44] nodepool is currently an external service that zuul talks with, but in a planned future the nodepool business logic is going to move inside of zuul itself. [18:10:22] ok, that all makes sense. [18:11:10] I can tell you what I told Antoine, that I'm not super concerned about nodepool overwhelming nova these days, but that also doing this with containers instead of VMs seems obviously better. [18:11:38] I will be making a new Cloud VPS project request sometime today for the "zuul-runner" project that this stuff will live in. Our plan is to decommission the "integration" project along with the current zuul2 setup and have the next generation runners in a new project. [18:12:02] ok! [18:12:25] Yeah, I think k8s is the better backend for us in the long term. [18:12:42] I predict that you will find magnum clumsy to work with at first but we should be able to get you what you need. [18:12:47] it may just take us some toil to figure out how to run the underlying cluster [18:13:13] I managed to make a toy cluster with it before for deployment-prep [18:13:49] I am also working on setting up a newer magnum driver w/LB integration but that might not be ready in time for your experiments. [18:14:16] the workflow for "upgrading" will be a challenge at some point, but like the PAWS and Quarry usage we should know how to rebuild the cluster from scratch. [18:14:39] I saw a bunch of LBaaS stuff getting touched. That's exciting! [18:14:44] yep, if you're able to build around a blue/green model then you don't have to think much about upgrades. [18:15:20] There's an lbaas UI in Horizon now but it's not hooked up to magnum yet. [18:17:09] The most interesting thing I need to start thinking about for this zuul-runner cluster will be how to get the k8s credentials out of opentofu and over into the prod vms that will be running the zuul control plane. [18:17:41] that's a future me problem though at this point :) [18:18:29] I need to catch up on the openbao proposal but last I heard it was still basically single-tenant. So you'll probably wind up solving that problem with copy/paste [18:19:04] unless they're dynamic creds, in which case... I'd refer the question on to future Bryan [18:19:10] yeah, or dropping a file in a well known location and then pulling it from prod [18:19:22] yeah [18:19:59] the zuul "brains" will be in ganetti vms in eqiad (and maybe codfw?) [18:20:43] T393873 -- looks like both DCs [18:20:44] T393873: eqiad/codfw: 6 VM request for Zuul upgrade project - https://phabricator.wikimedia.org/T393873 [18:22:09] oops. lunch time :) [18:22:12] so there's another reason for multi-region cloud-vps (or, I guess, to not use cloud-vps) [18:22:48] multi-region please :) [18:24:33] I don't actually know how many jobs we spin up a month in the gitlab-runners in Digital Ocean, but I know they cost us $15KUSD/month. [18:25:03] err.. no $1500/month [23:01:37] T396540 please and thank you. Hopefully I gave enough context to help y'all reason about the project, but I'm happy to answer more questions on the task too. [23:01:37] T396540: Request creation of zuul-runners VPS project - https://phabricator.wikimedia.org/T396540