[08:10:37] * arturo online [08:18:49] i'm planning to flip ipv4 cloud vps web proxy traffic to the new proxies [08:20:15] that should be easily reverted if anything happens right? [08:20:44] yeah, it's just a flip of the floating ip target [08:21:00] 👍 sounds good to me then [08:21:16] quick review? https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/55 [08:21:26] should help merging the auto-deps-update MRs [08:22:52] dcaro: i'm guessing we don't have an easy way to include all the commits between main and the currently deployed state, instead of just the latest? [08:23:26] there might be a way by checking tags (currently deployed tag vs latest tag) [08:24:05] but for that mr, there's only one commit in toolforge-deploy, just jumping multiple versions [08:25:07] like https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/764/diffs?commit_id=383547533237d196029de976ee3c5b02654e922b [08:25:25] (it just copies that over to the MR, that we are not doing currently) [08:30:33] oh right, that's showing the log for toolforge-deploy, not the component itself [08:30:35] +1'd [08:33:20] yep :), thanks! [08:44:30] * dhinus paged ProjectProxyDOwn [08:44:35] uh [08:44:37] that might me me [08:44:38] one second [08:45:43] * taavi starts proxy-03 back up [08:46:09] yeah, for some reason the old internal IP is still getting lots of internal traffic? [08:46:29] I did check that the floating IP dns aliases had been updated already [08:46:56] dhinus: those alerts should clear any second now. no actual impact on users, except for traffic from cloud vps to some other service behind the proxy [08:47:01] ack thanks [08:47:05] they cleared just now [08:47:41] cool [08:48:34] probably just DNS TTLs in that case, I'll leave that VM running for now and come back to it later today [08:48:36] sorry bout that [08:52:36] that's ok! [08:58:10] harbor is down alert from tools/toolsbeta [08:58:14] is there any operation going on? [08:59:28] https://usercontent.irccloud-cdn.com/file/PTIVddAZ/image.png [09:00:44] the containers are up [09:00:51] this could be the front proxy [09:00:55] something is going on yes [09:00:58] could be the same thing as above? although the alerts are very late in that case [09:01:07] curl sometimes gives that there's no proxy for that name [09:01:14] taavi: yes, it could be the wmcloud.org proxy [09:01:20]

No proxy is configured for this host name. Please see our documentation on Wikitech for more information on configuring a proxy.

[09:01:25] for root@toolsbeta-test-k8s-control-10:/tmp# curl -v https://tools-harbor.wmcloud.org/v2/toolforge/builds-api/tags/list [09:01:34] (I was debugging why the deploy failed in toolsbeta) [09:02:00] maybe the proxy redis DB is empty, or missing records somehow? [09:02:44] an "easy" fix we could try is re-creating the proxy entries [09:02:45] it works from my laptop :/ [09:02:59] oh, it can be DNS timeouts, as taavi suggested earlier [09:03:01] https://www.irccloud.com/pastebin/ibKmT95p/ [09:03:25] a-ha, the old proxies are failing to redis replicate from the new ones [09:03:28] > 721:S 08 May 2025 09:03:14.994 # Can't handle RDB format version 10 [09:03:32] yep dns resolves differenty [09:03:34] it's the split brain DNS thing [09:03:40] taavi@cloudservices1005 ~ $ sudo rec_control wipe-cache "wmcloud.org$" [09:03:40] wiped 87 records, 9 negative records, 839 packets [09:03:40] taavi@cloudservices1005 ~ $ sudo rec_control wipe-cache "wmflabs.org$" [09:03:40] wiped 123 records, 8 negative records, 595 packets [09:03:43] let's see if that helps [09:03:58] that seems to help yes [09:04:05] seemingly yes, proxy-03 is no longer getting any traffic [09:04:10] yep, helm pulls ok now [09:04:19] taavi: can we dump the old redis into the new ones by hand? [09:04:31] (and then stop replication) [09:04:53] arturo: redis replication old->new works fine, the other way is broken [09:05:02] oh I see [09:05:07] i made the new ones primary when moving the floating IP [09:05:32] ok, then I guess if proxy-03 is no longer seeing traffic, let's shut down it to avoid further confusion? [09:05:37] doing [09:06:14] the issue here came from the fact that I originally shut it down too early, before internal DNS caches had expired [09:06:26] ack [09:06:43] and when I started it back up, it couldn't fetch the redis data because the redis primary was running a newer redis version and the protocol doesn't allow that [09:07:11] I see :-( unfortunate when software wont just do its thing [09:07:38] we can work around that if we need to start the old proxies again for some reason, but I suspect we won't [09:25:02] is it just me or have we been getting more reports of LDAP flakiness lately? [09:27:15] yep, in the last few days I've seen it quite flaky, specially in the bastion [09:27:20] (toolforge bastion) [10:09:48] anyone around for a quick review? https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/56 I broke the deployment mr creation xd [10:10:02] (it's so hard to test changes to those pipelines...) [10:10:36] +1'd [10:12:45] thanks! [10:24:39] I retried creating a magnum cluster with tofu and this time it worked fine, so it's not /always/ broken [10:26:20] but web proxy creation seems to be consistently broken, I opened T393679 [10:26:21] T393679: tofuinfratest fails to create web proxy - https://phabricator.wikimedia.org/T393679 [10:27:53] next script I want to run everywhere: https://phabricator.wikimedia.org/P75879, this adds AAAA records to all web proxies [10:35:26] taavi: I don't understand why it needs to traverse all projects [10:35:47] I assume zones hosting most of proxy FQDNs are in cloudinfra? [10:36:10] dhinus: thanks for T391467 ! [10:36:11] T391467: gitlab ci: validate secrets settings in pipeline for tofu integration - https://phabricator.wikimedia.org/T391467 [10:37:06] arturo: we create a $PROJECT.wmcloud.org. DNS zone for projects upon creation. so for example the records for the `*.catalyst.wmcloud.org` web proxy lives in the `catalyst.wmcloud.org.` DNS zone which exits in the `catalyst` project, not in cloudinfra [10:37:32] ok, I see [10:37:44] same thing applies to the `wikimania-mautic.wmcloud.org` (as an example I could quickly find), the DNS record for that lives in the `wikimania-mautic` project [10:40:47] taavi: the script LGTM [10:47:07] * dcaro lunch [10:54:02] i've ran the script in wmflabsdotorg, will do it in cloudinfra (wmcloud.org) and then globally later [10:54:11] ack [11:14:15] pro tip: remember to remove --dry-run from the script when running it for real [12:02:06] xd [12:34:46] ok, and now harbor's alerting again [12:34:59] i think prometheus is trying to talk to the proxy v6 address from a v4-only host :/ [12:39:31] i've dropped the AAAA records for those domains for now, will re-add when T393697 is fixed [12:39:32] T393697: Rebuild Toolforge Prometheus nodes in v6-dualstack network - https://phabricator.wikimedia.org/T393697 [12:39:51] ack, thanks! [12:40:18] i've cleared the DNS caches again, so those alerts should resolve shortly [12:40:21] again sorry for the noise [12:41:58] np [12:49:28] dcaro, dhinus: FYI there is a new spicerack release out (v10.2.0) just to be able to build spicerack for bookworm, it removes temporarily elasticsearch support on python 3.10+ (pip) or bookworm (deb), it should not affect you in any way but lmk if it does [12:51:44] volans: ack [12:55:26] taavi: fyi. I've re-opened T369891 , in case you are interested :), the last error was actually when fetching the openapi.json from the components-api (so it's not only one api failing) [12:55:26] T369891: [toolforge deploy] direct-api tests fail intermittently on toolsbeta - https://phabricator.wikimedia.org/T369891 [14:21:50] dcaro: regarding T393699, do you think we can shuffle around enough old/smaller nodes to get the refresh nodes onto 25G switches? If not I'm thinking we should maybe refresh them with smaller storage nodes and save giant OSDs for when we have 25G switches everywhere. [14:21:50] T393699: [ceph] Figure out hosts placement in the racks - https://phabricator.wikimedia.org/T393699 [14:25:34] andrewbogott: I'm looking into what's where and how much space we have around (for my sanity https://docs.google.com/spreadsheets/d/1VImN3sIBWM1uqaNlW5Pzkfh4WmHqnI5RK7JuxqU4NQ0/edit?usp=sharing) [14:25:55] I was thinking actually to have the big hosts anyhow, even if we are not using them 100% [14:26:15] (as in, plugged at 10G but not provisioning all the drives) [14:26:42] otherwise we would have to wait a few years for the refresh and freeing the space in the racks [14:27:27] yeah, I thought of that too, we could just leave half the drives in the servers idle... [14:27:34] but that means shrinking the cluster in the near term [14:27:52] But it sounds like you still want to go ahead with refreshing to the jumbo-sized servers [14:29:25] we might be ok shrinking the cluster just a little bit, the numbers I made where the network is not enough are if we utilize the full hosts to the same extent we utilize the current hosts [14:29:56] ok [14:30:06] So I'll tell rob to go ahead with the big servers and will stop fretting [14:30:35] let me do some numbers first just in case though [14:33:57] oops ok [14:39:10] If I want/need to run the simplest possible 1-node k8s cluster on cloudcontrols (for magnum support) do y'all have suggestions about what deployment tech to use? I'm tempted to copy catalyst's k3s model, I don't know of any other production examples for something trivial like that... [14:39:51] (don't ask me how I feel about the k8saas product requiring an out-of-band k8s cluster to run) [14:39:57] on cloudcontrols?? [14:40:30] doesn't have to be there but it's the obvious place [14:40:52] I have already asked the devs the chicken/egg question about running the magnum control plane on magnum but none of the guides recommend that. [14:41:15] I guess if it's on cloudcontrols it would be 3-node for HA [14:42:04] i guess running that on vms on some special service project is not an option? [14:42:29] it probably is, that would be fine [14:42:46] I just don't generally think of putting the control plane on the cloud, but it's not a bad idea. [14:43:03] I am not asking you to read this, but if you're really curious this is what I'm talking about: https://docs.openstack.org/magnum-capi-helm/latest/ [14:43:22] an another option depending on what you need might be https://kubernetes.io/docs/tutorials/cluster-management/kubelet-standalone/ [14:43:41] i'm generally not a huge fan of putting yet more stuff on cloudcontrols, they're already weird enough [14:44:02] that's reasonable. I guess we have other control plane stuff on VMs (proxies, puppet) [14:45:27] So if I'm running this in cloud-vps, what tech would you use? HA would be nice but maybe we don't have a good way to do that, seems like all our current clusters only have one control node. [14:46:18] to me the most obvious thing would be to have a kubeadm cluster like we already have in tools, since we already have all the tooling and expertise for that [14:47:12] I think of that as being a strictly by-hand operation, is there automation for that nowadays? [14:47:48] i mean all of the cluster operations (upgrades, adding/removing nodes) have been automated with puppet and cookbooks for a while now [14:47:55] main manual thing is standing up a new cluster [14:49:22] for today standing up is my primary challenge :) I've certainly done it before using upstream docs, do we have docs for the official/toolforge way or should I just stick with how I've done it in the past? [14:50:32] https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/New_cluster [14:50:39] kind/minikube might also be a good first-manual-test option [14:51:09] kind is what the magnum docs suggest [14:51:18] (it might also be easier to uninstall) [14:51:26] but I'd like to introduce as few new technologies as possible :) [14:51:35] we use kind inside lima-kilo [14:51:46] but that's single-node right? [14:52:01] yeah, for initial testing i maybe wouldn't do kubeadm, but if you need something that's solid for "production" use cases then i think we've proven kubeadm works for that already [14:52:17] it can spin up workers too, though afaik in the same machine yes (might be able to be multi-machine, have not checked) [14:52:54] ok [14:52:55] why do you need a k8s cluster to start magnum? [14:53:45] dcaro: long answer is linked above. Short answer, I suspect, is "when you have a hammer, every problem looks like a nail" [14:54:16] we can start a second openstack cluster, create a k8s deployment with magnum there, to start the k8s deployment with magnum in the current openstack cluster [14:54:17] xd [14:56:05] oh, it's using k8s cluster api to create clusters [14:57:33] 'can I use magnum to make the cluster that runs the magnum control plane' was definitely my first question [14:57:50] but it runs the risk of not being able to ever recover from a service outage [14:58:16] There are two competing replacement magnum drivers and both use cluster api [14:58:22] so, service cluster is on our future one way or another. [16:04:20] * arturo offline [16:10:24] dhinus: was there a task for the tofuinfratest proxy failures? I have a fix [16:10:54] T393679 [16:10:56] T393679: tofuinfratest fails to create web proxy - https://phabricator.wikimedia.org/T393679 [16:11:04] thank you! [16:12:18] fix is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143610 [16:12:40] great! btw, magnum worked once this morning (manual tofu run), then it failed again with the cronjob run, but I didn't investigate the error [16:13:06] ^ is why I'm impatient to switch to the new driver [16:13:16] although it's going to be more complicated than I had hoped [16:13:32] yeah, I'm also curious to see if we get any random successful runs from tofu in the coming days [16:14:04] things look slightly better than yesterday after cleaning up the old clusters/stacks [16:14:27] I would just let the cronjob do its thing and check back in a few days if there was any successful cluster creation [16:15:05] * dhinus offline [16:15:07] taavi, are you interested in this on proxy-04.proxy-codfw1dev.codfw1dev.wikimedia.cloud [16:15:10] https://www.irccloud.com/pastebin/32cjWSOj/ [16:15:11] ? [16:15:39] andrewbogott: is that the old bullseye host? [16:16:20] no, that's 02 which is shut down [16:16:28] I think 04/05 are a HA pair? [16:18:01] is the puppetmaster there in sync? [16:18:50] I'll check [16:18:56] btw puppet works properly on 05 [16:20:35] ah, it's a catch-22 situation [16:20:39] yes, it's up to date [16:20:55] the box needs a new firewall rule coming from puppet, but puppet being broken is blocking it [16:22:19] fixed [16:22:42] thank you! [16:23:00] issue was that it was missing https://gerrit.wikimedia.org/r/c/operations/puppet/+/1138837 [16:23:12] next issue: apt update is failing with [16:23:13] E: The repository 'http://mirrors.wikimedia.org/osbpo bookworm-dalmation-backports-nochange Release' does not have a Release file. [16:23:31] that's a failure and not just a warning? [16:23:38] hm, 'dalmation'? i thought it was 'dalmatian'? [16:23:47] yes, that's why it's failing [16:23:49] looks like a typo xd [16:23:50] it's making the puppet run fail, i think the next one will succeed [16:24:02] but also it should be epoxy now so that line should get removed entirely [16:24:32] nope, there's also caracal still there [16:24:46] i removed dalmation manually [16:26:35] I guess we don't have absenting :/ [16:46:11] puppetdb puppet errors seem to be caused by https://gerrit.wikimedia.org/r/1141963, i'm asking about that in -sre [17:41:14] Loading the web proxies horizon tab seems to be slow/dead no matter which project I have selected. [17:49:02] bd808: hmm, we might be missing security group rules to permit traffic to the proxy api over ipv6 [17:50:45] I can have a look once we're done with unscheduled religious events (so in practice tomorrow), but if that's enough for you to go by feel free to poke at it too [17:51:36] Monte made T393725 for us [17:51:37] T393725: Horizon proxy tab is showing "Gateway Timeout" - https://phabricator.wikimedia.org/T393725 [17:57:49] I'll look at the security groups [18:02:25] taavi wins the no-look debugging prize :) [19:07:48] bd808: next step is to debug an issue with no details other than "something somewhere is broken" :) [19:15:24] I can usually diagnose that one in no time (since I'm the one who broke it) [19:34:42] If Cloud is randomly broken somewhere Rabbit and then DNS would be my first guesses :) [20:47:55] migrating to a new laptop so will be offline for a while