[07:28:17] morning [07:40:19] I'm rebooting all the stuck workers [08:18:44] hello [08:32:12] foxtrot-ldap is failing already :/ [08:32:16] bitnami pulled their image [08:32:18] ERROR: failed to build: failed to solve: docker.io/bitnami/openldap: failed to resolve source metadata for docker.io/bitnami/openldap:latest: docker.io/bitnami/openldap:latest: not found [08:33:09] there is a notice at https://hub.docker.com/r/bitnami/openldap [08:33:13] explaining the changes [08:34:10] yep, I'll change to that one, but we might want to move to something soon-ish [08:34:14] *something else [08:35:09] some of the secure ones are free, to be checked, the public ones should be available via the archive, but no longer updated [08:35:51] ldap does not seem to be in the secure ones list [08:35:56] (or I'm not finding it) [08:36:23] same here [08:36:40] the only one I find is the legacy one https://hub.docker.com/r/bitnamilegacy/openldap [08:36:43] that you can use for now [08:37:47] https://docker-registry.wikimedia.org/dev/bullseye-openldap/tags/ could work? [08:50:31] we have it pulled in toolsbeta harbor too [08:50:35] is that the same image? [08:51:15] it seems to be a diffierent image [08:51:21] not sure it's going to work, looking [08:53:37] that image is trying to use non-existing repos already :/ [08:53:38] 3.129 E: The repository 'http://mirrors.wikimedia.org/debian bullseye-backports Release' does not have a Release file. [09:01:47] perfect :/ [09:05:18] used our hosted image for now, but yep, would be nice to use something like that one (or a trixie-based one) [09:07:11] I guess we can ask sim.on/mor.itz :D https://gitlab.wikimedia.org/repos/releng/dev-images/-/blob/main/dockerfiles/bullseye-openldap/README.md [09:09:27] 👍 [10:04:44] I just merged a bunch of toolforge proxy changes which are (at least in theory) no-ops [12:12:35] neat [13:00:50] it seems there's some gerrit issue that's creating errors on metricsinfra side [13:00:59] 'https://gerrit.wikimedia.org/r/cloud/metricsinfra/prometheus-manager/': The requested URL returned error: 503 [13:01:26] there's a maintenance going on [13:01:27] Gerrit will be under maintenance on Monday, 6 Oct 2025, 12:00–13:00 UTC . During the maintenance, the system will be read-only (T387833)). [13:01:27] T387833: Gerrit failover process - https://phabricator.wikimedia.org/T387833 [13:01:59] indeed, I saw mentions of a rollback, are the 503s still ongoing ? [13:17:14] looks like we're back [13:26:31] it looks ok now yep [13:38:31] topranks: is it possible to set up a ganeti VM that talks to a cloudsw so it acts like it's in the cloud realm? I don't really know how much ganeti networking is software vs hardware [13:39:26] andrewbogott: no not really [13:39:36] well the limitation is the separate physical [13:39:50] sry, separate physical network [13:39:58] ok! I assumed not, just daydreaming. [13:40:13] if the ganeti host is in a clod rack connect to cloudsw no problem [13:40:30] if it’s in a prod rack connected to a prod switch it’s not really possible [13:41:07] right, so we'd have to build a custom ganeti cluster to support that, which would largely defeat the purpose [13:41:10] I think we would need a ganeti cluster within the cloud racks to achieve that, correct me if I'm wrong [13:41:39] yeah, basically [13:41:46] * andrewbogott just trying to think of ways to get a jump-start on k8s-on-metal [13:43:45] standing up a new Ganeti cluster isn't terribly complicated, especially with routed Ganeti. happy to help if that's needed [13:44:53] jah but it requires hardware so ordering new hardware doesn't save time when we're already waiting on hardware :) [13:46:08] tbh if we had a cluster like that then I can think of a bunch of other use cases where that could be helpful [13:47:39] true [13:48:51] I think I'm missing a bit of context, what does ganeti in cloud give you that openstack does not? [13:49:51] well, for example: openstack itself is not very resource intensive, so we could run openstack services on many small ganeti hosts and have better service separation between different openstack things. [13:50:10] that's not something I'm anxious to do but I know it bugs taavi that we have a bunch of different things running on each cloudcontrol :D [13:50:59] openstack things for example, having the ability to split the things on cloudcontrols (mariadb for example) to separate hosts with separate OS reimage patterns [13:51:00] dcaro: but the real reason I asked in the first place was I was wondering if we could set up a toy toolforge-on-k8s-on-metal simulation on ganeti (since ganeti would be networked more like a metal host). That wouldn't be a long-term deployment regardless. [13:51:23] having cloudcumins with access to the cloud network directly would simplify the setup somewhat [13:52:00] so this would be as alternative to having openstack in openstack, or in containers kinda thing right? [13:53:02] sort of, although I'm not exactly sure what you mean by openstack in openstack [13:53:27] running openstack inside vms managed by openstack itself [13:53:33] https://wiki.openstack.org/wiki/TripleO [13:53:36] there was an effort on that some time ago at least [13:53:38] yep that [13:53:40] Although I never took that idea seriously for us... [13:54:18] the thing we were talking about was running a k8s cluster and deploying openstack on that. https://docs.openstack.org/kolla-ansible/latest/ [13:54:33] yep, that's the other one [13:54:47] yep, I remember that one [13:54:54] I guess that a very lightweight version of that would be just starting openstack inside containers manually [13:54:55] The reason that didn't go anywhere is that we quickly discovered we'd have to build our own containers since k-a doesn't really support the ldap backend for keystone. [13:55:09] or didn't at the time [14:25:26] my internet decided to stop working right now :/ [14:26:22] so I'm on 5g hotspot, let's see how it goes [14:28:37] anyone with mac can test https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/277 + [14:28:38] ? [14:29:18] dcaro: I can after the meeting :) [14:29:24] thanks! [15:28:54] dcaro: the build worked but for some reason it failed towards the end installing ingress-admission [15:29:11] Failed to render chart: exit status 1: Error: Kubernetes cluster unreachable: Get "https://toolforge-control-plane:6443/version": [15:29:18] but it deployed all the other components before that... [15:31:34] hmm... I had the toolforge-control-plane container restart itself [15:31:36] too [15:31:38] by the end [15:31:57] it said something about autoupgrades, "rebooted", and it came back up [15:32:41] where did you see that it restarted? [15:32:50] I'm trying running ansible again [15:33:19] docker logs toolforge-control-plane [15:33:31] from within the lima vm [15:34:56] I see some boot messages there... but not a clear "restarted" message [15:36:16] re-running ansible worked fine, so it could be the control-plane was down for a few seconds [15:37:41] it did not say restarted, but it showed 'stopping service X, stopping Y, ...' [15:37:58] and then the startup messages 'starting X, starting Y, ...' [15:38:23] I only see the "starting" ones [15:38:39] okok [15:39:15] https://www.irccloud.com/pastebin/zv5f2OOQ/ [15:39:21] an example of the stopping things [15:39:29] https://phabricator.wikimedia.org/P83606 [15:39:59] I can test a couple more times provisioning from scratch tomorrow [15:40:06] but it doesn't seem a blocker, as re-running ansible fixed it [15:40:25] ack [15:43:29] I have to log off, see you tomorrow :) [15:47:55] cya! [15:55:13] * volans too [16:01:57] dhinus: fyi. the db saving mr was actually pointing to the wrong branch, it's actually ~1k lines of which ~500 are tests [16:02:05] (vs 9k it showed before xd) [16:38:05] * dcaro off [17:11:22] If anyone is still around, can I get a +1 for https://phabricator.wikimedia.org/T406271 ? [19:36:33] +1d [19:41:16] ty