[07:28:17] <dcaro>	 morning
[07:40:19] <dcaro>	 I'm rebooting all the stuck workers
[08:18:44] <dhinus>	 hello
[08:32:12] <dcaro>	 foxtrot-ldap is failing already :/
[08:32:16] <dcaro>	 bitnami pulled their image
[08:32:18] <dcaro>	         ERROR: failed to build: failed to solve: docker.io/bitnami/openldap: failed to resolve source metadata for docker.io/bitnami/openldap:latest: docker.io/bitnami/openldap:latest: not found
[08:33:09] <volans>	 there is a notice at https://hub.docker.com/r/bitnami/openldap
[08:33:13] <volans>	 explaining the changes
[08:34:10] <dcaro>	 yep, I'll change to that one, but we might want to move to something soon-ish
[08:34:14] <dcaro>	 *something else
[08:35:09] <volans>	 some of the secure ones are free, to be checked, the public ones should be available via the archive, but no longer updated
[08:35:51] <dcaro>	 ldap does not seem to be in the secure ones list
[08:35:56] <dcaro>	 (or I'm not finding it)
[08:36:23] <volans>	 same here
[08:36:40] <volans>	 the only one I find is the legacy one https://hub.docker.com/r/bitnamilegacy/openldap
[08:36:43] <volans>	 that you can use for now
[08:37:47] <volans>	 https://docker-registry.wikimedia.org/dev/bullseye-openldap/tags/ could work?
[08:50:31] <dcaro>	 we have it pulled in toolsbeta harbor too
[08:50:35] <dcaro>	 is that the same image?
[08:51:15] <dcaro>	 it seems to be a diffierent image
[08:51:21] <dcaro>	 not sure it's going to work, looking
[08:53:37] <dcaro>	 that image is trying to use non-existing repos already :/
[08:53:38] <dcaro>	 3.129 E: The repository 'http://mirrors.wikimedia.org/debian bullseye-backports Release' does not have a Release file.
[09:01:47] <volans>	 perfect :/
[09:05:18] <dcaro>	 used our hosted image for now, but yep, would be nice to use something like that one (or a trixie-based one)
[09:07:11] <volans>	 I guess we can ask sim.on/mor.itz :D https://gitlab.wikimedia.org/repos/releng/dev-images/-/blob/main/dockerfiles/bullseye-openldap/README.md
[09:09:27] <dcaro>	 👍
[10:04:44] <taavi>	 I just merged a bunch of toolforge proxy changes which are (at least in theory) no-ops
[12:12:35] <godog>	 neat
[13:00:50] <dcaro>	 it seems there's some gerrit issue that's creating errors on metricsinfra side
[13:00:59] <dcaro>	 'https://gerrit.wikimedia.org/r/cloud/metricsinfra/prometheus-manager/': The requested URL returned error: 503
[13:01:26] <dcaro>	 there's a maintenance going on
[13:01:27] <dcaro>	  Gerrit will be under maintenance on Monday, 6 Oct 2025, 12:00–13:00 UTC . During the maintenance, the system will be read-only (T387833)). 
[13:01:27] <stashbot>	 T387833: Gerrit failover process - https://phabricator.wikimedia.org/T387833
[13:01:59] <godog>	 indeed, I saw mentions of a rollback, are the 503s still ongoing ?
[13:17:14] <godog>	 looks like we're back
[13:26:31] <dcaro>	 it looks ok now yep
[13:38:31] <andrewbogott>	 topranks: is it possible to set up a ganeti VM that talks to a cloudsw so it acts like it's in the cloud realm? I don't really know how much ganeti networking is software vs hardware
[13:39:26] <topranks>	 andrewbogott: no not really
[13:39:36] <topranks>	 well the limitation is the separate physical
[13:39:50] <topranks>	 sry, separate physical network
[13:39:58] <andrewbogott>	 ok! I assumed not, just daydreaming.
[13:40:13] <topranks>	 if the ganeti host is in a clod rack connect to cloudsw no problem
[13:40:30] <topranks>	 if it’s in a prod rack connected to a prod switch it’s not really possible
[13:41:07] <andrewbogott>	 right, so we'd have to build a custom ganeti cluster to support that, which would largely defeat the purpose 
[13:41:10] <volans>	 I think we would need a ganeti cluster within the cloud racks to achieve that, correct me if I'm wrong
[13:41:39] <topranks>	 yeah, basically
[13:41:46] * andrewbogott just trying to think of ways to get a jump-start on k8s-on-metal
[13:43:45] <moritzm>	 standing up a new Ganeti cluster isn't terribly complicated, especially with routed Ganeti. happy to help if that's needed
[13:44:53] <andrewbogott>	 jah but it requires hardware so ordering new hardware doesn't save time when we're already waiting on hardware :)
[13:46:08] <taavi>	 tbh if we had a cluster like that then I can think of a bunch of other use cases where that could be helpful
[13:47:39] <andrewbogott>	 true
[13:48:51] <dcaro>	 I think I'm missing a bit of context, what does ganeti in cloud give you that openstack does not?
[13:49:51] <andrewbogott>	 well, for example: openstack itself is not very resource intensive, so we could run openstack services on many small ganeti hosts and have better service separation between different openstack things.
[13:50:10] <andrewbogott>	 that's not something I'm anxious to do but I know it bugs taavi that we have a bunch of different things running on each cloudcontrol :D
[13:50:59] <taavi>	 openstack things for example, having the ability to split the things on cloudcontrols (mariadb for example) to separate hosts with separate OS reimage patterns
[13:51:00] <andrewbogott>	 dcaro: but the real reason I asked in the first place was I was wondering if we could set up a toy toolforge-on-k8s-on-metal simulation on ganeti (since ganeti would be networked more like a metal host).  That wouldn't be a long-term deployment regardless.
[13:51:23] <taavi>	 having cloudcumins with access to the cloud network directly would simplify the setup somewhat
[13:52:00] <dcaro>	 so this would be as alternative to having openstack in openstack, or in containers kinda thing right?
[13:53:02] <taavi>	 sort of, although I'm not exactly sure what you mean by openstack in openstack
[13:53:27] <dcaro>	 running openstack inside vms managed by openstack itself
[13:53:33] <andrewbogott>	 https://wiki.openstack.org/wiki/TripleO
[13:53:36] <dcaro>	 there was an effort on that some time ago at least
[13:53:38] <dcaro>	 yep that
[13:53:40] <andrewbogott>	 Although I never took that idea seriously for us...
[13:54:18] <andrewbogott>	 the thing we were talking about was running a k8s cluster and deploying openstack on that. https://docs.openstack.org/kolla-ansible/latest/
[13:54:33] <dcaro>	 yep, that's the other one
[13:54:47] <taavi>	 yep, I remember that one
[13:54:54] <dcaro>	 I guess that a very lightweight version of that would be just starting openstack inside containers manually
[13:54:55] <andrewbogott>	 The reason that didn't go anywhere is that we quickly discovered we'd have to build our own containers since k-a doesn't really support the ldap backend for keystone.
[13:55:09] <andrewbogott>	 or didn't at the time
[14:25:26] <dcaro>	 my internet decided to stop working right now :/
[14:26:22] <dcaro>	 so I'm on 5g hotspot, let's see how it goes
[14:28:37] <dcaro>	 anyone with mac can test https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/277 +
[14:28:38] <dcaro>	 ?
[14:29:18] <dhinus>	 dcaro: I can after the meeting :)
[14:29:24] <dcaro>	 thanks!
[15:28:54] <dhinus>	 dcaro: the build worked but for some reason it failed towards the end installing ingress-admission
[15:29:11] <dhinus>	 Failed to render chart: exit status 1: Error: Kubernetes cluster unreachable: Get "https://toolforge-control-plane:6443/version":
[15:29:18] <dhinus>	 but it deployed all the other components before that...
[15:31:34] <dcaro>	 hmm... I had the toolforge-control-plane container restart itself
[15:31:36] <dcaro>	 too
[15:31:38] <dcaro>	 by the end
[15:31:57] <dcaro>	 it said something about autoupgrades, "rebooted", and it came back up 
[15:32:41] <dhinus>	 where did you see that it restarted?
[15:32:50] <dhinus>	 I'm trying running ansible again
[15:33:19] <dcaro>	 docker logs toolforge-control-plane
[15:33:31] <dcaro>	 from within the lima vm
[15:34:56] <dhinus>	 I see some boot messages there... but not a clear "restarted" message
[15:36:16] <dhinus>	 re-running ansible worked fine, so it could be the control-plane was down for a few seconds
[15:37:41] <dcaro>	 it did not say restarted, but it showed 'stopping service X, stopping Y, ...'
[15:37:58] <dcaro>	 and then the startup messages 'starting X, starting Y, ...'
[15:38:23] <dhinus>	 I only see the "starting" ones
[15:38:39] <dcaro>	 okok
[15:39:15] <dcaro>	 https://www.irccloud.com/pastebin/zv5f2OOQ/
[15:39:21] <dcaro>	 an example of the stopping things
[15:39:29] <dhinus>	 https://phabricator.wikimedia.org/P83606
[15:39:59] <dhinus>	 I can test a couple more times provisioning from scratch tomorrow
[15:40:06] <dhinus>	 but it doesn't seem a blocker, as re-running ansible fixed it
[15:40:25] <dcaro>	 ack
[15:43:29] <dhinus>	 I have to log off, see you tomorrow :)
[15:47:55] <dcaro>	 cya!
[15:55:13] * volans too
[16:01:57] <dcaro>	 dhinus: fyi. the db saving mr was actually pointing to the wrong branch, it's actually ~1k lines of which ~500 are tests
[16:02:05] <dcaro>	 (vs 9k it showed before xd)
[16:38:05] * dcaro off
[17:11:22] <andrewbogott>	 If anyone is still around, can I get a +1 for https://phabricator.wikimedia.org/T406271 ?
[19:36:33] <dcaro>	 +1d
[19:41:16] <andrewbogott>	 ty