[07:25:41] whoever's clinic duty now, the etherpad seems to need rotating [08:05:51] my understanding was that we were shifting a week and Raymond was taking this week and I would be taking next. [08:07:06] but if that's not the case and I misunderstood let me know ;) [09:09:00] greetings [09:09:11] flushing backlog, ping me if sth urgent needs my attention [09:09:48] o/ [09:10:01] I plan to upgrade toolsbeta to k8s 1.32 today [09:10:08] yay [09:23:44] looks like gitlab is throttling requests from toolforge build service: T428396 [09:23:44] T428396: Buildservice failing - https://phabricator.wikimedia.org/T428396 [09:23:58] godspeed taavi [09:24:15] dhinus: see -sre and slack, it was having issues, then got restarted and didn't liked a config, now should be back [09:24:19] as of a minute ago [09:25:03] ack thanks! [09:28:01] dhinus: I am still a bit conflicted about our build pipeline relying on gitlab being up. empirically gitlab's availability is lower than the rest of toolforge infrastructure's, but also a large portition of tools host their source code there so it being down will break those builds regardless of whether the buildpacks are hosted there [09:29:34] taavi: I also don't like that we use gitlab to host binary tools, but at the same time as you say many builds would still rely on it to get the source code [09:30:11] for buildpacks we could consider having a local cache so that if gitlab is unavailable they will still work, but if a good chunk of tools have their code there is pointless. And if we're pushing people to put more code there even more. So maybe the action item is to ask for a better SLO/SLI/SLA for gitlab ;) [09:30:40] or you mean building via gitlab ci? [09:34:32] I was talking about toolforge build service builds, so gitlab ci is not involved [09:34:53] in the task above, the build failed downloading binary files from https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/packages [09:35:08] which we should probably host elsewhere... [09:36:07] but we also want a good SLO for gitlab because we are encouraging users to store the source code of tools there :) [10:00:56] the ToolsDBReplicationLagIsTooHigh alert that just triggered is caused by a big "mysqldump" of mixnmatch that is currently running on the replica [10:01:08] it should resolve when the dump completes, I will open a task to track it [10:03:57] unable to fetch the kubeadm-config ConfigMap: failed to get node registration: node toolsbeta-test-k8s-worker-nfs-10 doesn't have kubeadm.alpha.kubernetes.io/cri-socket annotation [10:04:00] huh, never seen that before [10:04:21] * taavi retries [10:06:08] nope, still happens [10:10:46] apparently that suggests some issue during the initial node registration? added the annotation manually and now it seems fine [10:23:54] the toolforge nfs alert fired, so I filed T428422 and kicked off the script to find the biggest tools [10:23:54] T428422: 2026-06-08 Toolforge NFS cleanup - https://phabricator.wikimedia.org/T428422 [10:52:13] dhinus: is your team looking at T428214 as well? [10:52:13] T428214: components-api failing to connect to internal api - https://phabricator.wikimedia.org/T428214 [11:00:33] not yet but we can take a look [11:00:46] cc Raymond_Ndibe ^ [11:43:11] can someone check the announcement for upgrading toolforge k8s in a week: https://etherpad.wikimedia.org/p/tools-k8s-1.32-announcement [11:44:23] LGTM, also TIL zonestamp.t.o [12:00:10] taavi: about https://phabricator.wikimedia.org/T428060#11993677 yeah looks like *something* needs to happen :) [13:01:24] I'm going to do tools nfs cleanup unless someone else already is [13:04:05] andrewbogott: t.aavi has already opened a task to the tool owners [13:04:45] great, then I won't :) taavi, did you already do the routine log pruning? Or is log explosion not a thing now that we use loki? [13:16:23] 🦄🎉 https://etherpad.wikimedia.org/p/WMCS-2026-06-11 | Channel is logged at https://wm-bot.wmcloud.org/logs/%23wikimedia-cloud-admin/ | ping cteam | clinic duty: Raymond_Ndibe [13:18:11] Raymond_Ndibe: I rotated the etherpad but I'm not available for the SRE meeting so hoping you (or someone else) will attend and take notes today. [13:20:03] I'm also unable to attend the SRE meeting today [13:22:57] I'll be there if that helps [13:23:46] !log tools.dimastbkbot webservice python3.9 start (T428139) [13:23:46] dhinus: Not expecting to hear !log here [13:23:46] T428139: [toolsdb] Transaction History Length growing too much - https://phabricator.wikimedia.org/T428139 [13:23:51] sorry wrong channel [13:30:15] volans: it does help! Just add any notes that you think might be of interest to wmcs folks to the 'sre meeting' section of the etherpad. thx! [13:30:20] (usually that amounts to 'not much') [13:30:56] sure, will do [14:57:07] andrewbogott: I’ll join the sre meeting [14:57:19] great! Sounds like it's going to be a good one [15:25:40] andrewbogott: if you are available can you help with +1 for https://phabricator.wikimedia.org/T427731 ? [15:26:34] looking [15:29:22] I responded with a couple followup questions -- want to make sure they're really using what they asked for, if yes then it's fine. [15:30:21] Yeaa, I also ran the tool https://cloudvps-quota.toolforge.org on their project, and could see ram and cpu is exhausted [15:31:07] filled by the existing VMs, but that doesn't mean the VMs need to be as big as they are. [15:31:51] Yup agree [16:01:07] Opened T428470 to track the toolsdb replication lag we saw this morning [16:01:21] T428470: [toolsdb] mysqldump on big db can cause replica lag - https://phabricator.wikimedia.org/T428470 [18:39:35] andrewbogott: thanks for your help with deleting that cluster! [18:42:15] i'm working through another issue where our k8s api hostname is not included in the api server certificate SANs (because we use haproxy in front of the internal api endpoint). fwict the cluster-api backend for magnum affords more control of that and i'm wondering what the status of that new backend is and how i can track progress [18:46:39] and in the meantime, with the heat backend is there any way for us to control the heat template parameters? i ask because i noticed there is a `master_hostname` heat parameter that makes its way into the SANs :) [22:34:25] dduvall: I had filed an upstream bug about better SAN control. It was closed as implemented, but nothing linked to the bug to make it easier to track down -- https://bugs.launchpad.net/magnum/+bug/2116114 [22:35:38] bd808: oh, interesting! i should have asked you :) [22:35:53] any idea how/if it was implemented? [22:36:31] just the "status: New → Fix Released" note. I just looked it up to see what ever happened there. [22:39:48] i don't see anything in stable/2026.1 :/ [22:42:01] I asked on the bug for an explanation. ¯\_(ツ)_/¯ [22:45:20] the script I had referenced was ripped out in https://opendev.org/openstack/magnum/commit/0c6e7d49066431c0ba2350ccad816ae987c0f045