[08:00:16] greetings [08:13:20] morning! [08:38:29] morning [08:40:01] regarding https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1040 can I get some review and/or indication on what should I do to get this unblocked so we can make some progress for the tracing? [08:58:38] I can try giving it a go, quick review https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/247 [08:59:03] in this one you mention though that you were not able to make it work? https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/294 [09:01:00] that's a very small limitation related to lima-kilo exposure only, shouldn't affect production AFAICT and, unless I made an obvious mistake somewhere that I'm sure some of you can spot in code review, it's probably related to an already existing limitation in our lima-kilo setup [09:01:24] okok, so I can still query it from within lima-kilo then? [09:01:30] just that it doesn't show up with the other exposed port for some reason (the others are http, not https) [09:03:05] (updated formatting of the overview of the merge-request, wansn't readable) [09:03:26] yes you can query internally or with kubectl port-forward [09:03:31] ack [09:04:12] thanks! [09:04:36] so the idea is that it's not reachable from other pods, except through the https gateway right? And it's not gathering any logs, but you can push through the nginx gateway? [09:04:50] morning [09:06:12] correct, the reachability from other pods can be discussed, as long as they only see the ngingx gateway with basic auth might be fine in case we might have cluster-internal needs [09:07:39] ack, I meant reachability to loki itself 👍 [09:09:14] volans: I'm getting `MountVolume.SetUp failed for volume "auth" : secret "loki-tracing-basic-auth" not found` when deploying on my already running lima-kilo [09:09:25] that's for `pod/loki-tracing-gateway-6c75bbcdf5-q9vdr` [09:09:38] I can try rebuilding lima-kilo but will take a bit [09:09:40] the steps are described in the MR commit message [09:09:42] no need [09:09:59] it's expected, you can't create the secret within that namespace before the namespace is created [09:10:17] unless there is another way that I don't know [09:10:41] if there is an easy way to have a commit that just generates the namespace, and then we create the secret that's also an option [09:10:45] ack, I was expecting `after the first deployment` to mean successful deployment xd [09:11:03] the other question is if we have a way to automatically generate the secret based on some data in a private repo [09:11:15] in that case I can commit the htpasswd secret somewhere [09:11:19] and use that, it would be idea [09:11:20] *ideal [09:11:39] you can add the secret to the things to deploy, let me search for the syntax [09:12:18] but where to commit it? [09:13:28] you just have to commit a silly secret for lima-kilo, and then add it to the puppetserver of the project you are deploying to (ex. toolsbeta-puppetserver, or tools-puppetserver) [09:14:12] https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/blob/main/components/builds-api/values/tools.yaml.gotmpl?ref_type=heads#L8 [09:14:45] for `local` (aka lima-kilo) you can just put the silly value https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/blob/main/components/builds-api/values/local.yaml.gotmpl?ref_type=heads [09:16:07] and that will create a secret? [09:17:30] that will extract it from the secrets config, that is created by puppet, with the secrets stored as local commits in the private git repository for each puppetserver [09:18:04] * volans still confused [09:18:15] as I don't need the credentials but the secret to be created in kubectl [09:18:43] ack, let's unpack :) [09:19:17] the upstream chart expects a secret that I named 'loki-tracing-basic-auth' being already there [09:19:22] https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1040/diffs#5f955253eed076c3e7cf0e19cb62bd8b13d7ac19_0_18 [09:19:23] * create secret using helm, with the content being a value `{ .Value.config.auth }` or something [09:20:31] can it not use the `extraObjects:` mechanism to create the secret with the chart? [09:20:32] * in the per-environment value files (tools.yaml.gotmpl/toolsbeta.yaml.gotmpl/local.yaml.gotmpl) set that value to something, for local, to a silly string/auth, in toolsbeta/tools, to the snippet that runs the get_secrets.sh script [09:21:00] both should work I think, being a value should be already overridable [09:21:44] oh, I see, it want the name of the existing secret, yep, you'll have to add something to extraObjects, or use your own chart [09:21:54] will get_secrets.sh hide the secret in the diffs/logs when deploying? [09:23:54] Hmm, I'm not sure, though those logs are not exposed anywhere but the cumin nodes (that already have root access to run anything on those VMs) [09:24:12] ack [09:25:15] it's the same with the s3 storage btw. the secrets are there, and used in the charts the same way [09:26:22] s3 is prod only, locally it uses minio [09:26:26] so that's slightly different [09:27:07] * volans adding the secret generation, thanks for the pointers [09:37:48] I was able to send and receive logs, so that's good :) [09:39:19] gtg to the doctor, I'll be back in a bit [09:43:26] :D [11:04:35] I think it's trying to pull images from quay.io directly [11:04:52] https://www.irccloud.com/pastebin/Zn3VWD9O/ [11:05:01] (just rebuilt from scratch the lima-kilo) [11:05:39] oh, I have to redeploy from that toolforge-deploy branch [11:05:40] okok [11:05:45] yes [11:25:38] is wmf-stable/raw equivalent or similar to oci://ghcr.io/kvaps/raw ? [11:26:01] and/or do we have oci://ghcr.io/kvaps/raw in our internal registry? [11:35:43] does not ring a bell, what's it for? [11:38:13] AFAIU some easy way to inline the secret creation without the need to create an additional chart [11:38:42] extraObjects is a bit of a mess because we're already declaring it in common and I need to define the secret after due to inclusion order of the values files [11:38:46] wmf-stable/raw is https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/raw/README.md which sounds like a similar thing [11:39:21] I found it git grepping toolforge-deploy :D [11:51:47] * volans trying if it does the trick [12:02:33] volans: found the issue with the port not working, commented in the task, there's a networkpolicy blocking it [12:03:04] * dcaro lunch [12:03:21] checking, thx [12:03:38] tools-db disk space is going down again, I opened T409716 [12:03:38] T409716: [toolsdb] ibdata1 growing on primary - https://phabricator.wikimedia.org/T409716 [12:32:25] updated MR, left a comment with an issue with the pre-commit check in CI, not sure if false positive [12:38:47] * volans lunch [13:05:00] volans: fyi. we discussed in the team meeting, and unless you have anything against it, we decided to split the logging/loki-tracing in two different components (so they can be deployed individually, and tested individually) [13:20:33] andrewbogott: is codfw1dev horizon still broken? :( [13:21:04] yes :( T409328 [13:21:05] T409328: sso failure in codfw1dev (labtesthorizon.wikimedia.org) - https://phabricator.wikimedia.org/T409328 [13:21:23] I haven't looked much since Moritz offered [13:28:33] andrewbogott: do you recall how disruptive neutron restarts have been lately? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1203402 triggers restarts even in eqiad1 due to a spacing change in the config files [13:29:37] The interruption is brief but it does cause connections to reset. Probably worth removing the newline unless that's a real pain. [13:29:50] I'm pretty sure puppet does not ever actually restart those services though. [13:30:01] godog: I want to take another (brief) run at T407586. You said you wanted to try swapping in a different grub build; are you building that yourself? [13:30:01] T407586: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware - https://phabricator.wikimedia.org/T407586 [13:31:13] it's perfectly possible to do it without the newline, it'd just make the config file more ugly [13:31:21] although T380972 suggests you're right re: not restarting [13:31:21] T380972: openstack: prevent puppet from restarting neutron-openvswitch-agent - https://phabricator.wikimedia.org/T380972 [13:32:21] so I'm tempted to just merge as is [13:32:39] I can live with that [13:33:49] merging then [13:40:01] andrewbogott: ok, can confirm that does not cause any of the important neutron agents to restart [13:40:26] I'm going to let that roll out in codfw1dev, and then look at a cookbook/cumin command to restart them to actually pick up the change [13:40:32] quick review? just moving code from one module to another https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/247 [14:02:12] andrewbogott: not atm no, please go ahead! [14:02:44] godog: can you tell me what grub you were thinking of trying, and where I can get it? [14:03:34] andrewbogott: yes I tried (and didn't work) with grub2 2.14~git20250718.0e36779-2 from debian unstable [14:03:53] ok, and it showed the same issue? [14:04:17] IIRC yes same issue, packages are https://deb.debian.org/debian/pool/main/g/grub2/ [14:04:35] ok, thanks [14:05:09] np! please keep us posted [14:15:30] how do you scp to the busybox installer? I briefly thought that I would be able to just wget from the outside internet but that seems to not work [14:15:56] it does work though you have to set the proxy first [14:16:32] to answer your question: you'll find the private key used in the installer on cumin hosts, check `install-console` script [14:16:47] maybe we should also add 'install-scp' [14:17:11] wget should be enough for now, but... [14:17:57] https://www.irccloud.com/pastebin/lwUBWonW/ [14:18:16] that must not be what you meant by 'set the proxy first' [14:18:37] webproxy_url="http://webproxy.eqiad.wmnet:8080" export http_proxy=$webproxy_url https_proxy=$webproxy_url [14:18:41] is what I use [14:19:05] ok, trying... [14:19:44] yeah, that's the same as what I tried before. Hangs on Connecting to deb.debian.org|2a04:4e42:3d::644|:443... [14:20:31] curious in the sense I do remember trying and succeeding with wget + proxy [14:21:27] seems like it's ignoring the proxy settings [14:23:23] ok, HTTPS_PROXY works and https_proxy does not. [14:23:49] no, that's not it, something else must've happened [14:23:57] I guess the installer needs to finish before that works [14:24:24] wait, and there's no dpkg... [14:24:35] * andrewbogott hates trying to get things done in busybox [14:25:26] depending on where the installer is atm, if you chroot /target /bin/bash then you'll get a shell in the installed system [14:27:34] ah! much better :) [14:32:42] dhinus: maybe you have a moment? quick review? (the same than the last two times) https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/247 [14:33:07] dcaro: looking [14:33:11] thanks! [14:33:29] dhinus: sorry, the mr is https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/247 [14:33:42] ah no, pasted ok xd, in my window the link looked broken [14:34:46] dcaro: which meeting was that? I don't have anything against splitting them, I can also use your approach of having a "common" directory at the components/ level directory to share some bits [14:34:46] andrewbogott: if you have a moment, do you have any idea why codfw1dev is seeing the neutron metadata agents as down? [14:35:08] volans: the team meeting on thursday afternoon [14:35:38] ack, a note on the MR would have been useful, I coul dhave started to split this morning :) [14:36:00] but it shouldn't take long now that all seems to work (beside CI) [14:36:07] both are ok yes, as long as there's two different directories under `components` it should be able to deploy and decouple without issues [14:36:21] taavi Sounds like https://phabricator.wikimedia.org/T395255 I wonder if I didn't fix in codfw1dev? I'll look shortly [14:36:21] it can be done after too [14:37:26] taavi: looks like that's it, want me to upgrade them now? [14:37:44] yes please! [14:38:41] volans: meeting notes https://office.wikimedia.org/wiki/Wikimedia_Cloud_Services/Meeting_Notes/2025-11-06 [14:39:29] dhinus: any idea whether T409734 is related to any of your recent work? [14:39:30] T409734: Tools Automoderator and Content Translation metrics can't access tools.db.svc.eqiad.wmflabs - https://phabricator.wikimedia.org/T409734 [14:40:21] dcaro: ack, I can't say that from those notes was clear a decision was made :D [14:40:23] taavi: huh interesting, I haven't changed any DNS yet so that's surprising [14:41:13] volans: yep, sorry, the decision was that unless you had anything against, nobody else did [14:42:33] taavi: done [14:42:46] andrewbogott: thanks! [14:45:39] ack [15:02:46] andrewbogott: meeting ? [15:37:45] fixing the MTU in codfw1dev: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/280 [16:01:02] update: the network mtu was changed, but the existing interfaces didn't at least automatically migrate. [16:48:28] dang [16:58:45] does forcing another dhcp lease do the trick wrt mtu ? [16:59:37] * dhinus off [17:23:59] * volans errand and off