[08:00:16] <godog>	 greetings
[08:13:20] <dcaro>	 morning!
[08:38:29] <volans>	 morning
[08:40:01] <volans>	 regarding https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1040 can I get some review and/or indication on what should I do to get this unblocked so we can make some progress for the tracing?
[08:58:38] <dcaro>	 I can try giving it a go, quick review https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/247
[08:59:03] <dcaro>	 in this one you mention though that you were not able to make it work? https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/294
[09:01:00] <volans>	 that's a very small limitation related to lima-kilo exposure only, shouldn't affect production AFAICT and,  unless I made an obvious mistake somewhere that I'm sure some of you can spot in code review, it's probably related to an already existing limitation in our lima-kilo setup
[09:01:24] <dcaro>	 okok, so I can still query it from within lima-kilo then?
[09:01:30] <volans>	 just that it doesn't show up with the other exposed port for some reason (the others are http, not https)
[09:03:05] <volans>	 (updated formatting of the overview of the merge-request, wansn't readable)
[09:03:26] <volans>	 yes you can query internally or with kubectl port-forward
[09:03:31] <dcaro>	 ack
[09:04:12] <volans>	 thanks!
[09:04:36] <dcaro>	 so the idea is that it's not reachable from other pods, except through the https gateway right? And it's not gathering any logs, but you can push through the nginx gateway?
[09:04:50] <dhinus>	 morning
[09:06:12] <volans>	 correct, the reachability from other pods can be discussed, as long as they only see the ngingx gateway with basic auth might be fine in case we might have cluster-internal needs
[09:07:39] <dcaro>	 ack, I meant reachability to loki itself 👍
[09:09:14] <dcaro>	 volans: I'm getting `MountVolume.SetUp failed for volume "auth" : secret "loki-tracing-basic-auth" not found` when deploying on my already running lima-kilo
[09:09:25] <dcaro>	 that's for `pod/loki-tracing-gateway-6c75bbcdf5-q9vdr`
[09:09:38] <dcaro>	 I can try rebuilding lima-kilo but will take a bit
[09:09:40] <volans>	 the steps are described in the MR commit message
[09:09:42] <volans>	 no need
[09:09:59] <volans>	 it's expected, you can't create the secret within that namespace before the namespace is created
[09:10:17] <volans>	 unless there is another way that I don't know
[09:10:41] <volans>	 if there is an easy way to have a commit that just generates the namespace, and then we create the secret that's also an option
[09:10:45] <dcaro>	 ack, I was expecting `after the first deployment` to mean successful deployment xd
[09:11:03] <volans>	 the other question is if we have a way to automatically generate the secret based on some data in a private repo
[09:11:15] <volans>	 in that case I can commit the htpasswd secret somewhere
[09:11:19] <volans>	 and use that, it would be idea
[09:11:20] <volans>	 *ideal
[09:11:39] <dcaro>	 you can add the secret to the things to deploy, let me search for the syntax
[09:12:18] <volans>	 but where to commit it?
[09:13:28] <dcaro>	 you just have to commit a silly secret for lima-kilo, and then add it to the puppetserver of the project you are deploying to (ex. toolsbeta-puppetserver, or tools-puppetserver) 
[09:14:12] <dcaro>	 https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/blob/main/components/builds-api/values/tools.yaml.gotmpl?ref_type=heads#L8
[09:14:45] <dcaro>	 for `local` (aka lima-kilo) you can just put the silly value https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/blob/main/components/builds-api/values/local.yaml.gotmpl?ref_type=heads
[09:16:07] <volans>	 and that will create a secret?
[09:17:30] <dcaro>	 that will extract it from the secrets config, that is created by puppet, with the secrets stored as local commits in the private git repository for each puppetserver
[09:18:04] * volans still confused
[09:18:15] <volans>	 as I don't need the credentials but the secret to be created in kubectl
[09:18:43] <dcaro>	 ack, let's unpack :)
[09:19:17] <volans>	 the upstream chart expects a secret that I named 'loki-tracing-basic-auth' being already there
[09:19:22] <volans>	 https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1040/diffs#5f955253eed076c3e7cf0e19cb62bd8b13d7ac19_0_18
[09:19:23] <dcaro>	 * create secret using helm, with the content being a value `{ .Value.config.auth }` or something
[09:20:31] <taavi>	 can it not use the `extraObjects:` mechanism to create the secret with the chart?
[09:20:32] <dcaro>	 * in the per-environment value files (tools.yaml.gotmpl/toolsbeta.yaml.gotmpl/local.yaml.gotmpl) set that value to something, for local, to a silly string/auth, in toolsbeta/tools, to the snippet that runs the get_secrets.sh script
[09:21:00] <dcaro>	 both should work I think, being a value should be already overridable
[09:21:44] <dcaro>	 oh, I see, it want the name of the existing secret, yep, you'll have to add something to extraObjects, or use your own chart
[09:21:54] <volans>	 will get_secrets.sh hide the secret in the diffs/logs when deploying?
[09:23:54] <dcaro>	 Hmm, I'm not sure, though those logs are not exposed anywhere but the cumin nodes (that already have root access to run anything on those VMs)
[09:24:12] <volans>	 ack
[09:25:15] <dcaro>	 it's the same with the s3 storage btw. the secrets are there, and used in the charts the same way
[09:26:22] <volans>	 s3 is prod only, locally it uses minio
[09:26:26] <volans>	 so that's slightly different
[09:27:07] * volans adding the secret generation, thanks for the pointers
[09:37:48] <dcaro>	 I was able to send and receive logs, so that's good :)
[09:39:19] <dcaro>	 gtg to the doctor, I'll be back in a bit
[09:43:26] <volans>	 :D
[11:04:35] <dcaro>	 I think it's trying to pull images from quay.io directly
[11:04:52] <dcaro>	 https://www.irccloud.com/pastebin/Zn3VWD9O/
[11:05:01] <dcaro>	 (just rebuilt from scratch the lima-kilo)
[11:05:39] <dcaro>	 oh, I have to redeploy from that toolforge-deploy branch
[11:05:40] <dcaro>	 okok
[11:05:45] <volans>	 yes
[11:25:38] <volans>	 is wmf-stable/raw equivalent or similar to oci://ghcr.io/kvaps/raw ?
[11:26:01] <volans>	 and/or do we have oci://ghcr.io/kvaps/raw in our internal registry?
[11:35:43] <dcaro>	 does not ring a bell, what's it for?
[11:38:13] <volans>	 AFAIU some easy way to inline the secret creation without the need to create an additional chart
[11:38:42] <volans>	 extraObjects is a bit of a mess because we're already declaring it in common and I need to define the secret after due to inclusion order of the values files
[11:38:46] <taavi>	 wmf-stable/raw is https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/raw/README.md which sounds like a similar thing
[11:39:21] <volans>	 I found it git grepping toolforge-deploy :D
[11:51:47] * volans trying if it does the trick
[12:02:33] <dcaro>	 volans: found the issue with the port not working, commented in the task, there's a networkpolicy blocking it
[12:03:04] * dcaro lunch
[12:03:21] <volans>	 checking, thx
[12:03:38] <dhinus>	 tools-db disk space is going down again, I opened T409716
[12:03:38] <stashbot>	 T409716: [toolsdb] ibdata1 growing on primary - https://phabricator.wikimedia.org/T409716
[12:32:25] <volans>	 updated MR, left a comment with an issue with the pre-commit check in CI, not sure if false positive
[12:38:47] * volans lunch
[13:05:00] <dcaro>	 volans: fyi. we discussed in the team meeting, and unless you have anything against it, we decided to split the logging/loki-tracing in two different components (so they can be deployed individually, and tested individually)
[13:20:33] <taavi>	 andrewbogott: is codfw1dev horizon still broken? :(
[13:21:04] <andrewbogott>	 yes :(  T409328
[13:21:05] <stashbot>	 T409328: sso failure in codfw1dev (labtesthorizon.wikimedia.org) - https://phabricator.wikimedia.org/T409328
[13:21:23] <andrewbogott>	 I haven't looked much since Moritz offered
[13:28:33] <taavi>	 andrewbogott: do you recall how disruptive neutron restarts have been lately? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1203402 triggers restarts even in eqiad1 due to a spacing change in the config files
[13:29:37] <andrewbogott>	 The interruption is brief but it does cause connections to reset. Probably worth removing the newline unless that's a real pain.
[13:29:50] <andrewbogott>	 I'm pretty sure puppet does not ever actually restart those services though.
[13:30:01] <andrewbogott>	 godog: I want to take another (brief) run at T407586. You said you wanted to try swapping in a different grub build; are you building that yourself?
[13:30:01] <stashbot>	 T407586: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware - https://phabricator.wikimedia.org/T407586
[13:31:13] <taavi>	 it's perfectly possible to do it without the newline, it'd just make the config file more ugly
[13:31:21] <taavi>	 although T380972 suggests you're right re: not restarting
[13:31:21] <stashbot>	 T380972: openstack: prevent puppet from restarting neutron-openvswitch-agent - https://phabricator.wikimedia.org/T380972
[13:32:21] <taavi>	 so I'm tempted to just merge as is
[13:32:39] <andrewbogott>	 I can live with that
[13:33:49] <taavi>	 merging then
[13:40:01] <taavi>	 andrewbogott: ok, can confirm that does not cause any of the important neutron agents to restart
[13:40:26] <taavi>	 I'm going to let that roll out in codfw1dev, and then look at a cookbook/cumin command to restart them to actually pick up the change
[13:40:32] <dcaro>	 quick review? just moving code from one module to another https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/247
[14:02:12] <godog>	 andrewbogott: not atm no, please go ahead!
[14:02:44] <andrewbogott>	 godog: can you tell me what grub you were thinking of trying, and where I can get it?
[14:03:34] <godog>	 andrewbogott: yes I tried (and didn't work) with grub2 2.14~git20250718.0e36779-2 from debian unstable
[14:03:53] <andrewbogott>	 ok, and it showed the same issue?
[14:04:17] <godog>	 IIRC yes same issue, packages are https://deb.debian.org/debian/pool/main/g/grub2/
[14:04:35] <andrewbogott>	 ok, thanks
[14:05:09] <godog>	 np! please keep us posted
[14:15:30] <andrewbogott>	 how do you scp to the busybox installer? I briefly thought that I would be able to just wget from the outside internet but that seems to not work
[14:15:56] <godog>	 it does work though you have to set the proxy first
[14:16:32] <godog>	 to answer your question: you'll find the private key used in the installer on cumin hosts, check `install-console` script
[14:16:47] <godog>	 maybe we should also add 'install-scp'
[14:17:11] <andrewbogott>	 wget should be enough for now, but...
[14:17:57] <andrewbogott>	 https://www.irccloud.com/pastebin/lwUBWonW/
[14:18:16] <andrewbogott>	 that must not be what you meant by 'set the proxy first'
[14:18:37] <godog>	 webproxy_url="http://webproxy.eqiad.wmnet:8080" export http_proxy=$webproxy_url https_proxy=$webproxy_url
[14:18:41] <godog>	 is what I use
[14:19:05] <andrewbogott>	 ok, trying...
[14:19:44] <andrewbogott>	 yeah, that's the same as what I tried before. Hangs on Connecting to deb.debian.org|2a04:4e42:3d::644|:443...
[14:20:31] <godog>	 curious in the sense I do remember trying and succeeding with wget + proxy
[14:21:27] <andrewbogott>	 seems like it's ignoring the proxy settings
[14:23:23] <andrewbogott>	 ok, HTTPS_PROXY works and https_proxy does not.
[14:23:49] <andrewbogott>	 no, that's not it, something else must've happened
[14:23:57] <andrewbogott>	 I guess the installer needs to finish before that works
[14:24:24] <andrewbogott>	 wait, and there's no dpkg...
[14:24:35] * andrewbogott hates trying to get things done in busybox
[14:25:26] <godog>	 depending on where the installer is atm, if you chroot /target /bin/bash then you'll get a shell in the installed system
[14:27:34] <andrewbogott>	 ah! much better :)
[14:32:42] <dcaro>	 dhinus: maybe you have a moment? quick review? (the same than the last two times) https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/247
[14:33:07] <dhinus>	 dcaro: looking
[14:33:11] <dcaro>	 thanks!
[14:33:29] <dcaro>	 dhinus: sorry, the mr is https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/247
[14:33:42] <dcaro>	 ah no, pasted ok xd, in my window the link looked broken
[14:34:46] <volans>	 dcaro: which meeting was that? I don't have anything against splitting them, I can also use your approach of having a "common" directory at the components/ level directory to share some bits
[14:34:46] <taavi>	 andrewbogott: if you have a moment, do you have any idea why codfw1dev is seeing the neutron metadata agents as down?
[14:35:08] <dcaro>	 volans: the team meeting on thursday afternoon
[14:35:38] <volans>	 ack, a note on the MR would have been useful, I coul dhave started to split this morning :)
[14:36:00] <volans>	 but it shouldn't take long now that all seems to work (beside CI)
[14:36:07] <dcaro>	 both are ok yes, as long as there's two different directories under `components` it should be able to deploy and decouple without issues
[14:36:21] <andrewbogott>	 taavi Sounds like https://phabricator.wikimedia.org/T395255 I wonder if I didn't fix in codfw1dev?  I'll look shortly
[14:36:21] <dcaro>	 it can be done after too
[14:37:26] <andrewbogott>	 taavi: looks like that's it, want me to upgrade them now?
[14:37:44] <taavi>	 yes please!
[14:38:41] <dcaro>	 volans: meeting notes https://office.wikimedia.org/wiki/Wikimedia_Cloud_Services/Meeting_Notes/2025-11-06
[14:39:29] <taavi>	 dhinus: any idea whether T409734 is related to any of your recent work?
[14:39:30] <stashbot>	 T409734: Tools Automoderator and Content Translation metrics can't access tools.db.svc.eqiad.wmflabs - https://phabricator.wikimedia.org/T409734
[14:40:21] <volans>	 dcaro: ack, I can't say that from those notes was clear a decision was made :D
[14:40:23] <dhinus>	 taavi: huh interesting, I haven't changed any DNS yet so that's surprising
[14:41:13] <dcaro>	 volans: yep, sorry, the decision was that unless you had anything against, nobody else did
[14:42:33] <andrewbogott>	 taavi: done
[14:42:46] <taavi>	 andrewbogott: thanks!
[14:45:39] <volans>	 ack
[15:02:46] <godog>	 andrewbogott: meeting ?
[15:37:45] <taavi>	 fixing the MTU in codfw1dev: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/280
[16:01:02] <taavi>	 update: the network mtu was changed, but the existing interfaces didn't at least automatically migrate.
[16:48:28] <andrewbogott>	 dang
[16:58:45] <godog>	 does forcing another dhcp lease do the trick wrt mtu ?
[16:59:37] * dhinus off
[17:23:59] * volans errand and off