[07:44:00] <taavi>	 apparently chatgpt still hasn't realized that the 1-hour TTL on *.toolforge.org has expired after the record was changed yesterday
[08:00:19] <dcaro>	 morning!
[08:03:09] <godog>	 greetings
[08:47:24] <dhinus>	 hello!
[10:20:46] <dcaro>	 taavi: this is expected right? summary: Service tools-proxy-10:443 has failed probes (http_toolforge_org_ip4)
[10:21:05] <dcaro>	 (the url still works for me)
[10:21:29] <taavi>	 dcaro: see my !log on -cloud a couple of minutes ago :-)
[10:21:45] <dcaro>	 I saw yes, just making sure nothing broke
[10:22:12] <taavi>	 so yes, expected
[10:22:18] <dcaro>	 I'll ack the alert
[10:22:24] <taavi>	 the alerts will clear as soon as Puppet runs on the prometheus hosts
[10:57:29] * dcaro lunch
[12:19:51] <godog>	 andrewbogott: I'm going to poke T407586 at bit more FYI, hope that's okay
[12:19:51] <stashbot>	 T407586: latest Trixie image (as of 2025-10-16) grub failure - https://phabricator.wikimedia.org/T407586
[12:26:08] <andrewbogott>	 Yes please! There's lots more info on the task about what's going wrong 
[12:33:05] <godog>	 ok! yeah the fact that I can't reproduce locally bugs (ah ah) me
[13:13:40] <dcaro>	 quick review https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/290 moving logic to toolforge_deploy_mr
[13:15:37] <andrewbogott>	 godog: yesterday I was hacking the apt1002 to make the installer pause at the end before reboot. I can do that again if you want to get a clear view of the server post-install.
[14:06:13] <dcaro>	 the deployment is running the tests and they pass
[14:06:37] <dhinus>	 I did see one test fail just before you reverted the change
[14:06:49] <dcaro>	 https://www.irccloud.com/pastebin/eXicMwpG/
[14:06:53] <dcaro>	 still running
[14:07:03] <dhinus>	 " ✗ list build [20796]"
[14:07:26] <dcaro>	 that's from your run right?
[14:07:32] <dhinus>	 the logs of the builds-api pod did not show much, except a very high latency for some operations (>3m)
[14:07:41] <dhinus>	 dcaro: yes from my run 5 mins ago
[14:07:42] <dcaro>	 graphs show the same yep
[14:07:44] <dcaro>	 https://usercontent.irccloud-cdn.com/file/rqhInDU6/image.png
[14:08:10] <dhinus>	 so something was making some api calls very slow, which meant they timed out on one of the layers above
[14:08:31] <dhinus>	 have a link to the patch that was deployed earlier?
[14:08:32] <dcaro>	 it's weird that it did not fail the functional tests on deployment though
[14:08:37] <dhinus>	 yes very weird
[14:08:47] <dhinus>	 maybe some resource exhaustion that took a few mins after deploy?
[14:08:49] <dcaro>	 toolforge deploy mr is https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1009
[14:09:06] <dcaro>	 the builds-api release is https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/releases/0.0.202-20251023122338-895f8556
[14:09:23] <dcaro>	 the only change in it is https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/144
[14:10:18] <dcaro>	 what it does is an extra call to get the taskrun for the build, and fetch the image sha, to add it to the image name
[14:10:52] <dcaro>	 so there's an extra call to k8s tekton, but should not be that impactful, it's filtering by name too
[14:10:52] <dcaro>	 https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/144/diffs#0cfd90216a04e25257d891d68ee18c147f7f0d8c_220_223
[14:11:00] <dcaro>	 so no "query everything" kinda call
[14:11:30] <dcaro>	 (using a labelselector, that should be fast enough I think, annotations are the ones that are more troublesome from what i remember)
[14:11:40] <dhinus>	 very weird, I have no idea what's going on
[14:12:00] <dcaro>	 let me try to do that k8s query by hand
[14:13:38] <dcaro>	 hmm... yep, that might be an issue
[14:13:48] <dcaro>	 well, it's not superslow though
[14:13:50] <dcaro>	 https://www.irccloud.com/pastebin/Jp6aSmcc/
[14:14:46] <dcaro>	 but if you have many builds, and it takes 15s for each build, it multiplies
[14:15:01] <dcaro>	 hmm.... :oldmanyellsatsky:
[14:15:53] <dcaro>	 https://www.irccloud.com/pastebin/rBJsdf2f/
[14:16:02] <dcaro>	 it consistently takes ~15s
[14:16:26] <dcaro>	 toolsbeta is way smaller, so probably that's why we did not see the issue
[14:16:35] <dhinus>	 makes sense
[14:16:41] <Damianz>	 that's a bit annoying, but slow enough even with there never being /that/ many builds
[14:17:11] <dcaro>	 5 builds might already pass the common 1min timeouts
[14:17:19] <Damianz>	 as far as I can tell there is 1 taskruns for each pipeline run (or at least there was locally) so I would have expected that to be about the same responsiveness
[14:17:36] <dcaro>	 yep, there should be, not sure why it's so slow
[14:18:02] <dhinus>	 the logs I saw were more than 3 mins, which does not seem to match the number of builds it would take if it's 15sec each
[14:18:04] <Damianz>	 can you try the same query for pipeline just out of curiosity
[14:18:51] <dcaro>	 https://www.irccloud.com/pastebin/uIcnXYCH/
[14:19:05] <dcaro>	 had to change the query a bit (different label), but took way less time
[14:19:18] <dcaro>	 (still not awesome time, but manageable)
[14:20:07] <dcaro>	 https://www.irccloud.com/pastebin/tbmOtWlq/
[14:20:19] <dcaro>	 1-1 mapping between them
[14:21:05] <dcaro>	 hmmm.... I'm thinking it might be related to crd versioning, it might be trying to transform the pipelineruns from one version to another when fetching them
[14:21:42] <Damianz>	 I'd have to refresh my memory on how k8s does filtering, but perhaps we can add status/startTime to the filter to speed it up
[14:21:57] <dcaro>	 hhmmm
[14:21:59] <dcaro>	 https://www.irccloud.com/pastebin/SCjDdrC3/
[14:22:41] <dcaro>	 that might be it
[14:22:52] <dcaro>	 https://www.irccloud.com/pastebin/skwavUPn/
[14:23:25] <dcaro>	 both of them hit the webhook
[14:23:26] <dcaro>	 https://www.irccloud.com/pastebin/WPlljGKg/
[14:24:13] * dcaro has had bad experiences with crd versioning
[14:24:15] <Damianz>	 that could defiantly explain some things
[14:24:56] <dcaro>	 pipelineruns also have it though
[14:24:58] <dcaro>	 https://www.irccloud.com/pastebin/FuLuvGhv/
[14:25:12] <dcaro>	 https://www.irccloud.com/pastebin/pxvbxtUb/
[14:25:22] <dcaro>	 maybe they are simpler to transform?
[14:25:31] <dcaro>	 I'll open a task to dump all this in
[14:25:39] <dcaro>	 (might be a red herring too)
[14:26:04] <Damianz>	 there's quite a lot of 'stuff' in a taskrun
[14:29:22] <dcaro>	 T408125
[14:29:23] <stashbot>	 T408125: [builds-api] listing deployments gets very slow when querying taskruns - https://phabricator.wikimedia.org/T408125
[14:33:43] <dcaro>	 we might have to run this? https://github.com/kubernetes-sigs/kube-storage-version-migrator/blob/master/USER_GUIDE.md
[14:37:55] <godog>	 andrewbogott: thank you ! I did poke around a little since I found the installer paused, we're back at grub rescue prompt now and I'll let it be since I'm out of ideas
[14:38:08] <godog>	 andrewbogott: did I get it right that a reimage with --os bullseye works as expected ?
[14:42:02] <andrewbogott>	 godog: I haven't tried bullseye recently but bookworm definitely works
[14:42:10] <andrewbogott>	 I'm trying a second 4-drive server to see if it's reproducible.
[14:42:43] <godog>	 andrewbogott: ah yes my bad! I meant bookworm
[14:43:09] <godog>	 ok so reimage with --os bookworm is known-good for cloudcontrol2010-dev (?)
[14:43:25] <andrewbogott>	 correct
[14:43:48] <andrewbogott>	 the task includes grub-installer logs for bookworm, they're very different and generally happy
[14:44:14] <godog>	 fascinating
[14:44:31] <andrewbogott>	 with the paused install you can re-run the installer yourself and see it fail in the same way
[14:44:46] <andrewbogott>	 on trixie it complains about many things
[14:45:21] <godog>	 heh indeed, I have seen grub complaining on preseed-test too, though then actually working at reboot
[14:46:22] <andrewbogott>	 with those same errors?
[14:46:59] <andrewbogott>	 I think we're running grub-installer four times (once for each drive) so I would confirm that it's complaining the same number of times with preseed-test
[14:47:52] <godog>	 I'll re-run preseed-test and confirm
[14:49:34] <andrewbogott>	 one thing I haven't tried with cloudcontrol2010-dev is doing a thorough wiping of the drives to make sure there isn't some latent partition cruft messing with us.
[14:49:44] <andrewbogott>	 although w/out a raid controller I'm not sure I know how to properly wipe the drives
[14:50:35] <godog>	 I did a dd if=/dev/zero of=<disk> bs=512 count=2M just to make sure in one of the earlier tests, nothing changed tho
[14:50:59] <andrewbogott>	 oh great
[14:51:13] <godog>	 probably good enough, at least to nuke the relevant metadata, I'm 98% sure
[14:51:59] <dcaro>	 opened T408127 to track the stored version fixes
[14:52:00] <stashbot>	 T408127: [bulids-builder,tekton] upgrade taskuns to v1 from v1beta1 in storage and delete v1beta1 from stored versions - https://phabricator.wikimedia.org/T408127
[14:53:26] <andrewbogott>	 godog: I'm reimaging maps-test2002 to see if I can reproduce this on different hardware. I just noticed that it's currently set to raid 5 though so if this succeeds I'll try again with raid 10.
[14:53:45] <taavi>	 does anyone happen to know if the source draw.io file for https://wikitech.wikimedia.org/wiki/File:Toolforge_k8s_network_topology.png is stored somewhere?
[14:53:51] <godog>	 andrewbogott: ok!
[14:55:16] <dcaro>	 taavi: no idea
[14:56:54] <andrewbogott>	 taavi: I don't know but it would be fine to email arturo and ask if it's still on his laptop.
[14:59:30] <dcaro>	 Damianz: quick question, have you had issues with missing logs lately? have you noticed any difference?
[15:00:04] <andrewbogott>	 godog: grub works on maps-test2002 with raid5. So next step is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1198344
[15:00:06] <andrewbogott>	 ooh and meeting
[15:00:07] <taavi>	 i seem to have been okta'd, joining the meeting in a second
[15:00:39] <godog>	 andrewbogott: good luck
[15:00:53] <Damianz>	 dcaro: I've been using kubectl logs recently so haven't noticed, was sort of waiting for logs-api
[15:02:01] <dcaro>	 ack
[15:04:39] <taavi>	 andrewbogott: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1198349
[15:53:06] <andrewbogott>	 godog: I can't reproduce the issue on maps-test2002, can't decide if that's good news or bad news. But the next stab in the dark I want to try is switching cloudcontrol2010-dev to raid5 just in case it works around the issue and we can stop thinking about it.
[15:53:12] <andrewbogott>	 (assuming you are also out of ideas)
[15:53:39] <godog>	 andrewbogott: yes might as well eliminate as many variables as we can
[15:54:31] <godog>	 given the latest tests my money is on some kind of kernel and/or hw interaction or bug
[15:56:05] <andrewbogott>	 yeah -- given that I can't reproduce it on another host it's temping to just take that server into the woods and leave it there.
[15:56:15] <andrewbogott>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1198359 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1198358/3
[15:59:53] <godog>	 heh if it is indeed the hw then to me that begs the question of whether other R450 are affected too
[16:01:38] <andrewbogott>	 yeah, maybe I need to find another more-similar test host
[16:01:44] <andrewbogott>	 since it's clearly not just '4 drives in sw raid'
[16:02:29] <godog>	 +1
[16:02:49] <dcaro>	 hmpf... for the taskruns migration, I think it might be worth it to just delete the old ones
[16:06:06] <andrewbogott>	 I was hoping https://netbox.wikimedia.org/dcim/device-types/208/ would show me a list of servers and their status, no such luck
[16:07:18] <taavi>	 andrewbogott: click on "Devices" under "Related Objects"
[16:07:58] <andrewbogott>	 omg there it is! thx taavi 
[16:09:07] <andrewbogott>	 and cloudcontrol100[8,9,10]-dev are on the list. easy!
[16:09:22] <godog>	 \o/
[16:35:32] <dcaro>	 I think I have a fix for the taskruns issue https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1010, any reviewers?
[16:36:08] <dcaro>	 (currently running the tests there, though they did not fail before anyhow)
[16:46:29] <dcaro>	 hmpf... the tests failed because the build timed out trying to fetch stuff from github
[16:46:59] <dcaro>	 `[step-inject-buildpacks] 2025-10-23T16:42:06.601924249Z wget: can't connect to remote host (185.199.111.133): Operation timed out`
[16:49:29] <andrewbogott>	 godog: reproduced the issue on cloudcontrol1008-dev, and raid5 doesn't help
[16:50:00] <Damianz>	 I made a ticket for hosting the gitlab artifacts a while back but the same goes for github really (or "just" rebuild the builder image)
[16:50:35] <dcaro>	 putting everything in the builder image is more complicated than it looks like
[16:52:21] <dcaro>	 gitlab is where most of our source code lives, and if it's down, we can do something about it, for github, we can't do much
[16:54:31] <Damianz>	 I agree the gitlab issue is more things rely on the single master so upgrades take it down... but not being able to do much in a managed service sort of ducks
[16:55:38] <dcaro>	 argee
[17:00:10] <dcaro>	 who would have thought! https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/packages
[17:00:31] <dcaro>	 I can put most of the stuff we use in the pipeline there, reduce the github requests considerably
[17:23:06] <dcaro>	 testing it here https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/76 while github is rate-limiting us
[17:40:12] * dcaro off
[18:10:06] * dhinus off