[07:44:00] apparently chatgpt still hasn't realized that the 1-hour TTL on *.toolforge.org has expired after the record was changed yesterday [08:00:19] morning! [08:03:09] greetings [08:47:24] hello! [10:20:46] taavi: this is expected right? summary: Service tools-proxy-10:443 has failed probes (http_toolforge_org_ip4) [10:21:05] (the url still works for me) [10:21:29] dcaro: see my !log on -cloud a couple of minutes ago :-) [10:21:45] I saw yes, just making sure nothing broke [10:22:12] so yes, expected [10:22:18] I'll ack the alert [10:22:24] the alerts will clear as soon as Puppet runs on the prometheus hosts [10:57:29] * dcaro lunch [12:19:51] andrewbogott: I'm going to poke T407586 at bit more FYI, hope that's okay [12:19:51] T407586: latest Trixie image (as of 2025-10-16) grub failure - https://phabricator.wikimedia.org/T407586 [12:26:08] Yes please! There's lots more info on the task about what's going wrong [12:33:05] ok! yeah the fact that I can't reproduce locally bugs (ah ah) me [13:13:40] quick review https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/290 moving logic to toolforge_deploy_mr [13:15:37] godog: yesterday I was hacking the apt1002 to make the installer pause at the end before reboot. I can do that again if you want to get a clear view of the server post-install. [14:06:13] the deployment is running the tests and they pass [14:06:37] I did see one test fail just before you reverted the change [14:06:49] https://www.irccloud.com/pastebin/eXicMwpG/ [14:06:53] still running [14:07:03] " ✗ list build [20796]" [14:07:26] that's from your run right? [14:07:32] the logs of the builds-api pod did not show much, except a very high latency for some operations (>3m) [14:07:41] dcaro: yes from my run 5 mins ago [14:07:42] graphs show the same yep [14:07:44] https://usercontent.irccloud-cdn.com/file/rqhInDU6/image.png [14:08:10] so something was making some api calls very slow, which meant they timed out on one of the layers above [14:08:31] have a link to the patch that was deployed earlier? [14:08:32] it's weird that it did not fail the functional tests on deployment though [14:08:37] yes very weird [14:08:47] maybe some resource exhaustion that took a few mins after deploy? [14:08:49] toolforge deploy mr is https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1009 [14:09:06] the builds-api release is https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/releases/0.0.202-20251023122338-895f8556 [14:09:23] the only change in it is https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/144 [14:10:18] what it does is an extra call to get the taskrun for the build, and fetch the image sha, to add it to the image name [14:10:52] so there's an extra call to k8s tekton, but should not be that impactful, it's filtering by name too [14:10:52] https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/144/diffs#0cfd90216a04e25257d891d68ee18c147f7f0d8c_220_223 [14:11:00] so no "query everything" kinda call [14:11:30] (using a labelselector, that should be fast enough I think, annotations are the ones that are more troublesome from what i remember) [14:11:40] very weird, I have no idea what's going on [14:12:00] let me try to do that k8s query by hand [14:13:38] hmm... yep, that might be an issue [14:13:48] well, it's not superslow though [14:13:50] https://www.irccloud.com/pastebin/Jp6aSmcc/ [14:14:46] but if you have many builds, and it takes 15s for each build, it multiplies [14:15:01] hmm.... :oldmanyellsatsky: [14:15:53] https://www.irccloud.com/pastebin/rBJsdf2f/ [14:16:02] it consistently takes ~15s [14:16:26] toolsbeta is way smaller, so probably that's why we did not see the issue [14:16:35] makes sense [14:16:41] that's a bit annoying, but slow enough even with there never being /that/ many builds [14:17:11] 5 builds might already pass the common 1min timeouts [14:17:19] as far as I can tell there is 1 taskruns for each pipeline run (or at least there was locally) so I would have expected that to be about the same responsiveness [14:17:36] yep, there should be, not sure why it's so slow [14:18:02] the logs I saw were more than 3 mins, which does not seem to match the number of builds it would take if it's 15sec each [14:18:04] can you try the same query for pipeline just out of curiosity [14:18:51] https://www.irccloud.com/pastebin/uIcnXYCH/ [14:19:05] had to change the query a bit (different label), but took way less time [14:19:18] (still not awesome time, but manageable) [14:20:07] https://www.irccloud.com/pastebin/tbmOtWlq/ [14:20:19] 1-1 mapping between them [14:21:05] hmmm.... I'm thinking it might be related to crd versioning, it might be trying to transform the pipelineruns from one version to another when fetching them [14:21:42] I'd have to refresh my memory on how k8s does filtering, but perhaps we can add status/startTime to the filter to speed it up [14:21:57] hhmmm [14:21:59] https://www.irccloud.com/pastebin/SCjDdrC3/ [14:22:41] that might be it [14:22:52] https://www.irccloud.com/pastebin/skwavUPn/ [14:23:25] both of them hit the webhook [14:23:26] https://www.irccloud.com/pastebin/WPlljGKg/ [14:24:13] * dcaro has had bad experiences with crd versioning [14:24:15] that could defiantly explain some things [14:24:56] pipelineruns also have it though [14:24:58] https://www.irccloud.com/pastebin/FuLuvGhv/ [14:25:12] https://www.irccloud.com/pastebin/pxvbxtUb/ [14:25:22] maybe they are simpler to transform? [14:25:31] I'll open a task to dump all this in [14:25:39] (might be a red herring too) [14:26:04] there's quite a lot of 'stuff' in a taskrun [14:29:22] T408125 [14:29:23] T408125: [builds-api] listing deployments gets very slow when querying taskruns - https://phabricator.wikimedia.org/T408125 [14:33:43] we might have to run this? https://github.com/kubernetes-sigs/kube-storage-version-migrator/blob/master/USER_GUIDE.md [14:37:55] andrewbogott: thank you ! I did poke around a little since I found the installer paused, we're back at grub rescue prompt now and I'll let it be since I'm out of ideas [14:38:08] andrewbogott: did I get it right that a reimage with --os bullseye works as expected ? [14:42:02] godog: I haven't tried bullseye recently but bookworm definitely works [14:42:10] I'm trying a second 4-drive server to see if it's reproducible. [14:42:43] andrewbogott: ah yes my bad! I meant bookworm [14:43:09] ok so reimage with --os bookworm is known-good for cloudcontrol2010-dev (?) [14:43:25] correct [14:43:48] the task includes grub-installer logs for bookworm, they're very different and generally happy [14:44:14] fascinating [14:44:31] with the paused install you can re-run the installer yourself and see it fail in the same way [14:44:46] on trixie it complains about many things [14:45:21] heh indeed, I have seen grub complaining on preseed-test too, though then actually working at reboot [14:46:22] with those same errors? [14:46:59] I think we're running grub-installer four times (once for each drive) so I would confirm that it's complaining the same number of times with preseed-test [14:47:52] I'll re-run preseed-test and confirm [14:49:34] one thing I haven't tried with cloudcontrol2010-dev is doing a thorough wiping of the drives to make sure there isn't some latent partition cruft messing with us. [14:49:44] although w/out a raid controller I'm not sure I know how to properly wipe the drives [14:50:35] I did a dd if=/dev/zero of= bs=512 count=2M just to make sure in one of the earlier tests, nothing changed tho [14:50:59] oh great [14:51:13] probably good enough, at least to nuke the relevant metadata, I'm 98% sure [14:51:59] opened T408127 to track the stored version fixes [14:52:00] T408127: [bulids-builder,tekton] upgrade taskuns to v1 from v1beta1 in storage and delete v1beta1 from stored versions - https://phabricator.wikimedia.org/T408127 [14:53:26] godog: I'm reimaging maps-test2002 to see if I can reproduce this on different hardware. I just noticed that it's currently set to raid 5 though so if this succeeds I'll try again with raid 10. [14:53:45] does anyone happen to know if the source draw.io file for https://wikitech.wikimedia.org/wiki/File:Toolforge_k8s_network_topology.png is stored somewhere? [14:53:51] andrewbogott: ok! [14:55:16] taavi: no idea [14:56:54] taavi: I don't know but it would be fine to email arturo and ask if it's still on his laptop. [14:59:30] Damianz: quick question, have you had issues with missing logs lately? have you noticed any difference? [15:00:04] godog: grub works on maps-test2002 with raid5. So next step is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1198344 [15:00:06] ooh and meeting [15:00:07] i seem to have been okta'd, joining the meeting in a second [15:00:39] andrewbogott: good luck [15:00:53] dcaro: I've been using kubectl logs recently so haven't noticed, was sort of waiting for logs-api [15:02:01] ack [15:04:39] andrewbogott: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1198349 [15:53:06] godog: I can't reproduce the issue on maps-test2002, can't decide if that's good news or bad news. But the next stab in the dark I want to try is switching cloudcontrol2010-dev to raid5 just in case it works around the issue and we can stop thinking about it. [15:53:12] (assuming you are also out of ideas) [15:53:39] andrewbogott: yes might as well eliminate as many variables as we can [15:54:31] given the latest tests my money is on some kind of kernel and/or hw interaction or bug [15:56:05] yeah -- given that I can't reproduce it on another host it's temping to just take that server into the woods and leave it there. [15:56:15] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1198359 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1198358/3 [15:59:53] heh if it is indeed the hw then to me that begs the question of whether other R450 are affected too [16:01:38] yeah, maybe I need to find another more-similar test host [16:01:44] since it's clearly not just '4 drives in sw raid' [16:02:29] +1 [16:02:49] hmpf... for the taskruns migration, I think it might be worth it to just delete the old ones [16:06:06] I was hoping https://netbox.wikimedia.org/dcim/device-types/208/ would show me a list of servers and their status, no such luck [16:07:18] andrewbogott: click on "Devices" under "Related Objects" [16:07:58] omg there it is! thx taavi [16:09:07] and cloudcontrol100[8,9,10]-dev are on the list. easy! [16:09:22] \o/ [16:35:32] I think I have a fix for the taskruns issue https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1010, any reviewers? [16:36:08] (currently running the tests there, though they did not fail before anyhow) [16:46:29] hmpf... the tests failed because the build timed out trying to fetch stuff from github [16:46:59] `[step-inject-buildpacks] 2025-10-23T16:42:06.601924249Z wget: can't connect to remote host (185.199.111.133): Operation timed out` [16:49:29] godog: reproduced the issue on cloudcontrol1008-dev, and raid5 doesn't help [16:50:00] I made a ticket for hosting the gitlab artifacts a while back but the same goes for github really (or "just" rebuild the builder image) [16:50:35] putting everything in the builder image is more complicated than it looks like [16:52:21] gitlab is where most of our source code lives, and if it's down, we can do something about it, for github, we can't do much [16:54:31] I agree the gitlab issue is more things rely on the single master so upgrades take it down... but not being able to do much in a managed service sort of ducks [16:55:38] argee [17:00:10] who would have thought! https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/packages [17:00:31] I can put most of the stuff we use in the pipeline there, reduce the github requests considerably [17:23:06] testing it here https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/76 while github is rate-limiting us [17:40:12] * dcaro off [18:10:06] * dhinus off