[12:24:39] andrewbogott: FYI I'm testing a firmware upgrade on cloudcontrol2010-dev [13:21:40] godog: sounds good, that was going to be my next step as well :) [13:23:07] dhinus: ready for me to reboot those clouddumps hosts? [13:23:24] andrewbogott: go ahead [13:23:46] maybe worth a quick note to cloud-admin and/or irc? [13:23:56] I expect tools will hang or crash during the reboot [13:24:01] * andrewbogott runs sudo cookbook sre.hosts.reboot-single --task-id T407110 clouddumps1001.wikimedia.org [13:25:12] ideally this is something we would notify users one or two days in advance... but I don't expect the disruption to last for more than a few mins so it depends on our SLOs (if we had them) :P [13:25:48] godog: let's see if NFS clients are more resilient on the new version [13:26:36] new version of what? [13:26:41] we didn't upgrade the clients did we? just the homedir nfs server [13:27:01] so I wouldn't expect any change unless there was some weird interaction between multiple nfs connections [13:28:00] andrewbogott is right, only the server version was upgrade, so clouddumps are completely unaffected [13:28:05] *upgraded [13:28:46] only the tools-nfs server version was upgraded, IIUC [13:28:50] andrewbogott: yeah the bios/idrac upgrades did nothing, and neither did a newer version of grub, I'll try with an older version [13:28:56] yes that's correct, only tools-nfs [13:30:31] godog: an older version of grub? How are you doing that? [13:31:13] not yet sure if wedging bookworm's grub sideways into trixie will work, that's the idea tho [13:32:33] do you need to build a custom debian installer for that? Or is there some magic you can do in the busybox shell? [13:33:51] plan is to stop d-i from rebooting like you did the other day, then via install-console chroot /target and try download/install bookworm grub from there [13:34:49] makes sense. I don't think I've tried downloading/installing extra stuff mid install but I guess the installer must have a network connection to work at all [13:35:20] dhinus: I think I'll just do the other one in a hurry, and then maybe we should reboot all the workers just to be safe? [13:36:05] andrewbogott: I quickly checked I can access the dumps from a tools bastion and it looks good [13:36:20] let's do the second one and then evaluate the situation [13:36:26] maybe we don't need mass reboots [13:36:34] ok. toolschecker fired so something is unhappy but maybe it will recover on its own [13:37:42] just out of curiosity I launched "ls /mnt/nfs/dumps-clouddumps1002.wikimedia.org/" on tools-bastion-14 [13:37:46] it's hanging as expected [13:38:01] let's see what happens when the server comes back up [13:40:10] ls timed out with "Input/output error" [13:41:38] it's back [13:41:49] timeout seems kind of good, better than hanging forever [13:42:00] my second "ls" hanged and now completed successfully [13:42:16] which seems very good [13:42:32] so... maybe it's fine? The dumps mounts are r/o right? Maybe less disaster prone somehow. [13:42:57] yes, "ro" at least on the bastion [13:43:16] ro and soft? Or just ro? [13:44:05] ro and soft [13:44:32] I'm not sure that's true for all clients though [13:45:03] Seems like it's probably the same? There's no reason for a r/w mount on dumps and it would be work to make it differ by client [13:45:12] anyway... seems like everything is good? [13:45:17] yeah looking good [13:46:41] lots of ' cloudelastic' hosts left on that task but don't think they've ever been up to us to maintain. [13:46:57] So we're good until the inevitable November reboots :) [13:47:00] no cloudelastic are not ours despite the cloud* name [13:47:18] I checked a random toolforge k8s nfs worker and the dumps mounts are working fine [13:50:01] great, thanks for checking [13:53:35] the toolforge grafana dashboard is also looking good https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview [13:55:10] I see the number of container restarts spiked during the reboots and is still a bit high [13:55:46] let's give it some time to stabilize [14:25:23] still a bit spikey but seems to be improving [14:26:28] https://usercontent.irccloud-cdn.com/file/Hi6DGYFV/Screenshot%202025-10-24%20at%2016.25.59.png [14:29:27] that's a lot of spikes [14:30:59] godog: not a big deal but for your next round of experiments can you move to cloudcontrol1008-dev? I'm working on a different trixie issue in parallel and it's easier if I have cloudcontrol2010-dev up and running. [14:34:26] andrewbogott: for sure, I'm done for now/today actually, 2010-dev is all yours [14:34:45] ok! have a good weekend [14:36:24] cheers -- wrapping up stuff then off for real [14:37:19] not sure yet exactly why but this particular hw configuration makes grub_lvm_detect use a ton of memory apparently [15:02:54] * bd808 wonders if he should remove toolschecker from his stalkwords [16:01:28] toolforge containers restart have _not_ stabilized [16:01:44] I suspect the spikes are from a small number of tools that are stuck in some kind of restart loop [16:02:06] andrewbogott: I would try to dig deeper in the stats at https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview [16:02:18] and see if you can identify the tools that are causing the spikes in the "total" value [16:02:34] unfortunately I have to go offline now, I can try to have a look later/tomorrow if it's not resolved [16:02:52] * dhinus off [16:03:02] dhinus: I will look. Is there a serious downside to just resarting workers? [16:03:07] *restarting* [16:05:42] no, you can try that too [16:07:47] also, silly question: on cloudcumin how do I get cumin to glob cloud hosts? I've done it 1000 times but can't do it today apparently