[07:05:01] puppetserver unattended jdk upgrades bites again [07:06:47] I was looking, that's the puppet errors and puppet enc failing? [07:10:00] yes [07:10:04] fix is to restart puppetserver [07:10:16] yes [07:10:18] ack [07:10:25] done [07:10:53] T377803 [07:10:54] T377803: Cloud VPS: 2024-10-22 cloud-wide puppet problem related to java update - https://phabricator.wikimedia.org/T377803 [07:11:01] i restarted tools/toolsbeta/metricsinfra/cloudinfra puppetservers [07:11:33] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1140572/ [07:14:12] trying to run PCC but that's not working [07:17:36] there we go [07:17:50] merged [07:23:06] my internet is kinda intermittent for some reason [07:23:18] (fyi. if I don't reply) [10:03:16] how does one run a global cumin against all VMs these days? [10:03:22] there's still some puppetservers to fix [10:59:37] taavi: I think "sudo cumin 'O{*}'" should work from cloudcumin1001, but I haven't used it recently [11:10:31] dhinus: no, i got an openstack 401 [11:18:06] taavi: I think something must have broken with the recent spicerack updates, I'll have a look [11:18:46] it was relying on a manually applied patch T346453 [11:18:46] T346453: [cumin] [openstack] Openstack backend fails when project is not set - https://phabricator.wikimedia.org/T346453 [13:28:40] one of the tools-static-15 nginx worker processes is stuck in D [13:30:56] did this happen before? I only remember D processes in NFS workers [13:33:04] the probe seems to be up now [13:33:20] yeah the probe is flapping [13:33:29] i assume because one working worker isn't enough to serve all the things [13:33:36] makes sense [13:33:53] i wonder if we could scale up the number of workers beyond the number of cores [13:34:57] hmm worth a try, at least [13:35:32] since i don't think serving static files from nfs is particularly cpu heavy [13:38:02] makes sense to me. any theory on what's causing the D state though? [13:41:58] nothing more specific than "nfs being nfs" [13:42:06] trying to attach a debugger to that process is also hanging [13:55:26] rebooting the instance [14:43:40] Is someone else already looking at paws-nfs-1? [14:45:45] I'm in my room now have some time before dinner, is the paws thing still an issue? [14:46:32] the site itself looks OK to me but there's a puppet alert. [14:46:35] yes, I can have a look, might be the same java autoupgrade [14:46:58] ok -- I'm also happy to fix, just didn't want to mess with someone else's work in progress. [14:47:06] If it's a one-liner go ahead and try, and I'll take over if that doesn't fix it. [14:48:38] yep, just restart puppetserver [14:49:06] (the service) [14:50:03] well that's easy. Should I just issue a 'systemctl restart puppetserver' cloud-wide? [14:50:43] on puppetservers only yep, just in case [14:50:55] yep that was it [14:52:00] oh right but cumin is broken [14:52:27] ummmm dhinus are you already in the process of re-patching cumin? [14:52:57] yep, but I was distracted by a few other things [14:53:07] I see the patch needs rebasing [14:53:22] let me see if that's easy [14:53:45] * andrewbogott should probably have another go at getting that patch actually merged [14:54:10] btw. anyone up for reviewing https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/124 ? (and the cli side) [14:54:24] andrewbogott: I agreed on a plan with v.olans a few months back (it's detailed in the task) [14:54:27] there's some users here that would benefit from being able to use the newer buildpacks and such [14:54:57] (long shot, but if anyone is interested would be nice xd) [14:59:03] dhinus: which task? T346453? [14:59:03] T346453: [cumin] [openstack] Openstack backend fails when project is not set - https://phabricator.wikimedia.org/T346453 [15:00:03] andrewbogott: yep, in the last few comments there are a set of tests. I tried setting up a test env with devstack but failed [15:00:28] ok [15:00:33] devstack is cool and fragile [15:00:35] in the meantime, I rebased the patch and it rebased cleanly [15:01:05] I'll apply it to cloudcumin* [15:04:36] patch applied and working correctly [15:05:32] nice, thank you! [15:10:27] I'm restarting all the puppetservers. We're up to 110 alerts for VMs, let's see if that number drops now. [15:12:44] 👍 [15:59:27] topranks: John is having trouble cleaning up netbox entries for a decom'd server, can you check in on T392686? (Clearly no rush since it's a decom) [15:59:29] T392686: decommission cloudlb2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T392686 [16:15:06] dcaro, if you're still around... I'm going to start T393196 unless there's a reason those hosts are still hanging around (my assumption is that it's because I forgot to make a decom ticket.) [16:15:07] T393196: decommission cloudcephosd100[1-3] - https://phabricator.wikimedia.org/T393196 [16:15:29] 👀 [16:16:34] oh yep, those are not used no, I thought for whichever reason that they were decommissioned already :S [16:18:23] I still see them in 'osd tree' [16:21:01] anyway, I will trust in the cookbook [16:21:59] * andrewbogott runs an errand while ceph rebalances [16:32:02] I'm around if you need anything, try also telegram if I don't reply, but I have a laptop and I'm doing stuff [16:32:06] :) [16:55:12] heh, that errand was not nearly long enough [20:55:20] andrewbogott: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/226 [21:08:48] Ok! Did it look right, otherwise? [21:09:32] andrewbogott: yes! [21:09:38] you may be able to just copy-paste