[08:27:49] greetings [08:31:52] morning [08:46:12] from -sre we have a small problem with wdqs1028 that got reimaged and changed IP and can't mount the NFS dumps from the clouddumps hosts. AFAICT ferm is fine (resolved the new IP as shown in iptables too) but /etc/exports has the FQDN so exportfs most likely has the old resolve IP. [08:46:45] do we know what incantation of exportfs is safe to run to re-resolve the FQDNs in /etc/exports without disrupting the existing mounts from all the other clients? [08:48:24] volans: I'm tempted to just change everything in /etc/exports to use IPs https://gerrit.wikimedia.org/r/c/operations/puppet/+/1240605 [08:50:29] taavi: do we reload nfs when the list changes in /etc/exports? [08:50:48] I don't see a Notify [08:50:52] in [08:50:53] file { '/etc/exports': [08:51:19] otherwise using IPs would not change much I guess [08:51:49] hmmm [08:51:59] that seems like a problem in itself [08:52:23] I was looking at wikitech to see if we had documented the "safe" way to reload it but couldn't find it [08:53:01] and the manpage suggest that -r might do the trick but also doesn't say explicitely that is safe [08:57:42] so looking at /var/lib/nfs/etab (mentioned by exportfs(8) man page) and at /proc/fs/nfsd/exports there's a single entry per ip address, which to me suggests that adding and removing hosts would not affect other hosts [08:58:16] so we could try the IP patch and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1240609 [08:59:41] do we need -a? [08:59:51] it seems that -r might be enough? [09:00:46] but anyway let's leave the puppet patches for after we find the right incantation and test it ;) [09:01:08] so at least we know it's the right command :D [09:06:44] we could ofc start with clouddumps1002 that should be less used I think [09:08:21] so 1002 is already serving these prod realm nfs clients :P [09:08:29] (with 1001 serving nfs to cloud vps) [09:08:54] so both used / [09:08:55] :/ [09:09:43] yep. I want to move all NFS clients to 1001, since 1002 is already pretty heavily used thanks to it serving http clients, but haven't gotten around to doing that just yet [09:09:50] (that is T416677) [09:09:51] T416677: Move internal dumps NFS clients to clouddumps1001 - https://phabricator.wikimedia.org/T416677 [09:13:25] but I think it's relatively safe to experiment on 1002 now, since those nfs clients are not that heavily used [09:15:04] so I would maybe just try merging the patch to change the exports file to IPs, and then see how to apply that on 1002? [09:16:10] I was more inclined to first try to make nfs re-resolve what it has already on file, just doing exportfs -r on 1002 and see if others break and if a puppet run on wdqs1028 fixes it [09:20:44] what do you think taavi? I'm logged in wdqs1029 (a good host) in the mounted directory and I can try to see how disruptive it is there when we do something [09:21:41] sure [09:22:05] but I have a feeling exportfs won't see it changed and will not re-export it, since the hostnames seem to make it all the way to the kernel (according to /proc/fs/nfsd/exports) [09:23:50] yes, it's totally possible, just wanted to exclude it, if it doesn't work we change to IPs and then retry exportfs -r that should export the "new" ones with IPs and left the stale hostnames I guess (how does it handles the duplicates between IPs and FQDNs that resolves to the same IPs?) [09:24:14] and those will be cleared at the next nfs reload I guess (but that might be disruptive) [09:24:52] let's give that a try then [09:26:46] ack, I'm ready on wdqs1029, do you wanna do the honor on clouddumps1002? [09:27:20] I ran `sudo exportfs -ra` there, it excited pretty much immediately and printed nothing [09:27:46] all good on 1029 [09:27:55] nothing interesting from the debug option: https://phabricator.wikimedia.org/P88894 [09:27:57] running puppet on wdqs1028 [09:28:47] failed as before [09:30:02] +1ed the IP patch [09:30:45] taavi: fwiw the content of /proc/fs/nfsd/exports "changed" at least order on 1002 [09:30:50] so your command "did something" :D [09:31:01] huh :D [09:31:15] but yeah a useless change [09:31:45] running puppet [09:32:37] I'm wondering if to be slightly more safe we should leave both IPs and FQDNs for now, but it would be messy [09:33:22] ready for me to run exportfs again? [09:33:49] ready [09:34:07] done [09:34:25] all good on 1029 [09:34:36] re-running puppet on 1028 [09:36:08] mounted 1002 just fine [09:36:41] before doing 1001 we could do the other patch so that we can test it works [09:37:01] sure [09:37:58] running pcc [09:39:43] https://puppet-compiler.wmflabs.org/output/1240609/8072/ [09:39:46] +!ed [09:40:25] merging [09:41:30] is it expected that all clients are connected via v4? [09:42:34] not really [09:42:51] from /proc/fs/nfsd/exports I mean [09:43:45] * volans running puppet on 1028, should fix and be clean [09:43:47] puppet run on 1001 was fine and ran the command as expected [09:43:54] yeah saw the export [09:47:54] per https://manpages.debian.org/trixie/nfs-kernel-server/exportfs.8.en.html#Exporting_Directories I wonder if wew should be using brackets around the IPv6 addresses :/ [09:48:12] checking [09:52:20] yes it looks like we have to [09:52:22] what a syntax [09:52:37] the range syntax looks horrible [09:52:52] yeah [2a02:ec80:a100:4000::]/64 IIUIC [09:53:02] that means we also have to patch the other networks added [09:53:23] from cloud_networks_public [09:59:02] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1240621 [10:01:35] great [10:05:15] ran puppet on 1002, /proc/fs/nfsd/exports still only lists ipv4 things :( [10:07:33] maybe because they are already connected? [11:28:27] "Object storage quota by 'objects' is 80.84% full for project tools" [11:29:26] I'm storing too much stuff into loki? [11:30:42] the graph is... interesting :) https://usercontent.irccloud-cdn.com/file/iIYryoHZ/Screenshot%202026-02-19%20at%2012.30.17.png [11:32:10] wait didn't we increase it to 200GB? [11:32:32] from horizon I see tools-loki at 51GB and infra-tracing at 25GB [11:34:05] "2026-02-09 15:10:56 dcaro volans: I'll bump the quota to 200G (2x)" [11:34:14] this is the objects quota, not the size quota [11:34:18] ahhhh [11:35:19] we can bump it as well, but I'm still worried if it grows almost-linearly [11:35:29] I see 619621 for tools-loki and 704477 for infra-tracing loki [11:35:47] I would expect it to grow linearly until it hits the retention age (two months iirc) [11:36:12] correct, but we changed a bit what we store a week ago [11:36:21] *two weeks ago [11:36:47] and I think we should have reached the two months from the initial data gathering, let me check [11:37:53] yep, 2 months passed from the initial enabling [11:40:33] time flies when you're having "fun" :) [11:46:05] eheh, this morning I've enabled tracing on the other VMs outside of k8s [11:46:36] yep, I saw yesterday, but it's been growing non-stop since december [11:46:58] I'll double check later that the cleanup after 60d is working [11:47:18] thanks, I'd say let's check that first and bump the quota later only if needed [11:47:34] now I need to fix a bug I noticed to prevent tracing activity in their own homes also for the outside k8s case (small bug fix will be quick) [11:50:46] oh, the quota percentage going down is only because of quota bumps (added total n of objects graph where that is clearer) [11:53:39] toolsbeta is using a lot of space.... [11:53:58] brb [11:56:15] from horizon toolsbeta has 3MB and 2946 objects for infra-tracing-loki and 47MB and 4645 objects for the toolsbeta-loki [11:56:21] are you sure? [11:57:16] wait a second are we talking tools or tools-logging quota? [11:57:24] because loki lives in the *-logging ones [11:57:55] I see 76GB for harborstorage on tools and 93GB on toolsbeta [12:05:08] wait this alert is about the tools project :D [12:05:13] so that's about harbor, not about loki [12:05:23] ehehe ^^6 [12:05:47] I guess we got to the same conclusion :) [13:39:21] yep, toolsbeta is not doing any cleanups, looking, tools is doing some, but looking also [14:30:40] the quick fix I mentioned above would be https://gerrit.wikimedia.org/r/c/operations/puppet/+/1240704