[08:24:54] the openstack response time alerts are https://gerrit.wikimedia.org/r/c/operations/alerts/+/1154163 [10:12:28] I opened T396199 to track the toolsdb replication alert [10:12:30] T396199: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2025-06-06 - https://phabricator.wikimedia.org/T396199 [10:12:40] usual problem with a slow query [10:16:21] I'm gonna stop the replication, manually run the query, then restart replication [11:32:06] stopping the replication is not as easy as it should be... more details in the task [11:32:54] I'll do some more research before resorting to a "kill -9" like I did last time [11:33:54] anyone have interest in looking at the tools nfs usage alert or should I? [12:37:32] topranks, dcaro, there is an outstanding change for cloudcephosd1048, it shouldn't change anything functionally but wanted to check with you before pushing it https://www.irccloud.com/pastebin/TcJN5ok4/ [12:38:35] XioNoX: Ack thanks yeah you can push that, I may have forgot to run homer after a change yesterday sry [12:42:06] all done! [12:42:12] ty! [12:54:27] started a tmux on tools-nfs-2 to locate disk hogs again [13:52:35] https://phabricator.wikimedia.org/T396220 [13:54:50] taavi: I ran the a logfile cleanup script last night on tools-nfs-2 but it maybe didn't do much [13:55:41] andrewbogott: I generally don't find that script very helpful because it'll prune some files but do nothing about the tools generating that much logs or other stuff in the first place [13:56:12] and there's a lot of stuff it just won't find because they're spread out to a lot of files, like https://phabricator.wikimedia.org/T396222 [13:56:27] yep, definitely diminishing returns if we run it frequently. [13:56:52] It's just where I start because it can do its business while I'm sleeping :) [14:15:01] taavi: I am confusing myself about floating IP routing and firewalls. It seems like when I connect to a VM (from within the cloud) on a floating IP vs on an internal IP, the src IP is different. Does that sound right to you? Traffic to a floating IP natted even when it's from VM->VM? [14:15:15] (context is me trying to make https://wikitech.wikimedia.org/wiki/News/2025_Floating_IP_Routing_Change) [14:16:33] an update on toolsdb replag (T396199): I did not find any way to stop replication cleanly, so I'm gonna force a shutdown of the replica [14:16:34] T396199: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2025-06-06 - https://phabricator.wikimedia.org/T396199 [14:18:13] andrewbogott: traffic to floating IPs always involves some form of NAT, as the target VM never {sees, listens on} the floating IP but rather on the private IP [14:20:35] that makes sense. [14:20:55] So firewall/secgroup rules need to take that into account if you want the same behavior from within and without. [14:21:12] I don't offhand remember the behaviour when a VM with a floating IP talks to an another VM with a floating IP, but having different behaviour when talking to a floating IP versus a private IP makes sense since the former goes through the neutron router while the latter stays inside the subnet [14:22:11] andrewbogott: I honestly don't understand your wording of what's going to change on https://wikitech.wikimedia.org/wiki/News/2025_Cloud_VPS_floating_IP_routing_change at all [14:22:37] yeah, that' s because 1) the doc isnt' done yet and 2) I'm confused :) [14:23:10] What's going to change is we're going to stop doing the DNS mutilation, so that dns lookups inside and outside the cloud will resolve to the same IP, the floating one. [14:23:28] Which means a few edge cases that were working by accident before will stop working because the src IP will change for that traffic. [14:23:37] But I need to rewrite the top paragraph entirely. [14:26:56] And I'm trying to figure out specifically what needs to change in the firewall rules. But somehow failing to understand tcpdump [14:27:00] uh oh [14:27:01] aaa.beta.toolforge.org 185.15.56.238 - - [06/Jun/2025:14:26:38 +0000] "GET / HTTP/2.0" 404 5223 "-" "curl/7.88.1" (backend 172.16.2.161:30000) [14:27:11] this is from a VM without a floating IP trying to talk to a floating IP [14:27:20] 185.15.56.238 is the neutron router IP [14:27:35] so we lose source information entirely there [14:27:58] yeah, I think that's what I expected... [14:28:51] The big picture here is: I think the labs-ip-aliaser is an obscure, confusing hack and I want to get rid of it. [14:29:03] I can easily be convinced to just leave it instead since that requires me to do nothing [14:29:05] https://phabricator.wikimedia.org/T374129 [14:30:21] alternative proposal: revisit that in a year or two, at which point much more of the cloud-internal traffic is over v6 which makes this affect much less users [14:32:19] Are there really situations where /all/ traffic will move to v6? Seems like v6 is mostly only useful for cloud-internal things since the outside world isn't really v6 compliant yet. [14:33:13] this change is only affecting cloud-internal traffic, not sure how traffic from the outside is relevant here? [14:33:59] I'm assuming that most servers affected by this change are serving users both inside /and/ outside [14:34:15] Which means they'll probably continue to think in terms of v4, even for traffic within the cloud [14:34:19] but maybe I'm missing something [14:35:25] You're thinking that this will get better because the DNS lookups that are currently mangled will start to return unmangled AAAA records instead [14:35:31] ? [14:36:00] i'm thinking that what external clients do is not at all relevant for a change that does not affect anything except how vms talk to other vms [14:36:35] sure, I get that [14:37:05] But my question is: will users who are serving both external and internal traffic /change/ how they support internal traffic just because ipv6 is an option? [14:37:46] ipv6 only helps with this scenario if users adopt it, I think? [14:38:15] they don't have to change anything, with both endpoints having v6 support traffic will naturally shift towards v6 [14:38:35] unless those users go hardcoding v4 addresses in their configs, at which point this change won't have any effect either way [14:39:37] So right now my service running on VM1 looks up VM2.wikimedia.org, gets a v4 address, and contacts it. [14:39:51] In the future my service will look up VM2.wikimedia.org, get a v6 address, and contact it. [14:39:55] that's what you're saying, right? [14:40:22] s/wikimedia.org/wmcloud.org/, but more or less yes [14:40:42] Oh, you're right. [14:42:12] So, two questions: 1) why/when will that switch from 'get a v4 address' to 'get a v6 address' happen? And 2) if the v6 address they get is tied directly to a VM, doesn't that take away the value of a floating IP, that being that it's floating? [14:44:10] 1) once both VMs have been re-created in the dualstack network, and once whomever is running that service has added an AAAA record in addition the A once (which I'm slowly working on for all of our services) [14:45:03] 2) see https://wikitech.wikimedia.org/wiki/News/2025_Cloud_VPS_VXLAN_IPv6_migration#Are_there_floating_IPs_for_IPv6_? and the section below [14:45:59] so, "don't use floating IPs, use cnames" [14:46:31] so let me try again [14:46:41] although I only now realized that that won't really work for dual-stack things [14:46:50] Any user that's supporting external traffic will /also/ have to have a floating v4 address. [14:47:17] So why would that user bother to support v6 with cnames etc. etc. when they already get what they need with v4 floating IP and /have/ to have a v4 IP? [14:47:35] ^ is why I keep bringing up external traffic, does that make sense now? [14:47:51] oh I see [14:48:22] I guess the obvious answer is that because in two years when you announce this change you can say "or just set up IPv6 properly for this service to bypass any problems this change is making" [14:48:23] \o/ [14:48:48] ok [14:49:27] We can also say that today, it's just that in two years a couple of users will read that and say "Oh, I did that already, I'm good" [14:49:45] I think we're on the same page now, thank you for your patience with me! [14:50:01] plus there'll be less floating IPs in use at all at that point, because for various use cases we can say "no, use IPv6 instead" or "you don't need a floating IP address, the proxy service supports custom domains now" [14:50:12] yeah [14:50:31] but if you do it in two years, much more people have hopefully re-created their VMs by then anyway so it'll create the illusion of them having to do less work to adapt [14:51:10] So, rolling back to that 'it comes from the neutron router' thing... is that as simple as saying 'add 185.15.56.238 to your firewall rule' or is it more complicated than that? [14:51:13] (I'm still hopeful that people will do the work to enable IPv6 because that's the right thing to do, but I guess that's not that common of a view :-) [14:51:22] well the problem is that you lose source information entirely [14:51:55] So that means that it opens you up to everything in the cloud, right? And you can't be selective. [14:52:08] yes, and lose the ability to audit which client is using things [14:52:20] if you did that change today, we'd for example lose information on which internal users are using the VPS web proxy, or which VMs are sending outbound emails, unless we came up with some workaround to restore them [14:52:42] * andrewbogott nods [14:53:07] IHMO neither of those things can happen, so we would need to come up with some sort of a workaround [14:53:47] well, as you said, the workaround is to just use v6 [14:54:18] yes, but that requires both the client and server VMs to be re-created [14:54:33] hm [14:54:34] yeah [14:55:04] We should probably disable the ability to create v4-only VMs [14:55:29] i.e. the proxy service already supports v6, but that doesn't help if our users have closer to a thousand VMs to migrate still [14:55:45] https://wikitech.wikimedia.org/wiki/News/2025_Cloud_VPS_VXLAN_IPv6_migration schedules that to happen in a couple of weeks [14:56:22] yeah, good. [14:56:28] and with the default being dualstack, I don't think that makes much of a difference as soon as we do that before Debian 13 is out [14:56:39] (I have a personal calendar reminder for that as well) [14:56:52] yeah, it's a rare user who is intentionally selecting v4-only instead of the default. [14:57:06] well I guess the original plan was to disable the old VLAN, but keep the v4-only VXLAN for now [14:57:38] Since I'm thinking about it I'm going to try to get that document to be correct, but I won't announce or rip out anything, it can just sit there poised and ready for the v6 future. [15:02:19] the disk space alert seems to be https://phabricator.wikimedia.org/T395020#10891317 [15:02:42] I'm about to log off now, it's at 85% which should last through the weekend [15:05:01] * andrewbogott moves that task to 'stalled' and finds something else to do. [15:05:09] have a good weekend taavi [15:56:46] toolsdb replication is back in sync! T396199 [15:56:48] T396199: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2025-06-06 - https://phabricator.wikimedia.org/T396199 [17:00:14] :taavi if you are around please help review https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/105 [17:00:38] if not maybe someone else can? dhinus ? [17:01:38] It’s for the bug described here https://phabricator.wikimedia.org/T396210 [17:02:47] Raymond_Ndibe: I'm out sorry! [17:03:55] Ok no problem. I’m on break too lol 😆. Had to fix it though, since I kind of caused it. Maybe I can wait for taavi [17:04:21] Or maybe I can self approve it [20:40:54] openstack-browser is having troubles getting answers from designate. T396256 [20:40:54] T396256: openstack-browser timing out trying to fetch dns zones in multiple projects - https://phabricator.wikimedia.org/T396256 [21:33:07] now something bad is happening either with toolforge or with toolforge monitoring [21:45:52] wait, no, it's just tools beta