[07:37:18] arturo: when you have a moment, https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/6 [07:43:03] taavi: LGTM +1 [07:43:44] thanks, deploying [07:48:53] designate is returning 504s? https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/jobs/490149 [08:00:06] i restarted things and it seems better [08:00:09] now running apply [08:01:47] toolserver.org now has an IPv6 address [08:20:39] 🎉 [08:24:05] arturo: https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/7 [08:39:09] TIL that tools.wikimedia.de still exists (and is a CNAME to toolserver.org) [08:39:19] heh [08:40:30] anyway, any concerns about the newest MR or can I merge it? [08:40:44] I'm looking [08:41:36] I'm scratching my head regarding why the import block for records don't need the project_id [08:41:48] I guess is because is the same as the auth [08:42:14] anyway, approved +1 [08:45:00] thanks [08:47:14] one last MR: https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/8 [08:48:58] 👀 [08:52:08] +1'd [08:52:14] (but there is a typo) [08:52:26] thanks, fixing and will merge then [08:52:53] btw I think we can close T380174 now? [08:52:53] T380174: CloudVPS: IPv6 in eqiad1 - https://phabricator.wikimedia.org/T380174 [08:53:25] yes [08:55:02] also I just closed the 13 years old T37947 [08:55:03] T37947: Enable IPv6 on CloudVPS - https://phabricator.wikimedia.org/T37947 [08:55:05] there's also T37947, with a few open subtasks, and that one should probably be open until docs and such have been updated [08:55:07] ah [08:55:24] docs is T380054 apparently [08:55:25] T380054: Cloud VPS: prepare documentation on VXLAN/IPV6 migration - https://phabricator.wikimedia.org/T380054 [08:55:35] i think T380081 is also fixed? [08:55:35] T380081: horizon: enable the UI to select networks on VM creation panel - https://phabricator.wikimedia.org/T380081 [08:55:46] yeah both of them are nearly done [08:55:52] just last few checks, then close them [08:55:57] cool [09:52:54] taavi: https://usercontent.irccloud-cdn.com/file/dbDK98Ql/image.png [09:54:27] arturo: i'll look, but suspect it's a monitoring issue as i just fixed T392559 to make it appear as a target in the first place [09:54:28] T392559: metricsinfra: maintain-projects is broken - https://phabricator.wikimedia.org/T392559 [09:54:37] ok [09:55:39] yup, metricsinfra is trying to scrape it over v6 from a v4-only instance :D [09:57:00] i need to get some food, will fix after that [10:21:21] topranks: I added this FQDN for this addr: https://netbox.wikimedia.org/ipam/ip-addresses/18719/ I hope the PTR zone has been set up already? :-S [10:22:24] arturo: did you run the cookbook to update the zone files? [10:22:31] $ sudo cookbook sre.dns.netbox 'neutron updates' [10:22:38] this one? ^^^ [10:22:40] yes [10:22:45] is running [10:22:52] but that one fails if the zone is not set already? [10:23:03] it shouldn't [10:23:11] ok [10:23:14] it might not have any effect, but it won't crash i think [10:23:51] It's not the "zone" that doesn't exist as such, but the snippet that the Netbox DNS automation file will create is not referenced in the zone that does [10:24:05] WMF DNS is auth for 0.8.c.e.2.0.a.2.ip6.arpa. [10:25:24] it should work just fine [10:25:26] https://www.irccloud.com/pastebin/BWIRKY8A/ [10:26:11] nah [10:26:42] oh wait [10:26:46] sorry - yeah I did it already [10:27:01] there is a record for the cloudgw IP [10:27:01] https://netbox.wikimedia.org/ipam/prefixes/1100/ip-addresses/ [10:27:21] and the required "INCLUDE" statement is in the 0.8.c.e.2.0.a.2.ip6.arpa zone file: [10:27:31] cmooney@dns2005:/etc/gdnsd/zones$ grep 4.0.e.f.0.0.0.a.0.8.c.e.2.0.a.2.ip6.arpa * [10:27:31] 0.8.c.e.2.0.a.2.ip6.arpa:$INCLUDE netbox/4.0.e.f.0.0.0.a.0.8.c.e.2.0.a.2.ip6.arpa [10:27:35] so yeah should be fine [10:28:02] thanks for double checking [10:29:19] but anyway please do check this before adding the stuff in Netbox. If it's missing in the dns it causes problems, and blocks SREs doing reimages and other stuff that trigger the dns cookbook [10:29:35] yes, you are right [10:30:21] if it's a new range this is the kind of thing you need to add: [10:30:21] https://gerrit.wikimedia.org/r/c/operations/dns/+/1109732 [10:30:35] the process is quite crunky, I'm hoping to push for something saner in time we don't need to do it this way [10:31:02] 👍 [10:43:39] arturo: filed T392570 for the prometheus issue, unfortunately it seems a bit complex [10:43:40] T392570: metricsinfra: Support scraping v6-enabled instances - https://phabricator.wikimedia.org/T392570 [10:44:59] taavi: thanks! all the options you listed seems sensible to me [10:59:39] functional network tests in eqiad1: [10:59:41] [2025-04-24 10:50:25] INFO: --- passed tests: 78 [10:59:41] [2025-04-24 10:50:25] INFO: --- failed tests: 0 [10:59:41] [2025-04-24 10:50:25] INFO: --- total tests: 78 [11:22:38] hi, is it expected that tools.wmflabs.org Ipv6 doesn't work? [11:24:08] paladox: if you mean pinging it doesn't work, it seems like we're missing a security group rule [11:24:16] i'll have a look once i'm out of meetings [11:24:22] ah ok, thanks! [11:24:28] but i think http(s) itself should work? [11:25:35] yup, that works [11:26:04] nice :) [11:26:09] cmooney@wikilap:~$ curl -v https://tools.wmflabs.org [11:26:09] * Trying 2a02:ec80:a000:1::304:443... [11:26:09] * Connected to tools.wmflabs.org (2a02:ec80:a000:1::304) port 443 (#0) [11:43:11] arturo: seemingly the default security group has a allow all ICMP v4 rule. any concerns about duplicating that as is for ipv6? generally that means that all v6 instances will be pingable from the outside (compared to v4 instances where that only applied to those with floating IPs), but I think that's fine [11:45:26] My general instincts are to allow all ICMP [11:45:56] one could limit to only ICMP ECHO allowed in from outside, along with RELATED, ESTABLISHED ICMP (i.e. packet too big messages and other necessary things) [11:47:05] You also need to make sure not to break neighbor discovery, router advertisements and the like, that would mess everything up (though it's only local to the link / LAN) [11:50:56] taavi: yeah, seems fine [11:52:04] i added those by hand to tools / toolsbeta [11:52:12] we should maybe find a way to backfill those to all existing projects [11:52:26] shame that neutron can't do a single rule that matches both protocols :/ [11:52:29] taavi: tofu-infra tracks that [11:52:33] confirm it's working [11:52:36] cmooney@wikilap:~$ ping tools.wmflabs.org [11:52:36] PING tools.wmflabs.org(tools-legacy-redirector-3.tools.eqiad1.wikimedia.cloud (2a02:ec80:a000:1::304)) 56 data bytes [11:52:36] 64 bytes from tools-legacy-redirector-3.tools.eqiad1.wikimedia.cloud (2a02:ec80:a000:1::304): icmp_seq=1 ttl=52 time=89.2 ms [11:52:36] 64 bytes from tools-legacy-redirector-3.tools.eqiad1.wikimedia.cloud (2a02:ec80:a000:1::304): icmp_seq=2 ttl=52 time=88.9 ms [11:52:56] arturo: I see https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/blob/main/modules/project/default_sg_rules.tf?ref_type=heads#L69 but that definitely was not applied on tools, so I think that's for new projects only? [11:53:28] well, tools is definitely interesting: do we want to track that in tofu-infra or in the dedicated tofu-provisioning repo? [11:54:05] i think the default security group rules should be in tofu-infra to not duplicate them, and anything custom in the toolforge repo [11:54:15] ok [11:54:31] then the way to go is to drop all the rules by hand at the time of tofu apply [11:54:42] there is a brief moment in which there are no rules, but that's generally OK [11:55:35] I see a "manage_default_secgroup = false," setting in resources/eqiad1/tools/main.tf, I think that's the one that we need to flip [11:55:43] yes [11:56:04] can we import the rules instead? not having the rules to allow project-internal traffic there even for a second is very scary [11:56:25] import rules is tedious, I don't recommend [11:56:32] ok [11:56:34] most of the traffic is covered by other specific security groups [11:56:42] will neutron let us add duplicate rules and remove the old ones afterward? [11:56:46] no [11:56:47] :-( [11:56:50] aha [11:57:37] most of connections are established already, so if a rule is dropped, the connection wont break anyway [11:57:48] true [11:58:13] the brief block only affects new connection (theory) [11:58:32] ok, let's try it in toolsbeta? [11:58:37] for the projects I backfilled, I did not notice a cut [11:58:39] sure [11:59:14] more info: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/tofu-infra#Managing_the_default_security_group [12:04:27] taavi: I guess this is not a priv key, right? https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/3/diffs#8ca01d86f9efb98b375fde98a41614a36938a3a8_4_55 [12:04:45] arturo: no, that's the public key [12:04:53] thanks for confirming [12:06:16] chuckonwu: do you think https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/3 is ready to merge? [12:07:41] hmm, I do not think the floating IP records should be imported there, they're already automatically managed via a script in puppet [12:08:26] the ones with 'MANAGED BY dns-floating-ip-updater.py IN PUPPET' [12:08:28] ? [12:08:29] yes [12:08:38] makes sense [12:13:14] taavi: https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/9 does this make sense? [12:13:49] lgtm [12:16:18] s3 in codfw1dev is failing? [12:16:21] https://gitlab.wikimedia.org/repos/cloud/cloud-vps/networktests-tofu-provisioning/-/jobs/490349 [12:29:59] I plan to decline this: T209011 in favour of IPv6 [12:29:59] T209011: Change routing to ensure that traffic originating from Cloud VPS is seen as non-private IPs by Wikimedia wikis - https://phabricator.wikimedia.org/T209011 [12:30:25] sounds good [12:35:07] arturo: https://wikitech.wikimedia.org/wiki/News/2025_Cloud_VPS_VXLAN_IPv6_migration#Timeline why not make dualstack the default? [12:35:54] i also would make either of the VXLAN options the default today, instead of waiting two months to change that [12:36:22] maybe we can discuss later in the weekly team meeting. I wanted to be extra carefull [13:19:41] i have added a general plan to T379175 [13:19:41] T379175: Enable IPv6 for the Cloud VPS web proxy - https://phabricator.wikimedia.org/T379175 [13:24:31] hmm I just got an error running tofu plan (from the cookbook) [13:24:33] Error: Failed to get existing workspaces: operation error S3: ListObjectsV2, exceeded maximum number of attempts, 5, https response error StatusCode: 503 [13:24:52] that was for codfw, during "Initializing modules..." [13:25:39] actually I think it's during "Initializing the backend...", I can reproduce [13:26:28] just saw arturo's message from 1 hour ago: "s3 in codfw1dev is failing?" [13:27:35] can I get a +1 anyway for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/202 [13:27:44] I will apply to eqiad1 only [13:28:19] +1 [13:28:34] thanks [13:33:41] arturo, yes that branch is ready for merging [14:07:12] chuckonwu: approved, please hit the merge button and then run the apply pipeline step :-) [14:08:36] i've just now realized https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/blob/main/toolsbeta/dns.tf?ref_type=heads#L13 is something managed by the web proxy service, not sure if that should be in tofu code as well [14:09:13] ah, good point [14:09:40] * arturo now wonders how to interact with the tofu state via a cli [14:10:10] I haven't yet run the apply stage so this is a quick patch, one second [14:10:22] oh good [14:13:22] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1138799 would add a description to any future web proxy records to avoid that confusion [14:14:00] +1'd [14:15:26] andrewbogott: I'm going to rebase and merge your dynamicproxy puppet changes [14:18:37] chuckonwu: https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/10 +1'd [14:18:39] https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/10 removes the record that is managed by Puppet [14:18:42] Great [14:20:43] thanks! [14:40:51] taavi: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/networktests-tofu-provisioning/-/merge_requests/18 ? [14:40:55] wdyt? [14:41:27] who does root@wmcloud.org go to? [14:41:39] us, I hope. [14:43:08] do you have other suggestions for the dest addr? [14:43:43] cloud-admin-feed@lists? [14:44:11] k [14:44:12] does sendmail really just work like that from a ci job? huh [14:44:31] I was _hoping_ [14:44:33] 🙏 [14:44:42] we can give it a try [14:46:41] that's failing in codfw1dev for some reason [14:46:53] yeah, can't somehow find the s3 bucket [14:47:02] also affects tofu-infra [14:47:26] may be related to the ceph work andrewbogott is doing on codfw1dev [14:47:44] +1 [14:48:44] oh, I saw the other day that tofu is adding native locking for s3 bucket state [14:48:51] (something we discussed at some point) [14:49:07] if this email thing work, I will add it next to the toolforge repo [14:50:36] dcaro: nice! [14:51:25] https://github.com/opentofu/opentofu/commit/eba25e2fedca0a782bdf9d381a3a65f3ad1acff3 [14:55:31] taavi: one thing I notice [14:55:55] on this page: https://wikitech.wikimedia.org/wiki/Help:Cloud_VPS_IP_space [14:56:08] it seems to only list the eqiad ranges? [14:56:26] if that's the intent I think the v4 private range should be 172.16.0.0/17 as that is all that is routed to the cloudgw's [14:56:38] and we have space from the upper half of the /16 used in codfw [14:57:16] topranks: yeah, i was thinking about that.. in my mind that page is for our users for problems like "what IPs should I set the access list to in my on-wiki credentials and such", which is why that doesn't include codfw1dev with no real user workloads [14:57:27] i will update it to only have the first /17. thanks [14:57:36] yeah no worries, not a big deal [14:57:54] the other alternative would be to include codfw1dev ranges, so we'd need to add the v6 range for there [14:58:16] but as you say codfw not used by user workloads so I think it makes more sense to change the v4 mask [14:58:29] https://wikitech.wikimedia.org/w/index.php?title=Help:Cloud_VPS_IP_space&curid=458702&diff=2295736&oldid=2295042 [14:58:42] cool <3 [15:42:52] email proposal: [15:42:57] https://www.irccloud.com/pastebin/sr5L7tkF/ [15:43:58] arturo: s|make sure VXLAN/IPv6-dualstack is|make sure the VXLAN/IPv6-dualstack network is| [15:44:06] can you update the timeline on the wikitech page as well? [15:44:11] yes [15:46:09] taavi: https://wikitech.wikimedia.org/w/index.php?title=News%2F2025_Cloud_VPS_VXLAN_IPv6_migration&diff=2295759&oldid=2295688 [15:47:08] maybe push back disabling VLAN/legacy a bit further? [15:47:10] otherwise lgtm [15:47:42] ok, 2 months [15:47:51] wfm [15:48:13] topranks: the v6 BGP session for cloudlb2002-dev should now be using the correct interface, if you want to set the switch-side config [15:49:25] arturo: it still has the default switch (in a week) and disabling legacy creation in one item? [15:49:39] i'll fix [15:50:16] Should it be possible for me to ping6/traceroute6/curl -6 toolserver.org at this point? [15:50:20] taavi: yes, please fix it. I have my child demanding my attention now 🌈 [15:50:30] bd808: yes [15:50:47] taavi: I have sad news then. All 3 fail. [15:50:59] I'll make a paste [15:56:47] * arturo offline [15:57:11] huh [15:57:11] Apr 23 15:33:55 tools-legacy-redirector-3 systemd-networkd[455]: ens3: DHCPv6 address 2a02:ec80:a000:1::304/128 (valid for 23h 59min 59s, preferred for 23h 59min 59s) [15:57:11] Apr 24 15:33:55 tools-legacy-redirector-3 systemd-networkd[455]: ens3: DHCPv6 lease lost [15:58:20] taavi: https://phabricator.wikimedia.org/P75450 -- there isn't a lot of information there, but let me know if I can check something more useful. [16:05:19] aha, the host-level firewall is blocking dhcpv6 response packets for some reason [16:07:36] bd808: tracking this in T392611 now [16:07:37] T392611: VMs with ferm host-level firewall do not permit DHCPv6 responses - https://phabricator.wikimedia.org/T392611 [16:08:16] :old-man-shakes-fist-at-ferm: [16:16:23] this also happens in iptables and nftables I believe [16:17:23] at least from my home connection doing DHCPv6 IA_PD, the request is sent to a multicast address, reply comes from a unicast one, and thus doesn't match "related, established" [16:17:38] I need this instead: [16:17:39] iifname "enp2s0" ip6 saddr fe80::/10 udp dport 546 accept [16:18:04] topranks: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1138837 [16:18:18] taavi: hey was afk just back [16:18:26] let me look at that and the bgp thing you mentioned [16:18:32] yeah no hurry for the bgp thing [16:19:13] PCC at https://puppet-compiler.wmflabs.org/output/1138837/5353/, for both an instance running a firewall and one not [16:19:16] +1 for the patch makes sense [16:19:29] thanks! [16:21:50] bd808: try now? [16:29:12] taavi: the curl works! :) [16:34:19] nice :) [16:34:26] taavI: when you get a minute [16:34:27] https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1138850 [17:22:41] taavi: ok I fixed a few things and the BGP is up and working for the 2a02:ec80:a100:4000::1/128 vip from cloudlb2002 [17:23:02] well, for some value of "working" [17:23:16] we are missing the required "ip rules" on cloudlb2002 to make it work [17:23:31] right now it is getting packets but sending the response out the prod realm primary interface, following the default [17:24:25] we need the equivalent of this for IPv6: [17:24:31] https://www.irccloud.com/pastebin/nwxPVK4i/ [17:24:50] The IPv6 route is in the "cloud-private" table so we are good there: [17:24:54] cmooney@cloudlb2002-dev:~$ ip -6 route show table cloud-private [17:24:54] default via 2a02:ec80:a100:205::1 dev vlan2151 metric 1024 pref medium [17:25:57] we just need to tell the system to use this table for packets from... well that's not 100% clear [17:26:19] certainly 2a02:ec80:a100:4000::/64, but perhaps it should be 2a02:ec80:a100::/64 [17:26:31] sry.... 2a02:ec80:a100::/48 [17:26:52] topranks: while you're in there, is cloudlb2004-dev also set up properly? I'm flipping service from 2001-dev to 2004-dev and assume some network things need doing. (2004-dev doesn't actually work yet but that's another story) [17:28:13] I mean, haproxy doesn't work, the host itself should be fine. [17:29:34] andrewbogott: no bgp is not set up for it on the cloudsw [17:29:40] https://www.irccloud.com/pastebin/WjaSwuEY/ [17:30:45] Is it simple to switch service that from 2001 to 2004? [17:30:54] the alerts on cloudcephosd1029 is me, not sure why they did not work (the cookbook did create a silence :/ ) [17:31:04] *the silences did not work [17:31:07] oh good, thanks dcaro! I was just starting to worry [17:31:24] and also briefly forgot that that server is the lab rab [17:31:26] andrewbogott: they can both be configured side by side [17:31:37] however the bigger issue is bird is not running on 2004 [17:31:42] cmooney@cloudlb2004-dev:~$ sudo birdc [17:31:42] Unable to connect to server control socket (/run/bird/bird.ctl): No such file or directory [17:31:55] let's get that working before we trigger alerts by adding it on the cloudsw side and failing [17:32:01] andrewbogott: fyi there's a tmux running in cloudcumin1001 with the cookbook running [17:32:13] yeah, that's because haproxy is crashing I think. And haproxy is crashing because the conf syntax is super different in Bookworm and I haven't updated things yet. [17:32:24] ok yes that could be it [17:32:28] dcaro: meaning I left it running? Or meaning you're leaving it running? [17:32:41] nonon, I left it running xd [17:32:45] ok :) [17:32:55] topranks: this is https://phabricator.wikimedia.org/T392366 if you need a ref [17:33:24] 2001-dev is already mostly out of service so I don't mind if you break it in bgp [17:33:28] andrewbogott: sorry maybe we are talking past each other [17:33:33] I was looking at cloudlb2004-dev [17:33:36] you are working on that? [17:33:43] or cloudcephosd2004-dev ? [17:33:58] andrewbogott: I'll leave it undraining in batches, in case something goes awry feel free to take it stop it [17:34:41] * dhinus offline [17:34:45] dcaro: this is putting 1029 back into the pool? [17:34:57] yep [17:35:00] topranks: shit, wrong task, hang on [17:35:33] T377126 [17:35:34] T377126: replace cloudlb2001-dev with cloudlb2004-dev - https://phabricator.wikimedia.org/T377126 [17:35:36] there we go [17:35:49] ok yeah well that does make more sense [17:35:54] and I know they run haproxy :) [17:36:37] ideally they do [17:36:49] ha yeah [17:36:56] let me know how you get on with that [17:37:03] when it's done hopefully bird will be happy [17:37:14] we can tell by running "sudo birdc" and then "show protocols" [17:37:27] after which I'll add the session on the switch and we can see how it goes [17:37:35] ok [17:40:39] oh, quick question, the support assist file for the hard drive replacement is too big to upload to phabricator, any ideas where can I put it? [17:44:14] s3? [17:44:14] topranks: huh, I thought I got those rules added already. anyway, will have a look tomorrow [17:44:17] Etherpad? [17:44:21] dcaro: google drive? [17:44:21] dcaro: people.wikimedia.org? [17:44:26] dcaro: you're only going to get bad suggestions from me [17:44:38] depends on if that's needs to be private or not [17:44:56] it probably does, and google drive is probably the right answer [17:44:57] I think it does not, the disk procurement tickets are not private right? [17:45:21] Basically anything to do with procurement or negotiation should be kept private if possible. [17:45:35] I mean replacement sorry [17:45:35] andrewbogott: are you planning to update https://gitlab.wikimedia.org/repos/cloud/cloud-vps/go-cloudvps/-/merge_requests/2? [17:45:38] (same same I guess) [17:46:09] taavi: yes but also my day is 100% out of control so don't save it for me if you want it done. [17:46:27] taavi: thanks for the link, I did not know about that [17:50:17] Can I reset the tofu-infra repo in cloudcontrol1011 or does someone have work in progress there? (puppet is complaining) [17:58:20] added the people.wikimedia.org to my wm-lol xd https://wm-lol.toolforge.org/api/v1/search?query=help [17:58:27] * dcaro off [18:21:27] * andrewbogott does the reset [23:58:30] "I have faith that we will get to IPv6 before the heat death of the universe, but I'm not willing to put a more definite date on it than that." -- bd808, https://phabricator.wikimedia.org/T209011#4734161