[09:11:59] <arturo>	 we can't easily answer the question "how many vCPU we have available in eqiad1" because the metrics that power the grafana dashboard are broken
[09:12:05] <arturo>	 https://grafana.wikimedia.org/d/000000579/wmcs-openstack-eqiad-summary?orgId=1
[09:31:20] <arturo>	 topranks: I just discovered this T376879 when merging the cloudgw patch
[09:31:21] <stashbot>	 T376879: keepalived: it doesn't support mixing IPv4 and IPv6 VIPs on the same VRRP instance - https://phabricator.wikimedia.org/T376879
[09:31:47] <arturo>	 would you suggest refactoring the keepalived module to support IPv6, or migrate to BGP-based anycast VIPs?
[09:32:33] <topranks>	 The BGP way is the better way and what we should do long term 
[09:33:20] <topranks>	 the question is how much effort is involved, but I'd already thought it might have been better to move cloudgw to the BGP way when looking at the previous patches 
[09:33:51] <topranks>	 the question is how do you provide the highly-available GW IP for the cloudnet/neutron to point it's default route at without VRRP?
[09:34:00] <topranks>	 I don't think we can do bgp between cloudgw and cloudnet right?
[09:34:23] <arturo>	 mmmm
[09:34:47] <arturo>	 is not immediately obvious to me how that would work
[09:35:19] <topranks>	 so - if you need to refactor the keepalived module to support v6 for the "inside" of the cloudgw (facing cloudnet/neutron) then I guess we can keep using it for the "outside" of the cloudgw (facing cloudsw) too 
[09:35:33] <topranks>	 still plan to eventually change the outside to BGP instead of vrrp 
[09:37:22] <arturo>	 ok
[09:38:09] <arturo>	 here are the docs for neutron and BGP: https://docs.openstack.org/neutron/latest/admin/config-bgp-dynamic-routing.html
[09:38:11] <topranks>	 You could _potentially_ use radvd to send IPv6 RAs from both cloudgw to the cloudnet's, and they'd get their default route from there.  The problem with that is failover time (if a cloudgw dies and thus cloudnet stops getting RAs from it) is far too slow (10 mins or something by default, I'm not sure how much it can be tuned down).
[09:39:34] <arturo>	 ok
[09:40:02] <topranks>	 I'm happy with what you choose here tbh, if BGP from neutron seems workable it's probably the better option.  But that said I do wonder if we aren't moving too many things at once, making it hard for ourselves.
[09:42:37] <arturo>	 yeah
[09:42:58] <arturo>	 keepalived is probably the easiest for now, otherwise additional neutron changes would be a massive sidetrack
[10:12:47] <arturo>	 topranks: the change in keepalived may not be that big after all https://gerrit.wikimedia.org/r/c/operations/puppet/+/1079234
[10:14:16] <topranks>	 arturo: ok yeah, mainly just the other definition in keepalived.conf then?
[10:14:20] <topranks>	 lgtm I think 
[10:14:29] <arturo>	 yeah
[10:14:42] <topranks>	 and yeah I kind of agree might be best to avoid a major distraction by adding bgp to the current work 
[10:17:03] <arturo>	 ok
[10:36:00] <dhinus>	 arturo: you suggested restarting rabbitmq yesterday, would you do it with the procedure described at https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/RabbitMQ#Resetting_the_HA_setup ?
[10:36:46] <arturo>	 dhinus: I would start by a simpler service restart, to see if that makes any difference
[10:36:52] <arturo>	 `sudo cumin cloudcontrol2* 'systemctl restart rabbitmq-server || true'`
[10:37:19] <arturo>	 that procedure is a full reset, which may be the stronger measure in case of additional malfunction
[10:37:22] <dhinus>	 you mean cloudcontrol1?
[10:37:40] <arturo>	 well, yeah, I copied from codfw1dev, where I usually have to do it once every couple days
[10:37:44] <dhinus>	 ok!
[10:38:02] <arturo>	 both the soft restart and the hard reset are good candidates for a cookbook :-)
[10:39:43] <dhinus>	 ah it's actually cloudrabbit1* :)
[10:40:30] <dhinus>	 restart completed
[10:41:03] <arturo>	 oh, right, eqiad1
[10:43:08] <dhinus>	 trove is still broken, but I noticed something I didn't think about yesterday: the VM gets created, so I can try to ssh into it and see what happens
[10:43:16] <dhinus>	 maybe the trove docker container is broken
[10:47:41] <dhinus>	 ok the database eventually became healthy just as I was checking
[10:47:51] <dhinus>	 when I sshed there was no container running, but one appeared shortly after
[10:49:48] <arturo>	 BTW because T376220 I noticed there are many many entries on wikitech being created by the test project that are never cleaned up
[10:49:49] <stashbot>	 T376220: Labslogbot needs new SUL OAuth credentials after Wikitech authn changes - https://phabricator.wikimedia.org/T376220
[10:49:56] <arturo>	 see https://wikitech.wikimedia.org/wiki/Special:Contributions/Labslogbot
[10:50:29] <dcaro>	 cloudinfra-internal-puppetserver is failing to resolve names in codfw
[10:50:34] <dcaro>	 https://www.irccloud.com/pastebin/BwrCMwbe/
[10:50:38] <dcaro>	 anything going on there?
[10:52:01] <arturo>	 dcaro: I'm working with the network in codfw1dev, expect it to be intermittently broken.
[10:52:23] <dcaro>	 I think it has been failing for a few days (got there from puppet run errors on the email)
[10:52:30] <dcaro>	 though might be a mixture?
[10:52:39] <arturo>	 yeah, maybe
[10:52:45] <dcaro>	 let me know when the network is supposed to be stable and I'll recheck
[10:53:32] <arturo>	 I believe I started the more invasive changes today
[10:55:41] <arturo>	 topranks: do you have a moment to check ipv6 routing in cloudgw? 
[10:55:57] <arturo>	 this in particular
[10:56:00] <arturo>	 https://www.irccloud.com/pastebin/9CFzTT9r/
[10:56:25] <arturo>	 I'm trying to answer the question: why VRRP packets come from `fe80::2eea:7fff:fe7b:e104` rather than the other address
[10:59:09] <topranks>	 without knowing better I suspect VRRP with IPv6 defaults to using the link-local IPs 
[11:01:11] <arturo>	 do you know if we can control this at routing table level? via metrics or something?
[11:01:55] <topranks>	 no I'd expected it's a keepalived switch 
[11:01:58] <topranks>	 is it a problem?
[11:03:24] <topranks>	 fwiw these two routes are missing for the routing in/out to work right, but you probably already know
[11:03:44] <topranks>	 ip -6 route add vrf vrf-cloudgw default via 2a02:ec80:a100:fe03::1
[11:03:44] <topranks>	 ip route add vrf vrf-cloudgw 2a02:ec80:a100::/55 via 2a02:ec80:a100:fe04::2:1
[11:04:57] <arturo>	 I'll need to add them
[11:05:16] <arturo>	 thanks for double checking
[11:06:34] <dhinus>	 arturo: the entries in wikitech at https://wikitech.wikimedia.org/wiki/Special:Contributions/Labslogbot seem to predate the oauth changes
[11:06:47] <dhinus>	 I think it might be a separate issue
[11:07:25] <arturo>	 dhinus: yeah, I guess we would need to check the project deletion workflow
[11:08:10] <dhinus>	 I will open a separate task
[11:10:45] <dhinus>	 T376888
[11:10:45] <stashbot>	 T376888: tofuinfratest creates many pages in wikitech - https://phabricator.wikimedia.org/T376888
[11:39:03] <arturo>	 topranks: quick review here? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1079246
[11:47:30] <topranks>	 yep that looks ok I think +1
[11:47:38] <arturo>	 thanks!
[12:39:40] * arturo relocating from public library to hoe
[12:39:43] <arturo>	 home*
[13:47:06] <dhinus>	 ok I think I discovered something about the Trove issue: adding the "ssh-from-anywhere" SG to a broken instance makes it move from BUILD to ACTIVE
[13:47:33] <dhinus>	 I was adding it for debugging, but it looks like just by adding it, without even sshing to it, Trove is able to complete the provisioning
[13:49:26] <dhinus>	 did any network setting change about 1 week ago? tf-infra-test started failing on October 4th
[13:52:26] <dcaro>	 that sounds like missing default?
[14:21:12] <arturo>	 dhinus: I have no idea off the top of my head. Many openstack things have changed lately
[14:21:22] <arturo>	 for example, wmfkeystonehook
[14:21:33] <arturo>	 not to mention all the stuff we do via tofu-infra
[14:25:45] <dhinus>	 the timing might coincide with some tofu-infra change, but I'm not sure which one
[14:29:29] <dhinus>	 trove instances don't use the 
[14:29:44] <dhinus>	 "default" sg, but get their own trove_sg_{id} SG
[14:32:00] <dhinus>	 comparing with old trov instances, it looks like the new instances have a SG which is missing "ALLOW to 0.0.0.0/0"
[14:38:56] <arturo>	 is that SG tracked via tofu-infra?
[14:39:18] <dhinus>	 no
[14:39:29] <arturo>	 wait, is a brand new project,
[14:39:30] <dhinus>	 I'm trying to understand where it gets created
[14:39:42] <dhinus>	 nope, existing project, new trove instance
[14:39:46] <arturo>	 for new projects, the default sg is created by neutron
[14:39:50] <arturo>	 ok, then nevermind
[14:39:58] <dhinus>	 maybe tofu is completely unrelated
[14:40:02] <arturo>	 ok
[14:40:18] <arturo>	 I just wanted to mention that we deleted the default security group rules
[14:40:43] <dhinus>	 it kinda sounds similar, but it could be a red herring
[14:40:44] <arturo>	 which are added 1) to any new SG 2) to the default SG group
[14:40:53] <dhinus>	 ah maybe 1) is the issue
[14:41:14] <arturo>	 allow all egress traffic sounds like a default sg rule
[14:41:16] <dhinus>	 trove will create a new SG when a new trove instance is created
[14:41:28] <arturo>	 oh!
[14:41:31] <arturo>	 there you go then
[14:41:50] <dhinus>	 yep, where is that setting that we changed?
[14:42:04] <dhinus>	 either we revert that, or we fix the trove SG by adding the required ruels
[14:42:06] <dhinus>	 *rules
[14:42:13] <arturo>	 hardcoded in the openstack DB, because the openstack provider doesn't support it ;.-(
[14:42:19] <dhinus>	 :(
[14:42:41] <arturo>	 actually, let me document that in https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/OpenTofu
[14:44:00] <dhinus>	 can you change it with the openstack CLI?
[14:44:07] <arturo>	 yes
[14:44:21] <arturo>	 openstack default security group rule list
[14:44:23] <arturo>	 and friends
[14:45:15] <arturo>	 I think the solution is to re-add the default security group rules
[14:45:46] <arturo>	 and then configure tofu-infra to 'delete default security group rules', so for the sg we track in tofu there are no rules outside the ones defined in the rpo
[14:45:48] <arturo>	 repo*
[14:45:57] <dhinus>	 yes I agree
[14:46:20] <dhinus>	 delete_default_rules=true for all tofu-managed SGs
[14:46:30] <arturo>	 yeah
[14:50:23] <dhinus>	 I'm trying to recreate them based on the rules I see in old sec groups
[14:50:44] <arturo>	 ok
[14:50:47] <arturo>	 should be fairly simple
[14:51:18] <arturo>	 something like: egress, all protocols, all destinations
[14:51:29] <arturo>	 in IPv4, then same for IPv6
[14:51:48] <arturo>	 (you can omit IPv6 for now, but they were present as well when I deleted them)
[14:51:57] <dhinus>	 "sudo wmcs-openstack default security group rule create --egress --ethertype IPv4"
[14:52:07] <dhinus>	 and another one for ipv6
[14:52:15] <arturo>	 sounds right
[14:54:02] <dhinus>	 hmm I think it works, but I created a new SG and it looks slightly different from the old ones
[14:55:15] <arturo>	 I wonder if this could be the root cause of the DNS issue that was reported lately
[14:55:50] <arturo>	 nah, nevermind, it was reported a month ago
[14:55:54] <dhinus>	 https://phabricator.wikimedia.org/P69627
[14:56:22] <dhinus>	 the new ones have additional "normalized_cidr='0.0.0.0/0', remote_ip_prefix='0.0.0.0/0'"
[14:56:30] <dhinus>	 probably harmless but I wonder why
[14:57:04] <arturo>	 no idea
[14:57:51] <dhinus>	 we already set delete_default_rules in tofu, so I don't think we need any other change
[14:58:22] <arturo>	 ok
[18:02:49] * dcaro off