[09:11:59] we can't easily answer the question "how many vCPU we have available in eqiad1" because the metrics that power the grafana dashboard are broken [09:12:05] https://grafana.wikimedia.org/d/000000579/wmcs-openstack-eqiad-summary?orgId=1 [09:31:20] topranks: I just discovered this T376879 when merging the cloudgw patch [09:31:21] T376879: keepalived: it doesn't support mixing IPv4 and IPv6 VIPs on the same VRRP instance - https://phabricator.wikimedia.org/T376879 [09:31:47] would you suggest refactoring the keepalived module to support IPv6, or migrate to BGP-based anycast VIPs? [09:32:33] The BGP way is the better way and what we should do long term [09:33:20] the question is how much effort is involved, but I'd already thought it might have been better to move cloudgw to the BGP way when looking at the previous patches [09:33:51] the question is how do you provide the highly-available GW IP for the cloudnet/neutron to point it's default route at without VRRP? [09:34:00] I don't think we can do bgp between cloudgw and cloudnet right? [09:34:23] mmmm [09:34:47] is not immediately obvious to me how that would work [09:35:19] so - if you need to refactor the keepalived module to support v6 for the "inside" of the cloudgw (facing cloudnet/neutron) then I guess we can keep using it for the "outside" of the cloudgw (facing cloudsw) too [09:35:33] still plan to eventually change the outside to BGP instead of vrrp [09:37:22] ok [09:38:09] here are the docs for neutron and BGP: https://docs.openstack.org/neutron/latest/admin/config-bgp-dynamic-routing.html [09:38:11] You could _potentially_ use radvd to send IPv6 RAs from both cloudgw to the cloudnet's, and they'd get their default route from there. The problem with that is failover time (if a cloudgw dies and thus cloudnet stops getting RAs from it) is far too slow (10 mins or something by default, I'm not sure how much it can be tuned down). [09:39:34] ok [09:40:02] I'm happy with what you choose here tbh, if BGP from neutron seems workable it's probably the better option. But that said I do wonder if we aren't moving too many things at once, making it hard for ourselves. [09:42:37] yeah [09:42:58] keepalived is probably the easiest for now, otherwise additional neutron changes would be a massive sidetrack [10:12:47] topranks: the change in keepalived may not be that big after all https://gerrit.wikimedia.org/r/c/operations/puppet/+/1079234 [10:14:16] arturo: ok yeah, mainly just the other definition in keepalived.conf then? [10:14:20] lgtm I think [10:14:29] yeah [10:14:42] and yeah I kind of agree might be best to avoid a major distraction by adding bgp to the current work [10:17:03] ok [10:36:00] arturo: you suggested restarting rabbitmq yesterday, would you do it with the procedure described at https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/RabbitMQ#Resetting_the_HA_setup ? [10:36:46] dhinus: I would start by a simpler service restart, to see if that makes any difference [10:36:52] `sudo cumin cloudcontrol2* 'systemctl restart rabbitmq-server || true'` [10:37:19] that procedure is a full reset, which may be the stronger measure in case of additional malfunction [10:37:22] you mean cloudcontrol1? [10:37:40] well, yeah, I copied from codfw1dev, where I usually have to do it once every couple days [10:37:44] ok! [10:38:02] both the soft restart and the hard reset are good candidates for a cookbook :-) [10:39:43] ah it's actually cloudrabbit1* :) [10:40:30] restart completed [10:41:03] oh, right, eqiad1 [10:43:08] trove is still broken, but I noticed something I didn't think about yesterday: the VM gets created, so I can try to ssh into it and see what happens [10:43:16] maybe the trove docker container is broken [10:47:41] ok the database eventually became healthy just as I was checking [10:47:51] when I sshed there was no container running, but one appeared shortly after [10:49:48] BTW because T376220 I noticed there are many many entries on wikitech being created by the test project that are never cleaned up [10:49:49] T376220: Labslogbot needs new SUL OAuth credentials after Wikitech authn changes - https://phabricator.wikimedia.org/T376220 [10:49:56] see https://wikitech.wikimedia.org/wiki/Special:Contributions/Labslogbot [10:50:29] cloudinfra-internal-puppetserver is failing to resolve names in codfw [10:50:34] https://www.irccloud.com/pastebin/BwrCMwbe/ [10:50:38] anything going on there? [10:52:01] dcaro: I'm working with the network in codfw1dev, expect it to be intermittently broken. [10:52:23] I think it has been failing for a few days (got there from puppet run errors on the email) [10:52:30] though might be a mixture? [10:52:39] yeah, maybe [10:52:45] let me know when the network is supposed to be stable and I'll recheck [10:53:32] I believe I started the more invasive changes today [10:55:41] topranks: do you have a moment to check ipv6 routing in cloudgw? [10:55:57] this in particular [10:56:00] https://www.irccloud.com/pastebin/9CFzTT9r/ [10:56:25] I'm trying to answer the question: why VRRP packets come from `fe80::2eea:7fff:fe7b:e104` rather than the other address [10:59:09] without knowing better I suspect VRRP with IPv6 defaults to using the link-local IPs [11:01:11] do you know if we can control this at routing table level? via metrics or something? [11:01:55] no I'd expected it's a keepalived switch [11:01:58] is it a problem? [11:03:24] fwiw these two routes are missing for the routing in/out to work right, but you probably already know [11:03:44] ip -6 route add vrf vrf-cloudgw default via 2a02:ec80:a100:fe03::1 [11:03:44] ip route add vrf vrf-cloudgw 2a02:ec80:a100::/55 via 2a02:ec80:a100:fe04::2:1 [11:04:57] I'll need to add them [11:05:16] thanks for double checking [11:06:34] arturo: the entries in wikitech at https://wikitech.wikimedia.org/wiki/Special:Contributions/Labslogbot seem to predate the oauth changes [11:06:47] I think it might be a separate issue [11:07:25] dhinus: yeah, I guess we would need to check the project deletion workflow [11:08:10] I will open a separate task [11:10:45] T376888 [11:10:45] T376888: tofuinfratest creates many pages in wikitech - https://phabricator.wikimedia.org/T376888 [11:39:03] topranks: quick review here? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1079246 [11:47:30] yep that looks ok I think +1 [11:47:38] thanks! [12:39:40] * arturo relocating from public library to hoe [12:39:43] home* [13:47:06] ok I think I discovered something about the Trove issue: adding the "ssh-from-anywhere" SG to a broken instance makes it move from BUILD to ACTIVE [13:47:33] I was adding it for debugging, but it looks like just by adding it, without even sshing to it, Trove is able to complete the provisioning [13:49:26] did any network setting change about 1 week ago? tf-infra-test started failing on October 4th [13:52:26] that sounds like missing default? [14:21:12] dhinus: I have no idea off the top of my head. Many openstack things have changed lately [14:21:22] for example, wmfkeystonehook [14:21:33] not to mention all the stuff we do via tofu-infra [14:25:45] the timing might coincide with some tofu-infra change, but I'm not sure which one [14:29:29] trove instances don't use the [14:29:44] "default" sg, but get their own trove_sg_{id} SG [14:32:00] comparing with old trov instances, it looks like the new instances have a SG which is missing "ALLOW to 0.0.0.0/0" [14:38:56] is that SG tracked via tofu-infra? [14:39:18] no [14:39:29] wait, is a brand new project, [14:39:30] I'm trying to understand where it gets created [14:39:42] nope, existing project, new trove instance [14:39:46] for new projects, the default sg is created by neutron [14:39:50] ok, then nevermind [14:39:58] maybe tofu is completely unrelated [14:40:02] ok [14:40:18] I just wanted to mention that we deleted the default security group rules [14:40:43] it kinda sounds similar, but it could be a red herring [14:40:44] which are added 1) to any new SG 2) to the default SG group [14:40:53] ah maybe 1) is the issue [14:41:14] allow all egress traffic sounds like a default sg rule [14:41:16] trove will create a new SG when a new trove instance is created [14:41:28] oh! [14:41:31] there you go then [14:41:50] yep, where is that setting that we changed? [14:42:04] either we revert that, or we fix the trove SG by adding the required ruels [14:42:06] *rules [14:42:13] hardcoded in the openstack DB, because the openstack provider doesn't support it ;.-( [14:42:19] :( [14:42:41] actually, let me document that in https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/OpenTofu [14:44:00] can you change it with the openstack CLI? [14:44:07] yes [14:44:21] openstack default security group rule list [14:44:23] and friends [14:45:15] I think the solution is to re-add the default security group rules [14:45:46] and then configure tofu-infra to 'delete default security group rules', so for the sg we track in tofu there are no rules outside the ones defined in the rpo [14:45:48] repo* [14:45:57] yes I agree [14:46:20] delete_default_rules=true for all tofu-managed SGs [14:46:30] yeah [14:50:23] I'm trying to recreate them based on the rules I see in old sec groups [14:50:44] ok [14:50:47] should be fairly simple [14:51:18] something like: egress, all protocols, all destinations [14:51:29] in IPv4, then same for IPv6 [14:51:48] (you can omit IPv6 for now, but they were present as well when I deleted them) [14:51:57] "sudo wmcs-openstack default security group rule create --egress --ethertype IPv4" [14:52:07] and another one for ipv6 [14:52:15] sounds right [14:54:02] hmm I think it works, but I created a new SG and it looks slightly different from the old ones [14:55:15] I wonder if this could be the root cause of the DNS issue that was reported lately [14:55:50] nah, nevermind, it was reported a month ago [14:55:54] https://phabricator.wikimedia.org/P69627 [14:56:22] the new ones have additional "normalized_cidr='0.0.0.0/0', remote_ip_prefix='0.0.0.0/0'" [14:56:30] probably harmless but I wonder why [14:57:04] no idea [14:57:51] we already set delete_default_rules in tofu, so I don't think we need any other change [14:58:22] ok [18:02:49] * dcaro off