[10:07:17] arturo: if you have a moment today, I'd appreciate if you could take a look at https://gitlab.wikimedia.org/repos/cloud/toolforge/calico/-/merge_requests/8 [10:07:28] sure [10:07:58] thanks [10:15:48] <_joe_> hi, who can I ask for a review of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1056937/ ? [10:20:34] _joe_: I can do it [10:26:54] cc dhinus [10:27:40] looking [10:29:12] should be good, let's check the PCC [10:31:15] pcc failed with 'no parameter named 'target'' [10:39:37] dhinus: do you know if the openstack config used in cloudcumin for cumin should work from my laptop? (or if it's using some special internal-only kind of endpoint) [10:44:44] I'm getting unauthorized when trying to use novaobserver creds locally to list tools vms [10:45:22] maybe some clouds.yaml stuff somewhere? [10:48:58] hmm I haven't used novaobserver recently, so I don't know if something changed [10:49:04] when was the last time you used it successfully? [10:49:38] from cloudcumin1001 it works ok [10:52:36] the auth looks ok to me locally, it reads the config ok [10:52:45] but ends up in 401 [10:52:46] https://www.irccloud.com/pastebin/Vy5UCrst/ [10:53:02] (putting a pdb line in the middle of the openstack backend code for cumin) [10:53:49] maybe it generates a token correctly, but then that token is rejected by the openstack service? [10:53:50] the alternative is hardcoding the tools nodes in the cookbooks (of which we have some already, so might not be too bad for now, then we change all when it works) [10:54:16] might be [10:54:58] same cookbook from cloudcumin works? that is also using novaobserver [10:55:07] (/etc/cumin/config.yaml) [10:55:33] clouds.yaml should not be involved if you go through spicerack [10:55:48] yep [10:55:52] it works yes [10:56:15] is your local cumin config file similar to the one in cloudcumin? [10:57:00] yep, quite similar yes, I copy-pasted the openstack part of it [10:57:14] I'm actually confused why the creds are in the cumin config and not in the spicerack config [10:57:20] I think I might have messed up the virtualenv when installing the openstack packages :/ [10:57:20] ah I know [10:57:24] to query the server list [10:57:47] it's possible it's not reading the config file you expect it to read [10:57:53] now cookbook itself is not running (bad numpy version or something) [10:58:00] I'll try to fix that and see if that helps [10:58:22] it's using the right config file, I can pdb in the middle of the cumin code, and see that it read the right config (user/pass/url/...) [10:59:15] it seems cookbook only supports numpy<2 [11:04:30] `dcaro@urcuchillay$ wmcs-cookbooks wmcs.toolforge.k8s.component.deploy --cluster-name toolsbeta --component builds-cli --git-branch bump_to_0.0.18` [11:04:33] that almos works :) [11:13:59] \o/ [11:14:01] https://www.irccloud.com/pastebin/mtLzXrbF/ [11:14:53] did you consider having a different cookbook? [11:29:07] yes, but I don't want to have to think which cookbook to use depending on the component I want to deploy [11:29:46] I'm also considering dropping `k8s` from the prefix, as it's not needed anymore [11:35:22] same for the ToolforgeK8s* classes [11:35:25] and inventory [11:41:57] I keep being surprised by the flakiness of openstack [11:42:05] T371242 [11:42:05] T371242: openstack: codfw1dev: nova-compute can't contact rabbitmq - https://phabricator.wikimedia.org/T371242 [12:30:31] ready for review :), deploying also packages now https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1057847 [12:47:07] andrewbogott: when you are awake, I would appreciate your input here: T371242 [12:47:10] T371242: openstack: codfw1dev: nova-compute can't contact rabbitmq - https://phabricator.wikimedia.org/T371242 [13:19:41] arturo: is there connectivity from the cloudcontrol to rabbitmk? [13:19:45] *rabbitmq [13:26:42] I think the new cloudcontrols might be missing the private vlan [13:27:12] 2151 interface is missing on 2009-dev [13:27:35] the error is in 2005 though ,looking [13:28:13] 2005 can connect [13:28:18] https://www.irccloud.com/pastebin/DV5fDenv/ [13:41:21] arturo, if you look in nova.conf for 'ALERT' you may find something useful :) [13:43:06] (that's meaningful if the rabbit IPs changed but not if the control nodes changed) [14:02:39] dcaro: I'm going to rebuild the harbor DB in a moment. [14:04:59] done :) [14:06:17] that was quick :) [14:06:53] yeah, it turns out to be pretty reliable once I fixed the postgres rebuild path [14:07:04] things look ok [14:07:23] cool [14:17:27] why do you think there was any change to rabbitmq? [14:18:03] andrewbogott: I haven't changed anything related to rabbitmq or the network [14:18:14] arturo: I don't think I have the context, what were you doing when things broke? [14:18:44] I updated python3-nova on all codfw1dev servers because https://phabricator.wikimedia.org/T371240 [14:19:25] hm [14:19:37] then there's no reason why it should've broken :( Do you know for sure that it was working pre-upgrade? [14:19:46] no :-( [14:20:00] I suspect it wasn't working before [14:20:16] but I only looked for nova-fullstack after the upgrade [14:20:28] well, maybe there are earlier logs [14:21:07] hm, I guess now that the fullstack tests delete the oldest broken VMs we can no longer use that to answer this question... [14:23:37] I see plenty of errors in the logs from days before [14:23:40] did you do a full restart of nova services recently? I just restarted on a cloudvirt and that seems to have made it h appy [14:23:45] so this has nothing to do with the upgrade [14:24:12] is not happy, nova-compute will timeout in a bit waiting for rabbitmq [14:24:38] I have restarted all nova services using the cookbook this morning [14:24:41] ok. do you mind if I do 'sudo cookbook wmcs.openstack.restart_openstack --all --cluster-name codfw1dev' ? [14:24:51] sure [14:25:02] I did the same but with --nova [14:25:07] (assuming it will produce the same result as when you did it) [14:25:30] that cookbook doesn't restart rabbitmq BTW [14:25:37] maybe we should have an option [14:25:45] but I did restart rabbitmq manually anyway [14:27:57] the message we're seeing ('timed out waiting for reply') doesn't necessarily mean that nova can't talk to rabbit does it? [14:28:14] Couldn't it mean that it can't talk to another service it's trying to talk to /via/ rabbit? e.g. another nova agent, or neutron? [14:28:15] I think network-wise, as in TCP/IP we are fine [14:29:16] I have visually scanned all services logs using logstash, and I have found nothing relevant so far [14:29:32] but also in terms of rabbit connection handshake [14:30:57] It is hard to navigate logs anyway, we are generating 2.5M log lines per day in codfw1dev ... [15:40:33] andrewbogott: I also just created T371274 [15:40:33] T371274: openstack: codfw1dev: manually clean up old hypervisors - https://phabricator.wikimedia.org/T371274 [15:41:28] thx. at the moment nova is convinced that there are still valuable VMs on 2001/2002 even though there aren't [15:42:08] * arturo offline [17:33:44] * dcaro off [20:05:11] andrewbogott: I have a task in need of prod root help at T371289. The credentials that Striker uses to talk to GitLab are expiring and I need someone with similar access rights as you to rotate the config value for me. Details of where to find the new secret are in the task. [20:05:12] T371289: Rotate StrikerBot GitLab PAT before it expires on 2024-08-01 - https://phabricator.wikimedia.org/T371289 [20:15:45] I will give it a try! Might be a bit before I can look though, trying to get the damn AC to work [20:24:10] andrewbogott: it can totally wait until tomorrow. Good luck with local temperature controls [20:45:24] bd808: I see hieradata/role/eqiad/wmcs/openstack/eqiad1/cloudweb.yaml but not hieradata/role/eqiad/wmcs/openstack/eqiad1/labweb.yaml -- cloudweb is the right place, right? [20:48:00] ah, yep, apparently I renamed it :) [20:55:36] andrewbogott: I didn't pay much attention to the bits that I was cut-and-pasting from old emails. :) Glad you figured out the rename. [20:57:41] Can you confirm that the key actually updated in the place it needed to update? [20:58:15] * bd808 logs into a cloudweb [20:59:49] andrewbogott: cloudweb1003:/etc/striker/env still has the old value. Should I force a puppet run there? [21:00:01] yep [21:00:13] (sorry, still not 100% tuned in) [21:01:20] yup, it is updated now. [21:01:23] great [21:03:50] GitLab decided that if you don't pay them you can only have a PAT that lasts for 365 days so we will get to do this dance once a year now [21:12:42] Better 365 than 30 :( [21:27:09] In some ways 30 would be better because it would push us to find a fix for either automated rotation or removing the ridiculous constraint on the TTL.