[07:08:37] hello, when would be a good time to upgrade cloudsw1-b1-codfw ? https://phabricator.wikimedia.org/T416443 [07:08:55] basically that means ~20min downtime for everything connected to the switch [08:31:38] XioNoX: I'd wait for andrew to confirm, but from a quick look at https://netbox.wikimedia.org/dcim/devices/4567/interfaces/ and the fact that is in codfw I think it might be fine anytime as nothing "production" wise should be running there AFAIK. But I'm not sure if it requires any pre-work to make things smoother. [08:31:56] yeah exactly [10:09:36] I'll go ahead and upgrade cloudcumin2001 now [10:10:28] ack for me [10:14:21] ack [10:56:42] cloudcumin2001 is updated and looks fine to me, keyholder is rearmed [10:56:59] volans: can you please apply the hotfix you mentioned? [10:57:07] sure [10:57:35] since cloudcumin2001 is only wired up against codfw1dev, I was unable to find a proper Cumin command to test with [10:58:07] (my usual test on cloudcumin1001 is against deployment-prep, which isn't on codfw1dev) [10:58:15] so maybe quickly test what you use there [10:59:00] hotpatch applied, sudo cumin 'O{*}' returned 49 hosts, testing ssh now [10:59:19] moritzm: did you restart already keyholder-proxy? [10:59:58] the whole VM was rebooted, so implicitly yes [11:02:16] mmmh debugging why ssh doesn't work [11:06:10] moritzm: so it authenticates to the bastion but fails to auth on the final host, debugging why [11:06:24] I'm using from root this to debug: [11:06:24] SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -vvv -F /etc/cumin/ssh_config root@acme-chief-2.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud [11:07:27] does that work from cloudcumin1001 or was that already broken? [11:08:22] seems broken too :/ [11:08:49] the fact that 2001 connects to codfw's openstack doesn't help as it's usually not used :/ [11:09:34] Feb 4 11:09:15 acme-chief-2 sshd[370080]: /etc/ssh/userkeys/root.d/cumin:2: Authentication tried for root with correct key but not from a permitted host (host=172.16.129.18, ip=172.16.129.18, required=172.16.129.190,172.16.129.181). [11:10:36] I can't even ssh with my global key :/ [11:10:55] I can ssh as root from my machine [11:11:10] taavi: thanks, those are defined in hieradata/cloud/codfw1dev.yaml IIRC [11:11:41] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1236694 [11:12:11] perfect [11:14:35] deployed and now the ssh command from above works [11:15:43] great, re-testing cumin [11:16:25] worked on acme-chief-2.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud failed on the others, re-trying in 30m to check if works everywhere [11:16:34] and then debug why my ssh doesn't [11:16:49] do we need to keep bastion-codfw1dev-03 and -04 in the allowlist? [11:17:29] ah simple, my user can't ssh to bastion.bastioninfra-codfw1dev.codfw1dev.wmcloud.org [11:19:57] taavi-clouddev@bastion-codfw1dev-06:~$ id volans [11:19:57] id: ‘volans’: no such user [11:20:06] remember this is with a completely separate LDAP tree [11:20:20] dhinus: I would guess -03 and -04 are completely unused at this point [11:20:44] do I need to register somewhere or send a patch? [11:21:12] volans: I'm adding you to the bastion project [11:21:24] <3 [11:21:28] dhinus: adding what user exactly? [11:21:31] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1236696 [11:21:50] taavi: right, probably volans does not have an account at all [11:22:10] I forgot how I created mine [11:22:30] * volans feels left out [11:22:38] we should add filippo too if he's not there [11:22:46] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Testing_deployment#Use_the_LDAP_terminal_UI [11:23:57] great, thx [11:24:59] then I guess you can use the cookbook wmcs.vps.add_user_to_project to add yourself to the 'bastioninfra-codfw1dev' project [11:26:31] k [11:49:07] andrewbogott: in case you already hit this, it seems that the openstack cli on codfw is unable to add a user, the cookbook failed running: [11:49:10] sudo -i wmcs-openstack role add --project=bastioninfra-codfw1dev --user=volans member --os-cloud novaadmin [11:49:33] it gets a 400 sayind that 'enabled' is a required property and in the provided json there is no enabled [11:57:38] That's interesting! Will look when I'm more awake [12:50:08] shall I proceed with updating cloudcumin1001 or are additional tests needed on 2001? [12:58:24] sudo cumin 'O{*}' 'id' runs ok on 21 hosts of 50 [12:58:40] fails on 29 [12:59:07] but might be totally unrelated, but I can't debug it as I can't ssh there and openstack cli fails to add my user [13:00:37] * volans lunch [13:01:30] moritzm: I would go ahead [13:02:45] I'm fine either way, we can also wait for Riccardo's access to be sorted out today and then I upgrade cloudcumin1001 tomorrow in the European morning [13:51:01] it seems like the remaining codfw1dev VMs for which cumin fails don't have DNS names for whatever reason? [13:51:14] (I'm filtering out trove/octavia/admin-monitoring where failures are expected) [14:28:08] volans: There may be a few different things going on here. First: it looks to me like you are already a member of the bastioninfra-codfw1dev. Do you know otherwise or are we basing this on ssh failing? [14:28:46] ...and the followup question is going to be: are you using a codfw1dev-specific ssh key to connect? [14:29:27] yes ssh failing to the bastion: volans@bastion.bastioninfra-codfw1dev.codfw1dev.wmcloud.org: Permission denied (publickey). [14:29:33] unless I use my root key [14:29:43] the key is the same of eqiad [14:29:47] for cloud ofc [14:30:36] What I mean is -- do you know that codfw1dev has a totally different keystore from eqiad1? And do you have a key registered there? [14:30:49] Backing up... in auth.log I see some 'Invalid user volans from' messages early on. But after a bit they change to [14:31:07] 'Failed publickey for volans' [14:31:18] where should I add my pub key for codfw? [14:31:25] the ldap user interface didn't ask for it [14:31:30] so I think that means that the process to add you to the project worked (but then also produced a stupid useless error) [14:31:43] bah [14:31:46] let me think... [14:32:05] (the terminal-based ldap thing is pretty new, you might be the first person to travel this path) [14:32:31] I just followed the linked https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Testing_deployment#Use_the_LDAP_terminal_UI :D [14:33:52] for the moment why don't you dm me the public key and I'll just ldapvi it in. Then after you're unblocked I can think about a proper process [14:34:02] unless you want to learn about ldapvi, which no one does [14:34:35] sure [14:34:51] thx [14:39:45] volans: ok, I think we have two followups: 1) document how to add an ssh key in codfw1dev, 2) report that ridiculous openstack cli messsage usptream. [14:39:51] Was there a third bump or is that it? [14:40:55] I think dhinus tried to add me on horizon and got an error, might be a third one [14:41:18] that's probably the same thing as the cli failure? [14:41:56] the error in horizon is the same you get in the terminal: "'enabled' is a required property" [14:42:08] didn't you get an error for listing too? [14:42:18] yes, even for listing [14:42:22] just opening https://labtesthorizon.wikimedia.org/project/member/ [14:42:27] the error message is the same [14:42:29] oh, interesting [14:42:52] so maybe when you tried to add your user, it _did_ add it, but then failed to list the result [14:42:59] (just a random guess) [14:43:29] oddly if I operate on my 'labtestandrewmortal' I don't get those messages and it just works, but if I operate on 'labtestandrew' I see that error. So it seems like they might have added some sensitivity to an ldap field in a recent release. [14:43:42] Yes, I think that's right, it added it but then errored out when running the query to show you what it did [14:46:20] T416483 [14:46:20] T416483: Adding a new user to codfw1dev is messed up - https://phabricator.wikimedia.org/T416483 [15:28:50] quick review for https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1133? [15:29:32] +1d [15:30:04] ty [15:30:18] also, https://wikitech.wikimedia.org/wiki/News/2025_Cloud_VPS_Purge says we will be shutting down things at the start of this month. did that happen already? [15:33:34] it did not. I'm not sure if komla has done the 'chase down individual project maintainers' bit yet. komla, have you? Are we ready to start breaking things? [15:33:59] at least I got some threatening emails at the end of last month [15:52:19] I'm always deeply conflicted about shutting down projects from unresponsive users because it makes me feel like I'm a fossil for expecting people to read their email [15:54:15] yeah, I get that [15:56:00] on the other hand, as we're giving away resources easily worth hundreds/thousands/etc of $, it doesn't seem that unreasonable to me to expect those people to reply once to several communication attempts over the course of several months [15:56:33] I'm at least going to start with the projects that say DELETE [15:57:08] oh yeah, if /you/ want to shut things down I have no problem with that :D [15:57:22] And as you say, ones that are specifically marked delete, I should've deleted ages ago. [15:57:55] But also, you know the thing about deleting projects, right? That 'openstack project delete' removes keystone entries but leaves orphan records other places? [15:58:03] * andrewbogott thinks we still don't have a cookbook for this [15:58:31] we have a cookbook to just do the deletion, my plan was to convert the checklist at https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Projects_lifecycle to actual checks in the cookbook as a part of this [16:00:04] * andrewbogott jumps for joy [17:44:21] if someone else wants to try out the project deletion cookbook, these are still listed as delete on the wiki: https://etherpad.wikimedia.org/p/cloud-vps-purge-deletions [17:44:37] with a couple patches still up for review: https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1236786