[03:55:10] Heads up everyone: I manually executed the scheduled pipeline here https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/pipeline_schedules to force auto-update pre-commit for our repositories. The reason for this can be found here https://phabricator.wikimedia.org/T372601. Basically the last run of that scheduled pipeline committed a version of golangci-lint that wasn't working for our CI and as a result all our [03:55:10] golang repositories were failing CI. Incase you wake up to a lot of gitlab MR email activities all of that was my fault, for good reason [14:40:59] dhinus: I added you to a patch about ceph reimaging, since we were talking about it recently. https://gerrit.wikimedia.org/r/c/operations/puppet/+/1064388 [14:42:16] We want to test out removing `$(hostname)-vg`and associated LVM signatures while reimaging, but leaving all other LVM signatutes untouched. [14:42:54] As the patch stands, this would affect cloudcephosd* hosts as well, but we could exclude them if you prefer. [14:49:04] thanks! IIRC we actually tried following your vg schema for a cloudcephosd host that was recently reimaged [14:49:45] it would be good to have a shared partition list/names [14:50:50] andrewbogott: I was chatting with arnaudb who is playing with some cloudvps instances in the mariadbtest project [14:51:24] arnaudb tried using the "keypair" feature, but that seems to have interfered with the standard auth [14:51:48] so now I can ssh as root to db-mariadbtest-1 but not as a standard user [14:52:07] Yeah, ideally that panel would appear for unpuppetized images but vanish for puppetized base images. [14:52:09] do you know if there's a quick way to reset it? or maybe it's easier to delete and recreate that vm [14:52:27] But the keypair thing really should break standard auth. Is puppet failing there? [14:52:41] sorry, *really should not break" [14:54:04] yes puppet is failing, but probably for unrelated reasons [14:54:23] "Could not request certificate: The certificate retrieved from the master does not match" [14:54:35] that's probably the local puppet config that's broken [14:54:41] If it didn't get a proper puppet run ever then auth won't get set up [14:54:48] yep that makes sense [14:55:11] so probably the keypair thing is unrelated [14:55:18] unless y'all have tested it both ways [14:55:37] the first puppet run worked [14:55:42] but then something broke it [14:55:57] probably it has a project config that switches it to a different puppetserver [14:56:03] (or someone) [14:56:05] (me) [14:56:07] :D [14:56:31] btw my personal key works fine on that VM [14:56:34] yep I think the attempt to link it to the local puppetserver didn't work and things started failing [14:56:41] so pam/sssd seems to be working fine [14:56:42] mind that: instead of debugging, if somebody has a magic formula to reset everything to 0 on this project, it'd be good enough as I was only starting to tinker when I broke things [14:57:05] arnaudb: what do you mean by 'reset everything to 0'? [14:57:06] (don't want to make you waste time on empty vMs x) [14:57:14] ah now my one works as well, maybe I just didn't wait long enough after adding myself to the project! [14:57:38] arnaudb: can you try if you can ssh now to db-mariadbtest-1.mariadbtest.eqiad1.wikimedia.cloud [14:57:39] andrewbogott: I meant deleting the cumin/puppet nodes [14:57:53] arnaudb: can you not? [14:58:16] I was unsure it was going to fix my issue with the ssh key as it seemed to be linked to the key I was not able to delete [14:58:34] so I tried to not break it further before being able to assess where I did something wrong [14:58:45] dhinus: Connection closed by UNKNOWN port 65535 I'm being dropped [14:59:01] It's true that there's no way to remove a keypair from your account, but you can certainly create new VMs without the keyapri [14:59:05] *keypair [14:59:07] arnaudb: and does it work with cloud-cumin-01.mariadbtest.eqiad1.wikimedia.cloud ? [14:59:13] and in any case, I don't think that's related to what's happening [14:59:37] andrewbogott: ack, dhinus: I can! yay [15:01:07] in db-mariadbtest-1 I see "fatal: Access denied for user arnaudb by PAM account configuration [preauth]" [15:01:28] I wonder if ssh is trying to auth with a different key [15:02:51] it should be trying successively the available ones, no? [15:03:16] yep unless it matches one that was the one managed by the openstack keypairs... but that's a long shot [15:03:38] andrewbogott: is it expected that "openstack keypair list --project mariadbtest" is hanging without returning results? [15:03:48] You should definitely delete and recreate that VM rather than being curious [15:03:58] I don't see evidence that puppet has ever worked properly there [15:04:02] andrewbogott: +1 [15:04:04] ack, will try that tomorrow [15:04:12] thanks for the help <3 [15:05:33] + try a new hostname, there might be a stale cert for that name on the puppetserver [15:16:20] btullis: I left a comment in the task, but I'm probably not understanding the full reimage+prepare process [15:17:39] I think your patch is fine to merge anyway [15:27:26] Thank you kindly. [17:41:34] slyngs: regarding your recent email, is 'idm-test.wikimedia.org' backed by prod ldap or codfw1dev/labtest ldap? [18:36:01] dhinus: are you still around? I'm getting an alert about toolsdb [18:39:27] checking [18:40:37] I cannot ssh to either tools-db-1 or tools-db-3 [18:40:43] I started a migration to another cloudvirt just in case that matters [18:41:03] hm, but db-3 is on a different cloudvirt already [18:41:44] I have maybe ssh problems, can you ssh to any instance? [18:41:57] a different VM on that host seems to work fine [18:42:08] e.g. deployment-ms-be08.deployment-prep.eqiad1.wikimedia.cloud [18:42:34] ok now I can ssh to tools-db-3 [18:42:41] migration finished and now I can reach -1 [18:42:45] hmmmm [18:43:10] I can see tools-db-3 logged a "lost connection" to -1 [18:43:22] "Aug 21 18:42:34 tools-db-1 kernel: [4760409.912374] mysqld invoked oom-killer: gfp_mask=0x1140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=-600" [18:43:25] on -1 [18:43:54] so... why is that happening on different VMs in the same week? [18:44:01] yep seems weird [18:44:04] I'm restarting mariadb on -1 [18:44:10] ok [18:44:52] this runbook is up to date in case I'm not around: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState#MariaDB_process_killed_by_OOM_killer [18:45:01] it's also linked from the alert in alertmanager [18:46:06] I don't see anything in the neutron or nova logs that would explain this [18:46:12] so /maybe/ it's just coincidental ooms [18:47:27] Resolved for now, right? I'm going to clear alert manager [18:47:31] replication of toolsdb resumed automatically after I restarted the primary [18:47:38] so all looks good, alert manager is already clear [18:47:38] that's good :) [18:48:34] * dhinus offline