[12:14:50] hey folks, afaics there is something weird going on with the puppet host cert for puppetserver1001 [12:15:02] please avoid doing any reimage etc.. for the time being [12:19:24] https://phabricator.wikimedia.org/T405580 [12:27:04] elukey: need any help? [12:27:37] taavi: nah we found the problem thanks, the host's tls cert was cleaned by mistake (this seems to be the root cause) [12:47:59] it will be fixed soon, I am patching the renew-cert cookbook to allow this use case, so we don't have to do it manually etc.. [13:07:56] the intended way, namely run a patched version of sre.hosts.renew-cert didn't work, very lovely [13:12:16] (discussing the issue in the team's chan if you are interested) [13:25:39] Moving it in here since it is getting a bit complicated, and more eyes are needed [13:26:10] current status is https://phabricator.wikimedia.org/T405580#11214172 [13:26:24] elukey: the rsyslogd profile uses puppet::expose_agent_certs, so I think we can just take the private key from /etc/rsyslog/ssl/ (and the public key from /etc/puppet/puppetserver/ca/signed/ if needed) and put them to correct places in /var/lib/puppet/ssl [13:27:18] lemme check [13:28:32] looks good indeed [13:28:44] at least, cert.pem looks the right one [13:29:37] the cert in signed/ matches with the key in there [13:29:42] should I do that? [13:30:34] I wanted to come up with the cp commands beforehand so we are clear what/where we are copying, if you are those handy could you please post them in here? [13:30:47] *if you have [13:31:22] cp /etc/rsyslog/ssl/cert.pem /var/lib/puppet/ssl/certs/puppetserver1001.eqiad.wmnet.pem [13:31:26] this should be one in theory [13:32:01] cp /etc/rsyslog/ssl/server.key /var/lib/puppet/ssl/private_keys/puppetserver1001.eqiad.wmnet.pem [13:32:09] is it what you have in mind taavi ? [13:33:00] elukey: yeah, basically just take a backup of the current stuff in /var/lib/puppet/ssl (probably not needed, but just in case) and then copy the files to where we need them: https://phabricator.wikimedia.org/P83467 [13:33:54] ah ok you are using /etc/puppet/puppetserver/ca/signed/puppetserver1001.eqiad.wmnet.pem [13:35:22] looks good [13:35:35] taavi: +1, please post the snippet in the task :) [13:36:17] otherwise I can do it, lemme know [13:36:39] doing [13:38:24] elukey: it worked, running puppet now [13:39:37] https://puppetboard.wikimedia.org/report/puppetserver1001.eqiad.wmnet/6849ec1e8514aecd81a8a323e64e0702d74bc6df [13:39:46] let me try with a newer PCC run [13:39:55] sorry wrong chan [13:39:57] :| [13:40:04] taavi: you rock [13:40:40] I mean this was a very nice save [13:42:16] I am going to write some next steps to the task [13:42:35] I think that we probably need a cookbook to clean up the old certs that we want to destroy [13:42:48] and also we need backups of the puppetserver's TLS host keys [13:42:58] this time we were very lucky [13:44:18] all right I think we are green now, thanks again taavi [13:44:52] or generally of all host keys, could simply be a systemd timer which does a local copy [13:45:21] that as well, no strong opinion [15:24:28] I vaguely remember this but not the resolution [15:24:35] on the reimaging cookbook, after a failed reimage, I am getting: [15:24:35] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [15:24:39] er [15:24:41] has_puppet7 = self.puppet_server.hiera_lookup(self.fqdn, "profile::puppet::agent::force_puppet7") [15:24:55] but of course the host already has this set at the role level [15:25:14] does someone remember the resolution to this? because I don't [15:25:27] that happens when the server is not in puppetdb (which can happen after a reimage fails), running the cookbook with --new should work around that [15:25:33] tried that, didn't work [15:25:44] ==> Host durum7003.magru.wmnet was found in PuppetDB but --new was set. Are you sure you want to proceed? The --new option will be unset [15:25:54] so that's already there [15:27:12] and I do have a failed Puppet run so I wonder if that should be affecting that. but that's only for trixie and now I am trying to reimage to bookworm, and Puppet should not be in the picture during the intial reimage anyway [15:27:21] not unless it is trying to load something in the current state and failing [15:27:49] hmmm [15:27:51] which host? [15:27:54] durum7003 [15:28:51] https://puppetboard.wikimedia.org/report/durum7003.magru.wmnet/072be2bf34d477a2c9d865cbd73b1d25eb5374ab [15:29:07] this is routed ganeti and so requires the bird2 from component [15:29:18] what I forgot was that that is not in trixie yet, hence the Puppet run fails [15:29:38] sukhe@apt1002:~$ sudo -i reprepro lsbycomponent bird2 [15:29:38] bird2 | 2.0.7-4.1wm1 | buster-wikimedia | main | amd64, source [15:29:41] bird2 | 2.17.1+branch.mq.bgp.multilisten.c47b08a1524c-cznic.1 | bookworm-wikimedia | component/bird-routed-ganeti | amd64 [15:30:03] but the reimage back to bookworm should work of course :> [15:30:32] yeah, it's failing because the catalog is failing to compile: https://phabricator.wikimedia.org/P83470 [15:30:44] I guess one workaround is to remove it from puppetdb and then run with --new [15:30:52] yeah that's worth a shot I think [15:31:58] * sukhe pulls up the docs on how to do that [15:32:26] sudo puppet node deactivate $FQDN [15:32:33] clean is not required? [15:32:53] 'clean' cleans up certificates, 'deactivate' removes it from puppetdb [15:33:07] so presumably then both? [15:33:16] * taavi is not sure [15:33:26] hi folks, just a heads up that we'll be switching the deployment server to codfw between 16:00-18:00 UTC today [15:33:37] will try deactivate and see. thanks! [15:34:53] self.puppet_server.delete(self.fqdn) [15:34:53] self.puppet_master.delete(self.fqdn) [15:35:01] reimage cookbook. but yeah, let's see [15:35:27] cool that worked