[06:26:49] FYI, I'l reboot puppetserver1002/2002 in approx. 10 minutes. best to hold off puppet merges until these are done (otherwise you'll run into timeouts) [06:46:04] and these are back, puppet merges can resume [08:31:45] bd808: thanks! [08:34:50] hey folks, I am going to proceed in a few to rollout the new debmonitor's pki intermediate, as outlined in https://phabricator.wikimedia.org/T420993#11812923 [08:35:05] I'll need to disable puppet fleetwide for 10/15 mins [08:35:17] good luck [08:41:10] thanks <3 [08:41:15] all right disabling puppet [08:53:25] merged private and public keys in puppet private/public [08:53:33] now I am going to run puppet on the pki hosts [08:54:24] +1 [09:02:32] running on debmonitor1003 [09:04:45] the refresh looks good and I've just tested a debmonitor submission on sretest1005 [09:05:11] worked just fine: https://debmonitor.wikimedia.org/hosts/sretest1005.eqiad.wmnet (inetutils-telnet is no longer shown as upgradeable) [09:06:01] debmonitor on 1003 was also correctly restarted during the cert refresh [09:06:57] I restarted it manually in this case, from systemctl status it didn't seem to have been during the puppet run [09:07:24] ah, ok.I had only looke at the uwsgi lifetime [09:08:10] moritzm: how are the certs handled? I see that we also have envoy with a discovery cert, but I guess that part is only for the UI right? The debmonitor server cert is exposed directly by uwsgi? [09:08:37] indeed, envoy is only for the web UI [09:10:31] hmmh, actually after a full restart of apache on debmonitor1003 I'm getting a cert error with debmonitor-client now [09:10:32] ah no ok not uwsgi, but httpd [09:10:50] that needs to be restarted [09:11:09] * volans around if you need a hand [09:12:46] https://paste.debian.net/hidden/8cad24b6 [09:14:19] lovely [09:14:40] so is the leaf + intermediate ok on sretest1005? [09:14:51] maybe the debmonitor-client needs to be restarted? [09:15:06] it's not a daemon [09:15:16] okok, it runs as timer [09:15:31] or hook, don't recall [09:15:38] both [09:15:50] the content of /etc/debmonitor/ssl is correct? [09:16:56] in theory https://paste.debian.net/hidden/8cad24b6 seems complaining about the debmonitor server cert right? [09:17:36] it seems that the client can't verify the validity of the server cert [09:18:16] so if sretesdt1005 has the root CA and debmonitor is sending back both the leaf cert and the intermediate it should work, let me check something [09:18:21] /etc/debmonitor/ssl/debmonitor__sretest1005_eqiad_wmnet.pem is updated and looks alright to me [09:19:10] all the certs there I mean [09:19:53] elukey: debmonitor can simply be run on the commandline, just running "sudo debmonitor-client" is enough [09:20:49] SSL handshake has read 1712 bytes and written 426 bytes [09:20:49] Verification error: unable to verify the first certificate [09:21:04] with [09:21:04] openssl s_client -connect debmonitor.discovery.wmnet:443 -CApath /etc/ssl/certs -servername debmonitor.discovery.wmnet [09:21:21] the server is sending only the leaf I think [09:21:40] mmm lemme check [09:21:57] I see only one cert in Server certificate from that command [09:22:40] elukey@debmonitor1003:~$ sudo cat /etc/debmonitor.conf [09:22:40] [DEFAULT] [09:22:40] server=debmonitor.discovery.wmnet [09:22:40] cert=/etc/debmonitor/ssl/debmonitor__debmonitor1003_eqiad_wmnet.pem [09:22:40] key=/etc/debmonitor/ssl/debmonitor__debmonitor1003_eqiad_wmnet-key.pem [09:23:01] I think we should use debmonitor__debmonitor1003_eqiad_wmnet.chained.pem [09:23:21] I agree, but how did it work till now? [09:23:31] the clients must have had the intermediate [09:23:34] the old intermediate is part of ca-certificates? [09:23:39] nope [09:23:42] or was directly generated from the root? [09:23:52] IIRC it was a puppet cert before [09:24:10] my understanding is that we only have the regular bundle with the root pki public cert on the hosts [09:24:14] not sure about debmonitor though [09:24:25] no, it's not part of the bundle [09:26:45] wasn't the previous one generated by Puppet_Internal_CA.pem ? [09:27:37] puppet runs still currently blocked, yes? [09:27:48] Emperor: yep sorry, we hope to unblock in a few [09:28:04] ack, np, the cookbook will keep waiting for a while yet :) [09:28:09] volans: in theory debmonitor has its own intermediate, shouldn't use the Puppet CA [09:28:27] Emperor: sorry we just got a "surprise" that we didn't expect :D [09:28:30] I'm currently looking at Puppet git, I was under the impression John had changed it away from a Puppet cert some years ago [09:28:38] for the new one yes, I was talking how was working before, anwyay, I think if we use the chained one all works [09:29:10] volans: trying to modify the config manually so we can test on debmonitor1003 [09:29:15] +1 [09:31:39] ok sorry the config that I pasted above was for the debmonitor *client* on debmonitor1003, the actual tls config is handled by httpd [09:31:49] SSLCertificateFile /etc/cfssl/ssl/debmonitor__debmonitor_discovery_wmnet_server/debmonitor__debmonitor_discovery_wmnet_server.pem [09:31:49] SSLCertificateKeyFile /etc/cfssl/ssl/debmonitor__debmonitor_discovery_wmnet_server/debmonitor__debmonitor_discovery_wmnet_server-key.pem [09:31:57] but same thing, no chained [09:32:03] ack [09:32:37] I wonder if we have to use the chained in the clients too though [09:32:56] my gut would say yes [09:33:12] being mutual TLS also the server needs to be able to verify the client [09:33:50] debmonitor1003 should be fixed [09:34:31] elukey: changing hte client config to chained works fine [09:34:32] without doesn't [09:34:44] so we need the patch to also change the client's cert path [09:34:49] lovely [09:34:56] that IMHO confirms that the issuer of the previous cert was in ca-certificates [09:35:53] we can check the old public for debmonitor, it should contain the data [09:36:05] anyway, looking into the puppet patch [09:36:26] the only reference I could find in Phab was https://phabricator.wikimedia.org/T340741#9001934 by John, but he might have confused it with the discovery cert from https://phabricator.wikimedia.org/T281377 [09:40:20] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1270870 [09:41:11] running pcc [09:43:02] k [09:43:04] volans, moritzm https://puppet-compiler.wmflabs.org/output/1270870/8408/ (failed to post) [09:43:09] it looks good afaics [09:43:44] why didn't change the client config on debmonitor1003? [09:43:53] ah no it did [09:43:54] my bad [09:44:05] looks good to me [09:44:08] shiop it [09:46:06] running puppet on debmonitor servers + httpd restart [09:46:27] I really can't wait to do the discovery intermediate rotation [09:46:40] it will be soo much fun [09:46:44] * elukey cries in a corner [09:46:58] ahahaha [09:47:46] all good on the server side [09:47:53] let's verify on sretest1005 [09:48:01] confirmed working on sretest1005 [09:48:17] forced a Puppet run and upgraded openssh and all properly recorded: https://debmonitor.wikimedia.org/hosts/sretest1005.eqiad.wmnet [09:48:29] moritzm: shall we test another couple of hosts just to be sure? then we can re-enable puppet [09:48:42] yes, let me also quickly validate on a bullseye and trixie node, I'll report back [09:48:42] pick a bullseye [09:48:49] great [09:53:11] bullseye and trixie work fine, we can re-enable Puppet [09:53:42] \o/ [09:55:26] all right, re-enabling [09:59:12] all done! [09:59:19] Emperor: you should be unblocked [09:59:30] thanks volans and moritzm [09:59:53] anytime :) [10:00:50] elukey: as I said before it will all be fine :-) [10:02:02] ™ [10:02:08] moritzm: very happy that the new puppet part worked fine, but now I am a little more hesitant for the discovery part. It will probably require a day :D [10:02:43] and I need to test the cert-manager part of k8s first, to avoid having clusters pulling certs automatically [10:04:43] elukey: thanks :) [10:13:21] headsup: the cumin1003 reboot/update to Cumin 6 will commence shortly, please don't start any new tmuxes/screens there [10:18:56] I may have a tmux open there but nothing running on it feel free to boot me [10:19:41] ack [10:19:50] I'll report back when it's completed [10:25:08] cumin1003 can be used again [10:26:09] Riccardo just upgraded it to Cumin 6, which has been running on cumin2002 before, but since more people tend to use cumin1003 than cumin2002, if you see any issues (which are not expected at this point),please speak up [10:26:22] indeed [10:28:13] as usual, there's some WIP wmfmariadb-py packages flagged for downgrade, not opening that can of worms right now :-) [10:28:52] these are probably test packages so I left them in their current versions [11:57:54] volans, claime, I'm depooling esams for the network OS upgrade (same as last tuesday hopefully with less crashloop bugs :) [11:58:16] XioNoX: ack [11:59:40] XioNoX: ack [12:53:56] moritzm: seems that puppetserver1002 is having som eissues, we have 137 hosts with puppet failed [12:54:00] https://puppetboard.wikimedia.org/nodes?status=failed [12:54:11] aything currently in progress puppet wise? [12:55:29] they are all timeouts or Early EOF, so potentially also network related [12:56:00] not really, but puppetserver1002 was rebooted earlier the day and maybe some SPF is acting up after the reboot? I'll depool it for further analysis [12:56:24] ack thx, I'll keep an eye on puppetboard [12:56:30] to see if it improves [12:56:36] (154 failed now) [13:01:51] volans: https://gerrit.wikimedia.org/r/c/operations/dns/+/1270924 [13:02:07] //Unable to run puppet on config-master2001.codfw.wmnet,config-master1001.eqiad.wmnet to update configmaster.wikimedia.org with the new host SSH public key for wmf-update-known-hosts-production// <-- is this something that should resolve with a subsequent puppet run? [from a reimage cookbook run] [13:02:29] +1ed [13:02:43] Emperor: see private [13:02:49] and here too [13:03:27] ah, right, I see. I picked a bad day to be doing a bunch of reimages... [13:03:36] wait for the depool and then retry [13:04:06] confirm that now it's better (at least on lvs1019) so it should be already depooled [13:04:21] fabfur: is not yet merged [13:04:27] oh [13:04:58] now it applied the catalog without timeouts (but can't send report to puppetserver1002: eof) [13:07:06] puppetserver1002 is now depooled, but it'll take a little until clients requery the service record, should recover soon [13:07:20] going to add a task for troublesooting [13:07:46] thx [13:11:07] https://phabricator.wikimedia.org/T423282 [13:12:46] thx, added some exceptions examples from the logs [13:13:06] not sure why, the failed nodes are increasing though :/ [13:14:31] but all the failures I'm seeing are always towards 1002 [13:15:04] the current spike is most certainly linked to the esams maintenance as well? [13:16:07] there's no new requests incoming to puppetserver1002, so it shouldn decrease now [13:16:10] there's no new requests incoming to puppetserver1002, so it should decrease now [13:16:15] it went from ~150 to ~280, so much more than esams :D but seems to be going down [13:23:18] back to 161 and recovering [13:53:47] hey folks, I made a list of hosts impacted by the pki discovery rollout: https://phabricator.wikimedia.org/T420993#11819672 - Please check them and see if I need to rework the categories, the idea would be to go from low to high risk when rolling out [13:54:15] the other bit is also related to a restart that may be required, if the new cert is not picked up by default [14:01:00] I've re-imaged ms-be2068 twice now; and both times the initial reboot into a vanilla bullseye image goes fine, but once the initial puppet run has happened the system is no longer able to boot - it just keeps cycling round trying to boot from the (I think correct) disk, pausing a bit, and then rebooting again, with no sign of a GRUB menu or anything. [the cursor moves up and down a bit on an otherwise-empty display] Any ideas? [14:02:35] (other than open a ticket with infra-foundations which I will do shortly ) [14:02:54] elukey: That is quite a list! [14:03:16] Thanks for going there. [14:08:54] Emperor: Is it possible that this related to UEFI vs legacy booting? I see you did this https://github.com/wikimedia/operations-puppet/commit/5b96573f4ea35718078bf111551adf6bb7853426 so the hosts are *not* using UEFI, right? I got caught out recently by the fact that the `sre.hosts.provision` cookbook now uses UEFI by default and you have to pass `--legacy` if you want to use legacy BIOS booting. [14:10:14] btullis: I didn't re-provision the host, so I don't think I'd have made any changes in that regard [14:13:03] btullis: yeah and we have two weeks to go through it, including the k8s services that are not mentioned [14:13:38] Emperor: never heard of the issue, but maybe the efi partition duplication logic could be at fault here, it is the only thing that I can think of from your description [14:13:49] are those hosts new or with a different config? [14:14:57] elukey: ancient hosts config-J, I ran the convert-disks cookbook (which did a firmware upgrade by default to 7.0.0.183) [14:15:14] I think it's still BIOS not UEFI [14:17:35] Emperor: ah ok, and is the idrac upgrade needed? I am wondering if downgrading it and the retrying would make a change [14:18:09] elukey: the convert-disks cookbook does it (unless you disable). Is there a better version I should try? [14:18:10] or upgrading to another firmware version [14:18:46] 7.0.0.183 was the latest version offered, I think. [14:19:04] and do you recall the previous version? [14:19:05] elukey: I've created T423286 [14:19:06] T423286: Initial puppet run makes ms-be2068 unbootable - https://phabricator.wikimedia.org/T423286 [14:19:16] elukey: 5.something I think [14:20:19] elukey: 2026-04-14 08:25:55,277 mvernon 3035179 [INFO] ms-be2068.codfw.wmnet (IDRAC): target_version: 7.0.0.183, current_version: 5.0.20.0 [14:20:30] Emperor: I'd suggest to try to rollback to it on one host, then retry the reimage and see how it goes [14:23:57] elukey: the cookbook isn't offering me 5.0.20.0, only 6.10.30.00 6.10.30.20 and the 7 one [14:24:16] I'll try 6.10.30.20 [14:25:53] the other alternative is to check with dcops if there is a newer firmware [14:26:07] at the moment the cookbook gets the firmware manually downloaded on cumin hosts [14:26:23] so the list is not updated unless dcops works on it [14:28:09] I'll see how this goes (though I'm not sure what mechanism might explain the puppet-breaks-bootability being impacted by firmware) [14:32:21] That might have broken it entirely :( [14:33:20] yep, now console com2 says 111 (which I think is connection refused), and the cookbook's attempts to get a RedFish connection likewise are failing [14:34:36] ah, no, perhaps it is now getting there. [14:36:17] you may need to re-run provisioning [14:36:37] firmware downgrade complete, retrying re-image [14:51:56] post-installer reboot OK, now we wait for puppet to see if it breaks it again [15:15:15] elukey: this time post-puppet it says "Booting from Hard drive C: GRUB" and hangs thus [15:24:35] Emperor: lovely, but it may give us a good indication that it is the firmware the issue [15:26:02] the sad part is that our 7.x idrac firmware seems to be the last one https://www.dell.com/support/product-details/it-it/product/poweredge-r740xd2/drivers [15:26:55] elukey: I'm not entirely convinced - something (presumably puppet) has mucked up the GRUB install enough to render the system entirely unbootable (and at quite an early stage in the GRUB process) [15:27:55] Could be yeah. If you have a host with a similar config, you can try to go through the whole process without the firmware upgrade [15:27:59] (I _could_ try the third available firmware version, but that rather feels like we'll be back in the same spot in another 45 minutes) [15:28:01] and see if it leads to the same issue [15:28:22] one thing that I'd try is the idrac firmware and the BIOS one, if they are not coupled together [15:28:47] I can do that (the disks cookbook has a skip-firmware option, though presumably a too-old bios not working properly is why it defaults to updating) [15:29:13] elukey: the firmware that's getting changed is the idrac one (and I think only the idrac one) [15:29:50] ...but I've spent all day on just this one host, so I'll not start the other possible-victim until tomorrow, I think, since otherwise it'll be in a weird state overnight [15:31:03] yeah I would try to upgrade the bios too, the supermicros ones for example couple it together [15:31:23] I think they are separate, but maybe somewhere in the changelog there is a dependency [15:32:45] elukey: want to suggest a BIOS version? [15:33:42] currently it's on 2.12.2. There look to be 2.7.5 2.5.4 2.6.5 2.24.0 2.17.1 and 2.15.1 available [15:40:15] most recent, 2.24.0 afaics (upstream provides 2.26 too) [15:40:26] OK, doing that [15:51:34] Done, host failed as before, but I'll try a re-image (seems unlikely to fix anything, but...) [16:01:29] lemme know, we can sync tomorrow in case [16:01:36] (need to go afk in a few) [16:04:24] sure, installer running now, it'll be a while yet, and I don't think we're trying anything else today. [16:07:43] post-installer boot OK [16:33:22] post-puppet no boot, just "GRUB " [17:01:14] rzl: Progress on the AW/WF caching issues, the connections aren't connecting: T423311 Oops. [17:01:16] T423311: Writes to /*/wf-wan/ failing with CONNECTION FAILURE or SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY (mcrouter not being reached?) - https://phabricator.wikimedia.org/T423311 [17:07:14] James_F: I can take a look this afternoon! [17:13:09] Thanks!