[07:03:29] wow nice kafka-logging-codfw almost on Kafka 3.7! \o/ [09:46:38] Hi! I am having problems with Gitlab CI and tracked it down to docker-hub-mirror.cloud.releng.team having problems. Is that a known issue? [10:21:09] moritzm: o/ https://gerrit.wikimedia.org/r/c/operations/puppet/+/1276645 seems to have broken the WMCS central servers where we were using the acme provider [10:53:12] taavi: I've added you to the fix https://gerrit.wikimedia.org/r/c/operations/puppet/+/1276876 [13:26:02] elukey sorry to bug, but I was wondering if there was a cfssl or curl command we could run to get the new certs until the puppet code is ready? I tried variations on https://phabricator.wikimedia.org/P91477 with no luck so far [13:30:13] inflatador: np! You can check what profile::pki::get_cert does, but it is a little buried among puppet classes. May I ask why you need to do it manually? [13:32:34] elukey TBH I'm not sure I do. I thought `profile::pki::get_cert` didn't support the `label` arg yet. If it does or if there is another way to do it w/puppet, that's fine too [13:33:31] I was looking thru the changes on T420993 but it's entirely possible I missed something, if there is a good example CR I can run with that [13:33:34] T420993: Rotate discovery intermediate certificate (expires 2026-05-03) - https://phabricator.wikimedia.org/T420993 [13:35:42] hmm, maybe I don't need to do anything at all. It looks like the cert alerts for WDQS are resolving [13:36:04] inflatador: nono it does support it, I am trying to add a lookup(), so you are free to use it with something different than 'discovery' [13:37:31] OK cool. Any idea why the alerts are resolving now? I should still have to update the puppet code to point to `discovery2026` intermediate somewhere, right? [13:41:05] ^^ confirmed, the newest cert is still using the old intermediate [13:41:17] OK, I'll get to work on a Puppet patch that uses the label for the new cert [13:44:16] inflatador: please don't patch anything, me and Moritz are taking care of it :) [13:44:29] we need to do some testing etc.. [13:44:45] the intermediate will not expire until early may, and we will reach out to people to migrate [13:45:22] elukey ACK, we're just getting flooded with certificate alerts. But I'm happy to wait if y'all need more time [13:48:28] inflatador: you can silence alerts :) [13:54:28] Point well taken ;) . Have y'all had any luck with the new intermediate on test servers? I'm totally fine with being a guinea pig if it helps [13:58:41] we are rolling the first one out right now, if you want to help and it is not going to cause major problems we would be more than happy! [14:00:35] sure, if you want to point me to one of the CRs y'all are using for the test servers I will try and apply the same type of changes for our test WDQS hosts [14:05:58] I think I got it, just need to update our envoy code to use this param: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/tlsproxy/envoy.pp#60 [14:10:17] inflatador: should be it yes! [14:12:46] elukey excellent, I created https://gerrit.wikimedia.org/r/c/operations/puppet/+/1277109 if you want to take a look. [14:15:35] LGTM! [14:16:27] cool, will report back shortly [14:26:33] it's still using the old intermediate for some reason. Don't worry about it though, I'll keep digging [14:32:07] inflatador: can you give me a hostname? I am not 100% sure if envoy reloads its tls material if it changes [14:32:58] ah wdqs2025 [14:33:14] elukey nope, that's my screw-up actually. Was looking at the right host. `wdqs2025` is the correct host and it does have the new chain and envoy config [14:33:32] er...was looking at the **wrong** host that is [14:33:58] nice! [14:35:33] envoy is still serving the old cert even after restarting, checking on that now [14:36:59] so the envoy tls term config points to `discovery2026__query-experimental_eqiad_wmnet_server.chained.pem` but the file on disk is `discovery2026__query-experimental_eqiad_wmnet_server.chain.pem` [14:43:21] hmm, seems fixed now [14:43:49] chained contains both leaf and intermediate, chain only the intermediate [14:44:00] so envoy should expose chained [14:44:30] I moved all the old certs out of the way and ran puppet again, that seemed to fix it [14:44:34] otherwise clients with only the root pki public cert wouldn't be able to validate the leaf cert (since it would be signed by unknown ca) [14:44:59] what old certs do you mean? [14:45:19] the previous intermediate and chain [14:45:35] ok so puppet leaves it there, okok [14:45:38] backed up to `/root/tls` if you wanna see what it looked like after I ran puppet but before I moved the old stuff [14:46:01] but it shouldn't be a problem to have both [14:46:08] and you're right, `chained` vs `chain` was not the issue [14:46:39] so from puppet logs /etc/envoy/listeners.d/00-tls_terminator_443.yaml is changed as well [14:47:07] and also Scheduling refresh of Service[envoyproxy.service] [14:47:07] yeah, I one-offed that when I thought `chain` vs `chained` was the problem [14:47:54] but I don't understand if it worked or not after the puppet run :D [14:48:17] it was serving the old cert after the puppet run [14:48:27] so first I tried restarting envoy, that didn't help [14:48:50] then I messed with the envoy tls terminator config, that didn't help [14:48:58] I rolled back those changes, that didn't help [14:50:07] Then I stopped envoy and moved all the files out of `/etc/envoy/ssl` . After that I ran puppet so everything was recreated. That's what actually got envoy to serve the new cert [14:51:22] Not sure exactly what happened, maybe there was caching or something? I was checking the cert from localhost and from cumin2002, got the same results in both places [14:51:27] I think we'd need to test on another node, just running puppet without doing anything else [14:52:06] OK, I'm not sure how late you wanna stay but I can prepare another host if you think it would be useful. Otherwise happy to pick it up next week (or not at all if it doesn't help) [14:54:42] I tried the sha512sum for /etc/envoy/ssl/discovery2026__query-experimental_eqiad_wmnet_server.chained.pem and the file you backed up on /root/tls, they don't match [14:57:28] actually, there is no `discovery2026__query-experimental_eqiad_wmnet_server.chained.pem` in /root/tls . I didn't do anything to that file, I think Puppet must not have created it on the first run? [14:57:53] I was about to say that, my sha512 didn't match because I mistakenly used "chain" [14:58:06] so envoy didn't pick up a non-existent file [14:58:22] so it safely kept the other tls cert etc.. [14:58:37] let's check if PCC says anything about `chained [14:59:16] yeah, https://puppet-compiler.wmflabs.org/output/1277109/6566/wdqs2025.codfw.wmnet/index.html [15:06:13] best guess from looking at the Puppet logs is that it's some order of operations thing, looks like `/etc/cfssl/csr/discovery__query-experimental_eqiad_wmnet_server.csr` is removed but `/etc/cfssl/csr/discovery2026__query-experimental_eqiad_wmnet_server.csr` is not created before it tries to require a leaf cert [15:06:29] errr..**request** not require [15:08:30] I don't know CFSSL, but maybe it would be easier to request the cert with an API call via python instead of trying to get all the args lined up for the cfssl CLI [15:13:35] in this case it shouldn't be very different, we'd need to end up with a cfssl-cli-like script anyway (in python) [15:14:03] andrewbogott: I have a pending change in puppet merge related to setup_capi.sh. Is it safe to merge? [15:14:19] yes, please merge my patch. [15:14:22] ok@ [15:14:23] ! [15:14:24] Sorry, doing too many things at once [15:14:53] that's alright, don't worry [15:15:47] Yeah, I still think it would be a win for visibility. So far as I can tell the current cfssl run doesn't log anything about what args it's using. A single defaults file that renders all the args as a template and a python script that sources it and logs would be nice [15:16:20] Obviously not time to work on that, but maybe something to consider in the future [15:24:32] inflatador: I see another puppet run at 14:20 on the host, with some failures - where you doing some tests? I see that it failed [15:24:41] it was already applying the new change though [15:26:33] elukey no, I hadn't changed anything at that point. I guess I didn't realize that the puppet run had failed ;( . But looking back at the logs I think it's what I said above [15:27:18] I didn't do any one-off changes until after I ran puppet, and even then I was working on the mistaken assumption that puppet ran cleanly [15:30:57] so [15:30:59] Apr 24 14:20:39 wdqs2025 puppet-agent[2511044]: (/Stage[main]/Main/Cfssl::Cert[discovery2026__query-experimental_eqiad_wmnet_server]/Exec[create chained cert /etc/envoy/ssl/discovery2026__query-experimental_eqiad_wmnet_server.chain.pem]) Dependency Exec[Generate cert discovery2026__query-experimental_eqiad_wmnet_server refresh on intermediate ca change] has failures: true [15:32:01] the refresh "refresh on intermediate ca change" is the new code that I added, that worked nicely for debmonitor but it may have issues if the intermediate is changed [15:32:14] My brain is now fried, I'll check on monday and report back [15:32:17] thanks inflatador ! [15:32:31] elukey np, good luck and feel free to ping Monday if I can help [15:32:40] sure! have a good weekend folks [15:32:41] o/ [15:51:05] I have elevated https://phabricator.wikimedia.org/T423779 to UBN! as I got pinged on my day off about it. I'd really love pointers who could help debugging this further. My investigation points to curl requests to eventbus hanging