[07:32:44] klausman, brouberol, btullis, elukey: I've switched the k8s clusters to use discovery2026 and will start kicking off certificate refreshes on codfw clusters in 10min if there are no objections [07:33:25] I will stagger the refreshes across 30min per cluster, so we don't have "big bang" expiration time windows [07:33:54] Super, thanks jayme. [07:46:25] ack! [07:57:04] thanks for the headsup! [09:44:29] I'm gonna roll https://gerrit.wikimedia.org/r/1277471 which replaces /usr/local/bin/charlie with /usr/bin/charlie. shouldn't be any problems, just heads up [09:51:14] We've got an SSL error from opensearch in codfw here: https://airflow-platform-eng.wikimedia.org/dags/spur_download_and_index_anonymous_residential_codfw/grid?dag_run_id=scheduled__2026-04-27T16%3A00%3A00%2B00%3A00&task_id=download_and_index_feed_codfw [09:52:01] I'm not 100% sure that it's related to this work, but it seems possible. I can roll-restart the cluster, which should pick up any new certificates. [09:59:23] The istio ingressgateway had the new certificate and was fine, but it also re-encrypts the request and sends it to the upstream opensearch cluster. This might be where it was getting broken. This gave a 503. `curl -I https://opensearch-ipoid.svc.codfw.wmnet:30443/_bulk` [10:04:29] btullis: from what I recall that's a self signed cert the opensearch operator issues, right? [10:05:00] ah, it's not [10:05:06] Not self-signed. It uses cert-manager. [10:05:16] but 503 means everything is fine :) [10:05:18] cert wise [10:05:28] Roll-restarting the cluster didn't help. [10:07:39] Well it's doing double TLS. Once it's decrypting at the ingressgateway, then it's re-encrypting to send onto the opensearch cluster itself. It could be this second TLS that is having a problem, and then I think that perhaps istion would be generating the 503 to send back to the client. I'll keep investigating. [10:08:39] btullis: the opensearch cluster have a reference to the discovery issuer in their chart IIRC [10:08:45] maybe...but the error from the airflow logs seems different [10:09:22] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/8f85b131df743fad97527c318a1816264931676d/charts/opensearch-cluster/templates/certificate_wmf.yaml#74 [10:09:26] brouberol: OK, yes. I see this: [10:09:29] SSLError(HTTPSConnectionPool(host='opensearch-ipoid.svc.codfw.wmnet', port=30443): Max retries exceeded with url: /_bulk (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:2393)')))) [10:09:30] https://www.irccloud.com/pastebin/gINcgkVD/ [10:09:50] hmm, I've never seen this before :/ [10:09:56] brouberol: thats fine. the "issuer" in that regard did not change [10:10:18] > This error typically occurs during the SSL/TLS handshake when the client fails to authenticate the server, causing the connection to drop unexpectedly (resulting in an "End of File" or EOF). [10:10:18] https://www.w3tutorials.net/blog/python-sockets-ssl-eof-occurred-in-violation-of-protocol/ [10:10:23] jayme: ack [10:12:03] I have to step away for a couple of hours, sorry :/ [10:12:27] No worries. [10:19:29] Oh I think I might know what it is. It might be unrelated to this cert-manager referesh. Checking something. [10:21:03] Yes, it's my fault. This gives a 503 too `curl -I https://opensearch-ipoid.svc.eqiad.wmnet:30443/_bulk` but this doesn't `curl -I https://opensearch-ipoid.svc.eqiad.wmnet:30443/` [10:21:26] Therefore, it's a mistake that I made in this patch: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1277486 [10:41:12] I think that this might fix it. https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1278389 [10:41:30] Sorry for wasting your time, jayme [10:41:37] np [10:41:56] if all you have is openssl, everything looks like a certificate [10:57:54] ^ drive by comment, but I suggest we bash this [13:42:33] jayme: nice woork!! <3 [13:43:07] I really hope it is :D