[08:19:10] filed a patch to follow up on yesterday's trouble with jaeger: https://gerrit.wikimedia.org/r/1269868 [08:19:23] that is the last thing for the aux upgrade, the rest is done [08:57:47] <_joe_> <3 [09:02:22] elukey: hmm...wonder what changed there and why we did not see something like that for wikikube services [09:03:00] jayme: maybe it is grpc-related? [09:03:05] ah...we auto generate the alt names there and probably include the internal ones right away [09:03:18] IIRC [09:03:31] okok, and we don't since jaeger sets an explicit list? [09:03:38] yeah [09:04:02] okok makes sense [09:38:48] Maybe the explicit list should be in addition to the autogenerated alt names? [09:38:59] (I haven't thought too deeply about it) [09:39:34] claime: the full objects are hardcoded in jaeger case since the chart does not use chart modules [09:39:44] Ah right [10:20:09] hey folks, if anybody has time https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1269992 [10:20:31] after a chat with traffic we should upgrade that, they don't maintain anymore the old version [10:28:13] elukey: lmao I wanted blake to do that [10:28:20] https://phabricator.wikimedia.org/T422926 [10:29:46] Welp, I'll have them rebuild the image and update the chart etc. [10:30:05] the change is easy I can abandon it [10:30:29] Bah, not worth wasting that time, it's ok [10:30:48] We'll have other occasions, we have a bunch of bullseye migrations to do as well [10:31:13] but yeah blake and I will take over from there if that's all right with you [10:35:17] totally thanks! [13:24:37] FYI, after the dse-k8s rolling upgrade, it appears that we lost all our certificate resources (like if you did `kubectl get cert` for a namespace, it would return nothing). Not sure how you did the aux clusters but you might wanna check 'em [14:00:43] <_joe_> inflatador: eek that doesn't sound great. but also not as terrible as it could be - those can surely be regenerated, right? [14:03:41] _joe_ yeah, `helm rollback ${current_release}` will fix it quickly. We just didn't realize that happened until we lost a few services to cert expiry [14:04:03] <_joe_> nasty [14:06:13] Yup. I set up manual cert alerts for a few services, but long-term we might wanna use blackbox k8s autodiscovery for that [14:28:14] inflatador: that seems off...what certificate objects exactly? [14:28:41] AIUI you destroyed some releases manually before upgrading...so maybe those where affected? [14:30:10] there is zero code in k8s for migrating stored objects, so once something is in etcd it will stick around in exactly that format until it is edited (in which case it might be transferred to a new storage format) [14:30:40] but I doubt that happened for cert objects, since it's a stable api [14:35:02] this: https://grafana.wikimedia.org/goto/afiodjqddsiyof?orgId=1 looks like you might have removed all the *tls-proxy-certs which are generated by admin_ng together with the namespaces [14:35:36] but thouse would not have come back by rolling back a service release [15:08:00] jayme that's possible, I didn't do the migration myself so will have to check w/teammates [18:05:06] > AIUI you destroyed some releases manually [18:05:06] No we didn't, we only scaled down some repicas to 0 [18:05:09] *replicas [18:06:11] > looks like you might have removed all the *tls-proxy-certs [18:06:11] Yep, this is what I observed. I'm not sure *how* we did it though [18:06:51] I do remember part of the migration being about changing some cert-manager behavior and config. We might have messed up something, but I would remember running `kubectl delete certificate -A` in god mode tbh [18:07:01] so, heh, not sure [18:09:21] I checked, and we did upgrade k8s on the 27th, so something we did that day caused this [18:09:40] we have logged all operations in https://docs.google.com/document/d/1q7Amw_XSN_Lfb7fCnaSprpW8Z43iMyD4NOD3Lbq2hR4/edit?tab=t.0, we can investigate if need be