[09:22:39] <brouberol>	 prometheus_amd_rocm_stats.service is failing on dse-k8s-worker1001 due to temperature metrics being reported as N/A https://phabricator.wikimedia.org/P76464
[09:22:45] <brouberol>	 does that ring a bell for anyone? Thanks!
[09:24:59] <elukey>	 it is weird, rocm-smi seems not recognizing some stuff.. Has anything changed recently? reimage etc..?
[09:25:17] <brouberol>	 IIRC Ben rebooted it to load a new kernel
[09:25:26] <brouberol>	 according to https://github.com/ROCm/ROCm/issues/4268 "a reboot fixes it"
[09:26:06] <elukey>	 sigh
[09:26:12] <elukey>	 at least there is a report about it
[09:26:22] <elukey>	 shall we try a drain + reboot to check?
[09:26:52] <brouberol>	 yep, on it
[09:28:34] <brouberol>	 I've cordoned it, I have to wait for an airflow task pod to finish until I can reboot it, as I don't want to impact user jobs
[09:33:26] <brouberol>	 reboot ongoing
[09:39:16] <brouberol>	 I'm still seeing the same issue post reboot
[09:41:25] <brouberol>	 I;'m seeing these messages when running `rocm-smi`
[09:41:25] <brouberol>	 ERROR: 2 GPU[0]: power: RSMI_STATUS_NOT_SUPPORTED: This function is not supported in the current environment.
[09:47:54] <brouberol>	 I know nothing about rocm. All I know is that that worker has an "old" GPU (something like > 3, 4 years). Is there a way to upgrade rocm in any way?
[09:56:17] <elukey>	 back sorry
[09:56:22] <brouberol>	 np
[09:56:23] <elukey>	 lemme check on the node
[09:56:27] <brouberol>	 <3
[09:57:28] <elukey>	 ok so the node seems to follow the recent layout, namely no ROCM packages installed
[09:57:38] <elukey>	 rocm-smi is from bookworm's upstream repos
[09:57:47] <elukey>	 we just install it to get the info for the metrics
[09:58:06] <elukey>	 the idea is that the .so rocm libs are stored in the Docker images themselves, so we can vary the OS etc..
[09:58:36] <elukey>	 so upgrading ROCm atm is not an option, but I am a little bit puzzled that with a change in the Kernel nothing works anymore
[09:59:11] <elukey>	 the drivers are shipped by the kernel, so maybe the dropped something, but it seems strange
[10:02:18] <brouberol>	 (btw, small sidetrack. We're now included the kadmin and kerberos servers hostnames in the general-$env.yaml files, to avoid hardcoding the hostnames in configmap. The deprecation of krb1001 would have led to an interesting outage if I had applied admin_ng, which would have removed the egress rule to the currently configured kadmin) 
[10:05:04] <brouberol>	 cf https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1151131
[10:07:47] <brouberol>	 *including
[10:17:50] <elukey>	 brouberol: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1151153
[10:18:25] <brouberol>	 approved thank you!
[10:18:35] <elukey>	 brouberol: interesting about krb1001, have you notified Moritz about it? He'll be interested for sure
[10:18:51] <brouberol>	 I was about to :)
[10:18:55] <elukey>	 super :)
[10:19:05] <elukey>	 we have the same GPU on ml-serve1001, but different kernels
[10:19:27] <elukey>	 6.1.137-1 on ml-serve1001, 6.12.22-1~bpo12+1 on dse
[10:20:10] <elukey>	 at some point we'll probably need to remove the old GPUs
[10:20:22] <elukey>	 we added them as test on dse, but never used them
[10:20:24] <elukey>	 cc: klausman: 
[10:20:55] <elukey>	 (nothing urgent, but let's keep it in mind)
[10:21:40] <brouberol>	 re Moritz: {{done}} 
[10:28:13] <klausman>	 ack. re: gpu removal