[09:53:53] hi folks! I have a high level question about https://github.com/ROCm/k8s-device-plugin/blob/master/cmd/k8s-node-labeller/README.md, namely if it is acceptable or not to run on ml-serve [09:54:42] we already run the main device plugin to expose the gpus to the kubelet as daemon, so we can assign a pod to a gpu via resources labels etc.. [09:54:59] now it would be nice to be able to ask for a specific gpu, like one with X VRAM size etc.. [09:55:14] since we don't have the same GPU across all nodes, and we'll likely vary in the future [09:55:55] AMD offers the node labeller controller, that needs to run with high privileges to be able to read from /dev on the worker and modify the nodes' attributes [09:56:28] the correspondent ClusterRole seem not to require a horrible amount of perms: https://github.com/ROCm/k8s-device-plugin/blob/master/cmd/k8s-node-labeller/README.md#prerequisites [09:57:18] https://github.com/ROCm/k8s-device-plugin/blob/master/helm/amd-gpu/templates/rbac.yaml [09:57:55] I don't think that this can run as a daemon :( [09:58:44] I am also thinking about other solutions like our own logic/script that parses /dev and reports resource labels (like X GPUs with Y VRAM, etc..) but I am not sure if it would be safer than the above [10:03:24] <_joe_> I am not comfortable with the idea of running proprietary binary blobs with high privileges in production, but let's analyze what we're doing already first [10:04:13] <_joe_> But what problem do you see here specifically? [10:13:52] _joe_ please read above, it is not a binary proprietary blob, the code is fully open :) [10:14:18] <_joe_> elukey: yeah I was reading the readme [10:14:33] we already run the gpu plugin that informs the kubelet about the presence of a GPU, we build it as debian package and deploy [10:14:55] <_joe_> yeah I was asking what you think is problematic with that component [10:15:35] oh okok, my question was related to the RBAC settings, and the fact that it needs access to /dev [10:15:55] the former looks reasonable, it needs to touch "only" nodes, not everything else [10:16:19] the main problem atm is that the ml team can schedule pod on a gpu, they cannot specify which on [10:16:21] *one [10:22:47] <_joe_> I mean, my point was - what's the attack surface increase with this? If it's "just" a k8s operator which needs extensive access to the host properties, I don't think it's terrible [10:26:41] yep I agree, I just wanted others' opinions to validate it :D [10:26:51] if I missed anything etc.. [10:27:13] I can try to import it in deployment-charts then, following the usual path for review [10:27:20] thanks for the brainbounce :) [10:29:02] <_joe_> elukey: yeah hence my confusion, I looked at the RBAC stuff and I assumed it was a proprietary component given you were worried :D [10:31:26] _joe_ ahh no my bad, I was just looking for a second opinion [10:31:31] will be more clear the next time