[09:53:53] <elukey>	 hi folks! I have a high level question about https://github.com/ROCm/k8s-device-plugin/blob/master/cmd/k8s-node-labeller/README.md, namely if it is acceptable or not to run on ml-serve
[09:54:42] <elukey>	 we already run the main device plugin to expose the gpus to the kubelet as daemon, so we can assign a pod to a gpu via resources labels etc..
[09:54:59] <elukey>	 now it would be nice to be able to ask for a specific gpu, like one with X VRAM size etc..
[09:55:14] <elukey>	 since we don't have the same GPU across all nodes, and we'll likely vary in the future
[09:55:55] <elukey>	 AMD offers the node labeller controller, that needs to run with high privileges to be able to read from /dev on the worker and modify the nodes' attributes
[09:56:28] <elukey>	 the correspondent ClusterRole seem not to require a horrible amount of perms: https://github.com/ROCm/k8s-device-plugin/blob/master/cmd/k8s-node-labeller/README.md#prerequisites
[09:57:18] <elukey>	 https://github.com/ROCm/k8s-device-plugin/blob/master/helm/amd-gpu/templates/rbac.yaml
[09:57:55] <elukey>	 I don't think that this can run as a daemon :(
[09:58:44] <elukey>	 I am also thinking about other solutions like our own logic/script that parses /dev and reports resource labels (like X GPUs with Y VRAM, etc..) but I am not sure if it would be safer than the above
[10:03:24] <_joe_>	 I am not comfortable with the idea of running proprietary binary blobs with high privileges in production, but let's analyze what we're doing already first
[10:04:13] <_joe_>	 But what problem do you see here specifically?
[10:13:52] <elukey>	 _joe_ please read above, it is not a binary proprietary blob, the code is fully open :)
[10:14:18] <_joe_>	 elukey: yeah I was reading the readme
[10:14:33] <elukey>	 we already run the gpu plugin that informs the kubelet about the presence of a GPU, we build it as debian package and deploy
[10:14:55] <_joe_>	 yeah I was asking what you think is problematic with that component
[10:15:35] <elukey>	 oh okok, my question was related to the RBAC settings, and the fact that it needs access to /dev
[10:15:55] <elukey>	 the former looks reasonable, it needs to touch "only" nodes, not everything else
[10:16:19] <elukey>	 the main problem atm is that the ml team can schedule pod on a gpu, they cannot specify which on
[10:16:21] <elukey>	 *one
[10:22:47] <_joe_>	 I mean, my point was - what's the attack surface increase with this? If it's "just" a k8s operator which needs extensive access to the host properties, I don't think it's terrible
[10:26:41] <elukey>	 yep I agree, I just wanted others' opinions to validate it :D 
[10:26:51] <elukey>	 if I missed anything etc..
[10:27:13] <elukey>	 I can try to import it in deployment-charts then, following the usual path for review
[10:27:20] <elukey>	 thanks for the brainbounce :)
[10:29:02] <_joe_>	 elukey: yeah hence my confusion, I looked at the RBAC stuff and I assumed it was a proprietary component given you were worried :D
[10:31:26] <elukey>	 _joe_ ahh no my bad, I was just looking for a second opinion
[10:31:31] <elukey>	 will be more clear the next time