[09:06:01] <klausman>	 btullis: The new DSE worker (1009) looks to be running ok, but is cordoned. Ok to uncordon?
[09:06:21] <klausman>	 (1007 is also cordoned, but I dunno why, so not touching that)
[09:56:12] <btullis>	 klausman: Yeah, please feel free. I'll look at why 1007 is cordoned.
[09:56:19] <klausman>	 ack
[11:53:38] <brouberol>	 related to ^: we realized that `amd.com/gpus` should appear in the node resources (visible via kubectl get node <name> -oyaml), but didn't. Nothing that a good reboot didn't fix
[11:54:29] <brouberol>	 we now have at least 2 generations of GPUs in dse-k8s-eqiad. I was wondering if we had thought about custom node labels apps could leverage to target a GPU model/generation or another
[12:34:56] <klausman>	 Note that the resource bit in the job description sis till needed, since k8s needs to know that one GPU is "consumed"
[12:35:22] <akosiaris>	 nothing GPU specifically, but we do have some custom node labels apps leverage for various things. There are the obvious ones that have to do with topology (dc and row/rack), one for disk types (SSD vs HDD) and there is the solely sessionstore specific dedicated one for kask
[12:36:50] <akosiaris>	 apps define their affinity requiring a specific node label (and in the case of sessionstore, a taint toleration too) during scheduling
[12:52:29] <elukey>	 re: gpi labels - we use https://github.com/ROCm/k8s-device-plugin that has a specific functionality for labels, but at the time we chose to avoid it due to some weird constraint
[12:52:34] <elukey>	 *GPU labels
[12:52:55] <elukey>	 we package the binary in a deb, and deploy to all k8s gpu nodes
[12:53:05] <elukey>	 in theory adding the labeller should be enough
[12:53:11] <elukey>	 cc: brouberol: --^
[12:53:46] <brouberol>	 thanks for the details! I'll have a look
[12:53:53] <klausman>	 I think the oddness back then was that the labeler needed a daemonset, and we didn't want that. or sth like that
[12:56:23] <elukey>	 the device plugin should be a daemonset as well, but we chose to package it in a deb and run it as OS daemon instead
[12:56:40] <elukey>	 IIRC it was also related to what permissions the labeller needed to run
[13:02:34] <klausman>	 ahyes, getting consisten device permissions for things running in pods was the tricky bit
[13:07:11] <elukey>	 nono that part was for the device plugin itself, since we needed to workaround the 'render' group on the OS
[13:07:22] <elukey>	 I think this was the issue https://github.com/ROCm/k8s-device-plugin/blob/master/helm/amd-gpu/templates/labeller.yaml#L60
[13:07:50] <elukey>	 but in theory we could add it to the deb easily
[13:08:14] <elukey>	 the device plugin needs to run as root to be able to place unix sockets in the kubelet's dirs
[13:08:26] <elukey>	 no idea how it works though
[13:08:32] <elukey>	 worth to open a task 
[13:17:20] <klausman>	 https://phabricator.wikimedia.org/T373806
[13:17:42] <klausman>	 Feel free to add subscribers/projects etc. For now I only added ML and DSE tags
[13:24:06] <brouberol>	 nice thanks