[09:06:01] btullis: The new DSE worker (1009) looks to be running ok, but is cordoned. Ok to uncordon? [09:06:21] (1007 is also cordoned, but I dunno why, so not touching that) [09:56:12] klausman: Yeah, please feel free. I'll look at why 1007 is cordoned. [09:56:19] ack [11:53:38] related to ^: we realized that `amd.com/gpus` should appear in the node resources (visible via kubectl get node -oyaml), but didn't. Nothing that a good reboot didn't fix [11:54:29] we now have at least 2 generations of GPUs in dse-k8s-eqiad. I was wondering if we had thought about custom node labels apps could leverage to target a GPU model/generation or another [12:34:56] Note that the resource bit in the job description sis till needed, since k8s needs to know that one GPU is "consumed" [12:35:22] nothing GPU specifically, but we do have some custom node labels apps leverage for various things. There are the obvious ones that have to do with topology (dc and row/rack), one for disk types (SSD vs HDD) and there is the solely sessionstore specific dedicated one for kask [12:36:50] apps define their affinity requiring a specific node label (and in the case of sessionstore, a taint toleration too) during scheduling [12:52:29] re: gpi labels - we use https://github.com/ROCm/k8s-device-plugin that has a specific functionality for labels, but at the time we chose to avoid it due to some weird constraint [12:52:34] *GPU labels [12:52:55] we package the binary in a deb, and deploy to all k8s gpu nodes [12:53:05] in theory adding the labeller should be enough [12:53:11] cc: brouberol: --^ [12:53:46] thanks for the details! I'll have a look [12:53:53] I think the oddness back then was that the labeler needed a daemonset, and we didn't want that. or sth like that [12:56:23] the device plugin should be a daemonset as well, but we chose to package it in a deb and run it as OS daemon instead [12:56:40] IIRC it was also related to what permissions the labeller needed to run [13:02:34] ahyes, getting consisten device permissions for things running in pods was the tricky bit [13:07:11] nono that part was for the device plugin itself, since we needed to workaround the 'render' group on the OS [13:07:22] I think this was the issue https://github.com/ROCm/k8s-device-plugin/blob/master/helm/amd-gpu/templates/labeller.yaml#L60 [13:07:50] but in theory we could add it to the deb easily [13:08:14] the device plugin needs to run as root to be able to place unix sockets in the kubelet's dirs [13:08:26] no idea how it works though [13:08:32] worth to open a task [13:17:20] https://phabricator.wikimedia.org/T373806 [13:17:42] Feel free to add subscribers/projects etc. For now I only added ML and DSE tags [13:24:06] nice thanks