[05:44:38] good morning! ☀️ [07:12:14] good morning folks, I'm back :D [07:53:04] isaranto: welcome back! :D [08:18:43] good morning folks [08:23:39] Morning! [08:23:58] I see you nothing caugt fire while I was away :) [08:25:27] hi Tobias, welcome back! [08:26:03] ty <3 [09:28:35] * isaranto early lunch! [10:24:49] 06Machine-Learning-Team, 06Data-Platform-SRE, 10Prod-Kubernetes, 06serviceops, and 2 others: Update kserve to v0.15.2* on ML clusters - https://phabricator.wikimedia.org/T380722#11073256 (10JMeybohm) Sorry for the drive-by: I've created a script (https://gerrit.wikimedia.org/r/plugins/gitiles/operations/de... [11:21:34] Hi folks. Do we know if anyone is actively using the two AMD Radeon Pro WX 9100 cards on the Hadoop cluster? (an-worker1100 and an-worker1101 as per: https://wikitech.wikimedia.org/wiki/Machine_Learning/AMD_GPU) [11:23:05] I'm asking because those two hosts are about to be decommissioned and reused, but I need to make a decision on what to do with the GPU cards. Options include... [11:24:02] 1) Moving them to different Hadoop workers 2) Moving them to the dse-k8s-cluster 3) Allowing them to be decommissioned as EoL [11:24:40] maybe also 4) put them in hw storage and decom them in [timeframe] once we're sure they're not needed anymore. [11:24:53] btullis: also, hi Ben, I'm back :) [11:25:22] klausman: Welcome back! Hope you had a good time. [11:25:39] It was rainy and cold the whole 4 weeks, so yes, it was fantastic :) [11:27:17] Sounds lush :-) I've also cross-posted this in #talk-to-data-engineering on Slack. I'll see if I can find any references to recent YARN jobs that might have targeted the GPUs. [11:29:20] There is also still a pari of these GPUs in a ml-serve host (1001, I believe) [11:29:46] Correction: only one [11:29:50] Oh yes, so 5) add them to ml-serve is also another option, I guess. [11:29:51] But it is 1001 [11:30:53] https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Hadoop#GPU_Support [11:31:04] > We have 2 hadoop worker nodes each one equipped with one AMD GPU, but we have never been able to use them up to now. [11:32:00] So I suppose we could leave them in Hadoop, ready for the upgrade to 3.0, which is pretty imminent. [11:32:06] hmm. I vaguely remember Aiko mentioning use of the GPUs on Hadoop, but I may misremember [11:32:53] FWIW, I think it's likely more useful to move the ML GPUs to Hadoop than vice versa. [11:33:04] It's were they came from, initially. [11:37:03] Ack. It's also about whether they would get any use on Hadoop 3, or whether they would be better in dse-k8s. Or whether they have any value, given that the MI100 and MI210s are available, plus new ones coming up. [11:43:01] Agreed. Knowing how much utilization happens on Hadoop would be good. I am pretty sure the use of the WX GPUs on ml machines is close to zero. [11:58:01] Yes, I found some work from Aiko here: T276407 - which led me to here: https://github.com/AikoChou/wikimedia-research-2021/tree/main/distributed-inference - Not sure if anything is still ongoing. [12:03:25] aiko: we've been discussing GPU usage on Hadoop (i.e. if the old WX series GPUs are useful there), do you have any insights for that? (I repeat things since you weren't on IRC when we started discussing) [12:13:22] hiiii! [12:13:49] btullis: yes that was an old project for experimenting GPUs on hadoop. so cool to see you found that :D the project isn't ongoing. I don't think anyone is actively using those GPUs. [12:17:41] * klausman late lunch [12:45:25] aiko: Thanks ever so much. Do you think that there is still any value in those old GPUs for future projects? Or are they so outdated as to be effectively worthless to us, now? [13:00:45] don't know if it has been mentioned before already but debian trixie stable is here! [13:00:59] the image is already in prod images and in the registry https://docker-registry.wikimedia.org/trixie/tags/ [13:01:48] isaranto: the two new ml-serve nodes with the new GPU (1012/1013) are already installed with Debian (in a basic install, the stack above the OS still needs to be adapted to trixie) [13:03:06] great, thanks moritzm! [13:44:18] btullis: I think they are quite old but are still computing resources. Like stat1008 and stat1010 have the same GPUs and they are still in use. Perhaps we could move them to different Hadoop workers to retain the options of using GPUs on Hadoop? or could we move them to other stat machines? [13:52:42] aiko: Yes, moving them to other Hadoop workers would be fine. Another option would be the dse-k8s cluster, where we already have 2 of this type of GPU. I'd be a bit less keen on moving them to stat servers, to be honest, because these hosts are already quite custom and fiddly to upgrade/manage. [17:06:02] going afk folks, cu tomorrow!