[03:53:53] good morning [05:55:06] good morning! [06:15:59] 06Machine-Learning-Team: Build model training pipeline using WMF ML Airflow instance - https://phabricator.wikimedia.org/T396495#10970712 (10gkyziridis) ###ToneCheck Retraining Docker Image Updates During the experimentation on ToneCheck retraining pipeline in airflow I faced some obstacles which are stated bel... [06:47:27] Good morning. [06:58:16] Guten tag! [07:08:02] 06Machine-Learning-Team: Update knative's queue proxy image and the Swift/S3 accounts used on ml-serve clusters - https://phabricator.wikimedia.org/T398533 (10elukey) 03NEW [07:08:08] isaranto: o/ created --^ to summarize what we discussed yesterday [07:08:41] awesome, thank you! [07:22:46] isaranto: I am also going to open another task, I think we'd need to upgrade our k8s-gpu-plugin to include stuff like https://github.com/ROCm/k8s-device-plugin/pull/117 [07:23:01] in theory it should be a matter of upgrading the debian package [07:23:13] bonus point would be to also get the node-labeller [07:23:23] to target specific GPUs [07:25:51] ack! [07:27:44] last one - I am working on Pyrra configs etc.., once I've finished I'll upload the tonecheck's config so we'll start checking the dashboards [07:28:51] elukey: question: I'm wondering about the node-labeller..isn't it sth we can already do with the current setup? Like assign node labels and then define a nodeselector? or is this the missing piece of the puzzle [07:28:52] ? [07:29:09] just curious! [07:31:48] isaranto: I think that the node-labeller gives you an extra label about the GPU and how big it is, so you can target the one that you need [07:31:57] rather than "gimme just a gpu" [07:32:37] so it could be handy for example when a pod needs a slice of a mi300, vs maybe something less powerful [07:33:04] and if we slice the mi300 into different "pieces", we may have a way to target them separately [07:33:13] iiuc it automatically applies a label then? cause an alternative would be that we provide manual labels "mi210", "mi300" and use thes in the deployments [07:33:13] like "this pod needs a 64G slice" [07:33:31] aa right, I hadn't thought about that. nice one! [07:33:57] yeah but how to do it manually it is not super clear to me, because in theory it is the k8s plugin that exposes the devices announcing their capabilities [07:34:19] there could be a way via deployment charts, but I think it would just target a host, not its gpus [07:34:39] we didn't add the labeller at the time since it was IIRC a bit complex and/or requiring some horror config [07:34:42] :D [07:38:38] clear, thanks! [08:27:56] 06Machine-Learning-Team: Reimplement the model-upload script to take into consideration new use cases - https://phabricator.wikimedia.org/T394301#10971402 (10BWojtowicz-WMF) I've made the Python script for model-upload work with just `urllib3` and `boto3` as external dependencies, both of which are available as... [09:44:58] 06Machine-Learning-Team: Upgrade the AMD GPU plugin for k8s to support MI300 GPUs - https://phabricator.wikimedia.org/T398600 (10elukey) 03NEW [09:45:11] aaand created --^ [12:20:56] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 05Goal: FY2024-25 Q4 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10972305 (10OKarakaya-WMF) This time I've tried 44 languages in a single model. I see some languages drop significantly although th... [12:45:15] 06Machine-Learning-Team: Reimplement the model-upload script to take into consideration new use cases - https://phabricator.wikimedia.org/T394301#10972381 (10elukey) Hello! >>! In T394301#10971402, @BWojtowicz-WMF wrote: > @elukey What would be the next steps to put it inside puppet repository? Should I create...