[03:53:53] <georgekyz>	 good morning
[05:55:06] <bartosz>	 good morning! 
[06:15:59] <wikibugs>	 06Machine-Learning-Team: Build model training pipeline using WMF ML Airflow instance - https://phabricator.wikimedia.org/T396495#10970712 (10gkyziridis) ###ToneCheck Retraining Docker Image Updates  During the experimentation on ToneCheck retraining pipeline in airflow I faced some obstacles which are stated bel...
[06:47:27] <ozge_>	 Good morning.
[06:58:16] <isaranto>	 Guten tag!
[07:08:02] <wikibugs>	 06Machine-Learning-Team: Update knative's queue proxy image and the Swift/S3 accounts used on ml-serve clusters - https://phabricator.wikimedia.org/T398533 (10elukey) 03NEW
[07:08:08] <elukey>	 isaranto: o/ created --^ to summarize what we discussed yesterday
[07:08:41] <isaranto>	 awesome, thank you!
[07:22:46] <elukey>	 isaranto: I am also going to open another task, I think we'd need to upgrade our k8s-gpu-plugin to include stuff like https://github.com/ROCm/k8s-device-plugin/pull/117
[07:23:01] <elukey>	 in theory it should be a matter of upgrading the debian package
[07:23:13] <elukey>	 bonus point would be to also get the node-labeller
[07:23:23] <elukey>	 to target specific GPUs
[07:25:51] <isaranto>	 ack!
[07:27:44] <elukey>	 last one - I am working on Pyrra configs etc.., once I've finished I'll upload the tonecheck's config so we'll start checking the dashboards
[07:28:51] <isaranto>	 elukey: question: I'm wondering about the node-labeller..isn't it sth we can already do with the current setup? Like assign node labels and then define a nodeselector? or is this the missing piece of the puzzle
[07:28:52] <isaranto>	 ?
[07:29:09] <isaranto>	 just curious!
[07:31:48] <elukey>	 isaranto: I think that the node-labeller gives you an extra label about the GPU and how big it is, so you can target the one that you need
[07:31:57] <elukey>	 rather than "gimme just a gpu"
[07:32:37] <elukey>	 so it could be handy for example when a pod needs a slice of a mi300, vs maybe something less powerful
[07:33:04] <elukey>	 and if we slice the mi300 into different "pieces", we may have a way to target them separately
[07:33:13] <isaranto>	 iiuc it automatically applies a label then? cause an alternative would be that we provide manual labels "mi210", "mi300" and use thes in the deployments
[07:33:13] <elukey>	 like "this pod needs a 64G slice"
[07:33:31] <isaranto>	 aa right, I hadn't thought about that. nice one!
[07:33:57] <elukey>	 yeah but how to do it manually it is not super clear to me, because in theory it is the k8s plugin that exposes the devices announcing their capabilities
[07:34:19] <elukey>	 there could be a way via deployment charts, but I think it would just target a host, not its gpus
[07:34:39] <elukey>	 we didn't add the labeller at the time since it was IIRC a bit complex and/or requiring some horror config 
[07:34:42] <elukey>	 :D
[07:38:38] <isaranto>	 clear, thanks!
[08:27:56] <wikibugs>	 06Machine-Learning-Team: Reimplement the model-upload script to take into consideration new use cases - https://phabricator.wikimedia.org/T394301#10971402 (10BWojtowicz-WMF) I've made the Python script for model-upload work with just `urllib3` and `boto3` as external dependencies, both of which are available as...
[09:44:58] <wikibugs>	 06Machine-Learning-Team: Upgrade the AMD GPU plugin for k8s to support MI300 GPUs - https://phabricator.wikimedia.org/T398600 (10elukey) 03NEW
[09:45:11] <elukey>	 aaand created --^
[12:20:56] <wikibugs>	 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 05Goal: FY2024-25 Q4 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10972305 (10OKarakaya-WMF) This time I've tried 44 languages in a single model. I see some languages drop significantly although th...
[12:45:15] <wikibugs>	 06Machine-Learning-Team: Reimplement the model-upload script to take into consideration new use cases - https://phabricator.wikimedia.org/T394301#10972381 (10elukey) Hello!  >>! In T394301#10971402, @BWojtowicz-WMF wrote: > @elukey What would be the next steps to put it inside puppet repository? Should I create...