[06:53:38] Good morning! [06:58:23] Good morning [07:01:40] Morning! [08:04:57] morning folks! [08:05:04] https://www.digitalocean.com/blog/now-available-amd-instinct-mi300x-gpus is really nice [08:11:58] \o so droplet == VM? [08:16:08] yep [08:17:11] "Customization: AMD Instinct™ MI300X is available both as single and eight GPU configurations and in bare metal configurations" [08:22:28] 06Machine-Learning-Team: Build model training pipeline using WMF ML Airflow instance - https://phabricator.wikimedia.org/T396495#10912108 (10OKarakaya-WMF) Just want to share a thought about the training pipeline in mind, but nvm if I'm missing some information: We get the [training data](https://gitlab.wikimed... [08:50:10] I was reading https://rocm.blogs.amd.com/software-tools-optimization/compute-memory-modes/README.html and TIL that the MI300X can be partitioned [08:50:38] amd-smi seems able to do it, and they say that the "users will see more GPUs" [08:51:02] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 13Patch-For-Review: [batch #1] Enable revertrisk filters in simplewiki & trwiki - https://phabricator.wikimedia.org/T395668#10912281 (10isarantopoulos) There was an issue during the deployment for si... [08:51:16] not sure what happens at the OS level, namely if the drivers are able to show multiple devices in /dev [08:51:31] do you know more? Is it something that you already tested? [08:52:16] yes I think so from https://rocm.blogs.amd.com/software-tools-optimization/compute-memory-modes/README.html#deployment-through-docker [08:52:37] wow really nice [08:54:17] this may open the door to multiple pods sharing safely and in the right way a huge GPU [08:58:24] this is our bet! we haven't tested this (unless klausman has and I don't remember) [08:59:16] I've only read up on it from the page Luca has mentioned it, but haven't tried it. [08:59:26] the reason why this is our bet is because it eliminates the need for a variety of VRAM sizes in GPUs and also allows for better utilization [09:01:40] IIRC we have a testing host somewhere right? If so, should we test rocm-smi to make sure that it does what it is supposed to? [09:02:06] this bit wasn't in the document that I've read, it is a game changer [09:02:23] it makes the whole set of pros/cons way more digestible and ok [09:05:04] klausman: --^ [09:05:27] yeah, it's on my todo-pile [09:06:40] if you want I can help, really interested.. not sure if I can get an account on the test node though [09:07:17] We gave that back a while back, so even I don't have access anymore [09:07:28] So we'll have to wait until the new machines arrive [09:07:56] okok [11:17:15] 06Machine-Learning-Team: Build model training pipeline using WMF ML Airflow instance - https://phabricator.wikimedia.org/T396495#10912840 (10achou) Here are the steps I built a [[ https://gitlab.wikimedia.org/aikochou/airflow-dags/-/blob/tone-check/ml/dags/tone_check_dag.py?ref_type=heads | DAG ]] to run a Spark... [11:33:08] georgekyz, kevinbazira: ---^ I added the steps for building and using artifacts in a DAG. let me know if you have any questions or need clarification [11:33:32] thanks for sharing Aiko [11:34:38] I've already built an artifact: https://gitlab.wikimedia.org/kevinbazira/ml-pipelines/-/packages [11:34:53] now working on running it in the DAG [11:46:21] I also built an artifact yesterday, but now I am trying to use the docker option. [11:48:05] https://gitlab.wikimedia.org/repos/machine-learning/ml-pipelines/-/tree/retraining_tone_check/tone_check_docker?ref_type=heads [11:48:53] thnx for sharing folks [11:57:21] shall we focus solely on container based workflows since we're going to be running on k8s? [11:58:59] I mean since we're starting now I think that running on k8s is the future proof option, and the only one that would allow us to also utilize one of the newer GPUs for training [11:59:56] plz correct me if I'm wrong or I'm missing sth [12:01:05] I could found a way to run the retrainiing job from the artifcat using python operator. Thus I am investigating the docker way which I think it gives us more freedom and independency without generating any artifacts. The thing in this way is that we need somehow easily push our dockers to a registry in order to run them via the `WMFKubernetesPodOperator`. [12:08:12] ack! if there is something that we cant figure out and it isn't documented on Wikitech then we should ask DPE directly. Perhaps they have either thought of this, planned already or will have a suggestion [12:10:37] My thought is to use SparkSubmitOperator for data generation and KubernetesPodOperator for model training and evaluation. They will be in the same pipeline/DAG. For data generation, you need spark. [12:12:29] yes data generation can be done exactly as you said. [12:12:41] I am trying now the KubernetesPodOperator. [12:12:57] Agreed! running spark on the dse cluster has not actually been explored. Let's document this then also [12:13:31] sry I just said something that is contradicting what you said aiko [12:14:08] Ben in this thread https://wikimedia.slack.com/archives/CSV483812/p1749717277061669 said "You can also run Spark jobs on the dse-k8s cluster, if you would like. You would be trailblazing if you were to do this" [12:14:09] I am in favor of using the SparkSubmitOperator, that would mean that we run spark jobs on dse-k8s cluster right? [12:15:59] with SparkSubmitOperator, we run spark jobs on hadoop [12:17:27] clear, thanks! [12:17:39] our airflow instance will submit the job to run on hadoop [12:18:23] then what I said is not contradicting. So we want to run the spark jobs on hadoop which is the standard way and use dse for training [12:19:17] yes! that's what I meant [12:19:47] 🙌 [12:34:08] georgekyz: regarding build and push docker images to the registry on gitlab, check out kokkuri https://gitlab.wikimedia.org/repos/releng/kokkuri a gitlab CI tool that release eng team builds [12:34:25] yeap [12:34:28] thnx [12:34:33] and it uses blubber [12:35:33] https://phabricator.wikimedia.org/T396495#10912108 I was also thinking about how to get the dataset into the training step in kubernetes as the dataset will be generated with spark. [12:37:50] ozge_: yeah we should figure out if k8s container can access hdfs or how to config that. The generated dataset will be stored in hdfs [13:17:35] Should we setup a new gitlab-ci pipeline for kokkuri? [13:17:58] we could also see if we can use ceph for the generated dataset or at least ask if this is already feasible https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Ceph [13:18:51] sorry I'm just throwing in wild ideas but it is important for us to explore future proof implementations and align with DPE's vision [13:19:25] no worries everything is really valuable at the time. [13:19:55] <3 [13:20:25] although that would only work as a dataset and not an actual table (at least for now) [13:24:52] isaranto: ahh good point! reminds me Fabian mentioned this to me. one of his suggestion is to use ceph db, so the training job can access the data and also save the eval results [13:27:49] so one good question that we need to answer: "what is the suggested way of passing large datasets between tasks?hdfs, ceph. or something else?" [13:28:17] georgekyz: yep let's try it [13:29:15] if both parts (ETL, model training) were run in the same k8s cluster we could even use a k8s volume https://kubernetes.io/docs/concepts/storage/persistent-volumes/ [13:29:20] but for now this is out of the question [13:29:52] I see this: https://gitlab.wikimedia.org/cdobbins/airflow-dags/-/blob/airflow_version_2_9_2/.gitlab-ci.yml?ref_type=tags where they use the kokkury setting in the dags repo. Should we set it up also in the our ml-pipelines repo? [13:36:36] hey folks! Re: SLO for tone check, are we planning to have it done by end of the quarter? There is no rush, I am just asking since I have a related hypothesis in Asana and I need to do an update :) [13:42:14] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10913161 (10OKarakaya-WMF) working on the decision brief doc [here](https://docs.google.com/document/d/1pL1mCJ-lAf6zL1ffrctYT0w5L_w3kby... [14:14:25] I tried to create a new gitlab-ci pipeline to run kokkuri. I set the target to be my branch for now, you can check it here: https://gitlab.wikimedia.org/repos/machine-learning/ml-pipelines/-/ci/editor?branch_name=retraining_tone_check [14:32:25] elukey: if you have time we could do it even sooner or at least setup the dashboard in grafana. I started addressing your comments in the task https://phabricator.wikimedia.org/T390706#10894563 but we have finalized the SLIs and their SLOs [14:32:49] isaranto: okok I'll review the doc on Monday! [15:15:54] georgekyz: we've never pushed images from gitlab before, only from gerrit. If successful, we should see an image called "repos/machine-learning/ml-pipelines/..." in the docker registry, right? [15:18:08] aiko: I am not pretty sure about that... the kokkuri pipeline produces an `.env` artifact which states two refs: [15:18:08] ``` [15:18:08] BUILD_AND_PUBLISH_IMAGE_IMAGE_TAG=job-536054 [15:18:08] BUILD_AND_PUBLISH_IMAGE_IMAGE_INTERNAL_REF=registry.cloud.releng.team/repos/machine-learning/ml-pipelines:job-536054 [15:18:08] BUILD_AND_PUBLISH_IMAGE_IMAGE_REF=registry.cloud.releng.team/repos/machine-learning/ml-pipelines:job-536054 [15:18:09] ``` [15:18:37] aiko: I am not sure [15:19:32] it seems that it pushes the images in something like: `registry.cloud.releng.team/repos/machine-learning/ml-pipelines:job-536054` [15:24:41] maybe we need to set the KOKKURI_REGISTRY_PUBLIC variable? like https://gitlab.wikimedia.org/cdobbins/airflow-dags/-/blob/airflow_version_2_9_2/.gitlab-ci.yml?ref_type=tags#L35 [15:27:07] I am still not sure if we need to setup the kokkuri pipeline in our ml-pipelines repo or in the DAGs side... For the time being I built it in our repo. I will try to play around with multiple combinations.... [15:28:17] Another question is: How the DAG code is update in airflow? Should we always rerun the `./run_dev_instance.sh`? I merged changes in my main branch in my fork, and the instance of airflow is still running old code. [15:29:22] I think it should be in ml-pipelines, not the DAG side because we want to pack the job logic in a docker image [15:31:31] I agreee [15:31:59] I added the variables you shared, I am running the pipeline again [15:33:22] you mean update DAG code in the dev instance, right? it should automatically sync, maybe need a bit time [15:36:14] yeah probably [15:43:51] I think that there should be some pre-work in order to use these VAriables: [15:43:55] https://www.irccloud.com/pastebin/XvOuJUJr/ [15:44:41] Anyway, I will look at it again on Monday. Enjoy your weekend all! [15:48:41] we can find an image that used kokkuri to publish to docker-registry, go to their repo to see how they set up their .gitlab-ci.yml [15:48:51] anyway have a nice weekend :) [16:38:24] have a nice weekend all!