[06:22:12] mooorning! [06:39:57] Good morning [06:56:02] good morning! [07:24:22] good morning folks [09:03:55] bartosz: o/ [09:04:24] if you want to check the puppet CI job locally, you can run ./utils/run_ci_locally.sh before sending the patch [09:04:33] (within the puppet repo I mean) [09:15:14] o/ elukey: ohh that's super useful to know, thank you! [10:39:52] georgekyz: I'd recommend not to implement train_test_split. If the issue with the image size is that big of an issue that we can even install a small package like sklearn we should work to resolve it [10:40:59] isaranto: I already implemented it and it really slim: https://gitlab.wikimedia.org/repos/machine-learning/ml-pipelines/-/blob/main/training/common_utils/helper_functions.py?ref_type=heads [10:41:21] regardless of the image size, would it be worth to use transformers datasets for this operation? [10:42:01] I thought it is a pity to install the whole sk-learn library using just an easy/small function for train_test_split [10:44:16] isaranto: Since we will use the data_generation pipeline that Aiko built, I am not sure if we really need the transformers datasets. I am using this function for handling data: https://gitlab.wikimedia.org/repos/machine-learning/ml-pipelines/-/blob/main/training/tone_check/retrain_job/retrain.py?ref_type=heads#L73 [10:44:48] I don't think it is a pity. using software that is battle tested and covers many cases is highly recommended [10:45:42] otherwise we need more unit tests to cover edge cases etc [10:47:57] the current unit_tests that I wrote are kinda shallow because I did not focus on that one. We will need extra unit-tests for sure. [10:48:45] I am gonna try now to include sk-learn and use its `train_test_split` function [10:50:22] I appreciate the work done here but let's use standard libraries for common operations. It will allow us to focus on the problems that we need to solve and is less error prone. [10:50:25] thanks <3 [10:53:50] no problem at all! you are right. [11:20:48] it cannot push the image :( [11:21:06] https://gitlab.wikimedia.org/repos/machine-learning/ml-pipelines/-/jobs/555248 [11:23:59] have we investigated why? is it due to the size? [11:24:38] since we don't use GPUs at the moment in airflow, we could just use the cpu version of torch which would result in a much smaller image [11:26:21] isaranto: the only difference in this version is that I included sklearn [11:26:22] anyway I'm just throwing some suggestions at the problem y'all folks are working on this so you will better know what to do [11:27:33] indeed is a good idea to use a slimmer base image than: https://docker-registry.wikimedia.org/amd-pytorch23/tags/ (this is the current one that I am using) [11:30:07] there are no clear logs that mention that it fails due to big size. Although, when I create a slimmer image it pushes it smoothly. [11:32:18] Another thought (maybe silly), is there any possibility that it fails due to time instead of size? In that case we could just set up higher time limit but I am not sure if this is the case [11:36:13] isaranto: I just refresh the job and now it pushed it... so it seems to be kinda flaky... I had experienced that inconsistency in the previous images as well... [11:36:42] now it pushed successfully (including sk-learn as well): https://gitlab.wikimedia.org/repos/machine-learning/ml-pipelines/-/jobs/555271 [11:37:45] sounds good for now.. I think that using the cpu version and a much much smaller image would make sense for now since we're not using gpus which would mean that we postpone this problem for when we use gpus on the dse-k8s cluster [11:40:02] sounds good [12:01:46] 06Machine-Learning-Team: Build and push images to the docker registry from ml-lab - https://phabricator.wikimedia.org/T394778#10974951 (10BTullis) >>! In T394778#10967994, @elukey wrote: > Thanks for the summary! I see two separate problems being listed: > > 1) Have a separate Docker Registry to be able to push... [12:27:02] 06Machine-Learning-Team: Build and push images to the docker registry from ml-lab - https://phabricator.wikimedia.org/T394778#10975045 (10elukey) >>! In T394778#10974951, @BTullis wrote: >>>! In T394778#10967994, @elukey wrote: >> Thanks for the summary! I see two separate problems being listed: >> >> 1) Have a... [13:15:48] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 05Goal: FY2024-25 Q4 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10975134 (10OKarakaya-WMF) I've checked serving: - We can create a new step to export hdfs tables (and the model) to pkl and then... [14:04:48] 06Machine-Learning-Team: Build and push images to the docker registry from ml-lab - https://phabricator.wikimedia.org/T394778#10975318 (10BTullis) >>! In T394778#10975045, @elukey wrote: > ...Docker Registry is an essential dependency for the K8s clusters and running it on top of them seems to be risky. We do ha...