[06:29:56] Good morning :sunn [06:30:06] ☀️ [06:57:50] hello! [06:58:01] good morniiiing [09:42:10] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Host an OpenVINO model in LiftWing - https://phabricator.wikimedia.org/T395012#10962279 (10santhosh) Update: New version of Opevino and Openvino model server was released a few days ago. I updated my production bookworm based docker image patch: https... [10:36:15] 06Machine-Learning-Team: AI/ML Infrastructure Request: Expand ORES-enabled RevertRisk filters deployment to all wikis, excluding Commons and Wikidata - https://phabricator.wikimedia.org/T398291 (10kostajh) 03NEW [12:08:40] Hey folks, I have a question. The placeholder "lives" in the bluber.yaml corresponds to the volume? So when we set: [12:08:40] ```lives: [12:08:40] in: /srv/edit_check``` [12:08:40] That means that there is a volume `/srv/edit_check` in this case? [12:10:03] I am asking this because I am having some issues when passing the `model_path` to the retraining container. It seems that it doesn't recognize it as local path and it tries to download the model from huggingface [12:11:37] i think it is the WORKDIR equivalent in docker [12:12:16] so whatever operation you specify in your blubber file (copy etc) will go under that dir [12:12:47] nothing to do with volumes as far as I know [12:14:06] are you using the local files only argument when you load the model? [12:14:09] similar to this https://gerrit.wikimedia.org/r/plugins/gitiles/machinelearning/liftwing/inference-services/+/refs/heads/main/src/models/llm/model.py#49 [12:18:05] georgekyz: --^ [12:18:09] yeap, I set the flag to true [12:18:43] but I am receiving the error: [12:18:48] https://www.irccloud.com/pastebin/NevCEaaA/ [12:20:00] it seems that it cannot find the path [12:20:58] we managed to set up the s3 client and it successfully downloads the files from the bucket [12:24:53] 06Machine-Learning-Team, 10Wikipedia-Android-App-Backlog (Android Release - FY2025-26): Migrate Machine-generated Article Descriptions from toolforge to liftwing. - https://phabricator.wikimedia.org/T343123#10962984 (10Seddon) [12:42:01] 06Machine-Learning-Team, 06collaboration-services, 10Wikipedia-iOS-App-Backlog (iOS Release FY2025-26): [Spike] Fetch Topics for Articles in History on iOS app - https://phabricator.wikimedia.org/T379119#10963193 (10Seddon) [13:22:03] 06Machine-Learning-Team, 05Goal: Q4 24-25 Goal: Productionize tone check model - https://phabricator.wikimedia.org/T391940#10963454 (10achou) Update: * A [[ https://gitlab.wikimedia.org/repos/machine-learning/exploratory-notebook/-/blob/main/tone-check/data_generation_templates.ipynb?ref_type=heads | clean ver... [13:24:53] 06Machine-Learning-Team, 05Goal: Q4 24-25 Goal: Productionize tone check model - https://phabricator.wikimedia.org/T391940#10963461 (10achou) [13:48:47] 06Machine-Learning-Team, 05Goal: FY2024-25 Q4 Goal: Productionize tone check model - https://phabricator.wikimedia.org/T391940#10963573 (10Aklapper) [14:17:12] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 05Goal: FY2024-25 Q4 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10963727 (10OKarakaya-WMF) # Proposed Next Steps Focusing on following goals: - Scale Add-a-Link model across more languages FY202... [14:20:34] 06Machine-Learning-Team, 05Goal: FY2024-25 Q4 Goal: Productionize tone check model - https://phabricator.wikimedia.org/T391940#10963740 (10isarantopoulos) Spillovers: - Publishing the SLO - evaluate model performance with using page_title in the input - Continue the work on the airflow DAG - Crea... [14:25:54] 06Machine-Learning-Team, 05Goal: Q4 24-25 Goal: Simple article summaries: Set up the software stack for efficiently serving production LLMs - https://phabricator.wikimedia.org/T391941#10963768 (10isarantopoulos) Spillover: - the slimmed docker image is ready, we need to have it in the docker registry. The new... [14:29:57] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06DBA, 10MediaWiki-Recent-changes, and 2 others: [Epic] Recent Changes ORES Enabled Revert Risk Powered Filters Rollout Plan - https://phabricator.wikimedia.org/T391964#10963781 (10isarantopoulos) Remaining work: - Simplewiki and trwiki deployments a... [15:11:04] georgekyz: can you share the code. +blubber file you are using? it is hard to understand from just the HF error [15:13:37] blubber: https://gitlab.wikimedia.org/repos/machine-learning/ml-pipelines/-/tree/main/.pipeline/training/tone_check/retrain_job?ref_type=heads [15:13:37] retrain job: https://gitlab.wikimedia.org/repos/machine-learning/ml-pipelines/-/blob/main/training/tone_check/retrain_job/retrain.py?ref_type=heads [15:13:37] ariflow-ml-retrain-dag: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/ml_retrain/ml/dags/ml_retraining_dag.py?ref_type=heads [15:15:25] I am talking with folks from DE and they say that each of the operators are running in different pods and there is no common space/volume among them. So the logic that I am using is not gonna work. They suggest to enable PVCs which would use the CephFS or RBD based file system as a temporary store for the model. [15:16:51] because this issue will occur during loading the data inside the docker, that is because right now loading data is the next step after loading the model in the docker and that's why is not throwing an error yet. [15:18:14] The full error that I am currently receiving is: [15:18:14] ``` [15:18:14] [2025-07-01, 14:51:41 UTC] {pod_manager.py:520} INFO - [base] INFO:root:List element of: /tmp/ml_training_pipeline/model [15:18:14] [2025-07-01, 14:51:41 UTC] {pod_manager.py:520} INFO - [base] INFO:root:Is dir valid: False [15:18:14] [2025-07-01, 14:51:41 UTC] {pod_manager.py:520} INFO - [base] Traceback (most recent call last): [15:18:15] [2025-07-01, 14:51:41 UTC] {pod_manager.py:520} INFO - [base] File "/srv/edit_check/training/tone_check/retrain_job/retrain.py", line 161, in [15:18:15] [2025-07-01, 14:51:41 UTC] {pod_manager.py:520} INFO - [base] main() [15:18:16] [2025-07-01, 14:51:41 UTC] {pod_manager.py:520} INFO - [base] File "/srv/edit_check/training/tone_check/retrain_job/retrain.py", line 155, in main [15:18:16] [2025-07-01, 14:51:41 UTC] {pod_manager.py:520} INFO - [base] handler.load_model() [15:18:17] [2025-07-01, 14:51:41 UTC] {pod_manager.py:520} INFO - [base] File "/srv/edit_check/training/tone_check/retrain_job/retrain.py", line 67, in load_model [15:18:17] [2025-07-01, 14:51:41 UTC] {pod_manager.py:520} INFO - [base] logging.info(f"{os.listdir(self.model_path)}") [15:18:17] [2025-07-01, 14:51:42 UTC] {pod_manager.py:539} INFO - [base] FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ml_training_pipeline/model' [15:18:51] Based on logs the container started correctly, it initialized the model handler, but it fails wen it checks the input path. [15:28:30] ack. that makes sense.Is there a need to have base model download and training as separate tasks? If we add them in the same task we'd solve this problem. [15:29:06] that said, ofc a pvc would be useful for passing the model (or data) around between training, evaluation tasks etc [15:41:31] In that case we need to install boto3 inside the retraining container and have the corresponding code which downloads the files from the bucket. That should be easier to be honest but I am not sure about the accesses and the size of the image if we install boto3 in the container. [15:54:37] There is a slackthread talking about it I will dm it to you [15:58:40] ack [18:11:31] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 13Patch-For-Review: [batch #1] Enable revertrisk filters in simplewiki & trwiki - https://phabricator.wikimedia.org/T395668#10964955 (10isarantopoulos) I have replicated locally the above behavior an...