[06:01:28] FIRING: [3x] HelmfileAdminNGPendingChangesLiftWing: Pending admin_ng changes on ml-serve-codfw - https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Deploy_changes_to_helmfile.d%2Fadmin_ng - https://alerts.wikimedia.org/?q=alertname%3DHelmfileAdminNGPendingChangesLiftWing [06:06:28] FIRING: [6x] HelmfileAdminNGPendingChangesLiftWing: Pending admin_ng changes on ml-serve-codfw - https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Deploy_changes_to_helmfile.d%2Fadmin_ng - https://alerts.wikimedia.org/?q=alertname%3DHelmfileAdminNGPendingChangesLiftWing [08:52:53] o/ [08:53:02] I am rolling out the gpu node labeller on all nodes [08:53:21] aiko: o/ IIUC Kevin's problem was lack of GPU resources, or something different? [09:04:30] o/ sweet, thank you Luca! <3 [09:30:48] deployment done! [09:31:00] 06Machine-Learning-Team, 06Data-Platform-SRE: Investigate Label functionality of AMD GPU device plugin on k8s - https://phabricator.wikimedia.org/T373806#11270906 (10elukey) Deployed to prod! ` root@deploy2002:~# kubectl get nodes --show-labels | grep vram dse-k8s-worker1001.eqiad.wmnet Ready ... [09:31:53] so on dse-k8s-eqiad we can now use labels to target specific GPUs, I see that we have 16G and 64G ones. kevinbazira o/ was it something that you needed? [09:37:21] elukey: thank you for working on this. targeting specific GPUs is something we need and had created this work around that enabled us to filter for on MI210: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/wmf_airflow_common/kubernetes.py?ref_type=heads#L201 [09:37:22] yes, the issue in T406958 is caused by lack of GPU resources [09:38:47] kevinbazira: ah yes now you should be able to just use an extra label [09:39:06] probably "amd.com/gpu.vram=64G" is enough [09:39:44] there are other details like GPU's compute units etc.. but I don't think it is really needed atm [09:40:06] lemme know if you are able to test it and if it works (no rush, anytime) [09:41:25] super cool. I will definitely test this and let you know how it goes. [09:42:26] thanksss [10:06:28] FIRING: [6x] HelmfileAdminNGPendingChangesLiftWing: Pending admin_ng changes on ml-serve-codfw - https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Deploy_changes_to_helmfile.d%2Fadmin_ng - https://alerts.wikimedia.org/?q=alertname%3DHelmfileAdminNGPendingChangesLiftWing [11:50:26] 06Machine-Learning-Team, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 07Essential-Work, 13Patch-For-Review: Enable Airflow triggerer process for deferrable operators in airflow-ml and airflow-devenv - https://phabricator.wikimedia.org/T406958#11271535 (10brouberol) 05Open→03In progress [12:08:50] 06Machine-Learning-Team, 07Essential-Work: Merge tone-check pipeline DAGs into a single DAG for simplified orchestration - https://phabricator.wikimedia.org/T407212 (10kevinbazira) 03NEW [12:29:58] 06Machine-Learning-Team, 07Essential-Work: Merge tone-check pipeline DAGs into a single DAG for simplified orchestration - https://phabricator.wikimedia.org/T407212#11271755 (10kevinbazira) I have run the `tone_check_training_dag` in staging, and the following tasks succeeded: `generate_training_data`, `split_... [12:53:11] 06Machine-Learning-Team, 10Add-Link-Structured-Task, 06Growth-Team (FY2025-26 Q2 Sprint 1): Add a Link: Rollout "Add a Link" Structured Task to Wikipedias that are supported by V2 model - https://phabricator.wikimedia.org/T404460#11271898 (10Trizek-WMF) [13:07:43] 06Machine-Learning-Team, 06Data-Platform-SRE: Investigate Label functionality of AMD GPU device plugin on k8s - https://phabricator.wikimedia.org/T373806#11272001 (10brouberol) @elukey Is there anything I need to do to get this running over in dse-k8s-eqiad? [13:12:47] 06Machine-Learning-Team, 06Data-Platform-SRE: Investigate Label functionality of AMD GPU device plugin on k8s - https://phabricator.wikimedia.org/T373806#11272051 (10elukey) @brouberol already deployed on DSE! :) [13:13:30] 06Machine-Learning-Team, 06Data-Platform-SRE: Investigate Label functionality of AMD GPU device plugin on k8s - https://phabricator.wikimedia.org/T373806#11272057 (10brouberol) Woohoo indeed! ` brouberol@deploy2002:/srv/deployment-charts/helmfile.d/dse-k8s-services/airflow-wikidata$ sudo -i root@deploy2002:~#... [13:40:08] 06Machine-Learning-Team, 07Essential-Work: Reimplement the model-upload script to take into consideration new use cases - https://phabricator.wikimedia.org/T394301#11272325 (10BWojtowicz-WMF) 05Open→03Resolved [14:06:43] FIRING: [6x] HelmfileAdminNGPendingChangesLiftWing: Pending admin_ng changes on ml-serve-codfw - https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Deploy_changes_to_helmfile.d%2Fadmin_ng - https://alerts.wikimedia.org/?q=alertname%3DHelmfileAdminNGPendingChangesLiftWing [14:06:50] 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Airflow training pipeline for Tone check model - https://phabricator.wikimedia.org/T398970#11272410 (10kevinbazira) * Our DAGs were granted permission by DPE SRE to spin up pods in the airflow-ml instance (T406302#11250084) * Model training task fails because... [15:12:03] 06Machine-Learning-Team, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 07Essential-Work, 13Patch-For-Review: Enable Airflow triggerer process for deferrable operators in airflow-ml and airflow-devenv - https://phabricator.wikimedia.org/T406958#11272726 (10brouberol) a:03brouberol [15:44:26] 10Lift-Wing, 06Machine-Learning-Team, 10Wikidata, 06Wikimedia Enterprise, 10Wikimedia Enterprise - Content Integrity: Request to host Wikidata Revert Risk on Lift Wing - https://phabricator.wikimedia.org/T406179#11272915 (10FNavas-foundation) [17:21:28] 10Lift-Wing, 06Machine-Learning-Team, 10Wikidata, 06Wikimedia Enterprise, 10Wikimedia Enterprise - Content Integrity: Request to host Wikidata Revert Risk on Lift Wing - https://phabricator.wikimedia.org/T406179#11273617 (10FNavas-foundation) [17:37:13] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: eqiad row C/D Machine Learning host migrations - https://phabricator.wikimedia.org/T405647#11273750 (10RobH) >>! In T405647#11250698, @RobH wrote: > @klausman, > > Can you provide feedback on when we can migrate these hosts from one network port to th... [18:06:43] FIRING: [6x] HelmfileAdminNGPendingChangesLiftWing: Pending admin_ng changes on ml-serve-codfw - https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Deploy_changes_to_helmfile.d%2Fadmin_ng - https://alerts.wikimedia.org/?q=alertname%3DHelmfileAdminNGPendingChangesLiftWing [20:15:10] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review, 06Growth-Team, and 3 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11274339 (10achou) > I want to understand what semantics you're aiming for first... [22:06:43] FIRING: [6x] HelmfileAdminNGPendingChangesLiftWing: Pending admin_ng changes on ml-serve-codfw - https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Deploy_changes_to_helmfile.d%2Fadmin_ng - https://alerts.wikimedia.org/?q=alertname%3DHelmfileAdminNGPendingChangesLiftWing