[06:34:21] good morning [06:35:35] I have a small MR in airflow dags. https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1717 this MR will deploy all models above the release threshold to https://analytics.wikimedia.org/published/wmf-ml-models/addalink/v2/ Can you take a look when you have time? @kevinbazira [06:36:55] Actually, Kevin is presenting at Wikimania 2025 today. Can anyone else take look to the MR? [06:40:12] ozge_: o/ LGTM! [06:40:33] 🙌 [06:44:44] good morning o/ [06:54:07] ozge_: looks good! can you remind me what is the decision threshold? [06:54:14] and where it is configured? [07:03:01] precision_threshold = "0.75" [07:03:01] recall_threshold = "0.2" [07:03:01] I've created the list in the MR based on the scores in the spreadsheet https://docs.google.com/spreadsheets/d/1gwneJ5-WvT4ZSsYeHZR-Cu6P5Gz_eKQSx2vQPB0fMDg/edit?gid=282549996#gid=282549996 [07:03:01] We also apply the threshold checks during the staging release https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/ml/dags/add_a_link_release_staging_dag.py?ref_type=heads Actually, I can move this check to prod release and update the precision threshold. But on the other hand if a model is not released to staging, it will fail if we try to release to prod. [07:10:47] aa right thank you [07:27:45] morning folks [07:44:40] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review: Data Persistence Design Review: Article topic model caching - https://phabricator.wikimedia.org/T402984#11235879 (10BWojtowicz-WMF) @Eevans Thank you very much for elaborating on the history and differences between those two. I w... [08:38:33] ozge_: I see 0.75 as the precision threshold in https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/ml/dags/add_a_link_release_staging_dag.py?ref_type=heads#L322 is it the one used? [08:39:05] I'm interested to see how we can better streamline the release process [08:42:09] This one blocks the staging release if the score is lower than the given value. [08:42:09] I actually used the values in the spreadsheet and used 0.75 as the threshold. [08:42:10] Let me create an MR to have this check for the prod release as well. [08:43:46] ok, thanks for clarifying. I'm just reporting/communicating some metrics so was looking into that [08:47:27] awesome. https://gitlab.wikimedia.org/repos/machine-learning/ml-pipelines/-/blob/main/training/add_a_link/src/add_a_link/release/check_release_threshold.py?ref_type=heads . I'm sharing the implementation for the release threshold check. [08:51:02] So we have a threshold for converting xgboost prediction probabilities to classification. We run evaluation on multiple thresholds. Release threshold check requires at least one of the precision, recall pairs to be above the release threshold. [08:58:16] teşekkür ederim! [09:02:11] :D rica ederim. [09:18:09] hello, we have already many models in the new location https://analytics.wikimedia.org/published/wmf-ml-models/addalink/v2/ more is on the way [09:31:30] hello, I've an MR in airflow dag https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1718/diffs It moves release threshold check from staging release dag to prod release dag. Can you take a look when you have time? @kevinbazira [09:32:01] ack. looking ... [10:09:43] klausman: o/ as FYI I noticed a strange behavior of ml-serve1012, namely that after a reboot the GPUs were not recognized (some drm-related errors in the dmesg), and it was consistent after 2/3 reboots. Then I powercycled the host via the Webui, and the gpus reappeared :D [10:10:47] there must be a horrific-reason why this happens, just keep it in mind if you'll find yourself in the same unlucky situation in the future [10:54:00] Roger & thanks [10:54:19] you think that maybe a soft boot (no powercycle) leaves the machine in a bad state from the kernel's POV? [11:11:09] 06Machine-Learning-Team, 05Goal: Export retrained Tone-check model to an S3 bucket - https://phabricator.wikimedia.org/T406217 (10gkyziridis) 03NEW [11:12:08] 06Machine-Learning-Team: Export retrained Tone-check model to an S3 bucket - https://phabricator.wikimedia.org/T406217#11236515 (10gkyziridis) [11:30:15] ottomata: Hey, I left a small comment on this review: https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1192900 Is it possible to cast an eye over it whenever you have time? [11:49:40] 06Machine-Learning-Team, 10Semantic Search: Semantic Search POC - In article QA - https://phabricator.wikimedia.org/T405359#11236567 (10OKarakaya-WMF) Looking into the question related scores, we generally get low scores in question_relevance_to_title and curiosity. {F66719673} {F66719675} Question quality... [13:07:03] bartosz: Hey, is this task done: https://phabricator.wikimedia.org/T371021 ?? [13:08:03] o/ georgekyz: The deployment on production is not done yet, I'd like to do it Monday since tomorrow I'm out of office [13:09:06] Thank youuu [13:49:50] 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review - https://phabricator.wikimedia.org/T392833#11236966 (10BWojtowicz-WMF) **Weekly Report** //Sharing a day earlier as I'm OOO on 3rd of October. // Summary of progress: 1. Work ad... [13:52:41] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review: Data Persistence Design Review: Article topic model caching - https://phabricator.wikimedia.org/T402984#11236971 (10Eevans) >>! In T402984#11235879, @BWojtowicz-WMF wrote: > > [ ... ] > > I see you filled out the description with... [14:04:13] 06Machine-Learning-Team, 05Goal, 13Patch-For-Review: Q1 FY2025-26 Goal: Scaling Add-a-link to more wikis via production (airflow) pipelines - https://phabricator.wikimedia.org/T398950#11237025 (10OKarakaya-WMF) All models are deployed to the [new location](https://analytics.wikimedia.org/published/wmf-ml-mod... [15:03:59] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review: Data Persistence Design Review: Article topic model caching - https://phabricator.wikimedia.org/T402984#11237262 (10Eevans) [15:08:22] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review: Data Persistence Design Review: Article topic model caching - https://phabricator.wikimedia.org/T402984#11237296 (10Eevans) [15:08:37] 06Machine-Learning-Team: Add support for K8s 1.23 on Trixie - https://phabricator.wikimedia.org/T405891#11237297 (10elukey) 05Open→03Resolved a:03elukey The ML node is up and running, and it seems working fine. I am going to keep testing in the parent task, but for the moment this task should be marked... [15:16:19] 10Lift-Wing, 06Machine-Learning-Team: Remove old nsfw model from inference-services repo - https://phabricator.wikimedia.org/T405083#11237332 (10isarantopoulos) 05Open→03Resolved [15:16:26] 06Machine-Learning-Team, 13Patch-For-Review: Experiment with amd-smi and the new AMD GPUs MI300x - https://phabricator.wikimedia.org/T403697#11237333 (10elukey) >>! In T403697#11209777, @elukey wrote: > Next steps: > > * Add support for the new amd-smi tool's format to the Prometheus GPU exporter. > * Copy t... [15:48:39] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review: Data Persistence Design Review: Article topic model caching - https://phabricator.wikimedia.org/T402984#11237484 (10Eevans) [15:52:38] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review: Data Persistence Design Review: Article topic model caching - https://phabricator.wikimedia.org/T402984#11237488 (10Eevans) [17:00:35] 10Lift-Wing, 06Machine-Learning-Team, 10Wikidata, 06Wikimedia Enterprise, 10Wikimedia Enterprise - Content Integrity: Request to host Wikidata Revert Risk on Lift Wing - https://phabricator.wikimedia.org/T406179#11237846 (10FNavas-foundation)