[04:38:29] <wikibugs>	 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Airflow training pipeline for Tone check model - https://phabricator.wikimedia.org/T398970#11212696 (10kevinbazira) I pushed an MR to implement the HDFS to PVC copy pattern from in T396495#11151194:  * [Copy airflow DAG](https://gitlab.wikimedia.org/repos/dat...
[06:18:41] <ozge_>	 good morning
[06:43:49] <isaranto>	 good morning!
[07:58:23] <elukey>	 o/
[07:58:59] <elukey>	 TIL that the supermicro hosts with AMD GPUs have a special connection between the 8 GPUs, super fast that allows to copy memory (via ROCm tools) without passing from the CPU's control
[07:59:17] <elukey>	 I don't think we'll ever use that but good to know :D
[08:18:47] <klausman>	 yes! there is a tool to test that bw in my homedir on ml-lab1002
[08:20:11] <klausman>	 It's usually ~28GB/s unidirecitonal/51GB/s bidirectional
[08:20:29] <klausman>	 https://github.com/ROCm/rocm_bandwidth_test is the source
[08:29:01] <klausman>	 Oh, on the MI300 it's about double that vw
[08:29:03] <klausman>	 bw*
[08:30:28] <klausman>	 https://phabricator.wikimedia.org/P83466
[08:48:20] <elukey>	 IIUC though we cannot easily use it unless the tool explicitly uses rocm directives right?
[08:51:15] <wikibugs>	 (03PS1) 10AikoChou: events: construct new prediction classification event independently [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1191305 (https://phabricator.wikimedia.org/T405067)
[09:02:23] <klausman>	 elukey: AIUI, it's mostly meant for tools that are able to communicate such memory sharing, like multi-stage ML stuff. One program would take the input it, transform it on one GPU, and then transfer the relevant data to another that does th next step. Much like normal user-space shared memory, except it actually moves physically between NUMA nodes
[09:02:48] <elukey>	 yep yep, very cool
[09:03:55] <klausman>	 The whole NUMA thing is very much an HPC requirement, when you run these enormous clusters with thousands of GPUs. The next step up is then sharing memory between nodes with special networking. It's a very different programming model to what we usually do.
[09:05:06] <wikibugs>	 (03CR) 10AikoChou: "This is the page_change event schema: https://schema.wikimedia.org/repositories//primary/jsonschema/mediawiki/page/change/1.3.0.yaml" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1191305 (https://phabricator.wikimedia.org/T405067) (owner: 10AikoChou)
[09:15:15] <elukey>	 I guess that those nodes may also be used for "creative" high volume training, maybe happening in stages across multiple GPUs
[09:15:36] <elukey>	 it is sad that we cannot really use those hosts at their full potential
[09:21:28] <klausman>	 We just sit at an odd point in the customer base for these GPUs
[09:23:38] <elukey>	 the main question mark is to understand if we want to keep buying servers like these or not
[09:24:03] <elukey>	 because IIUC the partitioning promise is very limited
[09:30:24] <klausman>	 True. The thing is, buying GPU servers like the MI210-based ones we have has somewhat dried up. I mean we can hope that changes before we run out of GPU-based serving capacity --- either mid-tier stuff becoming available, or us not needing it anymore.
[09:34:15] <elukey>	 this is true for supermicro/dell, maybe we can find a specific vendor that can accomodate our needs 
[09:34:43] <elukey>	 the new hosts will not be in codfw for the foreseeable future, so it is becoming a big issue
[09:35:02] <elukey>	 (not only the cost, but how to make those beasts running in our dcs etc..)
[09:35:16] <klausman>	 yeah, the TOC of these machines is dizzying
[09:35:23] <klausman>	 er TCO
[10:29:04] <wikibugs>	 06Machine-Learning-Team, 10Add-Link-Structured-Task, 06Growth-Team: Introduce case sensitivity to machine learning model for Add a Link - https://phabricator.wikimedia.org/T405185#11213543 (10OKarakaya-WMF) I'm sharing an [analysis](https://gitlab.wikimedia.org/repos/machine-learning/ml-pipelines/-/blob/main...
[11:18:41] <wikibugs>	 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Enable volunteer evaluation of Tone Check model in additional languages - https://phabricator.wikimedia.org/T400423#11213744 (10gkyziridis) ==Update==  I had already reviewed applied ad-hoc post process on the languages below.  Please click the link on each w...
[13:38:49] <wikibugs>	 (03CR) 10Ottomata: "Thank you for this!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1191305 (https://phabricator.wikimedia.org/T405067) (owner: 10AikoChou)
[13:59:03] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: [articletopic-outlink] Allow using `page_id` as alternative to the `page_title` parameter. - https://phabricator.wikimedia.org/T371021#11214442 (10BWojtowicz-WMF) I've done a small analysis on performance implications of introducing the `page_id` para...
[18:28:38] <wikibugs>	 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: eqiad row C/D Machine Learning host migrations - https://phabricator.wikimedia.org/T405647 (10RobH) 03NEW
[23:22:30] <wikibugs>	 (03PS5) 10Zabe: Allow filtering models by rc_source instead of rc_type [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1187807 (https://phabricator.wikimedia.org/T74157)
[23:35:15] <wikibugs>	 (03PS6) 10Zabe: Allow filtering models by rc_source instead of rc_type [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1187807 (https://phabricator.wikimedia.org/T74157)
[23:48:06] <wikibugs>	 (03PS7) 10Zabe: Allow filtering models by rc_source instead of rc_type [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1187807 (https://phabricator.wikimedia.org/T74157)
[23:50:44] <wikibugs>	 (03PS6) 10Zabe: Replace most usages of rc_type with rc_source [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1187802 (https://phabricator.wikimedia.org/T74157)