[06:07:04] (03PS1) 10Kevin Bazira: article-country: send prediction results to weighted tags stream [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1114600 (https://phabricator.wikimedia.org/T382295) [06:27:36] (03CR) 10Kevin Bazira: events: add support for the weighted tags event stream (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1114355 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [08:16:18] hello! [08:16:33] good morning [08:33:53] morning! [08:56:04] (03CR) 10DCausse: [C:03+1] "lgtm!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1114600 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [08:57:22] (03CR) 10DCausse: [C:03+1] "lgtm!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1114355 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [09:50:03] 06Machine-Learning-Team, 13Patch-For-Review: Issues with Reference Need and Reference Risk models - https://phabricator.wikimedia.org/T384172#10499785 (10achou) > Expected Issue: The model is running slow. The prediction time for Reference-need model is proportional to page size—or precisely, the number of u... [09:52:54] Morning! [09:58:32] 06Machine-Learning-Team, 13Patch-For-Review: Issues with Reference Need and Reference Risk models - https://phabricator.wikimedia.org/T384172#10499834 (10achou) [10:16:30] Guten tag Tobias! [10:52:24] elukey: fyi, pushing the kserve and knative-serving changes from yesterday to prod-codfw. [10:52:39] correcion: prod-eqiad [11:25:21] and done! [11:31:22] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [11:36:22] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:12:13] klausman: nice! Lemme know when prod-codfw is done as well [12:12:24] I'll do that this afternoon [12:12:36] proceeding with https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1114423 in staging [12:12:51] The update was so uneventful, I don't think soaking for more than afew hours is really necessary [12:13:09] ack re: staging [12:31:29] ml-staging synced! [12:33:29] ty! [12:33:32] * klausman lunch [13:53:14] (03CR) 10AikoChou: [C:03+1] "Nice, looks good to me!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1114355 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [14:00:30] (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the reviews :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1114355 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [14:06:13] (03CR) 10CI reject: [V:04-1] events: add support for the weighted tags event stream [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1114355 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [14:06:55] (03CR) 10Kevin Bazira: [C:03+2] "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1114355 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [14:09:42] (03CR) 10CI reject: [V:04-1] events: add support for the weighted tags event stream [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1114355 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [14:24:28] 10Lift-Wing, 06Machine-Learning-Team, 07OKR-Work, 13Patch-For-Review: Create event stream for article-country model-server hosted on LiftWing - https://phabricator.wikimedia.org/T382295#10500708 (10Isaac) > `classification.prediction.articlecountry` Works for me! > topic names did not contain any spaces b... [14:26:44] 10Lift-Wing, 06Machine-Learning-Team: Update ROCm driver version on Lift Wing nodes - https://phabricator.wikimedia.org/T383230#10500717 (10isarantopoulos) 05Open→03Declined [14:27:38] 10Lift-Wing, 06Machine-Learning-Team: Update ROCm driver version on Lift Wing nodes - https://phabricator.wikimedia.org/T383230#10500718 (10isarantopoulos) Lift Wing uses ROCm inside the docker containers as part of Pytorch. We will need to assess if this is enough for all our workloads, but for the time b... [14:28:02] 10Lift-Wing, 06Machine-Learning-Team: Quantize aya-expanse-32B with GPTQ (GPTQModel) - https://phabricator.wikimedia.org/T384734#10500724 (10isarantopoulos) [14:29:32] 10Lift-Wing, 06Machine-Learning-Team: [LLM] quantization: allow loading model weights as int8/int4 with HF - https://phabricator.wikimedia.org/T377848#10500732 (10isarantopoulos) The information from this task has been summarized and documented on a [[ https://wikitech.wikimedia.org/wiki/Machine_Learning/Quant... [14:29:49] (03CR) 10AikoChou: [C:03+1] "I left two nits, but otherwise LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1114600 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [14:30:10] 10Lift-Wing, 06Machine-Learning-Team: [LLM] quantization: allow loading model weights as int8/int4 with HF - https://phabricator.wikimedia.org/T377848#10500739 (10isarantopoulos) 05Open→03Resolved [14:32:23] 06Machine-Learning-Team: Adding uv as a package manage on Lift Wing/blubber - https://phabricator.wikimedia.org/T384584#10500743 (10isarantopoulos) p:05Triage→03Low [14:34:02] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: [LLM] Use vllm for ROCm in huggingface image - https://phabricator.wikimedia.org/T370149#10500749 (10isarantopoulos) An update on the task is available in this [[ https://phabricator.wikimedia.org/P72019 | paste ]] [14:36:58] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: [LLM] Use vllm for ROCm in huggingface image - https://phabricator.wikimedia.org/T370149#10500783 (10isarantopoulos) >>! In T370149#10369892, @achou wrote: > I successfully built Triton flash attention using a miniconda env. However, when attempting t... [14:37:27] 10Lift-Wing, 06Machine-Learning-Team: Create SLO dashboards for reference quality models - https://phabricator.wikimedia.org/T384316#10500785 (10isarantopoulos) [14:38:05] 10Lift-Wing, 06Machine-Learning-Team: Create SLO dashboard for article-country - https://phabricator.wikimedia.org/T384935 (10isarantopoulos) 03NEW [14:38:17] 10Lift-Wing, 06Machine-Learning-Team: Create SLO dashboard for article-country model - https://phabricator.wikimedia.org/T384935#10500799 (10isarantopoulos) [14:38:46] 10Lift-Wing, 06Machine-Learning-Team, 07OKR-Work, 13Patch-For-Review: Create event stream for article-country model-server hosted on LiftWing - https://phabricator.wikimedia.org/T382295#10500800 (10isarantopoulos) p:05Medium→03High [14:38:52] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Build and Publish ROCm-Compatible Python Packages - https://phabricator.wikimedia.org/T381859#10500801 (10isarantopoulos) p:05High→03Medium [14:39:01] 06Machine-Learning-Team: [LLM] Use Flash attention 2 for GPU inference - https://phabricator.wikimedia.org/T371344#10500802 (10isarantopoulos) p:05High→03Medium [14:40:18] 10Lift-Wing, 06Machine-Learning-Team: [LLM] Lift Wing load testing - https://phabricator.wikimedia.org/T377225#10500804 (10isarantopoulos) a:05achou→03None [14:40:24] 06Machine-Learning-Team, 13Patch-For-Review: Update kserve to 0.13.1 - https://phabricator.wikimedia.org/T367048#10500807 (10isarantopoulos) a:05isarantopoulos→03None [14:47:07] elukey: codfw is now done, too (kserve&&knative-serving) [14:49:03] nice! [14:49:41] I am working with Ben in the k8s chan on the istio config, dse has a similar config compared to ml and I think something is off (related to the PSS migration) [14:49:58] Roger, lmkj if I can help [14:50:01] -j [14:51:03] 10Lift-Wing, 06Machine-Learning-Team: Create SLO dashboards for reference quality models - https://phabricator.wikimedia.org/T384316#10500858 (10isarantopoulos) [14:52:29] 06Machine-Learning-Team, 10ORES, 06Discovery-Search, 06Growth-Team, 07OKR-Work: Investigate what would be required to include countries in ORES and accessible via a search keyword - https://phabricator.wikimedia.org/T301671#10500865 (10Isaac) Copying over from T382295#10496113: > The search platform can... [14:52:50] 10Lift-Wing, 06Machine-Learning-Team: Create SLO dashboards for reference quality models - https://phabricator.wikimedia.org/T384316#10500868 (10isarantopoulos) We need to follow up on the previous issues caused by liftwing slos on pyrra and see if we can start onboarding new models. T302995#10409335 If everyt... [14:53:06] 10Lift-Wing, 06Machine-Learning-Team: Create SLO dashboard for article-country model - https://phabricator.wikimedia.org/T384935#10500873 (10isarantopoulos) We need to follow up on the previous issues caused by liftwing slos on pyrra and see if we can start onboarding new models. T302995#10409335 If everything... [14:54:17] 10Lift-Wing, 06Machine-Learning-Team: Create SLO dashboard for article-country model - https://phabricator.wikimedia.org/T384935#10500876 (10isarantopoulos) [14:55:14] 10Lift-Wing, 06Machine-Learning-Team: Create SLO dashboard for article-country model - https://phabricator.wikimedia.org/T384935#10500882 (10isarantopoulos) [14:55:34] 10Lift-Wing, 06Machine-Learning-Team: Create SLO dashboards for reference quality models - https://phabricator.wikimedia.org/T384316#10500884 (10isarantopoulos) [15:26:19] 06Machine-Learning-Team, 10Data-Platform-SRE (2025.01.11 - 2025.01.31): Move Lab machines into analytics net for DL access and switch to homedirs on Ceph - https://phabricator.wikimedia.org/T380279#10501014 (10klausman) Update: I have [a WIP change](https://gerrit.wikimedia.org/r/c/operations/puppet/+/1109044... [15:58:16] 06Machine-Learning-Team: Adding uv as a package manager on Lift Wing/blubber - https://phabricator.wikimedia.org/T384584#10501152 (10isarantopoulos) [16:01:04] Happy Lunar New Year!! \o/ [16:06:57] ah wow it is today! [16:09:18] aiko: is 新年好 ok??? :D [16:10:17] or better 新年快乐? [16:12:51] (of course with pinyin I confused 新 with 心 since they are both xīn) [16:54:35] 06Machine-Learning-Team, 13Patch-For-Review: Issues with Reference Need and Reference Risk models - https://phabricator.wikimedia.org/T384172#10501357 (10FNavas-foundation) Thanks @AikoChou - @Aitolkyn's testing showed - "75% of revisions in each language are completed within a 500ms time limit." You can see... [16:55:44] Happy Lunar New Year \o/ [16:56:08] 06Machine-Learning-Team: Adding uv as a package manager on Lift Wing/blubber - https://phabricator.wikimedia.org/T384584#10501363 (10dduvall) >>! In T384584#10489491, @isarantopoulos wrote: > @dduvall Hi! Just wanted to get your thought on this: > - is this something that you have explored already? > - have... [17:18:55] going afk folks, have a nice evening/rest of day o/