[05:05:48] <wikibugs>	 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Enable volunteer evaluation of Tone Check model in additional languages - https://phabricator.wikimedia.org/T400423#11208839 (10Strainu) I can confirm the same issues exist in Romanian
[06:59:27] <isaranto>	 Hola!
[07:01:58] <bartosz>	 Kalimera! 
[07:13:38] <isaranto>	 Dzień dobry o/
[07:50:03] <georgekyz>	 good morning folks
[07:56:59] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: Calculate tone check model service metrics for fixed calendar window - https://phabricator.wikimedia.org/T405338#11209107 (10isarantopoulos) Starting from the [[ https://grafana.wikimedia.org/goto/da6aZR3Ng?orgId=1 | Istio grafana dashboard ]] that presents the p90 latency...
[08:06:10] <wikibugs>	 06Machine-Learning-Team: Fix CI/CD on ml-pipelines repository - https://phabricator.wikimedia.org/T404717#11209119 (10gkyziridis) ==Update==  The cicd in the [[ https://gitlab.wikimedia.org/repos/machine-learning/ml-pipelines | ml-pipelines repo ]] is fixed. Tone-check test image is running on push to branch, no...
[08:29:31] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: Calculate tone check model service metrics for fixed calendar window - https://phabricator.wikimedia.org/T405338#11209163 (10elukey) @isarantopoulos it is not that easy :)   Passing down Grafana time ranges to Prometheus PromQL, that doesn't support fixed dates, requires s...
[08:32:50] <elukey>	 klausman: o/ merged the refactoring, and deployed it, looks good :) I am going to reimage ml-serve1013  that is still on bookworm
[08:33:03] <klausman>	 Roger!
[08:41:31] <elukey>	 ah and of course the prometheus amd exporter needs to be adapted to amd-smi
[08:45:33] <klausman>	 Are we using something custom-built there?
[08:46:35] <klausman>	 ahyes, prometheus-amd-rocm-stats.py
[08:46:54] <klausman>	 I can take care of updating that
[08:47:58] <elukey>	 I'll try with a quick test, and report back in the task if it is too long
[08:51:50] <klausman>	 I don't think amd-smi can be run the same way we used to run rocm-smi for promtheus purposes.
[08:53:05] <klausman>	 `sudo /opt/rocm/bin/amd-smi metric -uptfET` is probably a zeroth approximation of what we need
[09:01:09] <elukey>	 yep, the format is totally different 
[09:01:26] <elukey>	 I tried a quick change in the python code but it is not enough, fields are named differently etc..
[09:01:49] <elukey>	 klausman: you can use --json too
[09:02:05] <klausman>	 I can take a look at adding code to the exsting cronjob-tool to handle amd-smi, but it will take a minute
[09:02:23] <elukey>	 nah it is fine I'll do it later on
[09:02:30] <klausman>	 There are a lot of metrics, but most of them seem to have an "N/A" value.
[09:02:58] <klausman>	 I am not sure whether that is just "this counter hasn't bumped yet" or "this metric is not available on this card".
[09:03:36] <elukey>	 amd-smi seems to be very unstable to me, I think they are adding a ton of code during each release
[09:03:53] <elukey>	 so I would expect next iterations to be more robust
[09:03:59] <klausman>	 Either way, I think we should un the tool with all (except the broken voltage one) enabled, and only dump the non-"N/A" ones into the prometheus file. That way, if the status of a metric ever changes (new hw, better driver), we automagically get that metric.
[09:04:02] <elukey>	 the current version is at least able to partition the gpu
[09:04:33] <elukey>	 klausman: you could end up polluting prometheus with a ton of things that we don't need though
[09:05:00] <elukey>	 and in this case, the tool emits the pretty print of the json vs the older ones emitting a single line
[09:05:03] <klausman>	 Well, maybe just the subset that might be interesting, even if they're currently "N/A"?
[09:05:12] <elukey>	 so it would have broke anyway sigh
[09:05:44] <klausman>	 well, at least Python has good JSON support.
[09:06:29] <elukey>	 they are also releasing rocm 7.x, so I'd expect even more changes
[09:06:30] <klausman>	 there's also an official AMD metrics exporter
[09:06:36] <klausman>	 https://github.com/rocm/device-metrics-exporter
[09:07:22] <elukey>	 could be interesting
[09:07:30] <elukey>	 from a quick look it seems a bit bloated
[09:07:49] <klausman>	 yeah, agreed. I'll give it a spin on one of the lab machines, see how messy it is
[09:07:52] <elukey>	 but they usually ship super huge tools that do $everything
[09:15:46] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: Calculate tone check model service metrics for fixed calendar window - https://phabricator.wikimedia.org/T405338#11209268 (10isarantopoulos) >@isarantopoulos it is not that easy :) I assumed so :)  Although this won't allow us to calculate over a fixed window, wouldn't loo...
[09:32:53] <elukey>	 ml-serve1013 reimaged, the double reboot picks up kernel + firmware + etc.. nicely
[09:33:04] <klausman>	 excellent
[09:33:09] <elukey>	 amd-smi doesn't work for a lib issue, but it seems a stupid one
[09:33:20] <elukey>	 I need to go afk for a bit, ttl
[09:33:59] <klausman>	 ack
[09:34:40] <klausman>	 so far the exporter doesn't even remotely build and as you mentione it's an enormous amount of code and stuff. I think us just using amd-smi's JSON output and piling it into node-exporter is way more feasible.
[10:02:32] <elukey>	 ok found the issue with amd-smi, namely we also need libdrm-amdgpu1
[10:04:07] <elukey>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1190989
[10:06:32] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Experiment with amd-smi and the new AMD GPUs MI300x - https://phabricator.wikimedia.org/T403697#11209389 (10klausman) Today we found that `amd-smi` is not a drop-in replacement for `rocm-smi` when it comes to exporting metrics to Prometheus. We use [our own Python w...
[10:35:38] <isaranto>	 klausman: o/ Could you help with this task to calculate and report on the metrics defined there https://phabricator.wikimedia.org/T405338? I'm interested in finding how easy or difficult it is do so 
[11:04:33] <klausman>	 Will take a look
[11:21:08] <isaranto>	 thank you!
[11:21:23] <klausman>	 I think I got something workable, will update the phab ticket
[11:23:54] <wikibugs>	 06Machine-Learning-Team: Semantic Search POC - In article QA - https://phabricator.wikimedia.org/T405359#11209580 (10OKarakaya-WMF)
[11:28:09] <wikibugs>	 06Machine-Learning-Team: Semantic Search POC - In article QA - https://phabricator.wikimedia.org/T405359#11209586 (10OKarakaya-WMF) [Git branch for the current work](https://gitlab.wikimedia.org/repos/machine-learning/exploratory-notebook/-/tree/semantic_search_poc/semantic_search_poc/notebooks?ref_type=heads)
[11:29:03] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: Calculate tone check model service metrics for fixed calendar window - https://phabricator.wikimedia.org/T405338#11209591 (10klausman) I think this would work:  `lang=promql (  sum by (destination_canonical_service) (   increase(istio_requests_total{prometheus="k8s-mlserve...
[11:29:12] <klausman>	 and done, going for lunch now :)
[11:57:17] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: Calculate tone check model service metrics for fixed calendar window - https://phabricator.wikimedia.org/T405338#11209687 (10isarantopoulos) Thank you for the clarification! The above query responds to the availability SLI (1st item in task description). I tried to tackle...
[12:00:02] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: [articletopic-outlink] Allow using `page_id` as alternative to the `page_title` parameter. - https://phabricator.wikimedia.org/T371021#11209691 (10BWojtowicz-WMF)
[12:26:28] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: Calculate tone check model service metrics for fixed calendar window - https://phabricator.wikimedia.org/T405338#11209753 (10klausman) For latency, we'd use something like this:  `lang=promql (   sum by (destination_canonical_service) (   increase(istio_request_duration_mi...
[12:30:05] <wikibugs>	 06Machine-Learning-Team: Experiment with amd-smi and the new AMD GPUs MI300x - https://phabricator.wikimedia.org/T403697#11209777 (10elukey) ml-serve1012 and 1013 are now running with Trixie, a 6.16 kernel and up-to-date GPU firmwares. We are also using ROCm 6.5.3 amd-smi to support the GPU partitioning (so we c...
[12:32:12] <wikibugs>	 06Machine-Learning-Team: Semantic Search POC - In article QA - https://phabricator.wikimedia.org/T405359#11209794 (10OKarakaya-WMF) Alternative ranking strategy from Fabian: https://huggingface.co/BAAI/bge-reranker-v2-gemma
[13:59:30] <wikibugs>	 06Machine-Learning-Team, 10Add-Link-Structured-Task, 06Growth-Team: Introduce case sensitivity to machine learning model for Add a Link - https://phabricator.wikimedia.org/T405185#11210145 (10isarantopoulos) a:03OKarakaya-WMF
[14:00:15] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: Calculate tone check model service metrics for fixed calendar window - https://phabricator.wikimedia.org/T405338#11210157 (10klausman) As for the discrepancy (~97% vs. ~99%), I just ran the equivalent of my query (using`increase` etc) but instead of looking at the destinat...
[14:00:55] <klausman>	 isaranto: I figured out the source for the %age discrepancy and added a comment regarding that (and 1-2 other things) to the phab task
[14:01:16] <isaranto>	 ack! thank you, will take a look later!
[14:01:28] <jinxer-wm>	 FIRING: [3x] HelmfileAdminNGPendingChangesLiftWing: Pending admin_ng changes on ml-serve-codfw - https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Deploy_changes_to_helmfile.d%2Fadmin_ng  - https://alerts.wikimedia.org/?q=alertname%3DHelmfileAdminNGPendingChangesLiftWing
[14:01:50] <klausman>	 also addressing the adminng thing
[14:02:12] <wikibugs>	 (03PS1) 10Bartosz Wójtowicz: articletopic: Add `page_id` parameter to the articletopic model. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1191038 (https://phabricator.wikimedia.org/T371021)
[14:05:00] <klausman>	 and done.
[14:06:28] <jinxer-wm>	 FIRING: [6x] HelmfileAdminNGPendingChangesLiftWing: Pending admin_ng changes on ml-serve-codfw - https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Deploy_changes_to_helmfile.d%2Fadmin_ng  - https://alerts.wikimedia.org/?q=alertname%3DHelmfileAdminNGPendingChangesLiftWing
[14:07:17] <elukey>	 klausman, isaranto - I am not 100% what you are trying to achieve, but you are trying to reproduce what Pyrra already does
[14:08:00] <elukey>	 using things like increase[20d] will have the same side effects, see https://wikitech.wikimedia.org/wiki/SLO/Template_instructions/Dashboards_and_alerts#How_to_read_a_Pyrra_dashboard
[14:08:16] <elukey>	 you'll be always count what happened before the window in the first datapoints
[14:09:39] <elukey>	 and also it is not clear to me what we are comparing - is it the same service being hit by a different MW codepath? If so we could observe if there is a variation of the error budget burned
[14:09:57] <elukey>	 namely, is the experiment causing the error budget to be burned in a faster way?
[14:12:59] <klausman>	 AIUI, what we're looking for is a way to tell if the experiment makes the service worse for those two metrics (latency and error rate)
[14:16:07] <elukey>	 sure sure, but with your calculations every data point of a graph is basically showing how the past 20 days of requests went
[14:16:28] <elukey>	 that is not wrong, hear me out, pyrra does a similar thing
[14:16:37] <elukey>	 it is just a different view
[14:17:05] <elukey>	 so anything happened before influences the data
[14:21:54] <klausman>	 AIUI, the graph was not the what we're interested i, only the final value, which would be the SLO over the given window, so the last value (e.g. 97.2) would mean "in the N days before the last point of this timeseries, 97.2% of queries were 200s/under 1000ms"
[14:23:10] <klausman>	 Together with offset, one could then compare A and B setups for changes in latency etc. It's quite fiddly with offset, but short of using the Prom REST API (which allows for absolute timestamps), I don't see another wya of doing it
[14:32:08] <elukey>	 yep that part is clear :)
[14:33:23] <elukey>	 what I tried to bring up is that we may look at variations of error budget being burned
[14:33:32] <elukey>	 in Pyrra or other tools
[14:34:34] <elukey>	 you can do it with your queries too, namely checking the variations from A to B. You'll keep into account past data too, but the trend should be clear
[14:35:01] <elukey>	 if it is steady at 100% or burning a little bit every X hours and we see a regression, then the experiment is causing some troubles
[14:35:18] <elukey>	 I am just trying to think how to approach these problems in an SLO world
[14:35:40] <elukey>	 because even non-ML people could look at those etc..
[14:35:51] <klausman>	 I think if we needed super-precise data, the REST API of prometheus would be the only real option. I think I have a PoC piece of code somewhere that I wrote for some private thing a while back
[14:37:05] <klausman>	 I also don't know how often we're going to be doing this A/B testing, so how much effort we need to put into making things reusable
[14:39:31] <elukey>	 I expressed my idea, you folks will decide :)
[15:06:28] <jinxer-wm>	 FIRING: [6x] HelmfileAdminNGPendingChangesLiftWing: Pending admin_ng changes on ml-serve-codfw - https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Deploy_changes_to_helmfile.d%2Fadmin_ng  - https://alerts.wikimedia.org/?q=alertname%3DHelmfileAdminNGPendingChangesLiftWing
[15:15:36] <wikibugs>	 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Enable volunteer evaluation of Tone Check model in additional languages - https://phabricator.wikimedia.org/T400423#11210465 (10gkyziridis) ==Update== During ad hoc postprocess on each wiki we can remove those problematic data points from the samples. The iss...
[15:18:31] <isaranto>	 elukey: we are trying to find a way to report on the metrics for the A/B test in https://phabricator.wikimedia.org/T394463 cause it is something that we need to report in the next couple of days(or even today). We are not trying to compare A vs B latencies or availability we just want to report on service availability and performance overall. The useful thing is that if something is totally off we can further 
[15:19:57] <isaranto>	 evaluate if the experiment was useful or it needs to be repeated or something else. In the future we will definitely do this using the SLO dashboards
[15:32:38] <isaranto>	 Although the numbers we are providing are not 100% accurate due to the way the metrics are actually calculated but they still provide some good insights. your write up Luca helps a lot to understand why this is happening
[15:41:59] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: Calculate tone check model service metrics for fixed calendar window - https://phabricator.wikimedia.org/T405338#11210612 (10isarantopoulos) Ok! so I'm pasting the modified queries for the availability and latency metrics using the last 21d The first one results in 99.9% a...
[16:01:28] <jinxer-wm>	 RESOLVED: [3x] HelmfileAdminNGPendingChangesLiftWing: Pending admin_ng changes on ml-serve-codfw - https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Deploy_changes_to_helmfile.d%2Fadmin_ng  - https://alerts.wikimedia.org/?q=alertname%3DHelmfileAdminNGPendingChangesLiftWing
[16:10:19] <wikibugs>	 06Machine-Learning-Team, 06Data-Engineering, 06Data-Engineering-Radar, 07Essential-Work: Make the revert risk predictions datasets available for analysis - https://phabricator.wikimedia.org/T388453#11210747 (10Ahoelzl)
[16:13:55] <wikibugs>	 06Machine-Learning-Team, 06Data-Engineering, 06Data-Engineering-Radar, 07Essential-Work: Make the revert risk predictions datasets available for analysis - https://phabricator.wikimedia.org/T388453#11210764 (10Ottomata) FWIW, there is also now a `mediawiki.page_revert_risk_prediction_change.v1` stream and...
[16:14:19] <wikibugs>	 06Machine-Learning-Team, 06Data-Engineering, 06Data-Engineering-Radar, 07Essential-Work: Make the revert risk predictions datasets available for analysis - https://phabricator.wikimedia.org/T388453#11210775 (10Ottomata) 05In progress→03Resolved a:03Ottomata Being bold and resolving the task.
[16:16:10] <wikibugs>	 06Machine-Learning-Team, 10Data-Engineering-Roadmap, 06Wikimedia Enterprise, 07Epic, 10Event-Platform: [Event Platform] Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792#11210783 (10Ottomata) 05Open→0...
[22:47:50] <wikibugs>	 06Machine-Learning-Team, 10Add-Link-Structured-Task, 06Growth-Team: Introduce case sensitivity to machine learning model for Add a Link - https://phabricator.wikimedia.org/T405185#11212352 (10Sdkb) @Kerry_Raymond has independently raised this issue and provided some further examples in [[ https://en.wikipedi...