[06:08:49] Good morning [06:50:49] good morning o/ [07:04:34] good morning! [07:35:53] Maidin mhaith! [07:36:08] good morning [07:36:39] https://www.brendanlong.com/cpu-utilization-is-a-lie.html <- An interesting article on how CPU utilization %-age is usually not what it seems [07:51:29] morning! [09:03:25] klausman: o/ I am thinking to reimage ml-serve1012 to trixie to test https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/gpu-partitioning/mi300x/overview.html, since amd-smi is availble from Trixie onwards [09:03:38] basically to start figuring out how it works, drawbacks etc.. [09:03:41] wdyt? [09:03:44] looking [09:03:51] yes, go ahead [09:04:10] I've been tinkering a bit with ...13, but even that can all be wiped [09:05:49] So since the MI300x can be split 8-ways, splitting the 8GPUs would give us a 64GPU machine :D [09:08:38] very powerful and scary at the same time, if the host is powered off we may loose a ton of models [09:08:54] what did you test on 13? Anything that can be done in bookworm? [09:09:14] I ran sdnext, just to get a feel for performance [09:09:28] it's a StableDiffusion UI/Frontend. [09:33:57] oh okok, no ROCm config related [09:34:47] is there a task to collect experiments for the MI300X GPUs? With the aim of finding a target config, test it etc.. [09:35:08] I am almost done in upgrading the amd k8s plugin, but we'll need to configure the GPU first [09:35:20] and if amd-smi + related packages are only on trixie, it may be an issue [09:50:46] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q1 FY2025-26 Goal: Apply the Tone Check model to published articles, to learn whether we can build a pool of high-quality structured tasks for new editors - https://phabricator.wikimedia.org/T392283#11143036 (10Michael) >>! In T392283#11141651, @Eevans wrote: > Ke... [09:50:59] 06Machine-Learning-Team: Setup & experiments for MI300x GPUs used for LiftWing - https://phabricator.wikimedia.org/T403599 (10klausman) 03NEW [09:51:37] 06Machine-Learning-Team: Setup & experiments for MI300x GPUs used for LiftWing - https://phabricator.wikimedia.org/T403599#11143047 (10klausman) [09:51:43] 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Upgrade the AMD GPU plugin for k8s to support MI300 GPUs - https://phabricator.wikimedia.org/T398600#11143048 (10klausman) [09:51:55] elukey: I've just created ^^^ and linked in the existing plugin task [09:55:02] okok perfect, it would feel a real waste to not use this capacity for weeks/months :D [09:57:38] agreed! [09:58:47] 06Machine-Learning-Team: Review Tone Check Latency SLO and its targets - https://phabricator.wikimedia.org/T403378#11143078 (10elukey) @gkyziridis I checked the [[ https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=aWotKxQMz&var-namespace=edit-check&var-backend=$__all&from=now-7d&to=... [10:00:42] ok reimaging ml-serve1012 to trixie [10:12:00] 06Machine-Learning-Team: Review Tone Check Latency SLO and its targets - https://phabricator.wikimedia.org/T403378#11143122 (10gkyziridis) Hey @elukey, yes lets go towards that path. What I also see in [[ https://grafana-rw.wikimedia.org/d/c6GYmqdnz/knative-serving?orgId=1&from=now-2d&to=now&timezone=utc&var-cl... [10:19:14] * isaranto afk lunch [10:20:57] 06Machine-Learning-Team: Review Tone Check Latency SLO and its targets - https://phabricator.wikimedia.org/T403378#11143195 (10elukey) We could simply add `minReplicas: 3`, that would disable autoscaling but leave the current configs. What do you think? Anyway, +1 :) [10:26:09] 06Machine-Learning-Team: Setup & experiments for MI300x GPUs used for LiftWing - https://phabricator.wikimedia.org/T403599#11143223 (10elukey) The ml-serve1012 and ml-serve1013 are the first two eqiad hosts available for a test. Some high level thoughts/notes: 1) From the provisioning and puppet config perspect... [10:41:51] elukey: disabling autoscaling for edit-check patch ready : https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1184498 [10:42:20] elukey: probably silly question, but is there any setup for edit-check to logstash ? [11:08:04] Well, the usual Istio logs should be there, AIUI [11:09:26] https://usercontent.irccloud-cdn.com/file/T1VAyrzX/image.png [11:09:39] I do not see edit-check [11:11:31] sec... [11:12:12] https://logstash.wikimedia.org/goto/1741c9ee3719a009e6107ab59f1c3483 [11:13:01] I see ~2k requests in the last 5h [11:13:29] I often get confused when Logstash doesn't seem to see the requests I am looking for, and then I relaize that my time window may be too short [11:14:50] https://logstash.wikimedia.org/goto/9cd95aac1dca1b7488a4e0a405ecd0b1 And here's the same timefram,e but the kserve container logs [11:15:16] (that one is limited to staging, tho) [11:17:54] oh I was setting wrong filters, thnx @klausman [11:18:22] yw. I am still not friends with LS, but I am beginning to get used to it :D [11:50:34] Folks should I go ahead and merge/deploy the autoscaling disablement: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1184498 [11:50:35] ? [11:53:06] looking [11:54:55] georgekyz: you can always play around with the deployment in experimental namespace and edit it on the fly so that you can iterate faster