[00:34:49] <jinxer-wm>	 RESOLVED: KubernetesDeploymentUnavailableReplicas: ...
[00:34:49] <jinxer-wm>	 Deployment reference-need-predictor-00010-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00010-deployment - ...
[00:34:49] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[02:55:44] <jinxer-wm>	 FIRING: LiftWingServiceErrorRate: ...
[02:55:44] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=ptwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[03:05:44] <jinxer-wm>	 FIRING: [2x] LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate  - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[03:10:44] <jinxer-wm>	 RESOLVED: [2x] LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate  - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[06:08:47] <kevinbazira>	 morning morning o/
[06:08:59] <kevinbazira>	 TIL that AMD has an LLM benchmarking tool: https://github.com/ROCm/MAD/tree/develop/benchmark/vllm
[06:57:33] <ozge_>	 Good morning!
[07:17:31] <elukey>	 klausman: o/ re: removal of services - Some experimental isvcs were removed, but it seemed a unstaged change (namely, manual remove directly on deploy1003's deployment-chart repo)
[07:21:06] <isaranto>	 hello folks!
[07:37:10] <isaranto>	 kevinbazira: I'm available if you want to sync about testing the gpus
[07:40:11] <kevinbazira>	 isaranto: o/ sure sure. I've sent you an invite: https://meet.google.com/yig-iyjp-ybg?authuser=1
[07:40:32] <isaranto>	 ack, be right there
[08:13:59] <klausman>	 elukey: weird. I did the change live, yes, but made a patch immediately afterwards (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1134991)
[08:15:23] <klausman>	 It was submitted at 1243 UTC, and your message is from 13:59 UTC if I got my TZ's correctly
[08:22:09] <elukey>	 sure but you have to unstage your live changes, otherwise git pull will not work
[08:24:24] <klausman>	 I must have forgotten the "git reset bit"
[08:25:34] <elukey>	 yep yep no problem!
[12:10:00] <wikibugs>	 06Machine-Learning-Team, 07sre-alert-triage: Alert in need of triage: DiskSpace (instance ml-lab1001:9100) - https://phabricator.wikimedia.org/T391465 (10LSobanski) 03NEW
[15:05:51] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: Use rocm/vllm image on Lift Wing - https://phabricator.wikimedia.org/T385173#10726495 (10kevinbazira) I was able to run vLLM in the docker image: `rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6` by customizing instructions from this ROCm blog: https://rocm.blogs.a...
[15:07:23] <kevinbazira>	 o/ as discussed in the meeting, I've shared the steps I used to run vLLM in the image created by ROCm: https://phabricator.wikimedia.org/T385173#10726495
[15:27:40] <kevinbazira>	 klausman: o/ whenever you get a minute please help restart EventGate: https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate/Administration#Roll_restart_all_pods
[15:27:40] <kevinbazira>	 the rrla topic was recently added: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1133603
[15:27:40] <kevinbazira>	 the restart will avoid similar errors that were thrown when the article-country topic was added: https://phabricator.wikimedia.org/P73242#293629
[16:10:06] * isaranto afk