[00:34:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [00:34:49] Deployment reference-need-predictor-00010-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00010-deployment - ... [00:34:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [02:55:44] FIRING: LiftWingServiceErrorRate: ... [02:55:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=ptwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [03:05:44] FIRING: [2x] LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [03:10:44] RESOLVED: [2x] LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [06:08:47] morning morning o/ [06:08:59] TIL that AMD has an LLM benchmarking tool: https://github.com/ROCm/MAD/tree/develop/benchmark/vllm [06:57:33] Good morning! [07:17:31] klausman: o/ re: removal of services - Some experimental isvcs were removed, but it seemed a unstaged change (namely, manual remove directly on deploy1003's deployment-chart repo) [07:21:06] hello folks! [07:37:10] kevinbazira: I'm available if you want to sync about testing the gpus [07:40:11] isaranto: o/ sure sure. I've sent you an invite: https://meet.google.com/yig-iyjp-ybg?authuser=1 [07:40:32] ack, be right there [08:13:59] elukey: weird. I did the change live, yes, but made a patch immediately afterwards (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1134991) [08:15:23] It was submitted at 1243 UTC, and your message is from 13:59 UTC if I got my TZ's correctly [08:22:09] sure but you have to unstage your live changes, otherwise git pull will not work [08:24:24] I must have forgotten the "git reset bit" [08:25:34] yep yep no problem! [12:10:00] 06Machine-Learning-Team, 07sre-alert-triage: Alert in need of triage: DiskSpace (instance ml-lab1001:9100) - https://phabricator.wikimedia.org/T391465 (10LSobanski) 03NEW [15:05:51] 10Lift-Wing, 06Machine-Learning-Team: Use rocm/vllm image on Lift Wing - https://phabricator.wikimedia.org/T385173#10726495 (10kevinbazira) I was able to run vLLM in the docker image: `rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6` by customizing instructions from this ROCm blog: https://rocm.blogs.a... [15:07:23] o/ as discussed in the meeting, I've shared the steps I used to run vLLM in the image created by ROCm: https://phabricator.wikimedia.org/T385173#10726495 [15:27:40] klausman: o/ whenever you get a minute please help restart EventGate: https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate/Administration#Roll_restart_all_pods [15:27:40] the rrla topic was recently added: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1133603 [15:27:40] the restart will avoid similar errors that were thrown when the article-country topic was added: https://phabricator.wikimedia.org/P73242#293629 [16:10:06] * isaranto afk