[08:35:49] good morning [09:00:42] hello o/ [09:02:23] morning folks [09:52:07] o/ going to deploy article-country in prod ... [10:01:03] ^-- isvc is up and running in prod: https://phabricator.wikimedia.org/P73204 [10:22:51] o/ [10:23:00] something is working in staging now (knative) [10:23:14] not 100% right but I am getting there [10:24:01] I am down to "pod or containers "istio-validation", "istio-proxy" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost"" [10:24:10] so "only" the istio sidecar stuff [10:26:15] \o/ [10:27:30] 06Machine-Learning-Team, 07Kubernetes, 13Patch-For-Review: Migrate ml-staging/ml-serve clusters off of Pod Security Policies - https://phabricator.wikimedia.org/T369493#10524412 (10elukey) All right finally in staging we have knative/kserve containers passing the restricted PSS config (if applied to the name... [13:56:19] 10Lift-Wing, 06Machine-Learning-Team: Use rocm/vllm image on Lift Wing - https://phabricator.wikimedia.org/T385173#10525449 (10isarantopoulos) There is a new image avaialable: [[ https://hub.docker.com/layers/rocm/vllm/rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6/images/sha256-9a12ef62bbbeb5a4c30a01f702c8e025... [13:58:44] FIRING: LiftWingServiceErrorRate: ... [13:58:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=article-models&var-backend=article-country-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [14:09:53] ^--- investigating this issue: https://phabricator.wikimedia.org/P73242 [14:11:11] thanks Kevin [15:07:39] Thanks Kevin! [16:08:44] RESOLVED: LiftWingServiceErrorRate: ... [16:08:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=article-models&var-backend=article-country-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [16:20:25] ^-- thanks to klausman and elukey who helped restart EventGate for it to recognize the new article-country event stream: https://phabricator.wikimedia.org/P73242#293629 [16:20:41] <3 [16:24:39] \o/ [16:27:03] kevinbazira: I thought it was written on wikitech, if not could you please update the doc? [16:27:37] okok ... [16:32:07] elukey: I see it exists in the docs: `Ask to any SRE member of DE/ML/ServiceOps to roll restart EventGate Main's pods (in both DCs).` under https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Streams#Streams_(Admins_only,_Machine_Learning_team) [16:38:52] super [17:06:16] \o/ [17:09:22] \o/ [18:29:25] 06Machine-Learning-Team, 13Patch-For-Review: Issues with Reference Need and Reference Risk models - https://phabricator.wikimedia.org/T384172#10526680 (10achou) After testing different resource configurations for the model service in the experimental namespace, I found the optimal setup was increasing CPU from...