[00:27:13] 06Machine-Learning-Team, 06LPL Hypothesis, 10Recommendation-API, 10LPL Projects (Other), 07Unplanned-Sprint-Work: Collection data unavailable in several rec-api hosts - https://phabricator.wikimedia.org/T406854#11403691 (10GMikesell-WMF) [00:30:15] 06Machine-Learning-Team, 06LPL Hypothesis, 10Recommendation-API, 10LPL Projects (Other), 07Unplanned-Sprint-Work: Collection data unavailable in several rec-api hosts - https://phabricator.wikimedia.org/T406854#11403708 (10GMikesell-WMF) @SBisson Recommendation API is showing that it's not emptying the c... [00:31:03] 06Machine-Learning-Team, 06LPL Hypothesis, 10Recommendation-API, 10LPL Projects (Other), 07Unplanned-Sprint-Work: Collection data unavailable in several rec-api hosts - https://phabricator.wikimedia.org/T406854#11403711 (10GMikesell-WMF) [02:24:37] FIRING: KubernetesDeploymentUnavailableReplicas: ... [02:24:37] Deployment aya-llm-predictor-00006-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00006-deployment - ... [02:24:39] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [05:28:38] 10Lift-Wing, 06Machine-Learning-Team, 10Wikidata, 06Wikimedia Enterprise, and 3 others: Q2 FY2025-26 Goal: Host Wikidata Revert Risk model on LiftWing - https://phabricator.wikimedia.org/T406179#11404066 (10Sucheta-Salgaonkar-WMF) [06:24:37] FIRING: KubernetesDeploymentUnavailableReplicas: ... [06:24:37] Deployment aya-llm-predictor-00006-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00006-deployment - ... [06:24:40] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [08:18:58] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [08:18:58] Deployment aya-llm-predictor-00006-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00006-deployment - ... [08:18:58] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [08:23:41] 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Update Aya LLM model-server to run on LiftWing GPUs - https://phabricator.wikimedia.org/T410906#11404187 (10kevinbazira) First deployment shows the model-server in a `CrashLoopBackOff`: ` kevinbazira@deploy2002:~$ kubectl get pods NAME... [08:44:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [08:44:49] Deployment aya-llm-predictor-00007-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00007-deployment - ... [08:44:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [08:59:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [08:59:49] Deployment aya-llm-predictor-00007-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00007-deployment - ... [08:59:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [09:02:01] 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Update Aya LLM model-server to run on LiftWing GPUs - https://phabricator.wikimedia.org/T410906#11404330 (10kevinbazira) The above error was fixed by setting `BITSANDBYTES_DTYPE` to `None`. Now we are running into OOO issue shown below: ` kevinba... [09:28:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [09:28:49] Deployment aya-llm-predictor-00008-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00008-deployment - ... [09:28:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [09:48:25] (03PS1) 10Bartosz Wójtowicz: revise-tone-task-generator: Re-enable topic filtering. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211018 (https://phabricator.wikimedia.org/T408538)