[03:52:04] FIRING: KubernetesDeploymentUnavailableReplicas: ... [03:52:05] Deployment aya-llm-predictor-00001-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00001-deployment - ... [03:52:05] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [07:20:12] good morning [07:51:30] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q2 FY2025-26 Goal: Deploy Add-a-link v2 models to production - https://phabricator.wikimedia.org/T408790#11373406 (10OKarakaya-WMF) # Reporting 14/11/2025 ## Progress update on the hypothesis for the week, including if something has shipped: - All newly onboarded... [07:52:04] FIRING: KubernetesDeploymentUnavailableReplicas: ... [07:52:04] Deployment aya-llm-predictor-00001-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00001-deployment - ... [07:52:05] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [08:56:31] good morning! I'm creating a patch to deploy the new revise tone using cassandra on staging experimental - https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1205036 [08:58:02] Eric mentioned yesterday, that we probably want to set the `datacenter` so that we make sure we communicate with quorum in the same datacenter. Do we know the datacenter name in our case? This looks like in AQS there is only 1 datacenter called "datacenter1", but maybe I'm not looking at the right thing? https://wikitech.wikimedia.org/wiki/Cassandra/Tools/cdsh [09:11:36] our servers make me think it should be codfw, but maybe it's configured differently? `cassandra-dev2001-a.codfw.wmnet;cassandra-dev2001-b.codfw.wmnet;...` [10:16:36] (03CR) 10Gkyziridis: [C:03+1] "Thnx for adding loading tests Kevin!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1204730 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira) [10:22:22] (03CR) 10Kevin Bazira: [C:03+2] locust: add revertrisk-wikidata load test [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1204730 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira) [10:23:33] (03Merged) 10jenkins-bot: locust: add revertrisk-wikidata load test [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1204730 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira) [10:27:07] bartosz: I think it should be codfw.. we can try if it works [10:30:44] 06Machine-Learning-Team, 07Essential-Work: Fix logging on Revertrisk - https://phabricator.wikimedia.org/T409931#11373736 (10gkyziridis) [10:34:40] aiko: yess, let's do it [10:40:12] Oki, deployed on experimental, but we're suffering connection timeouts to Cassandra [10:43:17] I hot-edited datacenter to datacenter1 and the same timeout happens, so it's not this [11:13:38] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review - https://phabricator.wikimedia.org/T392833#11373923 (10BWojtowicz-WMF) **Update / Task on pause** The task has been put on pause for the past month due to moving res... [11:52:04] FIRING: KubernetesDeploymentUnavailableReplicas: ... [11:52:04] Deployment aya-llm-predictor-00001-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00001-deployment - ... [11:52:04] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [12:38:20] (03PS1) 10AikoChou: llm: bump transformers to v4.51.0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1205138 (https://phabricator.wikimedia.org/T403599) [13:14:48] (03CR) 10Dpogorzelski: [C:03+1] llm: bump transformers to v4.51.0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1205138 (https://phabricator.wikimedia.org/T403599) (owner: 10AikoChou) [13:39:29] (03CR) 10AikoChou: [C:03+2] llm: bump transformers to v4.51.0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1205138 (https://phabricator.wikimedia.org/T403599) (owner: 10AikoChou) [13:40:43] (03Merged) 10jenkins-bot: llm: bump transformers to v4.51.0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1205138 (https://phabricator.wikimedia.org/T403599) (owner: 10AikoChou) [14:36:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [14:36:49] Deployment aya-llm-predictor-00001-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00001-deployment - ... [14:36:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [14:53:08] 06Machine-Learning-Team, 06Data-Engineering, 06serviceops: Enable ChangeProp to consume mediawiki.page_content_change.v1 - https://phabricator.wikimedia.org/T409469#11374466 (10JMonton-WMF) > @Ottomata, what work is required to produce mediawiki.page_content_change.v1 to Kafka main? I'm expecting just some h... [15:19:55] 06Machine-Learning-Team, 06Data-Engineering, 06serviceops: Enable ChangeProp to consume mediawiki.page_content_change.v1 - https://phabricator.wikimedia.org/T409469#11374579 (10dcausse) If pushing to kafka-main you might need to increase broker's `message.max.bytes` see T344688. [15:21:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [15:21:49] Deployment aya-llm-predictor-00003-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00003-deployment - ... [15:21:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [15:24:34] 06Machine-Learning-Team, 06Data-Engineering, 06serviceops: Enable ChangeProp to consume mediawiki.page_content_change.v1 - https://phabricator.wikimedia.org/T409469#11374597 (10tchin) Would we also need to explicitly create the topics in main? Is auto topic creation enabled there? [15:59:39] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q1 FY2025-26 Goal: Task generation engine for Revise Tone task - https://phabricator.wikimedia.org/T408341#11374729 (10achou) **Weekly Report** Progress update on the hypothesis for the week, including if something has shipped: - Initial task generation T408533... [17:35:28] 06Machine-Learning-Team, 06Data-Engineering, 06serviceops: Enable ChangeProp to consume mediawiki.page_content_change.v1 - https://phabricator.wikimedia.org/T409469#11375060 (10dcausse) >>! In T409469#11374597, @tchin wrote: > Would we also need to explicitly create the topics in main? Is auto topic creation... [17:51:49] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment aya-llm-predictor-00003-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [19:11:50] 10Lift-Wing, 06Machine-Learning-Team, 10Wikidata, 06Wikimedia Enterprise, 10Wikimedia Enterprise - Content Integrity: Request to host Wikidata Revert Risk on Lift Wing - https://phabricator.wikimedia.org/T406179#11375298 (10Trokhymovych) Hi @kevinbazira I have reviewed the [[ https://gerrit.wikimedia.o... [19:16:49] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment aya-llm-predictor-00003-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [19:35:26] (03PS1) 10Sbisson: Prevent calling wikidata with no titles [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1205206 [23:17:04] FIRING: KubernetesDeploymentUnavailableReplicas: ... [23:17:04] Deployment aya-llm-predictor-00003-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00003-deployment - ... [23:17:04] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas