[03:52:04] FIRING: KubernetesDeploymentUnavailableReplicas: ... [03:52:04] Deployment aya-llm-predictor-00001-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00001-deployment - ... [03:52:04] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [05:19:24] (03PS1) 10Kevin Bazira: locust: add revertrisk-wikidata load test [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1204730 (https://phabricator.wikimedia.org/T406179) [06:03:04] 10Lift-Wing, 06Machine-Learning-Team, 10Wikidata, 06Wikimedia Enterprise, and 2 others: Request to host Wikidata Revert Risk on Lift Wing - https://phabricator.wikimedia.org/T406179#11369351 (10kevinbazira) I have run locust load tests on the revertrisk-wikidata staging isvc for 120s with 2 users, each sen... [06:03:27] (03CR) 10Kevin Bazira: "This has been tested on the `stat1008` machine as shown in: https://phabricator.wikimedia.org/T406179#11369351" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1204730 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira) [07:17:04] good morning [07:23:26] morning! [07:52:04] FIRING: KubernetesDeploymentUnavailableReplicas: ... [07:52:04] Deployment aya-llm-predictor-00001-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00001-deployment - ... [07:52:04] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [07:55:54] (03PS5) 10Bartosz Wójtowicz: revise-tone-task-generator: Add cache to the model. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1203757 (https://phabricator.wikimedia.org/T408538) [08:11:22] 06Machine-Learning-Team: Enable ChangeProp to consume mediawiki.page_content_change.v1 - https://phabricator.wikimedia.org/T409469#11369614 (10achou) Hi @Joe! The Machine Learning and Growth teams are collaborating on a GrowthExperiments newcomer task for revising tone (associated hypotheses are WE1.1.2 & WE1.1.... [08:28:01] 06Machine-Learning-Team, 06Discovery-Search (2025.10.20 - 2025.11.07): Initial task generation and ingestion to Cassandra and Search weight tags - https://phabricator.wikimedia.org/T408533#11369663 (10achou) **Update** We have the initial [[ https://drive.google.com/file/d/1omYeYlLy-lo_EZollxlTef2rrRZFhEQo/vi... [08:40:04] 06Machine-Learning-Team, 13Patch-For-Review: Create a Revise Tone Task Generator in LiftWing - https://phabricator.wikimedia.org/T408538#11369736 (10achou) @BWojtowicz-WMF We have the initial [[ https://drive.google.com/file/d/1omYeYlLy-lo_EZollxlTef2rrRZFhEQo/view?usp=drive_link | dataset ]] for frwiki. We c... [11:00:30] 06Machine-Learning-Team: Configure Lift Wing isvc Integration with Cassandra - https://phabricator.wikimedia.org/T409414#11370216 (10elukey) >>! In T409414#11350459, @Eevans wrote: >>>! In T409414#11350324, @elukey wrote: >> [ ... ] >> >> @Eevans Hi! Is there a load balancing endpoint in front of the cassandra... [11:12:45] (03CR) 10AikoChou: [C:03+1] "LGTM! I tested it locally and it works like a charm. Just a few nits :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1203757 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [11:19:25] aiko, klausman, dpogorzelski - as FYI I'll be afk tomorrow and next week :) [11:29:31] ack! [11:30:05] 放假愉快 :D [11:47:06] aye, cap'n [11:52:04] FIRING: KubernetesDeploymentUnavailableReplicas: ... [11:52:04] Deployment aya-llm-predictor-00001-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00001-deployment - ... [11:52:04] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [13:37:25] 06Machine-Learning-Team, 06Data-Engineering, 06serviceops: Enable ChangeProp to consume mediawiki.page_content_change.v1 - https://phabricator.wikimedia.org/T409469#11370721 (10achou) [13:39:40] (03PS6) 10Bartosz Wójtowicz: revise-tone-task-generator: Add cache to the model. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1203757 (https://phabricator.wikimedia.org/T408538) [13:40:44] (03CR) 10Bartosz Wójtowicz: revise-tone-task-generator: Add cache to the model. (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1203757 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [14:48:16] 06Machine-Learning-Team, 06Discovery-Search (2025.10.20 - 2025.11.07): Initial task generation and ingestion to Cassandra and Search weight tags - https://phabricator.wikimedia.org/T408533#11370900 (10achou) After meeting with @Michael today, we agreed to first enable **Testwiki** for more controlled experimen... [15:02:30] dpogorzelski, klausman - as FYI I applied https://github.com/ROCm/amdsmi/pull/136/files manually on ml-serve1012, it is basically the PR that upstream made to fix an issue that I reported a while ago. Hopefully they will release it soon-ish, so we'll not need this hack [15:08:22] just tested it setting CPX on ml-serve1012 and it worked nicely [15:19:35] :+1: [15:51:02] aiko: o/ if you have a min - aya fails with https://phabricator.wikimedia.org/P85318 [15:51:06] (kerve container) [15:52:04] FIRING: KubernetesDeploymentUnavailableReplicas: ... [15:52:04] Deployment aya-llm-predictor-00001-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00001-deployment - ... [15:52:04] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [15:52:32] elukey: yes! I was looking into that this morning [15:52:41] was aya ever tested on an AMD GPU? Namely, do we have the right support for it? [15:52:56] because it smells like something related to not finding rocm/AMD libs [15:53:07] the issue seems like with bitsandbytes, we can try to disable it [15:53:53] yese aya was tested on AMD GPU [15:53:54] does it support rocm? It may be cuda only.. [15:53:58] okok perfect [15:54:04] weird then [15:54:23] but Ilias tried more stuff after like bitsandbytes [15:54:41] aiko: lol https://github.com/bitsandbytes-foundation/bitsandbytes/issues/1573 [15:54:53] flash_attention etc [15:56:24] https://github.com/soghomon-b/transformers/commit/ef4ab7b47aae4b645225797aca2e5d1296118f61 [15:56:42] (03CR) 10Bartosz Wójtowicz: [C:03+2] revise-tone-task-generator: Add cache to the model. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1203757 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [15:57:08] sorry this is the one https://github.com/huggingface/transformers/commit/021006e1b00f0ee325e9e17d99985dac7abdc755 [15:57:19] maybe we just need to bump the transformers lib? [15:58:01] yeah looks like [16:00:04] we are using transformers==4.46.3 now [16:00:42] that is v4.57.1 [16:01:58] I'll test it tomorrow :) [16:02:15] thank you Luca <3 [16:02:32] (03Merged) 10jenkins-bot: revise-tone-task-generator: Add cache to the model. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1203757 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [16:15:02] aiko: it went out with https://github.com/huggingface/transformers/releases/tag/v4.51.0, so anything >= should work! [18:07:14] 06Machine-Learning-Team, 06LPL Hypothesis, 10Recommendation-API, 10LPL Projects (Other): Collection data unavailable in several rec-api hosts - https://phabricator.wikimedia.org/T406854#11371974 (10SBisson) p:05Medium→03High This is happening again. According to log stash all instances have updated th... [18:34:20] 06Machine-Learning-Team: Configure Lift Wing isvc Integration with Cassandra - https://phabricator.wikimedia.org/T409414#11372149 (10Eevans) >>! In T409414#11370216, @elukey wrote: > [ ... ] > @Eevans was there any discussion about adding an LVS endpoint in front of the Cassandra nodes? It shouldn't be super dif... [18:48:22] (03PS1) 10Sbisson: Error handling and update schedule [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1204944 (https://phabricator.wikimedia.org/T406854) [19:26:58] (03PS2) 10Sbisson: Error handling and update schedule [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1204944 (https://phabricator.wikimedia.org/T406854) [19:52:04] FIRING: KubernetesDeploymentUnavailableReplicas: ... [19:52:04] Deployment aya-llm-predictor-00001-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00001-deployment - ... [19:52:04] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [20:33:02] 06Machine-Learning-Team, 13Patch-For-Review: Revertrisk multilingual fails locally when ran with docker compose - https://phabricator.wikimedia.org/T408068#11372527 (10jsn.sherman) Sorry for the slow turn response; I followed the updated readme in the attached patch and I'm able to stand up the model server 🎉;... [22:07:19] (03CR) 10Jforrester: [C:03+2] build: Update MediaWiki requirement to 1.46 [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1203988 (https://phabricator.wikimedia.org/T409239) (owner: 10Reedy) [22:10:46] (03CR) 10Nik Gkountas: [C:03+2] Error handling and update schedule [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1204944 (https://phabricator.wikimedia.org/T406854) (owner: 10Sbisson) [22:11:23] (03Merged) 10jenkins-bot: Error handling and update schedule [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1204944 (https://phabricator.wikimedia.org/T406854) (owner: 10Sbisson) [22:32:21] (03Merged) 10jenkins-bot: build: Update MediaWiki requirement to 1.46 [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1203988 (https://phabricator.wikimedia.org/T409239) (owner: 10Reedy) [23:52:04] FIRING: KubernetesDeploymentUnavailableReplicas: ... [23:52:04] Deployment aya-llm-predictor-00001-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00001-deployment - ... [23:52:04] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas