[00:46:04] FIRING: KubernetesDeploymentUnavailableReplicas: ... [00:46:04] Deployment aya-llm-predictor-00005-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00005-deployment - ... [00:46:04] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [03:20:59] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q1 FY2025-26 Goal: Apply the Tone Check model to published articles, to learn whether we can build a pool of high-quality structured tasks for new editors - https://phabricator.wikimedia.org/T392283#11382019 (10Sucheta-Salgaonkar-WMF) @AikoChou or @BWojtowicz-WMF... [04:46:03] FIRING: KubernetesDeploymentUnavailableReplicas: ... [04:46:03] Deployment aya-llm-predictor-00005-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00005-deployment - ... [04:46:03] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [06:18:49] o/ [06:18:55] thanks for the review George. [06:18:55] going to deploy latest rrwikidata image in staging ... [06:27:25] pods up and running in LW: https://phabricator.wikimedia.org/P85352 [06:46:38] 10Lift-Wing, 06Machine-Learning-Team, 10Wikidata, 06Wikimedia Enterprise, 10Wikimedia Enterprise - Content Integrity: Request to host Wikidata Revert Risk on Lift Wing - https://phabricator.wikimedia.org/T406179#11382187 (10kevinbazira) @Trokhymovych, thank you for reviewing the revertrisk-wikidata model... [07:49:16] 06Machine-Learning-Team, 06Data-Engineering, 06serviceops: Enable ChangeProp to consume mediawiki.page_content_change.v1 - https://phabricator.wikimedia.org/T409469#11382296 (10achou) Hi, thanks all for the input. :) Due to our tight timeline, ML team has decided to move forward with Option D for now. > Tha... [07:59:09] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q1 FY2025-26 Goal: Apply the Tone Check model to published articles, to learn whether we can build a pool of high-quality structured tasks for new editors - https://phabricator.wikimedia.org/T392283#11382311 (10achou) 05Open→03Resolved a:03achou A contin... [08:00:43] 06Machine-Learning-Team: Setup & experiments for MI300x GPUs used for LiftWing - https://phabricator.wikimedia.org/T403599#11382318 (10achou) a:03DPogorzelski-WMF [08:01:09] 06Machine-Learning-Team: Setup & experiments for MI300x GPUs used for LiftWing - https://phabricator.wikimedia.org/T403599#11382320 (10achou) a:05DPogorzelski-WMF→03None [08:02:22] 06Machine-Learning-Team, 10EditCheck, 10VisualEditor: [SPIKE] Define process for validating Tone Check model eval data for languages staff members do not speak - https://phabricator.wikimedia.org/T407155#11382325 (10achou) a:03gkyziridis [08:02:54] good morning [08:03:44] (03CR) 10Bartosz Wójtowicz: [C:03+2] revise-tone-task-generator: Do not query when deleting from cache. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1206358 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [08:06:17] 06Machine-Learning-Team, 06Data-Engineering, 06serviceops: Enable ChangeProp to consume mediawiki.page_content_change.v1 - https://phabricator.wikimedia.org/T409469#11382330 (10Joe) I'm very happy you're going with the option @jijiki recommended, which sounds like both the path of least resistance and the be... [08:12:00] (03Merged) 10jenkins-bot: revise-tone-task-generator: Do not query when deleting from cache. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1206358 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [08:15:56] 06Machine-Learning-Team, 07Essential-Work: Introduce re-try mechanisms for MW API requests in LiftWing models - https://phabricator.wikimedia.org/T407843#11382356 (10achou) T363725 might be related, though that task focuses on handling redirects. [08:21:23] 06Machine-Learning-Team, 07Essential-Work: Iterate on Annotool functionality to support more use cases - https://phabricator.wikimedia.org/T409866#11382377 (10achou) [08:25:28] 06Machine-Learning-Team, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 10PersonalDashboard: AI/ML Infrastructure Request: Assistance in Rolling out Revert Risk to wikis that don't have damaging/goodfaith models - https://phabricator.wikimedia.org/T408607#11382392 (10achou) 05Open→03Resolved [08:27:04] 06Machine-Learning-Team, 07Essential-Work: Incorporate notebook into Tone-Check data generation ml-pipeline - https://phabricator.wikimedia.org/T404722#11382397 (10achou) 05Open→03Resolved [08:43:08] (03CR) 10Nikerabbit: Page collection validation script (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1203077 (owner: 10Sbisson) [08:46:04] FIRING: KubernetesDeploymentUnavailableReplicas: ... [08:46:04] Deployment aya-llm-predictor-00005-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00005-deployment - ... [08:46:04] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [11:58:00] (03PS1) 10Bartosz Wójtowicz: revise-tone: Temporarily disable topic filtering. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1206856 (https://phabricator.wikimedia.org/T408538) [12:06:07] (03CR) 10CI reject: [V:04-1] revise-tone: Temporarily disable topic filtering. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1206856 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [12:14:18] (03PS2) 10Bartosz Wójtowicz: revise-tone: Temporarily disable topic filtering. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1206856 (https://phabricator.wikimedia.org/T408538) [12:19:56] (03CR) 10AikoChou: [C:03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1206856 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [12:22:09] (03CR) 10Bartosz Wójtowicz: [C:03+2] revise-tone: Temporarily disable topic filtering. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1206856 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [12:30:08] (03Merged) 10jenkins-bot: revise-tone: Temporarily disable topic filtering. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1206856 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [12:46:04] FIRING: KubernetesDeploymentUnavailableReplicas: ... [12:46:04] Deployment aya-llm-predictor-00005-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00005-deployment - ... [12:46:04] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [13:02:31] 06Machine-Learning-Team, 06Discovery-Search (2025.10.20 - 2025.11.07): Initial task generation and ingestion to Cassandra and Search weight tags - https://phabricator.wikimedia.org/T408533#11383533 (10achou) Hi @pfischer @dcausse, ML team wants to follow up on the initial ingestion process. As you mentioned be... [13:16:04] (03PS1) 10Dpogorzelski: llm: solve cyclic dependency [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1206869 [13:20:40] (03CR) 10Bartosz Wójtowicz: [C:03+1] "Looks good, thank you! :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1206869 (owner: 10Dpogorzelski) [13:25:19] (03CR) 10Bartosz Wójtowicz: [C:03+2] llm: solve cyclic dependency [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1206869 (owner: 10Dpogorzelski) [13:26:29] (03Merged) 10jenkins-bot: llm: solve cyclic dependency [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1206869 (owner: 10Dpogorzelski) [13:28:40] 06Machine-Learning-Team, 06Discovery-Search (2025.10.20 - 2025.11.07): Initial task generation and ingestion to Cassandra and Search weight tags - https://phabricator.wikimedia.org/T408533#11383565 (10dcausse) >>! In T408533#11383532, @achou wrote: > Hi @pfischer @dcausse, ML team wants to follow up on the ini... [13:55:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [13:55:49] Deployment aya-llm-predictor-00005-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00005-deployment - ... [13:55:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [13:58:50] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 10PersonalDashboard, and 2 others: Enable revertrisk filters in thwiki - https://phabricator.wikimedia.org/T409438#11383711 (10Kgraessle) 05Stalled→03Open [13:59:29] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 10PersonalDashboard, and 2 others: Enable revertrisk filters in thwiki - https://phabricator.wikimedia.org/T409438#11383717 (10Kgraessle) This should be ready for anyone who wants to pick it up. [14:04:44] FIRING: LiftWingServiceErrorRate: ... [14:04:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=recommendation-api-ng&var-backend=recommendation-api-ng-main.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [14:21:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [14:21:49] Deployment aya-llm-predictor-00006-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00006-deployment - ... [14:21:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [14:29:44] RESOLVED: LiftWingServiceErrorRate: ... [14:29:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=recommendation-api-ng&var-backend=recommendation-api-ng-main.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [14:31:44] FIRING: LiftWingServiceErrorRate: ... [14:31:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=recommendation-api-ng&var-backend=recommendation-api-ng-main.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [15:07:43] (03PS1) 10Sbisson: Skip invalid link entries [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1206888 [15:10:13] (03PS1) 10Sbisson: Skip invalid link entries [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1206889 [15:11:23] (03PS1) 10Sbisson: Fix typo in cache update logs [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1206890 (https://phabricator.wikimedia.org/T410396) [15:26:44] RESOLVED: LiftWingServiceErrorRate: ... [15:26:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=recommendation-api-ng&var-backend=recommendation-api-ng-main.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [15:37:26] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: eqiad row C/D Machine Learning host migrations - https://phabricator.wikimedia.org/T405647#11384128 (10RobH) 05Open→03Resolved All machine learning hosts have been migrated, resolving this task. [16:28:44] FIRING: LiftWingServiceErrorRate: ... [16:28:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=recommendation-api-ng&var-backend=recommendation-api-ng-main.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [17:23:44] RESOLVED: LiftWingServiceErrorRate: ... [17:23:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=recommendation-api-ng&var-backend=recommendation-api-ng-main.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [18:01:47] (03PS3) 10Nik Gkountas: Page collections caching: Use sitematrix lang code for all articles [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1206883 (https://phabricator.wikimedia.org/T410387) [18:02:20] (03CR) 10CI reject: [V:04-1] Page collections caching: Use sitematrix lang code for all articles [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1206883 (https://phabricator.wikimedia.org/T410387) (owner: 10Nik Gkountas) [18:22:04] FIRING: KubernetesDeploymentUnavailableReplicas: ... [18:22:04] Deployment aya-llm-predictor-00006-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00006-deployment - ... [18:22:04] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [19:27:17] (03CR) 10Eamedina: "Should this patch be abandoned in favor of Ie355fba0b8a41021abc52697be91b7065549a8fd?" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1206888 (owner: 10Sbisson) [19:27:54] (03CR) 10Eamedina: [C:03+2] Fix typo in cache update logs [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1206890 (https://phabricator.wikimedia.org/T410396) (owner: 10Sbisson) [19:28:35] (03Merged) 10jenkins-bot: Fix typo in cache update logs [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1206890 (https://phabricator.wikimedia.org/T410396) (owner: 10Sbisson) [19:29:57] (03Abandoned) 10Sbisson: Skip invalid link entries [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1206888 (owner: 10Sbisson) [19:34:58] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Action-API, 10MediaWiki-Recent-changes, and 2 others: The filter damaging=verylikelybad not availble for API:Feedrecentchanges - https://phabricator.wikimedia.org/T410435#11385657 (10Pppery) [19:50:22] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: eqiad row C/D Machine Learning host migrations - https://phabricator.wikimedia.org/T405647#11385742 (10Jclark-ctr) a:05klausman→03Jclark-ctr [20:14:50] (03CR) 10Eamedina: [C:03+2] Skip invalid link entries [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1206889 (owner: 10Sbisson) [20:15:25] (03Merged) 10jenkins-bot: Skip invalid link entries [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1206889 (owner: 10Sbisson) [22:22:04] FIRING: KubernetesDeploymentUnavailableReplicas: ... [22:22:04] Deployment aya-llm-predictor-00006-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00006-deployment - ... [22:22:04] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas