[02:24:36] FIRING: KubernetesDeploymentUnavailableReplicas: ... [02:24:36] Deployment aya-llm-predictor-00006-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00006-deployment - ... [02:24:36] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [05:38:01] 06Machine-Learning-Team, 06Wikimedia Enterprise: Test liftwing wikidata revert risk API for scale and latency - https://phabricator.wikimedia.org/T409388#11399292 (10kevinbazira) [06:24:36] FIRING: KubernetesDeploymentUnavailableReplicas: ... [06:24:36] Deployment aya-llm-predictor-00006-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00006-deployment - ... [06:24:36] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [08:22:15] (03PS1) 10AikoChou: revise-tone-task-generator: Add paragraph extraction code for French, Arabic, Portuguese. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1210399 (https://phabricator.wikimedia.org/T408538) [08:34:57] aiko: o/ [08:35:00] back :) [09:15:50] (03CR) 10Bartosz Wójtowicz: [C:03+1] "The code looks great to me!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1210399 (https://phabricator.wikimedia.org/T408538) (owner: 10AikoChou) [09:36:31] bartosz: o/ I am currently testing outlink in staging for the connection issue, lemme know if it is a problem [09:50:50] elukey: o/ thank you for letting me know and for looking into it! We're not planning on touching the outlink deployment today so feel free to do anything you'd like :D [09:53:49] (03PS2) 10AikoChou: revise-tone-task-generator: Add paragraph extraction code for French, Arabic, Portuguese. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1210399 (https://phabricator.wikimedia.org/T408538) [09:54:30] 06Machine-Learning-Team, 13Patch-For-Review: Create a Revise Tone Task Generator in LiftWing - https://phabricator.wikimedia.org/T408538#11399806 (10elukey) IIUC Knative just sits behind the kube svcs for the inference services to provide autoscaling-like services/buffering etc.. It shouldn't influence the rou... [10:09:43] 06Machine-Learning-Team, 13Patch-For-Review: Create a Revise Tone Task Generator in LiftWing - https://phabricator.wikimedia.org/T408538#11399890 (10elukey) >>! In T408538#11399806, @elukey wrote: > IIUC Knative just sits behind the kube svcs for the inference services to provide autoscaling-like services/buff... [10:13:42] (03PS3) 10AikoChou: revise-tone-task-generator: Add paragraph extraction code for French, Arabic, Portuguese. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1210399 (https://phabricator.wikimedia.org/T408538) [10:15:25] 06Machine-Learning-Team: model reference-risk: reference_risk_score is always 0. - https://phabricator.wikimedia.org/T410744#11399899 (10Pablo) Thank you, @OKarakaya-WMF! I have just checked this other revision, https://en.wikipedia.org/w/index.php?title=Oleg_Strizhenov&oldid=1323361084, which includes a citati... [10:24:36] FIRING: KubernetesDeploymentUnavailableReplicas: ... [10:24:36] Deployment aya-llm-predictor-00006-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00006-deployment - ... [10:24:37] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [10:25:56] 06Machine-Learning-Team, 13Patch-For-Review: Create a Revise Tone Task Generator in LiftWing - https://phabricator.wikimedia.org/T408538#11399916 (10elukey) I always forget that `istioctl` is always a good friend :) ` root@deploy2002:~# istioctl-1.15.7 proxy-status | grep revis reference-need-predictor-00008-... [10:26:54] klausman: o/ I posted https://phabricator.wikimedia.org/T408538#11399916, lemme know if it works :) [10:27:02] (03CR) 10AikoChou: [C:03+2] "Thanks for the review! :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1210399 (https://phabricator.wikimedia.org/T408538) (owner: 10AikoChou) [10:30:30] (03Merged) 10jenkins-bot: revise-tone-task-generator: Add paragraph extraction code for French, Arabic, Portuguese. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1210399 (https://phabricator.wikimedia.org/T408538) (owner: 10AikoChou) [10:47:41] elukey: thank you so much! [10:48:30] elukey: The Host: header thing is somethign that occurred to me on Saturday afternoon (when cleaning the flat, no less), but I immediately forgot about it again %-) [10:51:11] yep I got derailed at the beginning as well, istio-proxy is configured to allow all pods in the mesh to call each other by default, but in this case it works at L7 and it needs the right outbound policy to be picked [10:51:17] otherwise it gets upset [10:53:54] Makes sense. The Host: header is something I used to forget about a lot in the beginning when using curl against the discovery endpoints, I should have thought of that last week :) [10:54:32] I alwasy forget about istioctl, it is really really great and useful [10:54:54] because it is a good way to get insights about how istio-proxy is configured [10:54:56] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q2 FY2025-26 Goal: Deploy Add-a-link v2 models to production - https://phabricator.wikimedia.org/T408790#11400003 (10OKarakaya-WMF) We got results for itwiki: {F70570186} I've checked two weeks before the release and after the release with 3 days of buffer. bef... [10:55:08] before hitting other walls [10:55:57] Yeah, I've put it in my "thoughts" file. Now I just needmto start to actually look at it %) [11:11:35] 06Machine-Learning-Team: model reference-risk: reference_risk_score is always 0. - https://phabricator.wikimedia.org/T410744#11400050 (10OKarakaya-WMF) cool, thank you @Pablo , I see eadaily.com is not valid only in ruwiki: {F70572095} The model uses sqlitedb (features.db) in this link: https://analytics.wiki... [11:22:24] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, and 2 others: WE 1.3.4 Roll out Revert Risk Filters to Wikis that don't have damaging/goodfaith Edit Models - https://phabricator.wikimedia.org/T408388#11400076 (10DMburugu) [11:41:56] Hey folks, I finished my tasks, is there anything where I can help? [11:45:00] 06Machine-Learning-Team: model reference-risk: reference_risk_score is always 0. - https://phabricator.wikimedia.org/T410744#11400159 (10Pablo) Thanks again, @OKarakaya-WMF! (I imagined there was no process to keep these resources updated, so I will check internally with @FNavas-foundation, @diego and @Miriam i... [11:54:36] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q2 FY2025-26 Goal: Deploy Add-a-link v2 models to production - https://phabricator.wikimedia.org/T408790#11400166 (10OKarakaya-WMF) [11:56:23] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q2 FY2025-26 Goal: Deploy Add-a-link v2 models to production - https://phabricator.wikimedia.org/T408790#11400171 (10OKarakaya-WMF) Started updating following wikis: The wikis that were released in v1 but they were under the release threshold in v1 and they are a... [12:08:23] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, and 2 others: WE 1.3.4 Roll out Revert Risk Filters to Wikis that don't have damaging/goodfaith Edit Models - https://phabricator.wikimedia.org/T408388#11400198 (10DMburugu) [13:22:10] (03PS2) 10Nik Gkountas: add support for combining single page collection with topic filter [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1207918 (https://phabricator.wikimedia.org/T409338) [13:22:57] (03PS3) 10Nik Gkountas: add support for combining single page collection with topic filter [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1207918 (https://phabricator.wikimedia.org/T409338) [13:23:49] (03CR) 10Nik Gkountas: "I used the decorator pattern to create a new recommender that handles this case. I'll add support for pageid caching in following patch." [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1207918 (https://phabricator.wikimedia.org/T409338) (owner: 10Nik Gkountas) [13:49:01] klausman: re: gpu metrics - when multi-partitions are used, the usage metrics show up only for the first GPU/partition. Say you have a mi300x split in 8 partitions, you'll see usage metrics only for the first or "0". AMD confirmed it is like that for the moment, we'll see in the future. [13:49:07] not great but at least we have some confirmation [13:49:29] they also merged the CLI bug that prevented partitioning, it should be available in the next release [13:49:39] is the zeroth GPU then a summary of all or just the first of the virtual ones? [14:04:53] 06Machine-Learning-Team, 06Wikimedia Enterprise: Test liftwing wikidata revert risk API for scale and latency - https://phabricator.wikimedia.org/T409388#11400499 (10gkyziridis) Hey @kevinbazira thank very much for running the loading tests for Revert-Risk wikidata. I think we should change a little bit the co... [14:05:18] I have no idea, I assume so, but they didn't tell [14:06:37] 06Machine-Learning-Team, 06Moderator-Tools-Team, 10PersonalDashboard: Surface edits to moderators which may require their review - https://phabricator.wikimedia.org/T404174#11400505 (10Samwalton9-WMF) [14:07:19] 06Machine-Learning-Team, 06Moderator-Tools-Team, 10PersonalDashboard: Surface edits to moderators which may require their review - https://phabricator.wikimedia.org/T404174#11400510 (10Samwalton9-WMF) [14:24:37] FIRING: KubernetesDeploymentUnavailableReplicas: ... [14:24:37] Deployment aya-llm-predictor-00006-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00006-deployment - ... [14:24:37] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [14:28:38] 10Lift-Wing, 06Machine-Learning-Team, 10Wikidata, 06Wikimedia Enterprise, and 3 others: Request to host Wikidata Revert Risk on Lift Wing - https://phabricator.wikimedia.org/T406179#11400591 (10Sucheta-Salgaonkar-WMF) [14:30:14] 10Lift-Wing, 06Machine-Learning-Team, 10Wikidata, 06Wikimedia Enterprise, and 3 others: Q2 FY2025-26 Goal: Host Wikidata Revert Risk model on LiftWing - https://phabricator.wikimedia.org/T406179#11400594 (10Sucheta-Salgaonkar-WMF) [14:34:15] 06Machine-Learning-Team, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 10PersonalDashboard: AI/ML Infrastructure Request: Assistance in Rolling out Revert Risk to wikis that don't have damaging/goodfaith models - https://phabricator.wikimedia.org/T408607#11400607 (10DMburugu) [14:42:14] (03PS1) 10Bartosz Wójtowicz: revise-tone-task-generator: Use quorum for Cassandra writes. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1210602 (https://phabricator.wikimedia.org/T408538) [14:49:02] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q1 FY2025-26 Goal: Enable volunteer evaluation of Tone Check model in additional languages - https://phabricator.wikimedia.org/T400423#11400689 (10gkyziridis) Progress update on the hypothesis for the week, including if something has shipped: I've collected data... [15:06:43] (03CR) 10AikoChou: [C:03+1] revise-tone-task-generator: Use quorum for Cassandra writes. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1210602 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [15:29:55] (03CR) 10Bartosz Wójtowicz: [C:03+2] "Merging this, but leaving the deployment for tomorrow!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1210602 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [15:39:44] (03Merged) 10jenkins-bot: revise-tone-task-generator: Use quorum for Cassandra writes. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1210602 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [15:50:35] 06Machine-Learning-Team: Update Aya LLM model-server to run on LiftWing GPUs - https://phabricator.wikimedia.org/T410906 (10kevinbazira) 03NEW [15:57:05] (03PS1) 10Kevin Bazira: llm: update model-server to use default dtype float16 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1210620 (https://phabricator.wikimedia.org/T410906) [15:57:55] 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Update Aya LLM model-server to run on LiftWing GPUs - https://phabricator.wikimedia.org/T410906#11401152 (10kevinbazira) [16:05:46] (03CR) 10Dpogorzelski: [C:03+1] "lgtm" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1210620 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [16:10:02] (03CR) 10Kevin Bazira: "This change is similar to the one made to the model-server in: https://phabricator.wikimedia.org/P85486#L19" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1210620 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [16:10:20] (03CR) 10Kevin Bazira: [C:03+2] llm: update model-server to use default dtype float16 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1210620 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [16:10:54] (03Merged) 10jenkins-bot: llm: update model-server to use default dtype float16 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1210620 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [17:29:44] FIRING: LiftWingServiceErrorRate: ... [17:29:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=recommendation-api-ng&var-backend=recommendation-api-ng-main.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [18:24:37] FIRING: KubernetesDeploymentUnavailableReplicas: ... [18:24:37] Deployment aya-llm-predictor-00006-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00006-deployment - ... [18:24:39] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [18:29:44] RESOLVED: LiftWingServiceErrorRate: ... [18:29:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=recommendation-api-ng&var-backend=recommendation-api-ng-main.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [19:20:44] FIRING: LiftWingServiceErrorRate: ... [19:20:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=recommendation-api-ng&var-backend=recommendation-api-ng-main.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [19:25:44] RESOLVED: LiftWingServiceErrorRate: ... [19:25:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=recommendation-api-ng&var-backend=recommendation-api-ng-main.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [22:24:37] FIRING: KubernetesDeploymentUnavailableReplicas: ... [22:24:37] Deployment aya-llm-predictor-00006-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00006-deployment - ... [22:24:39] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas