[05:32:50] 06Machine-Learning-Team: Add Slack notifications for Prometheus Alertmanager for ml-team - https://phabricator.wikimedia.org/T421040#11837485 (10isarantopoulos) 05Open→03Resolved [05:33:04] 06Machine-Learning-Team, 10Semantic Search: Semantic Search POC - In article QA - https://phabricator.wikimedia.org/T405359#11837486 (10isarantopoulos) 05Open→03Resolved [05:33:07] 06Machine-Learning-Team: Move inference-services repo from Gerrit to GitLab - https://phabricator.wikimedia.org/T408690#11837488 (10isarantopoulos) 05Declined→03Resolved [05:33:08] 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Update kserve to 0.15.2 - https://phabricator.wikimedia.org/T367048#11837487 (10isarantopoulos) 05Declined→03Resolved [05:33:11] 06Machine-Learning-Team, 10Prod-Kubernetes, 06ServiceOps new, 07Essential-Work, 07Kubernetes: Update kserve to v0.15.2* on ML clusters - https://phabricator.wikimedia.org/T380722#11837492 (10isarantopoulos) 05Declined→03Resolved [05:33:13] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Build and Publish ROCm-Compatible Python Packages - https://phabricator.wikimedia.org/T381859#11837491 (10isarantopoulos) 05Declined→03Resolved [05:33:15] 06Machine-Learning-Team: Edit Suggestions - Edit suggestion generation with loose edit types - https://phabricator.wikimedia.org/T418097#11837497 (10isarantopoulos) 05Open→03Resolved [05:33:19] 06Machine-Learning-Team, 10Prod-Kubernetes, 06ServiceOps new, 07Essential-Work, and 2 others: Update knative-serving+net-istio to v1.12.x on ML clusters - https://phabricator.wikimedia.org/T380723#11837494 (10isarantopoulos) 05Declined→03Resolved [05:33:25] 06Machine-Learning-Team: Add Slack notifications for Prometheus Alertmanager for ml-team - https://phabricator.wikimedia.org/T421040#11837502 (10isarantopoulos) Nothing else is needed here. The column trigger that also resolves the task didn't seem to work. [05:33:35] 06Machine-Learning-Team: Article Summary Generation and Evaluation Pipeline using vLLM image - https://phabricator.wikimedia.org/T395246#11837504 (10isarantopoulos) 05Invalid→03Resolved [05:33:36] 06Machine-Learning-Team, 10Semantic Search, 05Goal, 07OKR-Work: Q2 FY2025-26 Goal: Semantic Search - Embeddings Service for MVP - https://phabricator.wikimedia.org/T412338#11837503 (10isarantopoulos) 05Open→03Resolved [05:33:39] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Edit Suggestions - Formatting for html to text - https://phabricator.wikimedia.org/T419840#11837505 (10isarantopoulos) 05Open→03Resolved [05:33:47] 06Machine-Learning-Team, 10Add-Link-Structured-Task, 06Growth-Team: Add a Link: Remove Country and Continent names in suggestions - https://phabricator.wikimedia.org/T414297#11837508 (10isarantopoulos) 05Open→03Resolved [05:33:50] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q1 FY2025-26 Goal: Enable volunteer evaluation of Tone Check model in additional languages - https://phabricator.wikimedia.org/T400423#11837510 (10isarantopoulos) 05Open→03Resolved a:03isarantopoulos [05:34:06] 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Scaling Add-a-link to more wikis via production (airflow) pipelines - https://phabricator.wikimedia.org/T398950#11837515 (10isarantopoulos) 05Open→03Resolved a:03isarantopoulos [05:34:07] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q2 FY2025-26 Goal: Deploy Add-a-link v2 models to production - https://phabricator.wikimedia.org/T408790#11837514 (10isarantopoulos) 05Open→03Resolved [05:34:10] 06Machine-Learning-Team, 06Data-Engineering, 06serviceops-deprecated: Enable ChangeProp to consume mediawiki.page_content_change.v1 - https://phabricator.wikimedia.org/T409469#11837512 (10isarantopoulos) 05Declined→03Resolved [05:34:15] 06Machine-Learning-Team, 07Essential-Work: errors in revscoring-editquality-goodfaith - https://phabricator.wikimedia.org/T413854#11837518 (10isarantopoulos) 05Open→03Resolved [05:34:27] 06Machine-Learning-Team, 10EditCheck, 10VisualEditor: [SPIKE] Define process for validating Tone Check model eval data for languages staff members do not speak - https://phabricator.wikimedia.org/T407155#11837519 (10isarantopoulos) 05Open→03Resolved [05:34:28] 06Machine-Learning-Team: model reference-risk: reference_risk_score is always 0. - https://phabricator.wikimedia.org/T410744#11837521 (10isarantopoulos) 05Open→03Resolved [05:34:30] 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Update kserve to 0.15.2 - https://phabricator.wikimedia.org/T367048#11837522 (10isarantopoulos) 05Resolved→03Declined [05:34:34] 06Machine-Learning-Team: Move inference-services repo from Gerrit to GitLab - https://phabricator.wikimedia.org/T408690#11837523 (10isarantopoulos) 05Resolved→03Declined [05:34:38] 06Machine-Learning-Team, 05Goal: Merge articletopic outlink model transformer and predictor pods - https://phabricator.wikimedia.org/T404294#11837524 (10isarantopoulos) 05In progress→03Resolved [05:34:46] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Build and Publish ROCm-Compatible Python Packages - https://phabricator.wikimedia.org/T381859#11837526 (10isarantopoulos) 05Resolved→03Declined [05:34:56] 06Machine-Learning-Team, 10Prod-Kubernetes, 06ServiceOps new, 07Essential-Work, and 2 others: Update knative-serving+net-istio to v1.12.x on ML clusters - https://phabricator.wikimedia.org/T380723#11837527 (10isarantopoulos) 05Resolved→03Declined [07:48:08] 10Lift-Wing, 06Machine-Learning-Team, 06ServiceOps new, 10ServiceOps-SharedInfra, and 5 others: Reroute LiftWing endpoints - https://phabricator.wikimedia.org/T422804#11837749 (10Clement_Goubert) >>! In T422804#11833678, @KartikMistry wrote: > @Clement_Goubert Hi, do we need to change anything on the recom... [08:46:24] Hello, anyone know where I can find the documentation for the liftwing-inference-services-edit-check API? It doesn't seem to be in https://api.wikimedia.org/wiki/Lift_Wing_API/Reference and I'd like to at least have a test case for it when we start moving traffic from the api-gateway to the rest-gateway [08:51:36] Best I can come up on hsort notice is the httpbb checks (modules/profile/files/httpbb/liftwing/production/test_editcheck.yaml) [08:51:59] Ah I forgot we had those [08:52:03] Thanks I'll take a look [08:59:36] 10Lift-Wing, 06SRE, 06Machine-Learning-Team (Q4 FY2025-26): Fix securityContext propagation in liftwing - https://phabricator.wikimedia.org/T423149#11838027 (10DPogorzelski-WMF) 05Open→03Resolved [09:01:45] 10Lift-Wing, 06Machine-Learning-Team, 06SRE: Fix securityContext propagation in liftwing - https://phabricator.wikimedia.org/T423149#11838029 (10DPogorzelski-WMF) [09:13:19] 10Lift-Wing, 06SRE, 06Machine-Learning-Team (Q4 FY2025-26): Fix securityContext propagation in liftwing - https://phabricator.wikimedia.org/T423149#11838048 (10isarantopoulos) [09:41:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [09:41:49] Deployment embeddings-staging-predictor-00008-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=embeddings-staging-predictor-00008-deployment - ... [09:41:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [09:48:47] (03PS1) 10Gkyziridis: revertrisk-multilingual: Optimize rr-multilingual model for asynchronous predictions. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1275362 (https://phabricator.wikimedia.org/T415892) [09:48:59] ^^^ I am fixing the above as we speak [09:49:09] (03PS2) 10Gkyziridis: revertrisk-multilingual: Optimize rr-multilingual model for asynchronous predictions. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1275362 (https://phabricator.wikimedia.org/T415892) [09:51:40] 10Lift-Wing, 06Machine-Learning-Team, 06ServiceOps new, 10ServiceOps-SharedInfra, and 5 others: Reroute LiftWing endpoints - https://phabricator.wikimedia.org/T422804#11838243 (10Clement_Goubert) [09:51:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [09:51:49] Deployment embeddings-staging-predictor-00008-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=embeddings-staging-predictor-00008-deployment - ... [09:51:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [10:02:45] 10Lift-Wing, 06Machine-Learning-Team, 06ServiceOps new, 10ServiceOps-SharedInfra, and 5 others: Reroute LiftWing endpoints - https://phabricator.wikimedia.org/T422804#11838259 (10Clement_Goubert) [10:20:46] 10Lift-Wing, 06Machine-Learning-Team, 06ServiceOps new, 10ServiceOps-SharedInfra, and 5 others: Reroute LiftWing endpoints - https://phabricator.wikimedia.org/T422804#11838290 (10Clement_Goubert) [10:21:37] 10Lift-Wing, 06Machine-Learning-Team, 06ServiceOps new, 10ServiceOps-SharedInfra, and 5 others: Reroute LiftWing endpoints - https://phabricator.wikimedia.org/T422804#11838291 (10Clement_Goubert) [12:12:55] 10Lift-Wing, 06Machine-Learning-Team (Q4 FY2025-26): Generate OpenAPI descriptions for Lift Wing APIs - https://phabricator.wikimedia.org/T419455#11838634 (10isarantopoulos) [12:48:20] 06Machine-Learning-Team, 06Data-Engineering, 06serviceops-deprecated: Enable ChangeProp to consume mediawiki.page_content_change.v1 - https://phabricator.wikimedia.org/T409469#11838726 (10isarantopoulos) 05Resolved→03Declined Reverting as it was accidentally resolved [13:08:39] 06Machine-Learning-Team: Add Slack notifications for Prometheus Alertmanager for ml-team - https://phabricator.wikimedia.org/T421040#11838821 (10Aklapper) @isarantopoulos: There is no [column trigger](https://www.mediawiki.org/wiki/Phabricator/Project_management#Automated_actions_via_column_triggers) on that... [14:41:13] (03CR) 10AikoChou: [C:03+2] "Thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1270939 (https://phabricator.wikimedia.org/T416384) (owner: 10AikoChou) [14:44:37] (03Merged) 10jenkins-bot: python/logging_utils: add configurable framework logger level overrides [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1270939 (https://phabricator.wikimedia.org/T416384) (owner: 10AikoChou) [15:47:00] (03CR) 10Ozge: [C:03+1] revertrisk-multilingual: Optimize rr-multilingual model for asynchronous predictions. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1275362 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [18:46:44] FIRING: LiftWingServiceErrorRate: ... [18:46:49] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=article-models&var-backend=articlequality-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [19:06:44] RESOLVED: LiftWingServiceErrorRate: ... [19:06:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=article-models&var-backend=articlequality-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [23:48:18] 06Machine-Learning-Team, 10Add-Link-Structured-Task, 10Community Feedback (Growth), 06Growth-Team: AI/ML model update request: Named Entity Recognition for Add-a-Link - https://phabricator.wikimedia.org/T405185#11842100 (10KStoller-WMF) Note that a similar underlying issue is being [[ https://www.mediawiki...