[01:04:41] (03CR) 10Sbisson: [C:03+2] Update section suggestion fetching to request multiple at once [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1217170 (owner: 10Nik Gkountas) [01:06:23] (03CR) 10CI reject: [V:04-1] Update section suggestion fetching to request multiple at once [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1217170 (owner: 10Nik Gkountas) [01:42:41] (03CR) 10Nik Gkountas: [V:03+2 C:03+2] Update section suggestion fetching to request multiple at once [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1217170 (owner: 10Nik Gkountas) [01:43:29] (03CR) 10CI reject: [V:04-1] Update section suggestion fetching to request multiple at once [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1217170 (owner: 10Nik Gkountas) [04:43:03] (03CR) 10KartikMistry: [C:03+2] "recheck" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1217170 (owner: 10Nik Gkountas) [04:45:11] (03Merged) 10jenkins-bot: Update section suggestion fetching to request multiple at once [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1217170 (owner: 10Nik Gkountas) [09:00:44] FIRING: [3x] LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [09:03:48] (03CR) 10AikoChou: [C:03+2] "I think it could work well! It would automatically handle future schema bumps without needing to update the helpers manually. Just that we" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1191305 (https://phabricator.wikimedia.org/T426807) (owner: 10AikoChou) [09:05:37] (03CR) 10Nik Gkountas: [V:03+2 C:03+2] "Done" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1217170 (owner: 10Nik Gkountas) [09:08:18] (03Merged) 10jenkins-bot: events: construct new prediction classification event independently [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1191305 (https://phabricator.wikimedia.org/T426807) (owner: 10AikoChou) [10:27:49] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment revertrisk-language-agnostic-predictor-00007-deployment in revertrisk at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [10:40:44] FIRING: LiftWingServiceErrorRate: ... [10:40:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revertrisk&var-backend=revertrisk-language-agnostic-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [10:47:49] RESOLVED: [2x] KubernetesDeploymentUnavailableReplicas: Deployment revertrisk-language-agnostic-predictor-00007-deployment in revertrisk at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [10:50:44] RESOLVED: LiftWingServiceErrorRate: ... [10:50:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revertrisk&var-backend=revertrisk-language-agnostic-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [10:51:34] 06Machine-Learning-Team (Q4 FY2025-26): ml-serve: assess and raise resource quotas to support rolling deployments in revertrisk (and other) namespaces - https://phabricator.wikimedia.org/T426947 (10achou) 03NEW [11:37:35] 06Machine-Learning-Team (Q4 FY2025-26): Fix text normalization edge cases in TTS prototype - https://phabricator.wikimedia.org/T426756#11944373 (10kevinbazira) One thing we found while fixing subscript/superscript edge cases is that Wikipedia's plain-text extract API: https://en.wikipedia.org/w/api.php?action=q... [11:39:27] 06Machine-Learning-Team (Q4 FY2025-26): Fix text normalization edge cases in TTS prototype - https://phabricator.wikimedia.org/T426756#11944379 (10kevinbazira) [11:40:50] 06Machine-Learning-Team (Q4 FY2025-26): Fix text normalization edge cases in TTS prototype - https://phabricator.wikimedia.org/T426756#11944388 (10kevinbazira) Following T426756#11944373, we integrated the [NeMo text processing](https://github.com/NVIDIA/NeMo-text-processing) library into the TTS protottype sinc... [11:42:31] 06Machine-Learning-Team (Q4 FY2025-26): Fix text normalization edge cases in TTS prototype - https://phabricator.wikimedia.org/T426756#11944396 (10kevinbazira) We have also added a [nemo_whitelist.tsv](https://gitlab.wikimedia.org/toolforge-repos/wiki-tts/-/blob/master/wiki_tts/nemo_whitelist.tsv). Without it, N... [11:43:22] 06Machine-Learning-Team (Q4 FY2025-26): Fix text normalization edge cases in TTS prototype - https://phabricator.wikimedia.org/T426756#11944398 (10kevinbazira) [13:28:14] 06Machine-Learning-Team (Q4 FY2025-26), 07OKR-Work: Add 'iommu=pt' kernel parameter on MI300x nodes for direct GPU-to-GPU communication (PCIe P2P) - https://phabricator.wikimedia.org/T421461#11944915 (10elukey) 05Open→03Resolved Rolled out to all nodes! [13:51:56] 10Lift-Wing, 06Machine-Learning-Team, 07Essential-Work: Requesting write access to ml-serve-{eqiad,codfq} for ML team - https://phabricator.wikimedia.org/T381883#11945071 (10DPogorzelski-WMF) a:03DPogorzelski-WMF [13:53:32] 10Lift-Wing, 06Machine-Learning-Team, 07Essential-Work: Requesting write access to ml-serve-{eqiad,codfw} for ML team - https://phabricator.wikimedia.org/T381883#11945072 (10isarantopoulos) [14:00:04] 10Lift-Wing, 06Machine-Learning-Team (Q4 FY2025-26): Generate OpenAPI descriptions for Lift Wing APIs - https://phabricator.wikimedia.org/T419455#11945084 (10gkyziridis) Hey @apaskulin, thnx for your comments. We are still figuring out how we will expose the openapi-specs docs. We decided to go first with the... [14:11:56] 06Machine-Learning-Team, 10Ceph, 06Infrastructure-Foundations, 10SRE-swift-storage, 13Patch-For-Review: Move the Docker Registry's /ml prefix to S3/apus - https://phabricator.wikimedia.org/T420978#11945134 (10elukey) I had a chat with @JMeybohm the other day, and he pointed out a very wise thing - when w... [14:50:27] 10Lift-Wing, 06Machine-Learning-Team (Q4 FY2025-26): Generate OpenAPI descriptions for Lift Wing APIs - https://phabricator.wikimedia.org/T419455#11945316 (10apaskulin) [14:53:14] 10Lift-Wing, 06Machine-Learning-Team (Q4 FY2025-26): Generate OpenAPI descriptions for Lift Wing APIs - https://phabricator.wikimedia.org/T419455#11945338 (10apaskulin) >>! In T419455#11945084, @gkyziridis wrote: > We decided to go first with the current models that are configured in [[ https://gerrit.wikimedi... [15:58:29] 06Machine-Learning-Team, 10Add-Link-Structured-Task, 10Community Feedback (Growth), 06Growth-Team: AI/ML model update request: Named Entity Recognition for Add-a-Link - https://phabricator.wikimedia.org/T405185#11945641 (10KStoller-WMF) @Sucheta-Salgaonkar-WMF - @Sdkb I discussed this task again as this...