[08:20:08] good morning! [08:45:03] good morning [09:02:02] 06Machine-Learning-Team, 06collaboration-services, 06Discovery-Search, 10Wikipedia-iOS-App-Backlog (iOS Release FY2024-25): [Spike] Fetch Topics for Articles in History on iOS app - https://phabricator.wikimedia.org/T379119#10612503 (10Gehel) I see that you are interested in using the Search API, in partic... [09:09:28] 06Machine-Learning-Team, 06collaboration-services, 10Discovery-Search (2025.03.01 - 2025.03.21), 10Wikipedia-iOS-App-Backlog (iOS Release FY2024-25): [Spike] Fetch Topics for Articles in History on iOS app - https://phabricator.wikimedia.org/T379119#10612530 (10Gehel) [09:16:16] 10Lift-Wing, 06Machine-Learning-Team: Increased latencies in reference-quality models (ref-need) - https://phabricator.wikimedia.org/T387019#10612535 (10isarantopoulos) Providing an update with no clear improvement at the moment. The below results is testing just a small sample of requests against the same da... [09:18:51] 06Machine-Learning-Team: Retrain peacock detection model for production use - https://phabricator.wikimedia.org/T388211 (10achou) 03NEW [09:21:11] 06Machine-Learning-Team: Retrain peacock detection model for production use - https://phabricator.wikimedia.org/T388211#10612579 (10achou) [09:27:00] (03PS1) 10Ilias Sarantopoulos: reference-quality: increase concurrency by adding a second worker [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125380 [09:29:09] aiko: I want to try increasing the workers. If it doesnt work as expected I was considering separating the service to 2 different ones. This could happen in parallel in a new deployment so the current one won't be affected. wdyt? [09:45:04] yeah let's try it [09:45:08] (03CR) 10AikoChou: [C:03+1] reference-quality: increase concurrency by adding a second worker [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125380 (owner: 10Ilias Sarantopoulos) [09:47:41] I'm looking at ur test results: ref-risk from 2126 -> 134, ref-need from 3546 -> 2423. ref-risk improves a lot! [09:50:30] since ref-need is non blocking anymore ref-risk can continue to serve requests. the bottleneck seems to be the predict of reference-need because of the sequential inference as you mentioned [09:53:35] I'll update the above patch to use ray instead. It seems it has been stabilized in latest releases [09:54:28] running some local tests with this one first. perhaps I'll try both [09:56:00] 06Machine-Learning-Team: Verify cost of gathering peacock training/evaluation data for top 20 languages - https://phabricator.wikimedia.org/T388215 (10achou) 03NEW [09:59:29] hmm getting an error with ray when I try to use the decorator @serve.deployment(name="reference-need", num_replicas=2) [09:59:34] ``` [09:59:34] File "/opt/lib/venv/lib/python3.11/site-packages/ray/serve/deployment.py", line 123, in __init__ [09:59:34] raise RuntimeError( [09:59:34] RuntimeError: The Deployment constructor should not be called directly. Use `@serve.deployment` instead. [09:59:34] ``` [10:00:16] I suspect it has to do with the fact that one model server extends the other so adding the decorator in both messes things up. going to just try the 2 workers for now [10:02:06] (03CR) 10Ilias Sarantopoulos: [C:03+2] reference-quality: increase concurrency by adding a second worker [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125380 (owner: 10Ilias Sarantopoulos) [10:02:50] (03Merged) 10jenkins-bot: reference-quality: increase concurrency by adding a second worker [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125380 (owner: 10Ilias Sarantopoulos) [10:17:38] 06Machine-Learning-Team: Verify cost of gathering peacock training/evaluation data for top 20 languages - https://phabricator.wikimedia.org/T388215#10612774 (10achou) [10:26:57] 06Machine-Learning-Team: Verify cost of gathering peacock training/evaluation data for top 20 languages - https://phabricator.wikimedia.org/T388215#10612824 (10achou) Related task: T387925 [10:37:12] 06Machine-Learning-Team: Retrain peacock detection model for production use - https://phabricator.wikimedia.org/T388211#10612844 (10achou) Another [[ https://gitlab.wikimedia.org/-/snippets/161 | training code ]] (mBERT - trained on enwiki) for reference. [11:12:42] 10Lift-Wing, 06Machine-Learning-Team: Increased latencies in reference-quality models (ref-need) - https://phabricator.wikimedia.org/T387019#10612894 (10isarantopoulos) I tried adding a ray worker but I'm getting an error which is probably due to the fact that the ReferenceRisk class extends the ReferenceNeed... [11:28:57] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Increased latencies in reference-quality models (ref-need) - https://phabricator.wikimedia.org/T387019#10612919 (10isarantopoulos) There is also an issue with the resource quotas in the revision models namespace and when the max number of replicas are... [11:33:09] * isaranto afk lunch! [13:33:06] I'm facing the same issue with the deployment now -- we've hit the resource quota. I think if we bump it a bit it will give some room for the new revision to be scheduled https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1125422 [13:50:38] isaranto: o/ as a rule of thumb, calculate 150% of the max capacity for quotas, so you have room for deployments etc.. [13:50:42] this is what I try to do [13:51:41] ack, thanks! I think this service needs to be redesigned :( [13:56:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [13:56:49] Deployment reference-quality-predictor-00010-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-quality-predictor-00010-deployment ... [13:56:49] - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [14:02:17] that is me! there is not available replica for revision reference-quality-predictor-00010 [14:40:25] guys is gerrit down? or slow ? [14:43:13] it is yes [14:43:23] sres are working on it :) [14:43:24] works fine here [14:46:12] yeap it works again now [14:46:15] thnx folks [14:57:54] isaranto: if you want me to +2 and push the quota change, lmk [14:58:56] klausman: yes please do! I hope it will allow the current deployment to succeed [15:01:27] danke schön! [15:08:45] isaranto: eqiad has been done. lmk if-when I should proceed with codfw [15:08:52] 06Machine-Learning-Team: Investigate using SHAP values to highlight peacock words - https://phabricator.wikimedia.org/T387984#10613535 (10achou) **Summary** SHAP values come from cooperative game theory and help explain the contribution of each feature (in our case, each word or token) to a model’s prediction.... [15:12:33] klausman: you can go ahead. thanks! I see the new revision scheduled now on eqiad [15:12:49] ack, starting [15:13:07] the issue I see it that not all replicas can be scheduled at once so there might be some errors until all are up [15:13:27] I'll monitor that in future deployments [15:13:44] and done [15:15:43] thank yoouu [15:16:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [15:16:49] Deployment reference-quality-predictor-00010-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-quality-predictor-00010-deployment ... [15:16:49] - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [15:29:36] (03PS1) 10Ilias Sarantopoulos: Revert "reference-quality: test async predict" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125460 [15:29:56] (03PS2) 10Ilias Sarantopoulos: Revert "reference-quality: test async predict" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125460 [15:37:23] I'm going to merge the above and deploy it as it seems that predict latencies are increasing. Preprocess on the other have dropped significantly (due to using 2 workers) [15:45:31] (03CR) 10Ilias Sarantopoulos: [C:03+2] Revert "reference-quality: test async predict" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125460 (owner: 10Ilias Sarantopoulos) [15:54:12] sry for the above -- never deploy on a friday afternoon :D [16:41:38] no improvement [16:41:40] * isaranto sighs [16:42:15] I'll monitor and revert completely if I see nothing changes [16:42:47] aiko: I was thinking then to separate the 2 services so that we can handle them individually. wdyt? [16:44:06] going afk folks for now, have a nice weekend! [18:37:30] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck: Investigate options for providing liftwing access to beta cluster / patchdemo - https://phabricator.wikimedia.org/T388269 (10VPuffetMichel) 03NEW [18:37:51] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck: Investigate options for providing liftwing access to beta cluster / patchdemo - https://phabricator.wikimedia.org/T388269#10614355 (10VPuffetMichel)