[08:20:08] <isaranto>	 good morning!
[08:45:03] <georgekyz>	 good morning
[09:02:02] <wikibugs>	 06Machine-Learning-Team, 06collaboration-services, 06Discovery-Search, 10Wikipedia-iOS-App-Backlog (iOS Release FY2024-25): [Spike] Fetch Topics for Articles in History on iOS app - https://phabricator.wikimedia.org/T379119#10612503 (10Gehel) I see that you are interested in using the Search API, in partic...
[09:09:28] <wikibugs>	 06Machine-Learning-Team, 06collaboration-services, 10Discovery-Search (2025.03.01 - 2025.03.21), 10Wikipedia-iOS-App-Backlog (iOS Release FY2024-25): [Spike] Fetch Topics for Articles in History on iOS app - https://phabricator.wikimedia.org/T379119#10612530 (10Gehel)
[09:16:16] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: Increased latencies in reference-quality models (ref-need) - https://phabricator.wikimedia.org/T387019#10612535 (10isarantopoulos) Providing an update with no clear improvement at the moment.  The below results is testing just a small sample of requests against the same da...
[09:18:51] <wikibugs>	 06Machine-Learning-Team: Retrain peacock detection model for production use - https://phabricator.wikimedia.org/T388211 (10achou) 03NEW
[09:21:11] <wikibugs>	 06Machine-Learning-Team: Retrain peacock detection model for production use - https://phabricator.wikimedia.org/T388211#10612579 (10achou)
[09:27:00] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: reference-quality: increase concurrency by adding a second worker [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125380
[09:29:09] <isaranto>	 aiko: I want to try increasing the workers. If it doesnt work as expected I was considering separating the service to 2 different ones. This could happen in parallel in a new deployment so the current one won't be affected. wdyt?
[09:45:04] <aiko>	 yeah let's try it
[09:45:08] <wikibugs>	 (03CR) 10AikoChou: [C:03+1] reference-quality: increase concurrency by adding a second worker [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125380 (owner: 10Ilias Sarantopoulos)
[09:47:41] <aiko>	 I'm looking at ur test results: ref-risk from 2126 -> 134, ref-need from 3546 -> 2423. ref-risk improves a lot!
[09:50:30] <isaranto>	 since ref-need is non blocking anymore ref-risk can continue to serve requests. the bottleneck seems to be the predict of reference-need because of the sequential inference as you mentioned
[09:53:35] <isaranto>	 I'll update the above patch to use ray instead. It seems it has been stabilized in latest releases
[09:54:28] <isaranto>	 running some local tests with this one first. perhaps I'll try both
[09:56:00] <wikibugs>	 06Machine-Learning-Team: Verify cost of gathering peacock training/evaluation data for top 20 languages - https://phabricator.wikimedia.org/T388215 (10achou) 03NEW
[09:59:29] <isaranto>	 hmm getting an error with ray when I try to use the decorator @serve.deployment(name="reference-need", num_replicas=2)
[09:59:34] <isaranto>	 ```
[09:59:34] <isaranto>	   File "/opt/lib/venv/lib/python3.11/site-packages/ray/serve/deployment.py", line 123, in __init__
[09:59:34] <isaranto>	     raise RuntimeError(
[09:59:34] <isaranto>	 RuntimeError: The Deployment constructor should not be called directly. Use `@serve.deployment` instead.
[09:59:34] <isaranto>	 ```
[10:00:16] <isaranto>	 I suspect it has to do with the fact that one model server extends the other so adding the decorator in both messes things up. going to just try the 2 workers for now
[10:02:06] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+2] reference-quality: increase concurrency by adding a second worker [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125380 (owner: 10Ilias Sarantopoulos)
[10:02:50] <wikibugs>	 (03Merged) 10jenkins-bot: reference-quality: increase concurrency by adding a second worker [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125380 (owner: 10Ilias Sarantopoulos)
[10:17:38] <wikibugs>	 06Machine-Learning-Team: Verify cost of gathering peacock training/evaluation data for top 20 languages - https://phabricator.wikimedia.org/T388215#10612774 (10achou)
[10:26:57] <wikibugs>	 06Machine-Learning-Team: Verify cost of gathering peacock training/evaluation data for top 20 languages - https://phabricator.wikimedia.org/T388215#10612824 (10achou) Related task: T387925
[10:37:12] <wikibugs>	 06Machine-Learning-Team: Retrain peacock detection model for production use - https://phabricator.wikimedia.org/T388211#10612844 (10achou) Another [[ https://gitlab.wikimedia.org/-/snippets/161  | training code ]] (mBERT - trained on enwiki) for reference.
[11:12:42] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: Increased latencies in reference-quality models (ref-need) - https://phabricator.wikimedia.org/T387019#10612894 (10isarantopoulos) I tried adding a ray worker but I'm getting an error which is probably due to the fact that the  ReferenceRisk class extends the ReferenceNeed...
[11:28:57] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Increased latencies in reference-quality models (ref-need) - https://phabricator.wikimedia.org/T387019#10612919 (10isarantopoulos) There is also an issue with the resource quotas in the revision models namespace and when the max number of replicas are...
[11:33:09] * isaranto afk lunch!
[13:33:06] <isaranto>	 I'm facing the same issue with the deployment now -- we've hit the resource quota. I think if we bump it a bit it will give some room for the new revision to be scheduled https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1125422
[13:50:38] <elukey>	 isaranto: o/ as a rule of thumb, calculate 150% of the max capacity for quotas, so you have room for deployments etc..
[13:50:42] <elukey>	 this is what I try to do 
[13:51:41] <isaranto>	 ack, thanks! I think this service needs to be redesigned :(
[13:56:49] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[13:56:49] <jinxer-wm>	 Deployment reference-quality-predictor-00010-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-quality-predictor-00010-deployment ...
[13:56:49] <jinxer-wm>	 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[14:02:17] <isaranto>	 that is me! there is not available replica for revision reference-quality-predictor-00010
[14:40:25] <georgekyz>	 guys is gerrit down? or slow ?
[14:43:13] <elukey>	 it is yes
[14:43:23] <elukey>	 sres are working on it :)
[14:43:24] <klausman>	 works fine here
[14:46:12] <georgekyz>	 yeap it works again now 
[14:46:15] <georgekyz>	 thnx folks
[14:57:54] <klausman>	 isaranto: if you want me to +2 and push the quota change, lmk
[14:58:56] <isaranto>	 klausman: yes please do! I hope it will allow the current deployment to succeed
[15:01:27] <isaranto>	 danke schön!
[15:08:45] <klausman>	 isaranto: eqiad has been done. lmk if-when I should proceed with codfw
[15:08:52] <wikibugs>	 06Machine-Learning-Team: Investigate using SHAP values to highlight peacock words - https://phabricator.wikimedia.org/T387984#10613535 (10achou) **Summary**  SHAP values come from cooperative game theory and help explain the contribution of each feature (in our case, each word or token) to a model’s prediction....
[15:12:33] <isaranto>	 klausman: you can go ahead. thanks! I see the new revision scheduled now on eqiad
[15:12:49] <klausman>	 ack, starting
[15:13:07] <isaranto>	 the issue I see it that not all replicas can be scheduled at once so there might be some errors until all are up
[15:13:27] <isaranto>	 I'll monitor that in future deployments
[15:13:44] <klausman>	 and done
[15:15:43] <isaranto>	 thank yoouu
[15:16:49] <jinxer-wm>	 RESOLVED: KubernetesDeploymentUnavailableReplicas: ...
[15:16:49] <jinxer-wm>	 Deployment reference-quality-predictor-00010-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-quality-predictor-00010-deployment ...
[15:16:49] <jinxer-wm>	 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[15:29:36] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: Revert "reference-quality: test async predict" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125460
[15:29:56] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: Revert "reference-quality: test async predict" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125460
[15:37:23] <isaranto>	 I'm going to merge the above and deploy it as it seems that predict latencies are increasing. Preprocess on the other have dropped significantly (due to using 2 workers)
[15:45:31] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+2] Revert "reference-quality: test async predict" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125460 (owner: 10Ilias Sarantopoulos)
[15:54:12] <isaranto>	 sry for the above -- never deploy on a friday afternoon :D 
[16:41:38] <isaranto>	 no improvement 
[16:41:40] * isaranto sighs
[16:42:15] <isaranto>	 I'll monitor and revert completely if I see nothing changes
[16:42:47] <isaranto>	 aiko: I was thinking then to separate the 2 services so that we can handle them individually. wdyt?
[16:44:06] <isaranto>	 going afk folks for now, have a nice weekend!
[18:37:30] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck: Investigate options for providing liftwing access to beta cluster / patchdemo - https://phabricator.wikimedia.org/T388269 (10VPuffetMichel) 03NEW
[18:37:51] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck: Investigate options for providing liftwing access to beta cluster / patchdemo - https://phabricator.wikimedia.org/T388269#10614355 (10VPuffetMichel)