[02:05:36] <wikibugs>	 06Machine-Learning-Team, 06collaboration-services, 10Wikipedia-iOS-App-Backlog (iOS Release FY2024-25): [Spike] Fetch Topics for Articles in History on iOS app - https://phabricator.wikimedia.org/T379119#10608392 (10Tsevener) Adding another angle to expected requests per second:  Per https://phabricator.wiki...
[08:09:50] <isaranto>	 howdy!
[08:39:39] <georgekyz>	 morning morning 
[09:23:30] <aiko>	   morning :)
[09:28:36] <isaranto>	 \o
[11:26:42] <isaranto>	 aiko: I didn't get any improvements for the ref quality models on ml-staging :(
[11:30:20] <wikibugs>	 (03PS1) 10Kevin Bazira: article-country: update score normalization to support wikilink-related predictions [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125126 (https://phabricator.wikimedia.org/T385970)
[11:35:44] <isaranto>	 however, I'll still  proceed to deploy this in prod since it is an improvement
[11:39:06] <wikibugs>	 (03CR) 10Kevin Bazira: "for more context, this patch patch was implemented based on Isaac's recommendations in:https://phabricator.wikimedia.org/P73436#296960 and" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125126 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira)
[11:50:19] <aiko>	 isaranto: :(
[11:50:23] <aiko>	 ok!
[12:13:29] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: reference-quality: test async predict [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125129
[12:15:05] <isaranto>	 aiko: I want to make an test with asyncio for the predict function.  I think that ideally we should try multiprocessing for this. I noticed high predict latencies in ml-staging https://grafana.wikimedia.org/goto/A60O28pHg?orgId=1
[12:16:03] <isaranto>	 since this is a blocking call it would benefit if we use some sort of concurrency or multiprocessing
[12:17:14] <isaranto>	 Wdyt?
[12:20:07] <aiko>	 isaranto: predict latency being high is not new. the model scores every sentences without reference in the article. another option is using batch for this predict function https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/blob/main/knowledge_integrity/models/reference_need/bert.py?ref_type=heads#L8
[12:21:16] <isaranto>	 Yes you're right I remembered
[12:22:13] <wikibugs>	 (03CR) 10AikoChou: [C:03+1] reference-quality: test async predict [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125129 (owner: 10Ilias Sarantopoulos)
[12:38:08] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+2] reference-quality: test async predict [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125129 (owner: 10Ilias Sarantopoulos)
[12:38:53] <wikibugs>	 (03Merged) 10jenkins-bot: reference-quality: test async predict [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125129 (owner: 10Ilias Sarantopoulos)
[13:11:20] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: reference-quality: fix ref-risk classify [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125147
[13:13:17] <isaranto>	 aiko: sry about that --^
[14:02:02] <wikibugs>	 (03CR) 10AikoChou: [C:03+1] reference-quality: fix ref-risk classify [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125147 (owner: 10Ilias Sarantopoulos)
[14:05:44] <aiko>	 isaranto: I didn't spot on that either ^^"
[14:08:20] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+2] reference-quality: fix ref-risk classify [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125147 (owner: 10Ilias Sarantopoulos)
[14:09:05] <wikibugs>	 (03Merged) 10jenkins-bot: reference-quality: fix ref-risk classify [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125147 (owner: 10Ilias Sarantopoulos)
[14:50:25] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: reference-quality: revert async for ref-risk [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125172
[14:51:51] <wikibugs>	 (03CR) 10AikoChou: [C:03+1] reference-quality: revert async for ref-risk [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125172 (owner: 10Ilias Sarantopoulos)
[14:53:24] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+2] reference-quality: revert async for ref-risk [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125172 (owner: 10Ilias Sarantopoulos)
[14:54:10] <wikibugs>	 (03Merged) 10jenkins-bot: reference-quality: revert async for ref-risk [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125172 (owner: 10Ilias Sarantopoulos)
[17:07:30] <wikibugs>	 (03CR) 10AikoChou: [C:03+1] article-country: update score normalization to support wikilink-related predictions [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125126 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira)
[17:12:06] <wikibugs>	 (03PS2) 10Kevin Bazira: article-country: update score normalization to support wikilink-related predictions [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125126 (https://phabricator.wikimedia.org/T385970)
[17:13:30] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] "thanks for the review, Aiko! :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125126 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira)
[17:14:15] <wikibugs>	 (03Merged) 10jenkins-bot: article-country: update score normalization to support wikilink-related predictions [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125126 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira)
[17:31:29] <isaranto>	 I have an issue with a deployment in cluster: ml-serve-eqiad namespace: revision-models. I made a new deployment and because we have reached the resource quota (as autoscaling has reached the maxreplicas) the new revision can't be scheduled . What would be a good approach? I guess for now deleting the old revision would start removing the current pods. but for the future? increasing the resource quotas?
[17:31:50] <isaranto>	 in codfw I did the deployment fine as there was still resources left 
[17:41:49] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[17:41:49] <jinxer-wm>	 Deployment reference-quality-predictor-00009-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-quality-predictor-00009-deployment ...
[17:41:49] <jinxer-wm>	 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[17:43:36] <isaranto>	 and there is the alert for it
[17:43:38] <isaranto>	 :D
[17:53:19] <isaranto>	 klausman: if you're here can you advise? otherwise we can take a look in the morning
[17:53:50] <isaranto>	 I'm going to go afk for 1h but I can check later
[18:17:11] <klausman>	 If it can't be scheduled, it is _usually_ a resource issue, but I see 8 replicas and none pending atm
[18:26:57] <klausman>	 I don't think there is an easy way to move forward except adding quota. The service just needs that kinda breathing room
[18:55:44] <jinxer-wm>	 FIRING: LiftWingServiceErrorRate: ...
[18:55:44] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revision-models&var-backend=reference-quality-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[18:56:49] <jinxer-wm>	 FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment reference-quality-predictor-00008-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[19:01:49] <jinxer-wm>	 RESOLVED: [2x] KubernetesDeploymentUnavailableReplicas: Deployment reference-quality-predictor-00008-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[19:05:44] <jinxer-wm>	 RESOLVED: LiftWingServiceErrorRate: ...
[19:05:44] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revision-models&var-backend=reference-quality-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[19:25:36] <isaranto>	 ack, thank you. I see the pods from the new revision are up at the moment