[02:05:36] 06Machine-Learning-Team, 06collaboration-services, 10Wikipedia-iOS-App-Backlog (iOS Release FY2024-25): [Spike] Fetch Topics for Articles in History on iOS app - https://phabricator.wikimedia.org/T379119#10608392 (10Tsevener) Adding another angle to expected requests per second: Per https://phabricator.wiki... [08:09:50] howdy! [08:39:39] morning morning [09:23:30] morning :) [09:28:36] \o [11:26:42] aiko: I didn't get any improvements for the ref quality models on ml-staging :( [11:30:20] (03PS1) 10Kevin Bazira: article-country: update score normalization to support wikilink-related predictions [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125126 (https://phabricator.wikimedia.org/T385970) [11:35:44] however, I'll still proceed to deploy this in prod since it is an improvement [11:39:06] (03CR) 10Kevin Bazira: "for more context, this patch patch was implemented based on Isaac's recommendations in:https://phabricator.wikimedia.org/P73436#296960 and" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125126 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira) [11:50:19] isaranto: :( [11:50:23] ok! [12:13:29] (03PS1) 10Ilias Sarantopoulos: reference-quality: test async predict [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125129 [12:15:05] aiko: I want to make an test with asyncio for the predict function. I think that ideally we should try multiprocessing for this. I noticed high predict latencies in ml-staging https://grafana.wikimedia.org/goto/A60O28pHg?orgId=1 [12:16:03] since this is a blocking call it would benefit if we use some sort of concurrency or multiprocessing [12:17:14] Wdyt? [12:20:07] isaranto: predict latency being high is not new. the model scores every sentences without reference in the article. another option is using batch for this predict function https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/blob/main/knowledge_integrity/models/reference_need/bert.py?ref_type=heads#L8 [12:21:16] Yes you're right I remembered [12:22:13] (03CR) 10AikoChou: [C:03+1] reference-quality: test async predict [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125129 (owner: 10Ilias Sarantopoulos) [12:38:08] (03CR) 10Ilias Sarantopoulos: [C:03+2] reference-quality: test async predict [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125129 (owner: 10Ilias Sarantopoulos) [12:38:53] (03Merged) 10jenkins-bot: reference-quality: test async predict [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125129 (owner: 10Ilias Sarantopoulos) [13:11:20] (03PS1) 10Ilias Sarantopoulos: reference-quality: fix ref-risk classify [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125147 [13:13:17] aiko: sry about that --^ [14:02:02] (03CR) 10AikoChou: [C:03+1] reference-quality: fix ref-risk classify [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125147 (owner: 10Ilias Sarantopoulos) [14:05:44] isaranto: I didn't spot on that either ^^" [14:08:20] (03CR) 10Ilias Sarantopoulos: [C:03+2] reference-quality: fix ref-risk classify [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125147 (owner: 10Ilias Sarantopoulos) [14:09:05] (03Merged) 10jenkins-bot: reference-quality: fix ref-risk classify [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125147 (owner: 10Ilias Sarantopoulos) [14:50:25] (03PS1) 10Ilias Sarantopoulos: reference-quality: revert async for ref-risk [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125172 [14:51:51] (03CR) 10AikoChou: [C:03+1] reference-quality: revert async for ref-risk [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125172 (owner: 10Ilias Sarantopoulos) [14:53:24] (03CR) 10Ilias Sarantopoulos: [C:03+2] reference-quality: revert async for ref-risk [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125172 (owner: 10Ilias Sarantopoulos) [14:54:10] (03Merged) 10jenkins-bot: reference-quality: revert async for ref-risk [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125172 (owner: 10Ilias Sarantopoulos) [17:07:30] (03CR) 10AikoChou: [C:03+1] article-country: update score normalization to support wikilink-related predictions [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125126 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira) [17:12:06] (03PS2) 10Kevin Bazira: article-country: update score normalization to support wikilink-related predictions [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125126 (https://phabricator.wikimedia.org/T385970) [17:13:30] (03CR) 10Kevin Bazira: [C:03+2] "thanks for the review, Aiko! :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125126 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira) [17:14:15] (03Merged) 10jenkins-bot: article-country: update score normalization to support wikilink-related predictions [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125126 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira) [17:31:29] I have an issue with a deployment in cluster: ml-serve-eqiad namespace: revision-models. I made a new deployment and because we have reached the resource quota (as autoscaling has reached the maxreplicas) the new revision can't be scheduled . What would be a good approach? I guess for now deleting the old revision would start removing the current pods. but for the future? increasing the resource quotas? [17:31:50] in codfw I did the deployment fine as there was still resources left [17:41:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [17:41:49] Deployment reference-quality-predictor-00009-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-quality-predictor-00009-deployment ... [17:41:49] - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [17:43:36] and there is the alert for it [17:43:38] :D [17:53:19] klausman: if you're here can you advise? otherwise we can take a look in the morning [17:53:50] I'm going to go afk for 1h but I can check later [18:17:11] If it can't be scheduled, it is _usually_ a resource issue, but I see 8 replicas and none pending atm [18:26:57] I don't think there is an easy way to move forward except adding quota. The service just needs that kinda breathing room [18:55:44] FIRING: LiftWingServiceErrorRate: ... [18:55:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revision-models&var-backend=reference-quality-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [18:56:49] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment reference-quality-predictor-00008-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [19:01:49] RESOLVED: [2x] KubernetesDeploymentUnavailableReplicas: Deployment reference-quality-predictor-00008-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [19:05:44] RESOLVED: LiftWingServiceErrorRate: ... [19:05:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revision-models&var-backend=reference-quality-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [19:25:36] ack, thank you. I see the pods from the new revision are up at the moment