[08:04:23] hello folks! [08:16:11] so I'm trying to deal with the reference need horror [08:16:23] actually both reference-need & risk [08:45:01] isaranto: Do you need any help? Is there anything that we can support ? [08:49:19] o/ there are a couple of steps that we could do . I will write the options on the task https://phabricator.wikimedia.org/T387019 [08:49:38] this patch can be reviewed, thanks! https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1126094 [08:49:57] but I also need to fix CI :) [08:50:17] I'm thinking of 2 things: separate the model servers into different deployments and use multiprocessing [09:00:16] (03PS1) 10Ilias Sarantopoulos: reference-quality: allow to deploy models separately [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126499 (https://phabricator.wikimedia.org/T387019) [09:01:58] (03PS3) 10Ilias Sarantopoulos: reference-quality: allow to deploy models separately [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126094 (https://phabricator.wikimedia.org/T387019) [09:17:17] (03Abandoned) 10Ilias Sarantopoulos: reference-quality: allow to deploy models separately [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126094 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [09:25:03] isaranto: o/ lemme know if you need any help [09:25:37] IIRC having mulitple uvicorn workers was not ideal from the kserve point of view, I recall that we thought about it when we added the multi-process pool [09:25:38] ok, thanks! [09:26:39] yeah it isn't stable and as you had already said ray would be preferred. I reverted that change but it is not the only issue [09:29:44] the kserve-container cpu graphs shows very little throttling (that is good) but a ton of CPU usage [09:30:02] and from something like https://grafana.wikimedia.org/d/n3LJdTGIk/kserve-inference-services?orgId=1&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-namespace=revision-models&var-component=All&var-model_name=reference-risk it seems that predict takes little time [09:30:06] (03PS1) 10Santhosh: Update dependencies [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1126504 [09:30:12] and preprocess is the usual horror [09:30:27] but you were saying yesterday that it may not be preprocess this time? [09:30:50] I have no idea what the model-server does, but I guess it happens mostly in KI [09:31:31] there are 2 models served from the same pod . there is also reference -need where predict takes a lot of time https://grafana.wikimedia.org/goto/iU0jRnhHg?orgId=1 [09:31:31] (03CR) 10CI reject: [V:04-1] Update dependencies [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1126504 (owner: 10Santhosh) [09:32:50] I want to try multiprocessing for preprocess but it is tricky because again things happen in KI (a mwapi request + some postprocessing of the response). the tricky bit is that we'll use async + multiprocessing [09:33:48] yeah very sneaky indeed [09:34:08] it would be nice if KI was capable of accepting a pool to run into, or similar [09:34:15] but it would probably require a big change [09:34:39] I would suggest to separate the model servers into two separate isvc as starter [09:34:59] if it is feasible I mean, so they don't influence each other's perf [09:37:29] yes because both have heavy load but different needs in terms of resources. I'm just updating the task as we speak. I will ping again for help if needed. thank you! [09:46:24] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Increased latencies in reference-quality models (ref-need) - https://phabricator.wikimedia.org/T387019#10622651 (10isarantopoulos) We can explore the following options to make the service more reliable: 1. Separate the 2 model server deployments:... [09:47:18] I summarized some options, lemme know if I missed sth [09:48:20] makes sense yes [10:17:54] Morning! [10:44:07] \o [10:48:08] (03PS2) 10Ilias Sarantopoulos: reference-quality: allow to deploy models separately [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126499 (https://phabricator.wikimedia.org/T387019) [10:50:47] I plan to merge this one first to separate the services https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1126499 [10:51:40] (03PS3) 10Ilias Sarantopoulos: reference-quality: set reference-need batch size through env var [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126070 (https://phabricator.wikimedia.org/T387019) [10:57:35] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Increased latencies in reference-quality models (ref-need) - https://phabricator.wikimedia.org/T387019#10622927 (10isarantopoulos) **re: model deployment separation** I want to propose the following process: # Merge the patch in [[ https://gerrit.... [10:58:14] and this patch would likely do the trick for the API GW https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1126523 (this is one of the last steps after we deploy the new services) [11:13:12] are you planning to run the combined service along with the two split ones until the APIGW change is done? [11:13:25] (03CR) 10Gkyziridis: [C:03+1] "Thnx for working on this one." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126499 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [11:33:34] klausman: yes exactly, I outlined a deployment plan here https://phabricator.wikimedia.org/T387019#10622927. lemme know if this makes sense [11:34:31] (03CR) 10Ilias Sarantopoulos: [C:03+2] reference-quality: allow to deploy models separately [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126499 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [11:35:18] (03Merged) 10jenkins-bot: reference-quality: allow to deploy models separately [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126499 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [11:36:13] isaranto: yeah, that sgtm [11:43:24] (03CR) 10Ilias Sarantopoulos: inference-services: Develop loading peacock model logic. (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126012 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [11:45:16] * isaranto lunch o clock! [12:06:44] * aiko lunch 2 [12:40:39] o/ whenever you get a minute, please review: https://gerrit.wikimedia.org/r/1125661 [12:40:39] thanks! [12:43:42] kevinbazira: o/ I'm on it! [12:44:09] ack, thanks Aiko! [13:01:46] I create the new deployments --> https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1126545 [13:02:11] Looking... [13:03:56] isaranto: AIUI, this will immediately remove the old joint service in staging, but we don't expect anyone to use it, anyway. Do we have enough quota in the prod NS to run both joint and split services? [13:04:40] it will remove it from experimental ns in staging (not revision models) so I can play around with resources and do load tests [13:05:08] I think it will be enough for prod to spin up 1 pod for each. perhaps we'll need to boost the quota though [13:05:15] Roger [13:59:59] (03PS4) 10Ilias Sarantopoulos: reference-quality: set reference-need batch size through env var [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126070 (https://phabricator.wikimedia.org/T387019) [15:05:11] Tobias and I are deploying the changes to the API GW in a bit [15:06:33] please drop a note in #wikimedia-sre [15:07:07] there is a mediawiki deployment window ongoing [15:07:20] it should clash but we shouldn't overlap with it, as much as possible [15:08:09] ack [15:10:55] and done [15:50:49] so we did the deployment. reference risk is looking pretty awesome https://grafana.wikimedia.org/goto/UkhLEn2HR?orgId=1 [15:50:56] BUT reference-need is struggling [15:56:14] (03PS3) 10Gkyziridis: inference-services: Develop loading peacock model logic. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126012 (https://phabricator.wikimedia.org/T386100) [15:57:38] (03PS4) 10Gkyziridis: inference-services: Develop loading peacock model logic. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126012 (https://phabricator.wikimedia.org/T386100) [16:02:23] (03PS5) 10Gkyziridis: inference-services: Develop loading peacock model logic. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126012 (https://phabricator.wikimedia.org/T386100) [16:02:29] klausman: I made a patch to remove the old deployment and also increase/decrease resources according to https://grafana.wikimedia.org/goto/rPCdsnhHR?orgId=1 [16:02:29] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1126592 [16:02:49] if you approve I'll merge/deploy this [16:03:01] (03CR) 10CI reject: [V:04-1] inference-services: Develop loading peacock model logic. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126012 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [16:03:12] the previous revision for ref-need (0002) is stuck and the new pods (0003) have no traffic [16:03:38] it seems that they are unresponsive after being throttled for a while [16:05:08] (03PS6) 10Gkyziridis: inference-services: Develop loading peacock model logic. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126012 (https://phabricator.wikimedia.org/T386100) [16:05:46] (03CR) 10CI reject: [V:04-1] inference-services: Develop loading peacock model logic. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126012 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [16:08:23] (03PS7) 10Gkyziridis: inference-services: Develop loading peacock model logic. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126012 (https://phabricator.wikimedia.org/T386100) [16:13:59] isaranto: that change would revert the minreplicas to 1 (from 8), is that the intent? [16:14:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [16:14:49] Deployment reference-need-predictor-00003-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00003-deployment - ... [16:14:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [16:15:00] uh oh [16:15:20] they will eventually spin up I guess. My worry is that there is no wiggle room atm for more pods to spin up [16:15:42] let's go with that and see [16:15:48] ack [16:16:00] +1'd [16:16:50] ty [16:17:13] (03CR) 10Gkyziridis: "I tested it locally, hope it is gonna work in the server as well." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126012 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [16:18:35] (03CR) 10Gkyziridis: inference-services: Develop loading peacock model logic. (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126012 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [16:19:01] (03CR) 10AikoChou: "Thanks for working on this, Kevin! :) I have a few minor comments and a suggestion for readability." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125661 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira) [16:24:27] klausman: for some reason the old revision (0002) won't go away and the new one that I just deployed is not ready although the pod is up (this is 00004) [16:24:33] isaranto: maybe we should consider deleting the 002 pods (or the whole deployment) [16:24:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [16:24:49] Deployment reference-need-predictor-00003-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00003-deployment - ... [16:24:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [16:25:01] I can try deleting one of the 002 pods... and why did it recover just now? [16:25:33] if you can please delete the revision 0002 [16:25:37] ack [16:25:46] the service is stuck anyway so we're not doing any additional harm [16:26:03] done [16:26:12] thanks! [16:26:16] 🤞 [16:26:43] aaand they're restarting [16:27:31] yes because the latest ready revision was this one (0002) [16:27:34] argh [16:27:42] ah, good point [16:29:14] but I don't see any reason why the revision shouldnt be considered ready [16:29:58] that was the whole point of setting minreplicas to 1 [16:30:42] we could try setting maxreplicas to one to get rid of everything, then bump it up again? [16:31:01] but that would likely scale the "ready" deployment again, back to square 1 [16:31:42] hang on, min/maxreplicas is still 8/8 [16:32:42] and that is because the deployment didn't change from Helm's pov, probably [16:32:59] oh ok [16:33:04] fixed that [16:33:11] thanks that makes sense [16:33:20] and now #4 is terminating %-) [16:34:06] ok that was it [16:34:14] ah, #5 showed up, became ready and now #2 is going away [16:34:44] so my sync didn't affect the minreplicas and it remained 8, so the new revision was never ready because it had nor resources to spin up 8 replicas [16:34:59] yep [16:35:18] Mor #5s starting [16:38:33] the dahs says it wants 8 replicas, but has only 6. Maybe a resource issue on the NS? [16:39:49] yes, it seems like it [16:40:34] although I dodnt see a specific event which is weird [16:42:42] I need to go afk for a bit. I'll be back in ~30, is that ok? [16:44:44] FIRING: LiftWingServiceErrorRate: ... [16:44:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revision-models&var-backend=reference-need-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [16:49:43] I will have to go as well in a bit [16:50:39] I will put a silence and report on the task. The pods are throttled again so we need to try the other options (batch inference and then multiprocessing) [16:52:15] (03PS5) 10Ilias Sarantopoulos: reference-quality: set reference-need batch size through env var [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126070 (https://phabricator.wikimedia.org/T387019) [16:52:44] aiko: is it ok if I merge this one? https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1126070 [16:52:56] (03CR) 10CI reject: [V:04-1] reference-quality: set reference-need batch size through env var [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126070 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [16:53:01] we have already merged the MR in on the KI side [16:53:34] (03PS6) 10Ilias Sarantopoulos: reference-quality: set reference-need batch size through env var [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126070 (https://phabricator.wikimedia.org/T387019) [16:55:50] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Increased latencies in reference-quality models (ref-need) - https://phabricator.wikimedia.org/T387019#10624928 (10isarantopoulos) We have separated the service and it solved the problem for reference-risk which is now service at low latencies https:/... [17:02:42] I silence the alert for now :( [17:02:44] * isaranto afk [17:03:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [17:03:49] Deployment reference-need-predictor-00005-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00005-deployment - ... [17:03:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [17:08:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [17:08:49] Deployment reference-need-predictor-00005-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00005-deployment - ... [17:08:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [17:12:58] (03CR) 10AikoChou: [C:03+1] reference-quality: set reference-need batch size through env var [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126070 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [17:45:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [17:45:49] Deployment reference-need-predictor-00005-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00005-deployment - ... [17:45:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [18:15:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [18:15:49] Deployment reference-need-predictor-00005-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00005-deployment - ... [18:15:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [18:29:19] (03CR) 10Ilias Sarantopoulos: [C:03+2] reference-quality: set reference-need batch size through env var [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126070 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [18:30:06] (03Merged) 10jenkins-bot: reference-quality: set reference-need batch size through env var [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126070 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [21:36:32] 06Machine-Learning-Team, 10EditCheck, 10VisualEditor, 10Editing-team (Kanban Board): Evaluate efficacy of Peacock Check model output (internal review) - https://phabricator.wikimedia.org/T384651#10626528 (10ppelberg) a:03ppelberg