[06:05:44] FIRING: LiftWingServiceErrorRate: ... [06:05:49] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revision-models&var-backend=reference-quality-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [06:10:44] FIRING: [2x] LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [06:55:44] FIRING: [2x] LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [07:00:44] RESOLVED: [2x] LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [08:12:43] Hello! [08:39:31] goedenmorgen [08:42:04] I'm taking a look in the alerts above [08:49:44] FIRING: LiftWingServiceErrorRate: ... [08:49:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revision-models&var-backend=reference-quality-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [08:54:44] RESOLVED: LiftWingServiceErrorRate: ... [08:54:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revision-models&var-backend=reference-quality-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [08:55:11] georgekyz: do you want to go through the alerts together? [08:55:40] yeah sure [08:56:31] pinging you in 3' [08:56:43] ok [09:34:25] 10Lift-Wing, 06Machine-Learning-Team, 07OKR-Work, 13Patch-For-Review: Update the article-country isvc to use Wikilinks for predictions - https://phabricator.wikimedia.org/T385970#10566952 (10dcausse) Unfortunately I had to pause the backfill, we've burnt quite some budget of our update lag SLO (https://gra... [09:35:02] we're looking at the Istio access logs dashboard but it seems broken (can't see the requests) - https://logstash.wikimedia.org/goto/fabe47bbf64a4f898f7aadcadbf06fe1 [09:35:28] trying to figure out why it is broken but if anyone has an idea help is welcome [09:38:15] Morning! [09:38:20] having a look as well [09:39:34] we are here if you want to join https://meet.google.com/tew-iwdw-fqn [09:39:42] thanks for looking into it [09:47:59] I think Logstash is having an issue. I have poked o11y about it [09:54:02] we are also seeing something else that is weird (apart from the increased latencies :P ). Although there are 3 pods deployed in the revision-models ns in ml-serve-eqiad, we only see 1 of them in grafana https://grafana.wikimedia.org/goto/dOdYMBcHR?orgId=1 [10:02:33] I see only one pod in serve-codfw/revision-models [10:02:42] (on the deployment host) [10:03:02] yes codfw has 1 but eqiad has 3 [10:03:14] The grafana link you sent was for codfw [10:03:22] yes sry my bad [10:03:32] and the eqiad dashboard shows three pods :) [10:04:00] thanks! my mind was stuck that I had the right filters [10:04:21] it happens./ at least you're not doing what I have done in the past: restart the pods on the wrong DC :D [10:06:50] :D [10:07:10] istio dashboard working now :) [10:07:22] Logstash should work again (thanks to Luca and Filippo) [10:07:37] grazie mille! [10:08:16] The traffic spike around 09:42 is mostl WME, as far as I can tell. [10:20:45] yes. all traffic goes to reference quality model server (reference-risk and reference-need) [10:21:27] aiko: we are seeing very high latencies https://grafana.wikimedia.org/goto/QmnH4fcHg?orgId=1 [10:21:27] and high cpu usage on these pods https://grafana.wikimedia.org/goto/FgQHVf5HR?orgId=1 [10:24:36] and it causes throttling both in the preprocess and the predict stage. Anything else we could look into apart from increasing cpu limits? [11:19:03] * klausman lunch [11:48:06] 10Lift-Wing, 06Machine-Learning-Team: Use rocm/vllm image on Lift Wing - https://phabricator.wikimedia.org/T385173#10567209 (10isarantopoulos) I've managed to build the base vllm image on ml-lab and it is 34GB. Compressing the image brings it down to 7.0GB. I'm proceeding to also build the final image which... [11:48:26] finally built the base vllm image on ml-lab! 🎉 [12:00:21] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck: Create dummy peacock detection model server - https://phabricator.wikimedia.org/T386100#10567272 (10isarantopoulos) [12:00:52] georgekyz: --^ I've updated the task description to define the steps/patches we need to make [12:04:36] isaranto: thnx a lot I will check it [12:26:20] 10Lift-Wing, 06Machine-Learning-Team: Use rocm/vllm image on Lift Wing - https://phabricator.wikimedia.org/T385173#10567375 (10isarantopoulos) I've replicated building locally on ml-lab both images using the Dockerfiles. The final image is 35.7GB and 7.6GB compressed (as expected cause this is also the upstrea... [12:52:43] isaranto: maybe increase maxReplicas for referece-quality model server? would it help? [13:02:14] you'right it might help. Going to do that and then see about increasing cpu [13:03:04] it would be good if we figure out what is causing this throttling (mwapi ? sqlite?) to see which would be the best way to handle horizontal vs vertical scaling [13:05:50] increasing it to 5 to start with https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1121362 [13:07:42] aiko: ok if I merge the above? [13:08:51] thank you! [13:10:57] deployed! [13:13:04] it seems we better also limit the knative autoscaling targets to something lower than 20 rps. revertrisk is much more lightweight as a service and has 15 [13:13:35] the traffic at the moment is a bit less than 20 rps https://grafana.wikimedia.org/goto/B5fZ0BcHg?orgId=1 [13:14:04] which means that probably autoscaling wont kick in. [13:14:10] * isaranto going for lunch and will check again [13:14:29] isaranto: grafana shows high latencies in preprocess but not predict stage for reference-risk, so it would probably be mwapi, not sqlite causing the throttling [13:14:54] saw also in predict from the logs but only after throttling had already happened [13:15:10] check here for example https://logstash.wikimedia.org/goto/7050fe1163e34736ab4302cc5b563bee [13:17:48] I didnt remember that sqlite is used in predict, thanks for pointing it out! [13:18:07] can we tell in the logs the request is calling reference-need or reference-risk? [13:23:29] because for reference-need it looks like both preprocess and predict have high latencies, for reference-risk, only preprocess shows high latencies [13:25:17] yeah sqlite is used in reference-risk's prediction. there is no model, just fetching data from db [13:25:43] https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/blob/main/knowledge_integrity/models/reference_risk/model.py#L163 [13:37:05] we set the knative autoscaling target to 3 here: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/ml-services/revision-models/values.yaml#59 [13:39:24] it should overwrite the 20 at line 50? [13:48:44] FIRING: LiftWingServiceErrorRate: ... [13:48:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revision-models&var-backend=reference-quality-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [13:48:45] isaranto: if you have time, could you please print the layers of the docker image that you built in the task? [13:49:58] yes I have the full build logs also saved [13:50:27] I am asking since I'd like to see if we can break down layers into max 4GB [13:50:57] for example, if you RUN something with && and you pip install pytorch vllm etc.., maybe doing multiple RUNs could help (since you'd create more layers) [13:51:23] ack [13:51:23] having 34G of image is something really sad, but it is mostly a rant to the echosystem, not at you :) [13:51:49] shall I apend these to the task or in a paste? [13:52:07] in the task is fine, is it something 1000 lines long? [13:52:14] otherwise a paste [13:53:10] no it is ~20-30 [13:54:25] then it should be fine [13:58:59] 10Lift-Wing, 06Machine-Learning-Team: Use rocm/vllm image on Lift Wing - https://phabricator.wikimedia.org/T385173#10567735 (10isarantopoulos) Below we can see the uncompressed layer sizes for both images. going to get you the full created_by column as well but I have to modify the output a bit to paste it her... [14:12:59] 10Lift-Wing, 06Machine-Learning-Team, 07OKR-Work, 13Patch-For-Review: Update the article-country isvc to use Wikilinks for predictions - https://phabricator.wikimedia.org/T385970#10567758 (10Isaac) Thanks for the update -- given that work is still in progress on the keyword and we've paused the use of cirr... [15:57:58] 10Lift-Wing, 06Machine-Learning-Team: Use rocm/vllm image on Lift Wing - https://phabricator.wikimedia.org/T385173#10568210 (10isarantopoulos) this is the base image layers with the full instructions for each one. It seems that there is the 29GB layer that we could break into smaller ones. ` IMAGE... [16:23:05] 10Lift-Wing, 06Machine-Learning-Team: Use rocm/vllm image on Lift Wing - https://phabricator.wikimedia.org/T385173#10568321 (10isarantopoulos) Looking at the large layer and the thing it installs we focus on `rocm-dev` and `rocm-libs` which are meta packages and contain the following: ` apt-cache depends rocm-... [16:25:51] now there is no traffic on reference-quality so we won't know, however I'll make a patch to reduce the autoscaling limits [16:25:56] *triggers not limits [16:30:06] actually there is just one pod service 500s [16:30:15] :( [16:34:31] hmm there is no traffic at the moment. last request was made @ 14:50 UTC. However I do see 503 error messages on the istio dashboard and can't understand what this is https://grafana.wikimedia.org/goto/49txXf5HR?orgId=1 [16:35:09] I also saw 500s from the MWAPI, but it was just before the mtg [16:35:39] yes these were the ones I saw coming from mwapi [17:02:02] (03PS1) 10Gkyziridis: inference-services: add peacock dummy model service [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1121398 (https://phabricator.wikimedia.org/T386100) [17:02:17] * isaranto afk [17:02:48] (03PS2) 10Gkyziridis: inference-services: add peacock dummy model service [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1121398 (https://phabricator.wikimedia.org/T386100) [17:48:44] FIRING: LiftWingServiceErrorRate: ... [17:48:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revision-models&var-backend=reference-quality-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [17:49:32] I'll add a silence for that ^^^